Extraction of Water Bodies from High-Resolution Aerial and Satellite Images Using Visual Foundation Models

Ozdemir, Samed; Akbulut, Zeynep; Karsli, Fevzi; Kavzoglu, Taskin

doi:10.3390/su16072995

Open AccessArticle

Extraction of Water Bodies from High-Resolution Aerial and Satellite Images Using Visual Foundation Models

¹

Department of Geomatics Engineering, Faculty of Engineering and Natural Sciences, Gumushane University, 29100 Gumushane, Turkey

²

Department of Geomatics Engineering, Faculty of Engineering, Karadeniz Technical University, 61080 Trabzon, Turkey

³

Department of Geomatics Engineering, Faculty of Engineering, Gebze Technical University, 41400 Kocaeli, Turkey

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(7), 2995; https://doi.org/10.3390/su16072995

Submission received: 18 February 2024 / Revised: 22 March 2024 / Accepted: 2 April 2024 / Published: 3 April 2024

(This article belongs to the Special Issue Sustainability of the Environment: Monitoring and Analysis of Water Resources Using State-of-the-Art Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Water, indispensable for life and central to ecosystems, human activities, and climate dynamics, requires rapid and accurate monitoring. This is vital for sustaining ecosystems, enhancing human welfare, and effectively managing land, water, and biodiversity on both the local and global level. In the rapidly evolving domain of remote sensing and deep learning, this study focuses on water body extraction and classification through the use of recent deep learning models of visual foundation models (VFMs). Specifically, the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP) models have shown promise in semantic segmentation, dataset creation, change detection, and instance segmentation tasks. A novel two-step approach involving segmenting images via the Automatic Mask Generator method of the SAM and the zero-shot classification of segments using CLIP is proposed, and its effectiveness is tested on water body extraction problems. The proposed methodology was applied to both remote sensing imagery acquired from LANDSAT 8 OLI and very high-resolution aerial imagery. Results revealed that the proposed methodology accurately delineated water bodies across complex environmental conditions, achieving a mean intersection over union (IoU) of 94.41% and an F1 score of 96.97% for satellite imagery. Similarly, for the aerial imagery dataset, the proposed methodology achieved a mean IoU of 90.83% and an F1 score exceeding 94.56%. The high accuracy achieved in selecting segments predominantly classified as water highlights the effectiveness of the proposed model in intricate environmental image analysis.

Keywords:

visual foundation models; Segment Anything Model; CLIP; water bodies; semantic; segmentation

1. Introduction

Water is not only central to natural ecosystems and human activities, but also plays a crucial role in global climate dynamics [1,2,3]. While the Earth’s surface is largely covered by water, the small portion that is drinkable underlines the need for the effective management and monitoring of this essential resource [4,5]. Understanding the distribution and characteristics of water bodies is essential for assessing habitat health, species distribution, and the environmental impact of human activities [6,7]. Furthermore, water bodies, particularly lakes and reservoirs, are integral to the global carbon cycle and the exchange of greenhouse gases with the atmosphere, making water body studies vital for climate change research [1,4,8,9]. The rapid and accurate mapping of water bodies, which are essential for the sustainability of the ecosystems and welfare of human life, is of high importance for land, water, and biodiversity management at local and global scales [8,10]. The detailed mapping and understanding of water bodies, including lakes, rivers, and wetlands, are crucial for various aspects of hydrology and water resource management [6,9,10,11,12,13]. Current and accurate information about the water bodies aids in effective water allocation, flood management and water quality monitoring [2,5,6,14,15]. Additionally, this knowledge is invaluable in environmental research, contributing to studies in ecology, biology, and geology [9].

To extract water body information from remotely sensed data, a variety of techniques ranging from simple band ratio analysis to advanced deep learning models have been employed to enhance accuracy and efficiency [16,17]. Band ratio methods, such as the Normalized Difference Water Index (NDWI), utilize specific spectral bands to identify the presence of water, effectively distinguishing it from other land cover types [18,19,20,21]. However, these techniques often require manual tuning and may struggle with varying environmental conditions [4,17,22]. Machine learning approaches, on the other hand, offer more adaptability through the use of algorithms like support vector machines (SVM) and random forest (RF), which can classify water bodies via learning from labeled data [23,24,25]. Despite their adaptability, these methods still rely on hand-crafted features, limiting their ability to handle complex scenarios [9,17,26]. Deep learning, particularly through the use of convolutional neural networks (CNNs), represents a significant advancement in remotely sensed image analysis [27,28]. These models automatically extract low- to high-level features from raw data, thus enabling the detection of water bodies with high precision, even in challenging environments [1,8,9,15,22,29,30,31,32,33,34,35,36,37]. Wieland et al. [34] explored the use of convolutional neural networks, specifically U-Net [38] and DeepLab-V3+ [39] with various encoder backbones such as MobileNet-V3 [40] and ResNet-50, for water body segmentation in high-resolution images of normal and flood waters. They introduced a large dataset and tested models across different sensors and conditions. The findings demonstrated that U-Net with a MobileNet-V3 backbone was the most effective, with improvements from near-infrared bands and digital elevation models. Duan and Hu [35] introduced a new multiscale refinement network (MSR-Net) for precise water body segmentation, testing the network on satellite images. Dai et al. [36] developed MSLANet to accurately segment both buildings and water bodies from remote sensing images. They evaluated their multiscale location attention network on satellite images for water body segmentation. Liu et al. [37] introduced a CNN-based network, R50A3-LWBENet, for extracting lake water bodies from Google Earth remote sensing imagery. They utilized ResNet50 and three attention mechanisms. Li et al. [1] suggest a novel network named the dense-local-feature-compression (DLFC) network, designed for the automatic extraction of water bodies from various remote sensing images. In this architecture, each layer has the capability to access the feature maps from all preceding layers, facilitated by the densely connected module found in DenseNet. However, although deep learning models offer higher accuracy when compared to traditional machine learning approaches, they have their shortcomings, including the need for significant processing power for training and inference, as well as a substantial requirement for large amounts of labeled data [41].

The segmentation of remotely sensed data has long been conducted through the creation of image objects using region-growing algorithms, mainly multiresolution segmentation [42,43]. The introduction of deep learning has significantly improved the accuracy of segmentation methods for remotely sensed data [1,29,35]. However, as highlighted by Li et al. [1], in the realm of deep learning methods, each sensor requires the creation of its own training dataset, and models trained on samples from a single sensor cannot be effectively transferred to others. Additionally, as stated by Gautam and Singhai [44], further advancements in water body segmentation algorithms are anticipated, aiming for increased automation through enhanced deep learning architectures which are capable of processing images from various sensors. Recently, the emergence of visual foundation models (VFMs), including the Segment Anything Model (SAM) [45] and the Contrastive Language-Image Pretraining (CLIP) model [46], has dramatically transformed the landscape of image segmentation and classification tasks [47,48,49,50,51]. Their versatility and adaptability have opened new opportunities in remote sensing, enabling these models to tackle a wide array of challenges with unprecedented efficiency. In the development of advanced segmentation techniques using the SAM, several GitHub repositories have made significant contributions, showcasing the practical implementations and enhancements of the SAM in various applications. Recent studies in remote sensing have explored various innovative approaches to enhance image segmentation and change detection tasks using advanced VFMs.

For instance, Zhang et al. [48] introduced a novel method for remote sensing image semantic segmentation via the combination of VFMs, specifically the SAM, Grounding DINO, and CLIP. These pre-trained models were utilized to generate visual prompts, aiding in the accurate segmentation of geospatial objects. The proposed method was tested on diverse datasets, showing the effectiveness of VFMs for the semantic segmentation of remote sensing data. In a recent study, Wang et al. [47] developed an efficient pipeline to create a comprehensive remote sensing segmentation dataset, called SAMRS, which encompasses over a million instances. The dataset utilized the pre-training of deep learning models, addressing issues related to task discrepancies and limited training data. The use of SAMRS in preliminary experiments illustrated its effectiveness in enhancing annotation efficiency, thus improving the training process for remote sensing segmentation tasks. Ren et al. [49] extended the evaluation of the SAM to remote sensing imagery segmentation. Their study carefully selected diverse imagery benchmarks, focusing on interactive annotation and model composition. While the SAM adapted well to these tasks, the study also identified unique challenges which were specific to remote sensing data, highlighting areas for future research and development. Similarly, Ding et al. [50] adapted the SAM for change detection in high-resolution remote sensing images, developing the SAM-CD network. By employing the FastSAM variant of the SAM and integrating a convolutional adaptor, this research enhanced the detection of changes in remote sensing imagery. The SAM-CD network outperformed existing methods in terms of accuracy and demonstrated sample-efficient learning comparable to semi-supervised methods. Chen et al. [51] explored enhancing instance segmentation in remote sensing imagery using the SAM. They proposed the RSPrompter model to overcome the SAM’s reliance on manual prompts. The RSPrompter generates category-sensitive prompts from intermediate encoder layers, enabling the SAM to produce segmented masks with category labels. This approach was validated through extensive evaluations conducted on various datasets. It proved the effectiveness of the method in enhancing the SAM in terms of instance segmentation tasks in remote sensing.

In the evolving landscape of remote sensing and deep learning, recent advancements have set a precedent for innovative image segmentation and analysis techniques. Various adaptations and applications of vision foundation models have been investigated, highlighting their effectiveness in tasks ranging from semantic segmentation to change detection in remote sensing imagery [47,48,49,50,51]. In this study, an innovative approach is proposed to integrate a two-step process using the SAM and CLIP RSICD [52] models, harnessing the potential of these advanced models for water body segmentation tasks. In the first step, the Automatic Mask Generator method from the SAM was utilized, along with the pre-trained Vision Transformer-Huge (ViT-H) model, to segment input images. It is a crucial step for isolating distinct segments within the imagery, demonstrating the capabilities of the SAM in handling complex image characteristics. In the second step, the CLIP RSICD model that is trained on remote sensing imagery was utilized to classify the image segments extracted with the SAM. The proposed methodology was tested using two distinct data sources: Landsat 8 OLI imagery of the YTU-Waternet dataset, incorporating near-infrared (NIR), green, and blue bands, and high-resolution aerial imagery with RGB bands. Our contributions are summarized as follows: We have implemented a novel framework for the zero-shot classification of water bodies using state-of-the-art VFMs in satellite and aerial imagery. Furthermore, we evaluated the framework’s ability to generalize across both types of imagery, demonstrating its versatile application. Additionally, we conducted an analysis on the impact of parameter selection on the performance of these models, offering insights into the optimization of model configurations for enhanced accuracy and efficiency.

2. Materials and Methods

A comprehensive and innovative approach has been developed for detecting and classifying water bodies using remote sensing imagery. The proposed methodology relies on the integration of advanced VFMs, specifically the SAM and the CLIP model adapted for remote sensing imagery. In the following subsections, the characteristics of the datasets employed in this study are investigated, alongside a detailed exploration of each VFM and their respective roles in this study.

2.1. Dataset

To demonstrate the versatility of the proposed methodology, a diverse range of data sources were utilized. The first data source was satellite imagery with near-infrared (NIR), green, and blue bands from the YTU-Waternet dataset. The second dataset included a high-resolution aerial imagery dataset containing RGB bands, offering finer details when compared to the satellite imagery which is often essential in distinguishing subtle variations. By employing datasets with different characteristics, both the robustness of the proposed methodology and its adaptability in handling heterogeneous data sources were investigated.

2.1.1. Satellite Imagery Dataset

The YTU-Waternet dataset consists of 63 Landsat 8 OLI full-frame images captured in coastal zones across multiple countries, including Albania, Argentina, Bulgaria, England, Georgia, Greece, Ireland, Italy, Libya, Russia, South Africa, Spain, Turkey, and the USA [29] (Figure 1). In total, the YTU-Waternet dataset has 1008 images, where 824 images were selected for training, 92 for validation, and 92 for testing. Each image has a dimension of 512 × 512 pixels, and consists of blue, red, and near-infrared (NIR) bands.

It should be noted that only the test part of the YTU-Waternet dataset was utilized in this study. In the pre-processing stage of the YTU-Waternet dataset, linear contrast stretching was applied to enhance the contrast of the Landsat imagery before the following segmentation and classification tasks. The 2nd and 98th percentiles were chosen as bounds for adjusting the pixel intensities. Intensities below the 2nd percentile were set to the minimum, whereas those above the 98th percentile were adjusted to the maximum. This adjustment was applied to each image band. Therefore, the intensities were revised, ensuring that the specified lower and upper percentiles became the new minimum and maximum intensity values. This process significantly enhanced the contrast of the image. Following the application contrast adjustment, the images were normalized and converted into a uint8 format. In the final stage of preprocessing, the adjusted images were saved in a lossless format (*.png) to ensure that the images were free from any compression artifacts.

2.1.2. Aerial Imagery Dataset

In this study, the two following distinct geographical regions in Turkey were chosen to represent the diverse topographical and vegetative features of the country: Rize from the eastern Black Sea region and Malatya from the Central Anatolia region. The aerial images of these regions, along with the interior and exterior orientation parameters and the vector data, were obtained from the General Directorate of Mapping (GDM). The aerial imagery utilized in this study was captured using a Vexcel UltraCam Eagle camera (Vexcel Imaging GmbH, Graz, Austria), producing images with a resolution of 13,080 × 20,010 pixels in RGB bands. These images were then processed to generate orthophotos, utilizing the provided interior and exterior orientation parameters. The processing was carried out using Agisoft PhotoScan software (version 1.2), and the orthophotos were produced with a ground sampling distance of 1 m to ensure a detailed spatial representation.

The study area is the Ardeşen district of Rize province, which is representative of the typical topographical and vegetative features of the Black Sea region (Figure 2). This area is characterized by a mountainous terrain blanketed with tall trees, which transitions into flatter topography featuring dense and sparse settlements along the coastal sections. The study area encompasses two different types of water bodies: the sea and a river. The sea stretches along the east–west axis of the area, generally displaying a uniform structure with natural elements such as a beach and a river mouth, excepting for artificial constructs like piers, T-shape groynes, and fillings along its coastline. The river, originating from the southern side of the region, flows beneath man-made bridges adjacently to structures like water treatment plants, and it then drains into the northern part of the sea. The river depth is relatively shallow, and it does not fill its bed, leading to the formation of numerous alluvium islets of varying sizes within the riverbed. The width of this riverbed varies significantly, measuring around 35 m at its narrowest point and expanding to approximately 160 m at its widest. The study area was segmented into 12 tiles, selectively excluding those without any water pixels to minimize processing time. Each tile features a resolution of 1024 × 1024 pixels and maintains a ground sampling distance of 1 m per pixel.

The second study area was selected from the part of Karakaya Dam Lake, situated between the Kale and Doğanyol districts of Malatya province (Figure 3). This area features a mountainous topography with sparse vegetation. Starting from the northwest border of the study area and extending towards the southeast, there lies a dam lake, which splits into two branches, one heading southeast and the other southwest. The portion of Karakaya Dam Lake within the study area boundaries varies in width, being approximately 360 m at its narrowest point and extending up to about 650 m at its widest. The Malatya study area was segmented into 31 tiles, selectively excluding those without any water pixels in order to minimize processing time, similarly to the Rize study area. Each tile features a resolution of 1024 × 1024 pixels and maintains a ground sampling distance of 1 m per pixel. In Figure 3, tile boundaries are highlighted in red, and the tile numbers are given in each tile.

2.2. Methodology

The proposed methodology for water body extraction from remote sensing imagery employs a two-step process that leverages novel SAM and CLIP VFMs for segmentation and classification tasks (Figure 4). In the first step, image segmentation was carried out using the SAM in order to delineate the image segments. In the second step, the CLIP RSICD model was employed for zero-shot classification, which is specifically trained for remote sensing imagery to classify image segments.

2.2.1. Segment Anything Model (SAM)

The SAM is a novel and recently introduced image segmentation model trained on the largest dataset in the computer vision domain, which features over 1 billion masks on 11 million images and has remarkable zero-shot transfer capabilities that often surpass previous supervised methods. First proposed by Kirillov et al. [45], the Segment Anything Model is a sophisticated model designed for promptable segmentation. The SAM framework is composed of the three following essential components: an image encoder, a prompt encoder, and a mask decoder (Figure 5).

The image encoder, built on a pre-trained Vision Transformer (ViT), is adapted for high-resolution input processing. The encoder runs once per image, utilizing a Masked Autoencoder (MAE) approach. The encoder design allows it to be applied even before prompting the model. The prompt encoder in the SAM is designed to handle two types of prompts: sparse (such as points, boxes, and text) and dense (masks). Sparse prompts like points and boxes are represented using positional encodings combined with learned embeddings for each prompt type. A crucial aspect of the SAM is the mask decoder, which can efficiently translate the combined image and prompt embeddings, along with an output token, into a mask. The decoder incorporates a modified block that facilitates prompt self-attention and cross-attention between prompts and image embeddings. Subsequently, an upsampling process and a dynamic linear classifier are employed to calculate the foreground probability of the mask at each location within the image [45].

In the proposed approach, the SAM’s Automatic Mask Generator method was employed. There are three main parameters for controlling the automatic mask generation process: points per side, pred iou thresh, and stability score thresh. The points per side parameter defines the number of points sampled along each side of the image, thus creating a grid of prompt points. This parameter controls the granularity of the segmentation operation. The pred iou thresh parameter filters the generated masks based on the internally calculated IoU metric. The stability score thresh parameter is a metric that measures the reliability of the prediction which filters masks based on the stability under cutoff variations for binarizing mask predictions. Additionally, the box nms threshold parameter controls the IoU cutoff for non-maximal suppression in order to reduce duplicate masks. The stability score offset parameter represents the value by which the cutoff is shifted when calculating the stability score. The crop n layers parameter dictates the number of layers for running mask predictions on image crops, with each layer having a different number of crops. The crop nms threshold works like the box nms threshold, but applies across different image crops. The crop n points downscale factor parameter scales down the number of points-per-side sampled in each layer, managing computational load. Lastly, the min mask region area parameter is applied to remove small, disconnected mask regions.

2.2.2. Contrastive Language-Image Pre-Training Model (CLIP)

In developing the proposed methodology, inspiration was drawn from the principles established by the CLIP model (Figure 6), first proposed by Radford et al. [46]. CLIP represents a novel approach to learning visual representations via the use of natural language supervision.

In adopting a contrastive learning approach instead of trying to match exact textual descriptions with images, CLIP focuses on identifying the correct pairings within a batch of numerous possibilities. This is achieved by optimizing a symmetric cross-entropy loss across the cosine similarity scores between image and text embeddings.

However, it is important to note that the original CLIP model was not specifically trained in remote sensing imagery. Recognizing this limitation, researchers in the field of remote sensing sought to adapt the CLIP framework to better align with the specific demands of this domain. This development resulted in a variant of the CLIP model that was fine-tuned with the Remote Sensing Image Caption Dataset (RSICD) by Lu et al. [53]. This variant is known as CLIP RSICD, as described by Arutiunian et al. [52].

2.3. Water Body Segmentation Framework

The proposed methodology for water body extraction from remote sensing imagery employs a two-step process that leverages advanced VFMs, aiming to be precise and efficient. The first step of our methodology involves image segmentation using the SAM. In this phase, the SAM Automatic Mask Generator method was employed to utilize the “ViT-H” pre-trained model. To enhance the precision of the segmentation process, the SAM Automatic Mask Generator was specifically configured with parameters that were tailored to the dataset in this study. This included setting points per side to 32 for YTU-Waternet and 64 for aerial imagery datasets, which determines the density of the points used in mask generation, thereby ensuring detailed and accurate segmentation. Furthermore, both the pred iou thresh and stability score thresh parameters were set to 0.90. These thresholds are crucial because they define the minimum intersection over union (IoU) and stability score required for a mask to be considered in the final output, therefore ensuring that only the most reliable and accurate masks are generated. Unless otherwise specified, default values are used for the additional parameters.

Following the segmentation with the SAM, the second step of the proposed methodology employs the CLIP RSICD model for zero-shot classification. This step is implemented using the version available in [52]. Before proceeding further, we conducted experiments on the aerial imagery Malatya dataset with various prompts to optimize classification accuracy. For this purpose, we developed four sets of prompts. Prompt 1 was designed to test the model’s response to basic class definitions, focusing solely on the fundamental categories of land and water. Prompt 2 focuses on water-related classes and includes fresh water, vegetation, saltwater, wetland, pond, and residential area classes. Prompt 3 expanded upon the second, incorporating more ground-related classes like grassland, rural, forest, sea, urban, crop field, and suburban, in addition to fresh and saltwater. In Prompt 4, we opted for broader class definitions, incorporating a diverse range of classes like bare land, desert, mountain, river, rural, sparse residential, forest, sea, medium residential, commercial, ocean, pond, farmland, crop field, and industrial classes (Figure 7).

In Prompt 1, the water body in mask 3 can be detected with 99% probability. However, significant confusion among classes in image segments without a water body has been observed in masks 1, 2, 4, and 5, where a water body probability of over 22% has been calculated using the CLIP-RSICD model. In order to enhance the accuracy of class predictions made by the CLIP-RSICD model, different prompts have been tested by increasing class diversity. In Prompts 2 and 3, experiments were conducted by adding both ground-related and water-related classes. It has been observed that errors similar to those in Prompt 1 not only continued, but even increased in Prompts 2 and 3. However, Prompt 4 showed that a clearer distinction between water and non-water was achieved by including a broader range of classes, encompassing both ground-related and water-related categories. Therefore, in this study, Prompt 4 was selected due to the more successful differentiation of water bodies in various image segments.

After determining the most effective prompt as Prompt 4 for the aerial imagery dataset, each image segment was classified using the CLIP RSICD model, which was prompted with bare land, desert, mountain, river, rural, sparse residential, forest, sea, medium residential, commercial, ocean, pond, farmland, crop field, and industrial classes. Similarly, on the YTU-Waternet dataset, we used the sea, saltwater, ocean, forest, grassland, crop field, urban, suburban, and rural classes for the classification prompt. Our aim was to ensure the precise categorization of each segment. Segments identified as a water-related class with a combined probability exceeding 85% were assigned to the water class.

3. Results

In this study, a novel framework designed for the extraction of water bodies from remote sensing images is introduced. The proposed framework is centered around the utilization of state-of-the-art VFMs, notably the SAM and the CLIP RSICD models. These models, known for their advanced capabilities in image processing, are crucial in addressing the task of water body segmentation and classification. The proposed methodology involves an evaluation of each model’s performance and effectiveness on LANDSAT 8 OLI imagery obtained from the YTU-Waternet dataset, coupled with high-resolution aerial images obtained from the General Directorate of Mapping of Turkey. The experiments were conducted on a computer equipped with an Intel i5-11400F processor, 32 GB of RAM, and an NVIDIA RTX 3060 GPU with 12 GB of RAM. Pytorch version 2.0 was employed for the SAM and CLIP RSICD models. This section focuses on the performance of VFMs on satellite images in Section 3.1, followed by their performance on aerial images in Section 3.2 and Section 3.3.

3.1. Results for YTU-Waternet Dataset

The SAM Automatic Mask Generator method was configured with the points per side parameter set to 32 due to the lower resolution (512 × 512) of the imagery and relatively larger objects in the scene. Both the pred iou thresh and stability score thresh parameters were adjusted to 0.90 in order to obtain the most accurate water body delineation from the YTU-Waternet dataset. Following segmentation, each image segment underwent classification through the CLIP RSICD model, using sea, saltwater, ocean, forest, grassland, crop field, urban, suburban, and rural classes. This prompt was selected after our test in Section 2.3, which proved that using comprehensive class names improves the CLIP-RSICD model’s ability to effectively distinguish between the classes in a scene. It should be noted that the segments identified as sea, saltwater, or ocean with a combined probability above 85% were assigned to the water class.

The SAM + CLIP RSICD framework demonstrated remarkable performance in water body segmentation and classification, achieving an F1-Score of 96.974%, an overall accuracy (OA) of 98.321%, and an intersection over union (IoU) of 94.410%, as detailed in Table 1.

The segmentation and classification results for five sample tiles of the YTU-Waternet dataset are presented in Figure 8.

As shown in the instances in Figure 8, the proposed framework was able to extract the boundaries of these water bodies with exceptional accuracy. The performance of the framework in accurately distinguishing between land and water regions is noteworthy, demonstrating its robust capability in the accurate classification of land and water categories. The obtained accuracy is particularly evident in the precise delineation of water body boundaries, even under the demanding conditions presented by small landforms, such as small islands, present within water bodies, examples of which are given in Figure 8b. This suggests that the SAM + CLIP RSICD framework is capable of operating with high segmentation accuracy under complex and different environmental conditions. The proposed framework was able to delineate complex shoreline boundaries. Also, it can handle man-made structures, such as bridges over water bodies, achieving consistent segmentation, as shown in Figure 8d.

Despite the promising results, the SAM + CLIP RSICD framework does exhibit certain limitations (Figure 9). This is evident in Figure 9b for the Test 16 data, where the proposed method was unable to extract water body boundaries. In this specific instance, the SAM only segments land-related classes. However, CLIP RSICD was able to correctly classify the land segment as non-water. Similarly, in the Test 72 data depicted in Figure 9d, the SAM failed to segment a small water body located at the lower-left part of the image, while CLIP RSICD was able to correctly classify the image segments. Furthermore, an examination of the performance of the SAM reveals that the extracted image segments tend to avoid areas with high color contrast. In such cases, the pixels outline the shoreline. This behavior is clearly obvious in the segmentation results for the Test 72 data.

In another case, as demonstrated by Test 6 and Test 43 data in Figure 9a,b, land segments were classified as water. This misclassification can be traced back to the inability of the SAM to distinctly separate land from water during the segmentation process. As a result, the entire image was fed into the CLIP RSICD model, which then interpreted the image as belonging to the water class, as demonstrated in Figure 9. The classification results from the CLIP RSICD model revealed that the water class had a dominant cumulative probability of 85.18%, with the other classes contributing a smaller portion to the overall probability distribution. Despite these specific concerns, it is worth noting that the proposed framework yielded impressive results even though it has not been directly trained on remotely sensed data. While this slight segmentation inaccuracy might be considered minor and acceptable within various remote sensing applications, it poses potential challenges for studies concentrating on shoreline analysis. This limitation may be mitigated by fine-tuning the SAM using a more representative dataset.

3.2. Results for Malatya Study Area

In the second part of the experiment, the objective was water body extraction using the SAM + CLIP RSICD and labeling results on the aerial imagery dataset, where the SAM Automatic Mask Generator was configured with the points per side parameter of 64 due to the higher resolution (1024 × 1024) of the imagery, as well as to capture even the smallest water body objects. Both the pred iou thresh and stability score thresh parameters were 0.90 to ensure the best water body delineation accuracy. After the segmentation process, each image segment was classified by prompting the CLIP RSICD model with bare land, desert, mountain, river, rural, sparse residential, forest, sea, medium residential, commercial, ocean, pond, farmland, crop field, and industrial classes. Segments identified as river, pond, sea, or ocean with a combined probability above 85% were assigned to the water class. The results of the Malatya study area are given in Table 2.

The proposed framework demonstrated a high degree of success in accurately extracting and classifying water bodies, showing remarkable precision in delineating water boundaries, even under complex conditions. The SAM proved to be quite effective in differentiating various objects and establishing precise boundaries. Furthermore, when provided with the appropriate prompt, the CLIP RSICD model accurately classified the image segments, including smaller ones. Also, the SAM segmentation process took approximately 70 s to extract image segments from a single image due to the higher image resolution and increased points per side parameter. Moreover, the time taken by the CLIP RSICD model for segment classification was around 20 s, primarily due to the increased image resolution. To illustrate the combined effectiveness of the SAM and CLIP RSICD, Figure 10 presents segmented water bodies from five test samples in the Malatya study area. The delineated boundaries of water bodies for the test areas are depicted in Figure 11.

The proposed framework demonstrated robust performance and was able to successfully extract and classify water bodies with high accuracy. However, the SAM faced some challenges in segmenting water bodies in Tile 3, similar to the results of the 16th test data (Figure 9b) in the YTU-Waternet dataset. Upon further investigation, it was discovered that, in certain instances, the SAM generated segments with low stability scores, which led to the observed segmentation challenges. To address this, we adjusted the stability score offset parameter to 0.1 from the default value of 1 in Tile 3. This change allowed the SAM to delineate water bodies more effectively, albeit with slightly reduced segmentation precision. Figure 12 shows the results before and after the optimum value of the stability score offset parameter for Tile 3. This situation demonstrates that the SAM involves several parameters controlling the segmentation quality, each of which requires dataset-specific optimization.

3.3. Results for Rize Study Area

The final experiment involved applying the SAM + CLIP RSICD framework to the Rize dataset, which exhibits characteristics that are distinct from the previous aerial imagery dataset. Similar to the Malatya study area, the SAM Automatic Mask Generator was configured with the points per side parameter of 64 due to the higher resolution (1024 × 1024) of the imagery, as well as to capture even the smallest water body objects. Both the pred iou thresh and stability score thresh parameters were 0.90 to ensure the best water body delineation accuracy. The proposed framework demonstrated outstanding performance for Tiles 1 through 8 in terms of all accuracy metrics, as shown in Table 3. In Tiles 9 and 10, the framework maintained high performance, albeit with some variability in metrics. This was particularly evident in these tiles, where a lower recall suggested potential misses in water body detection. A more pronounced variation was observed in Tiles 11, where a reduced recall score of 62.012% indicated challenges in consistently identifying water bodies. Tile 12 demonstrated a unique situation, where there was a high recall of 96.659%, but a notably lower precision of 62.337%. This pattern suggests the likelihood of false positives being prevalent in this specific area.

In general, the framework exhibited a robust performance with mean scores of 96.246% in precision, 94.240% in recall, and an F1-Score of 94.559%. The mean OA and IoU were 98.840% and 90.827%, respectively. In this experiment, the proposed SAM + CLIP RSICD framework demonstrated a processing time comparable to that of the Malatya dataset. The water boundaries extracted for the Rize dataset are shown in Figure 13. It should be noted that the tile boundaries are represented in red and the respective tile numbers are given in the upper left of the tile boundaries. Green boundaries represent the extracted water body boundaries.

For Tile 11, it was observed that, while the SAM successfully segmented most parts of the river, it failed to accurately segment the southeastern section. Additionally, the CLIP RSICD model encountered challenges in labeling a specific river segment where the water level was low, as illustrated in Figure 14. This inaccuracy in labeling from the CLIP RSICD model could be attributed to the low water level, which likely alters the typical visual and spectral characteristics of the river, thus affecting the ability of the model to recognize and label it correctly. Furthermore, the results regarding sea boundary segments revealed a tendency of the SAM to under-segment shorelines, particularly in areas where water levels were low. This issue was evident in Tile 10 (Figure 14). The low water levels at these shorelines likely result in a blending of land and water features, creating a challenging scenario for the SAM to differentiate the shoreline distinctly. This under-segmentation suggests that the model is sensitive to variations in water levels, which significantly influence the textural and color characteristics essential for accurate segmentation.

4. Discussion

The SAM + CLIP RSICD framework demonstrated remarkable performance in water body segmentation and classification, achieving an F1-Score of 96.974%, an overall accuracy (OA) of 98.321%, and an intersection over union (IoU) of 94.410% on the YTU-Waternet dataset. Comparisons with the recent state-of-the-art deep learning methods reveal that the proposed methodology can closely match the water body extraction performance of these models, as shown in Table 4.

The WaterNet model [29] is an ensemble deep learning approach that combines five models, namely Standard U-Net, FC-DenseNet, Fractal U-Net, Dilated U-Net, and Pix2Pix trained on LANDSAT 8 OLI imagery with blue, red, and near-infrared bands. In the Pix2Pix architecture of the WaterNet model, a U-Net-based network serves as the generative network to produce predictions, utilizing convolution layers with 4 × 4 dimensions and Leaky ReLU activation functions. Similarly, the discriminative network of WaterNet employs 4 × 4 convolution layers and Leaky ReLU functions with transposed convolution layers. The WaterNet model requires extensive training as follows: 4 h 27 min for the Standard U-Net, 4 h 7 min for Dilated U-Net, 1 h 58 min for Fractal U-Net, 8 h 50 min for FC-DenseNet, and 4 h 58 min for Pix2Pix networks. In contrast, the proposed methodology does not require a training process, and the inference times for a single image are around 24 s for the SAM to extract image segments and 11 s for the CLIP RSICD model to classify the extracted segments.

In comparison to similar studies for water body segmentation, Wieland et al. [34] employed the U-Net architecture with the MobileNet-V3 encoder on pan-sharpened IKONOS imagery with red, green, blue and NIR bands and 0.8 m spatial resolution. They achieved a relatively higher performance (92% precision, 87% recall and 81% IoU) on the IKONOS data (Test Scenario 1). However, in experiments generalizing from satellite to aerial imagery (Test Scenario 2), their network considerably underperforms, as evidenced by precision of 79%, recall of 78%, and an IoU metric of 62%. Also, Wieland et al. [34] achieved a 7% improvement in the IoU metric by incorporating the NIR band and obtained an 81% IoU in Test Scenario I. In their study, the effects of band selection on the accuracy of the deep learning model is demonstrated. While each sensor requires its own training dataset, and models trained on samples from a single sensor encounter challenges when attempting to transfer effectively to others [1], it is challenging to generalize across datasets with different band combinations. These challenges are relevant for other deep learning based studies [1,29,35,36]. Nevertheless, the SAM + CLIP RSICD framework stands out for its flexibility and generalizability, even with marginally lower performance metrics. Moreover, the SAM + CLIP RSICD framework can be effortlessly adapted to imagery featuring various band combinations. Our experiments have shown that the proposed methodology is adaptable to imagery of varying spatial resolutions, including 30 m LANDSAT 8 OLI and 1 m aerial images. Also, the SAM + CLIP RSICD framework is capable of zero-shot classification across imagery from various sensors, although this requires parameter optimization for the SAM and the careful selection of the appropriate prompt for the CLIP RSICD model.

Unlike common ground objects like buildings and roads, water bodies have a wide range of sizes and forms, complicating their accurate delineation [35]. Surface water bodies, including lakes, ponds, rivers, reservoirs, wetlands, seas, and oceans, serve as Earth’s primary water storage. Seas and oceans are particularly distinct from other freshwater bodies due to their high salinity levels [41]. Creating a training dataset for each specific water-related class, and developing an architecture capable of simultaneously handling all of them, would be costly or might require separate architectures. However, the proposed approach offers a directly applicable methodology that can be effectively used across all these categories. Also, it was shown within the experiments that the proposed methodology can be applied to various spatial resolution imagery, such as 30 m LANDSAT 8 OLI and 1 m aerial images.

The proposed approach encounters challenges with changing water levels and seasonal changes. For example, in the Rize study area, lower water levels in the river section significantly diminished classification completeness. Similarly, in another case within the same area, we observed that the proposed approach could not segment low water levels near the shoreline. Despite these limitations, the proposed methodology successfully captures intricate water body boundaries and can delineate even the smaller segments of water bodies. Additionally, shadows created by geographical conditions can adversely affect segmentation accuracy. These limitations of the proposed approach can be mitigated via the fine-tuning of the SAM and the CLIP-RSICD model with remote sensing imagery including more distinct types of water bodies. Nonetheless, most image-based approaches face similar issues [1,34,36,41]. One of the most significant factors affecting the performance of the proposed approach is the selection of the appropriate prompt for the CLIP-RSICD model. As demonstrated in Section 2.3, the CLIP-RSICD model’s responses to prompts varies dramatically. For Mask 1, 4, and 5, the model’s responses to Prompt 2 indicate a high likelihood of freshwater and vegetation, with a significant score for freshwater. Prompt 3 suggests a similar interpretation with even higher confidence in freshwater, whereas Prompt 4 notably identifies the segment as predominantly bare land with some desert, which is a substantial and accurate deviation from the first two prompts. Similarly, Prompt 1 accurately labeled the land class with some confusion with the water class. While the CLIP RSICD model generates accurate segmentation regarding the land and water classes with Prompt 1, there is a considerable amount of ambiguity in the predictions when compared to Prompt 4. It becomes apparent that the prompt semantics can considerably influence the classifications of the model, as evidenced by the consistently water-related identifications in the initial prompts versus the notable discrepancies observed with the Prompt 4. Our experiments revealed that selecting the appropriate prompt is critical for achieving high labeling accuracy when using the CLIP RSICD model.

Additionally, the research consistently shows a worldwide reduction in clean water sources [54]. The decline is driven by climate change, the consequences of droughts, expanding urban areas, agricultural demands, and an increasing global population, which collectively strain the limited clean water resources remaining on Earth [54,55,56]. Integrating advanced deep learning techniques into remote sensing imagery is expected to open up extensive opportunities for further advancements in water sciences, driven by the ongoing progress in automation and artificial intelligence [44]. In this context, the SAM + CLIP RSICD framework are particularly valuable in accurate water body mapping in environmental studies, water resources management, and resilience planning, where the ability to efficiently process multiple source imagery in a timely manner is crucial.

5. Conclusions

In this study, the combination of the SAM and CLIP RSICD foundation models within a unified framework achieved an average accuracy rate exceeding 90% across all metrics for extracting water bodies from aerial imagery datasets, and an even higher average accuracy rate exceeding 94% across all metrics for the satellite imagery dataset. The remarkable performance of the VFMs highlights their advanced capabilities, demonstrating that they hold significant promise for advancing the field of remote sensing applications, particularly in zero-shot segmentation and classification tasks. Furthermore, the results demonstrate the superior generalization capabilities of VFMs when compared to traditional deep learning approaches. In particular, the CLIP RSICD model, initially trained on high-resolution RGB remote sensing imagery, showed an impressive ability to generalize images including the NIR band that have comparatively lower spatial resolution. This adaptability indicates that the model has inherent generalization capabilities beyond its original training scope. The success of the CLIP RSICD model in classifying remote sensing imagery, despite changes in spectral bands and resolution, highlights its potential for broader applications. Additionally, the pre-trained nature of these models eliminates the need for extensive task-specific training, thus offering a cost-effective approach that saves time and computational resources. The combination of VFMs’ relatively short runtime and ease of use enhances the practicality of our methodology, providing an innovative and accessible solution which is suitable for a wide range of users. Moreover, the ongoing improvement of these models through the incorporation of more varied and extensive datasets could unlock further improvements in accuracy and efficiency.

Author Contributions

Conceptualization, S.O., Z.A., F.K. and T.K.; methodology, S.O., Z.A., F.K. and T.K.; data curation, F.K.; writing—original draft preparation, S.O., Z.A., F.K. and T.K.; writing—review and editing, S.O., Z.A., F.K. and T.K.; visualization, S.O., Z.A., F.K. and T.K.; supervision, F.K. and T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The YTU-Waternet dataset utilized in this study is openly available at “http://www.remotesensinglab.yildiz.edu.tr/ (accessed on 29 December 2023)”. Aerial imagery dataset was obtained from the General Directorate of Mapping (GDM) and is only available with the permission of GDM.

Acknowledgments

The authors extend their sincere thanks to the General Directorate of Mapping, Turkey, for providing the aerial imagery dataset utilized in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, M.; Wu, P.; Wang, B.; Park, H.; Hui, Y.; Yanlan, W. A Deep Learning Method of Water Body Extraction from High Resolution Remote Sensing Images with Multisensors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3120–3132. [Google Scholar] [CrossRef]
Kalogiannidis, S.; Kalfas, D.; Giannarakis, G.; Paschalidou, M. Integration of Water Resources Management Strategies in Land Use Planning towards Environmental Conservation. Sustainability 2023, 15, 15242. [Google Scholar] [CrossRef]
Pietrucha-Urbanik, K.; Rak, J. Water, Resources, and Resilience: Insights from Diverse Environmental Studies. Water 2023, 15, 3965. [Google Scholar] [CrossRef]
Gupta, D.; Kushwaha, V.; Gupta, A.; Singh, P.K. Deep Learning Based Detection of Water Bodies Using Satellite Images. In Proceedings of the 2021 International Conference on Intelligent Technologies (CONIT), Hubli, India, 25–27 June 2021; pp. 1–4. [Google Scholar]
Yang, L.; Driscol, J.; Sarigai, S.; Wu, Q.; Lippitt, C.D.; Morgan, M. Towards Synoptic Water Monitoring Systems: A Review of AI Methods for Automating Water Body Detection and Water Quality Monitoring Using Remote Sensing. Sensors 2022, 22, 2416. [Google Scholar] [CrossRef]
Drogkoula, M.; Kokkinos, K.; Samaras, N. A Comprehensive Survey of Machine Learning Methodologies with Emphasis in Water Resources Management. Appl. Sci. 2023, 13, 12147. [Google Scholar] [CrossRef]
Panahi, J.; Mastouri, R.; Shabanlou, S. Insights into Enhanced Machine Learning Techniques for Surface Water Quantity and Quality Prediction Based on Data Pre-Processing Algorithms. J. Hydroinform. 2022, 24, 875–897. [Google Scholar] [CrossRef]
Wang, Y.; Li, S.; Lin, Y.; Wang, M. Lightweight Deep Neural Network Method for Water Body Extraction from High-Resolution Remote Sensing Images with Multisensors. Sensors 2021, 21, 7397. [Google Scholar] [CrossRef] [PubMed]
Gharbia, R. Deep Learning for Automatic Extraction of Water Bodies Using Satellite Imagery. J. Indian Soc. Remote Sens. 2023, 51, 1511–1521. [Google Scholar] [CrossRef]
Sit, M.; Demiray, B.Z.; Xiang, Z.; Ewing, G.J.; Sermet, Y.; Demir, I. A Comprehensive Review of Deep Learning Applications in Hydrology and Water Resources. Water Sci. Technol. 2020, 82, 2635–2670. [Google Scholar] [CrossRef]
Adli Zakaria, M.N.; Ahmed, A.N.; Abdul Malek, M.; Birima, A.H.; Hayet Khan, M.M.; Sherif, M.; Elshafie, A. Exploring Machine Learning Algorithms for Accurate Water Level Forecasting in Muda River, Malaysia. Heliyon 2023, 9, e17689. [Google Scholar] [CrossRef]
Naeem, K.; Zghibi, A.; Elomri, A.; Mazzoni, A.; Triki, C. A Literature Review on System Dynamics Modeling for Sustainable Management of Water Supply and Demand. Sustainability 2023, 15, 6826. [Google Scholar] [CrossRef]
Latif, S.D.; Ahmed, A.N. Streamflow Prediction Utilizing Deep Learning and Machine Learning Algorithms for Sustainable Water Supply Management. Water Resour. Manag. 2023, 37, 3227–3241. [Google Scholar] [CrossRef]
Mukonza, S.S.; Chiang, J.-L. Meta-Analysis of Satellite Observations for United Nations Sustainable Development Goals: Exploring the Potential of Machine Learning for Water Quality Monitoring. Environments 2023, 10, 170. [Google Scholar] [CrossRef]
Ch, A.; Ch, R.; Gadamsetty, S.; Iwendi, C.; Gadekallu, T.R.; Dhaou, I.B. ECDSA-Based Water Bodies Prediction from Satellite Images with UNet. Water 2022, 14, 2234. [Google Scholar] [CrossRef]
Tambe, R.G.; Talbar, S.N.; Chavan, S.S. Deep Multi-Feature Learning Architecture for Water Body Segmentation from Satellite Images. J. Vis. Commun. Image Represent. 2021, 77, 103141. [Google Scholar] [CrossRef]
Li, W.; Li, Y.; Gong, J.; Feng, Q.; Zhou, J.; Sun, J.; Shi, C.; Hu, W. Urban Water Extraction with UAV High-Resolution Remote Sensing Data Based on an Improved U-Net Model. Remote Sens. 2021, 13, 3165. [Google Scholar] [CrossRef]
Kaplan, G.; Avdan, U. Object-Based Water Body Extraction Model Using Sentinel-2 Satellite Imagery. Eur. J. Remote Sens. 2017, 50, 137–143. [Google Scholar] [CrossRef]
Kuleli, T.; Guneroglu, A.; Karsli, F.; Dihkan, M. Automatic Detection of Shoreline Change on Coastal Ramsar Wetlands of Turkey. Ocean Eng. 2011, 38, 1141–1149. [Google Scholar] [CrossRef]
Hamzaoglu, C.; Dihkan, M. Automatic Extraction of Highly Risky Coastal Retreat Zones Using Google Earth Engine (GEE). Int. J. Environ. Sci. Technol. 2023, 20, 353–368. [Google Scholar] [CrossRef]
Xu, H. Modification of Normalised Difference Water Index (NDWI) to Enhance Open Water Features in Remotely Sensed Imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Luo, Y.; Feng, A.; Li, H.; Li, D.; Wu, X.; Liao, J.; Zhang, C.; Zheng, X.; Pu, H. New Deep Learning Method for Efficient Extraction of Small Water from Remote Sensing Images. PLoS ONE 2022, 17, e0272317. [Google Scholar] [CrossRef] [PubMed]
Qin, X.; Yang, J.; Li, P.; Sun, W. Research on Water Body Extraction from Gaofen-3 Imagery Based on Polarimetric Decomposition and Machine Learning. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 6903–6906. [Google Scholar]
Li, A.; Fan, M.; Qin, G.; Xu, Y.; Wang, H. Comparative Analysis of Machine Learning Algorithms in Automatic Identification and Extraction of Water Boundaries. Appl. Sci. 2021, 11, 10062. [Google Scholar] [CrossRef]
Nagaraj, R.; Kumar, L.S. Multi Scale Feature Extraction Network with Machine Learning Algorithms for Water Body Extraction from Remote Sensing Images. Int. J. Remote Sens. 2022, 43, 6349–6387. [Google Scholar] [CrossRef]
Guru Prasad, M.S.; Agarwal, J.; Christa, S.; Aditya Pai, H.; Kumar, M.A.; Kukreti, A. An Improved Water Body Segmentation from Satellite Images Using MSAA-Net. In Proceedings of the 2023 International Conference on Machine Intelligence for GeoAnalytics and Remote Sensing (MIGARS), Hyderabad, India, 27–29 January 2023; Volume 1, pp. 1–4. [Google Scholar]
Kavzoglu, T.; Teke, A.; Yilmaz, E.O. Shared Blocks-Based Ensemble Deep Learning for Shallow Landslide Susceptibility Mapping. Remote Sens. 2021, 13, 4776. [Google Scholar] [CrossRef]
Yilmaz, E.O.; Tonbul, H.; Kavzoglu, T. Marine Mucilage Mapping with Explained Deep Learning Model Using Water-Related Spectral Indices: A Case Study of Dardanelles Strait, Turkey. Stoch. Env. Res. Risk Assess 2024, 38, 51–68. [Google Scholar] [CrossRef]
Erdem, F.; Bayram, B.; Bakirman, T.; Bayrak, O.C.; Akpinar, B. An Ensemble Deep Learning Based Shoreline Segmentation Approach (WaterNet) from Landsat 8 OLI Images. Adv. Space Res. 2021, 67, 964–974. [Google Scholar] [CrossRef]
An, S.; Rui, X. A High-Precision Water Body Extraction Method Based on Improved Lightweight U-Net. Remote Sens. 2022, 14, 4127. [Google Scholar] [CrossRef]
Mullen, A.L.; Watts, J.D.; Rogers, B.M.; Carroll, M.L.; Elder, C.D.; Noomah, J.; Williams, Z.; Caraballo-Vega, J.A.; Bredder, A.; Rickenbaugh, E.; et al. Using High-Resolution Satellite Imagery and Deep Learning to Track Dynamic Seasonality in Small Water Bodies. Geophys. Res. Lett. 2023, 50, e2022GL102327. [Google Scholar] [CrossRef]
Nasir, N.; Kansal, A.; Alshaltone, O.; Barneih, F.; Shanableh, A.; Al-Shabi, M.; Al Shammaa, A. Deep Learning Detection of Types of Water-Bodies Using Optical Variables and Ensembling. Intell. Syst. Appl. 2023, 18, 200222. [Google Scholar] [CrossRef]
He, Y.; Yao, S.; Yang, W.; Yan, H.; Zhang, L.; Wen, Z.; Zhang, Y.; Liu, T. An Extraction Method for Glacial Lakes Based on Landsat-8 Imagery Using an Improved U-Net Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6544–6558. [Google Scholar] [CrossRef]
Wieland, M.; Martinis, S.; Kiefl, R.; Gstaiger, V. Semantic Segmentation of Water Bodies in Very High-Resolution Satellite and Aerial Images. Remote Sens. Environ. 2023, 287, 113452. [Google Scholar] [CrossRef]
Duan, L.; Hu, X. Multiscale Refinement Network for Water-Body Segmentation in High-Resolution Satellite Imagery. IEEE Geosci. Remote Sens. Lett. 2020, 17, 686–690. [Google Scholar] [CrossRef]
Dai, X.; Xia, M.; Weng, L.; Hu, K.; Lin, H.; Qian, M. Multiscale Location Attention Network for Building and Water Segmentation of Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5609519. [Google Scholar] [CrossRef]
Liu, M.; Liu, J.; Hu, H. A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images. Appl. Sci. 2024, 14, 1344. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Nagaraj, R.; Kumar, L.S. Extraction of Surface Water Bodies Using Optical Remote Sensing Images: A Review. Earth Sci Inf. 2024, 17, 893–956. [Google Scholar] [CrossRef]
Kavzoglu, T.; Erdemir, M.Y.; Tonbul, H. Classification of Semiurban Landscapes from Very High-Resolution Satellite Images Using a Regionalized Multiscale Segmentation Approach. J. Appl. Remote Sens. 2017, 11, 035016. [Google Scholar] [CrossRef]
Kavzoglu, T.; Tonbul, H. A Comparative Study of Segmentation Quality for Multi-Resolution Segmentation and Watershed Transform. In Proceedings of the 2017 8th International Conference on Recent Advances in Space Technologies (RAST), Istanbul, Turkey, 19–22 June 2017; pp. 113–117. [Google Scholar]
Gautam, S.; Singhai, J. Critical Review on Deep Learning Methodologies Employed for Water-Body Segmentation through Remote Sensing Images. Multimed. Tools Appl. 2024, 83, 1869–1889. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 2–6 October 2023. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Wang, D.; Zhang, J.; Du, B.; Xu, M.; Liu, L.; Tao, D.; Zhang, L. SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model. arXiv 2023, arXiv:2305.02034. [Google Scholar]
Zhang, J.; Zhou, Z.; Mai, G.; Mu, L.; Hu, M.; Li, S. Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models. arXiv 2023, arXiv:2304.10597. [Google Scholar]
Ren, S.; Luzi, F.; Lahrichi, S.; Kassaw, K.; Collins, L.M.; Bradbury, K.; Malof, J.M. Segment Anything, from Space? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2023, Waikoloa, HI, USA, 2–7 January 2023.
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5611711. [Google Scholar]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2023, 62, 4701117. [Google Scholar] [CrossRef]
Arutiunian, A.; Vidhani, D.; Venkatesh, G.; Bhaskar, M.; Ghosh, R.; Pal, S. CLIP-Rsicd 2021. [GitHub Repository]. Available online: https://github.com/arampacha/CLIP-rsicd (accessed on 13 October 2023).
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
Silva, J.A. Wastewater Treatment and Reuse for Sustainable Water Resources Management: A Systematic Literature Review. Sustainability 2023, 15, 10940. [Google Scholar] [CrossRef]
Delanka-Pedige, H.M.K.; Munasinghe-Arachchige, S.P.; Abeysiriwardana-Arachchige, I.S.A.; Nirmalakhandan, N. Wastewater Infrastructure for Sustainable Cities: Assessment Based on UN Sustainable Development Goals (SDGs). Int. J. Sustain. Dev. World Ecol. 2021, 28, 203–209. [Google Scholar] [CrossRef]
Jodar-Abellan, A.; López-Ortiz, M.I.; Melgarejo-Moreno, J. Wastewater Treatment and Water Reuse in Spain. Current Situation and Perspectives. Water 2019, 11, 1551. [Google Scholar] [CrossRef]

Figure 1. Eight pseudo-color sample tiles from the YTU-Waternet test dataset, generated using blue, red and NIR bands from Landsat 8 OLI imagery.

Figure 2. The study area in the Rize province of Turkey. Tile boundaries are represented in red and the respective tile numbers are given in the upper left of the tile.

Figure 3. The study area located in the Malatya province of Turkey. Tile boundaries are represented in red and the respective tile numbers are given in the upper left of the tile.

Figure 4. Flowchart of water body extraction with VFMs.

Figure 5. The framework of the Segment Anything Model (SAM) (adapted from Kirillov et al., 2023) [45].

Figure 6. An illustration of the CLIP model adapted from Radford et al. (2021) [46].

Figure 7. Comparative analysis of the CLIP RSICD model for different prompts on various image segments.

Figure 8. Results for five sample test areas from the YTU-Waternet dataset. Segments extracted with SAM are delineated with blue boundaries and filled with distinct, randomly assigned colors to ensure clear differentiation.

Figure 9. In-depth analysis of segmentation results of the SAM + CLIP RSICD framework for YTU-Waternet dataset. Segments extracted by SAM are outlined in blue and filled with a pale blue color.

Figure 10. Segmentation results of the proposed framework on the Malatya study area. Segments identified using SAM have blue boundaries and are filled with random colors for clear differentiation.

Figure 11. Segmentation results of the proposed framework on the Malatya study area. Note that the green lines represent the extracted water body boundaries, while the tile boundaries are outlined in red, and the respective tile numbers are displayed in the upper left corner of each tile.

Figure 12. Comparison of the SAM segmentation results in Tile 3 before and after the adjustment of the stability score offset parameter to 0.1. Segments extracted by SAM are outlined with blue boundaries and filled with random colors for clear differentiation.

Figure 13. Results for the Rize study area. Note that the green lines represent the extracted boundaries of the water bodies, while the tile boundaries are outlined in red, and the respective tile numbers are displayed in the upper left corner of each tile.

Figure 14. Segmentation results for Tile 10 and 11 using the proposed framework on the Rize dataset. Segments extracted by SAM are outlined with blue boundaries and filled with random colors for clear differentiation.

Table 1. The performance of the proposed methodology on the YTU-Waternet dataset.

Study	Precision	Recall	F1-Score	OA	IoU
SAM + CLIP RSICD (Proposed)	96.083	98.205	96.974	98.321	94.410

Table 2. Accuracy metrics for the water body classification of aerial imagery.

	Precision	Recall	F1-Score	OA	IoU
Tile 1	99.047	97.803	98.421	99.024	96.891
Tile 2	98.421	97.011	97.711	99.139	95.524
Tile 3	99.819	95.697	97.715	98.426	95.524
Tile 4	98.526	99.886	99.201	99.612	98.416
Tile 5	96.458	99.439	97.925	99.395	95.935
Tile 6	99.917	97.592	98.741	99.569	97.513
Tile 7	100.00	95.685	97.795	99.188	95.685
Tile 8	99.979	97.748	98.851	99.277	97.728
Tile 9	99.443	98.227	98.831	99.149	97.689
Tile 10	92.965	78.345	85.032	98.955	73.961
Tile 11	99.829	97.906	98.858	99.357	97.743
Tile 12	99.290	94.368	96.766	98.409	93.735
Tile 13	99.928	97.949	98.929	99.531	97.881
Tile 14	99.908	98.701	99.301	99.519	98.611
Tile 15	99.893	98.100	98.988	99.528	97.997
Tile 16	99.800	99.024	99.410	99.473	98.827
Tile 17	99.704	98.557	99.127	99.682	98.269
Tile 18	92.885	99.429	96.046	98.403	92.392
Tile 19	98.920	96.668	97.781	99.815	95.659
Tile 20	99.966	98.364	99.158	98.878	98.331
Tile 21	99.811	98.160	98.979	98.767	97.978
Tile 22	99.891	95.904	97.856	98.448	95.803
Tile 23	99.947	97.262	98.586	99.811	97.211
Tile 24	99.924	98.947	99.433	99.479	98.873
Tile 25	99.987	97.980	98.973	98.786	97.968
Tile 26	99.781	96.762	98.248	99.151	96.556
Tile 27	99.956	98.884	99.417	99.668	98.841
Tile 28	100.00	99.063	99.529	99.256	99.063
Tile 29	100.00	97.484	98.726	99.349	97.484
Tile 30	99.988	98.384	99.184	99.155	98.381
Tile 31	99.973	85.565	92.210	98.674	85.546
Mean	99.160	96.803	97.927	99.189	96.065

Table 3. The obtained accuracies for the classification of water bodies in aerial imagery.

Tile Number	Precision	Recall	F1-Score	OA	IoU
Tile 1	1.000	1.000	1.000	1.000	1.000
Tile 2	1.000	1.000	1.000	1.000	1.000
Tile 3	1.000	99.896	99.948	99.897	99.896
Tile 4	98.686	99.636	99.158	98.797	98.331
Tile 5	99.995	97.878	98.925	98.649	97.873
Tile 6	98.074	99.495	98.779	99.403	97.588
Tile 7	99.872	99.540	99.706	99.418	99.414
Tile 8	1.000	97.901	98.940	98.760	97.901
Tile 9	99.106	91.311	95.049	98.170	90.565
Tile 10	99.949	86.553	92.770	97.291	86.514
Tile 11	96.930	62.012	75.636	97.087	60.818
Tile 12	62.337	96.659	75.794	98.615	61.022
Mean	96.246	94.240	94.559	98.840	90.827

Table 4. Recent studies on water body extraction using satellite imagery.

Study	Precision	Recall	F1-Score	OA	IoU
SAM + CLIP RSICD (Proposed)	96.083	98.205	96.974	98.321	94.410
WaterNet [29]	99.726	99.858	99.792	99.797	99.585
U-Net with MobileNet-V3 encoder Test Scenario 1 [34]	92.00	87.00	-	-	81.00
U-Net with MobileNet-V3 encoder Test Scenario 2 [34]	79.00	78.00	-	-	62.00
MSLANet [36]	97.86	96.28	97.06	-	94.30
MSR-Net (101) [35]	97.49	98.42	97.95	-	95.98
DLFC [1]	-	-	95.39	98.44	91.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ozdemir, S.; Akbulut, Z.; Karsli, F.; Kavzoglu, T. Extraction of Water Bodies from High-Resolution Aerial and Satellite Images Using Visual Foundation Models. Sustainability 2024, 16, 2995. https://doi.org/10.3390/su16072995

AMA Style

Ozdemir S, Akbulut Z, Karsli F, Kavzoglu T. Extraction of Water Bodies from High-Resolution Aerial and Satellite Images Using Visual Foundation Models. Sustainability. 2024; 16(7):2995. https://doi.org/10.3390/su16072995

Chicago/Turabian Style

Ozdemir, Samed, Zeynep Akbulut, Fevzi Karsli, and Taskin Kavzoglu. 2024. "Extraction of Water Bodies from High-Resolution Aerial and Satellite Images Using Visual Foundation Models" Sustainability 16, no. 7: 2995. https://doi.org/10.3390/su16072995

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extraction of Water Bodies from High-Resolution Aerial and Satellite Images Using Visual Foundation Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Satellite Imagery Dataset

2.1.2. Aerial Imagery Dataset

2.2. Methodology

2.2.1. Segment Anything Model (SAM)

2.2.2. Contrastive Language-Image Pre-Training Model (CLIP)

2.3. Water Body Segmentation Framework

3. Results

3.1. Results for YTU-Waternet Dataset

3.2. Results for Malatya Study Area

3.3. Results for Rize Study Area

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI