1. Introduction
Landslides are a serious geologic hazard on a global scale and have caused huge losses worldwide in the last few years [
1]. They occur when heavy rainfall, earthquakes, and human activities trigger the movement of soil and rock on slopes [
2,
3,
4]. The frequency and severity of landslide occurrences are on the rise, attributed to factors such as global warming, population growth, resource extraction, and environmental degradation [
5,
6]. Therefore, conducting landslide hazard studies and accurately identifying landslides are essential for assessing the impact of disasters, guiding post-disaster reconstruction, and preventing secondary disasters [
7,
8]. With the development of remote sensing and satellite technology, the application of remote sensing in large-scale geohazard investigations has become more and more popular, and great progress has been made in the identification of landslides using remote sensing [
9,
10,
11,
12].
Currently, there are four main approaches for landslide recognition methods based on remote sensing images: visual interpretation [
13,
14], pixel-based methods [
15], object-based methods [
16], and methods based on deep learning (DL) techniques [
17,
18,
19]. Visual interpretation refers to the manual annotation and classification of landslide areas in images by professionals by directly observing and analyzing the features, morphology, color, texture, and other information of remote sensing images [
20,
21]. This method has the highest accuracy. However, it is time-consuming and labor-intensive. Pixel-based methods reduce the degree of human intervention through supervised training and improve computational efficiency [
22], but it is difficult to obtain clear landslide boundaries and they cannot fully utilize the rich structural and textural information in the images [
23]. The object-based method utilizes a variety of discriminative features, such as spectral features, texture features, and morphological features of landslides, for landslide detection [
24,
25,
26]. This method is capable of categorizing objects with similar features into the same class, reducing salt-and-pepper noise. However, the recognition accuracy of this method largely depends on the initial segmentation precision, and it lacks strong capability for depicting details, resulting in a significant post-processing workload.
With the great progress of DL in the field of computer vision, DL-based methods have been widely applied to landslide recognition tasks and have become the main trend in this field [
27,
28]. Currently, research on DL models for landslide recognition mainly focuses on three directions: image classification, object detection, and semantic segmentation. Convolutional neural networks (CNNs) are commonly used in this research [
29,
30,
31], with CNNs being characterized by a relatively complex structure, numerous training parameters, and high demands on training data. Ji et al. [
32] employed an attention-boosted CNN model for the recognition of newly occurred landslides in Bijie, China, based on image classification. Ghorbanzadeh et al. [
33] compared the performance of CNNs with neural networks, support vector machines, and random forests in the semantic segmentation of landslides and discovered that CNNs perform better when they have enough samples. In addition, other CNN-based models such as PSPNet [
34], AlexNet [
35], ResNet [
36], U-Net [
37], and DenseNet [
38] have also been utilized for the semantic segmentation of landslides. Furthermore, target detection models represented by faster R-CNN [
39] and the YOLO series [
40,
41,
42] have also been applied to landslide recognition tasks.
With the introduction of the visual transformer (ViT) and its notable successes in computer vision, transformer-based models have been widely used in remote sensing identification tasks [
43]. Chen et al. [
44] developed sparse token transformers (STTs) for extracting buildings from remote sensing images. Utilizing a novel “sparse token sampler” module to represent buildings as sparse feature vectors, the STT achieves excellent performance on benchmark datasets while reducing computational complexity. Wang et al. [
45] introduced a novel ViT architecture named BuildFormer that enables accurate building extraction from remote sensing images. It overcomes the limitations of traditional CNN methods in modeling global dependencies and preserving spatial details, achieving state-of-the-art performance. In the landslide identification task, Huang et al. [
46] improved the Swin transformer by incorporating morphological edge analysis to address issues with landslide boundary discretization and irregularity, achieving more accurate landslide boundary extraction in the LuDing area of China. Lu et al. [
47] proposed ShapeFormer, a shape-enhanced ViT model designed to effectively handle landslides of various sizes and shapes in remote sensing imagery, enhancing the accuracy of landslide detection. Fu et al. [
48] significantly improved the accuracy and recognition capabilities of both YOLOv5 and faster R-CNN by replacing their backbones with Swin transformers. These models mentioned above, while performing well in specific tasks, often require large amounts of data for training. Moreover, their limited generalization, migration, and self-adaptation capabilities result in constraints when adapting to downstream tasks.
Recently, remarkable progress has been made in fundamental models such as GPT-4 [
49], Flamingo [
50], and SAM [
51], which have made important contributions to the development of human society. SAM, a vision foundation model pre-trained on the SA-1B dataset, showcases substantial generalization capabilities across various image and object segmentation tasks without additional training. This creates new ways for intelligent interpretation of natural images [
52,
53,
54]. SA-1B is the most extensive and diverse image segmentation dataset available. It contains over 11 million high-quality images taken from around the world, covering a wide range of scenes, objects, and environments, and consists of over 1 billion high-quality segmentation masks collected using Meta’s data engine [
51]. SAM comprises an image encoder, prompt encoder, and mask decoder. The image encoder is a ViT model pre-trained with an MAE [
55] that takes an image as input and generates its embedding. The prompt encoder takes prompt information as input and outputs prompt embeddings. The mask decoder maps the image embedding and prompt embeddings to a mask. Since SAM is an interactive model, it can take point, box, or mask prompts when segmenting images. The segmentation results vary depending on the type of prompt used. Currently, some researchers have started to apply SAM to remote sensing data. Chen et al. [
56] designed a prompt learning method, RSPrompter, for remote sensing images based on the SAM base model, which generates prompt inputs for SAM to enable it to automatically acquire instance-segmentation-level masks. Sultan et al. [
57] introduced GeoSAM by introducing an innovative architecture for fine-tuning SAM using sparse and dense cues, leading to significant enhancements in geographic image segmentation. Zhang et al. [
58] proposed RSAM-Seg, introducing adapter-scale in the multi-head attention block of the encoder in SAM, and inserting adapter-feature between ViT blocks. This design aims to generate prompts informed by images and enhance the model’s performance in the remote sensing image segmentation tasks.
Figure 1 shows how well SAM recognizes different images with different types of prompts. Without prompt, SAM’s performance on remote sensing images is notably weaker compared to other natural images. When prompts are provided, SAM excels at recognizing images with distinct boundaries and minimal background interference, such as airplane and factory images, but struggles with landslide images. Remote sensing images often require specialized spectral and spatial analyses due to their unique acquisition and processing techniques, and there are significant differences between them and natural images, especially for remote sensing landslide images, where common challenges include boundary blurring, complex background perturbations, and diverse morphological features. Although the SA-1B dataset includes images from various sources, such as natural scenes, urban environments, medical images, satellite images, etc., there is still room for further optimization of the SAM algorithm to improve the accuracy and generalization of target recognition in remote sensing imagery, which is in line with what other scholars have recognized [
56,
57,
58].
In this paper, we propose the SAM-based cross-feature fusion network (SAM-CFFNet). This network is designed to create a novel semantic segmentation model for the high-precision recognition of landslides. We utilize SAM’s image encoder to extract multi-level features from remote sensing optical images and design the cross-feature fusion decoder (CFFD) tailored to the characteristics of SAM’s image encoder and the requirements of landslide recognition tasks. In the CFFD module, we propose a novel cross-fusion mechanism and demonstrate its effectiveness in subsequent experiments. Furthermore, the CFFD ensures high segmentation accuracy by incorporating a shallow feature extractor (SFE). We comprehensively evaluate the performance of SAM-CFFNet on three open-source landslide datasets, and the experimental results show that our model outperforms the other comparative models.
The innovation of our approach is to leverage the powerful feature extraction capability of SAM to design decoders that are more adapted to the downstream segmentation task, which can achieve high-precision recognition of landslides with smaller trainable parameters without the intervention of human prompts, surpassing other traditional semantic segmentation methods. In summary, the main contributions of this study to the remote sensing landslide identification task are as follows:
- (1)
We propose a new semantic segmentation model, SAM-CFFNet, which demonstrates excellent performance on three landslide datasets, improving the accuracy of landslide recognition.
- (2)
Our proposed CFFD fully considers the characteristics of landslide images, can be well adapted to SAM’s image encoder, and shows excellent performance in landslide recognition tasks.
- (3)
The excellent performance of SAM-CFFNet in the landslide identification task highlights the potential application of SAM in this field and provides new ideas and methods for the further application of the SAM base model in the remote sensing field.
3. Methods
3.1. Framework
The SAM-CFFNet proposed in this study is an end-to-end network designed to extract landslide features from remote sensing images and output binary images representing landslide identification results. The structure of SAM-CFFNet is shown in
Figure 3, which mainly consists of the image encoder ViT (IEViT) and the CFFD.
The IEViT encoder is tasked with extracting four levels of hierarchical deep features from input images of resolution 1024 × 1024. The CFFD, an adept decoder, is dedicated to integrating the multi-scale semantic features harvested by the IEViT to achieve refined recognition results. It employs the cross-feature fusion module (CFFM) to meticulously fine-tune and cross-fuse the extracted features, thereby amplifying information pertinent to landslide characteristics. These fused features are dimensionally reduced via convolutional layers before entering the Bottle ASPP module, designed to capture background context across disparate receptive fields. The outputs of this module are then upsampled to match the resolution of features processed by the secondary branch, the SFE. The branch, focused on capturing and refining texture details from the input image, leverages an attention module to selectively weigh the shallow features in relevance to the main pathway’s deeper features. Finally, the deep features processed and the textured features from the SFE are concatenated. This concatenated feature map is then further upsampled to the original input resolution and passed through a final convolutional layer to produce the prediction output. The specific structure of each module is described in detail next.
3.2. Image Encoder ViT
Our IEViT is built upon SAM’s image encoder, specifically choosing the ViT-L version for its balanced performance and substantial parameter count while modifying it by removing the neck module positioned at the end of the model. The IEViT loaded pre-trained model weights that are publicly available on the official SAM website, maintaining the freeze on the entire IEViT module throughout the experiment.
As illustrated in
Figure 4, the IEViT consists of patch embedding, position embedding, and transformer encoder components. Patch embedding is responsible for dividing the input image of size 1024 × 1024 into multiple patches of size 64 × 64. This is achieved by applying a convolutional operation with a kernel size of 16, a stride of 16, and no padding, resulting in patches of the desired size. The position embedding creates a zero tensor with dimensions corresponding to the patches and embedding dimensions. This tensor is then added to the patches via element-wise addition, thereby incorporating positional information into the patches. The output patches are then fed into the transformer encoder, which is the core component of the ViT and is responsible for processing serialized patches to learn global features of the image, efficiently capturing long-range dependencies and complex patterns in the image through the self-attention mechanism and stacking of MLPs. The transformer encoder consists of 24 transformer blocks, each of which maintains the same input and output dimensions and can therefore be used in series. An excessive number of transformer blocks can lead to the problem of information loss when transferring features between transformer blocks at different levels. Moreover, as an interactive visual base model, SAM’s depth feature maps may not contain rich semantic information for specific categories. Therefore, to obtain more feature information related to landslides, we output the features of the 6th, 12th, 18th, and 24th transformer blocks. This is reasonable because, by outputting features at different levels, we can capture image representations at various levels, which helps to enhance the model’s generalization ability, making it more suitable for various downstream tasks.
3.3. Cross-Feature Fusion Decoder
The structure of the CFFD is shown in
Figure 3, where the focus will be on the three modules SFE, CFFM, and Bottle ASPP.
3.3.1. Shallow Feature Extractor
In the patch embedding of the IEViT, downsampling the image by a factor of 16 causes a loss of texture information in the model. To address this issue, we introduce the SFE. Comprising three convolutions with a stride of 2, three EPSA modules [
61], and an attention block [
62], the structure of the SFE is illustrated in
Figure 3. The EPSA module, proposed by Zhang et al. [
61], extracts fine-grained multi-scale spatial information and establishes long-distance dependencies. Meanwhile, the attention block, introduced by Oktay et al. [
62], enhances feature representation by dynamically focusing on crucial features, thereby reducing irrelevant information.
The SFE uses three convolutions to downsample the input image by eight times and then utilizes three EPSA modules to extract shallow information. The attention block [
62] suppresses the information in the shallow features that are unrelated to the main branch features in the CFFD, aiming to minimize information loss and confusion resulting from their fusion. The SFE aims to improve the model’s ability to represent details and low-level features by supplementing shallow information and to strengthen the model’s ability to capture image details and semantic information.
3.3.2. Cross-Feature Fusion Module
The CFFM consists of four feature adjustment modules (FAMs) and three feature cross-fusion structures (FCFSs), as shown in
Figure 5. The four FAMs are, respectively, responsible for fine-tuning and resizing the four input features. The FCFS is responsible for the cross-fusion of the four features.
The structure of an FAM is shown in
Figure 6. The FAM consists of two multi-layer perceptron (MLP) modules and a neck module. The MLP module contains two linear layers and an activation function. The two linear layers perform the downscaling and upscaling operations on the features, respectively, and this design reduces the number of parameters in the module. Connections are made outside the MLP using a residual network structure to reduce the loss of information from features. The channels of the features are permuted, and then the neck module is used to reduce the dimensionality of the features.
The structure of an FCFS is shown in
Figure 5. It can be observed that, within each FCFS module, the four input features are partitioned into four groups following the permutation rule
, where each group consists of three distinct features. Before being fed into the EPSA module, the features in each group undergo channel-wise summation. The CFFM consists of three FCFS submodules; thus, the above process is repeated three times. By utilizing the FCFS to cross-fuse features at different depths, it enables multi-level fusion of information, effectively enhancing the performance and generalization capability of the network.
3.3.3. Bottle ASPP
Building upon the ASPP [
63], we introduced the Bottle ASPP inspired by the bottleneck structure, as illustrated in
Figure 7. In the Bottle ASPP, the number of channels in the input features of the ASPP module is reduced using 1 × 1 convolutions and, consequently, the channel dimensions of the ASPP module output features are restored using 1 × 1 convolutions. Additionally, the output features are combined with the original input features through a residual structure. Compared to the original ASPP, the Bottle ASPP module reduces information loss and lowers parameter computation. For instance, when the input feature has 256 channels, the ASPP module has 2.13 MB of parameters while the Bottle ASPP has only 0.17 MB of parameters, resulting in a 92% reduction in parameters.
3.4. Evaluation Criterion
In this experiment, five performance metrics—precision, recall, F1-score, mean intersection over union (MIoU), and intersection over union (IoU) of landslide targets are used to compare and evaluate the proposed models, which are defined as shown below:
In the formula, TP, TN, FP, and FN represent the pixels that are correctly predicted as landslides, the pixels that are correctly predicted as non-landslides, the non-landslide pixels that are incorrectly predicted as landslides, and the landslide pixels that are incorrectly predicted as non-landslides, respectively.
3.5. Experimental Settings
When evaluating the performance of SAM-CFFNet in landslide recognition, we conducted comparative and ablation experiments on three landslide datasets. The detailed experimental designs for comparative and ablation experiments will be outlined in
Section 4.
Given that the landslide recognition task is essentially a binary classification problem and the non-landslide background of the images in the experimental data occupies a large proportion, it is prone to small target detection problems. Therefore, we use the sum of binary cross-entropy loss and dice loss as the total loss function to train the model to maintain the stability and class balance of the model, and the formula of this loss function is shown below:
where
denotes the binary cross-entropy loss and
denotes the dice loss, and
can be denoted as
can be represented as
where
denotes the total number of samples,
and
denote the true label value and the predicted result value of the
ith pixel point respectively.
Binary cross-entropy loss is widely used in binary classification and semantic segmentation for its stability and consistency. It quantifies prediction accuracy by comparing predicted probabilities with actual labels, demonstrating good robustness. Dice loss is effective in segmentation tasks and particularly handles class imbalances well. It measures overlap between predicted and truth regions, optimizing the intersection to ensure model sensitivity to object size and shape, resulting in better boundary depiction and more accurate segmentation.
Our experimental environment is based on the Debian operating system, developed using Python 3.7.12, and relies on PyTorch 1.11.0 with CUDA 11.3 for the development framework. Our computer is equipped with an Intel Xeon Gold 5218R processor (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA A100 Tensor Core GPU (Nvidia Corporation, Santa Clara, CA, USA), along with 128 GB of operating memory. During the experiments, all models use stochastic gradient descent (ADAMW) as the optimizer, with an initial learning rate of 0.0002 and epochs set to 30 rounds. The batch size for the SAM-CFFNet and other SAM-based comparative models was set to 8, while the batch size for the other comparative models was set to 64.
6. Conclusions
In this study, SAM-CFFNet is proposed as a novel and effective application of SAM. The objective is to improve the landslide recognition accuracy using SAM and to address its performance degradation and dependence on prompt information in the task of landslide recognition from remote sensing images. Notably, our specially designed CFFD effectively improves the model’s adaptability for downstream tasks. During the training process, the IEViT reads the pre-training weights and keeps them frozen, and this strategy fully utilizes the powerful feature extraction capability of SAM. This effectively improves the convergence speed and training efficiency of the model and enhances its generalization ability and adaptability on the landslide identification task.
We train and validate SAM-CFFNet against several other reference models on three landslide datasets and evaluate the model’s effectiveness in recognizing landslides using precision, recall, F1-score, MIoU, and IoU. Our results show that SAM-CFFNet performs optimally in terms of accuracy on all three landslide datasets, significantly outperforming the other compared models. SAM-CFFNet demonstrates excellent generalization ability and robustness on different datasets. Furthermore, we substantiated the rationale behind our designed CFFD through comparative analysis with various decoders. Additionally, we deliberated on the model’s training efficiency and outlined forthcoming research directions.
The results of this study highlight the excellent performance of SAM-CFFNet in land-slide identification tasks and the importance of this model in assessing the impact of landslides after a disaster as well as in guiding post-disaster reconstruction efforts. The SAM-based model represented by SAM-CFFNet shows great potential in the field of landslide detection and monitoring, and the insights gained from this study will help to promote the further development of SAM-based models in the field of geohazard monitoring.