1. Introduction
In recent years, the risk and scale of landslide hazards have escalated due to intensifying global climate change and rapid urban expansion [
1]. Natural factors, including earthquakes, volcanic eruptions, and heavy rainfall, alongside human activities such as mining and road construction [
2], frequently disrupt the structural integrity of geomaterials on steep slopes. As a result, these bodies are susceptible to gravitational movement, culminating in landslide events. Consequently, landslide-prone areas cluster in steep, mountainous/hilly terrains [
3]. Accurate identification of landslides is crucial for disaster prevention and control, as well as for assessing disaster-related damage.
Rapid advancements in remote sensing sensor technology have dramatically enhanced the spatial and temporal resolution of satellite imagery. Due to its wide coverage, timeliness, and distinct ability to identify landslides in remote areas [
4,
5,
6], remote sensing has demonstrated substantial potential in monitoring and identifying landslides. Meanwhile, the rapid development of computing technology provides technical support for efficient large-scale monitoring of large-scale landslides [
7]. The determination of whether an area is affected by a landslide, based on the pixel values of remote sensing images as features, can be primarily classified into supervised and unsupervised methods. Unsupervised methods such as K-means [
8] and ISODATA [
9] do not require prior labeling of samples. However, the classification accuracy of these methods is relatively low, and the clustering results often require further manual verification to identify the landslide area [
10]. Supervised classification methods, relying on the spectral features of individual pixels, distinguish landslides from non-landslide areas using classifiers like Support Vector Machines [
11], random forests [
12], and artificial neural networks [
13]. Yet these methods neglect other attribute information and contextual information, such as shape, texture, and spatial relationships, and are vulnerable to noise and parameter settings, resulting in limited recognition accuracy. To address these limitations, object-oriented methods [
14,
15,
16,
17] have been introduced into the field of remotely sensed landslide identification. While object-oriented methods improve landslide segmentation accuracy to some extent, they rely heavily on segmentation algorithms and can be inefficient when dealing with complex images [
18,
19]. Compared with traditional machine learning methods, the Convolutional Neural Network (CNN) framework has demonstrated superior ability to utilize the comprehensive information embedded in remote sensing images [
20,
21,
22]. Through its hierarchical architecture, the CNN approach effectively extracts high-level features from the imagery, thereby enabling accurate landslide identification [
23,
24,
25,
26,
27]. It is noteworthy that detection accuracy is predominantly contingent upon the computational capabilities and architectural sophistication of the implemented deep learning models [
28]. Consequently, the integration of advanced computational models [
29], innovative methodologies [
30,
31], and multi-source data fusion [
32] has emerged as a pivotal strategy for enhancing landslide detection accuracy. In addition, the Transformer architecture provides robust performance in landslide detection applications [
22], significantly improving the efficiency and accuracy of rapid landslide detection [
33,
34]. Another report shows that the enhanced SegFormer implementation outperforms traditional CNN-based approaches in seismic landslide identification [
35].
In summary, while existing studies have demonstrated effective landslide recognition within single domains, several critical limitations persist, including heavy reliance on abundant training datasets and single forms of dataset [
36,
37]. Given that landslides predominantly occur in mountainous and geographically complex regions, the collection of representative training samples is severely constrained [
38], resulting in limited generalization capability of recognition models. When substantial discrepancies exist between training and testing domains, deep learning models often fail to accurately characterize features in unseen regions, significantly degrading their practical performance [
39]. Consequently, the effective utilization of limited labeled data and the achievement of cross-domain landslide extraction across diverse geographic regions have emerged as the primary challenges in landslide recognition research. Unsupervised Domain Adaptation (UDA) techniques address this challenge by leveraging labeled source domain data and unlabeled target domain data to minimize inter-domain distribution gaps, thereby enhancing model performance in target environments [
40,
41]. Several studies have implemented convolutional neural network optimization strategies to align content features with style features, achieving source–target image statistical matching [
42,
43]. In cross-domain landslide recognition tasks, the inherent limitations in data completeness, particularly the scarcity of reliably labeled landslide samples in target regions, have prompted researchers to explore UDA-based approaches. Several innovative methodologies have been developed: Li et al. [
44] integrated adversarial learning with domain distance minimization for cross-scene landslide detection in high-resolution imagery; Zhang et al. [
45] implemented prototype learning to generate pseudo-labels for progressive feature alignment; Xu et al. [
46] developed an adversarial domain adaptation network that fuses geological features with remote sensing data for cross-domain landslide extraction; and Li et al. [
47] introduced a progressive label upgrading and cross-temporal style adaptation method for multi-temporal landslide detection and domain feature alignment. However, current UDA methods exhibit two key limitations:(1) over-emphasis on source–target domain feature alignment and under-utilization of target domain contextual information (e.g., topographic and geologic features); and (2) insufficient handling of landslide morphology complexity, especially in preserving boundary details due to over-reliance on global feature similarity. Given that landslide occurrence is significantly influenced by environmental factors, reliance on isolated pixel-level or object-level features often leads to misclassification (e.g., confusion between landslides and bare rock formations). These limitations manifest as insufficient utilization of contextual information and inadequate incorporation of background knowledge in the target domain. To address these challenges, this study proposes a novel framework incorporating masked image modeling principles [
48,
49], specifically enhancing context learning [
50] through target domain image masking within domain adaptation tasks, thereby improving performance across both upstream and downstream processes. While existing landslide recognition methods primarily focus on classification through spectral, textural, and shape feature analysis of entire landslide regions, they frequently encounter boundary ambiguity issues caused by topographic noise and surrounding landform interference.
This study proposes an unsupervised cross-domain landslide extraction framework that integrates image mask and morphological information enhancement to address the limitations of current domain adaptation methods in contextual information utilization and morphological completeness. The proposed methodology employs a knowledge distillation strategy. The teacher network generates pseudo-labels from complete target domain images, while the student model is trained to produce consistent predictions using randomly masked target domain images. This bidirectional learning process facilitates continuous improvement in pseudo-label quality through iterative context information exchange between the teacher and student models. The morphological information enhancement module leverages the distinct spectral characteristics (brightness/color contrast) between landslide regions and their surrounding environments to extract morphological features, which are subsequently transformed into morphological pseudo-labels. These pseudo-labels guide the student model in learning comprehensive landslide morphological patterns. The synergistic integration of enhanced contextual information utilization and explicit morphological feature incorporation ultimately improves the accuracy and robustness of cross-domain landslide extraction.
The contributions of this study are threefold: (1) the development of an unsupervised cross-domain landslide extraction framework integrating image mask and morphological information enhancement is achieved; (2) a novel mask module is designed, which significantly enhances the model’s ability to learn contextual information; (3) a cross-domain morphological information enhancement module is designed, which dramatically improves the accuracy of landslide morphology recognition.
4. Discussion
In this section, we discuss the choice of the two hyperparameters mask ratio and mask pixel size used in the mask module. In addition, we perform comparative experiments on how the optimizer and the learning rate are chosen. Meanwhile, considering the wide range of applications of this paper’s method, the experimental results are tested in three larger scales and complex regions.
The image-based mask module in this study utilizes two hyperparameters: the mask ratio and the mask pixel size. The mask ratio controls the proportion of the image area that is masked, while the mask pixel size regulates the scale of the mask block. To determine the optimal values for these hyperparameters in IMMDA, quantitative analysis was performed for different settings. The results for the mask ratio hyperparameter are shown in
Table 6, with the mask pixel size fixed at 64 × 64. The experimental findings indicate that the model achieves the highest accuracy when the mask ratio is set to 70%. When the mask ratio is lower than 70%, the accuracy improves as the ratio increases, suggesting that a higher mask ratio allows the model to better capture contextual information from the masked images. However, if the mask ratio becomes too large, critical image information may be masked, leading to a decrease in recognition accuracy.
The results of the quantitative analysis for mask pixel size are presented in
Table 7, with the mask ratio set to 70% for this experiment. The findings indicate that as mask pixel size increases, the model’s learning performance improves, achieving the best results when the pixel size is set to 64 × 64. Given that the input image size is 512 × 512, larger pixel sizes lead to unstable training performance. Therefore, a mask pixel size of 64 × 64 is determined to be the optimal value for the hyperparameter in the experiments conducted in this paper.
In the experiments of this paper, the AdamW optimizer is used with a learning rate of 0.0001 and a weight decay of 0.01 for a total of 20,000 iterations. Different model configurations are quantitatively analyzed and the results are shown in
Table 8. It is found that different optimizers or attenuation modes do not have much effect on the experimental results, and the difference between the IoU and F1-Score under different configurations is only about 0.1 to 0.5 percentage points. Compared to the AdamW optimizer, the Adam optimizer has a slightly lower performance. This also indicates that in the cross-domain landslide extraction task, the model is less sensitive to hyperparameters, and the SGD converges slower, with slightly lower results.
In order to test the effectiveness of the method of this paper in a larger spatial range of complex regions, the target domain datasets of Iceland (scale size: 4151×2763), Kupang (scale size: 1946×1319), and Tbilisi (scale size: 5588×5632) were selected for testing, and the results are shown in
Figure 8 below, where a sliding window was taken in the testing process to perform the chunking test. In the Iceland region, the image and label given by the target domain dataset are a one-sit volcanic region, and it can be found from the image that the spectral features of the landslide region and the non-landslide region have obvious differences, so the model of the present method basically recognizes all the landslide regions, which proves the effectiveness of the morphology enhancement module in cross-domain landslide extraction. However, in the more obvious bright areas, there may also be cases of misidentification of landslide areas due to the similarity of spectral and other features. In response to this situation, methods such as multi-source data fusion and texture feature endomorphism can be used to screen non-landslide areas in future studies. In the Kupang region, the landslide area is brighter and the method in this paper recognizes the more complete morphology of the landslide. However, this method fills the smaller voids in the landslide and there is a problem of incorrectly filling the smaller areas. For example, the small trees in the landslide area may be filled. In the Tbilisi region, the difference between the spectral characteristics of the landslides and those of the surrounding area is less pronounced. The lower right landslide area in this region is more similar to the surrounding area, resulting in incomplete morphological identification of the lower right landslide area. In order to recognize all landslide areas, the thresholds for landslide areas were set smaller, and there were cases where brighter houses and bare land were mistaken for landslides. In subsequent processing, larger area thresholds can be set for large landslide areas.