Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir

Li, Zhi-Hai; Shi, An-Chi; Xiao, Huai-Xian; Niu, Zi-Hao; Jiang, Nan; Li, Hai-Bo; Hu, Yu-Xiang

doi:10.3390/rs16142558

Open AccessArticle

Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir

by

Zhi-Hai Li

¹,

An-Chi Shi

^1,2

,

Huai-Xian Xiao

³,

Zi-Hao Niu

¹,

Nan Jiang

^3,*

,

Hai-Bo Li

⁴ and

Yu-Xiang Hu

⁴

¹

Power China Huadong Engineering Corporation Limited, Hangzhou 311122, China

²

Zhejiang Huadong Geotechnical Investigation & Design Institute Corporation Limited, Hangzhou 310004, China

³

College of Water Resource and Hydropower, Sichuan University, Chengdu 610065, China

⁴

State Key Laboratory of Hydraulics and Mountain River Engineering, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(14), 2558; https://doi.org/10.3390/rs16142558

Submission received: 19 June 2024 / Revised: 8 July 2024 / Accepted: 10 July 2024 / Published: 12 July 2024

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The task of landslide recognition focuses on extracting the location and extent of landslides over large areas, providing ample data support for subsequent landslide research. This study explores the use of UAV and deep learning technologies to achieve robust landslide recognition in a more rational, simpler, and faster manner. Specifically, the widely successful DeepLabV3+ model was used as a blueprint and a dual-encoder design was introduced to reconstruct a novel semantic segmentation model consisting of Encoder1, Encoder2, Mixer and Decoder modules. This model, named DeepLab for Landslide (DeepLab4LS), considers topographic information as a supplement to DeepLabV3+, and is expected to improve the efficiency of landslide recognition by extracting shape information from relative elevation, slope, and hillshade. Additionally, a novel loss function term—Positive Enhanced loss (PE loss)—was incorporated into the training of DeepLab4LS, significantly enhancing its ability to understand positive samples. DeepLab4LS was then applied to a UAV dataset of Baihetan reservoir, where comparative tests demonstrated its high performance in landslide recognition tasks. We found that DeepLab4LS has a stronger inference capability for landslides with less distinct boundary information, and delineates landslide boundaries more precisely. More specifically, in terms of evaluation metrics, DeepLab4LS achieved a mean intersection over union (mIoU) of 76.0% on the validation set, which is a substantial 5.5 percentage point improvement over DeepLabV3+. Moreover, the study also validated the rationale behind the dual-encoder design and the introduction of PE loss through ablation experiments. Overall, this research presents a robust semantic segmentation model for landslide recognition that considers both optical and topographic semantics of landslides, emulating the recognition pathways of human experts, and is highly suitable for landslide recognition based on UAV datasets.

Keywords:

landslide recognition; UAV; deep learning; semantic segmentation

1. Introduction

Landslides are one of the most common geological disasters in the world. When a landslide occurs, it often carries a large amount of soil, rock, or a mixture of both, moving downhill along the sliding surface formed on the slope [1]. As a result, the built environment (including houses, factories, roads, etc.) and the natural environment (including farms, forests, grasslands, etc.) on the slope are partially or completely destroyed [2,3]. In some extreme cases, landslides and their secondary disasters can cause considerable damage. For example, the 2014 Abe Barek landslide in Badakhshan Province, Afghanistan, resulted in nearly 2700 fatalities [4]; the 2018 Baige landslide in Tibet, China, blocked the main stream of the Jinsha River and formed a huge barrier lake, which breached, causing floods that affected areas as far downstream as Shangri-La and Lijiang, nearly 600 km away, resulting in losses exceeding USD 500 million in Lijiang alone [5,6]. However, landslide disasters are not entirely uncontrollable. Effective monitoring of landslide-prone areas and research into landslide initiation mechanisms, motion processes, and risk management, based on monitoring data, can lead to a comprehensive understanding of landslides. This, in turn, allows for the implementation of early measures to reduce the losses from landslide disasters and even prevent them through certain slope reinforcement measures [7,8,9].

In practice, the effective monitoring of landslide areas is not an easy task. After a major landslide disaster, detailed field investigations, sampling, and experimental work are typically required [10]. However, the mountainous distribution characteristics of landslides often make it difficult for researchers and monitoring equipment to quickly reach the disaster site, thus limiting the possibility of immediate landslide research post-disaster. Fortunately, the recent popularization of low-cost unmanned aerial vehicles (UAV) has greatly facilitated the monitoring of landslides. UAV, with their convenience, speed, proximity, and lightweight dataset [11], perfectly meet the needs for landslide monitoring. The use of UAV photogrammetry technology allows researchers to easily obtain extensive landslide photographs from a distance. Then, combined with structure-from-motion (SfM) technology, very high resolution (VHR) landslide image datasets can be quickly constructed, which can be used as the basic data for landslide identification and monitoring [12].

Following the deformation process of the landslide, a number of physical changes can be observed through photogrammetry. These include alterations in topography [13], vegetation cover [14], color (spectrum) [15], and cracks [16]. Therefore, based on the aforementioned multifaceted information, landslide recognition can be indirectly achieved from the landslide image dataset. Landslide recognition based on the image dataset can be divided into four methods: manual interpolation method, object-oriented method, object-based deep learning (DL) method, and pixel-based DL method. Figure 1 provides an intuitive illustration of the results of these four landslide recognition methods. The manual interpolation method (Figure 1a) employs direct observation of landslide features in images [17], or alternatively, the selection of a threshold through trial and error, followed by semi-automated recognition of landslide areas in images based on this threshold [18,19]. The object-oriented method (Figure 1b) often employs the multiresolution segmentation approach, which first divides the image into multiple non-overlapping objects based on homogeneity by selecting the optimal segmentation scale, and then extracts the landslide objects by combining the spectral, textural, and morphological characteristics of landslides [20,21]. The object-based deep learning (DL) method (Figure 1c) uses algorithms to detect objects in images by determining bounding boxes and classes, either in one or two stages. The YOLO series models are key examples of this method [22,23]. The pixel-based method (Figure 1d) relates to semantic segmentation tasks, which classify each pixel in the image using deep learning techniques, considering both local and regional context [24,25]. Overall, the object-based methods focused on object detection and the pixel-based methods focused on semantic segmentation have come together as the dominant methods for landslide recognition [26,27].

In contrast to object detection, semantic segmentation could provide more physical information about landslides, including boundaries and areas. While current popular semantic segmentation network models such as U-Net [28], Mask R-CNN [29], and DeepLabV3+ [30] can accurately perform landslide recognition tasks, they encounter significant difficulties in complex scenarios. Since these popular models usually use only RGB optical images as inputs, they may experience interference from factors such as vegetation, shading, and human activities in real landslide identification scenarios [31,32,33]. To address this problem, we hypothesize that the introduction of a corresponding DSM obtained from SfM may be able to ameliorate this problem [34]. Since the DSM essentially represents a 2.5D reproduction of a real-world 3D model, its introduction into the semantic segmentation model can be interpreted as improving the ability to identify landslides from an engineering geological perspective, in a manner similar to the empirical judgement of human experts during landslide field investigations.

This study will reference the architecture of DeepLabV3+ to re-design and develop a new semantic segmentation model named DeepLab for Landslide (DeepLab4LS) for landslide recognition that integrates optical and topographic information, which will be validated on landslide UAV datasets from the Baihetan reservoir. Additionally, this paper will investigate the issue of the loss function during model training to determine the most suitable loss function scheme for landslide recognition through detailed experiments. Finally, comparison and ablation experiments will be conducted on the novel semantic model to explore the role of various details in the architecture and training of the semantic model in landslide recognition.

2. Study Area and Dataset

2.1. Study Area and Data Acquisition

The Baihetan reservoir was formed after the Baihetan hydropower station began storing water in April 2021, with a total storage capacity of 20.627 billion m³ [35]. The storage of water has changed conditions such as the groundwater level and pore water pressure on the slopes [36], leading to the continuous development of a series of landslides within the reservoir area [16,37,38]. The activity of the landslides has affected the production and traffic of towns along the river within the reservoir area, and the potential impulse wave risk of the landslides also poses a significant threat to the downstream Baihetan hydropower station. In this study, a UAV image dataset obtained from the Baihetan reservoir in the lower reaches of the Jinsha river in China is employed as the data source for model training and validation.

Therefore, to thoroughly investigate the distribution of landslides within the reservoir area, a VHR UAV image dataset was collected between February and September 2023. This dataset, which covers an area of approximately 65 km², encompasses a stretch of Jinsha river within the reservoir area measuring approximately 50 km in length (Figure 2). The UAV photogrammetry employed the Feima D2000 four-rotor UAV, equipped with a SONY DOP 3000 camera (Shenzhen Feima Robotics Co., Ltd., Shenzhen, China) (Figure 3). The UAV flew in a fixed flight height mode at an altitude of approximately 150 to 200 m above the ground, ensuring that the resolution of the collected photos was stable at approximately 0.03 m. The heading overlap and the side overlap were 80% and 65%, respectively. In total, data collection for the UAV photogrammetry took approximately 1 month.

The original photos captured by the UAV and the corresponding ground control points (GCPs) were input into Pix4D software (4.9.0 trial version), where they underwent the SfM process. As a result, the orthophoto (Figure 2) and DSM of the study area were generated. It is worth mentioning that the orthophoto and DSM were subsequently downsampled to 0.1 m to avoid potential issues during subsequent deep learning training. At a resolution of 0.03 m, a 1024 × 1024 size cropped image would only represent less than 1000 m² of surface area. In many cases, this would only cover a small part of a landslide, preventing the network model from effectively processing the landslide’s regional context information [39]. Meanwhile, using a resolution of 0.03 m and cropping to 1024 × 1024 size would result in a dataset with over 50,000 images, which would be a significant burden for training. Therefore, using a dataset with a resolution of 0.1 m could effectively address the two aforementioned issues.

2.2. Dataset Creation

There are three types of image data used as inputs for the semantic segmentation model, which can be summarized as follows: optical semantic data, topographic semantic data, and label data. The optical semantic data are represented by orthophoto, while the label data are obtained by landslide experts who observe the orthophoto and annotate it using ArcGIS software. Here, it is worth mentioning the selection of topographic semantic data. We ultimately chose relative elevation, slope, and hillshade as the three types of data to collectively constitute the topographic semantic data used for input (Figure 4).

Relative elevation measures the degree of elevation difference within a region, numerically represented by the absolute elevation difference between the highest and lowest points in the area. The relative elevation metric directly affects the gravitational potential energy level. The greater the maximum value of relative elevation within an area, the higher the potential energy at that point, indicating an increase in the driving force of the landslide [40,41]. Although the concept of relative elevation is intended to be used, in practice, when composing topographic semantic data, we directly use elevation (i.e., the DSM itself) as input, because neural networks inherently perform normalization operations when processing data inputs, and the processed elevation becomes relative elevation.

Slope represents the angle between the vertical height of a point and its horizontal projection on a slope, intuitively reflecting the degree of slope inclination, which is also importantly related to landslides. An increase in the slope means that the component of gravity projected onto the potential sliding surface of the landslide also increases [42]. According to statistical data, most landslides have a slope between 10° and 30° [43].

By simulating the position and angle of the sun’s light source, hillshade calculates the illumination of each pixel to reflect the undulations and terrain of the landscape. Hillshade does not have a strong correlation with landslides on a geological level, but it has unique advantages when analyzing landslides using UAV datasets: it can reduce the impact of illumination differences in UAV data at different times [44]. In fact, the DSM exported by SfM can also mitigate the effects of illumination differences, but compared to the RGB image, the DSM loses a lot of shape and texture information. However, the hillshade calculated from the DSM can supplement a significant amount of image shape and texture, making it an ideal source of topographic semantic data for landslides. In this study, both slope and hillshade were directly exported from the DSM using ArcGIS software version 10.8. The solar altitude angle and the orientation of hillshade in this manuscript are 45° and 90°, respectively.

Furthermore, it is important to note that in subsequent sections of this paper, the term “topographic image” is used as a generic term for the DSM, hillshade, and slope, while the term “optical image” refers to the orthophoto. Both the terms precisely convey the core content of these two types of data, and it is helpful for understanding the data usage in this study.

To ensure that the model training could run smoothly under limited hardware conditions, the orthophoto, DSM, slope, hillshade, and label data were then cropped to a size of 1024 × 1024 and the cropped image set was filtered. The filtered images met at least one of the following conditions: (1) the proportion of pixels with null values exceeded 90%; (2) there were no pixels labeled as landslides. After the above operations, a dataset composed of 633 images of 1024 × 1024 size was obtained. The dataset was randomly divided into a training set and a validation set in an 8:2 ratio, which were used for model training and validation, respectively.

The entire dataset creation process for this section is shown in Figure 5, which also includes the mathematical expression of the filtering conditions for the image set.

2.3. Dataset Augmentation

In order to expand the dataset, avoid model overfitting, and enhance the generalizability of the model, data augmentation (DA) methods were used to improve the training set [45]. Specifically, four types of DA operations were applied: stretching, flipping, Gaussian blur, and rotation (Figure 6). We applied these four transformations using an on-the-fly data augmentation approach [46]. During the training process, these DA operations were applied randomly before the dataset was loaded into memory. Theoretically, this means that the training set for each epoch is different during the training process, increasing the diversity of the data and helping the model generalize better to unseen data.

3. Methods

3.1. DeepLabV3+

3.1.1. Model Architecture

DeepLabV3+ is a deep learning semantic segmentation model developed by Google, which has been widely applied in fields such as medical image segmentation [47], remote sensing image analysis [48], and autonomous driving [49]. Compared to previous models in the DeepLab series, DeepLabV3+ introduces a decoder module to refine segmentation results, especially along the boundaries of target objects, making it also a typical semantic segmentation model with an encoder-decoder architecture. DeepLabV3+ also upsamples the output feature maps from the encoder through the decoder module and fuses them with low-level features, thereby restoring detailed information about the targets and improving the accuracy of segmentation boundaries. A simplified diagram of the overall architecture of DeepLabV3+ is shown in Figure 7.

3.1.2. The Dilated Convolution & ASPP Optimizations

The two most important optimizations in the DeepLabV3+ encoder are dilated convolutions and Atrous Spatial Pyramid Pooling (ASPP), which indirectly contribute to DeepLabV3+’s excellent performance in multi-scale object recognition. By using dilated convolution operations instead of pooling operations, it is possible to obtain a receptive field larger than the size of the convolutional kernel without reducing the size of the feature map, while keeping the number of parameters manageable. For a standard convolution process, it can be described as:

y [i, j] = \sum_{m = - M / 2}^{M / 2} \sum_{n = - N / 2}^{N / 2} x [i + m, j + n] \cdot w [m, n] + b,

(1)

However, in dilated convolution, the distance between adjacent pixels in the image corresponding to the convolutional kernel is not constantly 1, which means that:

y [i, j] = \sum_{m = - M / 2}^{M / 2} \sum_{n = - N / 2}^{N / 2} x [i + r \cdot m, j + r \cdot n] \cdot w [m, n] + b,

(2)

where y is the element of the output feature map, x is the element of the input feature map, w is the weight of the convolution kernel, b is the bias, m and n are the indexes of the convolution kernel, and r is the dilated ratio, which is an integer greater than or equal to 1.

Dilated convolution can effectively improve the ability of the model to capture the features of large targets, such as landslides. As shown in Figure 8, when the dilation rate of the 3 × 3 convolution kernel is 4, the receptive field of the convolution operation is expanded from a space of 9 pixels to a space of 81 pixels, which already contains almost the whole landslide area.

Relying solely on dilated convolution can effectively expand the receptive field, but it does not achieve unified feature extraction for multi-scale objects. To address the issue of multi-scale feature extraction, Spatial Pyramid Pooling (SPP) has provided an excellent practical approach—establishing multiple parallel branches and using pooling operations with different kernel sizes in each branch to extract and ultimately integrate multi-scale features [50]. On the other hand, as mentioned above, dilated convolutions can be seen as a non-downsampling alternative to pooling operations. Naturally, the idea arises to use dilated convolution operations to replace the pooling operations in some branches of SPP, giving rise to Atrous Spatial Pyramid Pooling. Since dilated convolutions do not change the size of the feature map, this substitution allows the output results of each branch to be directly concatenated in the channel dimension (except for the pooling branches, which require upsampling), resulting in a multi-scale feature map that fully accounts for the features of objects at different scales. Figure 9 shows the basic structure of the ASPP module.

DeepLabV3+ has shown good performance on several open-source datasets, but these datasets are based on RGB optical images obtained from common imaging devices.

Therefore, the topographic information in the UAV datasets would be lost by using the general DeepLabV3+. For this reason, in order to achieve a landslide semantic segmentation model with good generalization performance, we re-designed a new DeepLabV3+-based semantic segmentation model for landslide recognition that includes topographic information.

3.2. DeepLab4LS: A Dual-Encoder DeepLab Model for Landslide

3.2.1. Basic Design of Model

In order to design a landslide recognition model that fuses both optical and topographic semantics, it is first necessary to consider the fusion strategy. In deep learning tasks for computer vision, common data fusion methods can be categorized into three types: simple fusion, channel fusion, and dual-encoder fusion (Figure 10). Simple fusion is considered the most basic form of data fusion, where two sets of data are directly added or multiplied together. This method necessitates high consistency between the two datasets and is typically used for fusing intermediate features within the network [51,52,53]. Channel fusion concatenates optical and topographic data along the channel dimension before its input into the model, thus better preserving the original characteristics of the data [54,55,56]. Dual-encoder fusion (also known as dual-backbone fusion) goes a step further by feeding each data set into an encoder (which is primarily composed of a backbone) and the features outputted by the encoders are then fused at the model level (usually through direct channel fusion) [57,58,59]. Compared to the other methods, dual-encoder fusion is the most complex, but it allows the data to undergo several layers of neural network processing both before and after fusion, thereby enabling a true deep fusion of data.

Therefore, to achieve better performance, this study employs a dual-encoder fusion approach for the model design. Overall, the network receives two types of inputs: orthophotos as optical features, and DSM, slope, and hillshade images as topographic features. Initially, two separate encoders are used to extract preliminary optical and topographic semantic features. Subsequently, these extracted semantic features are combined through channel fusion, followed by channel adjustment, size adjustment, and SoftMax processing to produce the final segmentation results. Throughout this process, the subsampling rates of the backbone within the encoders are appropriately adjusted to unify the subsampling multiples of the dual encoders.

3.2.2. Encoders Design

Considering the increase in the number of parameters introduced by the new encoders, the optical feature extraction backbone adopts the lightweight MobileNetV2 [60]. MobileNetV2 achieves model miniaturization and efficient inference speed through its unique inverted residual structure combined with depthwise separable convolutions. In numerous experiments [61,62,63], MobileNetV2, as a backbone, has been able to achieve results comparable to or even surpassing those of ResNet50 [64], with the former having only 1/7 to 1/8 of the latter’s parameters. Overall, MobileNetV2 can be regarded as an excellent lightweight backbone in terms of both computational accuracy and parameter volume.

With regard to the topographic feature extraction backbone, no single network emerges as a clear standout. In order to identify the most suitable network for this purpose, this study tested five different networks for their ability to extract topographic semantics, including VGG-16 [65], ResNet-18, ResNet-50, ShuffleNetV2 [66], and MobileNetV2. Among these, ShuffleNetV2 is a lightweight network structure similar to MobileNetV2. In ShuffleNetV2, the introduction of channel shuffle operations and pointwise group convolution techniques significantly reduces the model’s parameter count while ensuring high accuracy. Regardless of which topographic feature extraction backbone is used, from the perspective of controlling the overall model’s parameter volume, none employs the deep and shallow feature fusion method found in DeepLabV3+. Instead, they directly process based on the final output.

3.2.3. Model Architecture

With this understanding, we referenced the architecture of DeepLabV3+ and redesigned and implemented a recognition semantic segmentation model for landslide recognition—DeepLab for Landslide (DeepLab4LS) (Figure 11), composed of four modules: Encoder1, Encoder2, Mixer, and Decoder. The detailed structural design of the four modules is as follows:

(1): Encoder1

Encoder1 serves as the optical feature extraction module, comprising the optical feature extraction backbone and its associated components. Encoder1 takes RGB images (orthophotos) as input and normalizes them to a range of [0, 1] before entering the network. The optical feature extraction backbone is based on the MobileNetV2 model, with the original model’s downsampling factor adjusted to 8× to prevent data distortion caused by excessive downsampling. The final output of Encoder1 is divided into two parts: 48-channel 8× downsampled shallow optical semantic features and 256-channel 8× downsampled deep optical semantic features.

(2): Encoder2

Encoder2 acts as the topographic feature extraction module, comprising the topographic feature extraction backbone and its associated components. Encoder2 accepts a 3-channel image composed of stacked single-channel images of DSM, slope, and hillshade as input, and normalizes it to a range of [0, 1] before entering the network. In order to align with the deep optical semantic features outputted by Encoder1, the downsampling factor of Encoder2 is also set to 8x, and the final channel count of topographic semantic features is fixed at 256. This allows the shallow optical semantic features, deep optical semantic features, and topographic features to be fused at a ratio of 48:256:256.

(3): Mixer

The Mixer serves as the multimodal feature fusion module. Within the Mixer, the task is to further achieve cross-semantic fusion of optical and topographic semantics, building upon the foundation of fusing shallow and deep optical semantic features. Before fusion, deep optical semantic features and topographic semantic features are upsampled by a factor of 2 using bilinear interpolation to return to a 4× downsampled size. The three sets of features are then channel-stacked and immediately processed through two 3 × 3 convolution layers. The first convolution layer transforms the 256 + 48 + 256 channel features into 256 channels of fused semantics, while the second convolution layer further refines the fused semantics without changing the number of channels. As a result, the Mixer outputs a 256-channel feature map downsampled by 4×, representing the outcome of deep multimodal feature fusion.

(4): Decoder

The Decoder is the conversion module for segmentation results. The main function of the Decoder is to convert the feature map output from the Mixer into the final segmentation results. Therefore, its network structure is relatively simple, consisting of a 1 × 1 convolution layer, a bilinear interpolation module, and a final SoftMax module. The 1 × 1 convolution layer is used for channel adjustment (outputting 2 channels, representing the landslide foreground and background), the bilinear interpolation adjusts the image size to match the input image size, and the subsequent SoftMax transformation yields pixel classification probabilities. Finally, the category with the highest probability is selected as the pixel class to produce the final segmentation results.

It is important to emphasize that in the aforementioned Encoder1 and Encoder2 modules, adjustments to the downsampling multiples of each model are made by modifying the stride of the convolutional or pooling layers accordingly. Another modification to these models is the removal of the classifiers, so that they exist only as a backbone for a semantic segmentation model, rather than as complete classification models.

3.2.4. Improved Loss Function

In this study, another significant endeavor is the formulation of a loss function tailored for landslide recognition, which is then applied to the DeepLab4LS model. Given the relatively low proportion of landslide pixels in the image, a novel loss function, positive enhanced loss (PE loss), is introduced. This loss function is designed to enhance the model’s ability to comprehend positive samples. The PE loss is defined by the following equation:

P E l o s s = 1 - \frac{1}{|C|} \frac{\sum_{p i x e l s} y_{true} y_{pred}}{\sum_{p i x e l s} (y_{pred} + y_{true} - 2 y_{true} y_{pred})},

(3)

Here, y_true is equal to 1 when a pixel is a positive sample and 0 otherwise, while y_pred is equal to 1 when a pixel is predicted as a positive sample and 0 otherwise. The value of |C| is the total number of classes, which is consistently equal to 2 in this study.

During the training process of the DeepLab4LS model, the loss function is composed of both Positive Enhanced loss and Cross Entropy loss (CE loss), which is:

l o s s = P E L o s s + C E L o s s,

(4)

CE loss is a commonly used loss function in deep learning. The official implementation of DeepLabV3+ employs CE loss as its loss function [67]. The incorporation of the PE loss component into the loss function serves to reinforce the gradient stability and to enhance the model’s probability estimation capabilities.

3.3. Training Strategy

To facilitate model convergence and maintain the stability of learning low-level features, this study adopts a strategy of using pre-trained weights combined with a freeze-training approach. Given that the DeepLab4LS model’s backbones are based on widely-used feature extraction networks that have been slightly adjusted for downsampling factors, it is possible to apply the pre-trained weights from these networks, which have been trained on various open-source datasets, as the initial weights for DeepLab4LS’s backbone components. The initial weights, w_i, of the non-backbone parts of DeepLab4LS are initialized to follow a normal distribution with a mean of zero and a standard deviation of 0.02.

w_{i} ~ N (0.0, {0.02}^{2}),

(5)

The model training was conducted using an NVIDIA GeForce RTX 3060 graphics card with 12G of dedicated VRAM. The training phase was set to a total of 500 epochs, with the first 50 epochs designated for freeze training. During the freeze phase, the batch size was set to 8, and for the unfreeze phase, it was reduced to 4. This approach ensured that the total number of training steps exceeded 79,000, guaranteeing full convergence of the model. In the initial freeze training phase, the weights of the backbone components (Encoder1 and Encoder2) were fixed, allowing for rapid adjustment of the Mixer and Decoder weights, which quickly elevated the model to a higher level of accuracy. In the subsequent unfreeze phase, the weights of the entire model, including the backbone networks, were fine-tuned, enabling the model to gradually develop the capability for deep extraction and inference of landslide semantics.

Furthermore, to minimize VRAM consumption, the training process employs PyTorch’s mixed-precision training strategy. This strategy can boost training efficiency by integrating half-precision (float16) with single-precision (float32) floating-point numbers, thus striking a balance between performance and memory usage. Additionally, key hyperparameters that influence the training dynamics, including optimizer and learning rate, are concisely detailed in Table 1.

3.4. Evaluation Metrics

In this study, mIoU (mean Intersection over Union), mPA (mean Pixel Accuracy) and Accuracy were introduced to evaluate the accuracy of the semantic segmentation model in landslide recognition tasks. mIoU is a crucial metric for almost all semantic segmentation models and indicates the percentage of intersection over the union of the predicted and ground truth regions, providing a quick assessment of the model’s performance across all classes. mPA is derived by averaging the percentage of correctly classified pixels within each class, placing more emphasis on the model’s performance in each individual class. Accuracy, which is more intuitive than the first two, represents the proportion of correctly predicted pixels out of the total pixels, reflecting the model’s performance across the entire dataset. Specifically, the calculation methods for these three evaluation metrics are as follows:

mIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}},

(6)

mPA = \frac{1}{N} \sum_{i = 1}^{N} \frac{{TP}_{i}}{{TP}_{i} + {FN}_{i}},

(7)

Accuracy = \frac{\sum_{i = 1}^{N} {TP}_{i} + {TN}_{i}}{\sum_{i = 1}^{N} {TP}_{i} + {TN}_{i} + {FP}_{i} + {FN}_{i}},

(8)

where N is the total number of classes, and TP_i, FP_i, TN_i, FN_i are the true positive, false positive, true negative, and false negative counts for the i-th class, respectively.

4. Results

4.1. Segmentation Results and Comparison Experiments

The segmentation results of DeepLab4LS using different topographic semantics to extract the backbone are shown in Table 2. It is evident that no single topographic semantic extraction backbone excels in all three evaluation metrics simultaneously. However, in the overall evaluation, MobileNetV2 emerges as the superior topographic feature extraction backbone network. It has the highest mIoU and the second highest mPA and precision, demonstrating its robust ability to extract landslide semantics from topographic features. Somewhat surprisingly, VGG-16 achieved the highest mPA of the five topographic semantic extraction backbones, but scored the lowest in both mIoU and accuracy. This suggests that the model may have a tendency to predict pixels as landslides, or it may not have effectively learned the boundary features of landslides. Overall, considering both number of parameters and performance, MobileNetV2 stands out as a relatively optimal backbone for topographic semantic extraction.

Further, the DeepLab4LS model was compared with five other popular semantic segmentation models, and the results are shown in Table 3. It is quite apparent that the DeepLab4LS model has an absolute advantage in all evaluation metrics, with its mIoU surpassing the second place by 5.5 percentage points, indicating the significant gain brought by the integration of topographic features in landslide recognition. Among the other five models, DeepLabV3+ and PSPNet exhibited similar performance and were superior to the other three semantic segmentation models. Notably, U-Net, which excels in medical imaging datasets, performed moderately on the landslide drone dataset. In fact, models like DeepLabV3+ and PSPNet [68] have adopted a degree of multi-scale optimization to enhance the model’s multi-scale target perception capabilities; SegFormer [69] also employed a Transformer architecture for more effective regional feature modeling. In contrast, U-Net has a simpler architecture and lacks sufficient multi-scale target extraction capabilities, which may be the main reason limiting its landslide recognition ability.

In order to visually analyze the performance enhancement in landslide recognition exhibited by DeepLab4LS compared to the DeepLabV3+ model, we selected a subset of images from the validation set for comparative analysis, as shown in Figure 12. It can be observed that DeepLab4LS shows a significant enhancement in landslide recognition performance over the DeepLabV3+ model. For landslides with distinct boundary information, both DeepLab4LS and DeepLabV3+ can roughly identify the extent of the landslides (as in rows 3, 5, and 8 of Figure 12); however, for those landslides with less distinct boundaries (as in rows 6 and 7 of Figure 12), DeepLabV3+ often struggles to produce reliable results. It is also noteworthy that even for landslides with distinct boundaries, the edges recognized by DeepLabV3+ tend to be coarse. In contrast, DeepLab4LS is able to delineate the true boundaries of the landslides with greater precision.

4.2. Ablation Experiments

In this study, we designed the DeepLab4LS model, which consists of Encoder1, Encoder2, Mixer and Decoder modules, and introduced the PE loss function to enhance the model’s ability to understand positive samples. Through a series of ablation experiments, we aim to demonstrate the contributions of these modules or the loss function. Specifically, we aim to demonstrate:

(1): The contribution of the new combination of loss functions.
(2): The contribution of the Mixer module.
(3): The contribution of the dual-encoder architecture.

The design and results of the ablation experiments are presented in Table 4, revealing significant insights. The loss function in this study is composed of the CE and PE terms, and the removal of either significantly diminishes the performance of the model. Both the Mixer and the dual-encoder play a crucial role in the model; removing the Mixer and Encoder2 results in a decrease of 1.6% and 1.0% in mIoU, respectively. The contribution of Encoder1 is even more pronounced; without it, the model’s mIoU drops by 12.1%, which is clearly unacceptable. Nevertheless, the ablation experiments have successfully demonstrated the vital contributions of the combined loss functions, the Mixer module, and the dual-encoder architecture to the model.

5. Discussion

5.1. Topographic Features in Landslide Segmentation

One of the major enhancements that DeepLab4LS offers over DeepLabV3+ is the incorporation of topographic images as part of the input. As indicated in Figure 12, the introduced topographic information aids in bolstering the model’s comprehension of landslide semantics, particularly when landslide boundaries are indistinct. This outcome is unsurprising; as a geological entity, analyzing the topographic characteristics of landslides is crucial for their effective differentiation. Indeed, many features of landslides, such as cracks, fissures, accumulation bodies, and sliding surfaces, are more pronounced in topographic images [70,71], allowing the model to learn these features easily from the topographic images.

However, despite the significant assistance that topographic features in images provide for landslide recognition, ablation experiments (Table 4) indicate that relying solely on topographic images for recognition yields less accurate results. This phenomenon could be related to various factors, including digital imaging, neural networks, and the inherent characteristics of landslides themselves. Here, we propose a possible explanation: features in optical images are “easier” for neural networks to learn than those in topographic images. Therefore, when both optical and topographic images are used, the model can initially obtain some “hints” of landslide features from the optical images, which then synergistically promotes a deeper learning of landslide features from both optical and topographic images. In contrast, when only topographic images are used, the model is immediately confronted with challenging landslide features to learn, resulting in suboptimal training results.

5.2. Mixer of DeepLab4LS Model

The ablation experiments (Table 4) have confirmed the significant role of the Mixer in the model. It is noteworthy that when employing the dual-encoder structure without the Mixer, the mIoU is 74.4%, which is even lower than the 75.0% obtained when using only Encoder1. This indicates that merely introducing Encoder2 does not aid in a better understanding of the landslide semantics, but rather may somewhat impede the model’s comprehension of the optical landslide semantics.

To gain a better understanding of the Mixer’s function, we visualized the feature maps at five key points in DeepLab4LS by averaging the values across channels, as shown in Figure 13. It is readily observed that in the feature maps F1 and F2 output by Encoder1, the extent of landslide and background can be roughly discerned. However, in the feature map F3 output by Encoder2, the extent information is not as clear. This actually corroborates our conjecture in Section 5.1. that the topographic features of landslides are more challenging to learn. Once the three feature maps are stacked into F4, the network possesses both optical and topographic semantics of the landslide, but the features within the landslide extent remain somewhat disordered. Feature map F5 is the output of F4 after Mixer processing. After this, feature map F5 is considerably smoother, with features within the landslide extent exhibiting high consistency. Finally, the Decoder captures such a feature distribution and outputs reliable segmentation results. In summary, the Mixer effectively facilitates the deep integration of optical and topographic semantics in the model.

5.3. Loss Function in Landslide Segmentation

In this study, we attempted to enhance the model’s comprehension of positive samples by introducing a novel loss function—PE loss. The results of the ablation experiments (Table 4) indicate that PE loss effectively achieved our goal. Although the standard loss function in this study was designed as a combination of CE loss and PE loss, the ablation results show that even when using PE loss alone, there is a significant performance enhancement over using CE loss alone. We can try to explain this phenomenon based on the formula of PE loss itself. Observing the fractional term of PE loss, the numerator represents the correctly predicted positive samples, while the denominator accounts for all incorrectly predicted samples. Compared to Soft IoU loss (which will be discussed in detail later), the denominator of PE loss lacks the correctly predicted positive samples, which means that PE loss will easily attain a small value when a large number of positive samples are successfully predicted and few positive and negative samples are mispredicted, thereby amplifying the penalty in the opposite scenario. In essence, through PE loss, we aim to instruct the model to focus its efforts on the current positive sample.

In fact, the design of PE loss in this study was inspired by another similar loss function—Soft IoU loss (SI loss) [72]. Compared to PE loss, SI loss includes correctly predicted positive samples in the denominator, making it “less aggressive” than PE loss. The classic form of SI loss is:

c l a s s i c S I l o s s = 1 - \frac{1}{|C|} \frac{\sum_{p i x e l s} y_{true} y_{pred}}{\sum_{p i x e l s} (y_{pred} + y_{true} - y_{true} y_{pred})},

(9)

We can also logarithmize the fractional term, resulting in:

l o g a r i t h m i c S I l o s s = 1 - \log (\frac{1}{|C|} \frac{\sum_{p i x e l s} y_{true} y_{pred}}{\sum_{p i x e l s} (y_{pred} + y_{true} - y_{true} y_{pred})}),

(10)

Clearly, logarithmic SI loss makes SI loss more “aggressive” in a more direct manner. Here, the meanings of the terms in Equations (9)–(11) are the same as those in Equation (3).

We applied the above two forms of SI loss to the training process, and the results are shown in Table 5. Additionally, we also tried to introduce dice loss [73], which is often used to address the issue of imbalance in samples, to see if it could be profitable for the training. The dice loss is defined as follows:

d i c e l o s s = 1 - \frac{1}{|C|} \frac{2 \sum_{p i x e l s} y_{true} y_{pred}}{\sum_{p i x e l s} (y_{pred}^{2} + y_{true}^{2})}

(11)

Generally, the ranking of the loss functions in terms of performance gain for the model is as follows: PE loss (Equation (3)) > classic SI loss (Equation (9)) > dice loss (Equation (11)) > logarithmic SI loss (Equation (10)). PE loss, dice loss, and classic SI loss all improve the model’s ability to recognize landslides to some extent. However, logarithmic SI loss resulted in a negative optimization, which may be due to its “excessive” enhancement of penalty, causing the model to easily fall into local optima. In summary, our practice has proven that PE loss can effectively enhance the performance of landslide recognition models.

6. Conclusions

In this study, we designed and implemented the DeepLab4LS model, a novel semantic segmentation model for landslide recognition based on DeepLabV3+. We introduced a dual-encoder design and established a model architecture composed of Encoder1, Encoder2, Mixer, and Decoder modules. Within this framework, the model receives both optical and topographic images as inputs and achieves deep integration within the model itself. This endows the model with the ability to infer both optical and topographic semantics of landslides, aligning more closely with the approach human experts use to recognize landslides. Another significant optimization in this study is the introduction of a new loss function term—Positive Enhanced loss (PE loss)—to enhance the model’s comprehension of positive samples.

A series of comparison and ablation experiments were conducted on optical and topographic UAV datasets of the Baihetan reservoir. The results show that the DeepLab4LS model achieved an mIoU of 76.0%, outperforming DeepLabV3+’s mIoU of 70.5% and significantly surpassing other popular models such as SegFormer. These results are closely related to the introduction of the dual-encoder architecture and the Mixer module. Our experiments also found that optical and topographic images play a mutually reinforcing role, and that inputting only one type of image would lead to a decline in model performance. With only topographic images as input, the mIoU could drop by as much as 12.1%, which is completely unacceptable. The introduction of PE loss is also crucial for enhancing model performance, increasing the mIoU by approximately 3.8 percentage points. In fact, PE loss allows the model to focus more on improving the recognition quality of current positive samples, which is evidently beneficial for landslide recognition tasks against complex backgrounds.

Author Contributions

All authors contributed to the manuscript and discussed the results. Z.-H.L. and H.-X.X. drafted the manuscript and were responsible for the data processing, analysis and interpretation of the results. A.-C.S. and Z.-H.N. proposed the ideas for the thesis, designed the structure and contributed to the final revision of the thesis. Y.-X.H. collected the on-site UAV data. H.-B.L. contributed to the UAV modeling and the SPOT analysis. N.J. revised the manuscript and contributed to the final revision of the thesis. All authors have read and agreed to the published version of the manuscript.

Funding

This study has been supported by the National Natural Science Foundation of China (U2240221), the Sichuan Youth Science and Technology Innovation Research Team Project (2020JDTD0006) and the Key Science and Technology Plan Project of PowerChina Huadong Engineering Corporation Limited (KY2021–ZD–03).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to confidentiality requirements.

Acknowledgments

The authors would like to thank the anonymous reviewers for their time and constructive comments on our article.

Conflicts of Interest

Author Zhi-Hai Li, An-Chi Shi and Zi-Hao Niu were employed by the company Power China Huadong Engineering Corporation Limited, Hangzhou 311122, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hungr, O.; Leroueil, S.; Picarelli, L. The Varnes classification of landslide types, an update. Landslides 2014, 11, 167–194. [Google Scholar] [CrossRef]
Highland, L.M.; Bobrowsky, P. The Landslide Handbook—A Guide to Understanding Landslides; US Geological Survey: Reston, CA, USA, 2008. [Google Scholar]
Fang, K.; Dong, A.; Tang, H.; An, P.; Wang, Q.; Jia, S.; Zhang, B. Development of an easy-assembly and low-cost multismartphone photogrammetric monitoring system for rock slope hazards. Int. J. Rock Mech. Min. 2024, 174, 105655. [Google Scholar] [CrossRef]
Zhang, J.; Gurung, D.R.; Liu, R.; Murthy, M.S.R.; Su, F. Abe Barek landslide and landslide susceptibility assessment in Badakhshan Province, Afghanistan. Landslides 2015, 12, 597–609. [Google Scholar] [CrossRef]
Peng, M.; Ma, C.; Shen, D.; Yang, J.; Zhu, Y. Breaching and Flood Routing Simulation of the 2018 Two Baige Landslide Dams in Jinsha River. In Dam Breach Modelling and Risk Disposal: Proceedings of the First International Conference on Embankment Dams (ICED 2020) 1; Springer: Cham, Switzerland, 2020; pp. 371–373. [Google Scholar]
Yang, W.; Fang, J.; Jing, L.-Z. Landslide-lake outburst floods accelerate downstream slope slippage. Earth Surf. Dyn. Discuss. 2021, 9, 1251–1262. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Ning, Y.; Jiang, S.; Zhou, J. Dynamical process of the Hongshiyan landslide induced by the 2014 Ludian earthquake and stability evaluation of the back scarp of the remnant slope. Bull. Eng. Geol. Environ 2019, 78, 2081–2092. [Google Scholar] [CrossRef]
Zhou, J.; Lu, P.; Yang, Y. Reservoir landslides and its hazard effects for the hydropower station: A case study. In Advancing Culture of Living with Landslides: Volume 2 Advances in Landslide Science; Springer: Cham, Switzerland, 2017; pp. 699–706. [Google Scholar]
Fell, R.; Hartford, D. Landslide Risk Management. Landslide Risk Assessment; Routledge: London, UK, 2018; pp. 51–109. [Google Scholar]
Pazzi, V.; Morelli, S.; Fanti, R. A review of the advantages and limitations of geophysical investigations in landslide studies. Int. J. Geophys. 2019, 2019, 2983087. [Google Scholar] [CrossRef]
Giordan, D.; Adams, M.S.; Aicardi, I.; Alicandro, M.; Allasia, P.; Baldo, M.; De Berardinis, P.; Dominici, D.; Godone, D.; Hobbs, P. The use of unmanned aerial vehicles (UAVs) for engineering geology applications. Bull. Eng. Geol. Environ. 2020, 79, 3437–3481. [Google Scholar] [CrossRef]
Rothmund, S.; Niethammer, U.; Malet, J.; Joswig, M. Landslide surface monitoring based on UAV-and ground-based images and terrestrial laser scanning: Accuracy analysis and morphological interpretation. First Break 2013, 31. [Google Scholar] [CrossRef]
Hu, S.; Qiu, H.; Pei, Y.; Cui, Y.; Xie, W.; Wang, X.; Yang, D.; Tu, X.; Zou, Q.; Cao, P. Digital terrain analysis of a landslide on the loess tableland using high-resolution topography data. Landslides 2019, 16, 617–632. [Google Scholar] [CrossRef]
Furukawa, F.; Laneng, L.A.; Ando, H.; Yoshimura, N.; Kaneko, M.; Morimoto, J. Comparison of RGB and multispectral unmanned aerial vehicle for monitoring vegetation coverage changes on a landslide area. Drones 2021, 5, 97. [Google Scholar] [CrossRef]
Bui, T.; Lee, P.; Lum, K.; Loh, C.; Tan, K. Deep learning for landslide recognition in satellite architecture. IEEE Access 2020, 8, 143665–143678. [Google Scholar] [CrossRef]
Li, Z.; Jiang, N.; Shi, A.; Zhao, L.; Xian, Z.; Luo, X.; Li, H.; Zhou, J. Reservoir landslide monitoring and mechanism analysis based on UAV photogrammetry and sub-pixel offset tracking: A case study of Wulipo landslide. Front. Earth Sci. 2024, 11, 1333815. [Google Scholar] [CrossRef]
Hölbling, D.; Eisank, C.; Albrecht, F.; Vecchiotti, F.; Friedl, B.; Weinke, E.; Kociu, A. Comparing Manual and Semi-Automated Landslide Mapping Based on Optical Satellite Images from Different Sensors. Geosciences 2017, 7, 37. [Google Scholar] [CrossRef]
Rosin, P.L.; Hervás, J. Remote sensing image thresholding methods for determining landslide activity. Int. J. Remote Sens. 2005, 26, 1075–1092. [Google Scholar] [CrossRef]
Yang, X.; Chen, L. Using multi-temporal remote sensor imagery to detect earthquake-triggered landslides. Int. J. Appl. Earth Obs. Geoinf. 2010, 12, 487–495. [Google Scholar] [CrossRef]
Eisank, C.; Smith, M.; Hillier, J. Assessment of multiresolution segmentation for delimiting drumlins in digital elevation models. Geomorphology 2014, 214, 452–464. [Google Scholar] [CrossRef]
Rau, J.; Jhan, J.; Rau, R. Semiautomatic object-oriented landslide recognition scheme from multisensor optical imagery and DEM. IEEE Trans. Geosci. Remote Sensing 2013, 52, 1336–1349. [Google Scholar] [CrossRef]
Liu, Q.; Wu, T.; Deng, Y.; Liu, Z. Intelligent identification of landslides in loess areas based on the improved YOLO algorithm: A case study of loess landslides in Baoji City. J. Mt. Sci. 2023, 20, 3343–3359. [Google Scholar] [CrossRef]
Cheng, L.; Li, J.; Duan, P.; Wang, M. A small attentional YOLO model for landslide detection from satellite remote sensing images. Landslides 2021, 18, 2751–2765. [Google Scholar] [CrossRef]
Chen, X.; Yao, X.; Zhou, Z.; Liu, Y.; Yao, C.; Ren, K. DRs-UNet: A deep semantic segmentation network for the recognition of active landslides from InSAR imagery in the three rivers region of the Qinghai–Tibet Plateau. Remote Sens. 2022, 14, 1848. [Google Scholar] [CrossRef]
Soares, L.P.; Dias, H.C.; Grohmann, C.H. Landslide segmentation with U-Net: Evaluating different sampling methods and patch sizes. arXiv 2020, arXiv:2007.06672. [Google Scholar]
Ding, P.; Zhang, Y.; Jia, P.; Chang, X. A comparison: Different DCNN models for intelligent object detection in remote sensing images. Neural Process. Lett. 2019, 49, 1369–1379. [Google Scholar] [CrossRef]
Singh, R.; Rani, R. Semantic segmentation using deep convolutional neural network: A review. In Proceedings of the International Conference on Innovative Computing & Communications (ICICC), Delhi, India, 21–23 February 2020. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Yun, L.; Zhang, X.; Zheng, Y.; Wang, D.; Hua, L. Enhance the accuracy of landslide detection in UAV images using an improved Mask R-CNN Model: A case study of Sanming, China. Sensors 2023, 23, 4287. [Google Scholar] [CrossRef] [PubMed]
Fu, Y.; Li, W.; Fan, S.; Jiang, Y.; Bai, H. CAL-Net: Conditional Attention Lightweight Network for In-Orbit Landslide Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4408515. [Google Scholar] [CrossRef]
Ganerød, A.J.; Lindsay, E.; Fredin, O.; Myrvoll, T.; Nordal, S.; Rød, J.K. Globally vs. Locally Trained Machine Learning Models for Landslide Detection: A Case Study of a Glacial Landscape. Remote Sens. 2023, 15, 895. [Google Scholar] [CrossRef]
Pepe, M.; Fregonese, L.; Crocetto, N. Use of SfM-MVS approach to nadir and oblique images generated throught aerial cameras to build 2.5 D map and 3D models in urban areas. Geocarto Int. 2022, 37, 120–141. [Google Scholar] [CrossRef]
Dun, J.; Feng, W.; Yi, X.; Zhang, G.; Wu, M. Detection and mapping of active landslides before impoundment in the Baihetan Reservoir Area (China) based on the time-series InSAR method. Remote Sens. 2021, 13, 3213. [Google Scholar] [CrossRef]
He, C.; Hu, X.; Tannant, D.D.; Tan, F.; Zhang, Y.; Zhang, H. Response of a landslide to reservoir impoundment in model tests. Eng. Geol. 2018, 247, 84–93. [Google Scholar] [CrossRef]
Yi, X.; Feng, W.; Wu, M.; Ye, Z.; Fang, Y.; Wang, P.; Li, R.; Dun, J. The initial impoundment of the Baihetan reservoir region (China) exacerbated the deformation of the Wangjiashan landslide: Characteristics and mechanism. Landslides 2022, 19, 1897–1912. [Google Scholar] [CrossRef]
Cheng, Z.; Liu, S.; Fan, X.; Shi, A.; Yin, K. Deformation behavior and triggering mechanism of the Tuandigou landslide around the reservoir area of Baihetan hydropower station. Landslides 2023, 20, 1679–1689. [Google Scholar] [CrossRef]
Wang, D.; Gong, B.; Wang, L. On calibrating semantic segmentation models: Analyses and an algorithm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23652–23662. [Google Scholar]
Fernández, T.; Irigaray, C.; El Hamdouni, R.; Chacón, J. Methodology for landslide susceptibility mapping by means of a GIS. Application to the Contraviesa area (Granada, Spain). Nat. Hazards 2003, 30, 297–308. [Google Scholar] [CrossRef]
Wu, C.; Hu, K.; Liu, W.; Wang, H.; Hu, X.; Zhang, X. Morpho-sedimentary and stratigraphic characteristics of the 2000 Yigong River landslide dam outburst flood deposits, eastern Tibetan Plateau. Geomorphology 2020, 367, 107293. [Google Scholar] [CrossRef]
Chen, X.; Liu, C.; Chang, Z.; Zhou, Q. The relationship between the slope angle and the landslide size derived from limit equilibrium simulations. Geomorphology 2016, 253, 547–550. [Google Scholar] [CrossRef]
Çellek, S. Effect of the slope angle and its classification on landslide. Nat. Hazards Earth Syst. Sci. Discuss. 2020, 2020, 1–23. [Google Scholar]
Xiao, H.X.; Jiang, N.; Chen, X.Z.; Hao, M.H.; Zhou, J.W. Slope deformation detection using subpixel offset tracking and an unsupervised learning technique based on unmanned aerial vehicle photogrammetry data. Geol. J. 2023, 58, 2342–2352. [Google Scholar] [CrossRef]
Taylor, L.; Nitschke, G. Improving deep learning with generic data augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1542–1547. [Google Scholar]
Chlap, P.; Min, H.; Vandenberg, N.; Dowling, J.; Holloway, L.; Haworth, A. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imag. Radiat. Oncol. 2021, 65, 545–563. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Tu, J.; Zhang, X.; Yu, S.; Zheng, X. TSE DeepLab: An efficient visual transformer for medical image segmentation. Biomed. Signal Process. Control 2023, 80, 104376. [Google Scholar] [CrossRef]
Qian, J.; Ci, J.; Tan, H.; Xu, W.; Jiao, Y.; Chen, P. Cloud detection method based on improved deeplabV3+ remote sensing image. IEEE Access 2024, 12, 9229–9242. [Google Scholar] [CrossRef]
Baheti, B.; Innani, S.; Gajre, S.; Talbar, S. Semantic scene segmentation in unstructured environment with modified DeepLabV3+. Pattern Recognit. Lett. 2020, 138, 223–229. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Shi, W.; Meng, F.; Wu, Q. Segmentation quality evaluation based on multi-scale convolutional neural networks. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
Yan, Z.; Yan, M.; Sun, H.; Fu, K.; Hong, J.; Sun, J.; Zhang, Y.; Sun, X. Cloud and cloud shadow detection using multilevel feature fused segmentation network. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1600–1604. [Google Scholar] [CrossRef]
Al-Najjar, H.A.; Kalantar, B.; Pradhan, B.; Saeidi, V.; Halin, A.A.; Ueda, N.; Mansor, S. Land cover classification from fused DSM and UAV images using convolutional neural networks. Remote Sens. 2019, 11, 1461. [Google Scholar] [CrossRef]
Couprie, C.; Farabet, C.; Najman, L.; LeCun, Y. Indoor semantic segmentation using depth information. arXiv 2013, arXiv:1301.3572. [Google Scholar]
Pena, J.; Tan, Y.; Boonpook, W. Semantic segmentation based remote sensing data fusion on crops detection. J. Comput. Commun. 2019, 7, 53–64. [Google Scholar] [CrossRef]
Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1000–1011. [Google Scholar] [CrossRef]
Sun, Y.; Fu, Z.; Sun, C.; Hu, Y.; Zhang, S. Deep multimodal fusion network for semantic segmentation using remote sensing image and LiDAR data. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5404418. [Google Scholar] [CrossRef]
Wang, K.; He, D.; Sun, Q.; Yi, L.; Yuan, X.; Wang, Y. A novel network for semantic segmentation of landslide areas in remote sensing images with multi-branch and multi-scale fusion. Appl. Soft Comput. 2024, 158, 111542. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Aufar, Y.; Kaloka, T.P. Robusta coffee leaf diseases detection based on MobileNetV2 model. Int. J. Electr. Comput. Eng. 2022, 12, 6675. [Google Scholar] [CrossRef]
Indraswari, R.; Rokhana, R.; Herulambang, W. Melanoma image classification based on MobileNetV2 network. Procedia Comput. Sci. 2022, 197, 198–207. [Google Scholar] [CrossRef]
Ahsan, M.M.; Nazim, R.; Siddique, Z.; Huebner, P. Detection of COVID-19 patients from CT scan and chest X-ray data using modified MobileNetV2 and LIME. Healthcare 2021, 9, 1099. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Yu, H.; Chen, C.; Du, X.; Li, Y.; Rashwan, A.; Hou, L.; Jin, P.; Liu, F.F.; Kim, J.; Li, J. TensorFlow Model Garden. 2020. Available online: https://github.com/tensorflow/models (accessed on 21 December 2020).
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zhao, W.; Wang, R.; Liu, X.; Ju, N.; Xie, M. Field survey of a catastrophic high-speed long-runout landslide in Jichang Town, Shuicheng County, Guizhou, China, on July 23, 2019. Landslides 2020, 17, 1415–1427. [Google Scholar] [CrossRef]
Jongmans, D.; Garambois, S. Geophysical investigation of landslides: A review. Bull. Société Géologique Fr. 2007, 178, 101–112. [Google Scholar] [CrossRef]
Máttyus, G.; Luo, W.; Urtasun, R. Deeproadmapper: Extracting road topology from aerial images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3438–3446. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]

Figure 1. Different landslide recognition methods and their general result based on UAV orthophoto: (a) landslide boundary identification by manual judgement; (b) landslide identification based on object-oriented method; (c) landslide positioning based on object-based DL method; (d) landslide identification based on pixel-based DL method.

Figure 2. The UAV orthophoto (left) and the geographic location (right) of the study area. The UAV orthophoto is approximately 50 km in length along the Jinsha River and covers an area of approximately 65 km².

Figure 3. UAV equipment: (a) photograph of UAV working onsite; (b) Feima D2000 UAV; (c) SONY-DOP 3000 camera.

Figure 4. Input image data used for the semantic segmentation model. The orthophoto is composed of three channels including red (R), green (G), and blue (B) captured by UAV aerial camera, and the relative elevation, slope, hillshade and ground truth are composed of a single channel.

Figure 5. Workflow of dataset creation from step 1 to step 6. In step 1, the GT was labeled by manually observing the landslide areas in the orthophoto, and the hillshade and slope were exported from DSM. In step 2, certain image cropping was performed (pseudo code has been presented), and the large orthophoto was converted into an indexed set consisting of a series of small-scale images. In step 3, the filter condition is presented, and I is the small image after cropping, (x, y) is the coordinate of a pixel, W and H are the weight and height of the image before cropping, respectively, and g(I) is the dataset after filtering. In step 4, the filtering was applied according to the filter condition (pseudo code has been presented). In step 5, each indexed set was divided into two parts for the training and validation procedures, respectively. In step 6, the final dataset was fed into the model for training and validation.

Figure 6. Dataset augmentation procedure in the dataset loading process of each epoch. The original will go through a sequential process consisting of stretching, flipping, Gaussian blur and rotation procedures. All procedures were applied randomly and were designed to increase the randomness of the final augmented image.

Figure 7. The basic structure of the DeepLabV3+ model, showing a typical encoder-decoder architecture.

Figure 8. The dilated convolutions with a dilation rate of 4 bring an expanded receptive field to a 3 × 3 convolutional kernel: (a) an orthophoto including an entire landslide; (b) feature pixels corresponding to dilated and standard convolutions (yellow: feature pixels for dilated convolution; blue: feature pixels for standard convolution; green: feature pixels for both dilated and standard convolution); (c) schematic of the receptive field area.

Figure 9. The basic structure of the ASPP module.

Figure 10. The sketch map of common data fusion methods: simple fusion, channel fusion, and dual-encoder fusion.

Figure 11. The basic structure of the DeepLab for Landslide (DeepLab4LS) model, composed of Encoder1, Encoder2, Mixer and Decoder.

Figure 12. The results of the eight typical segmentation results numbered as 1–8: (a) the original UAV orthophoto; (b) the ground truth determined by landslide experts; (c,d) the segmentation results obtained by DeepLab4LS model and DeepLabV3+ model, respectively.

Figure 13. Visualization of feature maps for key nodes in DeepLab4LS. F1: low-level optical feature map of Encoder1 output; F2: high-level optical feature map of Encoder1 output; F3: topographic feature map of Encoder2 output; F4: concatenated feature map before fusion in Mixer; F5: fused feature map of Mixer output.

Table 1. The training parameters set for DeepLab4LS.

Environment and Parameters	Name and Value
Development Environment	Python 3.7.13/PyTorch 1.10.1/NVIDIA CUDA 11.3
Epoch	50 (freeze) + 450 (unfreeze)
Batch Size	8 (freeze)/4 (unfreeze)
Optimizer	Stochastic gradient descent
Momentum	0.9
Weight Decay	10⁻⁴
Learning Rate	7 × 10⁻³ (beginning)/7 × 10⁻⁵ (minimum)
Learning Rate Decay	Cosine annealing

Table 2. Segmentation results of DeepLab4LS under different topographic semantic extraction backbones.

Optical Semantic Extraction Backbone	Topographic Semantic Extraction Backbone	mIoU (%)	mPA (%)	Accuracy (%)
MobileNetV2	MobileNetV2	76.0	85.3	92.3
	VGG-16	72.6	86.3	90.3
	ShuffleNetV2	75.5	83.7	92.4
	ResNet-18	74.1	84.1	91.5
	ResNet-50	74.8	83.7	92.0

Table 3. Comparison experiments between DeepLab4LS and other popular semantic segmentation models.

Model	mIoU (%)	mPA (%)	Accuracy (%)
DeepLab4LS (with MobileNetV2 as the topographic backbone)	76.0	85.3	92.3
DeepLabV3+	70.5	80.1	90.4
PSPNet	70.5	81.2	90.1
SegFormer	70.1	77.8	90.8
U-Net	67.6	77.5	89.3
HRNet	69.6	80.0	89.8

Table 4. Ablation experiments based on the DeepLab4LS model, each experiment differing in module composition or loss function.

Base Model	Encoder1	Encoder2	Mixer	Loss Functions	mIoU (%)	mPA (%)	Accuracy (%)
DeepLab4LS (with MobileNetV2 as the topographic backbone)	✔	✔	✔	CE ² + PE ³	76.0	85.3	92.3
	✔	✔	✔	CE	72.2	81.2	91.1
	✔	✔	✔	PE	75.6	85.0	92.2
	✔	✔		CE + PE	74.4	81.7	92.2
		✔		CE + PE	63.9	76.4	83.9
	✔ ¹			CE + PE	75.0	84.0	92.1

¹ The DeepLab4LS model degrades to DeepLabV3+ model in this scenario (but with different loss functions). ² CE: Cross Entropy (loss). ³ PE: Positive Enhanced (loss).

Table 5. Segmentation results of DeepLab4LS under different loss functions.

Base Model	Loss Functions	mIoU (%)	mPA (%)	Precision (%)
DeepLab4LS (with MobileNetV2 as the topographic backbone)	CE ¹	72.7	81.5	91.3
	CE + PE ²	76.0	85.3	92.3
	CE + dice	72.9	81.5	91.5
	CE + classic SI ³	74.0	84.3	91.4
	CE + logarithmic SI ³	72.2	82.5	90.8

¹ CE: Cross Entropy (loss). ² PE: Positive Enhanced (loss). ³ SI: Soft IoU (loss).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.-H.; Shi, A.-C.; Xiao, H.-X.; Niu, Z.-H.; Jiang, N.; Li, H.-B.; Hu, Y.-X. Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir. Remote Sens. 2024, 16, 2558. https://doi.org/10.3390/rs16142558

AMA Style

Li Z-H, Shi A-C, Xiao H-X, Niu Z-H, Jiang N, Li H-B, Hu Y-X. Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir. Remote Sensing. 2024; 16(14):2558. https://doi.org/10.3390/rs16142558

Chicago/Turabian Style

Li, Zhi-Hai, An-Chi Shi, Huai-Xian Xiao, Zi-Hao Niu, Nan Jiang, Hai-Bo Li, and Yu-Xiang Hu. 2024. "Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir" Remote Sensing 16, no. 14: 2558. https://doi.org/10.3390/rs16142558

APA Style

Li, Z.-H., Shi, A.-C., Xiao, H.-X., Niu, Z.-H., Jiang, N., Li, H.-B., & Hu, Y.-X. (2024). Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir. Remote Sensing, 16(14), 2558. https://doi.org/10.3390/rs16142558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Landslide Recognition Using UAV Datasets: A Case Study in Baihetan Reservoir

Abstract

1. Introduction

2. Study Area and Dataset

2.1. Study Area and Data Acquisition

2.2. Dataset Creation

2.3. Dataset Augmentation

3. Methods

3.1. DeepLabV3+

3.1.1. Model Architecture

3.1.2. The Dilated Convolution & ASPP Optimizations

3.2. DeepLab4LS: A Dual-Encoder DeepLab Model for Landslide

3.2.1. Basic Design of Model

3.2.2. Encoders Design

3.2.3. Model Architecture

3.2.4. Improved Loss Function

3.3. Training Strategy

3.4. Evaluation Metrics

4. Results

4.1. Segmentation Results and Comparison Experiments

4.2. Ablation Experiments

5. Discussion

5.1. Topographic Features in Landslide Segmentation

5.2. Mixer of DeepLab4LS Model

5.3. Loss Function in Landslide Segmentation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI