*2.3. Data Modelling*

2.3.1. Spatial Dependence Modeling by GCN

Acquiring complex spatial dependence is one of the critical problems in spatiotemporal predicting. Traditional CNN-based models can capture local spatial features but are only usable in Euclidean space, such as a regular grid [15]. The GNSS monitoring network deployed on a landslide is a graph structure rather than a two-dimensional grid, which means the traditional CNN cannot capture the spatial dependence correctly.

Recently, the GCN model has received widespread attention, extending convolutional operations to non-Euclidean domains. It has been gradually applied in image classification and traffic road networks and has demonstrated that the spatial structure captured by GCNs improves the forecasting accuracy [26].

Given an adjacency matrix *A* and a feature matrix *X*, the GCN model builds a filter to handle specific spectral information in the Fourier domain. The filter, working on the nodes (e.g., monitoring sites) of a graph (e.g., GNSS network structure), focuses on the nodes' spatial features and measures their closeness by its first-order neighbourhood.

The GCN model learns the topological relationship between the nodes and their surrounding nodes for each node. Then, the encoder generates the latent representations for all geographical units of the monitoring network and the attributes of the nodes with graph convolution to obtain spatial dependence. As a result, similar units obtain similar representations, which are then used by the decoder for predicting.

As shown in Figure 6a, the GCN model can be constructed by stacking multiple convolutional layers to learn higher-order similarities between the nodes in the graph. The propagation of the GCN can be formulated as:

$$y\_{l+1} = \sigma(\mathcal{D}^{-\frac{1}{2}}\tilde{A}\mathcal{D}^{-\frac{1}{2}}y\_l\mathcal{W}\_l) \tag{5}$$

where *<sup>σ</sup>*(.) represents the nonlinear sigmoid function, such as the ReLU. *<sup>A</sup>*! <sup>=</sup> *<sup>A</sup>* <sup>+</sup> *<sup>I</sup>* is a self-connection structure matrix with an identity matrix *<sup>I</sup>*. *<sup>D</sup>*! is the diagonal node degree matrix of *<sup>A</sup>*!, represents as *<sup>D</sup>*! <sup>=</sup> <sup>∑</sup>*<sup>j</sup> <sup>A</sup>*!*ij*. *Wl* is a weight matrix for the *l-th* neural network layer. *yl* is a node-level output of *l* layer with *y*<sup>0</sup> = *X*.

**Figure 6.** The architectures of two deep-learning models: (**a**) the graph convolution networks; (**b**) the gated recurrent units.

### 2.3.2. Temporal Dependence Model by GRU

Acquiring temporal dependence is another crucial problem in landslide displacement spatiotemporal predicting. The recurrent neural network (RNN) has achieved impressive results given the outstanding time series forecasts [14]. However, traditional RNN has limitations for long-term forecasts due to deficiencies, such as gradient disappearance and explosion [27].

The LSTM model [28] and the GRU model [29] are proposed to address these problems. The basic principles of LSTM and GRU are almost identical. They all incorporate gated mechanisms and can handle longer sequences of tasks. Compared to LSTM, GRU combines the forget gate and the input gate into an update gate, decreasing the data flow; thus, it has a more straightforward structure, fewer parameters, and faster convergence speed [30]. Therefore, we have chosen the GRU model to learn temporal dependence from the displacement time series data.

The data flow of the GRU is illustrated in Figure 6b. *xt* denotes the model inputs at time *t*, *ht*−<sup>1</sup> is the hidden state at time *t* − 1, all the subscript letter indicates time, *rt* stands for the reset gate, *ut* represents the update gate, *ct* is the memory content stored, and *ht* denotes the output state. The update gate *ut* is used to control the degree of the status information transmission from time *t* − 1 to time *t*; the larger the value of *ut*, the more critical the previous state. The reset gate tensor *rt* controls the influence of time *t* − 1 on-time *t*; the smaller the value of *rt*, the weaker the effect.

The GRU obtains the monitoring information at time *t,* by taking the hidden status at time *t* − 1 and the current monitoring information while capturing the monitoring information at the present moment; the model still retains the changing trend of historical monitoring information and can capture temporal dependence.

#### 2.3.3. Spatiotemporal Model Using GC-GRU-N

To capture both spatial and temporal features from landslide monitoring data, we propose a new deep learning model (GC-GRU-N) based on graph convolutional network (GRU) and gated recurrent unit (GRU). In the model, an attribute-augmented graph convolution operation with weighted adjacency matrices and an attribute-augmented unit is employed to represent the spatiotemporal correlations of the monitoring network, and the obtained results will be used as the model inputs.

The architecture of the model is shown in Figure 7. The upper part shows the process of spatiotemporal displacement prediction. The GCN module is used to model spatial dependencies, while the GRU module is used to model temporal dependencies. A fusion layer is implemented to incorporate extracted features from both space- and time-domain. The model predicted displacement could be represented as *f*(*A*, *S*) with *A* the weighted adjacency matrix and *S* the attribute-augmented matrix.

The lower part (marked by a dashed box) gives a specific structure of a T-GCN cell. In each model cell, *ht*−*<sup>1</sup>* denotes the output at time *t* − 1, *gc*(.) is the graph convolution process, *ut*, *rt* are the update gate and reset gate at time *t*, and *ht* denotes the output at time *t*. The calculation process of spatiotemporal displacement prediction is shown below:

$$u\_t = \sigma(\mathcal{W}\_u \cdot \left[ \underline{\mathcal{gc}}(S\_t, A), h\_{t-1} \right] + b\_u) \tag{6}$$

$$r\_t = \sigma(\mathcal{W}\_r \cdot \left[ \mathcal{gc}(S\_{t\prime}A), h\_{t-1} \right] + b\_r) \tag{7}$$

$$c\_t = \tanh(\mathcal{W}\_\mathcal{C} \cdot \left[ \mathcal{gc}(S\_{t\prime}A), (r\_t \* h\_{t-1}) \right] + b\_\mathcal{c}) \tag{8}$$

$$h\_t = u\_t \* h\_{t-1} + (1 - u\_t) \* c\_t \tag{9}$$

where *σ*(.) and tanh(.) represents the sigmoid function, *W* and *b* stand for the weights and biases in the training process, respectively. \* denotes the matrix multiplication.

**Figure 7.** The overall process of spatiotemporal prediction.

2.3.4. Evaluation Metrics of the Prediction

To evaluate the model performance, we introduce three evaluation indicators, namely the mean absolute error (*MAE*), the mean absolute scaled error (*MASE*), and the root mean square error (*RMSE*) [31]. These metrics are widely used in the regression tasks, defined as follows:

$$MAE = \frac{1}{n} \sum\_{t=1}^{n} |\mathbf{Y}\_t - \mathbf{\hat{Y}}\_t| \tag{10}$$

$$MSE = mean \left( \left| \frac{e\_j}{\frac{1}{n-1} \sum\_{t=2}^{n} |Y\_t - Y\_{t-1}|} \right| \right) \tag{11}$$

$$RMSE = \sqrt{\frac{1}{n} \sum\_{t=1}^{n} \left(\mathbf{Y}\_t - \hat{\mathbf{Y}}\_t\right)^2} \tag{12}$$

where *n* is the length of time series, *Yt* represents the actual measurement, *Y*ˆ *<sup>t</sup>* denotes the predicted value, and *ej* = ∑*<sup>n</sup> t*=1 *Yt* <sup>−</sup> *<sup>Y</sup>*<sup>ˆ</sup> *t* indicates the forecast error for a given period *<sup>j</sup>* (the number of forecasts). *MAE* can reflect the absolute error of the prediction result. *MASE* is a favourable property to calculate the time-series forecast errors, which can be used to compare forecasts across data sets with different scales. *RMSE* can more accurately reflect the similarity between the predicted and the observed sequence.

#### **3. Experiments and Results**

#### *3.1. Analysis of the Spatiotemporal Correlation*

Generally, the deformation of a landslide varies spatially at different parts of the landslide body (e.g., [19,20]). The spatial–temporal correlation of each monitoring site in the GNSS network shows a strong location dependence. This section analyzes the measurements on each monitoring site to determine their spatiotemporal relations. To do so, we introduce grey relational analysis (GRA) to help estimate the correlation of monitoring points [19]; it is believed that the value of grey relational degree (GRD) greater than 0.6 denotes a close correlation.

For the Shuping landslide, all monitoring stations (Figure 2) are deployed on the active block. As shown in Table 1, for any neighbouring connection, their GRD values are greater than 0.6, implying a strong spatial–temporal correlation among them. We select two pairs of adjacent monitoring points (ZG85 with SP-2 and ZG85 with ZG87) and plot the observed deformation time series. The results are illustrated in Figure 8a,b, showing the strong consistency of each pair. The calculated GRD value is 0.8 and 0.87, respectively, suggesting a strong spatiotemporal correlation between them. From these GNSS observations on the east block (Figures 2 and 4), the landslide deforms locally obviously, indicting an unstable state. Still, it does not mean that the whole landslide is moving, or that the landslide is likely to occur, unless it is already entering an accelerated deformation stage with an increasingly accelerated velocity.


**Table 1.** Grey relation analysis results of the Shuping landslide.

**Figure 8.** (**a**–**d**) Deformation over time of adjacent monitoring points of different pairs.

For the Baishuihe landslide, two stations, namely ZG93 and ZG118, are installed in the more active block, with the rest monitoring stations at the other block (Figure 2). It can be seen from Figures 2 and 8c that monitoring sites located at the more active block exhibit almost identical displacement trends (GRD = 0.74). At the other monitoring sites in another block, the measured displacements are relatively small, fluctuating to within 20 mm per month. The GRD between these sites is larger than 0.6, reflecting remarkable similarity in displacement trends. Generally, the correlation decreases with increasing distance. However, for stations located at different blocks (e.g., ZG92 and ZG93), the monitoring displacements still show similarity in local features (Figure 8d), with the value of GRD equalling 0.54 (Table 2).


**Table 2.** Grey relation analysis results of the Baishuihe landslide.

In summary, the spatiotemporal correlation of monitoring sites in the GNSS network shows medium to strong relations. Grey relational analysis (GRA) results also show a strong location dependence consistent with the results calculated by Gaussian similarity functions. It confirms that the landslide displacement prediction should consider the spatiotemporal relationship between monitoring points.

#### *3.2. Model and Parameter Setting*

#### 3.2.1. Model Inputs

In the experiments, GNSS measured displacements acquired monthly (from July 2003 to March 2013 for Baishuihe landslide, from September 2007 to May 2015 for Shuping landslide), the associated daily reservoir water level and precipitation are used, all of which are normalized to the interval from 0 to 1 using the max–min normalization. These datasets are further divided into a training dataset and a test dataset. The training dataset (from July 2003 to August 2011 for Baishuihe landslide, from September 2007 to March 2014 for Shuping landslide) of the input layer are taken as inputs in the training process, and the remaining dataset is used to evaluate the performance of our proposed method.

For the input layer, the training dataset of single GNSS displacement time series can be denoted as *Dm* = {*d*1, *d*2, ... , *dm*}, *m* represents the length of the time series. GNSS observations from July 2003 to August 2011 for the Baishuihe landslide and from September 2007 to March 2014 for the Shuping landslide are taken separately as training datasets in the training process. The remaining GNSS observations are used to evaluate the performance of our proposed method.

The way of sample division from each training dataset is shown in Figure 9. A sliding window with window length equals l and step size equal n is used. Thus, the length of each obtained sample is *l* (2 ≤ *l* < *m*), represented by *Dtrain* = {*dm*−*l*, *dm*−*<sup>l</sup>*+1, ... , *dm*−*1*} with the last *y* (1 ≤ *y* < *l*) serving as label sample and the other values (l − *y*) used as the sample input.

**Figure 9.** The sample division of a single time series.

In this paper, the sliding window size is set to 6 by taking account of the typicality and quantity of the training samples, which stands for the half-cycle of the external factors to facilitate the recurrent layer to capture the temporal dynamics. The first five are input samples, the last 1 marked as the label. As suggested by [32], the model error increase as the value of y increases; thus, we set it to 1. Therefore, the dimension of the training sample is 93 and 74, separately. The test datasets are treated in the same way.

For the Baishuihe landslide, a 7 × 7 weighted adjacency matrix *Aw* is first construed using the Gaussian similarity function based on the spatial proximity of the deployed GNSS monitoring stations (Section 2.2.1). Then a feature matrix *X* with a size of 7 × 117 is constructed to represent the temporal displacement of each station. The number of rows equals the number of stations; the number of columns equals the measured displacement time series. Thus, in the same way, the dimension of *Aw* and *X* for the Shuping landslide is 6 × 6 and 6 × 93, respectively.

As illustrated in Section 2.1, for both landslides, the fluctuation of the reservoir water level and varying precipitation are two main external factors influencing landslides behaviours. We introduce an attribute-augmented unit that integrates features of the displacements, the seasonal rainfall, and the water level fluctuation to represent the effect of external influencing factors on landslide deformation. The augmented matrix with weighted adjacency matrices is incorporated into the forecast model to enhance the extraction of realistic spatial–temporal dependency, and the derived results will be used as the model inputs.

### 3.2.2. Model Parameters and Settings

Our proposed model has four hyper-parameters: the learning rate, the number of training iteration epochs, the number of hidden units, and the batch size. In the experiment, we empirically set the learning rate to 0.001 and the batch size to 32 [33]. However, the numbers of training iteration epochs and hidden units are two crucial parameters that may affect the prediction precision and, therefore, should be determined by designed comparison experiments. The ReLU is employed as the activation for each convolutional layer and the Adam optimizer for minimizing the loss function (Equation (13)).

$$\text{loss} = \sum\_{t=1}^{n} \left( \mathbf{Y}\_{t} - \hat{\mathbf{Y}}\_{t} \right)^{2} / n \tag{13}$$

where *n* is the time series length, *Yt* represents the actual measurement, *Y*ˆ *<sup>t</sup>* denotes the predicted value.

Comparison experiments for selecting the optimal hyper-parameters are performed by setting the number of hidden units to 64 first to analyze the changes of the prediction precision with a varying number of training epochs designed to be {100, 250, 500, 1000, 1500, 2000}. Figure 10 shows the variation of metrics with different training epochs. The horizontal axis represents the number of training epochs, and the vertical axis represents the variation of the metrics; it can be seen that when the training epochs equals 1000, the metrics obtain a minimum value. Thus, the model reaches its optimal performance. Accordingly, in the following comparison experiments, we set the training epochs value to 1000 to analyze the changes of the prediction precision with varying numbers of hidden units; these numbers are designed to be {8, 16, 32, 64, 100, 128}. Figure 11 gives the variation of metrics with different hidden units. The horizontal axis represents the number of hidden units, and the vertical axis represents the variation of the metrics. It can be seen that when the hidden units equal 64, the metrics obtain a minimum value. Thus, the model reaches its optimal performance. Consequently, in the following experiment, the number of training epochs and hidden units is set to be 1000 and 64, respectively.

**Figure 10.** The variation of metrics with different training epochs.

**Figure 11.** The variation of metrics with different hidden units.

#### *3.3. Predicted Results and Analysis*

3.3.1. Predicted Results Using the GC-GRU-N

To prove the effectiveness of the proposed model, we use four classical prediction models, which are MLR, ARIMA, SVR, and LSTM, to compare with the GC-GRU-N model for the two study sites. This section also conducts comparative analysis using the temporal graph convolutional network (T-GCN) without attribute augmentation to verify the model enhancement using the attribute-augmented graph convolution (GC) operations. We evaluate the effect of the GC-GRU-N model from two aspects: prediction performance and modelling time.

As the above-mentioned classical prediction models can only realize single timeseries prediction, the model predictions only reflect the displacement behaviour of a single monitoring station. Thus, for a GNSS network with *m* stations, classic models need to calculate *m* times separately to obtain the displacement forecasts of all stations. The GC-GRU-N utilizes a feature matrix *Xm*×<sup>n</sup> (Section 3.2.1) to represent the displacement over time of each station, predicting the displacement of the entire monitoring system.

The predicted results of the Baishuihe landslide and the Shuping landslide by the proposed model are shown in Figures 12 and 13, respectively. The predictions of each monitoring station are consistent well with the actual observations as a whole. According to Figures 6 and 7, measurements of several monitoring stations show mutational transition appeared in September each year; a more significant prediction error arises at this abrupt state with a maximum of 16.66 mm and 30.35 mm, respectively. The maximum error does not exceed 10 mm for the rest of the year. It could be due to fewer samples being available for the mutation state than for the other states because a monthly prediction time scale is used due to data acquisition limitations. Generally speaking, as the number of samples for mutation state increases, e.g., with daily-scale displacement, the model's errors gradually decrease. In addition, since the GCN captures spatial features by constantly moving a smooth filter in the Fourier domain, it might also lead to the peaks being smoother.

**Figure 12.** Comparison diagram of the overall prediction of landslide displacement; the vertical purple line represents the absolute error.

**Figure 13.** Comparison diagram of the overall prediction of landslide displacement; the vertical purple line represents the absolute error.

#### 3.3.2. Comparative Experiments

The performance of the forecast models is shown in Table 3. Our proposed model has outperformed the other five models in terms of three evaluation indicators in two study areas (Table 3). The errors of models such as MLR, ARIMA, and SVR are relatively large, resulting in poor prediction performance. However, the LSTM as a neural network model is better than traditional machine learning models (SVR) and time series models (MLR and ARIMA). Compared with the LSTM model, the GC-GRU-N and the T-GCN model can better describe the displacement trend because the model structure captures the spatial feature of the monitoring network. Consequently, the prediction accuracy of the GC-GRU-N is effectively enhanced. For the Baishuihe landslide, the three metrics of GC-GRU-N are 3.682 mm of *MAE*, 0.477 of *MASE*, and 4.429 mm of *RMSE*, respectively.



In the following discussion, we use the *RMSE* as the primary metric to represent the model's performance. The *RMSE* of the GC-GRU-N model and T-GCN model are reduced by approximately 64% and 50% compared with the MLR model. In comparison, the *RMSE* of the T-GC-GRU-N model and T-GCN model are around 56% and 38% lower than that of the ARIMA model. Compared with the SVR model, the RMSEs of the GC-GRU-N model and T-GCN model are reduced by 58% and 41%, respectively. In contrast, the RMSEs of the T-GC-GRU-N model and T-GCN model are decreased by 47% and 41% compared with the LSTM. Compared with the GC-GRU-N model, the T-GCN model is less effective because the T-GCN considers the spatial features and ignores the impact of the external factors on landslide displacement.

In terms of computation time, T-GCN is the most efficient model amongst all tested models, only requiring 19.93 s (Table 3). The proposed GC-GRU-N achieves competitive training efficiency ranking top two, taking 44.88 s, followed by the LSTM model costing 229.936 s. The GC-GRU-N is slower than T-GCN because the method needs to develop a unit to represent the effects of the triggering factors during convolution operation. The SVR takes 349.971 s, slightly higher than LSTM. In contrast, the modelling time of the MLR model and the ARIMA model is much longer than other methods presented in this paper, requiring 986.435 s and 0.534 h, respectively. In summary, the GC-GRU-N is significantly efficient considering its high accuracy among other advanced models.

Results of the ZG93 station installed on the Baishuihe landslide and the ZG85 station deployed on the Shuping landslide are depicted in Figure 14. The predictions of the proposed model are consistently well with the actual deformation trend and superior to other methods as a whole. Despite sometimes overreacting to rapid decreases and producing underestimated results at abrupt increases, our model outperforms all other time series forecast models at both landslides. This result indicates that the graph convolution with spatial correlation consideration scheme can efficiently capture the dynamics in the landslide monitoring.

**Figure 14.** Comparison results of the proposed model and five other models. Results of ZG93 installed on the Baishuihe landslide are shown in figure (**a**,**b**), and that of ZG85 deployed on the Shuping landslide are shown in figure (**c**,**d**).

Specifically, MLR and ARIMA as statistics methods can also depict the variation trend of landslide displacement, but with more significant overestimated or underestimated errors. The LSTM model is more efficient and shows more promising results than the SVR model in the machine-learning-based models, especially in predicting displacement around transition states. However, sudden rapid changes in the evolution may increase the model's errors. This could be due to fewer samples being available for the mutation state.

The T-GCN and the GC-GRU-N models capturing spatial and temporal features have achieved more promising time-series forecasts. The T-GCN model gives a lower prediction accuracy. This is because the T-GCN model only considers the spatial features, and ignores the external factors impacting landslide displacement. In summary, the GC-GRU-N as a spatial and temporal mode is significantly efficient with high accuracy amongst other models in landslide displacement forecasting.

#### 3.3.3. Ablation Experiment and Analysis

Ablation is utilized to demonstrate the importance of attribute enhancement to improve model performance. It refers to an attribute-augmented unit in the forecast model. We design the ablation experiment as the following: only consider the rainfall factor or the reservoir water effect, and both factors together (Figure 15). Table 4 shows the results for the Baishuihe landslide, with the T-GCN representing a model without an attribute-augmented unit. According to Table 4, the performance gains of using an attribute-augmented unit is apparent.

**Figure 15.** Comparison results of the ablation experiments considering only the rainfall factor or the reservoir water effect, and both factors together. The solid black line represents the original value.


**Table 4.** Ablation Experiments with different settings.

Note: R.w.l is the reservoir water level.

We also use the *RMSE* as the primary metric to represent the model's performance. As the experiment considers the rainfall factor alone, the reservoir water level factor alone, and both factors together, the *RMSE* values of the proposed model are 4.442 mm, 4.434 mm, and 4.429 mm, respectively, all of which are lower than that of the T-GCN (6.183 mm). Specifically, the ablation experiment demonstrates the effectiveness of assembling the external inducing factors in graph convolutional network, and the best performance in all indicators is achieved when both factors are considered simultaneously. As shown in Figure 15, the predictions of considering both trigging factors are consistent well with the actual deformation trend, which is superior to the other two scenarios and is still valid around transition states, including rapid decrease and abrupt increase conditions.
