**1. Introduction**

Drought is one of the most widespread and costly natural disasters that not only affects agricultural and livestock production but also leads to a series of ecological and socioeconomic problems [1,2]. Globally, dry areas are increasing at a rate of approximately 1.74% per decade from 1950 to 2008 [3]. The Tibetan Plateau (TP) is known as the "Asian Water Tower". Nevertheless, the areas of the arid and semiarid regions of the TP account for 23% and 44%, respectively [4], and approximately 62% of the plateau area is covered by

**Citation:** Cheng, M.; Zhong, L.; Ma, Y.; Wang, X.; Li, P.; Wang, Z.; Qi, Y. A New Drought Monitoring Index on the Tibetan Plateau Based on Multisource Data and Machine Learning Methods. *Remote Sens.* **2023**, *15*, 512. https://doi.org/10.3390/ rs15020512

Academic Editors: Gabriel Senay and Xianjun Hao

Received: 6 December 2022 Revised: 1 January 2023 Accepted: 12 January 2023 Published: 15 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

alpine meadows and grasslands [5]. The ecosystems over the TP are fragile due to their high elevation and unique geographical location. Meanwhile, the TP is highly susceptible to global climate change. Zhong et al. [6] noted that the air temperature over the TP has been 1.27 ◦C higher than normal since 2014. The average increase was 2.2 times the global average (0.57 ◦C). Overall, it easily suffers from drought under the combined effects of climate change and fragile ecosystems. Gao et al. [7] calculated the ratio of precipitation to potential evapotranspiration (P/PET) at 83 stations in the TP between 1979 and 2011 and found that the eastern TP was becoming drier. Wang et al. [8] investigated the plateau drought variation based on the self-calibrating Palmer drought severity index (scPDSI) from 1961 to 2009 and revealed that the southern TP experienced a wetting trend even though the northern TP became significantly drier, particularly in spring and autumn. Feng et al. [9] calculated the standardized precipitation evapotranspiration index (SPEI) using data from 274 meteorological stations over the TP during 1970–2017, indicating that severe drought frequency in winter and drought risk in summer showed an increasing trend. According to the statistics of the China Meteorological Administration, drought is the most dominant meteorological disaster among all meteorological disasters over the TP [10]. Therefore, a comprehensive understanding of drought characteristics on the TP is of great importance for drought early warning, prevention, and mitigation.

In general, a universal objective definition of drought is impractical and does not exist without knowledge of the climatologically expected values for the availability of stored water for a given need [11]. Wilhite and Glantz in 1985 [12] classified droughts into four categories: meteorological droughts, agricultural droughts, hydrological droughts, and socioeconomic droughts, which have been widely recognized by the scientific community [13]. Recently, some new drought types were proposed, such as ground water droughts [14], ecological droughts [15], agroecological droughts [16], and environmental droughts [17]. The aridity index, defined as the ratio between precipitation and evapotranspiration, is usually used to classify climatic zones and monitor drought [18]. The aim of this study is to develop a comprehensive index characterizing the complex drought conditions affected by climatic, hydrometeorological, and environmental factors. Accordingly, the drought index was adopted instead of the aridity index. To objectively quantify the onset, intensity, and spatial extent of drought, more than one hundred drought indices have been developed thus far. Among these indices, the Palmer drought severity index (PDSI) [19], the standardized precipitation index (SPI) [20], and the standardized precipitation evapotranspiration index (SPEI) [21] are the most popular and widely used drought indices. However, most of these indices are developed and evaluated on a monthly time scale. Although some daily drought indices, such as the standardized drought and flood potential index (SDFPI) [22] and drought potential index (DPI) [23], have been developed recently, drought indices with high resolution are generally scarce and need to be further investigated. With an average elevation of approximately 4000 m, the TP has the largest frozen soil zone in the mid-latitudes. In the freeze-thaw process, especially for the seasonal transitional period, the soil water phase and energy budget have dramatic changes at the daily temporal scale, which can affect the soil-vegetation-atmosphere interaction. Drought indices with high resolution have been expected to reflect this variation [24]. On the other hand, drought indices with high resolution can provide valuable drought information, such as the onset, end, and duration of drought, which are capable of guiding vulnerable agricultural and livestock production over the TP. In addition, some of these indices are based on a single variable, such as the vegetation condition index (VCI) [25] and temperature condition index (TCI) [26], which mainly reflect one specific aspect of drought. In terms of agricultural drought related to meteorology, soil, and vegetation systems, they cannot adequately capture the complex features of drought evolution. Integrating multiple drought-related variables and indices is an effective method for addressing this issue [27]. For example, Huang et al. [28] constructed an integrated drought index (IDI) based on precipitation, runoff, and soil moisture using the entropy weight method. It is an objective method for weight determination and gives soil moisture a low weight, causing the insensitivity of IDI

to agricultural drought. Lu et al. [29] developed the integrated scaled drought index (ISDI) based on precipitation, the normalized difference vegetation index (NDVI), soil moisture, and land surface temperature. ISDI is a linear combination of four drought factors. It cannot reflect the nonlinear relationships between hydrometeorological factors and drought. Previous studies have demonstrated that these integrated drought indices improved the capacity of drought monitoring. To date, there are three types of fusion methods: linear combination [30], copula-based methods [31,32], and machine learning (ML). Hao and Singh [33] noted that the former two approaches may suffer from the linearity assumption. In comparison, the ML approach has a strong capability of extracting target information from a large amount of random, noisy data and capturing the nonlinear characteristics of physical processes. It has therefore recently been favored by many scholars for the construction of drought indices. For example, Liu et al. [34] proposed an integrated agricultural drought index (IDI) based on remote sensing data and the backpropagation (BP) neural network, and it can effectively monitor drought events on the North China Plain.

Therefore, the objectives of this study are to (1) compare daily integrated drought indices developed by the four ML methods, namely, random forest regression (RFR), k-nearest neighbor regression (KNNR), support vector regression (SVR), and extreme gradient boosting regression (XGBR), based on multisource remote sensing and reanalysis data with the in situ soil moisture and obtain the optimal drought index as the new standardized integrated drought index (SIDI); (2) evaluate the SIDI performance in dry and wet changes against SPI-6, SPEI-6, and European Centre for Medium-Range Weather Forecasts Reanalysis 5 Land (ERA5-Land) soil moisture; and (3) assess the spatiotemporal applicability of SIDI for a typical drought event. The SIDI is expected to monitor plateau droughts with more detail and accuracy for agricultural water resource management.

### **2. Materials and Methods**

### *2.1. Materials*

In situ soil moisture data are obtained from the time-lapse observation dataset of soil temperature and humidity on the Tibetan Plateau from 2015 to 2020 [35–38] and a long-term dataset of integrated land–atmosphere interaction observations on the Tibetan Plateau from 2008 to 2016 [39]. The data were downloaded through the National Tibetan Plateau Data Center (https://data.tpdc.ac.cn/ (accessed on 20 April 2021)). The locations of the field observation stations can be found in Figure 1. These stations over different climates and underlying surface conditions are representative. Soil moisture data in units of percent with an hourly temporal resolution were converted to m3/m3 and further averaged to daily means. Soil moisture has been one of the most direct indicators of agricultural drought. Approximately 62% of the plateau area is covered by alpine meadows and grasslands with shallow root systems [5]. Therefore, the depth of soil moisture data used in this study was 10 cm. Moreover, drought during the critical stage of vegetation growth is more destructive to agricultural production. Hence, the soil moisture data collected at a 10 cm depth during the growing season (from May to October) were selected as the "ground truth" for assessing the performance of drought indices [34,40].

To develop the integrated drought index, we comprehensively consider the overall effects of the meteorological conditions, vegetation information, soil properties, and topographic environment. Thirteen variables in total were selected as predictor variables and summarized as follows [34,41]: four near-surface meteorological elements, including 2 m air temperature (TEMP), specific humidity (SHUM), 10 m wind speed (WIND), and precipitation rate (PREC), are provided by the China meteorological forcing dataset (CMFD) with a temporal resolution of three hours and a spatial resolution of 10 km (http://poles.tpdc.ac.cn (accessed on 20 April 2021)). Land surface temperature (LST) was acquired by a daily 1 km all-weather land surface temperature dataset for western China from 2000 to 2021 (http://data.tpdc.ac.cn (accessed on 15 July 2022)). Evaporation (EVAP) is provided by the ERA5-Land hourly reanalysis dataset at a spatial resolution of 10 km from 1950 to the present (https://cds.climate.copernicus.eu (accessed on 12

July 2019)). The fraction of absorbed photosynthetically active radiation (FAPAR) was obtained from the daily global QA4ECV FAPAR product at 5 km × 5 km during 1982–2016 (http://www.qa4ecv-land.eu (accessed on 16 February 2018)). Five soil characteristic data are provided by the SoilGrids 250 m 2.0 product, which includes bulk density (BDOD), clay, silt, sand, and soil organic carbon (SOC) at six depths: 0–5 cm, 5–15 cm, 15–30 cm, 30–60 cm, 60–100 cm, and 100–200 cm (https://soilgrids.org (accessed on 4 May 2020)). The 5–15 cm soil depth characteristic data were chosen as the input variables. Digital elevation model (DEM) data at 1 km are derived from the Resource and Environment Science and Data Center (https://www.resdc.cn (accessed on 1 September 2008)). The final resolution of all input data is 10 km × 10 km by using bilinear interpolation.

**Figure 1.** Distribution of field observation stations: (**a**) TP, (**b**) Pali, and (**c**) Naqu. The contour color represents different elevations. Pentagrams represent the station locations.

SPI-6 and SPEI-6 represent six months of rainfall and evaporation anomalies. Zhao et al. investigated the correlations of SPI and SPEI at different timescales (1, 3, 6, and 9 months) with NDVI at 33 stations around the Gannan region in the eastern TP and found that SPI-6 and SPEI-6 have good correlations with NDVI. SPI-6 and SPEI-6 are suitable for monitoring the drought conditions of alpine meadows [42]. Therefore, referring to McKee et al. [20] and Vicente-Serrano et al. [21], the SPI-6 and SPEI-6 were calculated. SPI-6, SPEI-6, and the

ERA5-Land hourly volume of water (m3/m3) at the 7–28 cm soil layer (ERA5-Land SM) (https://cds.climate.copernicus.eu/ (accessed on 12 July 2019)) were chosen as contrasts to evaluate the performance of the SIDI over the entire TP. The datasets used in this study are shown in Table 1.

**Table 1.** Datasets used in this study.


*2.2. Methods*

### 2.2.1. Machine Learning Models

RFR is an ensemble model that consists of multiple decision trees [43]. "Random" means that the input of each tree is randomly extracted from the training dataset, and a subset of features at each tree node is randomly selected from the available features to expand the tree. The model's final output is calculated as the average of predictions created by all individual trees. Consequently, RFR decreases the overall variance and avoids overfitting. For both small sample sizes and high-dimensional data, RFR captures nonlinear relationships between features and target variables and hence provides reliable results [34,44].

SVR is a supervised machine learning algorithm. SVR employs the kernel function where the input feature is projected into a high-dimensional feature space for building the optimal hyperplane to regress the training dataset with the minimum loss [45]. The performance of SVR depends on the proper selection of the kernel function. Many applications have demonstrated that the Gaussian radial basis function (RBF) is an excellent kernel function in SVR. In this study, we used SVR with the RBF kernel function.

KNNR is a nonparametric model. Its major advantage is its simplicity and efficiency. Given a data point, KNNR searches for the closest K data points based on the distance between that point and the remaining points in the training dataset. The model finally outputs the average of the target predictions for these K neighbors [46]. The target predictions of K neighbors are equally weighted in the KNNR we used.

XGBR is an advanced ensemble model designed by Chen et al. [47] on the basis of a gradient boosting machine (GBM). Compared with the traditional GBM, XGBR implementation adopts a regularized boosting technique and parallel processing, which help to reduce overfitting and speed up. Therefore, XGBR is a powerful machine learning model, especially when speed and accuracy are taken into consideration.

The detailed technical framework for the development of SIDI can be found in Figure 2.

**Figure 2.** The technical framework for the development of SIDI. Green boxes represent the input data; yellow boxes represent the machine learning models; red boxes represent the model output; blue boxes represent the validation processes.

### 2.2.2. Statistical Indicators

The root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2) were calculated to evaluate the performance of the ML models, as defined below [48,49].

$$\text{RMSE} = \sqrt{\frac{\sum\_{i=1}^{N} (P\_i - O\_i)^2}{N}} \tag{1}$$

$$\text{MAE} = \frac{\sum\_{i=1}^{N} (|P\_i - O\_i|)}{N} \tag{2}$$

$$\mathcal{R}^2 = 1 - \frac{\sum\_{i=1}^{N} \left(P\_i - O\_i\right)^2}{\sum\_{i=1}^{N} \left(P\_i - \overline{P}\right)^2} \tag{3}$$

where *N* is the number of observations. *Pi* and *Oi* are the predictions and observations, respectively. *P* represents the average of the predictions.

### 2.2.3. Shapley Additive Explanation

SHAP was first proposed by Shapley in 1953 to calculate the contribution of each player and allocate the value created by them in a collaborative game [50]. Lundberg and Lee in 2017 first introduced the SHAP to explain the output of machine learning models regarded as black boxes [51]. The SHAP value can be used to interpret individual machine learning predictions. The main idea of the SHAP value is to obtain the marginal contribution across all the possible permutations of the features and then take the average. The expression of the SHAP value is as follows [52]:

$$\mathcal{Q}\_{\mathbf{i}} = \sum\_{s \in N\{\mathbf{i}\}} \frac{|S|!(N-|S|-1)!}{n!} [\upsilon(S \cup \{\mathbf{i}\}) - \upsilon(S)] \tag{4}$$

where, ∅*<sup>i</sup>* is the contribution of feature *i*, *N* is the set of features, *n* is the number of features in *N*, *S* is the subset of *N* that contains feature *i*, and *v*(*N*) is the base value meaning the predicted outcome for each feature in *N* without knowing the feature values.

The sum of the SHAP value of each feature for each observation is considered the model outcome for each observation. Therefore, the explanation model g is formulated as follows:

$$\log(z') = \mathcal{O}\_0 + \sum\_{i=1}^{M} \mathcal{O}\_i z'\_i \tag{5}$$

where, *z* {0, 1}*<sup>M</sup>* and *M* is the number of features.

### **3. Results**

*3.1. Construction and Comparison of Drought Monitoring Index with In Situ Soil Moisture Measurements, SPI-6 and SPEI-6*

### 3.1.1. Construction of Drought Monitoring Index

To achieve an independent assessment of the performance of ML models, we used 70% of the input data (thirteen predictor variables and in situ SM) as the training dataset to construct the drought index, and the remaining 30% was employed as the validation dataset to assess the performance of the drought index. Moreover, min-max standardization was performed on the input data to eliminate the effects of dimensionality and accelerate the convergence speed. Regarding the in situ SM during the growing season as the ground truth, the drought indices were outputted by four ML regression models at the station scale. Figure 3 shows the comparisons of drought indices using four machine learning models with the in situ soil moisture in the training process. All 32,509 samples in the training dataset were used. The drought index based on the SVR model had a poor performance with a low R<sup>2</sup> of 0.66 and a high RMSE of 0.12. This may be because the RBF kernel function is not suitable for these complicated data sets. Another reason may be that SVR hyperparameters such as C and gamma are not good, although they have been optimized by the grid search. In contrast, the drought indices based on the RFR, XGBR, and KNNR models performed well, with R2 values of 0.86, 0.82, and 0.78; RMSE values of 0.08, 0.09, and 0.1; and MAE values of 0.06, 0.07, and 0.08, respectively. The results indicate that the three ML models have good application potential in the construction of the drought index in our study.

**Figure 3.** Comparisons of drought indices using four machine learning models with the in situ soil moisture in the training process: (**a**) RFR, (**b**) KNNR, (**c**) SVR, and (**d**) XGBR. Colors represent data density: the redder the color, the larger the data density is.

3.1.2. Comparison of Drought Monitoring Index with In Situ Soil Moisture Measurements, SPI-6 and SPEI-6

The prediction ability of the model is determined by the validation dataset because it was not involved in its construction. Hence, comparisons of drought indices using four machine learning models with in situ soil moisture were performed in the validation dataset (Figure 4). There are a total of 13,933 data points in the validation dataset. The result in the validation dataset for each model was slightly worse than that of the training dataset, indicating a high generalization level for each model due to the independence between the training and validation datasets [53]. In the validation process, the drought index based on the XGBR model outperformed that based on other models (R2 = 0.76, RMSE = 0.11, MAE = 0.08), followed by RFR (R<sup>2</sup> = 0.74, RMSE = 0.11, MAE = 0.08), KNNR (R2 = 0.73, RMSE = 0.11, MAE = 0.08), and SVR (R2 = 0.66, RMSE = 0.12, MAE = 0.1) [54,55]. The two ensemble models of XGBR and RFR achieve better accuracy and higher correlation compared with other models, suggesting the superiority of ensemble learning. The primary advantage of XGBR lies in the fact that a regularization term is added to the cost function to control the complexity of the model (regularization boosting technique). This technique reduces the variance and overfitting and makes the model simpler and faster [56]. The result confirms that the optimal drought index is the XGBR-based drought index, and thus, the XGBR model is used to construct the SIDI.

**Figure 4.** Comparisons of drought indices using four machine learning models with the in situ soil moisture in the validation process: (**a**) RFR, (**b**) KNNR, (**c**) SVR, and (**d**) XGBR. Colors represent data density: the redder the color, the larger the data density is.

Furthermore, the SIDI from 2000 to 2016 was obtained by inputting the spatial maps of thirteen predictor variables into the XGBR model. SIDI was switched to the monthly average, and then the spatial and temporal distributions of the monthly averaged SIDI were compared against ERA5-Land SM, SPI-6, and SPEI-6 in 2012, as shown in Figure 5. In terms of four variables, a smaller value corresponds to a drier area, and vice versa. In Figure 5a–d, the SM decreases gradually from the southeast to the northwest and exhibits obvious seasonal variability with a small (large) value in winter (summer). The SIDI has a similar spatial pattern and seasonal variability to the SM (Figure 5e–h). The spatial distributions of the SPI-6 and SPEI-6 are generally consistent, and there are some differences compared with the SM. For example, the SPI-6 and SPEI-6 display arid characteristics in the relatively humid southeastern region. This result suggests that they are not able to capture the dry and wet characteristics of the plateau well (Figure 5i–p). SPI-6 is formulated based on precipitation, ignoring the impact of temperature on drought. Considering this, the SPEI-6 is calculated based on precipitation and potential evaporation. However, previous studies found that the Thornth-waite algorithm failed to calculate the potential evapotranspiration when the average monthly temperature was below 0 ◦C, resulting in the poor applicability of SPEI-6 in arid and alpine regions such as the TP [57].

**Figure 5.** Comparisons of spatial and seasonal distributions of SIDI (**e**–**h**) against ERA5-Land SM at 7–28 cm ((**a**–**d**), units: m3/m3), SPI-6 (**i**–**l**), and SPEI-6 (**m**–**p**) in 2012. Colors represent drought degree: the redder the color, the stronger the drought degree is.

### *3.2. Drought Monitoring Performance for Typical Drought Events*

The drought monitoring ability of the SIDI was evaluated for a typical drought event. This event occurred in the Xizang Autonomous Region in May and June of 2009 [58,59]. SIDI classifications are divided into 9 levels using the percentile threshold method (Table 2) [57]. Figure 6 illustrates the spatial evolution characteristics of the drought process captured by the SIDI at an interval of nine days from 22 April to 1 July 2009. As shown in Figure 6a,b, an abnormal drought occurred over most regions of Xizang on 22 April, and then the drought intensity rapidly increased. In particular, the central part of Xizang exhibited extreme drought. Subsequently, the drought eased, whether in drought range or intensity, to a large extent on 10 May (Figure 6c). The drought area constantly expanded from the northwestern to southeastern regions of Xizang, with a higher drought magnitude on 19 May (Figure 6d). Thereafter, the drought continued to weaken (Figure 6e–h).



In addition, the temporal evolution characteristics of this drought event are depicted in Figure 7. There are two valleys (1 May and 18 May) in the SIDI. A valley indicates drought aggravation. Compared with the SM variation, the SIDI is more sensitive and detailed in capturing the key points of the drought process. In addition, with the increase in precipitation, the SIDI increased gradually, revealing that the drought had been relieved. Overall, the SIDI is capable of accurately describing the evolution process of drought events

in space and on a daily temporal scale, owing to its combination with the meteorological, vegetation, soil, and topographic environmental factors. Therefore, the SIDI is a reliable and comprehensive indicator for drought assessment.

**Figure 6.** Spatial evolution characteristics of the drought process captured by the SIDI in the Xizang Autonomous Region from 22 April to 1 July 2009 at an interval of nine days: (**a**) 22 April 2009, (**b**) 1 May 2009, (**c**) 10 May 2009, (**d**) 19 May 2009, (**e**) 28 May 2009, (**f**) 06 June 2009, (**g**) 15 June 2009, and (**h**) 24 June 2009. Colors represent drought degree: the redder the color, the stronger the drought degree is.

**Figure 7.** Temporal evolution characteristics of drought processes captured by SIDI in the Xizang Autonomous Region from 22 April to 01 July 2009, against ERA5-Land SM and CMFD PREC.

Furthermore, the SIDI and daily SPEI were adopted to identify drought characteristics at the BJ station, a representative station covering the alpine meadow on the central TP based on run theory [60] (Figure 8). A short-term drought event lasting 19 days was identified by the SIDI. However, the SPEI cannot identify it. It began on 26 March 2001, and lasted until 13 April 2001 (Figure 8a). The SIDI detected that the drought developed quickly and reached the intensity of extreme drought in several days. Thereafter, the drought eased slowly. Referring to the dry and wet classifications in Table 2, this drought event belonged to a moderate drought (I = 0.19). However, the SPEI shows wet conditions during this period (Figure 8b). The SIDI has a superior ability to identify drought information compared with other traditional indices, such as the SPEI.

**Figure 8.** Daily drought evolution of a drought event identified by the SIDI (**a**) and SPEI (**b**) at the BJ station based on run theory. D: drought duration; S: drought severity, which is the cumulative sum of drought conditions on D days; I: drought intensity, which can be calculated by dividing S by D. Red patterns represent dry conditions, and blue patterns represent wet conditions.

### *3.3. Importance of the Predictor Variables*

The performance of ML models has significant advantages for large data volumes with multiple predictor variables. Relationships between these predictor variables and model output are complicated and poorly identified due to the multiple predictor variables involved in the models and the black-box nature of ML models. How much did each predictor variable (feature value) contribute to the model output? To explain this, the SHAP was introduced into the XGBR model [61]. The average SHAP value of every feature and the SHAP value of every feature for every sample in the training dataset are presented in Figure 9. The color bar represents the feature value (red high, blue low). According to the averaged SHAP value, soil properties contributed 59.5% to the model output, followed by the meteorological (35.8%) and topographic environmental conditions (4.7%). The top three features that influence the model output are bulk density (BDOD), soil organic carbon (SOC), and silt, which are soil properties (Figure 9a). Moreover, the other two features, in addition to bulk density, have positive impacts on the model output (Figure 9b).

To interpret the interactions among predictor variables toward the model output, the BJ station covered by the representative alpine meadow on the central TP was chosen to apply the SHAP for different drought conditions (Figure 10). Figure 10 is the individual SHAP force plot, which includes three important characteristics: model output f(x), base value (the average of model outputs), and colors. The red color pushes the model output higher, whereas the blue color pushes the model output lower. For extreme and severe drought conditions, bulk density, land surface temperature, and soil organic carbon were the three major contributors to decreasing the model output (Figure 10a,b). In the case of moderate drought, abnormal drought, and nondrought conditions (Figure 10c–e), bulk density and soil organic carbon decreased

the model output, while specific humidity slightly increased it. Overall, bulk density (bdod) and soil organic carbon (soc) were two major contributors influencing the model output for all drought conditions, which is consistent with the result in Figure 9b.

**Figure 9.** Average SHAP value of every feature (**a**) and SHAP value of every feature for every sample (**b**) in the XGBR model. Soil properties include BDOD, SOC, SLIT, SAND and CLAY; meteorological conditions include LST, FAPAR, SHUM, EVAP, TEMP, WIND and PREC; and topographic environment includes DEM.

**Figure 10.** The individual force plots at the BJ station for different drought conditions: (**a**) extreme drought on 6 August 2006, (**b**) severe drought on 28 May 2013, (**c**) moderate drought on 11 June 2015, (**d**) abnormal drought on 27 June 2012 and (**e**) nondrought on 19 September 2012.

### **4. Discussion**

The aforementioned results showed that the XGBR-based SIDI can serve as an efficient drought index for TP drought monitoring. Traditional vegetation-based remote sensing indices at 8 days or 16 days are capable of monitoring agricultural drought. However, these vegetation-based indices usually identify drought characteristics through vegetation conditions, which reflect one specific aspect of drought. In fact, agricultural drought is related to meteorology, soil, and vegetation systems. In comparison, the new drought index, SIDI, comprehensively considered multiple factors. Nevertheless, detailed drought information, such as the onset, end, and intensity of drought, can be identified by the SIDI, which is capable of guiding vulnerable agricultural and livestock production, particularly for the growing season over the TP. The time span of the SIDI is from 2000 to 2016 and, thus, can be used to analyze drought change characteristics over a long time scale. However, there remain some issues with the construction of the SIDI. First, in the process of building the ML model, the model input is only from several stations in the central and western parts of the plateau. Moreover, some special underlying surfaces, such as deserts and glaciers, were not included in the model input. In the future, it is expected that more newly collected in situ data will be added to optimize the model, especially data at new stations where no data have been collected before. On the other hand, this new data can be used for model precision evaluation. Second, despite the development of SIDI, which considers multivariate factors, the mutual response relationship between factors is ignored. For example, the memory of soil moisture leads to a time lag effect between soil moisture and meteorological factors. Liu et al. [34] considered the lagging effect of NDVI on LST and precipitation changes in the newly developed integrated agricultural drought index (IDI). Qing et al. [62] constructed a comprehensive agricultural drought index (CADI) that comprehensively integrated the lagging times of soil moisture with precipitation and evapotranspiration. It is also worthwhile to further investigate whether the time lag effect will improve the prediction accuracy. In addition, many drought indices were developed under the assumption of a statistically stationary distribution of meteorological variables. In fact, the meteorological variables are not stationary due to the influences of climate change and human activities. Considering this, some nonstationary drought indices, such as the standardized nonstationary precipitation index (SnsPI) [63] and the nonstationary standardized runoff index (SRINS) [64] have been developed. Therefore, the nonstationarity of meteorological variables should be taken into account when optimizing the drought index. Third, the definition of the drought threshold level is a crucial step for drought severity categorization. However, drought studies currently focus on drought identification rather than categorization, resulting in various drought categorizations [65]. Fixed threshold levels and the percentile method are the two commonly used drought categorizations, which are not applicable to every region. Standardization of drought categorization is still an issue.

It is worth discussing the interesting finding revealed by the SHAP results: soil characteristics are more important than some meteorological variables in modeling drought. The climatological point is striking due to the negligence of soil in previous studies and deserves more attention. In addition, it remains unclear whether the inclusion of soil bulk density and soil organic content is what makes this model an improvement over others, which needs to be validated in future work.

### **5. Conclusions**

In this study, based on multisource remote sensing and reanalysis data, daily drought indices developed by four machine learning methods, including RFR, KNNR, SVR, and XGBR, were compared. The optimal drought index was selected as the SIDI. Furthermore, the drought monitoring ability of the SIDI was investigated based on the SPI-6, SPEI-6, ERA5-Land SM, and a typical drought event. In addition, the impact of predictor variables on the model output was also explored. The main conclusions are as follows.

(1) By comparing drought indices from four ML models with in situ SM data during the growing season, the drought index based on the XGBR model outperformed that based on other models (R<sup>2</sup> = 0.76, RMSE = 0.11, MAE = 0.08), followed by RFR (R<sup>2</sup> = 0.74, RMSE = 0.11, MAE = 0.08), KNNR (R2 = 0.73, RMSE = 0.11, MAE = 0.08), and SVR (R<sup>2</sup> = 0.66, RMSE = 0.12, MAE = 0.1). The result proves the superiority of the XGBR model, and this model is used to develop the SIDI.

(2) Compared with the spatial and seasonal distributions of SPI-6, SPEI-6, and ERA5- Land SM, the SIDI reflects the spatial characteristics of the plateau, which is dry in the northwest and humid in the southeast. It also depicts obvious seasonal variability, with large values in winter and small values in summer. For a typical drought event that occurred in the Xizang Autonomous Region in May and June of 2009, the SIDI accurately describes the evolution process of drought spatial evolution on a daily timescale, demonstrating its application potential in drought detection.

(3) Of the thirteen prediction variables, the contribution of 59.5% to model output was from soil properties, 35.8% was from meteorological conditions, and 4.7% was from the topographic environment. The top three variables that influence the model output are bulk density, soil organic carbon, and silt. Moreover, except for bulk density, the other two features have positive impacts on the model output. This suggests that soil information is an important factor affecting drought evolution, which should be taken into account in the construction of the drought index in the future.

**Author Contributions:** Conceptualization: M.C., L.Z. and Y.M.; methodology: M.C.; software: M.C. and L.Z.; validation: M.C., L.Z. and X.W.; formal analysis: M.C. and P.L.; investigation: X.W., Z.W. and Y.Q.; resources: L.Z.; data curation: M.C. and P.L.; writing—original draft preparation: M.C.; writing—review and editing: L.Z. and Y.M.; visualization: M.C. and Z.W.; supervision: M.C. and Y.Q.; project administration: L.Z.; funding acquisition: L.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by National Natural Science Foundation of China (Grant No. 41875031); the Second Tibetan Plateau Scientific Expedition and Research (STEP) Program, Ministry of Science and Technology of the People's Republic of China (Grant No. 2019QZKK0103); National Natural Science Foundation of China (Grant No. 41522501, 41275028, 42230610); CLIMATE-Pan-TPE in the framework of the ESA-MOST Dragon 5 Programme (Grant No. 58516).

**Data Availability Statement:** Datasets used in this study can be obtained according to the links in Table 1. The newly created SIDI data are available from the authors upon reasonable request as the data need further use.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
