2.1. Datasets
Although lacking detailed resolution, remote sensing products containing climatic data are currently widely available [
32]. For this study we used bioclimatic variables [
33], due to their prompt availability as a collection in Google Earth Engine (
Table 1).
Topographic parameters are key in describing species distribution across the landscape. This is relevant for Socotra, where physiography plays a key role in creating different microclimates across the island [
5]. We used the Shuttle Radar Topography Mission (SRTM) digital elevation data, at a resolution of 1 arc-second (approximately 30 m), to derive slope, elevation, and aspect to be considered in the distribution model [
34]. To complement the selected variables, three soil-related variables were included to evaluate their potential contribution to increase the model’s accuracy. Soil characteristics can also facilitate the stratification of potential restoration sites and allow for more tailored restoration practices and recommendations (
Table 2).
In order to harmonize all environmental layers, we used bicubic interpolation neighbor embedding for interpolating data points to create a collection of “super high resolution” images for analysis (
Supplementary Figure S1) [
35]. This operation was performed using Google Earth Engine, and is available through the code in the Results section [
36].
2.2. Data Analysis—Species Distribution
To model the potential distribution of
Dracaena cinnabari on Socotra, we used Google Earth Engine to process the layers, create testing and training sample plots, train a Random Forest classifier, and display the accuracy and final results. Multicollinearity tests were performed using STATISTICA Version 12 [
37].
Individual trees were identified using visual interpretation of satellite images, corrected with ground truthing, and then used to map the current distribution of
Dracaena cinnabari on Socotra (
Supplementary Figure S2) [
6]. This database, containing 80,247 tree occurrences, was transformed into a categorical image of the presence and absence of
Dracaena cinnabari trees [
6]. The operation was performed applying a 100 m buffer from the tree location (presence = 1), and mapping the remaining pixels as zero (absence = 0).
First, a systematic grid of plots was created, spaced at 100 × 100 m from one another. These plots (368,707 in total) were exported from Google Earth Engine to be tested for multicollinearity (
Table 3). This test of predicting variables was performed using tolerance and the variance inflation factor (VIF) [
38,
39,
40,
41,
42].
Considering the above results and the ecological coherence of the variables, highly correlated variables and others with high VIF values (i.e., above 10) were discarded, prioritizing variables that are commonly recorded in weather stations and ensuring that both temperature and precipitation were considered in the final selection. Simulations with different variables and combinations were also performed to assist with variable selection. Upon selecting the variables (“
bio01”, “
bio04”, “
bio15”, “
cec”, “
ph”, “
dist”, “
sand”, “
slope”), another multicollinearity test was performed [
42]. When considering only the selected variables, the multicollinearity was drastically reduced to values below VIF = 10 (
Table 4).
Google Earth Engine was used to run a Random Forest (RF) classifier and test its accuracy in estimating the potential distribution of the species. 24,646 training plots were used for training and 10,524 plots for testing; 500 decision tree iterations were used in the training of the model.
The number of plots predicted by the classifier (RF) with presence from the training sample was 17,585, while the number of plots predicted with absence was 351,122. Previous studies found that RF improved its performance with balanced classes, and there was a relevant disproportion between the two classes, with the absence class being over-represented [
43,
44]. An under-sampling method was used to balance the datasets by limiting the number of absence plots as a function of the number of presence plots [
45,
46] and cross-validation was applied dividing the training and testing sample data into a 70:30 proportion, of which 70% were used to “train” the model (training data set) and the remaining 30% (validation data set) were used to evaluate its predictive capacity. According to the confusion matrix, reported in
Table 5, the overall validation accuracy was 0.91.
The traditional metric of overall accuracy is no longer adequate for describing classifier performance [
45,
47]; therefore, the confusion matrix and its values were used to calculate some performance metrics. In the confusion matrix, the precision (TP/TP + FN) is the number of correctly-identified presence divided by the total number of times the model predicted presence, while the recall (TN/TN + FP) is the ratio of actual absence plots that were predicted incorrectly as presence. The results were precision = 0.87 and recall = 0.96. Therefore, in 87% of the cases, RF correctly predicts presence and in 96% of cases, it correctly predicted absence. Precision and recall cannot be high at the same time; if an optimization is performed on one, then the other will decrease. Therefore, when applying these metrics, we must choose which of them we want to be more precise.
For the objectives of this research a better prediction of true presence is a priority compared to the prediction of true absence, especially considering the difficulty of reaching some areas of the island. A high probability of success is required in order to efficiently allocate efforts and resources. A greater significance of precision is supported by the analysis of Peterson et al. [
48], who states that “
in a niche-modeling framework, a model that errors by omitting known points of presence is more seriously flawed than one that predicts areas not known to be inhabited”.
This is also the reason we decided to use precision and recall instead of specificity and sensitivity—two other performance metrics commonly used in studies on species distribution models [
47]. Precision focuses on true presence, while it overlooks true absence.
Therefore, an 87% degree of correct predictions of true presence allows for a theoretical framework identifying the suitable areas for
Dracaena reforestation projects; however, further experiments could be implemented with different sampling strategies [
49], or different algorithms as also recommended by Qiao et al. [
50], to obtain a higher precision value and consequently a better prediction capacity.
2.3. Data Analysis—Accessibility
In addition to the current Dracaena cinnabari distribution map, this study proposes an accessibility model focused on the locations of villages/settlements on Socotra which are relevant to help in choosing the most suitable restoration and conservation areas for these trees. With these two maps in hand, decision makers can start to plan interventions and investments for reforestation/conservation activities taking into consideration the metabolic energy cost for the land users.
For an accessibility map to be focused on the local human population, we first need to take into consideration how far the settlements are from other points of the island. The distance from settlements to other parts of the island in meters was calculated using recent available geolocation data of villages across the islands of Socotra (Vahalík et al., in preparation).
Similarly to Riccardi et al. [
51], we developed an accessibility model focused on the location of villages/settlements on Socotra to identify the areas most suitable for reforestation/conservation actions. Even though we are aware that not all villages/settlements are occupied throughout the year and that local people often move from and to different areas for part of the year, we considered the maximal potential occupancy of villages at this point in time. First, the distance from settlements to other parts of the island, in meters, was calculated (
Supplementary Figure S3).
Considering the trade-offs in place regarding the time available to allocate for restoration efforts, we can assume that land users weigh the costs of accessing a site based on the distance from their household. Any restoration activity must rely on the contribution of its surrounding communities to achieve success. This can be in the form of active contribution—such as tree planting and maintenance, improved management practices, degradation controls, etc.—or indirectly—for example, the exclusion of livestock and people through fencing. Aside from distance, terrain ruggedness is also a discouraging factor for moving through a landscape. In order to account for this factor, a metabolic energy cost image was created based on slope.
Slope and its links to the energy required to move have been extensively researched [
52,
53,
54]. In this study, we used an equation proposed by Minetti et al. [
54] to relate the metabolic energy cost of walking to slope, as follows:
where m is the slope and Cw (J⋅kg
−1⋅m
−1) is the energy cost of moving one unit of mass a horizontal distance equivalent to one unit of vertical displacement. Notice that due to the polynomial formulation, this equation is valid for a range of slopes approximately between −0.5 and 0.5, outside of which the behavior of the function becomes counterintuitive [
55]. The above approach was implemented using Google Earth Engine for image processing and SRTM as the base elevation model for the calculation of slope (resampled) (
Supplementary Figure S1).
Multiplying the distance to settlements by the metabolic energy cost (
Supplementary Figure S4), we estimate the energy required for moving one unit of mass (1 kg) outwards from the settlements’ center points. We assume that the higher the energy required to access an area, the lower the probability of community individuals commuting to it for reforestation and ecological restoration purposes (
Figure 1).
The areas of Socotra with the highest energy requirement for the local populations to access are the Hageher Mountains, the escarpments along the southern part of the island, and the rugged mountain ranges in the west of the island. Flat areas both on the coast and inland, where many population centers are found, have the estimated lowest energy requirements to access.