**4. Conclusions**

Here we present and assessment of seasonal drought forecasts, as characterized by the SPI at 3- and 6-months accumulation periods for, 3- and 6-month lead times, respectively. The main advantage of using the SPI for drought monitoring and prediction is that it is already used in operational monitoring systems in many countries around the globe and it is the drought index endorsed by the World Meteorological Organization (WMO).

We evaluated the scalar accuracy of the SPI forecasts together with the skill of probabilistic forecasts of discrete drought events (i.e., <sup>&</sup>lt;−1). The skill of probabilistic drought identification with the SPI was also assessed. The scalar skill of the SPI-3 and SPI-6 was found to be seasonally and regionally dependent, but for some locations, SPI3 predictions at a lead of 3 months and SPI6 predictions at a lead of 6 months are found to have "useful" skill (monthly correlation with observations is statistically significant at the 5% significance level). The difference in skill between the ECMWF S4 SPI forecasts for South-Central America and a baseline forecast based on the climatological characteristics is positive in many areas and for many months, however it is mostly statistically insignificant. Nevertheless, for the SPI-3, our results show that the skill of the dynamic seasonal forecast is always equal to or above the climatological forecasts. On the other hand, for the SPI-6, our results indicate that it is more difficult to improve the climatological forecasts.

In a second step, we have evaluated several methods to forecast the drought events from the ensemble. Ensemble drought detection was based on several methods (Table A2) and can be organized into three types [40]: individual, where the index is based on an individual member or percentile; partially integrative, where the sum of particular individual members or percentiles are used; and integrative, which is represented by the ensemble mean. Although individual dry members and partially integrative methods were providing an outstanding accuracy for seasonal drought detection, our results have shown that the spread of the ensemble is too large and these methods also have large bias and false alarm ratio. The best (or most consistent) method is defined by using the ensemble mean SPI values, both for SPI3 and SPI6, at three and six months lead times. Our decision was based on the GSS index, which according to many authors provides an optimum solution for selection a classification method based on the number of hits, misses and false alarm ratio. The ensemble mean achieves an overall accuracy of about 80%, with POD above 30% for at least 75% of the study area, and false alarm ration that is overall below the 70%. Although the ECMWF S4 forecast system often overestimates the drought onset, it is significantly better than using the climatology (<sup>∼</sup>=16%).

Finally, standard verification measures for probabilistic forecasts were used to assess the accuracy of drought predictions based on the SPI values for "moderate", "severe" and "extreme" categories. The Brier Skill Score, which measures the probabilistic forecast skill against a forecast derived from the climatology, showed that both the SPI3 and SPI6 were, for some regions, slightly more skillful that the climatology. The ECMWF forecast system behaves better than the climatology for clustered grid points at the North of South America, Northeast of Argentina and Mexico. The skillful regions are similar for SPI-3 and -6, but become reduced in extent for the most severe SPI categories. We hypothesize that, because an increase of SPI intensity is accompanied by a decrease of the respective cumulative probability, the likelihood of mismatching is larger. As expected, the BSS is lower for the locations where the scalar mismatch between the forecast and the observations is larger, which implies more categorical misses and/or false alarms at any SPI intensity.

Forecasting different magnitudes of meteorological drought intensity on a seasonal time scale still remains a challenge. However, the ECMWF S4 forecasting system does capture reasonably well the onset of drought events (i.e., "moderate" drought) for some regions and seasons. A match is noticeable between observed and predicted SPI for dry months in arid regions with highly marked precipitation seasonality. Although the performance of Numerical Weather Prediction models is always improving and advances in the representation of physical processes in the models is an area of intense active research, the performance is still not good enough to provide useful guidance on months with high precipitation amounts; but it provides information that is more skillful than the climatology for dry periods.

**Author Contributions:** Conceptualization, H.C., G.N. and P.B.; Methodology, H.C., G.N. and C.L.; Data computation, H.C. and E.D., Validation, all authors; Writing-Original and Draft Preparation, All authors; Writing-Review & Editing, all authors.

**Funding:** This research received support from the EUROCLIMA regional cooperation program between the European Union (European Commission; DG DEVCO) and Latin America.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A Description of the Validation Metrics**

#### *Appendix A.1 Nonprobabilistic Forecasts of Continuous SPI Values*

We first verify the scalar accuracy of the SPI values for the multimodel ensemble mean at 3 and 6 months lead time (respectively for SPI3 and SPI6). Ensemble mean SPI values are verified against observations for the hindcast period (i.e., from 1981 to 2010). In this case, the SPI magnitude can take any value in a specified segmen<sup>t</sup> of the real line, rather than being limited to a finite number of discrete classes (see Table A1). We perform an independent verification of drought forecasts for each month, by using four common accuracy measures of continuous nonprobabilistic forecasts, namely: the Pearson product-moment correlation coefficient, *r*; the Mean Error (ME); and the Root Mean Squared Error (RMSE). To be considered statistically significant at the 5% (10%) confidence level, the r between forecast SPI values and those in the verifying GPCC data needs to be greater than 0.37 (0.31), as defined by [41] for *Nyears* = 29 observations (i.e., after subtracting 1 year from the total number of available years in the dataset).

Although the correlation does reflect linear association between two variables (in this case, forecasts and observations), it is not sensitive to biases that may be present in the forecasts. On the other hand, the ME, which is the difference between the average forecast and average observation, expresses the bias of the forecasts. Forecasts that are, on average, too high will exhibit ME > 0 and forecasts that are, on average, too low will exhibit ME < 0. It is important to note that the bias gives no information about the typical magnitude of individual forecast errors, and is therefore not in itself an accuracy measure. To complement the ME, we have computed the RMSE, which has the same physical dimensions as the forecasts and observations, and can be thought of as a typical mean magnitude for individual forecast errors.

*Climate* **2018**, *6*, 48

We also verify the skill score for the multimodel ensemble mean at 3 and 6 months lead time (respectively for SPI3 and SPI6). Skill score refers to the relative accuracy of an ensemble set of forecasts and is interpreted as the improvement over a reference forecast [48]. Therefore, if the ECMWF S4 is providing value-added skill to the SPI forecasts, it will first be manifested by temporal correlations with observations, *r*1, that exceed the expected correlation of the same observations with the climatological SPI baseline value (0), *r*2. Under the assumption that the sets of forecasts are normally distributed, to assess the statistical significance of the difference between two correlations *r*1 and *r*2, we used Fisher's *Z* transformation, as explained in [49]. We define *Zi* as

$$Z\_i = \frac{1}{2} \ln \left( \frac{1 + r\_i}{1 - r\_i} \right)$$

for *i* = 1 and 2. The transformation *Z* is assumed to be normally distributed with variance (*N* − 3) − 1, where *N* = 29 observations (i.e., after subtracting 1 year from the total number of available years in the dataset). We then transformed *r*1 and *r*2 to *Z*1 and *Z*2, and computed the statistical significance for the difference in correlations using the *Z* statistics:

$$Z = \frac{Z\_1 - Z\_2}{\sqrt{\frac{1}{N\_1 - 3} + \frac{1}{N\_2 - 3}}}$$

where *N*1 − 3 and *N*2 − 3 are the degrees of freedom for *r*1 and *r*2, respectively. Using a null hypothesis of equal correlation and a non-directional alternative hypothesis of unequal correlation, if *Z* is greater than 1.96, the difference in correlations is statistically significant at the 5% confidence level.

A complementary skill score measure was constructed using the *RMSE* as the underlying accuracy statistic. The reference RMSE is based on the climatological average *SPI*, and is computed as:

$$RMSE\_{CIim} = \sqrt{\frac{1}{N} \sum\_{k=1}^{N} \left(\overline{SPI} - SPI\_k\right)^2}$$

For the *SPI*, the climatological average does not change from forecast occasion to forecast occasion (i.e., as a function of the yearly index *k*). This implies that the *RMSEClim* is an estimate of the sample variance of the predictand. For the RMSE using climatology as the control forecasts, the skill score becomes

$$SS\_{Clim} = 1 - \frac{RMSE}{RMSE\_{Clim}}$$

Because of the arrangemen<sup>t</sup> of the skill score, the *SSClim* based on RMSE is sometimes called the reduction of variance (RV), because the quotient being subtracted is the average squared error (or residual, in the nomenclature of regression) divided by the climatological variance.

#### *Appendix A.2 Nonprobabilistic Forecasts of Categorical SPI Values*

The temporal correlation between forecast and observed values of the SPI provides an overall measure of forecast accuracy and skill, one that is not limited to the case of drought alone. Therefore, we also evaluated SPI forecasts in the context of being able to detect drought, that is, when the SPI drops below a particular threshold. Here, we identified a drought event as occurring when the SPI value for a given month was ≤−1, which corresponds to a "moderate drought" category or higher in the classification system presented in Table A1.


**Table A1.** SPI classification following McKee et al. [8].

Ensemble drought detection was based on several methods (Table A2) and can be categorized into three types [50]: individual, where the index is based on an individual member or percentile; partially integrative, where the sum of particular individual members or percentiles are used; and integrative which is represented by the ensemble mean. The individual types should be seen as providing complementary information about the intensity of the SPI, but also about the distribution of the members. The individual types of drought detection have been subdivided into five classes representing dry members (Q13, Q23), wet ones (Q77, Q88) or the median. The extreme members of the distribution are not used to avoid outliers generally associated with ensemble systems [50].

**Table A2.** Methods to detect drought events from the S4 ensemble system. Adapted from [40].


For 3- and 6-month lead times (respectively for SPI3 and SPI6), we computed several verification measures for the categorical forecasts (i.e., below the SPI "-1" threshold) identified with the methods described in Table A2. All verification measures are based on a contingency table approach, which is applied at each grid point in the study area. The entries in the table are defined as follows: " *A*" is the number of drought events that are forecast and occur; "B" is the number of drought events that are forecast but do not occur; " *C*" is the number of drought events that are not forecast but do occur; and " *D*" is the number of drought events that are not forecast and do not occur. The variable *N* is the total number of cases analyzed from 1981 to 2010. Based on these values, the percentage correct (*PC*, perfect = 1) is the ratio of good forecasting events in relation to the total number of events.

$$PC = \frac{A + D}{N}$$

The extreme dependency score (EDS) provides a skill score in the range [ −1, 1] that can be used to find the hit-rate exponent [51]. The EDS takes the value of 1 for perfect forecasts and 0 for random forecasts, and is greater than zero for forecasts that have hit rates that converge slower than those of random forecasts.

$$EDS = \frac{2\log\frac{A+B}{N}}{\log\frac{A}{N}} - 1$$

*Climate* **2018**, *6*, 48

The Gilbert skill score (*GSS*) measures the fraction of forecast events that were correctly predicted, adjusted for the frequency of hits that would be expected to occur simply by random chance [40].

$$GSS = \frac{A + A^\*}{A + B + C - A^\*}$$

where *A\** is the number of random hits, computed as:

$$A^\* = \frac{(A+B)(A+C)}{N}$$

The GSS is often used in the verification of rainfall forecasts because its "equitability" allows scores to be compared more fairly across different regimes (for example, it is easier to correctly forecast rain occurrence in a wet climate than in a dry climate). However, because it penalizes both misses and false alarms in the same way, it does not distinguish the source of forecast error. Therefore, it should be used in combination with at least one other contingency table statistic, for example, bias. Here, we compute bias as:

$$BIAS = \frac{A+B}{A+C}$$

The probability of detection (*POD*, perfect = 1) is the ratio of the total number of observed events that have been forecasted.

$$POD = \frac{A}{A+C}$$

The false alarm rate (*FAR*, perfect = 0) is the fraction of the forecasted events which actually did not occur.

$$FAR = \frac{B}{A+B}$$

*Appendix A.3 Probabilistic Forecast of Categorical SPI Values*

Verification of probability forecasts is somewhat more subtle than verification of non-probabilistic forecasts. Since non-probabilistic forecasts contain no expression of uncertainty, it is clear whether an individual forecast is correct or not. On the other hand, unless a probabilistic forecast is either 0.0 or 1.0, the situation is less clear-cut. For probability values between these two (certainty) extremes, a forecast is neither right nor wrong, so that meaningful assessments can only be made using collections of multiple forecast members and observation pairs. A number of accuracy measures for verification of probabilistic forecasts of dichotomous events exist, but by far the most common is the Brier score (BS) [48]. The Brier score is essentially the mean squared error of the probability forecasts, considering that the GPCC drought observation at time *k* is *ok* = 1 if a drought event occurs (i.e., SPI ≤ −1), and that the GPCC observation at time *k* is *ok* = 0 if a drought event does not occur (i.e., SPI > −1). The BS averages the squared differences between pairs of forecast probabilities, *fcstk*, and the subsequent binary reference observations,

$$BS = \frac{1}{n} \sum\_{k=1}^{N} \left( f c \text{st}\_k - o\_k \right)^2$$

where the index *k* again denotes a numbering of the *N* forecast-event pairs. Comparing the *BS* with the root-mean squared error, it can be seen that the two are completely analogous. As a mean-squared-error measure of accuracy, the *BS* is negatively oriented, with perfect forecasts exhibiting *BS* = 0. Less accurate forecasts receive higher *BS* values, but since individual forecasts and observations are both bounded by zero and one, the score can take on values only in the range 0 ≤ *BS* ≤ 1.

Skill scores of the form of *SSclim* are also often computed for the *BS*, yielding the Brier Skill Score (*BSS*):

$$BSS = 1 - \frac{BS}{BS\_{mf}}$$

The BSS is the conventional skill-score form using the BS as the underlying accuracy measure. Usually, for the SPI, the reference forecasts are the relevant climatological probabilities of a drought event taking place with a certain severity (Table A1). For example, the frequency of "Moderate" drought events is approximately the 16%. The BSS ranges between minus infinity and 1; 0 indicates no skill when compared to the reference forecast; the perfect score is 1. A good companion to the BSS is the Relative Operating Characteristic (ROC) of the forecast. ROC is conditioned on the observations, and measures the ability of the probabilistic forecasting system to discriminate between drought events and non-events of different frequencies, that is, the resolution of the forecast. ROC is not sensitive to bias in the forecast (even a biased forecast could give a good ROC). However, the ROC is a measure of potential usefulness of the probabilistic forecast, and the area under the ROC curve gives a measure of its skill. Since ROC curves for perfect forecasts pass through the upper-left corner, the area under a perfect ROC curve includes the entire unit square, so *Aperf* = 1. Similarly ROC curves for random forecasts lie along the 45◦ diagonal of the unit square, yielding the area *Arand* = 0.5. The area A under a ROC curve of interest can also be expressed in standard skill-score form *SSROC*, as

$$SS\_{ROC} = \frac{A - 1/2}{1 - 1/2}$$

Wilks [48] states that *SSROC* is a reasonably good discriminator among relatively low-quality forecasts, but that relatively good forecasts tend to be characterized by quite similar (near-unit) areas under their ROC curves. The *SSROC* ranges between 0 and 1; 0.5 indicates no skill, while the perfect score is 1.

**Figure A1.** Monthly correlation of the observed and forecast SPI at 6-months lead time (SPI6) (using the mean of the ensemble) for the hindcast period (1981–2010). Values are indicated in the color bar: 0.31 (0.37) is statistical significant at 10% (5%) significance level.

**Figure A2.** Monthly difference in forecast skill (Pearson correlation) between the forecast SPI6 at 6-month lead time (using the mean of the ensemble) and climatological SPI for the hindcast period (1981–2010). Values are indicated in the color bar: 1.96 is the statistical significant at the 5% significance level.

**Figure A3.** RMSE between the observed and forecast SPI6 at 6-month lead time (mean of the ensemble) for the hindcast period (1981–2010). Values in difference of percentile magnitude are indicated in the color bar.

**Figure A4.** Skill Score of the SPI6 at 6-month lead time forecast measured in terms of the RMSE relative to climatological RMSE for the hindcast period (1981–2010).

**Figure A5.** Verification measures of categorical drought forecasts (i.e., below the SPI6 "-1" threshold) estimated with the methods described in Table A2.
