1. Introduction
Extreme value modeling for hydrological extreme events (e.g., heavy rainfall and floods) is an essential task for the design of hydraulic structures. For extreme value modeling, the block maxima approach has been widely used, and it uses a sequence of maximum observations extracted from equal periods, such as annual maximum (AM) daily rainfall. The set of block maxima is assumed to be independent and identically distributed (iid) and is fitted with a probability distribution model, such as a generalized extreme value (GEV) distribution, to estimate the design quantiles of extreme hydrological events.
As the impact of climate change has become a significant issue, many efforts have been made to consider nonstationarity in hydrologic applications. One popular approach applies a variety of candidate nonstationary models to the nonstationary data and selects an appropriate model based on model diagnostics. It generally employs the maximum likelihood estimation (MLE) method to estimate nonstationary model parameters due to its adaptability to changes in model structures [
1]. So far, this approach has been studied extensively and can be referred to as a “user-friendly” method. However, there is an issue regarding ergodicity in nonstationary extreme value modeling from a statistical point of view [
2,
3,
4]. More specifically, statistical properties from temporal statistics would involve the assumption of ergodicity since AM time series can be theoretically considered as a stochastic process. However, if the observed data has a trend or is affected by external variables (i.e., the process is nonstationary), the ergodicity cannot hold and inductive inference from the data would not be possible [
3,
4,
5]. In short, it is necessary to determine the relationships between model parameters and covariates prior to model parameter estimation to deal with the ergodicity issue. This is an important issue that needs to be addressed in terms of statistical hydrology, but hydrological applications mostly accept nonstationary models, as we mentioned earlier. Hence, we apply the standardized approach in this study, and the limitations regarding ergodicity will be discussed in the discussion section later.
For nonstationary extreme value modeling, the nonstationary GEV (NS-GEV) is a representative model, and it employs time or exogenous covariates (e.g., large-scale climate modes and hydrological variables) [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15]. Here, time
t is commonly used as a covariate for modeling the linear or polynomial trend of GEV parameters. It is a simple and straightforward way to apply nonstationarity to the probability distribution model. However, considering only time
t as a covariate could lead to problems such as an increase in uncertainty of quantile estimates and distortion of the probability distribution model according to extrapolation [
16,
17]. In addition, the trends of hydrological data can vary in the short or long term because of climate variability and other external forces [
18,
19,
20]. Thus, large-scale climate modes have been employed as potential climate covariates since the mid-2000s as they can consider natural climate variability.
In general, extreme value modeling for rainfall with climate covariates has been performed as follows: (1) identifying significant climate indices related to extreme rainfall; and (2) selecting an appropriate probability distribution model among the various GEV models based on the statistical evaluation of the model fit. As such, various forms of NS-GEV can be modeled to estimate the magnitude and occurrence frequency of extreme rainfall events using climate indices as covariates [
8,
9,
11,
12,
14,
21]. Then, information criteria such as Akaike information criterion (
AIC) and Bayesian information criterion (BIC) are generally used to select the most appropriate model. For example, Vasiliades et al. [
8] considered the four climate indices (i.e., the Southern Oscillation Index (SOI), North Atlantic Oscillation (NAO), Pacific Decadal Oscillation (PDO), and Mediterranean Oscillation Index (MOI)) reported to illustrate the dependence of the Mediterranean climate on large-scale low-frequency climate patterns. These indices were employed as covariates for the GEV model with a conditional density network, and an appropriate GEV model was finally selected based on the corrected
AIC (
AICc) and BIC. Agilan and Umamahesh [
9] conducted extreme value modeling for developing a nonstationary rainfall intensity-duration-frequency curve. They used five covariates, including El-Nino-Southern Oscillation (ENSO) and Indian Ocean Dipole (IOD), which are known to impact extreme rainfall over India; then, they selected an appropriate model using the
AICc. They concluded that the global processes (i.e., global warming, ENSO cycle, and IOD) are the best covariates for the long-duration extreme rainfall of Hyderabad city, India.
As shown in the abovementioned studies, many previous works considered several climate indices reported in the literature for modeling NS-GEV of rainfall extremes. The climate indices were directly used as covariates of location and scale parameters, which are closely linked to the mean and variance of the data, respectively. It could be possible to model a more reasonable NS-GEV by considering the connection between climate indices and the trend inherent in extreme rainfall. For this, a preliminary analysis is necessary to determine the trends in the mean and variance of AM rainfall and to identify the impact of climate indices on these trends. Recently, many researchers have employed modern signal processing known as ensemble empirical mode decomposition (EEMD) to extract long-term trends from a given data series. The EEMD has the advantage of detecting long-term trends after extracting oscillatory patterns from an original data series [
22,
23]. It is also successfully applied to hydro-climatology variables [
24,
25,
26,
27]. For example, Kim et al. [
27] identified that the behavior of the Atlantic Multidecadal Oscillation (AMO) index indicates a long-term trend in the monthly precipitation series of South Korea through a statistical analysis procedure along with EEMD. Chen et al. [
26] employed the EEMD to explore the trend of the AM daily precipitation data. They extracted long-term trends in the mean and variance of the AM daily precipitation data and successfully identified that the extracted trends can be applied to extreme value modeling.
The increase of uncertainty in design levels is one of the most important issues in nonstationary extreme value modeling. The use of a more complex model (i.e., nonstationary distribution model) generally allows a good fit for the given samples, but this could provide quantile estimates with a large amount of uncertainty [
3]. In general, however, the best model is selected using only the information criteria that assess the model performance based on the maximized log-likelihood (i.e., model fit) [
26,
28,
29,
30,
31,
32,
33], so that the selected nonstationary model usually yields unreliable design quantiles [
16]. Cooley [
34] demonstrated that relying solely on the information criteria is not enough to select an appropriate model because they cannot take into account uncertainty. Nevertheless, most of the previous studies assessed uncertainties in model parameters and design quantiles after selecting the best model based on the information criteria [
13,
16,
17,
21,
35,
36,
37,
38,
39,
40,
41]. Therefore, to provide more reliable design quantiles, the uncertainty of the candidate models should be assessed prior to selecting an appropriate model. The bootstrap method is generally recommended to measure the uncertainty of quantile estimates [
36,
37,
38,
42]. It is computationally efficient and provides realistic asymmetric confidence intervals (CIs) without asymptotic assumptions [
3,
10].
Many studies have attempted extreme value modeling by applying the NS-GEV with climate covariates to estimate rainfall quantiles considering climate variability. However, most of these are limited to modeling for several rainfall gauging stations [
8,
9,
11,
12,
14,
36], and few studies propose an overall procedure from the identification of appropriate climate indices impacting regional rainfall extremes to the selection of an appropriate nonstationary distribution model with climate covariates (e.g., India [
21], southern France [
38], and southern U.S. [
43]). To the best of our knowledge, there has been less attention to considering the uncertainty in the model selection procedure. The main objective of this study is to propose a procedure for extreme value modeling with large-scale climate modes and to consider the uncertainty of model selection. It can be divided into two main subsections: (a) identifying significant seasonal climate indices (SCIs) that impact the long-term trends of AM daily rainfall using the EEMD, and (b) selecting an appropriate GEV distribution considering both model fit and uncertainty among the stationary GEV (ST-GEV) and various NS-GEVs using time and SCIs as covariates. The EEMD was applied to AM daily rainfall observed at 61 stations over South Korea, and the residue was extracted that represent the long-term trend of AM daily rainfall. By conducting correlation analysis between the extracted residues and the various climate indices, significant SCIs that impact the trend of AM daily rainfall were selected. Extreme value modeling was then performed using the significant climate indices as covariates of GEV parameters. Considering both model fit and uncertainty, an appropriate GEV distribution was finally selected at each station. We also discussed the physical meaning of the significant climate indices selected over South Korea and the feasibility of the procedure for nonstationary extreme value modeling.
3. Methodology
This study is grouped largely into two main steps to conduct extreme value modeling with large-scale climate modes: Step 1. identifying significant climate indices that impact the trend of AM daily rainfall based on statistical approaches; Step 2. selecting an appropriate GEV distribution by comparing the performance of various ST- and NS-GEV with climate covariates. The procedure can be subdivided into five steps.
Figure 3 presents a brief overview of this study, and the details of the methodology are described in the following subsections.
3.1. Seasonal Climate Indices
All climate indices are provided on a monthly scale. To employ the SCIs as covariates in nonstationary extreme value modeling, we converted the monthly climate indices to seasonal ones by seasonal averaging (i.e., Spring: March-April-May (MAM); Summer: June-July-August (JJA); Fall: September-October-November (SON); and Winter: December-January-February (DJF)). To consider the time lag for using SCIs as a predictor, the SCIs observed prior to the occurrence of AM daily rainfall should be used. As most AM daily rainfall events over South Korea occur in the JJA season (see
Figure 2), the four seasons of lagged SCIs before the JJA season are considered as described in
Table 2.
3.2. Ensemble Empirical Mode Decomposition
EEMD is a decomposition method that has been used widely in hydrological time series. By decomposing an original time series into a set of intrinsic mode functions (
IMFs), the EEMD can subsequently extract a long-term trend inherent in the time series. The
IMFs indicate physically meaningful information, and they should satisfy two conditions: (1) the number of extrema and zero crossings must either be equal to each other or differ at most by 1 in the original time series, and (2) the mean of the upper and lower envelopes, which are defined by connecting all local minima and maxima, should be zero at any point [
47]. A monotonic function that remains after the decomposition is a residue, which indicates a long-term trend. Here, we briefly describe the EEMD procedure with an example of a time series, as shown in
Figure 4. The red-dotted and blue-dotted lines are the upper and lower envelopes of the original time series, respectively, and the gray-dotted line is the mean value of the local maxima and local minima. Assuming that there is an original time series, we can obtain
N number of
IMFs as follows: (1) identify the upper and lower envelopes by connecting the local maxima and minima
using a cubic spline; (2) calculate the mean value of the local maxima and local minima as
; (3) extract the mean value from the original time series as
; (4) repeat steps (1) to (3) until
satisfies the condition of the
IMF; (5) define a new time series by extracting
from
and repeat steps (1) to (5) until no more
IMFs can be extracted. Finally,
is composed of the sum of the
IMFs and a residue
, as presented in Equation (1):
where
is the number of
IMFs.
In this study, the EEMD is applied to AM daily rainfall to extract the long-term trend of AM daily rainfall as follows (for additional details, see Chen et al. [
26]): (1) define a residue
, extracted from the original time series
using the EEMD procedure, as a long-term trend in the mean of AM daily rainfall (hereafter referred to as the mean trend); (2) calculate the time series of the variance as
; (3) define a residue
, extracted from the time series of the variance
using the EEMD procedure, as a long-term trend in the variance of AM daily rainfall (hereafter referred to as the variance trend).
3.3. Spearman’s Rank Correlation Analysis
Spearman’s rank correlation is one of the most widely used statistical estimators to measure the statistical dependence between two different hydrometeorological series [
14,
38,
48,
49]. It is based on the nonparametric rank correlation and describes the statistical relationship using a monotonic function. Spearman’s rank correlation coefficient (
) between two time series
and
is defined by Equation (2):
where
is the difference between the two ranks of each observation at time
, and
is the number of observations. If the two time series have a perfect monotonic function, the value of
is close to −1 or 1. As a rule of thumb, a value of
between 0.7 and 1.0 (−0.7 and −1.0) represents a strong positive (negative) correlation, a value between 0.3 and 0.7 (−0.3 and −0.7) represents a moderate positive (negative) correlation, and a value between 0 and 0.3 (0 and −0.3) represents a weak positive (negative) correlation [
50,
51,
52].
To identify the significant SCIs impacting AM daily rainfall over South Korea, between the SCI and the mean and variance trends of AM daily rainfall are calculated for all employed stations. Then, the percentage of stations with a significant value () at the 1% significance level is calculated for each SCI.
3.4. Generalized Extreme Value Distribution Modeling
The GEV distribution is widely used for estimating the magnitude and occurrence probability of hydrological extreme events. Let
be the AM time series data with sample size
, which are assumed to be independent and identically distributed (iid). The cumulative distribution function
of the ST-GEV for
is expressed by Equation (3):
where
is the location parameter related to the mean of the data,
(>0) is the scale parameter related to the variability of the data, and
is the shape parameter related to the heaviness of the distribution tail.
When a trend or the impact of external variables in the observed data is considered in extreme value modeling, the distribution parameters can be modeled as functions of covariates, such as time or climate indices. In this study, the location and scale parameters of the GEV distribution were defined as functions of time-dependent covariates as follows:
where
is the number of covariates and
is the
-th covariate at time
. As the scale parameter should be positive for all
, an exponential function was employed as a link function for the scale parameter. The shape parameter was assumed to be stationary, as it is difficult to estimate the shape parameter with precision [
1,
10].
The combinations of significant SCIs are used as covariates of location and scale parameters. Depending on the nonstationarity in the GEV parameters, two types of NS-GEV are defined: NS-GEV(1,0,0), in which only a location parameter is assumed to be a function of covariates, and NS-GEV(1,1,0), in which both location and scale parameters are assumed to be functions of covariates. The parameters of GEV distributions are estimated using the MLE method.
3.5. Appropriate Model Selection Considering Uncertainty
The Akaike information criterion (
AIC) is an information-theoretic model selection method based on Kullback–Leibler information loss [
53]. Since the
AIC evaluates model performance based on the maximized log-likelihood and the number of parameters for the fitted distribution, a distribution model with a good fit and parameter parsimony was selected as an appropriate model. The equation of the
AIC is expressed as
where
is the maximized log-likelihood of the fitted distribution and
is the number of distribution parameters. By comparing the
AIC value of candidate distribution models, the model with the smallest
AIC value is selected as an appropriate model, considering both the parameter parsimony and the goodness-of-fit. The
AIC of each model can be rescaled as follows:
where
is the rescaled
and
is the smallest
among all candidate models. The models with
can also be reasonable choices [
53].
The bootstrap method is generally recommended to measure the uncertainty of the quantile estimate in a nonstationary distribution model because it is computationally efficient and provides realistic asymmetric confidence intervals (CIs) without asymptotic assumptions [
3,
10]. For the NS-GEV, the residual bootstrap method has been employed to calculate the confidence intervals for the parameter and quantile estimates [
9,
14,
54,
55,
56]. The residual bootstrap method can be conducted as follows:
For the fitted GEV distribution, transform the AM time series data (
) into the standardized residuals with no trend (
) as follows [
1]:
where the
,
, and
are the location, scale, and shape parameters of the fitted GEV distribution, respectively.
Obtain a new sample of by resampling residuals with replacement and back-transforming the resampled residuals using Equation (8).
For back-transformed samples, estimate the T-year quantile at each time (, ) using the same GEV distribution.
Repeat steps (2)–(3)
times and calculate the time-averaged 95% CIs for the T-year quantiles (
) as follows:
where
and
are the 97.5th and 2.5th percentiles of
ordered samples of
.
In this study, the model selection procedure combining the AIC and residual bootstrap method is performed to select an appropriate GEV model considering both model fit and uncertainty. First, the AIC values of all GEV candidates are calculated, and the GEV distributions with are selected. Second, for each selected GEV distribution, the residual bootstrap method is repeated 1000 times to calculate the time-averaged 95% CIs for the 100-year quantiles (). Finally, the GEV distribution with the smallest is selected as an appropriate distribution model.
6. Conclusions
In nonstationary extreme value modeling, it is essential to reflect the trend in statistical characteristics, such as the mean and variance of the observations, because nonstationarity is generally considered by the time-dependent location and/or scale parameters of the probability distribution model. Using physically meaningful information, such as climate indices as covariates, a probability distribution model can be improved in terms of not only considering nonstationarity but also reducing uncertainty. In this study, the procedure of extreme value modeling with large-scale climate modes was proposed, from the identification of appropriate SCIs impacting regional AM rainfall based on EEMD to the selection of appropriate NS-GEV, considering both model fit and uncertainty. Using the EEMD, residues that indicate the long-term trends in the mean and variance were extracted from AM daily rainfall. Subsequently, the correlation coefficient was calculated between the extracted residues and various lagged SCIs. After the identification of appropriate SCIs, the AMM_SON(−1), AMO_SON(−1), and NAO_JJA(−1) were finally selected as significant SCIs with an impact on the long-term trends in both the mean and variance of AM daily rainfall over South Korea. The combinations of these significant SCIs were employed as covariates of location and scale parameters for constructing various NS-GEVs.
As the uncertainty increases with the complexity of the probability distribution model, it is necessary to consider the uncertainty to select an appropriate probability distribution model. However, there have been few studies that consider uncertainty in the model selection procedure. In this study, the model selection procedure combining the AIC and residual bootstrap method was proposed to select an appropriate GEV model among ST-GEV, NS-GEV with time covariate, and NS-GEV with SCI covariates. Thus, the NS-GEV with SCI covariates was selected as an appropriate model for more than 65% of the applied stations in South Korea.
Nowadays, uncertainty is the main issue in nonstationary extreme value modeling. Rather than simply using the time covariate, employing appropriate climate indices as covariates can reduce uncertainty. The EEMD could be employed to detect the significant climate indices for using the covariates of the nonstationary extreme value model because it can extract the long-term trend inherent in the time series. Further, by selecting an appropriate probability distribution model considering both model fit and uncertainty, more accurate and reliable quantiles could be estimated. Although this study focused on the determination of significant climate indices and their application to extreme value modeling of extreme rainfall over South Korea, the method and discussion presented are expected to be extended to various hydrological variables, as well as to other regions.