1. Introduction
Frequency analysis represents a direct method of determining extreme events in hydrology that may be expected to occur for a given period.
The analysis involves the use of probability distributions to estimate the quantiles related to rare events with certain return periods, corresponding to small exceeding probabilities, where, in most cases, there are no recorded data.
In general, the analysis uses distributions of at least three parameters, in order to calibrate the high-order moments as much as possible. However, there are also situations in which the use of two-parameter probability distributions is preferred, such as the determination of maximum flow, maximum precipitation, and maximum volumes, or to use probability distribution functions such as synthetic unit hydrograph (SUH).
Considering that many researchers still use two-parameter distributions in performing frequency analyzes of extreme events in hydrology, the article aims to provide guidelines for a proper use of these types of statistical distributions. It also presents new elements and relations for a significant number of distributions with applicability in frequency analysis, which presents certain advantages compared to the usual two-parameter distributions in hydrology (such as Gumbel, log-logistic, exponential), as well as greater flexibility with the possibility of modeling the different forms of heavy tail curves, an advantage mainly due to the fact that these distributions are characterized by various values of higher order moments (coefficient of variation, skewness, kurtosis), aspects that do not characterize distributions such as Gumbel, log-logistic, or exponential.
Thus, this manuscript presents some of the most used two-parameter distributions, i.e., Gumbel (G), gamma (Γ2), log-normal (LN2), logistic (L2), log-logistic (LL2), Weibull (W2), Fréchet (F2), the shifted exponential (E2), and generalized exponential (GE), but also some less analyzed distributions, with applicability in these types of analyses, i.e., Rayleigh (R2), Pearson V (PV2), chi (CH2), and inverse chi (ICH2) distributions. The limited applicability by researchers of these last distributions is due to the complexity of the equations needed to estimate the parameters, as well as the lack of closed forms of the inverse function. It should be mentioned from the beginning that the manuscript does not exclude the possibility of using other distributions and methods in the frequency analysis of extreme events in hydrology, but only expands the possibility for researchers to use a larger number of distributions, this aspect being only beneficial to a corresponding holistic analysis from a statistical and hydrological point of view.
Why is the presentation of these distributions important? It is important because, considering the still widespread use of two-parameter distributions in FFA, based on the diagrams and variation relations of L-skewness and L-kurtosis, a pre-selection of distributions that can be successfully applied using the L- and LH-moments method can be used, respecting the performance criteria specific to these methods. Thus, the analysis is simplified and greatly reduces the calculation time, being able to replace the use of distributions with a larger number of parameters. For this purpose, the case study includes a four-parameter distribution, namely, the Burr distribution, which calibrates exactly all the linear moments, being thus characterized, from a statistical point of view, by a high degree of confidence, but for which the estimation of the parameters requires difficult calculations requiring the solution of a system of four non-linear equations.
The proposed distributions covers a wide variety of extreme events analyzed in hydrology: the Gumbel distribution is usually used to obtain the intensity–duration–frequency curves of rainfall intensity [
1,
2,
3,
4,
5,
6], to determine the maximum flows [
7,
8,
9,
10,
11], and to determine the minimum flows [
12,
13,
14,
15,
15]; the gamma distribution is often used for SUH design [
16,
17,
18,
19,
20,
21], as well as for the analysis of maximum flows [
22,
23,
24]; the log-logistic distribution is applied to determine the maximum precipitation [
25,
26,
27,
28] and to determine the maximum flows [
25,
29,
30]; HUSSAIN [
28] applied the Fréchet distribution to determine the maximum precipitation; Ahilan et al. [
31,
32] analyzed the Fréchet distribution to determine the maximum flow rates; Bhunya et al. [
19,
20,
21] used the two-parameter Weibull distribution for SUH design.
In the case of distributions W2, F2, PV2, using the MOM and the L-moments method, the new contributions were presented in previous materials [
33].
In this article, only the new elements regarding these thirteen distributions and their applicability are presented, such as the relations for parameter estimation using the first and second order LH-moments. Moreover, new elements are presented regarding the parameter estimation with MOM and L-moments, for the distributions Rayleigh, chi, and inverse chi, and also improved estimation relations, compared to existing approximate relations in the literature, for the gamma distribution being introduced for the first time in FFA. All these new elements are centralized in
Table 1.
Considering the diverse use of two-parameter distributions in frequency analysis in hydrology, the main purpose of the article is the presentation of exact and approximate parameter estimation relationships for these probability distributions, as well as the presentation of the relationships and diagrams necessary to choose the best distribution using the L-moments method and the LH-moments method. For these methods and distributions, the necessary relationships for parameter estimation and rigorous criteria for selecting the best model have not been presented so far, with this being the first time they appeared in science.
In general, the main disadvantage of using these two-parameter distributions is the fact that, regardless of the parameter estimation method, they cannot calibrate the higher-order moments (skewness,
; kurtosis,
; L-skewness,
; L-kurtosis,
; LH-skewness,
; LH-kurtosis,
). This aspect makes it extremely difficult to choose the best distribution. Another disadvantage is the fact that many two-parameter distributions have constant values of these statistical indicators, such as Gumbel, exponential shifted, normal, uniform, Weibull, Fréchet, etc. [
22,
34]. This disadvantage is greatly reduced in the case of Γ2, LN2, CH2, and ICH2 distributions, where the higher order statistical indicators (
,
,
,
,
,
) have distinct values depending on the second order statistical indicator (
,
,
). Thus, the presentation of the variation diagrams of these statistical indicators, using the L- and LH-moments methods, represents another element of novelty constituting the main criterion for selecting the best distribution. Using these two parameter estimation methods, the criterion for selecting the best distribution is for the difference to be as low as possible, between the theoretical values of the L-, LH-skewness and L-, LH-kurtosis indicators from those corresponding to the analyzed data set, especially in the case of small (
n < 25) and medium (
n < 50) series of observed data, preferably L-, LH-kurtosis, because, in general, it gives the heavy tail tendency [
34]. In the case of large data sets (
n > 80), additional criteria can be applied, namely the use of indicators and performance tests (relative mean error, relative absolute error, Kolmogorov–Smirnov, Anderson–Darling, etc.).
In the case of parameter estimation with MOM, although it is possible to create theoretical diagrams of skewness and kurtosis variation depending on the coefficient of variation (), the big disadvantage is the fact that the method is much less stable and robust to the variability of the length of the recorded data sets, with the skewness and kurtosis requiring important corrections, an aspect that very often cannot be fulfilled. Therefore, in the case of this method, for short and medium data sets, the selection criterion of the best distribution is not a rigorous one, involving subjective aspects that should not be found in such an analysis.
Another important element of the frequency analysis is the need to present the relative errors depending on the variability of the length of the observed data, knowing that this variability implies uncertainties on three levels, namely, in the estimation of statistical indicators, in the estimation of distribution parameters, and in the estimation of quantiles [
25].
The manuscript presents as a case study the determination of the maximum flows on the Siret river, Romania, but the main purpose is to present the mathematical foundation of these distributions and parameter estimation methods, presenting for the first time explicit relations (exact and approximate) of parameter estimation using these methods. A comparative analysis is made regarding the advantages and disadvantages of the different parameter estimation methods. Rigorous criteria for choosing the best distribution and methods are provided.
Considering that the vast majority of the distributions presented (using these methods) are not included in dedicated programs, it is imperative to present the equations necessary for the use of these distributions, for the verification and reproduction of the analysis, but most importantly for facilitating the use of these distributions by researchers. These represent aspects of particular importance in a research that must be transparent and objective.
Only these parameter estimation methods are analyzed because they are used in the processes of regionalization of extreme events. Statistical regionalization is usually carried out by identifying homogeneous regions (HRs), where some theoretical moments are assumed as constant, while others can be expressed as functions of specific geomorphic covariates [
35]. HRs are often identified by adopting: (1) geographical or administrative criteria; (2) cluster analysis [
36,
37,
38,
39]; or (3) artificial intelligence techniques [
40]. Moreover, approaches that do not define fixed-boundary regions [
41,
42,
43,
44] are receiving increasing attention [
45,
46,
47].
Taking into account that the existing regulations in Romania [
48] also recommend the use of two-parameter distributions in determining the maximum flow rates, it is important that their application is performed appropriately and with mathematical rigor.
This research is part of a much larger and more complex research in the development of normative proposals regarding the frequency analysis of extreme events, by identifying methods and probability distributions characterized by mathematical and hydrological rigor.
The manuscript brings new contributions regarding the frequency analysis of maximum flows using two-parameter distributions, such as parameter estimation relations for MOM, L-, and LH-moments methods, rigorous criteria regarding the choice of the best distribution, and diagrams and relationships of – and –.
3. Study Area and Data
This study analyzes the Siret river, by performing a frequency analysis of the maximum annual flows, using the methods and probability distributions presented.
The Siret River (cadastral code XII) is located in the north-eastern part of Romania, being the left tributary of the Danube River.
The dependent Siret river system is 647 km long, originating from the territory of Ukraine, from the woody Carpathians [
59,
60,
61].
Of the total length, 559 km (middle and lower Siret) are on the territory of Romania. It has the largest hydrographic basin on the territory of Romania, namely, 42,890 km2, representing approximately 18.1% of the country’s surface.
The river has an average slope of 1.7‰, a sinuosity coefficient of 1.86, an average altitude of 515 m, and a forestation coefficient of 0.37 [
61].
Figure 2 shows the hydrographic basin of the Siret river [
60].
The maximum flows characteristic of this river are the result of snow melting superimposed with the fall of precipitation, usually occurring in the spring–summer months [
62].
The river has a modified regime, with various retention works that influence the maximum annual values. In the Siret hydrographic watershed there are more than 100 multi-use reservoirs, and their contribution to flood mitigation is relatively low, because mitigation volumes are reduced compared to flood volumes, which have a peak flow greater than the flow with a 10% exceedance probability.
Table 3 shows the chronological series of the maximum annual flows recorded at the Lungoci hydrometric station, in the period 1970–2008, according to the Romanian Normative NP 129 [
48]. The recorded series has a length of 39 maximum flows characteristic of each year. The maximum flow rates of the analyzed series vary between 275 m
3/s and 4650 m
3/s.
Figure 3 shows the observed data, on the Siret River, Romania, for the entire analysis period.
From a statistical point of view, the three parameter estimation methods require the determination of some characteristic statistical indicators, such as the arithmetic mean (
), mean square deviation (
), the coefficient of variation (
), the skewness (
), the kurtosis (
), L- and LH-coefficient of variation (
,
), L- and LH-skewness (
,
), and the L- and LH-kurtosis (
,
). The values of these indicators, related to the analyzed data set, are presented in
Table 4.
The values of and confirm a low variation in maximum annual flows. The value of , which often requires certain corrections, confirms the fact that the skewness is to the right. In the case of the LH-moments methods, substantial changes in the indicators can be observed due to the attribution, in the analysis, of a lower importance to the lower extreme values.
4. Results and Discussion
The purpose of the analysis is to verify the performances of the described methods and distributions, in order to forecast the values of the quantiles corresponding to extremely rare events.
4.1. Verification of Independence, Stationarity, and Homogeneity
In a first stage, the criteria regarding the independence, stationarity, and homogeneity of the analyzed data were checked.
The independence of the data is ensured by the selection criterion, the data representing the maximum value corresponding to each year. However, there is the possibility that two maximum values characterize different years, but still come from the same flood, especially if the flood is multimodal. Thus, data independence was verified with the Cunnane criterion and the USWRC 1976 criterion [
63,
64].
The analyzed data are homogeneous, the value of the von Newman test [
65] being greater than 2, namely, 2.026.
Regarding the stationarity, it was checked with
t-test [
66], for 5% level of significance. The calculated value of the test is 0.277. The critical value of the test is 2.03. It should be mentioned that climate change leads to a possible loss of this stationarity.
4.2. Quantile Estimation
Taking into account that two-parameter distributions still have an extended applicability in these types of analyses, for the case study of the Siret river, different two-parameter distributions from different families, with different particularities, are applied.
The obtained results differ depending on the characteristics of the distributions and on the estimation methods of the parameters, but also on the influence of the variability of the lengths of the data sets recorded in the estimation of the statistical indicators, parameters, and estimated quantiles.
Compared to other parameter estimation methods, these three analyzed methods are based on the determination of some statistical indicators related to each method, representing a real advantage in the regionalization process.
Figure 4a–d graphically presents the comparison of the results obtained for the Siret river. The empirical probability used for the graphic representation of the recorded values is the Hazen probability [
22,
55], based on the considerations presented in
Section 4.3. The decimal logarithmic scale on the horizontal axis was used, because we wanted to highlight the heavy tail (the domain of rare events).
Following the presented results, it can be observed that the quantile values differ substantially depending on the distribution and the parameter estimation method. Some distributions are characterized by a heavy tail overestimating the values (Fréchet distribution), while other distributions underestimate the values (Rayleigh distribution).
The analysis was made especially for the estimation of the quantiles related to the range of exceedance probabilities between 0.01–1%. The analysis of the resulting values must always be performed separately depending on the estimation method. For all three parameter estimation methods, the extreme curves (upper and lower) are those defined by the Fréchet and Rayleigh distributions.
In the case of MOM, the maximum values obtained (p = 0.01%) are 20,886 m3/s for the Fréchet distribution and 5687 m3/s for the Rayleigh distribution. For L-moments methods, the values of the amounts of the two distributions are 45,446 m3/s, and 5494 m3/s. In the case of the two methods of LH-moments, the values of the quantiles of the distributions are 28,814 m3/s and 5859 m3/s for the first level of LH-moments, and 23,709 m3/s and 6156 m3/s for the second level LH-moments.
Analyzing the data obtained with first and second level LH-moments, one can observe the effect of separating the maximum flow rates, assigning greater importance to the values that actually represent floods.
4.3. Performance Criterion
Considering the varied range of results obtained, a crucial stage in such frequency analyzes is the choice of the best probability distribution.
In general, for the selection of the best distribution, certain statistical tests are applied (Kolmogorov–Smirnov, Anderson–Darling, chi-square test, Lilliefors test, Cramér–von Mises, Shapiro–Wilk, etc.) [
65] or performance indicators (relative mean error, mean absolute error, Kling–Gupta, Nash–Sutcliffe, root mean square error, etc.) [
23,
34,
65,
66], but these are relevant only when the data series are long enough (considered to be
n > 80).
In the case of MOM estimation (as in the case of the maximum likelihood method), the big disadvantage of this method is that in the case of small and medium data sets there are no rigorous selection criteria, and in general the analysis is characterized by a high degree of subjectivity, involving only a graphic visual choice [
67].
Regarding the L- and LH-methods, they currently have more rigorous selection criteria. The selection of the best distribution is made exclusively on the basis of the calibration of the statistical indicators of the data series, which are less sensitive to the variability of data lengths [
22,
34,
64]. The selection criterion is the calibration of indicators of order three (L-skewness) and four (L-kurtosis), and the methods assume the calibration of all these ratios of linear moments [
33,
68,
69].
In the case of two-parameter distributions, they properly calibrate only the L-coefficient of variation (the ratio between the moment of second order and the first moment of order). Regarding moments of order three and four, many two-parameter distributions have these constant values (Gumbel, normal, exponential shifted, etc.), which presents a big disadvantage. Thus, this article presents distributions that have the advantage of the fact that these indicators do not have constant values, but show a variation depending on the L-coefficient of variation. It is also presents, as the main criterion for selecting the best distribution, the relationships and variation graphs of the L-coefficient of variation and L-skewness and the L-coefficient of variation and L-kurtosis, thus, substantially facilitating the rigor of the selection (
Appendix B and
Appendix C). It is mentioned that the choice of the best distribution is performed using the fourth-order linear moment criterion.
The importance of presenting these criteria in the selection of the best two-parameter distribution using L- and LH-moments estimation is the fact that, mathematically, the vast majority of two- and three-parameter probability distributions are defined by the same variation curves of L-skewness−L-kurtosis (
Appendix F shows as example, the log-normal distribution of two and three parameters).
Thus,
Table 5 presents the values of the theoretical statistical indicators of the distributions obtained based on the analyzed series, as well as the values of two performance indicators (relative mean error and mean absolute error).
The statistical and theoretical values of the indicators L and LH and the coefficient of variation are not presented, because all the analyzed distributions calibrate this indicator accordingly.
Considering the relatively short length of the analyzed series (n = 39), it is not possible to establish a performance criterion using the resulting values of the two performance indicators.
Where the relative mean error (RME) and the relative absolute error (RAE) have the following expressions:
in which
represent the sample size, the observed value, and the estimated value for a given probability.
From the point of view of the obtained results, the LN2 distribution has the best results, with the values of the L-kurtosis indicators (0.192 for the L-moments method, 0.185 for the first level LH-moments method, and 0.185 for the second level LH-moments method) being the closest to those of the analyzed series. The graphic confirmation can be found in
Figure A1,
Figure A2,
Figure A3 and
Figure A4 from
Appendix B,
Appendix C,
Appendix D and
Appendix E.
For MOM, as previously mentioned, it is difficult to choose the best distribution based on statistical tests and performance indicators, especially in the case of short and medium data sets. Based on the graphical analysis of the obtained results, for the Siret river, the best two-parameter distribution, using MOM, is also selected using the LN2 distribution.
A correct use of two-parameter distributions (using the L- and LH-moments method) can replace a more laborious analysis that requires the use of at least four-parameter distributions for the calibration of all linear moments characteristic of the L and LH-moments methods. Thus,
Figure 5 compares the results of the LN2 distribution (the best model result) with those obtained with the four-parameter Burr distribution that fulfills all four conditions imposed by the method of linear moments.
It can be easily observed that the two curves are very close, the resulting values being characterized by very small differences. For the flow with the probability of exceeding or being equal to 0.01%, the values generated by the two distributions are 11,972 m3/s for LN2, and 11,128 m3/s for the Burr distribution, representing a relative error of 7.6%, more than acceptable for the forecast of these very rare events.
4.4. The Influence of the Variability of the Length of the Recorded Series
Any analysis is characterized by certain errors. In the case of the frequency analysis, it is mandatory to present the relative errors specific to the distribution chosen as the best model. Thus, regardless of the parameter estimation methods, it is necessary to present the relative errors in the estimation of statistical indicators, parameters, and quantiles, the latter being the most important. Thus, for exemplification, three cases are presented, for the three distributions chosen as the best models, for the three methods.
Thus, taking into account the results obtained, both for MOM and L-moments, the recorded values are assumed to come from an LN2 distribution.
Starting from the theoretical values of the statistical indicators (characteristic of a large number of values, 1000 < n < 5000), the relative errors corresponding to some samples of 80, 50, 25, and 39 values are determined as in the case of the Siret river.
In the case of MOM, two usual values of the coefficient of variation are chosen, namely, 1 and 2, and 0.634, corresponding to that of the analyzed data.
In the analysis with L-moments, the values of the L-coefficient of variation (
) were chosen 0.3, 0.5, 0.75, and 0.339 related to the analyzed series. In both methods, for simplification, the arithmetic mean (
) is chosen as 1. The theoretical values of the L-skewness indicators (
) and L-kurtosis (
) is determined, depending on the L-coefficient of variation (
), with the following new relationships:
The values of the statistical indicators, the estimated parameters, and the resulting amounts, characteristic of the sampling, are presented in
Table 6,
Table 7 and
Table 8.
Considering that in the frequency analysis the main element is the estimation of the quantiles, according to the exceedance probabilities of interest (0.01, 0.1, 0.5, 1%), the errors corresponding to the analyzed series have values of 7.59, 614, 4.96, and 438% using MOM, and 0.54, 0.47, 0.43, and 0.4% using L-moments. It can be seen that the L-moments method is much more stable and robust to the variability of the data series lengths. This method also shows stability regarding the existence of outliers [
34].
4.5. Confidence Interval
Considering all these uncertainties regarding the estimation of the quantile values, it is necessary to determine and graphically represent the confidence interval of the inverse function.
It will be represented only for the distribution chosen as the best model for the flow series of the Siret river, namely, the LN2 distribution.
For MOM, the confidence interval is determined based on the relationships presented in the literature [
22] specific to this method, as well as the use of Chow’s relationship on a Gaussian assumption [
22,
55,
64,
68,
69], being an element accessible to everyone, an important aspect in the development of a standard regarding maximum flows. For the L-moments and LH-moments method, the confidence interval is determined based on Chow’s relationship but using the frequency factor with the parameters estimated with these two methods, for a significance level of 10%.
Figure 6 shows the confidence intervals for the chosen model (LN2) and for all the methods used.
There are other methods for determining the confidence interval, such as the standard bootstrap technique or parametric Monte Carlo simulations, but which require a more complex analysis, inaccessible to everyone and still characterized by certain limitations [
70,
71,
72].
5. Conclusions
The article presents a significant number of two-parameter distributions, from different families. New relationships are presented regarding the estimation of their parameters with three methods often used in frequency analysis. The need for research is the high applicability of two-parameter distributions in frequency analysis of extreme events [
73,
74,
75]. This is also emphasized by the lack of inclusion of these distributions, for these methods, in programs dedicated to these types of analyses.
An exhaustive analysis regarding the choice of the best model, characteristic of each method, is also presented, as well as a comprehensive analysis regarding the determination of the relative errors, influenced by the variability of the lengths of the data sets. Establishing the priority criterion for the selection of the best distribution is the L-coefficient of variation and L-kurtosis, because it gives the heavy tail tendency in a more pronounced way than the criterion of the L-coefficient of variation and L-skewness. This avoids the choice of two-parameter distributions that underestimate the values of rare events (low excess probabilities).
For the Siret River, following the analysis and results obtained for the registered data, the LN2 distribution gives the best results, with the important observation that the choice in the case of MOM always involves, in the case of short and medium data sets, some subjective aspects, as there are no sufficiently rigorous criteria. For the recommended methods (L- and LH-moments), from the thirteen analyzed distributions, the log-normal distribution had the best results, with the theoretical values of the L-coefficient of variation and L-kurtosis best approximating the corresponding values of the recorded data.
Taking into account all the elements presented and the analysis performed, especially the comparative analysis with the four-parameter Burr distribution, it can be concluded that the best method for using two-parameter distributions in FFA is the method of linear moments, because there are more rigorous criteria of choosing the best model.
All this information related to frequency analysis, using these methods and two-parameter distributions, will help researchers to perform a well-founded and rigorous analysis. In general, the methods and criteria are also valid in the case of frequency analysis using distributions of three or more parameters.