Next Article in Journal
Performance of Segmented Thermoelectric Cooler Micro-Elements with Different Geometric Shapes and Temperature-Dependent Properties
Next Article in Special Issue
Application of Entropy Ensemble Filter in Neural Network Forecasts of Tropical Pacific Sea Surface Temperatures
Previous Article in Journal
Kullback–Leibler Divergence Based Distributed Cubature Kalman Filter and Its Application in Cooperative Space Object Tracking
Previous Article in Special Issue
Scaling-Laws of Flow Entropy with Topological Metrics of Water Distribution Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bayesian Technique for the Selection of Probability Distributions for Frequency Analyses of Hydrometeorological Extremes

1
College of Hydropower & Information Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
2
Department of Biological and Agricultural Engineering & Zachry Department of Civil Engineering, Texas A&M University, College Station, TX 77843-2117, USA
*
Author to whom correspondence should be addressed.
Entropy 2018, 20(2), 117; https://doi.org/10.3390/e20020117
Submission received: 13 November 2017 / Revised: 11 January 2018 / Accepted: 16 January 2018 / Published: 11 February 2018
(This article belongs to the Special Issue Entropy Applications in Environmental and Water Engineering)

Abstract

:
Frequency analysis of hydrometeorological extremes plays an important role in the design of hydraulic structures. A multitude of distributions have been employed for hydrological frequency analysis, and more than one distribution is often found to be adequate for frequency analysis. The current method for selecting the best fitted distributions are not so objective. Using different kinds of constraints, entropy theory was employed in this study to derive five generalized distributions for frequency analysis. These distributions are the generalized gamma (GG) distribution, generalized beta distribution of the second kind (GB2), Halphen type A distribution (Hal-A), Halphen type B distribution (Hal-B), and Halphen type inverse B (Hal-IB) distribution. The Bayesian technique was employed to objectively select the optimal distribution. The method of selection was tested using simulation as well as using extreme daily and hourly rainfall data from the Mississippi. The results showed that the Bayesian technique was able to select the best fitted distribution, thus providing a new way for model selection for frequency analysis of hydrometeorological extremes.

1. Introduction

Frequency analysis of hydrometeorological extremes plays an important role in the design of structures, such as dams, bridges, culverts, levees, highways, sewage disposal plants, waterworks, and industrial buildings [1,2,3,4,5]. From a frequency analysis, the probability of an extreme event can be estimated, and the value of a T-year design event (e.g., rainfall or flood) can be calculated. One of the objectives of frequency analysis of hydrometeorological extremes therefore is to establish a relationship between a flood or rainfall magnitude and its recurrence interval or return period.
A multitude of distributions have been employed for frequency analysis of hydrometeorological extremes. For example, the Pearson Type three (P-III) distribution is recommended in China; the Log-Pearson type three (LPT 3) is used in the U.S and Australia; and generalized extreme value (GEV) distribution is usually employed in Europe. Frequency analysis of hydrometeorological extremes at a given site or location is usually performed based on an appropriate probability distribution, which is selected on the basis of statistical tests for extreme hydrometeorological data [6]. However, no single distribution has gained global acceptance [7,8]. The traditional method is to try a variety of distributions and choose the best fitted distribution based on a particular mathematical norm, such as a least square error or a likelihood norm [9]. The disadvantages of this method of choosing are that it is laborious because too many different distributions need to be tried and empirical choices of candidate distributions make the results subjective [9,10,11]. In order to overcome these disadvantages, the generalized distributions have recently gained a lot of attention because they have been shown to be an effective tool for frequency analysis of hydrometeorological extremes. The greatest advantage of these generalized distributions is that they provide sufficient flexibility to fit a large variety of data sets, which facilitates the selection and comparison of different distributions. For example, Papalexiou and Koutsoyiannis [9] concluded that the generalized beta distribution of the second kind (GB2), which includes commonly used exponential, Weibull, and gamma distributions as special cases, was a suitable model for rainfall frequency analysis because of its ability to describe both J-shaped and bell-shaped data. Chen et al. [10] and Chen and Singh [11] also used the generalized gamma (GG) and GB2 distributions for hydrological frequency analysis, respectively. The results demonstrated that these two distributions could fit hydrometeorological data well. The generalized distributions can be derived using entropy theory by specifying appropriate constraints. The theory also provides a way for efficient parameter estimation [12].
Selection of the most appropriate distribution is of fundamental importance in hydrometeorological frequency analysis, since a wrong choice could lead to significant error and bias in design flood or rainfall estimates, particularly for higher return periods, leading to either under- or over-estimation, which may have serious implications in practice [13].
A distribution is often selected on the basis of statistical tests or by graphical methods [14]. Selection criteria based on the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Anderson–Darling Criterion (ADC) are widely used in hydrology [4,15]. Laio et al. [16] presented an objective model selection criterion based on the AIC, the Bayesian Information Criterion (BIC), and the Anderson–Darling Criterion (ADC). Using a rigorous numerical framework, they found that the ability of these criteria to recognize the correct parent distribution from the available data varied from case to case, and these were more effective for two parameter distributions [13]. In this study, a more objective method based on a Bayesian technique is introduced to select the distribution with more parameters for frequency analysis of hydrometeorological extremes.
Bayesian method has been widely used for hydrological analysis, such as model selection and hydrological uncertainty analysis. Duan et al. [17] used Bayesian model averaging for multi-model ensemble hydrologic prediction. Hsu et al. [18] used a sequential Bayesian approach for hydrologic model selection and prediction. Najafi et al. [19] used Bayesian Model Averaging method to assess the uncertainties of hydrologic model selection. Robertson and Wang [20] introduced a predictor selection method for the Bayesian joint probability modeling approach to seasonal streamflow forecasting at multiple sites. In addition, Bayesian model method was also used for model uncertainty analysis [21,22].
The objective of this study is therefore to present a more objective method based on a Bayesian technique to select the most appropriate generalized distribution for frequency analysis of hydrometeorological extremes. The entropy theory was employed to derive generalized distributions for hydrometeorological extremes and estimate their parameters based on the principle of maximum entropy. A simulation test was carried out to evaluate the performance of the proposed Bayesian model selection technique. The proposed method was then tested using annual maximum hourly and daily precipitation data from Mississippi.

2. Entropy Theory

Since the entropy theory was used for the derivation of these generalized distributions and estimation of their parameters, in this section, the entropy theory combined with the principle of maximum entropy (POME) method is introduced.
The entropy, defined by Shannon in 1848, can be expressed by
H ( X ) = 0 f ( x ) log f ( x ) d x
where f(x) is the probability density function (PDF) of X. f(x) can be derived by maximizing the entropy subject to given constraints, which can be expressed by
max   H ( X )
s . t .   0 f ( x ) d x = 1 ;
0 g i ( x ) f ( x ) d x = C i    ( i = 1 ,   ,   m )
Employing the method of Lagrange multipliers, the PDF of X from Equations (1) and (2) can be derived as
f ( x ) = exp ( λ 0 λ 1 g 1 ( x ) λ 2 g 2 ( x ) λ m g m ( x ) )
where m is the number of constraints; and λ0, …, λm are the Lagrange multipliers. According to (2b), λ0 can be defined as
exp ( λ 0 ) = 0 exp ( λ 1 g 1 ( x ) λ 2 g 2 ( x ) λ m g m ( x ) ) d x .
When different constraints are used, different PDFs can be obtained. According to the POME theory, all of the generalized distributions discussed in the following can be written in the form of Equation (3).

3. Generalized Distributions

Five generalized distributions, namely the GG distribution, the GB2 distribution, and three Halphen family distributions, were used in this study. The principle of maximum entropy (POME) method was used for parameter estimation, and it involves the following steps: (1) specification of constraints and maximization of entropy using the method of Lagrange multipliers; (2) derivation of the relation between Lagrange multipliers and constraints; (3) derivation of the relation between Lagrange multipliers and distribution parameters; and (4) derivation of the relation between distribution parameters and constraints. Detailed information on obtaining the equations for parameter estimation of those generalized distributions is given in [10,11,23]. In this paper, we mainly focus on model selection based on the Bayesian method.

3.1. Generalized Gamma Distribution

The probability density function of the GG distribution is given by
f ( x ) = r 2 β Γ ( r 1 r 2 ) ( x β ) ( r 1 1 ) exp ( ( x β ) r 2 )
where Γ(·) is the gamma function; r1 and r2 are the shape parameters, r1 > 0, r2 > 0; and beta is the scale parameter, β > 0.
For deriving Equation (5a) from the entropy theory, the following constraints are specified:
0 f ( x ) d x = 1
0 f ( x ) ln x d x = E ( ln X )
0 f ( x ) x q d x = E ( X q ) .
The probability density function (PDF) of the GG distribution can then be expressed as [10]:
f ( x ) = exp ( λ 0 λ 1 ln ( x ) λ 2 x q )
where λ0, λ1, and λ2 are the Lagrange multipliers, and q is the parameter q = r 2 [10].
The relations between Lagrange multipliers and parameters can be summarized as
{ q = r 2 λ 1 = 1 r 1 λ 2 = β r 2 .
The equations for parameter estimation based on the POME method can be given as [10]
{ 1 r 2 ln ( β r 2 ) 1 r 2 φ ( r 1 r 2 ) = E ( ln X ) β r 2 r 1 r 2 = E ( ln ( X r 2 ) ) 1 r 2 2 φ ( r 1 r 2 ) = var ( ln X )
where φ (·) is the digamma function; and φ’(·) is the tri-gamma function.
As seen in Equation (8), there are three unknown parameters, r1, r2, and β , in the three equations, and the variable X represents the observed hydrometeorological extreme series, which have been known before. By solving this equation set, the parameter of the GG distribution can be determined. The estimation procedures for other distributions are the same as those for the GG distribution.

3.2. Generalized Beta Distribution of the Second Kind

The PDF of the GB2 distribution is given by
f ( x ) = r 3 b B ( r 1 , r 2 ) ( x b ) r 1 r 3 1 ( 1 + ( x b ) r 3 ) ( r 2 + r 1 )
where B(·) is the beta function; and r1, r2, and r3 are the shape parameters, r1 > 0, r2 > 0 and r3 > 0; and b is the scale parameter, b > 0.
For deriving Equation (9a) from the entropy theory, the following constraints are specified:
0 f ( x ) d x = 1
0 f ( x ) ln x d x = E ( ln X )
0 f ( x ) ln ( 1 + p x q ) 1 / p d x = E ( ln ( 1 + p X q ) 1 / p ) .
According the maximum entropy theory, the PDF of the GB2 distribution can be expressed as [11]
f ( x ) = exp ( λ 0 λ 1 ln ( x ) λ 2 ln ( 1 + p x q ) 1 / p )
where p and q are two parameters, which are also related to the parameters of the GB2 distribution, p = ( 1 β ) r 3 , and q = r 3 .
The relations between Lagrange multipliers and parameters can be summarized as
{ λ 1 = 1 r 1 q λ 2 = p ( r 2 + 1 λ 1 q ) p = ( 1 β ) r 3 q = r 3 .
The equations for parameter estimation based on the POME method can be given as [11]
{ ln β 1 r 3 φ ( r 1 ) + 1 r 3 φ ( r 2 ) = E ( ln X ) φ ( r 2 ) φ ( r 1 + r 2 ) = E ( ln ( 1 + ( X β ) r 3 ) ) 1 r 3 2 φ ( r 1 ) + 1 r 3 2 φ ( r 2 ) = var ( ln X ) φ ( r 2 ) φ ( r 1 + r 2 ) = var ( ln ( 1 + ( X β ) r 3 ) ) .

3.3. Halphen Type A (Hal-A) Distribution

The PDF of the Hal-A distribution is given as
f ( x ) = 1 2 m v K v ( 2 α ) x v 1 exp [ α ( x m + m x ) ]    x > 0
where K0(·) is the modified Bessel function of the second kind of order ν, ν R; and m and α are parameters, m > 0 and α > 0.
For deriving Equation (13a) from the entropy theory, the following constraints are specified:
0 f ( x ) d x = 1
0 f ( x ) ln x d x = E ( ln X )
0 x f ( x ) d x = E ( X )
0 1 x f ( x ) d x = E ( 1 X ) .
From the entropy theory, the PDF of the Halphen type A distribution can be expressed as [23]
f ( x ) = exp ( λ 0 λ 1 ln x λ 2 x λ 3 1 x ) x > 0
where λ 3 is also the Lagrange multiplier.
The relations between Lagrange multipliers and parameters can be summarized as
{ λ 1 = 1 v λ 2 = α m λ 3 = m α .
The equations for parameter estimation based on the POME method can be given as
{ ln m + 1 K v ( 2 α ) K v ( 2 α ) v = E ( ln X ) m K v + 1 ( 2 α ) K v ( 2 α ) = E ( X ) K v 1 ( 2 α ) m K v ( 2 α ) = E ( 1 X ) .

3.4. Halphen Type B (Hal-B) Distribution

The PDF of the Hal-B distribution can be given as
f ( x ) = 2 m 2 v e f v ( α ) x 2 v 1 exp [ ( x m ) 2 + α ( m x ) ]    x > 0
where e f v ( ) is the exponential factorial function, defined as e f v ( α ) = 2 0 x 2 v 1 exp [ x 2 + α x ] d x (x > 0), m > 0 are scale parameters, and v > 0 and α are shape parameters.
For deriving Equation (17a) from the entropy theory, the following constraints are specified:
0 f ( x ) d x = 1
0 f ( x ) ln x d x = E ( ln X )
0 x 2 f ( x ) d x = E ( X 2 )
0 x f ( x ) d x = E ( X ) .
From the entropy theory, the PDF of the Halphen type B distribution can be expressed as [23]
f ( x ) = exp ( λ 0 λ 1 ln x λ 2 x 2 λ 3 x )    x > 0 .
The relations between Lagrange multipliers and parameters can be summarized as
{ λ 1 = 1 2 v λ 2 = 1 m 2 λ 3 = α m .
The equations for parameter estimation based on the POME method can be given as
{ ln m + 1 2 e f v ( α ) e f v ( α ) v = E ( ln X ) m 2 e f v + 1 ( α ) e f v ( α ) = E ( X 2 ) m e f v + 1 2 ( α ) e f v ( α ) = E ( X ) .

3.5. Halphen Type Inverse B (Hal-IB) Distribution

The PDF of the Hal-IB distribution can be given as
f ( x ) = 2 m 2 v e f v ( α ) x 2 v 1 exp [ ( m x ) 2 + α ( m x ) ] x > 0
where m > 0 is a scale parameter, and α and v > 0 are shape parameters.
For deriving Equation (21a) from the entropy theory, the following constraints are specified:
0 f ( x ) d x = 1
0 f ( x ) ln x d x = E ( ln X )
0 1 x 2 f ( x ) d x = E ( 1 X 2 )
0 1 x f ( x ) d x = E ( 1 X ) .
From the entropy theory, the PDF of the Halphen type inverse B can be expressed as [23]
f ( x ) = exp ( λ 0 λ 1 ln x λ 2 1 x 2 λ 3 1 x )    x > 0 .
The relations between Lagrange multipliers and parameters can be summarized as
{ λ 1 = 2 v + 1 λ 2 = m 2 λ 3 = m α .
The equations for parameter estimation based on the POME method can be given as
{ ln m 1 2 e f v ( α ) e f v ( α ) v = E ( ln X ) e f v + 1 ( α ) m 2 e f v ( α ) = E ( 1 X 2 ) e f v + 1 2 ( α ) m e f v ( α ) = E ( 1 X ) .

4. Model Selection Based on the Bayesian Technique

First, the five generalized distributions given above were used to fit a given data set D, and the equation sets derived by the POME method were applied for estimating their parameters. Second, the Bayesian technique introduced as follows was used to select the most appropriate distribution from the set of distributions for the data set D. In this study, the data D can be simulated data and observed data.
Let I be the background information. The posterior probabilities over a set of distributions can be expressed as
P ( M i | D , I ) = P ( M i | I ) P ( D | M i , I ) P ( D | I )
where P ( M i | D , I ) is the posterior probability of distribution or model M i and indicates the probability of this distribution to be true given the data series D and background information I. The largest approximate posterior probability among all of the distributions should be chosen as the most appropriate distribution. P ( M i | I ) is the prior model probability of distribution M i ; P ( D | M i , I ) is the probabilistic evidence or integrated likelihood of data D conditional on model Mi. P ( D | I ) is a normalization constant and is calculated using the sum and product rules of probability theory as
P ( D | I ) = i = 1 N P ( M i | I ) P ( D | M i , I )
where N is the number of distributions that are used for the frequency analysis.
To obtain the posterior probability, one needs to calculate the probabilistic evidence P ( D | M i , I ) , which can be obtained by integrating a joint distribution P ( λ , D | M i , I ) with respect to vector λ , and can be expressed as
P ( D | M i , I ) = + P ( λ , D | M i , I ) d λ
since
P ( λ , D | M i , I ) = P ( λ | M i , I ) P ( D | λ , M i , I )
where P ( λ | M i , I ) is the prior PDF for the Lagrangian multipliers given distribution Mi and background information I. Equation (27) can be obtained as
P ( D | M i , I ) = + P ( λ | M i , I ) P ( D | λ , M i , I ) d λ = E [ P ( D | λ , M i , I ) ]
where P ( D | λ , M i , I ) is the likelihood function of the data in terms of the set of Lagrangian multipliers, and can be expressed by
P ( D | λ , M i , I ) = k = 1 n f ( D k | λ , M i , I )
where n is the sample size, and D k denotes a specific value in data set D. For a given sample size D, model M i and background information I, P ( D | λ , M i , I ) can be calculated by the multiplication of all PDF values of Dk.
The multivariate Gaussian distribution was selected as the prior distribution for the Lagrangian multiplier vector λ . The mean value of Lagrangian multipliers was the estimated λ . The covariance matrix Σ was calculated based on the Hessian matrix H, Σ = H−1. The equation for calculating the Hessian matrix can be expressed as
H = [ 2 λ 0 λ 1 2 2 λ 0 λ 1 λ 2 . . . 2 λ 0 λ 1 λ m 2 λ 0 λ 2 λ 1 2 λ 0 λ 2 2 . . . 2 λ 0 λ 2 λ m . . . . . . . . . . . . 2 λ 0 λ m λ 1 2 λ 0 λ m λ 2 . . . 2 λ 0 λ m 2 ] .
From Equation (29), P ( D | M i , I ) can be obtained by integration. Since the integration in Equation (29) is often a complex and high-dimensional function in Bayesian statistics, the quantity P ( D | M i , I ) was calculated based on the calculation of E [ P ( D | λ , M i , I ) ] .
A Markov Chain Monte Carlo (MCMC) method was used in this study to calculate P ( D | M i , I ) and the posterior distribution of each distribution. The idea of MCMC sampling was first introduced by [24]. Since the target distribution is very complex, we cannot sample from it directly. The indirect method for obtaining samples from the target distribution is to construct an Markov chain with state space E, and whose stationary (or invariant) distribution is π(·), as discussed in [25]. Then, if we run the chain for sufficiently long, simulated values from the chain can be treated as a dependent sample from the target distribution. Using the MCMC simulation, pairs of Lagrangian multipliers λ were drawn from the joint distribution P ( λ , D | M i , I ) . The quantity P ( D | M i , I ) was finally calculated based on the calculation of E [ P ( D | λ , M i , I ) ] .
In the following, simulated data and real-world data were used for testing the proposed method. The flow chart can be found in Figure 1.

5. Performance Evaluation

Before using the proposed method in a practical application, a simulation test was carried out to evaluate the performance of the proposed Bayesian technique for model selection. The simulation test involves the following steps.
First, a distribution with given parameters was pre-defined.
Second, simulated datasets D were randomly drawn from the pre-defined distributions.
Third, the Gaussian, lognormal, Gamma, and Weibull distributions were used to fit the data set D, and the POME method was applied for parameter estimation.
Fourth, the proposed Bayesian technique was applied for model selection, and the best fitted distributions with the highest posterior probabilities were determined. The results were compared with the pre-defined distributions.
Fifth, the Bayesian model selection technique was compared with commonly used methods in hydrology, such as the root mean square error of the empirical and theoretical probabilities and the AIC criterion.
According to the steps mentioned above, this test focuses on the evaluation of the reliability of the Bayesian model selection for different distributions and data sample sizes. In order to show the performance of the proposed method, some simple and widely used distributions were considered, including the Gaussian, lognormal, Gamma, and Weibull distributions, which involve the Gaussian and non-Gaussian cases. The parameters used for the simulation are given in Table 1. Simulated datasets were randomly drawn from the pre-defined distributions given in Table 1 with sample sizes of 40, 80, 120, 160, 200, and 240. The proposed Bayesian technique was then applied to determine the best fitted distributions for each dataset. The multivariate Gaussian distribution was used for the prior distribution, in which the mean values are the estimated Lagrangian multiplier, and the covariance matrix Σ was calculated based on the Hessian matrix H, Σ = H−1. Usually, the estimated parameters were around the true values, so the Gaussian distribution was used. Additionally, the Hessian matrix was calculated to represent the covariance matrix. It is not straightforward to try other distributions, since it is a multivariate problem for which the multivariate Gaussian distribution is widely used.
The simulation results are shown in Figure 2, which indicate that when the data was sampled from the Gaussian distribution, for all of the sample size, the posterior probabilities of the Gaussian distribution were the highest. For the other tests, namely the lognormal distribution and the gamma distribution as the pre-defined distributions, respectively, the highest posterior probabilities for all of the sample size were the lognormal distribution and gamma distribution as well. Therefore, the proposed Bayesian technique can select the best fitted distribution even for a small sample size (sample size = 40).
The proposed method was compared with the traditional root mean square error (RMSE) and AIC values, which are also used to select the most appropriate distribution. The results are given in Table 2 and Table 3, in which the best fitted distributions with the smallest RMSE and AIC values are in bold. According to the smallest RMSE and AIC values, the correct distribution cannot always be selected. Take the Gaussian distribution as an example. When the sample size was 40, 80, 120, and 160, the best fitted distribution was, respectively, gamma, Weibull, Weibull, and Weibull. When the sample size became larger, greater than 160, the Gaussian distribution was detected as the correct distribution. The RMSE and AIC values of different distributions did not show significantly different results. In other words, the differences in the RMSE and AIC values among those distributions were not large. In Table 3, generally the AIC and RMSE values can show the best fitted distribution. However, in some cases the RMSE and AIC values of different distributions were nearly the same, such as the sample size equaling 160 and 200 in Table 3.
According to the performance test, the Bayesian technique can obtain the correct distribution at any time no matter what the sample size is. On the contrary, the traditional RMSE and AIC do not always work effectively. The RMSE and AIC for the data fitted using different distributions do not shown large differences. Therefore, the proposed method can provide an effective way for model selection in hydrological frequency analysis.

6. Case Study

Rainfall data for many different timescales were investigated. The timescales of these rainfall dates in the Mississippi River basin ranged from hourly to yearly. The annual maximum daily and hourly series were extracted for frequency analysis, and detailed information of daily and hourly data is shown in Table 4, in which the length of data, the mean value, standard deviation, and the minimum and maximum values are shown. The daily and hourly rainfall histograms for each gauging station are given in Figure 3.
The five generalized distributions were used to fit the data set, and the entropy method was used to estimate the parameters of these distributions, as given in Table 5 (for daily data) and Table 6 (for hourly data). A full Newton method was used to find the solution of the non-linear equation sets derived before. The “nleqslv” package in R language was used to solve the equation set. The initial value was set as 1 for all potential parameters. The proposed Bayesian technique was used to select the most appropriate distribution for rainfall frequency analysis. The multivariate Gaussian distribution was used for the prior distribution, in which the mean values are the estimated Lagrangian multiplier, and the covariance matrix Σ was calculated based on the Hessian matrix H, Σ = H−1. The posterior probabilities are also in Table 5 (for daily data) and Table 6 (for hourly data). The RMSE, AIC, and BIC were also calculated as given in Table 5 and Table 6. Both the AIC and BIC indexes are based on the likelihood values, and a penalty term was introduced for the number of parameters in the model. However, the differences between them are that the penalty term is larger in BIC than in AIC. In this study, it is seen from Table 5 and Table 6 that the selected model by the two methods is the same. Therefore, only the results given by AIC are discussed hereafter. The results indicate that for some of the cases, the selected model based on the three criteria are the same, e.g., gauging stations 225247, 220237, 227840, 220021, and 221314. For some of the stations, the results given by the three methods were not coincident. However, for these cases, the distribution with the lowest AIC value usually had the second-highest posterior probability. Take the gauging station 221094 in Table 5 for example. The AIC and RMSE criteria suggested that the GB2 distribution was the best, for which the posterior probability was 0.34, smaller than the highest one 0.58 (Hal-A). According to the simulation test in Section 4, the performance of the proposed method was better than the traditional AIC and RMSE values. The Bayesian method amplified the differences among the generalized distributions. In order to further compare the performance of these models, the theoretical and empirical exceedance probabilities of the daily rainfall data for the gauging station 223107 are shown in Figure 4a.
According to the results given in Table 5, the best fitted distribution for the gauging station 223107 recommended by the RMSE, AIC, and Bayesian methods, was GB2, Hal-A, and Hal-IB, respectively. As shown in Figure 4a, if the Hal-A distribution was used, the design values for large return periods would be underestimated. The fitting curves of the GB2 and Hal-IB distributions were nearly the same. Thus, the distribution Hal-A recommended by the AIC is not appropriate, and compared with GB2, the Hal-IB with less parameters and higher posterior probability was chosen finally.
The theoretical and empirical exceedance probabilities of the hourly rainfall data for the gauging station 222773 are shown in Figure 4b. According to the results given in Table 6, the best fitted distribution for the gauging station 222773 recommended by the RMSE, AIC and Bayesian methods was GG, Hal-B, and GB2, respectively. As shown in Figure 4b, if the GG and Hal-B distributions were used, the design values for large return periods would be underestimated.
In order to compare the fitting results more comprehensively, the Q-Q plot, P-P plot, and S-P plot were represented for the daily rainfall data from the gauging station 223107 as shown in Figure 5. It can be seen from Figure 5a that the fitting results of GB2 and Hal-IB are nearly the same. When the GG, Hal-A, and Hal-B distributions were used, the design rainfall for a large quantile would be underestimated, since the theoretical rainfall values calculated by the GG, Hal-A, and Hal-B distributions are significantly lower than the observed ones. For the P-P and S-P plots, the differences for large probability are not so obvious, and the plots in Figure 5b,c are well-distributed compared with the Q-Q plot. In Figure 5b, it is easily observed that the Hal-B distribution fits the worst, and the empirical probabilities in the middle part are significantly larger than the theoretical ones. S-P plots remove the impact of variance on the plot, and it is seen that the plots in the S-P figure are much more concentrated than those in the P-P figure.
Furthermore, in the U.S., the Log-Pearson three (LP3) distribution has been recommended for hydrological frequency analysis [26,27]. In order to compare the five generalized distributions with the commonly used LP3 distribution, the six distributions were considered and the proposed Bayesian method was used to select the best fitted one. The results are given in Table 7.

7. Conclusions and Discussion

The paper proposes a model selection approach based on a Bayesian technique to choose the best fitted distribution for hydrological frequency analysis. Five generalized distributions, including GG, GB2, Hal-A, Hal-B, and Hal-IB, which are also widely used in hydrology, were considered. The entropy-based method was used to express these distributions and the POME method was applied for parameter estimation. A simulation test was carried out to evaluate the performance of the proposed Bayesian method. Daily rainfall data from five stations and hourly rainfall data from another five stations from the Mississippi basin were selected as case studies. The main conclusions are summarized as follows.
(1)
The entropy-based five generalized distributions are given, and their corresponding equation sets for parameter estimation are introduced. The results of simulation test and case study show that the POME method can provide an effective way for parameter estimation.
(2)
Results of the simulation test demonstrate that the Bayesian technique can choose the most suitable distribution. Compared with the commonly used RMSE and AIC values, the proposed method gives a better performance.
(3)
Results of the case study indicate that when using different criteria for model selection, the results are not always the same. For some of the cases, the three criteria choose the same distribution. For others, the results are slightly different. Since choosing the probable distribution for hydraulic design is very significant, especially for extreme magnitudes, the distribution should be selected carefully. According to the posterior probabilities calculated by the proposed method for daily and hourly data from 10 gauging stations, generally the Hal-IB distributions give better fits for daily data and GB2 distributions give better fits for hourly data.
(4)
According to the results of the simulation test and case studies, the Bayesian model selection technique can give a more reliable result than the traditional RMSE and AIC values. Thus, the proposed method provides an effective way for model selection for hydrological frequency analysis.
(5)
The significant contribution of this paper is that compared with the traditional method, the proposed method is based on entropy theory, and the posterior probabilities were calculated based on the generation of Lagrange multipliers. In addition, the five generalized distributions were involved in this paper, since previous research mainly focus on the commonly used distribution or standard distributions.
This contribution of this paper mainly concentrates on univariate hydrometeorological frequency analysis. Recently, multivariate hydrological analysis has also surged up, such as [2,4,28,29,30,31]. However, univariate frequency analysis is the basis of multivariate frequency analysis, which can provide the marginal distributions for joint distribution. Thus, before establishing the multivariate distributions, the univariate distribution should be built rationally and appropriately first.
In addition, in the common hydrological frequency analysis, the hydrological data set is assumed to be independent and identically distributed [1]. Since there are influences of climate change and human activities on streamflow, it is possible that the mean value or the variation of the whole series would be changed. In other words, the data set is non-stationary. Non-stationary hydrological frequency analysis is also another hot and difficult topic in hydrology recently. In this paper, we mainly focus on the stationary frequency analyses of hydrometeorological extremes. Non-stationary hydrological frequency analysis will be discussed in future research.
Although this paper discussed the model selection method based on the five generalized distributions, the traditional commonly used distribution, the LP3 distribution, is still an effective tool for frequency analysis and can be used for design rainfall or flood calculation.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (51679094), the National Key R&D Program of China (2017YFC0405900), National Natural Science Foundation of China (51509273; 91547208), and Fundamental Research Funds for the Central Universities (2017KFYXJJ194, 2016YXZD048).

Author Contributions

Vijay P. Singh conceived and designed the experiments; Lu Chen and Kangdi Huang performed the experiments and analyzed the data; Lu Chen and Vijay P. Singh wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rosbjerg, D. Partial Duration Series in Water Resources; Technical University of Denmark: Kongens Lyngby, Denmark, 1993. [Google Scholar]
  2. Lin-Ye, J.; García-León, M.; Gracia, V.; Sánchez-Arcilla, A. A multivariate statistical model of extreme events: An application to the Catalan coast. Coast. Eng. 2016, 117, 138–156. [Google Scholar] [CrossRef]
  3. Chen, L.; Zhang, Y.; Zhou, J.; Singh, V.P.; Guo, S.L.; Zhang, J. Real-time error correction method combined with combination flood forecasting technique for improving the accuracy of flood forecasting. J. Hydrol. 2015, 521, 157–169. [Google Scholar] [CrossRef]
  4. Chen, L.; Singh, V.P.; Guo, S.; Zhou, J. Copula-based method for Multisite Monthly and Daily Streamflow Simulation. J. Hydrol. 2015, 528, 369–384. [Google Scholar] [CrossRef]
  5. Chen, L.; Singh, V.P.; Guo, S.L.; Zhou, J.Z.; Zhang, J.H.; Liu, P. An objective method for partitioning the entire flood season into multiple sub-seasons. J. Hydrol. 2015, 528, 621–630. [Google Scholar] [CrossRef]
  6. Yoon, P.; Kim, T.M.; Yoo, C. Rainfall frequency analysis using a mixed GEV distribution: A case study for annual maximum rainfalls in South Korea. Stoch. Environ. Res. Risk Assess. 2013, 27, 1143–1153. [Google Scholar] [CrossRef]
  7. Perreault, L.; Bobée, B.; Rasmussen, P. Halphen distribution system. I: Mathematical and statistical properties. J. Hydrol. Eng. 1999, 4, 189–199. [Google Scholar] [CrossRef]
  8. Zhang, J.; Chen, L.; Singh, V.P.; Cao, W.; Wang, D. Determination of the distribution of flood forecasting error. Nat. Hazards 2015, 1, 1389–1402. [Google Scholar] [CrossRef]
  9. Papalexiou, S.M.; Koutsoyiannis, D. Entropy based derivation of probability distribution: A case study to daily rainfall. Adv. Water Resour. 2012, 45, 51–57. [Google Scholar] [CrossRef]
  10. Chen, L.; Singh, V.P.; Xiong, F. An Entropy-Based Generalized Gamma Distribution for Flood Frequency Analysis. Entropy 2017, 19, 239. [Google Scholar] [CrossRef]
  11. Chen, L.; Singh, V.P. Generalized Beta Distribution of the Second Kind for Flood Frequency Analysis. Entropy 2017, 19, 254. [Google Scholar] [CrossRef]
  12. Singh, V.P. Entropy Based Parameter Estimation in Hydrology; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1998. [Google Scholar]
  13. Rahman, A.S.; Rahman, A.; Zaman, M.A.; Haddad, K.; Ahsan, A.; Imteaz, M. A study on selection of probability distributions for at-site flood frequency analysis in Australia. Nat. Hazards 2013, 69, 1803–1813. [Google Scholar] [CrossRef]
  14. Bobee, B.; Perreault, L.; Ashkar, F. Two kinds of moment ratio diagrams and their applications in Hydrology. Stoch. Hydrol. Hydraul. 1993, 7, 41–65. [Google Scholar] [CrossRef]
  15. Chen, L.; Singh, V.P.; Guo, S.L.; Zhou, J.; Ye, L. Copula entropy coupled with artificial neural network for rainfall-runoff simulation. Stoch. Environ. Res. Risk Assess. 2014, 28, 1755–1767. [Google Scholar] [CrossRef]
  16. Laio, F.; Baldassarre, G.D.; Montanari, A. Model selection techniques for the frequency analysis of hydrological extremes. Water Resour. Res. 2009, 45, 162–174. [Google Scholar] [CrossRef]
  17. Duan, Q.; Ajami, N.K.; Gao, X.; Sorooshian, S. Multi-model ensemble hydrologic prediction using Bayesian model averaging. Adv. Water Resour. 2007, 30, 1371–1386. [Google Scholar] [CrossRef]
  18. Hsu, K.L.; Moradkhani, H.; Sorooshian, S. A sequential Bayesian approach for hydrologic model selection and prediction. Water Resour. Res. 2009, 45, 1079. [Google Scholar] [CrossRef]
  19. Najafi, M.R.; Moradkhani, H.; Jung, I.W. Assessing the uncertainties of hydrologic model selection in climate change impact studies. Hydrol. Process. 2011, 25, 2814–2826. [Google Scholar] [CrossRef]
  20. Robertson, D.E.; Wang, Q.J. A Bayesian Approach to Predictor Selection for Seasonal Streamflow Forecasting. J. Hydrometeorol. 2012, 13, 155–171. [Google Scholar] [CrossRef]
  21. Vrugt, J.A.; Robinson, B.A. Treatment of uncertainty using ensemble methods: Comparison of sequential data assimilation and Bayesian model averaging. Water Resour. Res. 2007, 43, 223–228. [Google Scholar] [CrossRef]
  22. Parrish, M.A.; Moradkhani, H.; Dechant, C.M. Toward reduction of model uncertainty: Integration of Bayesian model averaging and data assimilation. Water Resour. Res. 2012, 48, 3519. [Google Scholar] [CrossRef]
  23. Chen, L.; Singh, V.P. Entropy-based derivation of generalized distributions for hydrometeorological frequency analysis. J. Hydrol. 2018, 557, 699–712. [Google Scholar] [CrossRef]
  24. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equations of state calculations by fast computing machines. J. Chem. Phys. 1953, 21, 1087–1091. [Google Scholar] [CrossRef]
  25. Smith, A.E.M.; Roberts, G. Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. R. Stat. Soc. B 1993, 55, 3–23. [Google Scholar]
  26. Griffis, V.W.; Stedinger, J.R. Log-Pearson Type 3 Distribution and Its Application in Flood Frequency Analysis. I: Distribution Characteristics. J. Hydrol. Eng. 2007, 12, 482–491. [Google Scholar] [CrossRef]
  27. Lamontagne, J.R.; Stedinger, J.R.; Yu, X.; Whealton, C.A.; Xu, Z. Robust flood frequency analysis: Performance of EMA with multiple Grubbs-Beck outlier tests. Water Resour. Res. 2016, 52, 3068–3084. [Google Scholar] [CrossRef]
  28. Lin-Ye, J.; Garcia-Leon, M.; Gracia, V.; Ortego, M.I.; Lionello, P.; Sanchez-Arcilla, A. Multivariate statistical modelling of future marine storms. Appl. Ocean Res. 2017, 65, 192–205. [Google Scholar] [CrossRef]
  29. Chen, L.; Singh, V.P.; Guo, S.L. Measure of correlations for rivers flows based on copula entropy method. J. Hydrol. Eng. 2013, 18, 1591–1606. [Google Scholar] [CrossRef]
  30. Chen, L.; Ye, L.; Singh, V.P.; Zhou, J.; Guo, S.L. Determination of input for artificial neural networks for flood forecasting using the copula entropy method. J. Hydrol. Eng. 2014, 19, 217–226. [Google Scholar] [CrossRef]
  31. Chen, L.; Singh, V.P.; Lu, W.; Zhang, J.; Zhou, J.; Guo, S. Streamflow forecast uncertainty evolution and its effect on real-time reservoir operation. J. Hydrol. 2016, 540, 712–726. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the whole paper. POME: principle of maximum entropy.
Figure 1. Flowchart of the whole paper. POME: principle of maximum entropy.
Entropy 20 00117 g001
Figure 2. The posterior probabilities of the simulation tests with the Gaussian distribution, Lognormal distribution, and Gamma distribution as the pre-defined distributions, respectively.
Figure 2. The posterior probabilities of the simulation tests with the Gaussian distribution, Lognormal distribution, and Gamma distribution as the pre-defined distributions, respectively.
Entropy 20 00117 g002aEntropy 20 00117 g002b
Figure 3. Daily and Hourly rainfall histograms for each gauging station.
Figure 3. Daily and Hourly rainfall histograms for each gauging station.
Entropy 20 00117 g003
Figure 4. Theoretical and empirical exceedance probabilities of the annual maximum rainfall data at the stations 223107 and 222773.
Figure 4. Theoretical and empirical exceedance probabilities of the annual maximum rainfall data at the stations 223107 and 222773.
Entropy 20 00117 g004
Figure 5. Q-Q, P-P, and S-P plots for the daily rainfall data from the gauging station 223107.
Figure 5. Q-Q, P-P, and S-P plots for the daily rainfall data from the gauging station 223107.
Entropy 20 00117 g005
Table 1. Parameters of different distributions for simulation test.
Table 1. Parameters of different distributions for simulation test.
NumberDistributionProbability Density Function (PDF)Parameters
1Gaussian f ( x ) = 1 2 π σ exp ( ( x μ ) 2 2 σ 2 ) μ = 10 , σ = 3.162
2Lognormal f ( x ; μ , σ ) = { 1 2 π σ x exp [ 1 2 σ 2 ( ln x μ ) 2 ] , x > 0 0 , x 0 μ = 2 , σ = 0.6
3Gamma f ( x ; β , α ) = β α Γ ( α ) x α 1 e β x , x > 0 α = 10 , β = 1
Table 2. The root mean square error (RMSE) and Akaike Information Criterion (AIC) values for the simulation test, in which the pre-defined distribution is the Gaussian distribution.
Table 2. The root mean square error (RMSE) and Akaike Information Criterion (AIC) values for the simulation test, in which the pre-defined distribution is the Gaussian distribution.
DistributionsCriteria4080120160200240
GaussianRMSE0.030.01750.02470.02430.01180.0115
AIC−169.07−443.78−577.56−792.76−1244.68−1482.07
lognormalRMSE0.03850.0440.0250.050.0570.0437
AIC−155.12−279.77−558.39−525.84−683.75−913.47
GammaRMSE0.02390.02930.01610.03770.03670.026
AIC−177.05−332.5793−642.25−607.66−801.37−1116.59
WeibullRMSE0.030.01680.0160.01680.0170.0118
AIC−167.34−441.3−661.23−858.42−1082.21−1457.45
Table 3. The RMSE and AIC values for the simulation test, in which the pre-defined distribution is the Gamma distribution.
Table 3. The RMSE and AIC values for the simulation test, in which the pre-defined distribution is the Gamma distribution.
DistributionsCriteria4080120160200240
GaussianRMSE0.04090.03170.330.03330.03180.033
AIC−152.98−372.77−509.91−693.5−879.39−1030.29
lognormalRMSE0.02170.01610.02110.00980.01580.0179
AIC−185.92−437.1−582.86−1018.23−1094.31−1311.04
GammaRMSE0.02290.01410.01690.01080.01240.0126
AIC−189.06−486.4−632.77−1026.04−1191.77−1421.8
WeibullRMSE0.0420.03980.03140.0360.03440.0332
AIC−150.55−373.95−523.74−672.99−853.67−1053.41
Table 4. Detailed information of daily and hourly annual maximum rainfall series.
Table 4. Detailed information of daily and hourly annual maximum rainfall series.
TimesGauging StationNumberLength of DataMean ValuesSDMaxMin
DailyCanton gauging2213891893–20123.61.16.81.65
Brookhaven City, MS2210941894–20144.141.448.081.85
Crvstal Spgs Exp Stn, MS2220941893–1954, 1985–201441.49.042.02
Forest, MS2231071930–20124.021.6611.752
Louisville, MS2252471895–20143.651.287.471.7
HourlyArkabutla dam2202371949–20011.570.463.120.88
Enid dam MS2227731949–20121.620.5940.2
Saucier experimental forest MS2278401955–20132.230.75.131.2
Aberdeen MS2200211952–20111.550.623.80.7
Calhoun city MS2213141948–20091.60.573.880.8
Table 5. Parameters, RMSE, AIC, and posterior probabilities for daily data calculated by five generalized distributions.
Table 5. Parameters, RMSE, AIC, and posterior probabilities for daily data calculated by five generalized distributions.
NumberDistributionPara1Para2Para3Para4RMSEAICBICPosterior Probabilities
221389GG13.2510.83970.1335 0.0205−613.05−604.760.02
Gb20.99551.580834.9516.840.0195−605.68−597.390.08
Hal-A5.2353.4310.0327 0.0183−624.56−616.270.14
Hal-B−20.9277.3635.381 0.0237−593.21−584.930.02
Hal-IB−9.1032.8554.776 0.0233−544.21−535.920.74
221094GG13.3080.6840.0527 0.0362−482.72−477.320.08
Gb21.6782.20311.9944.8940.0203−607.37−601.980.34
Hal-A3.5077.812−5.534 0.0229−584.34−578.950.58
Hal-B−20.9188.03225.6835 0.0557−409.26−410.150.00
Hal-IB−10.7552.3193.699 0.0525−539.94−531.550.00
222094GG14.8550.6280.0255 0.0307−412.5−404.940.04
Gb21.38031.3824.9776.5730.0213−462.61−455.040.20
Hal-A2.566912.3193−7.9616 0.021−466.21−458.640.32
Hal-B−12.25897.5813.5971 0.04−385.6−378.040.01
Hal-IB−9.5462.27084.2153 0.0218−464.18−456.610.44
223107GG13.59990.56060.0131 0.0294−409.36−402.110.00
Gb22.1991.8379.9312.43890.0164−466.07−458.810.08
Hal-A0.97730.104−8.194 0.0168−479.18−471.930.21
Hal-B−27.560514.57783.8889 0.0406−362.58−355.330.00
Hal-IB−3.89183.70763.2608 0.0165−457.91−450.650.71
225247GG13.5960.66570.0384 0.0437−433.23−424.860.00
Gb21.8221.15623.623.6410.02456−565.73−557.360.00
Hal-A2.055414.775−8.756 0.0287−533.57−525.210.04
Hal-B−46.51514.1286.0834 0.0653−363.04−354.680.00
Hal-IB−3.3613.9383.602 0.0232−575.37−567.010.96
Table 6. Parameters, RMSE, AIC, and posterior probabilities for hourly data calculated by five generalized distributions.
Table 6. Parameters, RMSE, AIC, and posterior probabilities for hourly data calculated by five generalized distributions.
NumberDistributionPara1Para2Para3Para4RMSEAICBICPosterior Probabilities
220237GG−19.1790.67940.0284 0.048−187.74−181.830.20
GB22.9632.53916.88922.44970.0308−233.74−227.830.57
Hal-A3.025915.972−12.521 0.0344−215.46−209.550.08
Hal-B−27.4588.3846.6655 0.0523−178.6−172.690.10
Hal-IB−7.1854.28985.7116 0.0312−222.62−216.720.06
222773GG4.1691.8581.656 0.0318−291.31−284.830.13
GB24.53.2210.7752.1530.0337−291.92−285.440.70
Hal-A5.36 × 10−21.90 × 10−26.8204 0.0333−274.18−267.700.05
Hal-B1.31051.61321.5702 0.0338−292.69−286.210.12
Hal-IB−50.57190.15653.0044 0.0419−231.43−224.950.00
227840GG17.04150.69770.0225 0.0473−266.6−260.370.05
GB22.0131.296611.72774.60420.0266−275.55−266.320.84
Hal-A2.79958.6585−11.0919 0.0271−274.68−268.450.00
Hal-B−23.67774.72185.8388 0.0422−249.55−243.320.11
Hal-IB−10.6011.8655.7014 0.0316−264.11−257.880.00
220021GG11.8750.650.0173 0.0582−189.32−183.040.10
GB24.10810.93243.72840.96210.0468−221.54−212.260.33
Hal-A1.170410.6076−8.8785 0.0532−217.75−211.470.15
Hal-B−30.30375.69474.203 0.0761−181.74−175.420.00
Hal-IB0.78282.2482.3074 0.0455−228.4−222.120.42
221314GG14.14650.67380.0171 0.0405−281.77−275.390.00
GB21.57340.340644.99994.5160.0318−293.19−286.810.05
Hal-A1.77867.979−9.475 0.0266−316.06−309.680.11
Hal-B−27.7924.93874.622 0.0432−263.62−257.240.00
Hal-IB−6.2171.4384.185 0.0249−309.73−303.350.84
Table 7. Parameters, RMSE, AIC, and poster probabilities for 223107 daily data calculated by five generalized distributions and the Log-Pearson three (LP3) distribution.
Table 7. Parameters, RMSE, AIC, and poster probabilities for 223107 daily data calculated by five generalized distributions and the Log-Pearson three (LP3) distribution.
DistributionsPara1Para2Para3Para4RMSEAICBICPosterior Probabilities
GG13.600.560.013 0.0294−406.36−402.110.002
GB22.201.849.932.440.0164−463.07−460.810.071
Hal-A0.9830.10−8.19 0.0168−476.18−471.930.197
Hal-B−27.5614.583.89 0.0406−359.58−355.330.0003
Hal-IB−3.893.703.26 0.0165−454.91−450.770.674
LP314.290.09−0.03 0.0167−448.65−444.390.056

Share and Cite

MDPI and ACS Style

Chen, L.; Singh, V.P.; Huang, K. Bayesian Technique for the Selection of Probability Distributions for Frequency Analyses of Hydrometeorological Extremes. Entropy 2018, 20, 117. https://doi.org/10.3390/e20020117

AMA Style

Chen L, Singh VP, Huang K. Bayesian Technique for the Selection of Probability Distributions for Frequency Analyses of Hydrometeorological Extremes. Entropy. 2018; 20(2):117. https://doi.org/10.3390/e20020117

Chicago/Turabian Style

Chen, Lu, Vijay P. Singh, and Kangdi Huang. 2018. "Bayesian Technique for the Selection of Probability Distributions for Frequency Analyses of Hydrometeorological Extremes" Entropy 20, no. 2: 117. https://doi.org/10.3390/e20020117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop