Next Article in Journal
Another Case of Degenerated Discrete Chenciner Dynamic System and Economics
Previous Article in Journal
Correct and Stable Algorithm for Numerical Solving Nonlocal Heat Conduction Problems with Not Strongly Regular Boundary Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Point Process Regression Extreme Model Using a Dirichlet Process Mixture of Weibull Distribution

State Key Laboratory of Mechanics and Control of Mechanical Structures, School of Mathematics, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(20), 3781; https://doi.org/10.3390/math10203781
Submission received: 13 September 2022 / Revised: 8 October 2022 / Accepted: 10 October 2022 / Published: 13 October 2022
(This article belongs to the Section Probability and Statistics)

Abstract

:
The extreme value theory is widely used in economic and environmental domains, it aims to study the stochastic extreme behaviors associated with rare events. In this context, we consider a new mixture model for extremal events analysis, including a Dirichlet process mixture of Weibull (DPMW) distribution below the threshold and the point process (PP) extreme model for the upper tail. This model developed a regression structure for the PP extreme model parameters, which explains the variation of the exceedance through all tail parameters. The estimation of the model parameters is performed under the Bayesian paradigm, applying the Markov chains Monte Carlo (MCMC) method. The model is applied to both simulation and real environmental data to demonstrate the performance in extrapolating extreme events.

1. Introduction

Extreme value models have been widely studied to analyze extreme events of maximum and minimum in environmental sciences, engineering and financial econometrics. Coles [1] showed that the motivation of extreme value models is to prevent extreme event damages, because, even though their probability of occurrence is small, their consequences can be undesirable.
The Generalized Pareto Distribution (GPD) describes a limit distribution of exceedance above the threshold, it consists of modeling the tail of the data. The GPD denoted by G P D ( u , σ u , ξ ) with threshold u, scale parameter σ u and shape parameters ξ . Coles and Tawn [2] and Coles [1] showed that the parameter σ u is strongly dependent on the threshold u in the GPD model, i.e., σ u = σ 0 + ξ ( u μ 0 ) , where σ 0 and μ 0 are the scale parameter and location parameter of the Generalized Extreme Value (GEV) model, respectively. The dependence between the parameters can lead to poor mixing [3]. Pickands [4,5] consider a more general tail model, point process (PP) extreme model, which can more easily be extended to the non-stationary case by allowing a change in the number of threshold exceedances over time.
The more common approaches for the PP models are to fix the threshold u where it is necessary to give a lot of attention in choosing this threshold, the different choices of u often result in very different parameter estimates and inferences [1,6]. Therefore, an extreme value mixture model is presented to avoid the selection threshold, which is the distribution below the threshold and an extreme value model in the tail. Frigessi et al. [7] suggested to model the data with a dynamical mixture, one component is a GPD and a lighter Weibull distribution as the other component, using maximum likelihood to estimate the approximate standard deviations. Behrens et al. [8] introduced a mixture model which combines a gamma distribution for the center and a GPD for the tail, using all observations for inference about the unknown parameters from both distributions, the threshold included. MacDonald et al. [3] proposed an extreme value mixture model where one term of the mixture is the non-parametric kernel density distribution, and the other is the point process model. However, all the prior of the parametric model in the bulk part of the aforementioned approach is specific, it follows that the parameter density estimation is complicated.
In this paper, we present a new flexible point process regression extreme model using a Dirichlet process mixture of Weibull distribution (DPMW-PPR), the presented model can be used to control the number of components by the Dirichlet process mixture model without prior information. Athanasios Kottas [9] showed that Dirichlet process (DP) mixture models form a very rich class of Bayesian nonparametric models. DP mixture models emerge by employing a DP prior for the mixing distribution in a mixture of a parametric family of distributions [10,11]. Hanson [12] and Jairo [13] proposed the DP mixture of gamma densities in the bulk part below the threshold and estimated the univariate densities on the positive real line. Here, we introduce the new point process regression extreme model includes a Dirichlet process mixture Weibull (DPMW) distribution below the threshold and the point process (PP) extreme model above the threshold. Moreover, the model established the addition of a regression structure for the estimation of the PP model parameters, based on the Bayesian approach, the prior distributions will not be attributed directly to the parameters of the PP extreme model, but to the coefficients of their linear structure. There are three key ingredients in our approach: we model all data, not only those belonging to the tail; we use a mixture model, one component of which is a Dirichlet process mixture of Weibull (DPMW) distribution and the other component with a point process (PP) extreme model; and we develop a regression structure for the PP model parameters, which explains the variation of the exceedance through all tail parameters. With the DPMW distribution under the threshold, the proposed model works well even in the absence of prior distribution and small sample sizes. The regression structure for the PP model parameters explains the behavior of the time series.
This work is divided into the following parts. In Section 2, we present the new mixture extreme model. In Section 3, we provide the MCMC method to make Bayesian inference for the model parameters. The simulation study of the proposed model with different sample sizes is provided in Section 4. In Section 5, two real environmental data are fitted to the proposed model. Finally, Section 6 shows the concluding remarks.

2. A New Point Process Regression Extreme Model

The proposed model assumes that observations below the threshold u follow a distribution with parameters θ , here with K ( · | θ ) , while the observations above the threshold are from a point process (PP) extreme model.
Pickands [4] showed that the point process defined by
P n = { ( i n + 1 , X i ) ; i = 1 , , n }
is well approximated, in the limit as n , by an inhomogeneous Poisson process on the region A = [ 0 , 1 ] × ( u , ) , for a sufficiently high threshold u, with the intensity function on the subregion B = ( t 1 , t 2 ) × ( x , ) given by
P ( B ) = ( t 2 t 1 ) n b [ 1 + ξ x μ σ ] + 1 / ξ , ξ 0 , ( t 2 t 1 ) n b exp ( x μ σ ) , ξ = 0 ,
where x > u , the scaling constant n b is the number of blocks of observations (e.g., number of years of daily data) and ( μ , σ , ξ , u ) is the parameter of PP extreme model.
Therefore, the distribution function F, of a sequence of n identically and independent distributed observations X = { x i : i = 1 , , n } can be written as
F ( x | ϕ , θ ) = K ( x | θ ) , x u , K ( u | θ ) + [ 1 K ( u | θ ) ] P ( x | μ , σ , ξ , u ) , x > u ,
where P ( x | μ , σ , ξ , u ) is the PP extreme model defined by (1). The density of the model is shown as:
f ( x | ϕ , θ ) = k ( x | θ ) , x u , [ 1 K ( u | θ ) ] p ( x | μ , σ , ξ , u ) , x > u ,
where k ( x | θ ) denotes the density of K ( x | θ ) and p ( x | μ , σ , ξ , u ) denotes the density of PP.

2.1. The DPMW Densities

This paper uses the Dirichlet process mixture of Weibull (DPMW) distribution in the bulk part of Equation (2). The DP consists of a parametric distribution function H 0 , the center or base distribution of the process, and a positive scalar precision parameter v [9]. The H DP ( v H 0 ) is shown as a DP prior of the distribution function H. Then, we obtain the DP mixture model
K ( · ; θ ) = W ( · | θ ) H ( d θ ) ,
where W ( · | θ ) is the distribution function of the parametric kernel of the mixture and H DP ( v H 0 ) . If w ( · | θ ) is the density corresponding to W ( · | θ ) , the density of the random mixture in (4) is k ( · ; θ ) = w ( · | θ ) H ( d θ ) .
The Dirichlet process mixture of Weibull distribution has been considered by Kottas [9] for the survival data. The Weibull distribution with the shape parameter a and scale parameter b is given by
w ( x | θ ) = Weibull ( x | a , b ) = a b x a 1 exp ( x a b ) ,
where x > 0 and θ = ( a , b ) .
Similar to [9], we use a non-conjugate uniform inverse gamma probability density distribution for the unknown parameters,
H 0 ( a , b | ψ , α , β ) = U ( a | 0 , ψ ) I g a m m a ( b | α , β ) ,
here, I g a m m a ( · | α , β ) denotes the inverse gamma distribution with mean β / ( α 1 ) , provided α > 1 .
Since the Pareto distribution is conjugated to the uniform distribution, we use it as the prior for ψ .
a i U ( 0 , ψ ) , ψ Pareto ( x m , l ) , ψ | a i Pareto ( max { a i , x m } , k + n )
by default x m = 6 , l = 2 which is an infinite variance prior distribution.
The conjugate prior of b is the gamma distribution since it comes from the inverse gamma distribution with the fixed shape α .
b i Inv-gamma ( α , β ) , β gamma ( α 0 , β 0 ) , β | b gamma ( α 0 + n α , β 0 + i = 1 n 1 b i ) .
We set α = 2 , α 0 = 1 , and β 0 = 0.5 by default.
The full Bayesian model can be written in the following hierarchical form:
x i | a i , b i i . i . d . h ( x i | a i , b i ) , i = 1 , , n ( a i , b i ) | H i . i . d . H , i = 1 , , n , H | v , ψ , β DP ( v H 0 ) , v , β , ψ P ( v ) gamma ( α 0 , β 0 ) Pareto ( x m , k ) ,
with H 0 defined in (6), d and all parameters of the priors of v, β and ψ are fixed.

2.2. Regression Structure for PP Parameters

The likelihood function of the extreme mixed model in (2) can be separated into the bulk distribution model and the PP tail model, as shown here:
L ( ϕ | X ) = L K ( θ | X ) L P P ( u , μ , σ , ξ | X ) ,
we define ϕ = ( θ , u , μ , σ , ξ ) as the parameter vector.
The likelihood function of the PP model is given by:
L P P ( u , μ , σ , ξ | X ) = exp { n b [ 1 + ξ ( u μ σ ) ] 1 / ξ } B 1 σ [ 1 + ξ ( x i μ σ ) ] 1 1 / ξ for ξ 0 , exp [ n b exp ( u μ σ ) ] B 1 σ exp ( x i μ σ ) for ξ = 0 ,
where B = { i : x i > u } , and n b is the number of exceedances of the threshold u, see [14,15].
The model developed has the addition of a regression structure for the estimation of the PP parameters, in which the prior distributions will not be attributed directly to the parameters, but to the coefficients of their linear structure [16,17]. Let parameters u, μ , σ , ξ have the p + 1 dimensional covariate vector: Z = ( 1 , Z 1 , i , . . . , Z p , i ) , with the first component of each of these vectors being equal to 1 allowing the inclusion of the intercept in the regression structures of the model. We will have a matrix 4 × ( p + 1 ) of regression coefficients composed of the vectors β u , β μ , β σ and β ξ , with β u = ( β u , 0 , , β u , p ) , β μ = ( β μ , 0 , , β μ , p ) , β σ = ( β σ , 0 , , β σ , p ) and β ξ = ( β ξ , 0 , , β ξ , p ) , where each row of this matrix is associated with each of the parameters u, μ , σ , ξ .
For each parameter, we will have a function that links it to its respective linear structure. In this model, we opted for a transformation in the linking function of the parameters ξ and σ , which will have the following scheme:
t ( u , μ , σ , ξ ) = g ( u , μ , σ * , ξ * ) , where σ * = log ( σ ) and ξ * = log ( ξ + 1 ) .
By choosing this reparametrization, the components ( β u , β μ , β σ , β ξ ) are orthogonal, see [18,19], which will facilitate the calculation of joint prior densities π ( β u , β μ , β σ , β ξ ) .
Thus, we have the parameters u i , μ i , σ i , ξ i estimated by the following structure:
u i = β u , 0 + β u , 1 Z 1 , i + β u , 2 Z 2 , i + . . . + β u , p Z p , i , μ i = β μ , 0 + β μ , 1 Z 1 , i + β μ , 2 Z 2 , i + . . . + β μ , p Z p , i , σ i = exp ( β σ , 0 + β σ , 1 Z 1 , i + β σ , 2 Z 2 , i + . . . + β σ , p Z p , i ) , ξ i = exp ( β ξ , 0 + β ξ , 1 Z 1 , i + β ξ , 2 Z 2 , i + . . . + β ξ , p Z p , i ) 1 ,
where β u , β μ , β σ , β ξ are the estimates of the regression coefficients, Z j , i ( i = 1 , , n , j = 1 , , p ) are the covariate vectors of the parameters u i , μ i , σ i and ξ i , which may or may not have covariates in common.

3. Bayesian Inference

The priors for the regression coefficients of the threshold u, the scale parameter σ , the shape parameter ξ and the location parameter μ will be presented in this section. In this work, we adopt the normal prior of the β u , β μ , β σ and β ξ , which are
β u , 0 N ( u 0 , V β u , 0 ) , β u , j N ( 0 , V β u , j ) , β μ , 0 N ( μ 0 , V β μ , 0 ) , β u , j N ( 0 , V β μ , j ) , β σ , 0 N ( σ 0 , V β σ , 0 ) , β σ , j N ( 0 , V β σ , j ) , β ξ , 0 N ( ξ 0 , V β ξ , 0 ) , β ξ , j N ( 0 , V β ξ , j ) .
Then, we have
π ( β u , β μ , β σ , β ξ ) exp ( β u , 0 2 2 V β u , 0 + k = 1 p ( β u , k 2 2 V β u , k ) ) × exp ( β μ , 0 2 2 V β μ , 0 + k = 1 p ( β μ , k 2 2 V β μ , k ) ) × exp ( β σ , 0 2 2 V β σ , 0 + k = 1 p ( β σ , k 2 2 V β σ , k ) ) × exp ( β ξ , 0 2 2 V β ξ , 0 + k = 1 p ( β ξ , k 2 2 V β ξ , k ) ) .
The posterior distribution of the proposed point process regression extreme model using a Dirichlet process mixture of Weibull distribution (DPMW-PPR) is then as follows:
p ( θ , ϕ | x ) A k ( x | θ ) × exp { n b [ 1 + ξ i ( u i μ i σ i ) ] 1 / ξ i } B 1 σ i [ 1 + ξ i ( x i μ i σ i ) ] 1 1 / ξ i × π ( β u , β μ , β σ , β ξ )
for ξ 0 and
p ( θ , ϕ | x ) A k ( x | θ ) × exp [ n b exp ( u i μ i σ i ) ] B 1 σ i exp ( x i μ i σ i ) × π ( β u , β μ , β σ , β ξ )
for ξ = 0 with A = { i : x i u i } and B = { i : x i > u i } .

4. Simulation Study

The simulated mixture distribution for the central part using data generated from a mixture of lognormal distribution:
h ( x ) = 0.8 L N ( μ 1 = 0 , σ 1 2 = 0.25 ) + 0.2 L N ( μ 2 = 1.2 , σ 2 2 = 0.02 ) .
Similar as [9], we set the DP precision v = 0.1 , α = 2 and the hyperparamenters for β and ψ are α 0 = 1 , β 0 = 0.5 , x m = 6 , l = 2 .
For the simulation data of the PP part, we consider the fixed threshold u = 1.98 as the 10% of the distribution in the upper tail means that the regression coefficients β u , 1 and β u , 2 are both equal to zero and the β u , 0 = 1.98 . Following Lima [20], the first covariate vectors of the parameters μ , σ , ξ is given by Z 1 , i = c o s ( 2 π / 12 ) , representing the seasonal effect, and the second covariate Z 2 , i is given by U ( 1 , 0.1 ) . We consider the regression coefficients parameters as follows
β μ , 0 = 5.5 , β μ , 1 = 0.05 , β μ , 2 = 0.5 , β σ , 0 = 0.25 , β σ , 1 = 0.015 , β σ , 2 = 0.28 , β ξ , 0 = 0.4 , β ξ , 1 = 0.016 , β ξ , 2 = 0.024 .
The prior of the threshold u, equal to β u , 0 in this simulation, is given by β u , 0 N ( u 0 , V β u , 0 ) , where u 0 is the 90% quantile of the simulated data and V β u , 0 is the 99 % of probability in the range between 50 % and 99 % of the simulation data.
For the regression coefficients parameters of μ , σ and ξ , we consider the following non-informative prior distribution:
β μ , 0 N ( 0 , 100 ) , β μ , 1 N ( 0 , 100 ) , β μ , 2 N ( 0 , 100 ) , β σ , 0 N ( 0 , 100 ) , β σ , 1 N ( 0 , 10 ) , β σ , 2 N ( 0 , 10 ) , β ξ , 0 N ( 0 , 100 ) , β ξ , 1 N ( 0 , 10 ) , β ξ , 2 N ( 0 , 10 ) .
As with the usual Metropolis algorithm, we adjust the variance of the sampling proposal density by using Markov chain Monte Carlo (MCMC) simulations to consider the Hessen value of the maximum likelihood estimate. After 5000 run-in cycles, we obtained convergence for all parameters using 10,000 iterations.
Table 1 shows the estimates for the regression coefficients of simulations with sample n = 500 , and the estimate of threshold u is u ^ = 2.0544 with the 95 % credibility interval ( 1.8223 , 2.2867 ) . The standard model fit diagnostics for the upper tail of the DPMW-PPR model is plotted in Figure 1, which indicates no issues with the upper tail fit. The standard model fit diagnostics for the full range values of the DPMW-PPR model is plotted in Figure 2, indicating an adequate fit over the observed range of support.
In standard model fit diagnostics figures, the top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot. In the return level plot, the points are the simulation data, the red line is the return level line fitted to the DPMW-PPR model, the dashed lines illustrate the 95 % credibility interval of the return level, and the two dotted lines are the upper tail fraction and threshold. In the quantile plot, the points are the simulation data, the line of equality is illustrated as the red line, the dashed lines illustrate the 95 % credibility interval, and the dotted lines are the thresholds. In the probability plot, the points are the simulation data, the line of equality is illustrated as the red line, the dashed lines illustrate the 95 % credibility interval, and the dotted lines are the upper tail fractions. In the density plot, the black solid line shows the true density function, the green dashed line shows the posterior predictive density function, and the red dashed line shows the estimate of threshold u.
In Figure 1 and Figure 2, we can see that both the probability and quantile plots show the effectiveness of the fitted model: each set of points plotted is nearly linear. The return level curves are close to linear, they also provide a satisfactory representation of the empirical estimates, especially once sampling variability is taken into account. The density plots in the bottom right corners of both figures show the density histogram of 500 simulations with the true density function (black solid line), the posterior density function with regression (blue dashed line) and the red dashed line is the estimate of threshold u.
Table 2 shows the estimates for the regression coefficients of simulations with sample n = 1000 , and the estimate of threshold u is u ^ = 1.9657 with the 95 % credibility interval ( 1.7416 , 2.1897 ) . Table 3 shows the estimates for the regression coefficients of simulations with sample n = 5000 , and the estimate of threshold u is u ^ = 1.9851 with the 95 % credibility interval ( 1.9075 , 2.0627 ) . We can see some similarities compared with the previous table. In this case, as the sample size is larger, it means that we have more information about the data with respect to the parameters, and in both types of prior the length of the credibility interval is lower than their equivalent cases compared with the small sample sizes. The results show that the proposed model could be even better when large sample sizes are considered.
Figure 3 and Figure 4 are the standard model fit diagnostics for the upper tail and the standard model fit diagnostics for full range values of the DPMW-PPR model with sample size n = 1000 , respectively. Figure 5 and Figure 6 are the standard model fit diagnostics for the upper tail and the standard model fit diagnostics for the full range values of the DPMW-PPR model with sample size n = 5000 . These standard diagnostic graphical imply the good fit of the DPMW-PPR model. The probability and quantile plots are better than the previous example in which the sample size is n = 500 , and once confidence intervals are added to the regression horizontal curve, the quality of the fit becomes more convincing.
Furthermore, we compare the proposed model (DPMW-PPR) with the Point Process (PP) extreme value model and the Weibull mixture Point Process (WMPP) extreme model, which combined the Weibull distribution in the bulk part and the Point Process extreme value model in the tail [3]. Figure 7, Figure 8 and Figure 9 show the true fitted density and the posterior density of the above three models with the simulated sample size n = 500 , n = 1000 , and n = 5000 , respectively. Obviously, our proposed model is superior to the other two models.

5. Application to Real Data

This section is intended for the application of the proposed model. In order to show the utility of the new model, we apply the methodology to the study of two river flow datasets (environmental data): the monthly flow data for the Iowa river measured at Wapello, IA, USA and the daily flow data for the Patuxent river observed near Bowie, MD, USA. For estimation of the model, the MCMC algorithm was run on the Windows 10 system, with a 64-bit opteron system 2.20 GHz processor with 16 GB RAM. The MCMC Metropolis-Hastings sampler was initialized at an arbitrary starting parameter vector and run for 20,000 iterations with a burn-in period of 5000, giving 15,000 posterior draws for each simulation.

5.1. The Monthly Flow in the Iowa River

The Iowa river is a tributary of the Mississippi River in the State of Iowa in the United States and is approximately 323 miles (520 km) long. Iowa City is located on the river, approximately 65 miles (105 km) from where the Mississippi River meets. Therefore, it is of practical significance to study the extreme value of the flow data. The data used in this paper is the monthly flow data (in cubic meters per second) of the Iowa river for the period from October 1958 to September 2006. The sample size of this dataset is n = 576 . The Iowa River data set is provided by National Water Information System (Available online: http://waterdata.usgs.gov/ia/nwis/sw (accessed on 21 November 2021).
Figure 10 plots the time series of the Iowa river monthly flow. It shows that the data have significant annual periodicity and strong seasonal effects, so the single threshold is inappropriate. It is significant to develop the regression structure for the threshold. Figure 8 decomposed the time series data into a trend, seasonal and random components using moving averages. The seasonality of Iowa river flow can be observed in Figure 11.
As shown in [1,16], the seasonality obviously present in environment data can be incorporated into the model through trigonometric functions. The effect of the season is in most cases well captured by the sine wave and cosine wave. Since the Iowa river flow data is in monthly time series units, we give two covariates Z 1 , m = c o s ( 2 π m / 12 ) and Z 2 , m = s i n ( 2 π m / 12 ) in the regression structure of the parameters u, μ , σ and ξ , which can capture the seasonal behavior of the data, the m is the month. For the regression coefficients parameters of μ , σ and ξ , we consider the prior distribution as the simulation part in Section 4. For the regression coefficients parameters of u, we consider the prior distribution β u , 0 N ( 0 , 100 ) , β u , 1 N ( 0 , 100 ) and β u , 2 N ( 0 , 100 ) .
Table 4 shows the estimates for the regression coefficients and the respective 95 % credibility intervals. We can see that the values of all the coefficients behave around intervals in which the value zero is not included, thus presenting significant significance for the estimation of their respective parameters. It means that the seasonal covariates Z 1 and Z 2 presented significant effects. For the regression coefficients of parameter ξ , we note the result of β ξ , 0 is negative, which means that the data has the light tail behavior. With the estimation of the regression coefficients, we can see how the parameters of the DPMW-PPR distribution behave over the months.
In Figure 12, we can see the estimate of threshold u varying over the twelve months. It illustrates that the estimated values of the threshold u increased significantly from the October of the previous year to April of the current year, revealing the high values for the threshold between March to May of the year. The estimated values of the threshold u decreased significantly from April to October and the estimate of the threshold reaches its peak in April and falls to the lowest value in October.
In Figure 13, we show the return levels for every 10, 50 and 100 years periods. For the expected return level, the month of May is the month that presents the highest level of return, with values of 1439.791 m 3 /s for every 10 years periods, the month of June is the month that presents the highest level of return, with values of 1816.659 m 3 /s for every 50 years periods and 1978.948 m 3 /s for every 100 years periods, respectively. The return levels are lowest in the months of October with values of 474.8953 m 3 /s for every 10 years periods, 562.0131 m 3 /s for every 50 years periods and 591.2955 m 3 /s for every 100 years periods, respectively.
Figure 14 shows the standard model fit diagnostics for the application of the DPMW-PPR model. Neither the probability plot nor the quantile plot causes doubt in the validity of the fitted model: each set of plotted points is near-linear. We have the adjustment of expected return levels in the original series of the Iowa river flow. The return level curve also provides a satisfactory representation of the empirical estimates. Finally, the corresponding density estimate seems consistent with the histogram of the data. Consequently, all four diagnostic plots lend support to the fitted DPMW-PPR model of the Iowa river monthly flow.
Furthermore, we compare the proposed model (DPMW-PPR) with the Point Process (PP) extreme value model and the Weibull mixture Point Process (WMPP) extreme model [3]. Figure 15 shows the true fitted density and the posterior density of the above three models with the Iowa River monthly flow data, respectively. Obviously, our proposed model is superior to the other two models.

5.2. The Daily Flow in the Patuxent River

The Patuxent river is the longest river contained entirely within Maryland in the United States, and along its length, changes from a slow-moving stream to a wide tidal estuary that empties into the Chesapeake Bay near the southernmost point of Maryland’s Western Shore. The Patuxent river daily flow observations from the USGS stream gage station 01594440 near Bowie, Maryland. The data used in this subsection is the daily flow data (in cubic meters per second) for the Patuxent river for the period from the prior 1 January 2005 to 30 December 2014 and data missing for very few dates. The sample size of this data set is n = 3506 . The Patuxent river data set is provided by National Water Information System (Available online: https://waterdata.usgs.gov/md/nwis/dvstat (accessed on 21 November 2021).
Figure 16 plots the time series of the Patuxent river daily flow. The data has significant annual periodicity and strong seasonal effects. It is significant to develop the regression structure for the threshold. In Figure 17, we decomposed the time series data and the seasonality of Patuxent river flow can be observed from the figure.
Since the Patuxent river flow data is in daily time series units, we give two covariates Z 1 , t = c o s ( 2 π t / 365.25 ) and Z 2 , t = s i n ( 2 π t / 365.25 ) in the regression structure of the parameters u, μ , σ and ξ , where t is the data order. We consider the prior distribution of regression coefficients as Section 5.1.
Table 5 shows the estimates for the regression coefficients and the respective 95 % credibility intervals. For the regression coefficients of shape parameter ξ , we see that the coefficients β ξ , 1 and β ξ , 2 are not significant, showing that there is no seasonality for the shape parameter. We also note that the β ξ , 0 parameter is positive, indicating that the data have a long tail distribution. Regarding the parameters μ and σ and the threshold u, all regression coefficients are significant, we consider that the seasonal covariates Z 1 and Z 2 both presented significant effects for this parameters.
In Figure 18, we can see the estimate of threshold u varying over 365 days. It illustrated that the estimated values of the threshold u decreased significantly from around the 308th day of the previous year to around the 129th day of the current year and the estimate of the threshold increased dramatically from around the 129th day of the year to around the 308th day of the year, the estimate of threshold falls with the lowest values on about the 129th of the year and reaches its peak on about the 308th of the year.
In Figure 19, we show the return levels for every 10, 50 and 100 years periods. For the expected return level, the highest level of return is 203.6169 m 3 /s for every 10 years periods, 256.3437 m 3 /s for every 50 years periods and 303.6306 m 3 /s for every 100 years periods. The lowest level of return is 171.9141 m 3 /s for every 10 years periods, 207.4655 m 3 /s for every 50 years periods and 232.2825 m 3 /s for every 100 years periods.
Figure 20 shows the standard model fit diagnostics for the application of the DPMW-PPR model. Neither the probability plot nor the quantile plot causes doubt in the validity of the fitted model: each set of plotted points is near-linear. We have the adjustment of expected return levels in the original series of the Patuxent river flow. Since the shape parameter ξ > 0 , the return level plot is concave. The return level curve also provides a satisfactory representation of the empirical estimates. Finally, the corresponding density estimate seems consistent with the histogram of the data and the posterior distribution has a heavy tail which provides a satisfactory representation of the empirical estimates. Consequently, all four diagnostic plots lend support to the fitted DPMW-PPR model of the Patuxent river daily flow.
Different from the general extreme value model, we model all data, not only those belonging to the tail. With the DPMW distribution under the threshold, the proposed model works well even in the absence of prior distribution and small sample sizes. The regression structure for the PP model parameters explains the behavior of the time series. Furthermore, we compare the proposed model (DPMW-PPR) with the Point Process (PP) extreme value model and the Weibull mixture Point Process (WMPP) extreme model which combined the Weibull distribution in the bulk part and the Point Process extreme value model in the tail [3]. Figure 21 shows the true fitted density and the posterior density of the above three models fitted to the Patuxent river daily flow, respectively. Obviously, our proposed model is superior to the other two models.

6. Conclusions

In this paper, we consider a new extreme mixture model, which includes a Dirichlet process mixture of Weibull distribution below the threshold and the point process extreme model for the upper tail. The model developed the addition of a regression structure for the estimation of the PP extreme model parameters based on the Bayesian approach, the prior distributions will not be attributed directly to the extreme parameters, but to the coefficients of their linear structure. With the Dirichlet process mixture of Weibull distribution, the proposed model works well even in the absence of prior distribution and small sample sizes. The regression structure for the PP model parameters explains the behavior of the time series. The model is shown in both simulation and environmental data to demonstrate performance in extrapolating extreme events.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W. and X.L.; formal analysis, Y.W. and X.L.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; supervision, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of China (grant no. 61374183) and Postgraduate Research & Practice Innovation Program of Jiangsu Province (grant no. KYCX22_0322).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editors and reviewers for providing useful comments and suggestions to improve the quality of this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Coles, S. An Introduction to Statistical Modeling of Extreme Values; Springer: London, UK, 2001. [Google Scholar]
  2. Coles, S.; Tawn, J. A Bayesian analysis of extreme rainfall data. J. R. Stat. Soc. Ser. C-Appl. Stat. 1996, 45, 463–478. [Google Scholar]
  3. MacDonald, A.; Scarrott, C.J.; Lee, D.; Darlow, B.; Reale, M.; Russell, G. A flexible extreme value mixture model. Comput. Stat. Data Anal. 2011, 55, 2137–2157. [Google Scholar]
  4. Pickands, J. The two dimensional Poisson process and extremal processes. J. Appl. Probab. 1971, 8, 745–756. [Google Scholar]
  5. Pickands, J. Statistical inference using extreme order statistics. Ann. Stat. 1975, 3, 119–131. [Google Scholar]
  6. Northrop, P.J.; Attalides, N.; Jonathan, P. Cross-validatory extreme value threshold selection and uncertainty with application to ocean storm severity. J. R. Stat. Soc. Ser. C-Appl. Stat. 2017, 66, 93–120. [Google Scholar]
  7. Frigessi, A.; Haug, O.; Rue, H. A dynamic mixture model for unsupervised tail estimation without threshold estimation. Extremes 2002, 5, 219–235. [Google Scholar]
  8. Behrens, C.N.; Lopes, H.F.; Gamerman, D. Bayesian analysis of extreme events with threshold estimation. Stat. Model. 2003, 4, 227–244. [Google Scholar]
  9. Kottas, A. Nonparametric Bayesian survival analysis using mixtures of Weibull distributions. J. Stat. Plan. Infer. 2006, 136, 578–596. [Google Scholar]
  10. Ferguson, T. A Bayesian Analysis of Some Nonparametric Problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar]
  11. Antoniak, C.E. Mixture of Dirichlet process with applications to Bayesian Nonparametric problems. Ann. Stat. 1974, 2, 1152–1174. [Google Scholar]
  12. Hanson, T.E. Modelling censoring lifetime data using a mixture of gamma baselines. Bayesian Anal. 2006, 3, 575–593. [Google Scholar]
  13. Patino, F. A semi-parametric Bayesian extreme value model using a Dirichlet process mixture of gamma densities. J. Appl. Stat. 2015, 42, 267–280. [Google Scholar]
  14. Smith, R. Maximum likelihood estimation in a class of non-regular cases. Biometrika 1985, 72, 67–90. [Google Scholar]
  15. Wadsworth, J.L.; Tawn, J.A.; Jonathan, P. Accounting for choice of measurement scale in extreme value modeling. Ann. Appl. Stat. 2010, 4, 1558–1578. [Google Scholar]
  16. Nascimento, F.F.; Gamerman, D.; Lopes, H.F. Regression models for exceedance data via the full likelihood. Environ. Ecol. Stat. 2011, 18, 495–512. [Google Scholar]
  17. Nascimento, F.F.; Azevedo, A.; Ferraz, V.R. Regression models to dependence for exceedance. J. Appl. Stat. 2020, 16, 3048–3059. [Google Scholar]
  18. Chaves-Demoulin, V.; Davison, A.C. Generalized additive modelling of sample extremes. J. R. Stat. Soc. Ser. C-Appl. Stat. 2005, 54, 207–222. [Google Scholar]
  19. Nascimento, F.F.; Assuncao, A. Regression models for change point data in extremes. Braz. J. Probab. Stat. 2021, 35, 85–100. [Google Scholar]
  20. Lima, S.R.; Nascimento, F.F.; Ferraz, V.R.S. Regression models for time-varying extremes. J. Stat. Comput. Simul. 2018, 88, 235–249. [Google Scholar]
Figure 1. Standard model fit diagnostics for the upper tail of the DPMW-PPR model fitted to the simulated data with n = 500 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 1. Standard model fit diagnostics for the upper tail of the DPMW-PPR model fitted to the simulated data with n = 500 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g001
Figure 2. Standard model fit diagnostics for the DPMW-PPR model fitted to the simulated data with n = 500 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 2. Standard model fit diagnostics for the DPMW-PPR model fitted to the simulated data with n = 500 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g002
Figure 3. Standard model fit diagnostics for the upper tail of DPMW-PPR model fitted to the simulated data with n = 1000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 3. Standard model fit diagnostics for the upper tail of DPMW-PPR model fitted to the simulated data with n = 1000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g003
Figure 4. Standard model fit diagnostics for the DPMW-PPR model fitted to the simulated data with n = 1000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 4. Standard model fit diagnostics for the DPMW-PPR model fitted to the simulated data with n = 1000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g004
Figure 5. Standard model fit diagnostics for the upper tail of the DPMW-PPR model fitted to the simulated data with n = 5000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 5. Standard model fit diagnostics for the upper tail of the DPMW-PPR model fitted to the simulated data with n = 5000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g005
Figure 6. Standard model fit diagnostics for the DPMW-PPR model fitted to the simulated data with n = 5000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 6. Standard model fit diagnostics for the DPMW-PPR model fitted to the simulated data with n = 5000 . (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g006
Figure 7. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) with n = 500 . The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Figure 7. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) with n = 500 . The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Mathematics 10 03781 g007
Figure 8. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) with n = 1000 . The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Figure 8. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) with n = 1000 . The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Mathematics 10 03781 g008
Figure 9. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) with n = 5000 . The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Figure 9. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) with n = 5000 . The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Mathematics 10 03781 g009
Figure 10. Time series of the Iowa river monthly flow.
Figure 10. Time series of the Iowa river monthly flow.
Mathematics 10 03781 g010
Figure 11. Decomposition of additive time series for Iowa river monthly flow.
Figure 11. Decomposition of additive time series for Iowa river monthly flow.
Mathematics 10 03781 g011
Figure 12. Parameter u varying over time for the data of Iowa river flow. The green line is the estimate of the parameter u. The shaded area is the 95 % confidence interval of the estimates.
Figure 12. Parameter u varying over time for the data of Iowa river flow. The green line is the estimate of the parameter u. The shaded area is the 95 % confidence interval of the estimates.
Mathematics 10 03781 g012
Figure 13. Return levels expected every 10, 50 and 100 years. Green: expected return every 10 years; red: expected return every 50 years; blue: expected return every 100 years.
Figure 13. Return levels expected every 10, 50 and 100 years. Green: expected return every 10 years; red: expected return every 50 years; blue: expected return every 100 years.
Mathematics 10 03781 g013
Figure 14. Standard model fit diagnostics for the DPMW-PPR model fitted to the Iowa River monthly flow. (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 14. Standard model fit diagnostics for the DPMW-PPR model fitted to the Iowa River monthly flow. (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g014
Figure 15. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) fitted to the Iowa River monthly flow data. The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Figure 15. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) fitted to the Iowa River monthly flow data. The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Mathematics 10 03781 g015
Figure 16. Time series of the Patuxent river daily flow.
Figure 16. Time series of the Patuxent river daily flow.
Mathematics 10 03781 g016
Figure 17. Decomposition of additive time series for Patuxent river daily flow.
Figure 17. Decomposition of additive time series for Patuxent river daily flow.
Mathematics 10 03781 g017
Figure 18. Parameter u varying over time for the data of Patuxent river flow. The green line is the estimate of the parameter u. The shaded area is the 95 % confidence interval of the estimates.
Figure 18. Parameter u varying over time for the data of Patuxent river flow. The green line is the estimate of the parameter u. The shaded area is the 95 % confidence interval of the estimates.
Mathematics 10 03781 g018
Figure 19. Return levels expected every 10, 50 and 100 years for Patuxent river daily flow. Green: expected return every 10 years; red: expected return every 50 years; blue: expected return every 100 years.
Figure 19. Return levels expected every 10, 50 and 100 years for Patuxent river daily flow. Green: expected return every 10 years; red: expected return every 50 years; blue: expected return every 100 years.
Mathematics 10 03781 g019
Figure 20. Standard model fit diagnostics for the DPMW-PPR model fitted to the Patuxent river daily flow. (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Figure 20. Standard model fit diagnostics for the DPMW-PPR model fitted to the Patuxent river daily flow. (The top left corner is the return level plot, the top right corner is the quantile plot, the bottom left corner is the probability plot, and the bottom right corner is the density plot).
Mathematics 10 03781 g020
Figure 21. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) fitted to the Patuxent river daily flow. The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Figure 21. The posterior density plot for three models (the DPMW-PPR model, the PP extreme value model and the WMPP extreme model) fitted to the Patuxent river daily flow. The red line shows the true density function, the green line, the blue line and the purple line are the posterior density function of the DPMW-PPR model, the WMPP extreme model and the PP extreme value model, respectively.
Mathematics 10 03781 g021
Table 1. Estimates for the regression coefficients of simulations with sample n = 500 .
Table 1. Estimates for the regression coefficients of simulations with sample n = 500 .
β μ , 0 = 5.5 β μ , 1 = 0.05 β μ , 2 = 0.5
Posterior mean5.8712−0.02880.7027
95 % Credibility interval(5.2592, 6.4832)(−0.0548, −0.0028)(0.4917, 0.9138)
Interval length1.2240.05200.4221
β σ , 0 = 0.25 β σ , 1 = 0.015 β σ , 2 = 0.28
Posterior mean−0.28610.02250.2601
95 % Credibility interval(−0.3758, −0.1964)(−0.0337, 0.0786)(0.1402, 0.3809)
Interval length0.17940.11240.2407
β ξ , 0 = 0.4 β ξ , 1 = 0.016 β ξ , 2 = 0.024
Posterior mean0.5870−0.01860.0401
95 % Credibility interval(0.0345, 1.1396)(−0.0292, −0.0081)(−0.100, 0.1804)
Interval length1.10510.02110.2802
Table 2. Estimates for the regression coefficients of simulations with sample n = 1000 .
Table 2. Estimates for the regression coefficients of simulations with sample n = 1000 .
β μ , 0 = 5.5 β μ , 1 = 0.05 β μ , 2 = 0.5
Posterior mean5.4863−0.05850.5424
95 % Credibility interval(5.0157, 5.9568)(−0.1572, 0.0402)(0.4517, 0.6331)
Interval length0.94110.19740.1814
β σ , 0 = 0.25 β σ , 1 = 0.015 β σ , 2 = 0.28
Posterior mean−0.24040.02090.2639
95 % Credibility interval(−0.3058, −0.1751)(−0.0328, 0.0476)(0.2025, 0.3254)
Interval length0.13080.10740.1229
β ξ , 0 = 0.4 β ξ , 1 = 0.016 β ξ , 2 = 0.024
Posterior mean0.4708−0.01420.0230
95 % Credibility interval(0.2685, 0.6731)(−0.0218, −0.0066)(−0.0186, 0.0646)
Interval length0.40460.01520.0831
Table 3. Estimates for the regression coefficients of simulations with sample n = 5000 .
Table 3. Estimates for the regression coefficients of simulations with sample n = 5000 .
β μ , 0 = 5.5 β μ , 1 = 0.05 β μ , 2 = 0.5
Posterior mean5.4952−0.04850.5036
95 % Credibility interval(5.2433, 5.7471)(−0.0698, −0.0271)(0.4822, 0.5249)
Interval length0.50380.04260.0428
β σ , 0 = 0.25 β σ , 1 = 0.015 β σ , 2 = 0.28
Posterior mean−0.25180.01320.2753
95 % Credibility interval(−0.2602, −0.2434)(−0.0255, 0.0520)(0.2636, 0.2871)
Interval length0.01680.07740.0234
β ξ , 0 = 0.4 β ξ , 1 = 0.016 β ξ , 2 = 0.024
Posterior mean0.4162−0.01730.0248
95 % Credibility interval(0.3156, 0.5168)(−0.0214, −0.0132)(−0.0093, 0.0589)
Interval length0.20120.00820.0682
Table 4. Estimates and 95 % credibility interval for the regression coefficients of the Iowa river monthly flow.
Table 4. Estimates and 95 % credibility interval for the regression coefficients of the Iowa river monthly flow.
β u , 0 β u , 1 β u , 2
Posterior mean496.8886209.1493−188.5929
95 % Credibility interval(442.9628, 550.8143)(132.8868, 285.4118)(−264.8554, −112.3304)
β μ , 0 β μ , 1 β μ , 2
Posterior mean581.5001214.7011−199.2846
95 % Credibility interval(538.5956, 624.4043)(171.4820, 257.9182)(−256.3556, −142.2444)
β σ , 0 β σ , 1 β σ , 2
Posterior mean5.13100.1875−0.5595
95 % Credibility interval(5.0304, 5.2316)(0.0637, 0.3113)(−0.7006, −0.4184)
β ξ , 0 β ξ , 1 β ξ , 2
Posterior mean−0.1859−0.1653−0.1273
95 % Credibility interval(−0.2448, −0.1270)(−0.2409, −0.0897)(−0.2091, −0.0455)
Table 5. Estimates and 95 % credibility interval for the regression coefficients of the Patuxent river daily flow.
Table 5. Estimates and 95 % credibility interval for the regression coefficients of the Patuxent river daily flow.
β u , 0 β u , 1 β u , 2
Posterior mean20.0316−4.50743.2883
95 % Credibility interval(19.6799, 20.3834)(−5.0068, −4.0080)(2.7932, 3.7834)
β μ , 0 β μ , 1 β μ , 2
Posterior mean167.300011.15769.2531
95 % Credibility interval(115.4776, 219.1224)(4.8359, 17.4793)(2.7590, 15.7472)
β σ , 0 β σ , 1 β σ , 2
Posterior mean4.50400.1822−0.0208
95 % Credibility interval(3.9989, 5.0091)(0.0411, 0.3233)(−0.0445, −0.0029)
β ξ , 0 β ξ , 1 β ξ , 2
Posterior mean0.55380.0642−0.0452
95 % Credibility interval(0.3828, 0.7248)(−0.0982, 0.2266)(−0.1933, 0.1029)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, Y.; Liu, X. A New Point Process Regression Extreme Model Using a Dirichlet Process Mixture of Weibull Distribution. Mathematics 2022, 10, 3781. https://doi.org/10.3390/math10203781

AMA Style

Wang Y, Liu X. A New Point Process Regression Extreme Model Using a Dirichlet Process Mixture of Weibull Distribution. Mathematics. 2022; 10(20):3781. https://doi.org/10.3390/math10203781

Chicago/Turabian Style

Wang, Yingjie, and Xinsheng Liu. 2022. "A New Point Process Regression Extreme Model Using a Dirichlet Process Mixture of Weibull Distribution" Mathematics 10, no. 20: 3781. https://doi.org/10.3390/math10203781

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop