Next Article in Journal
Painlevé Analysis of the Traveling Wave Reduction of the Third-Order Derivative Nonlinear Schrödinger Equation
Previous Article in Journal
Correction: Alkabaa et al. An Investigation on Spiking Neural Networks Based on the Izhikevich Neuronal Model: Spiking Processing and Hardware Approach. Mathematics 2022, 10, 612
Previous Article in Special Issue
Bayesian Learning in an Affine GARCH Model with Application to Portfolio Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Composite Half-Normal-Pareto Distribution with Applications to Income and Expenditure Data

by
Neveka M. Olmos
1,
Emilio Gómez-Déniz
2,
Osvaldo Venegas
3,* and
Héctor W. Gómez
1
1
Departamento de Estadística y Ciencias de Datos, Facultad de Ciencias Básicas, Universidad de Antofagasta, Antofagasta 1240000, Chile
2
Department of Quantitative Methods in Economics and TIDES Institute, University of Las Palmas de Gran Canaria, 35017 Las Palmas de Gran Canaria, Spain
3
Departamento de Ciencias Matemáticas y Físicas, Facultad de Ingeniería, Universidad Católica de Temuco, Temuco 4780000, Chile
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(11), 1631; https://doi.org/10.3390/math12111631
Submission received: 24 March 2024 / Revised: 15 May 2024 / Accepted: 15 May 2024 / Published: 23 May 2024

Abstract

:
The half-normal distribution is composited with the Pareto model to obtain a uni-parametric distribution with a heavy right tail, called the composite half-normal-Pareto distribution. This new distribution is useful for modeling positive data with atypical observations. We study the properties and the behavior of the right tail of this new distribution. We estimate the parameter using a method based on percentiles and the maximum likelihood method and assess the performance of the maximum likelihood estimator using Monte Carlo. We report three applications, one with simulated data and the others with income and expenditure data, in which the new distribution presents better performance than the Pareto distribution.

1. Introduction

Data of insurance claims, income, and other actuarial information present asymmetric behavior with heavy tails; these data are generally unimodal, with positive skewness and a heavy right tail. To model these data, therefore, investigators use heavy-tailed distributions, such as the Pareto distribution. The Pareto distribution has been widely used by many investigators, such as Beirlant et al. [1], Beirlant et al. [2] and Resnick [3].
A random variable X has a Pareto distribution (see Pareto [4]; Arnold [5]) with scale parameter θ and shape parameter α if its probability density function (pdf) is given by
f X ( x ; θ , α ) = α θ α x 1 + α , x θ ,
with θ > 0 and α > 0 .
The half-normal (HN) distribution is an important distribution for extending the normal distribution to the skew-normal distribution to flexibilize the asymmetry and the tails of the normal distribution (see Azzalini [6]; Henze [7]). We say that a random variable X has an HN distribution with a scale parameter σ if its pdf is given by
f X ( x ; σ ) = 2 σ ϕ x σ , x 0 ,
with σ > 0 and ϕ ( · ) as the standard normal pdf. We denote this by X H N ( σ ) . The respective cumulative distribution function (cdf) of X is
F X ( x ; σ ) = 2 Φ x σ 1 , x 0 ,
where Φ ( · ) is the cdf of the standard normal pdf. Furthermore, its quantile function (Q) is given by
Q ( p ) = σ Φ 1 1 + p 2 , 0 < p < 1 ,
where Φ 1 is the inverse function of the cdf of the standard normal pdf.
The HN distribution has good properties, being a positive truncation of the normal distribution. Some extensions of the HN distribution are given by Cooray and Ananda [8] and Olmos et al. [9,10], among others.
The composite model methodology was introduced by Cooray and Ananda [11], who applied it to obtain the log-normal-Pareto model; Scollnik [12] discusses two extensions of the composite log-normal-Pareto model, while Cooray and Cheng [13] discuss the Bayesian estimators of the composite log-normal-Pareto model. Ciumara [14] obtained a composite Weibull–Pareto model, which they applied to actuarial data using the same design as Cooray and Ananda [11]. Cooray [15] reviewed the construction and properties of the composite Weibull–Pareto model, illustrating it in three sets of real data. Teodorescu [16] applied this methodology to a truncation of the log-normal Pareto model; Teodorescu and Panaitescu [17] applied it to a truncation of the Weibull–Pareto model, and Teodorescu and Vernic [18] applied it to the exponential-Pareto model; Scollnik and Sun [19] developed various composite Weibull–Pareto models and applied them to actuarial data; and Calderín-Ojeda et al. [20] studied the composite exponential arctan–Lognormal model and applied it to income data.
The composite distribution methodology is as follows:
f ( x ) = c f 1 ( x ) , if 0 < x θ , c f 2 ( x ) , if θ x < ,
where f 1 and f 2 are densities with positive support and c is the normalization constant. The following restrictions must also be met
  • f 1 ( θ ) = f 2 ( θ )
  • d d x f 1 ( x ) | x = θ = d d x f 2 ( x ) | x = θ
The composite exponential-Pareto (CEP) model studied by Teodorescu and Vernic [18] has only one parameter, and our proposal offers an alternative. We say that a random variable X has a CEP distribution with scale parameter θ if its pdf is given by
f ( x ; θ ) = 0.775 θ exp 1.35 x θ , if 0 < x θ , 0.2 θ 0.35 x 1.35 , if θ x < .
The object of the present article is to introduce a composite distribution combining the HN and Pareto distributions. The new distribution obtained has a HN density up to a certain threshold value and a Pareto density for the rest of the distribution. We call it the composite half-normal-Pareto (CHNP) distribution. Thus, we obtain a distribution with one parameter and a heavier right tail than the HN distribution,  which can compete with the Pareto distribution.
The article is organized as follows. In Section 2, we give the expression of the CHNP distribution and some of its properties. In Section 3, we carry out an estimation of the parameter using a percentiles method and the maximum likelihood (ML) method; we also show a simulation study and present the asymptotic convergence and the asymptotic variance of the ML estimator. In Section 4, we carry out two applications, one with simulated data and the other with income data. In Section 5, we offer some concluding remarks.

2. CHNP Distribution

In this section, we introduce the representation, density, properties, and graphs of the new distribution.

2.1. Density Function

The following proposition shows the pdf of the CHNP distribution, which is generated using the methodology given in Equation (2) with its respective conditions (Supplementary Materials).
Proposition 1.
Let Z C H N P ( θ ) . Then, the pdf of Z is given by
f Z ( z ; θ ) = 1 + k Φ ( 1 + k ) θ ϕ 1 + k θ z , i f 0 z θ , k θ k 2 Φ ( 1 + k ) z ( 1 + k ) , i f θ z < ,
where  θ > 0 , k = 0.464288  and Φ denote the cdf of the N ( 0 , 1 ) distribution.
Proof. 
Using the composite distribution methodology, where f 1 is the HN distribution and f 2 is the Pareto distribution, and respecting the two restrictions, the following equations system is obtained:
ϕ θ σ = α σ 2 θ , 2 θ σ 3 ϕ θ σ = α ( α + 1 ) θ 2 .
Substituting the first equation in the second, we obtain θ σ = 1 + α , and we then obtain the equation
ϕ 1 + α = α 2 1 + α .
It is resolved numerically and the value obtained is α = k = 0.464288 and c 1 = 2 Φ 1 + k . The result is obtained by replacing these values in the proposed distributions.    □
Remark 1.
From Figure 1, it can be seen that the CHNP distribution has a heavier right tail than the HN distribution, although both have only one parameter. This is an important point with this methodology, in which we increase the weight of the right tail without increasing the parametric space. We also observe that the CHNP distribution maintains some of the properties of the HN model, such as its capacity to include zero. This is a very important characteristic, since the presence of zeros affects distribution modeling. Many parametric models found in the literature cannot be used for datasets containing zeros.

2.2. Properties

This subsection presents some properties of the CHNP distribution, such as its mode, cdf, survival and risk functions, quantile function, median, and its coefficients of asymmetry and kurtosis.
Proposition 2.
The CHNP distribution is unimodal and is reached at zero.
Proof. 
Deriving with respect to z, we have
f Z ( z ; θ ) = ( 1 + k ) 1 + k Φ ( 1 + k ) θ 3 z ϕ 1 + k θ z , if 0 z θ , k ( 1 + k ) θ k 2 Φ ( 1 + k ) z ( 2 + k ) , if θ z < .
We observe that for 0 z θ , we have ( 1 + k ) 1 + k Φ ( 1 + k ) θ 3 z ϕ 1 + k θ z = 0 , so long as z = 0 , then the mode is 0.    □
Proposition 3.
Let Z C H N P ( θ ) with θ > 0 . Then, the cdf of Z is
F Z ( z ; θ ) = 1 Φ ( 1 + k ) Φ 1 + k θ z 1 2 , i f 0 z θ , 1 θ k z k 2 Φ ( 1 + k ) , i f θ z < .
Proof. 
Applying the definition of cdf directly, the result is obtained.    □
Corollary 1.
1. 
The survival function s ( t ) , which is the probability that an article will not fail before time t, is defined by s ( t ) = 1 F ( t ) . The survival function for a Z C H N P ( θ ) random variable is given by
s ( t ) = 1 1 Φ ( 1 + k ) Φ 1 + k θ t 1 2 , i f 0 t θ , θ k t k 2 Φ ( 1 + k ) , i f θ t < .
2. 
The hazards function h ( t ) , defined by h ( t ) = f ( t ) s ( t ) , for a Z C H N P ( θ ) random variable, is given by
h ( t ) = 1 + k ϕ 1 + k θ t θ Φ ( 1 + k ) Φ 1 + k θ t + 1 2 , i f 0 t θ , k t , i f θ t < .
Figure 2 shows plots of the hazard function of the CHNP distribution, and Table 1 indicates that the hazard function is unimodal. We can analyze some monotonicity intervals of the hazard function of the CHNP distribution by resolving the following equation numerically.
ϕ 1 + k θ t 1 + k t θ Φ ( 1 + k ) Φ 1 + k t θ + 1 2 = 0 .
Table 1 provides some numerical computations for calculating monotonicity intervals of the hazard function of the CHNP distribution for different parametric values.
Remark 2.
From Figure 3, it can be observed that the hazard function of the HN distribution is increasing, whereas the hazard function of the CHNP distribution is more flexible.

Right Tail of the CHNP Distribution

We know that any probability distribution, specified by its cdf F ( t ) on the real numbers, has a heavy right tail (see Rolski et al. [21]) if
lim sup t log s ( t ) t = 0 .
The following result shows that the CHNP distribution has a heavy right tail.
Proposition 4.
The cdf of the random variable T C H N P ( θ ) is a heavy-tailed distribution.
Proof. 
We have
lim sup t log s ( t ) t = lim sup t f T ( t ; σ , q ) 1 F T ( t ; σ , q ) = lim sup t k t = 0 ,
where we have applied L’Hospital’s Rule once to obtain the result.    □
Remark 3.
In the case of a random variable X H N ( σ ) , we have
lim sup x log s ( x ) x = lim sup x f X ( x ; σ ) 1 F X ( x ; σ ) = lim sup x ϕ x σ σ 1 Φ x σ = lim sup x x σ 2 = .
Applying L’Hospital’s Rule twice, the result obtained indicates that the HN distribution does not have a heavy right tail; this is a reason for developing the CHNP distribution, which does have a heavy right tail.
Proposition 5.
Let Z C H N P ( θ ) . Then, the quantile function (Q) of the CHNP distribution is given by
Q ( p ) = θ 1 + k Φ 1 p Φ ( 1 + k ) + 1 2 , i f 0 < p 0.4614636 , θ 2 Φ ( 1 + k ) ( 1 p ) 1 / k , i f 0.4614636 p < 1 .
Proof. 
Clearing z in the equation p = F Z ( z ; θ ) , the result is obtained.    □
Corollary 2.
Let Z C H N P ( θ ) . Then, the median ( M e ) of the CHNP distribution is given by
M e = θ Φ 1 / k ( 1 + k )
We study the effects of the parameter θ on the coefficients of asymmetry and kurtosis defined by Galton [22] and Moors [23], respectively. These are based on the quantile function and are given in the following result.
Corollary 3.
Let Z C H N P ( θ ) ; then, the coefficients of asymmetry ( β 1 ) and kurtosis ( β 2 ) are, respectively:
β 1 = Q 3 4 + Q 1 4 2 Q 2 4 Q 3 4 Q 1 4 = 1.54737 ,
β 2 = Q 3 8 Q 1 8 + Q 7 8 Q 5 8 Q 3 4 Q 1 4 = 5.648296 ,
where Q ( · ) is the quantile function given in Equation (7). The value of the coefficient of kurtosis of the HN distribution is 1.176419 , while the value of the coefficient of kurtosis of the CHNP distribution is 5.648296 . In other words, the right tail of the CHNP distribution is heavier than the right tail of the HN distribution.
An algorithm exists to generate random numbers of the CHNP distribution (Algorithm 1).
Algorithm 1 The algorithm for simulating from the Z C H N P ( θ ) can proceed as follows
1:
Generate Y U n i f o r m ( 0 , 1 ) .
2:
Compute
Z = θ 1 + k Φ 1 Y Φ ( 1 + k ) + 1 2 , if 0 < Y 0.4614636 , θ 2 Φ ( 1 + k ) ( 1 Y ) 1 / k , if 0.4614636 Y < 1 ,
Using the R software package [24], R version 4.3.3, https://www.r-project.org/ (accessed on 15 January 2024), we generated a random sample of size n = 50 from Z C H N P ( θ ) , shown in Figure 4. The codes in R are:
1:
Generate y = r u n i f ( n , 0 , 1 )
2:
k = 0.464288
3:
Compute
z = i f e l s e ( y < 0.4614636 , ( 2 / s q r t ( 1 + k ) ) q n o r m ( ( y p n o r m ( s q r t ( 1 + k ) ) + 0.5 ) ) , 2 ( 2 p n o r m ( s q r t ( 1 + k ) ) ( 1 y ) ) ( 1 / k ) )

2.3. Actuarial Measure

Distributions with heavy tails are used to describe the risk exposure; for example, the quantile function is used in the area of actuarial statistics to define the value at risk (VaR). We discuss the VaR measure for the CHNP distribution. The VaR measure is also known as the quantile risk measure or the quantile premium principle and is specified with a given degree of confidence, which may be 90%, 95%, or 99%. The VaR of the random variable Z C H N P ( θ ) is defined as (see Artzner [25] and Artzner et al. [26]):
V a R p = Q ( p ) = θ 1 + k Φ 1 p Φ ( 1 + k ) + 1 2 , if 0 < p 0.4614636 , θ 2 Φ ( 1 + k ) ( 1 p ) 1 / k , if 0.4614636 p < 1 ,
Table 2 provides some numerical computations for the VaRp measure of the CHNP( θ ) distribution for different parametric values. Figure 5 shows graphs of the VaRp measurement of the CHNP distribution for different values of parameter θ .

3. Parameter Estimation

In this section we present two methods for estimating the parameter θ , the first based on percentiles and the second on ML.

3.1. A Method Based on Percentiles

Let z 1 z 2 , , z n be an ordered random sample derived from the C H N P ( θ ) distribution. We assume that z m θ z m + 1 . Using the percentiles, the parameter θ can be estimated as the p-th percentile, where p = F ( θ ) and k = 0.464288 .
From Equation (3), we have
p = P ( Z θ ) = F Z ( θ ) = 1 Φ ( 1 + k ) Φ 1 + k θ θ 1 2 = 0.436223
We have an estimate of the p-th percentile (see Klugman et al. [27]) given by
θ ˜ = ( 1 h ) z m + h z m + 1 ,
where m = [ ( n + 1 ) p ] , h = ( n + 1 ) p m and [ a ] indicates the largest integer less than or equal to a.

3.2. ML Estimation

Let z 1 z 2 , , z n be an ordered random sample derived from the C H N P ( θ ) distribution and z m θ z m + 1 ; the likelihood function can be written as
L ( θ ; z 1 , , z n ) = i = 1 m 1 + k Φ ( 1 + k ) θ ϕ 1 + k θ z i i = m + 1 n k θ k 2 Φ ( 1 + k ) z i ( 1 + k ) .
The log-likelihood function can be written as
( θ ) = c 1 m log ( θ ) m ( 1 + k ) 2 θ 2 z m 2 ¯ + k ( n m ) log ( θ ) ( 1 + k ) i = m + 1 n log ( z i ) ,
where c 1 = m 2 log ( 1 + k ) m log Φ ( 1 + k ) m 2 log ( 2 π ) + ( n m ) log ( k ) ( n m ) log ( 2 Φ ( 1 + k ) ) and z 2 ¯ m = 1 m i = 1 m z i 2 .
Differentiating ( θ ) with respect to θ , we have
( θ ) θ = m θ + m ( 1 + k ) θ 3 z m 2 ¯ + k ( n m ) θ .
Hence, the solution of the equation ( θ ) θ = 0 is
θ ^ m = ( 1 + k ) m z m 2 ¯ ( 1 + k ) m k n , k n < ( 1 + k ) m .
For each m, m = 1 , 2 , . . . , n 1 , we evaluate θ ^ m ; if it is found that z m θ ^ m z m + 1 , then the ML estimate of θ is θ ^ = θ ^ m .
Lemma 1.
Let Z C H N P ( θ ) . Then,
(i) 
0 θ 2 θ 2 log f 1 * ( z ; θ ) f 1 * ( z ; θ ) d z = 1 2 Φ 1 + k θ 2 2 + 3 k 4 Φ 1 + k ,
(ii) 
θ 2 θ 2 log f 2 * ( z ; θ ) f 2 * ( z ; θ ) d z = k 2 Φ 1 + k θ 2 ,
where f 1 * ( z ; θ ) = 1 + k Φ ( 1 + k ) θ ϕ 1 + k θ z and f 2 * ( z ; θ ) = k θ k 2 Φ ( 1 + k ) z ( 1 + k ) .
Proof. 
Results ( i ) and ( i i ) are obtained by performing their respective calculations.    □
Proposition 6.
Let z 1 z 2 , , z n be an ordered random sample derived from the C H N P ( θ ) distribution; we assume that z m θ z m + 1 . Then, the Fisher information ( I F ) for the θ parameter of the CHNP distribution is given by
I F ( θ ) = 1 2 Φ ( 1 + k ) θ 2 2 m ( 2 Φ ( 1 + k ) 3 k 1 ) + 2 n k ( n 2 m ) k 2 .
Proof. 
After calculations and applying the Leibniz Theorem (see Casella and Berger [28], Section 2.4), we have
I F ( θ ) = E ( θ ) θ 2 = i = 1 m 0 θ θ log f 1 * ( z i ; θ ) 2 f 1 * ( z i ; θ ) d z i + i = m + 1 n θ θ log f 2 * ( z i ; θ ) 2 f 2 * ( z i ; θ ) d z i = i = 1 m 0 θ 2 θ 2 log f 1 * ( z i ; θ ) f 1 * ( z i ; θ ) d z i i = m + 1 n θ 2 θ 2 log f 2 * ( z i ; θ ) f 2 * ( z i ; θ ) d z i + k ( 1 k ) ( n 2 m ) 2 Φ 1 + k θ 2 = m 0 θ 2 θ 2 log f 1 * ( z ; θ ) f 1 * ( z ; θ ) d z ( n m ) θ 2 θ 2 log f 2 * ( z ; θ ) f 2 * ( z ; θ ) d z + k ( 1 k ) ( n 2 m ) 2 Φ 1 + k θ 2 .
Then, applying Lemma 1, the result is obtained.    □
Hence, for large samples, the ML estimator, θ ^ , is asymptotically normal; that is,
θ ^ L N ( θ , I F 1 ( θ ) ) ,
resulting that the asymptotic variance of the ML estimator θ ^ is the inverse of Fisher’s information is given in Equation (9), i.e.,
V a r ( θ ^ ) 2 Φ ( 1 + k ) θ 2 2 m ( 2 Φ ( 1 + k ) 3 k 1 ) + 2 n k ( n 2 m ) k 2 .

3.3. Simulation Study

To examine the behavior of the ML estimation, a simulation study is presented to assess the performance of the ML estimator for parameter θ of the CHNP distribution.
Algorithm 1 given in Section 2.2 can be used to generate random numbers from the CHNP distribution. The simulation analysis was carried out by generating 1000 samples from the CHNP distribution, of sizes n = 50 , 100, 150, and 200.
Table 3 shows the empirical bias (Bias), the mean of the standard errors (SEs), the root of the empirical mean squared error (RMSE), and the 95% coverage probability (CP) based on the asymptotic distribution for the ML estimator of parameter θ . As Table 3 shows, the performance of the estimations improves as n increases.

4. Applications

In this section, we show three applications, the first with simulated data and the others with real datasets. To compare the models, we use the Akaike information criterion AIC (see Akaike [29]) and the Bayesian information criterion BIC (see Schwarz [30]).

4.1. Numerical Application

In this numerical application, we use the same 50 simulated data used for the graph in Figure 4; these data were generated from the CHNP(2) distribution. The data are shown in Table 4.
The descriptive statistics for these data are given in Table 5, where CS is the coefficient of asymmetry of the sample and CK is the coefficient of kurtosis. A high coefficient of kurtosis of 6.477 is observed; we generated these data with θ = 2 , meaning that the right tail of the data is very heavy.
The first object of this numerical example is to use the estimation methods given in Section 3.
  • A method based on percentiles. For this example, p = 0.436223 . We have m = [ ( n + 1 ) p ] = [ 22.24737 ] = 22 and h = 22.24737 22 = 0.24737 . Thus, the estimator of θ is θ ˜ = ( 1 0.24737 ) z 22 + 0.24737 z 23 = 1.875693 , where z 22 = 1.84486889 and z 23 = 1.96947573 .
  • ML estimation. Using the value of m = 22 found with the previous method, we calculate the estimator given in Equation (8). This gives us θ ^ 22 = 1.994828 . We observe that the ratio z 22 θ ^ 22 z 23 is not satisfied. Therefore, we must use another value for m, for example m = 23 ; with this, we obtain θ ^ 23 = 1.9913 , and now we observe that the ratio z 23 θ ^ 23 z 24 is satisfied, where z 23 = 1.96947573 and z 24 = 2.10059341 . Thus, the ML estimate of θ is θ ^ = θ ^ 23 = 1.9913 .
The second objective of this numerical example is to compare the fit with the Pareto model, since this model has a heavy right tail. Table 6 shows the ML estimations for the parameters of the CHNP model and the Pareto model. Table 6 also shows the values of the AIC and BIC criteria for each model.
The lowest values of the AIC and BIC criteria correspond to the CHNP model, meaning that the CHNP model offers a better fit for these data than the Pareto model. This was to be expected since the data were simulated from the CHNP model. The above values for the measures indicate that the CHNP model has its own shape and may be difficult to replace by any other known model with a heavy right tail.

4.2. Application to Income Data

This dataset comes from the U.S. Survey of Consumer Finances (SCF), a nationally representative sample that contains extensive information on assets, liabilities, income, and demographic characteristics of those sampled (potential U.S. customers). It contains a random sample of 500 households with positive incomes, which were interviewed in the 2004 survey. The variable of interest is the annual income of the family in thousands of USD divided by the number of household members. The descriptive statistics for these data are given in Table 7.
Figure 6 presents a boxplot that shows one very extreme datum, while Figure 7 shows a boxplot of the dataset after removing the extreme datum in order to show the other outliers, which cannot be seen in the boxplot of Figure 6. These atypical data make the right tail heavier. It may be noted that the majority of the observations are around USD 21,125 per capita per family, but there is one very atypical value representing an income of USD 75 million. Because the Pareto distribution has a heavy right tail, we use it to compare its fit to the income data with the fit of the CHNP distribution.
Table 8 shows the ML estimates for the parameters of the Pareto, CEP, and CHNP models, as well as the values of the AIC and BIC criteria for each model.
It can be observed that the lowest values of the AIC and BIC criteria correspond to the CHNP model, meaning that the CHNP model offers a better fit for these data than the CEP and Pareto models.
Figure 8 shows the empirical cdf with estimated cdfs of the CHNP, CEP, and Pareto models. Note the good fit of the CHNP model with the income data.
Table 9 presents VaR estimates for the CHNP, CEP, and Pareto distributions at the following levels: 0.5, 0.6, 0.7, 0.8, 0.9, and 0.95. We know that the model with higher VaR values has a heavier tail. Table 9 shows that the Pareto model has a heavier tail than the CHNP and CEP models at the highest levels. Table 8 and Figure 9 also show the good fit of the CHNP model with the income data. According to the selection criteria of the AIC and BIC models, the CHNP model fits the income data better than the CEP and Pareto models.

4.3. Application to Expenditure Data

The data were obtained from the Medical Expenditure Panel Survey (MEPS), carried out by the US Agency for Healthcare Research and Quality in the civil population. The variable of interest consists of expenditure amounts for ambulatory visits (EXPENDOP). The data can be found in Frees [31]. The descriptive statistics for these data are given in Table 10.
Figure 10 presents a boxplot that shows extreme data, one of which is very extreme. These outliers make the right tail heavy. The dataset has three observations with zero cost; it therefore cannot be fitted with the Pareto model, and we only compared CHNP with the CEP model. These two models can be fitted to datasets containing zero observations.
Table 11 shows the ML estimates for the parameters of the CEP and CHNP models, as well as the values of the AIC and BIC criteria for each model.
It can be observed that the lowest values of the AIC and BIC criteria correspond to the CHNP model, meaning that the CHNP model offers a better fit for these data than the CEP model. Figure 11 and Figure 12 show the empirical cdf and VaR with estimated cdfs of the CHNP and CEP models. Note the good fit of the CHNP model with the expenditure data.

5. Concluding Remarks

This paper presents a study of the CHNP model, which was generated by the same methodology introduced by Cooray and Ananda [11]. The CHNP model has only one parameter, making it an attractive competitor with various one-parameter models used in actuarial statistics. The CHNP model is a viable alternative for fitting data with extreme observations. Some other characteristics of the CHNP model are:
  • The CHNP model has a heavy right tail, as is shown by Proposition 4.
  • The support of the CHNP model contains zero. It is a property that is very important for modeling datasets containing zero.
  • Cdf, risk function, and quantile function are explicit and are represented by known functions.
  • The VaR risk measure is explicit in the CHNP model; in the applications with real data, it is compared with the VaR risk measure of the CEP and Pareto models.
  • The applications with income and expenditure data show that the CHNP model provides a better fit than the CEP and Pareto models; it is also observed that the VaR of the CHNP model is closer to the empirical VaR than the VaR of the CEP and Pareto models.
  • One reviewer noted the importance of performing a comparison of estimation methods, including Bayesian inference. As we have Fisher’s information for the parameter θ , we can use it in Jeffrey’s prior to perform Bayesian inference. However, a detailed investigation of the performance of Bayesian estimation is beyond the scope of the present paper.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12111631/s1.

Author Contributions

Conceptualization, N.M.O. and H.W.G.; methodology, N.M.O. and H.W.G.; software, N.M.O. and H.W.G.; validation, N.M.O., E.G.-D. and H.W.G.; formal analysis, O.V. and H.W.G.; investigation, N.M.O. and O.V.; writing—original draft preparation, N.M.O. and H.W.G.; writing—review and editing, E.G.-D. and O.V.; funding acquisition, O.V. and H.W.G. All authors have read and agreed to the published version of the manuscript.

Funding

The research of N.M. Olmos and H.W. Gómez were supported by Semillero UA-2024.

Data Availability Statement

The first real dataset is available on the website https://www.federalreserve.gov/econres/scfindex.htm (accessed on 15 January 2024) and the second one in Frees [31].

Acknowledgments

The authors are very grateful to the anonymous referees for their efforts in improving the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Beirlant, J.; Teugels, J.L.; Vynckier, P. Practical Analysis of Extreme Values; Leuven University Press: Leuven, Belgium, 1996. [Google Scholar]
  2. Beirlant, J.; Joossens, E.; Segers, J. Generalized Pareto fit to the society of actuaries’ large claims database. N. Am. Actuar. J. 2004, 8, 108–111. [Google Scholar] [CrossRef]
  3. Resnick, S.I. Discussion of the Danish data on large fire insurance losses. ASTIN Bull. 1997, 27, 139–151. [Google Scholar] [CrossRef]
  4. Pareto, V. Cours d’Éconimie Politique; F. Rouge: Laussanne, Swizerland, 1897. [Google Scholar]
  5. Arnold, B.C. Pareto Distributions, 2nd ed.; Chapman & Hall: New York, NY, USA, 2015. [Google Scholar]
  6. Azzalini, A. A class of distributions which includes the normal ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
  7. Henze, N. A probabilistic representation of the Skew-Normal distribution. Scand. J. Stat. 1986, 4, 271–275. [Google Scholar]
  8. Cooray, K.; Ananda, M.M.A. A Generalization of the Half-Normal Distribution with Applications to Lifetime Data. Commun. Stat.—Theory Methods 2008, 37, 1323–1337. [Google Scholar] [CrossRef]
  9. Olmos, N.M.; Varela, H.; Gómez, H.W.; Bolfarine, H. An extension of the half-normal distribution. Stat. Pap. 2012, 53, 875–886. [Google Scholar] [CrossRef]
  10. Olmos, N.M.; Varela, H.; Bolfarine, H.; Gómez, H.W. An extension of the generalized half-normal distribution. Stat. Pap. 2014, 55, 967–981. [Google Scholar] [CrossRef]
  11. Cooray, K.; Ananda, M.M.A. Modeling actuarial data with a composite lognormal-Pareto model. Scand. Actuar. J. 2005, 5, 321–334. [Google Scholar] [CrossRef]
  12. Scollnik, D.P.M. On composite lognormal–Pareto models. Scand. Actuar. J. 2007, 7, 20–33. [Google Scholar] [CrossRef]
  13. Cooray, K.; Cheng, C.I. Bayesian estimators of the lognormal–Pareto composite distribution. Scand. Actuar. J. 2015, 6, 500–515. [Google Scholar] [CrossRef]
  14. Ciumara, R. An actuarial model based on the composite Weibull–Pareto distribution. Math. Rep. 2006, 8, 401–414. [Google Scholar]
  15. Cooray, K. The Weibull–Pareto composite family with applications to the analysis of unimodal failure rate data. Commun. Stat.—Theory Methods 2009, 38, 1901–1915. [Google Scholar] [CrossRef]
  16. Teodorescu, S. On the truncated composite lognormal–Pareto model. Math. Rep. 2010, 12, 71–84. [Google Scholar]
  17. Teodorescu, S.; Panaitescu, E. On the truncated composite Weibull–Pareto model. Math. Rep. 2009, 11, 259–273. [Google Scholar]
  18. Teodorescu, S.; Vernic, R. Some composite exponential–Pareto models for actuarial prediction. Rom. J. Econ. Forecast. 2009, 12, 82–100. [Google Scholar]
  19. Scollnik, D.P.M.; Sun, C. Modeling with Weibull–Pareto models. N. Am. Actuar. J. 2012, 16, 260–272. [Google Scholar] [CrossRef]
  20. Calderín-Ojeda, E.; Azpitarte, F.; Gómez-Déniz, E. Modelling income data using two extensions of the exponential distribution. Physica A 2016, 461, 756–766. [Google Scholar] [CrossRef]
  21. Rolski, T.; Schmidli, H.; Schmidt, V.; Teugel, J. Stochastic Processes for Insurance and Finance; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  22. Galton, F. Enquiries into Human Faculty and Its Development; Macmillan & Company: London, UK, 1883. [Google Scholar]
  23. Moors, J.J.A. A quantile alternative for kurtosis. J. R. Stat. Soc. Ser. D Stat. 1988, 37, 25–32. [Google Scholar] [CrossRef]
  24. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 12 January 2024).
  25. Artzner, P. Application of coherent risk measures to capital requirements in insurance. N. Am. Actuar. J. 1999, 3, 11–25. [Google Scholar] [CrossRef]
  26. Artzner, P.; Delbaen, F.; Eber, J.-M.; Heath, D. Coherent measures of risk. Math. Financ. 1999, 9, 203–228. [Google Scholar] [CrossRef]
  27. Klugman, S.A.; Panjer, H.H.; Willmot, G.E. Loss Models: From Data to Decisions, 4th ed.; Wiley: New York, NY, USA, 1998. [Google Scholar]
  28. Casella, G.; Berger, R. Statistical Inference, 2nd ed.; Cengage Learning: Boston, MA, USA, 2002. [Google Scholar]
  29. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1074, 19, 716–723. [Google Scholar] [CrossRef]
  30. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  31. Frees, E.W. Regression Modeling with Actuarial and Financial Applications; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Figure 1. Comparison of HN and CHNP distributions.
Figure 1. Comparison of HN and CHNP distributions.
Mathematics 12 01631 g001
Figure 2. Plot of the hazard function h ( t ) of the CHNP distribution with different values of parameter θ .
Figure 2. Plot of the hazard function h ( t ) of the CHNP distribution with different values of parameter θ .
Mathematics 12 01631 g002
Figure 3. Comparison of hazard functions of the HN and CHNP distributions.
Figure 3. Comparison of hazard functions of the HN and CHNP distributions.
Mathematics 12 01631 g003
Figure 4. Histogram using a sample of size n = 50 from density CHNP(2).
Figure 4. Histogram using a sample of size n = 50 from density CHNP(2).
Mathematics 12 01631 g004
Figure 5. Plots of the VaR measure of the CHNP distribution with different values of parameter θ .
Figure 5. Plots of the VaR measure of the CHNP distribution with different values of parameter θ .
Mathematics 12 01631 g005
Figure 6. Boxplot for income data.
Figure 6. Boxplot for income data.
Mathematics 12 01631 g006
Figure 7. Boxplot for income data without extreme data.
Figure 7. Boxplot for income data without extreme data.
Mathematics 12 01631 g007
Figure 8. Plots of the empirical cdf, with estimated CHNP cdf, estimated CEP cdf, and estimated Pareto cdf models.
Figure 8. Plots of the empirical cdf, with estimated CHNP cdf, estimated CEP cdf, and estimated Pareto cdf models.
Mathematics 12 01631 g008
Figure 9. Plots of the empirical VaR, plots of the VaR for CHNP, CEP, and Pareto models.
Figure 9. Plots of the empirical VaR, plots of the VaR for CHNP, CEP, and Pareto models.
Mathematics 12 01631 g009
Figure 10. Boxplot for expenditure data.
Figure 10. Boxplot for expenditure data.
Mathematics 12 01631 g010
Figure 11. Plots of the empirical cdf, with estimated CHNP cdf and estimated CEP cdf models.
Figure 11. Plots of the empirical cdf, with estimated CHNP cdf and estimated CEP cdf models.
Mathematics 12 01631 g011
Figure 12. Plots of the empirical VaR and plots of the VaR for CHNP and CEP models.
Figure 12. Plots of the empirical VaR and plots of the VaR for CHNP and CEP models.
Mathematics 12 01631 g012
Table 1. Monotonicity of the hazard function.
Table 1. Monotonicity of the hazard function.
θ IncreasingDecreasing
1 ( 0 , 0.4183847 ] ( 0.4183847 , )
1.8 ( 0 , 0.7530924 ] ( 0.7530924 , )
2.5 ( 0 , 1.045962 ] ( 1.045962 , )
3.5 ( 0 , 1.464346 ] ( 1.464346 , )
5 ( 0 , 1.673539 ] ( 1.673539 , )
Table 2. VaRp measure and p is the significance level.
Table 2. VaRp measure and p is the significance level.
ModelpVaRpModelpVaRp
CHNP(0.5) 0.90 20.73632 CHNP(5) 0.90 207.3632
CHNP(0.5) 0.95 92.27857 CHNP(5) 0.95 922.7857
CHNP(0.5) 0.99 2955.067 CHNP(5) 0.99 29 , 550.67
CHNP(1) 0.90 41.47265 CHNP(8) 0.90 331.7812
CHNP(1) 0.95 184.5571 CHNP(8) 0.95 1476.457
CHNP(1) 0.99 5910.134 CHNP(8) 0.99 47,281.07
CHNP(3) 0.90 124.4179 CHNP(10) 0.90 414.7265
CHNP(3) 0.95 553.6714 CHNP(10) 0.95 1845.571
CHNP(3) 0.99 17 , 730.4 CHNP(10) 0.99 59,101.34
Table 3. Bias, SE, RMSE, and CP for the CHNP model with sample sizes 50, 100, 150, and 200.
Table 3. Bias, SE, RMSE, and CP for the CHNP model with sample sizes 50, 100, 150, and 200.
θ nBiasSERMSECP
1500.04200.24920.26320.9398
1000.01570.17200.17580.9456
1500.01300.14020.14330.9458
2000.01300.14020.12270.9458
2500.06950.49520.52550.9391
1000.03370.34430.35640.9442
1500.02160.27910.28630.9420
2000.01420.24080.24490.9454
3500.10670.74410.79280.9381
1000.06310.51760.53610.9447
1500.03730.41930.43400.9411
2000.02420.36170.37000.9415
4500.02420.36171.06560.9415
1000.08020.68970.72060.9413
1500.05930.55990.57030.9468
2000.03610.48290.49300.9465
Table 4. 50 simulated data.
Table 4. 50 simulated data.
0.014454570.019251260.047480030.128877460.198926100.53610582
0.705680850.724041040.789690390.823922950.844962160.90191984
0.922442961.211303691.269070961.286367631.306685991.35093138
1.424329501.655350341.723753111.844868891.969475732.10059341
2.147869352.686532232.710269182.731828133.105116683.41038988
3.570828324.444311425.097548745.212859445.606142956.62777414
8.600982449.3267008210.8537737214.2896473920.5120282425.16157611
27.7418705360.1602676365.4144921171.3353592787.67022926102.51836463
183.21141945215.76829133
Table 5. Descriptive statistics for 50 simulated data from the CHNP(2) model.
Table 5. Descriptive statistics for 50 simulated data from the CHNP(2) model.
nMedianMeanVarianceCSCK
502.41719.4741936.5530.6516.477
Table 6. Fifty simulated data: Model, ML estimates, AIC, and BIC values.
Table 6. Fifty simulated data: Model, ML estimates, AIC, and BIC values.
ModelML EstimatesAICBIC
CHNP( θ ) θ ^ = 1.991 325.641329.855
Pareto( θ , α ) θ ^ = 0.015 α ^ = 0.188 378.785380.697
Table 7. Descriptive statistics for income data.
Table 7. Descriptive statistics for income data.
nMedianMeanVarianceCSCK
50021.125216.70911,270,0010.4351.655
Table 8. ML estimates for the income data with corresponding standard errors (in parentheses), AIC and BIC values.
Table 8. ML estimates for the income data with corresponding standard errors (in parentheses), AIC and BIC values.
ModelML EstimatesAICBIC
CHNP( θ ) θ ^ = 19.370 ( 1.978 ) 4937.4464947.875
CEP( θ ) θ ^ = 21.446 ( 1.675 ) 5049.7905054.005
Pareto( θ , α ) θ ^ = 0.065 ( 0.001 )
α ^ = 0.171 ( 0.008 ) 5867.9105876.339
Table 9. Comparison of the VaR of different models for income data.
Table 9. Comparison of the VaR of different models for income data.
Model∖Significance 0.5 0.6 0.7 0.8 0.9 0.95
CHNP( θ ^ )25.08640.56575.379180.519803.3253574.872
CEP( θ ^ )31.81360.186136.919436.0943159.84722,895.580
Pareto( θ ^ , α ^ )3.74413.80574.248795.16545,800.1202,638,007
Table 10. Descriptive statistics for expenditure data.
Table 10. Descriptive statistics for expenditure data.
nMedianMeanVarianceCSCK
752.56084.955987.0300.0791.165
Table 11. ML estimates for the expenditure data with corresponding standard errors, AIC and BIC values.
Table 11. ML estimates for the expenditure data with corresponding standard errors, AIC and BIC values.
ModelML EstimatesAICBIC
CHNP( θ ) θ ^ = 2.042 ( 0.533 ) 392.609399.244
CEP( θ ) θ ^ = 2.007 ( 0.469 ) 405.183411.818
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Olmos, N.M.; Gómez-Déniz, E.; Venegas, O.; Gómez, H.W. A Composite Half-Normal-Pareto Distribution with Applications to Income and Expenditure Data. Mathematics 2024, 12, 1631. https://doi.org/10.3390/math12111631

AMA Style

Olmos NM, Gómez-Déniz E, Venegas O, Gómez HW. A Composite Half-Normal-Pareto Distribution with Applications to Income and Expenditure Data. Mathematics. 2024; 12(11):1631. https://doi.org/10.3390/math12111631

Chicago/Turabian Style

Olmos, Neveka M., Emilio Gómez-Déniz, Osvaldo Venegas, and Héctor W. Gómez. 2024. "A Composite Half-Normal-Pareto Distribution with Applications to Income and Expenditure Data" Mathematics 12, no. 11: 1631. https://doi.org/10.3390/math12111631

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop