Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models

Martínez-Flórez, Guillermo; Tovar-Falón, Roger; Gómez, Héctor W.

doi:10.3390/math12223566

Open AccessArticle

Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models

by

Guillermo Martínez-Flórez

¹

,

Roger Tovar-Falón

^1,*

and

Héctor W. Gómez

²

¹

Departamento de Matemáticas y Estadística, Universidad de Córdoba, Montería 230002, Colombia

²

Departamento de Estadística y Ciencia de Datos, Facultad de Ciencias Básicas, Universidad de Antofagasta, Antofagasta 1240000, Chile

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(22), 3566; https://doi.org/10.3390/math12223566

Submission received: 4 September 2024 / Revised: 23 October 2024 / Accepted: 13 November 2024 / Published: 15 November 2024

(This article belongs to the Special Issue New Advances in Distribution Theory and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, we model the rate or proportion of a specific phenomenon using a set of known covariates. To fit the regression model, which explains the phenomenon within the intervals

(0, 1)

,

[0, 1)

,

(0, 1]

, or

[0, 1]

, we employ a logit link function. This approach ensures that the model’s predictions remain within the appropriate range of zero to one. In cases of inflation at zero, one, or both, the logit link function is similarly applied to model the dichotomous Bernoulli-type variable with a multinomial response. The findings demonstrate that the model yields a non-singular information matrix, ensuring valid statistical inference. This ensures the invertibility of the information matrix, allowing for hypothesis testing based on likelihood statistics regarding the parameters in the model. This is not possible with other asymmetric models, such as those derived from the skew-normal distribution, which has a singular information matrix at the boundary of the skewness parameter. Finally, empirical results show the model’s effectiveness in analyzing proportion data with inflation at zero and one, proving its robustness and practicality for analyzing bounded data in various fields of research.

Keywords:

unit proportional hazard distribution; censoring; proportion data; truncation; zero-one inflation

MSC:

62J05; 62E15

1. Introduction

In recent years, probability distributions have seen significant advancements, particularly through the creation of new families derived from extensions or generalizations of classical distributions. These innovations aim to overcome the limitations of traditional models and provide greater flexibility to better fit the complex phenomena observed in various fields of knowledge. Examples of these distributions include those based on transformations such as the generalized beta distribution by Eugene et al. [1], the family of generalized distributions based on the Kumaraswamy distribution, referred to as Kw-distributions and introduced by Cordeiro and De Castro [2] (Kw-normal, Kw-Weibull, Kw-gamma, Kw-Gumbel, and Kw-inverse Gaussian distribution); and the beta modified Weibull distribution of Silva et al. [3]. These new distributions not only better capture data characteristics like skewness and kurtosis but also improve accuracy in modeling extreme events or phenomena with heavy tails. Furthermore, their implementation has proven useful in fields such as biomedicine, economics, and engineering, where classical models fail to adequately describe the reality of the data.

In parallel, truncated distributions have emerged as another essential tool, particularly when the data are bounded within a specific range. These distributions are modifications of classical ones, where values outside a certain interval are truncated, improving the model’s fit for data restricted by natural or experimental constraints [4]. For example, the truncated normal distribution is widely used in reliability analysis and survival studies where negative values are not possible [5,6]. Similarly, the truncated Weibull distribution has been applied in actuarial sciences to model the time to event data [7], offering greater flexibility when standard distributions fail to capture the behavior of the tail.

A method for creating new families of distributions involves using a generating distribution as a base. This method has been widely employed by various authors, including Cordeiro et al. [8,9], Zografos and Balakrishnan [10], Ristić and Balakrishnan [11], Castellares et al. [12], and Cordeiro et al. [13]. In the same context, Mahdavi and Silva [4] introduced a method for generating families of truncated distributions, producing a two-parameter extension of the base distribution. This method has been used to derive distributions such as the truncated exponential-exponential and the truncated Lomax-Exponential. These innovations in probability distributions have proven to be valuable tools in statistical analysis, providing more robust and adaptable models for complex data.

The method introduced by Mahdavi and Silva [4] can be summarized as follows:

Definition of the Truncated Distribution: A random variable U with support in the interval $(a, b)$ , where $a \leq 0$ and $b \geq 1$ , and cumulative distribution function (CDF) F is considered. The CDF of the truncated random variable U in the interval $(0, 1)$ is defined as:

$F_{U_{t}} (u) = \frac{F (u) - F (0)}{F (1) - F (0)} .$

(1)
Generation of the New Family of Distributions: Using the truncated CDF, the new truncated F–G family of distributions is introduced. For each absolutely continuous G distribution (denoted as the baseline distribution), the $T F$ –G distribution is associated. The CDF of the $T F$ –G class of distributions is defined as:

$G_{X} (x) = \frac{F (G (x)) - F (0)}{F (1) - F (0)},$

(2)

where G is the CDF of the random variable V used to generate a new distribution.

The probability density function (PDF),

f_{X} (x)

, survival function, and hazard rate function are given, respectively, by:

f_{X} (x) = \frac{g (x) f (G (x))}{F (1) - F (0)},

(3)

S_{X} (x) = \frac{F (1) - F (G (x))}{F (1) - F (0)} .

(4)

and

h_{X} (x) = \frac{g (x) f (G (x))}{F (1) - F (G (x))},

(5)

where f and g are the PDF of the random variables U and V, respectively. The extension to the location-scale case of the model (3) is obtained from the transformation

Y = μ + σ X

, where

X \sim TF - G

, for

μ \in R

y

σ \in R^{+}

; it has PDF given by:

f_{Y} (y) = \frac{1}{σ} \frac{g (x) f (G (x))}{F (x_{1}) - F (x_{0})},

(6)

where

x = \frac{y - μ}{σ}, x_{0} = \frac{a - μ}{σ}, x_{1} = \frac{b - μ}{σ} .

Some distributions that have been derived using the generator proposed by [4] are the truncated exponential-exponential (TEE), the truncated Lomax-Exponential by Enami [14], the truncated exponential Marshall Olkin Lomax distribution of Hadi and Al-Noor [15] and the truncated Nadarajah-Haghighi Exponential by Al-Habib et al. [16]. The generator proposed by [4] can also be used to derive distributions useful for modeling data in the interval

(0, 1)

, such as proportions, rates, or indices.

The analysis of phenomena represented by proportion data, confined to values between zero and one, is essential across various scientific disciplines. These data elucidate part-to-whole relationships and are prevalent in numerous applications, including the prevalence of diseases, the distribution of resources in economics, the survival rates of species, and the utilization of habitats in ecology [17]. Modeling such data can be highly challenging when there is high zero-to-one inflation in proportion data. Traditional statistical models, such as the censored normal or censored log-normal models, may not be the best solution, as they often struggle to accurately characterize the underlying distribution of proportion data with inflated extremes.

Numerous authors have collaborated to develop more robust models than the censored normal and censored log-normal models for this type of data. By incorporating distributions such as the Birnbaum–Saunders [18,19], Student-t [20,21], skew-normal (SN) [22,23,24,25], and power-normal (PN) [26,27] distributions, among others, they offer a framework for analyzing data with high degrees of skewness and kurtosis compared with traditional models.

Perhaps the beta distribution is the most well-known in the statistical literature and is commonly used for fitting unit interval data. However, it has limitations when modeling unit data with zero-one inflation. Recent proposals, such as the zero-one inflated beta models, have been made to overcome this limitation and have proven to be viable alternatives for handling data with certain degrees of asymmetry [28,29,30,31,32,33]. Despite advancements in modeling data with inflation and asymmetry, there remains a gap in adequately addressing zero-one inflation in proportion data. Existing models fail to fully capture the unique distributional characteristics and complexities introduced by these inflations, leading to biased estimators and imprecise inferences [34,35].

The primary aim of this study is to introduce and develop unit-proportional hazard zero-one inflated (UPHZOI) models, a novel class of regression models specifically designed to address the challenges posed by zero-one inflation in proportional data confined to the unit interval. UPHZOI models combine a continuous-discrete mixture distribution with covariates, enabling them to effectively capture the complex dynamics of such data.

The remainder of this article is structured as follows: Section 2 provides background on the asymmetric proportional hazard model and introduces the truncated proportional hazard model. It also presents the process of parameter estimation, considering a classical approach using the maximum likelihood method. In Section 3, we introduce new regression models for unit interval data with inflation, including the model formulation, parameter estimation, and elements of the Hessian matrix. Section 4 demonstrates the application of these models through empirical case studies on doubly censored data and zero-inflated data. Section 5 presents an analysis of the major results, limitations, and future research directions. The article concludes with Section 6.

2. An Asymmetric Distribution for Skew Data

This section provides background on the proportional hazard (PH) distribution introduced by Martínez-Flórez et al. [36] for modeling data with high or low kurtosis and a wide range of skewness. Additionally, the unit-proportional hazard distribution is introduced, derived using the truncated method of [4]. The latter serves as the foundation for formulating the UPHZOI models, from which regression models for proportion data are developed.

2.1. Proportional Hazard Distribution and Its Modeling

The PDF of the PH distribution is given by

ϕ_{PH} (y; θ) = α f (\frac{y - ξ}{σ}) {\{1 - F (\frac{y - ξ}{σ})\}}^{α - 1}, y \in R,

(7)

where

θ = (ξ, σ, α)

, with

ξ \in R

is a location parameter,

σ \in R^{+}

is a scale parameter,

α

is a positive real number and, F is an absolutely continuous distribution function with continuous density function

f = d F

. The notation

Y \sim PH (ξ, σ, α)

indicates that Y follows an PH distribution with parameters

ξ

,

σ

, and

α

.

Under the PH model, the hazard function is presented as

h_{PH} (y, α) = α h_{f} (y),

where

h_{f} (\cdot) = f (\cdot) / (1 - F (\cdot))

is the hazard function regarding the density f. When the CDF F in the (7) model corresponds to the CDF of the standard normal distribution, that is,

F = Φ

and therefore

f = ϕ

, we obtain the model denominated proportional hazard normal (PHN), whose PDF is given by

ϕ_{PHN} (y; θ) = α ϕ (\frac{y - ξ}{σ}) {\{S (\frac{y - ξ}{σ})\}}^{α - 1}, y \in R,

(8)

where

S (\cdot)

is the survival function of the standard normal PDF. This model also serves as an alternative for fitting data with much wider ranges of skewness and kurtosis than those of the normal distribution, which the latter cannot adequately capture. The CDF of the

PHN (μ, σ, α)

is given by:

Φ_{PHN} (y; θ) = 1 - {\{S (\frac{y - ξ}{σ})\}}^{α}, y \in R .

(9)

By considering various values of

α

, Martínez-Flórez et al. [36] found that the range of the asymmetry and kurtosis coefficients,

\sqrt{β_{1}}

and

β_{2}

, for the variable

Y \sim PHN (0, 1, α)

are the intervals

(- 1.1578, 0.9918)

and

(1.1513, 4.3023)

, respectively. This indicates that the PHN model is superior to both the SN and PN models in terms of asymmetry and kurtosis. Furthermore, ref. [36] demonstrate that the information matrix of the PHN distribution is non-singular. This is advantageous for statistical inference, as it allows for hypothesis testing based on likelihood ratio statistics.

2.2. Truncated Proportional Hazard Normal Distribution

Based on the TF-G distribution, we define the truncated proportional hazard normal (TPHN) distribution in the unit interval

[0, 1]

. Let

F (\cdot)

be the CDF of the PHN distribution and

G (\cdot)

the CDF of a continuous uniform distribution on

[0, 1]

; then, we have that the PDF of the TPHN model is

ϕ_{TPHN} (y; ξ, σ, α) = \frac{\frac{α}{σ} ϕ_{PHN} (\frac{y - ξ}{σ})}{{\{S (\frac{- ξ}{σ})\}}^{α} - {\{S (\frac{1 - ξ}{σ})\}}^{α}}, 0 < y < 1,

(10)

where

ϕ_{PHN}

and S are defined in (8). The standardization terms, which facilitate the normalization of the data within the specified limits, are defined as

z = \frac{y - ξ}{σ}, z_{0} = - \frac{ξ}{σ}, z_{1} = \frac{1 - ξ}{σ} .

This is denoted by

TPHN (ξ, σ, α)

. It can be seen from (10) that the CDF, survival function, and hazard function for the TPHN distribution are given by:

Φ_{TPHN} (y; ξ, σ, α) = \frac{{S (z_{0})}^{α} - {S (z)}^{α}}{{S (z_{0})}^{α} - {S (z_{1})}^{α}},

(11)

S_{TPHN} (y; ξ, σ, α) = \frac{{S (z)}^{α} - {S (z_{1})}^{α}}{{S (z_{0})}^{α} - {S (z_{1})}^{α}},

(12)

and

h_{TPHN} (y; ξ, σ, α) = \frac{α}{σ} \frac{ϕ (z) {S (z)}^{α - 1}}{{S (z)}^{α} - {S (z_{1})}^{α}} = α \frac{{S (z)}^{α}}{{S (z)}^{α} - {S (z_{1})}^{α}} h (y),

(13)

respectively, where

h (y)

is the hazard function of the normal distribution.

The moments of a random variable with TPHN distribution can be obtained using the expression

E (Y^{r}) = \frac{α \sum_{j = 1}^{r} ξ^{r - j} σ^{j} λ^{j}}{{\{S (z_{0})\}}^{α} - {\{S (z_{1})\}}^{α}}, r = 1, 2, \dots

(14)

where

λ = \int_{S (z_{1})}^{S (z_{0})} Φ^{- 1} (1 - u) u^{α - 1} d u

being

Φ^{- 1} (\cdot)

the inverse of the function

Φ (\cdot) .

2.3. Parameter Estimation in the TPHN Model

The TPHN parameters can be estimated using the maximum likelihood (ML) method by maximizing the log-likelihood function. We consider a random sample of n observations,

Y_{1}, Y_{2}, \dots, Y_{n}

from the

TPHN (ξ, σ, α)

distribution; the log-likelihood function of

θ = {(ξ, σ, α)}^{⊤}

is obtained by taking the natural logarithm of the joint likelihood function defined as

L (θ, y) = \prod_{i = 1}^{n} ϕ_{T P H N} (y_{i}; θ)

, where now

θ = (ξ, σ, α)

. Taking the natural logarithm in the above expression, we obtain the log-likelihood function established as

\begin{matrix} ℓ (θ) & = & n log (α) - n log (σ) + \sum_{i = 1}^{n} log (ϕ (z_{i})) \\ + (α - 1) \sum_{i = 1}^{n} log (S (z_{i})) - n log (W (ξ, σ, α)), \end{matrix}

(15)

where

z_{i} = \frac{y_{i} - ξ}{σ}

and

W = W (ξ, σ, α) = log ({\{S (z_{0})\}}^{α} - {\{S (z_{1})\}}^{α})

. By taking the first derivatives of the function presented in (15) with respect to the parameters,

\dot{ℓ} (θ) = \partial ℓ (θ) / \partial θ

, we obtain the score elements. For the location parameter

ξ

, the score function is formulated as

\dot{ℓ} (α) = \frac{n}{α} + \sum_{i = 1}^{n} log (S (z_{i})) - n \frac{{\{S (z_{0})\}}^{α} log (S (z_{0})) - {\{S (z_{1})\}}^{α} log (S (z_{1}))}{W} .

(16)

For the scale parameter

σ

, the score function is defined as

\dot{ℓ} (μ) = \frac{1}{σ} \sum_{i = 1}^{n} z_{i} + \frac{α - 1}{σ} \sum_{i = 1}^{n} \frac{ϕ (z_{i})}{S (z_{i})} - n \frac{α}{σ} \frac{h (z_{0}) {\{S (z_{0})\}}^{α} - h (z_{1}) {\{S (z_{1})\}}^{α}}{W} .

(17)

For the shape parameter

α

, the score is formulated as

\dot{ℓ} (σ) = - \frac{n}{σ} + \frac{1}{σ} \sum_{i = 1}^{n} z_{i}^{2} + \frac{α - 1}{σ} \sum_{i = 1}^{n} z_{i} \frac{ϕ (z_{i})}{S (z_{i})} - n \frac{α}{σ} \frac{z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α}}{W} .

(18)

The maximum likelihood estimate (MLE) of the parameters is obtained by solving the system of equations formed by setting (16)–(18) equal to zero. This system is generally solved using iterative numerical methods, such as the Newton–Raphson or quasi-Newton algorithms, which iteratively refine the parameter estimates to maximize the likelihood function.

2.4. Information Matrix in TPHN Model

The observed information matrix can be approximated by the negative of the Hessian matrix, which is obtained from the second derivatives of the log-likelihood function. The second derivatives of the log-likelihood function for

ξ ξ, ξ σ, σ σ, ξ α, σ α

an

α α

are given in the Appendix A.1. To derive the information matrix, it suffices to find the expected value of the elements of the observed information matrix. According to [36], the family of proportional hazard distributions is regular; thus, the information matrix of the PHN model is non-singular, as demonstrated in Martínez-Flórez et al. [36]. Consequently, the information matrix of the truncated distribution on

[0, 1]

is non-singular, and its covariance matrix is given by

Σ = Σ (ξ, σ, α) = I^{- 1} (ξ, σ, α) = {(E (J (ξ, σ, α)))}^{- 1} .

It follows that, for large n,

\hat{θ}

is consistent and, furthermore, by the central limit theorem,

\hat{θ}

is asymptotically normally distributed with mean vector

θ

and covariance matrix

Σ

, i.e.,

\hat{θ} \overset{D}{⟶} N_{3} (θ, Σ),

Details of this result can be found in [37].

In practice, since the matrix

J (θ)

is consistent for

I (θ)

, we can take

Σ = J^{- 1} (θ)

as the covariance matrix of the estimator vector for the TPHN model.

2.5. Unit-Proportional Hazard Regression Model

We now introduce the unit-proportional hazard normal (UPHN) regression model to fit proportion data from the TPHN distribution by changing the location parameter

ξ

in (10) to the linear predictor

ξ_{i} = x_{i}^{⊤} β

, where

x_{i} = {(1, x_{1 i}, \dots, x_{p i})}^{⊤}

is an observed covariate vector for the observation i, and

β = {(β_{0}, β_{1}, \dots, β_{p})}^{⊤}

is the regression coefficient vector. The response (dependent) variable

Y_{i}

can be modeled by

Y_{i} = β_{0} + β_{1} x_{1 i} + \dots + β_{p} x_{p i} + ε_{i}, i = 1, \dots, n,

(19)

where

ε_{i} \sim TPHN (0, σ, α)

. It follows from the natural form that

Y_{i} \sim TPHN (x_{i}^{⊤} β, σ, α), i = 1, 2, \dots, n .

Since our focus is on cases where the variable of interest lies within the unit interval

(0, 1)

, issues may arise with the expected response or predicted value, which could fall outside this standard unit interval

(0, 1)

, potentially resulting in negative estimates that lack interpretation and/or meaning. To avoid these issues, we change the assumption that the response variable Y is a linear function of the vector of explanatory variables

x_{i}^{⊤} = (x_{1}, x_{2}, \dots, x_{p})

to a nonlinear transformation of this set of variables. This model will be obtained by assuming that the location parameter of

y_{i}

can be written as

g (μ_{i}) = ξ_{i} = x_{i}^{⊤} β, i = 1, \dots, n,

(20)

where

g (\cdot)

is a strictly monotonic and twice differentiable link function that maps

(0, 1)

to

R

. There are several options for choosing the link function

g (\cdot)

; two commonly used for this particular case are the logit function

g (μ_{i}) = log (μ_{i} / (1 - μ_{i}))

, and the probit function

g (μ_{i}) = Φ (μ_{i})

. These two options yield very similar results in predicted values, with some exceptions for extreme values. Because the logit and probit functions provide very similar results in terms of model fit, and unlike the probit function, the logit link function allows for simpler algebraic manipulations and obtaining expressions for the score function, elements of the information matrix and expectation calculations among others, we opt for the logit function. Thus, in this case, we write

μ_{i} = \frac{exp (x_{i}^{⊤} β)}{1 + exp (x_{i}^{⊤} β)}, i = 1, 2, \dots, n .

(21)

For this model, the parameters are interpreted based on the odds ratio between the odds of the prediction or mean when one of the variables is increased by m units (while keeping the other explanatory variables constant) and the odds without this increase. It has been demonstrated that this odds ratio is given by

exp (m β_{k})

, where

β_{k}

is the parameter associated with the explanatory variable increased by m units. It follows that the distribution of the variable under study is

y_{i} \sim TPHN (μ_{i}, σ, α), i = 1, 2, \dots, n .

The estimates of the parameters of the UPHN regression model with a logit link function can be obtained using the ML method. The log-likelihood function for the parameter vector

θ = (β, σ, α)

given a sample of n observations is given by

\begin{matrix} ℓ (θ) & = n log (α) - n log (σ) + \sum_{i = 1}^{n} log (ϕ (z_{i})) \\ + (α - 1) \sum_{i = 1}^{n} log (S (z_{i})) - \sum_{i = 1}^{n} log (W_{i} (μ_{i}, σ, α)), \end{matrix}

(22)

where

W_{i} = W_{i} (μ_{i}, σ, α) = log ({\{S (z_{0 i})\}}^{α} - {\{S (z_{1 i})\}}^{α})

with

z_{i} = \frac{y_{i} - μ_{i}}{σ}, z_{0 i} = - \frac{μ_{i}}{σ}, z_{1 i} = \frac{1 - μ_{i}}{σ} .

Thus, the score function, defined as the derivative of the log-likelihood function with respect to each of the parameters, is given for the vector whose components are given by:

\begin{matrix} \dot{ℓ} (α) & = & \frac{n}{α} + \sum_{i = 1}^{n} log (S (z_{i})) - \sum_{i = 1}^{n} \frac{{\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} log (S (z_{1 i}))}{W_{i}}, \\ \dot{ℓ} (β_{j}) & = & \frac{1}{σ} \sum_{i = 1}^{n} x_{i j} z_{i} μ_{i} (1 - μ_{i}) + \frac{α - 1}{σ} \sum_{i = 1}^{n} \frac{x_{i j} μ_{i} (1 - μ_{i}) ϕ (z_{i})}{S (z_{i})} \\ - \frac{α}{σ} \sum_{i = 1}^{n} \frac{x_{i j} μ_{i} (1 - μ_{i}) (h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h (z_{1 i}) {\{S (z_{1 i})\}}^{α})}{W_{i}}, \\ \dot{ℓ} (σ) & = & - \frac{n}{σ} + \frac{1}{σ} \sum_{i = 1}^{n} z_{i}^{2} + \frac{α - 1}{σ} \sum_{i = 1}^{n} z_{i} \frac{ϕ (z_{i})}{S (z_{i})} - \frac{α}{σ} \sum_{i = 1}^{n} \frac{z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}} . \end{matrix}

Setting these expressions to zero, we get the corresponding score equations whose numerical solution leads to the MLE. The elements of the information matrix are obtained using the chain rule and are presented in Appendix A.2.

It can be seen that, for large sample sizes, we have

\hat{θ} \overset{D}{⟶} N_{p + 3} (θ, I_{F} {(θ)}^{- 1}) .

where, “D” indicates convergence in distribution. In this way, inferences can be made about the parameters using likelihood ratio statistics.

2.6. MCMC Methods for the PHN Model

Bayesian methods can also be implemented to perform statistical inference within the PHN distribution family. Although there is limited statistical literature addressing this issue in power-normal distributions, Sarabia and Castillo [38] provides some initial ideas on how to approach it. In this section, we do not aim to propose specific Bayesian methods but rather open the door to exploring these methods within the PHN model class.

We consider the standard case of the

PHN (0, 1, α) \equiv PHN (α)

model, and, similar to [38], we assume a gamma distribution for the shape parameter

α

. The model we consider is

\begin{matrix} Y ∣ α & \sim PHN (α) \end{matrix}

(23)

\begin{matrix} α & \sim G a m m a (δ_{0}, λ_{0}), \end{matrix}

(24)

where

G a m m a (δ_{0}, λ_{0})

denotes a gamma random variable with PDF proportional to

s^{δ_{0} - 1} e^{- λ_{0} s}

with

δ_{0}

and

λ_{0}

known. If we denote by

m (y)

the marginal distribution of Y and by

π (α ∣ Y)

the posterior distribution of the shape parameter

α

, we have that:

\begin{matrix} m (y) & = & \int_{0}^{\infty} α ϕ (y) {[1 - Φ (y)]}^{α - 1} \frac{λ_{0}^{δ_{0}}}{Γ (δ_{0})} α^{δ_{0} - 1} e^{- λ_{0} α} d α \\ = & \frac{λ_{0}^{δ_{0}}}{Γ (δ_{0})} \frac{ϕ (y)}{1 - Φ (y)} \frac{Γ (δ_{0} + 1)}{{λ_{0} - log [1 - Φ (y)]}^{δ_{0} - 1}}, \end{matrix}

(25)

from which it follows that:

π (α ∣ Y) = \frac{{λ_{0} - log [1 - Φ (y)]}^{δ_{0} + 1}}{Γ (δ_{0} + 1)} α^{δ_{0}} e^{- (λ_{0} - log [1 - Φ (y)]) α},

(26)

which is the PDF of a random variable

Gamma (δ_{1}, λ_{1})

, where

δ_{1}

and

λ_{1}

are given by

δ_{1} = δ_{0} + 1, λ_{1} = λ_{0} - log [1 - Φ (y)]

Inference about the parameter

α

is carried out based on the posterior distribution given in (26). For the location-scale case,

PHN (ξ, σ, α)

, prior distributions for the parameters

ξ

and

σ

that can be considered are the normal and inverse-gamma distributions, respectively.

3. UPHN Zero-One Inflated Regression Model

In this section, we present some regression models for unit interval (proportion) data that account for inflation at values zero and one or any value between zero and one.

3.1. Models for Censored Data

Cragg proposed a two-part model [39], which is a framework for fitting the mixture of a discrete and a continuous random variable. This model is represented by:

g (y_{i}) = p_{i} I_{i} + (1 - p_{i}) f (y_{i}) (1 - I_{i}),

where

p_{i}

is the probability that determines the relative contribution of the point mass distribution made by the discrete variable,

f (\cdot)

is a PDF, and

I_{i}

is an indicator variable that takes values of 0 or 1. This model is optimal in cases where the model is inflated at the point mass value (for example,

y_{i} = a)

, whose probability at

y = a

cannot be explained by the CDF associated with the PDF

f (\cdot)

. Cragg’s model can be extended to the case of a variable with double censoring or two-point mass values, for example, 0 and 1, in which case it is given by:

g (y_{i}) = p_{0 i} I_{0 i} + (1 - p_{0 i} - p_{1 i}) f (y_{i}) (1 - I_{0 i} - I_{1 i}) + p_{1 i} I_{1 i},

where

p_{0 i} = \Pr (y_{i} = 0)

,

p_{1 i} = \Pr (y_{i} = 1)

,

I_{0 i}

is the indicator variable that takes the value 1 if

y_{i} = 0

and zero otherwise. Similarly,

I_{1 i}

is the indicator variable for

y_{i} = 1

. In this model, the three components are determined by different stochastic processes, thus necessarily leading to a positive response from f. On the other hand, a zero or a one comes from the distribution of a point mass.

3.2. Zero-One Inflated PHN Distribution

Based on Cragg’s model, we proposed the zero-one inflated PHN model as a means of

g (y) = \{\begin{matrix} ρ_{0}, & if y = 0, \\ \frac{α}{σ} (1 - ρ_{0} - ρ_{1}) ϕ (z) {\{S (z)\}}^{α - 1}, & if 0 < y < 1, \\ ρ_{1}, & if y = 1, \end{matrix}

where

z = \frac{y - μ}{σ}, ρ_{0} = \Pr (y = 0), ρ_{1} = \Pr (y = 1) .

From this model, cases of inflation only at zero follow by taking

ρ_{1} = 0

or inflation only at one by taking

ρ_{0} = 0

.

The CDF is represented by:

G (y) = \{\begin{matrix} ρ_{0}, & if y \leq 0, \\ ρ_{0} + (1 - ρ_{0} - ρ_{1}) [{\{S (z_{0})\}}^{α} - {\{S (z)\}}^{α}], & if 0 < y < 1, \\ 1, & if y \geq 1 . \end{matrix}

The most interesting case in this new model is when covariates are used to explain the response both in the censored part (0 and 1) and in the uncensored part (the continuous part in

(0, 1)

). Thus, for the discrete part, it is assumed that the responses at zero and one can be explained by the covariate vectors

x_{(0) i} = {(1, x_{0 i 1}, \dots, x_{0 i q})}^{⊤}

and

x_{(1) i} = {(1, x_{1 i 1}, \dots, x_{1 i r})}^{⊤}

respectively. Then, to determine the probabilities

ρ_{0}

and

ρ_{1}

, a logistic model with a polytomous response can be constructed such that:

ρ_{0 i} = \Pr (y_{i} = 0) = \frac{exp (x_{(0) i}^{⊤} β_{(0)})}{1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)})},

(27)

ρ_{1 i} = \Pr (y_{i} = 1) = \frac{exp (x_{(1) i}^{⊤} β_{(1)})}{1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)})},

(28)

ρ_{01 i} = 1 - ρ_{0 i} - ρ_{1 i} = \Pr (y_{i} \in (0, 1)) = \frac{1}{1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)})},

(29)

where

β_{(0)} = {(β_{00}, β_{01}, \dots, β_{0 q})}^{⊤}

y

β_{(1)} = {(β_{10}, β_{11}, \dots, β_{1 r})}^{⊤}

are vectors of unknown parameters associated respectively with the covariate vectors

x_{(0)}

and

x_{(1)} .

Similarly, for the continuous component of the model, a unit model

PHN (μ_{i}, σ, α)

is still assumed with a logit link function in the mean response, i.e.,

log (μ_{i} / (1 - μ_{i})) = x_{i}^{⊤} β

, where

x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i p})

is a vector of covariates with associated coefficient vector

β = {(β_{0}, β_{1}, β_{2}, \dots, β_{p})}^{⊤}

. For this model, it is easy to verify that the log-likelihood function for the parameter vector

θ = {(β_{(0)}^{⊤}, β_{(1)}^{⊤}, β^{⊤}, σ, α)}^{⊤}

given

X_{(0)},

X

,

X_{(1)}

and

Y

can be written in the form:

ℓ (θ) = ℓ (β_{(0)}, β_{(1)}) + ℓ (β, σ, α),

where

\begin{matrix} ℓ (β_{(0)}, β_{(1)}) & = \sum_{0} x_{(0) i} β_{(0)} + \sum_{1} x_{(1) i} β_{(1)} - \sum_{i = 1}^{n} log [1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)})] . \end{matrix}

and

ℓ (β, σ, α) = \sum_{y_{i} \in (0, 1)} (log (α) - log (σ) + log (ϕ (z_{i})) + (α - 1) log (S (z_{i}))) .

Given these characteristics, the MLEs of the model parameters can be obtained separately for each component of the log-likelihood function. The score function is derived by differentiating each component of the log-likelihood function. It can be shown that the Fisher information matrix can be written as a block diagonal matrix in the form:

I (θ) = Diag \{I (β_{(0)}, β_{(1)}), I (β, σ, α),\}

where

I (β_{(0)}, β_{(1)})

corresponds to the information matrix of the discrete part. The elements of the observed information matrix for the discrete part are given in the Appendix A.3. The respective Fisher information matrix is obtained by calculating the expectation of the elements of the observed information matrix. Furthermore, since the inverse of a block diagonal matrix is the block diagonal matrix of the respective inverses, it follows that the variance-covariance matrix is given by:

Σ = Diag {I_{(β_{(0)}, β_{(1)})}^{- 1}, I_{(β, σ, α)}^{- 1}} .

Here, for large sample sizes it follows that for

θ = {(β, β_{(0)}, β_{(1)}, σ, α)}^{⊤}

\hat{θ} \overset{D}{⟶} N_{p + q + r + 3} (θ, I_{F} {(θ)}^{- 1}) .

Confidence intervals for

θ_{r}

with of confidence coefficient

ω = 100 (1 - ψ) %

can be obtained as

{\hat{θ}}_{r} \mp z_{1 - ω / 2} \sqrt{\hat{σ} ({\hat{θ}}_{r})}

. By talking

ρ_{1 i} = 0

, the zero-inflated model is followed and, making

ρ_{0 i} = 0

, the zero-inflated model is obtained.

3.3. The Zero-One Inflated UPHN Model

Similarly to how the zero-one inflated PHN model was constructed, a zero and/or one-inflated UPHN distribution can be proposed, which is given by:

f (y_{i}) = \{\begin{matrix} ρ_{0}, & if y = 0, \\ \frac{α}{σ} (1 - ρ_{0} - ρ_{1}) \frac{ϕ (z) {S (z)}^{α - 1}}{{S (z_{0})}^{α} - {S (z_{1})}^{α}}, & if 0 < y < 1, \\ ρ_{1}, & if y = 1 . \end{matrix}

where z,

ρ_{0} = \Pr (y = 0)

and

ρ_{1} = \Pr (y = 1)

are defined as in the zero-one inflated PHN model.

The CDF of this distribution is represented by

F (y) = \{\begin{matrix} ρ_{0}, & if y \leq 0, \\ ρ_{0} + (1 - ρ_{0} - ρ_{1}) \frac{{S (z_{0})}^{α} - {S (z)}^{α}}{{S (z_{0})}^{α} - {S (z_{1})}^{α}}, & if 0 < y < 1, \\ 1, & if y \geq 1 . \end{matrix}

For the case of covariates in the model,

x_{(0) i} = {(1, x_{0 i 1}, \dots, x_{0 i q})}^{⊤}

and

x_{(1) i} = {(1, x_{1 i 1}, \dots, x_{1 i r})}^{⊤}

for the zero- and one-inflated part, with associated coefficient vector

β_{(0)} = {(β_{00}, β_{01}, \dots, β_{0 q})}^{⊤}

and

β_{(1)} = {(β_{10}, β_{11}, \dots, β_{1 r})}^{⊤}

. For the continuous component of the model, we connect the response variable with the linear predictor using the logit link function. As before, we choose this link function because, in addition to ensuring that the predictions model is within the

(0, 1)

interval, the logit function allows for more explicit expressions of the score function elements and the information matrix compared to the probit function, which depends on the integral of the cumulative distribution function of the standard normal distribution. In this way, we assume relationship

log (μ_{i} / (1 - μ_{i})) = x_{i}^{⊤} β

, where

x_{i} = {(1, x_{i 1}, x_{i 2}, \dots, x_{i p})}^{⊤}

is a vector of covariates with vector of coefficients

β = {(β_{0}, β_{1}, β_{2}, \dots, β_{p})}^{⊤}

.

The proposal again is to use a polytomous logistic model to explain the probabilities

ρ_{0 i}

and

ρ_{1 i}

. As in the case of the inflated PHN model, we have that the log-likelihood function is given by

ℓ (θ) = ℓ (β_{(0)}, β_{(1)}) + ℓ (β, σ, α),

where

ℓ (β_{(0)}, β_{(1)})

is the same as the inflated PHN model, while

\begin{matrix} ℓ (θ; y) & = & n_{01} log (α) - n_{01} log (σ) + \sum_{y_{i} \in (0, 1)} log (ϕ (z_{i})) + (α - 1) \sum_{y_{i} \in (0, 1)} log (S (z_{i})) \\ - \sum_{y_{i} \in (0, 1)} log (W_{i} (μ_{i}, σ, α)), \end{matrix}

where

z_{i},

W_{i} = W_{i} (μ_{i}, σ, α),

z_{0 i}

and

z_{1 i}

are as defined in (22).

The score function is obtained by differentiating each component of the log-likelihood function and the Fisher information matrix can be written as a diagonal block matrix in the form:

I (θ) = Diag \{I (β_{(0)}, β_{(1)}), I (β, σ, α)\} .

The elements of the matrix

I (β_{(0)}, β_{(1)})

are like those given in the inflated PHN model, while the elements of the matrix

I (β, σ, α)

are like those given in the information matrix of the UPHN regression model.

3.4. Generalized Two-Part PHN Model

Cragg’s two-part model [39] encounters the issue that some censored points may be values at the boundary of the censoring limit. This is particularly problematic for a distribution

f (\cdot

) within the unit interval

[0, 1]

, where a zero or one could either be a realization from the point mass distribution or a partial observation of

f (\cdot)

having a critical value that is not precisely known but is close to

(0, T_{1})

or

(T_{2}, 1)

for small values of the pre-specified constants

T_{1}

and

T_{2}

. In practice, the values

T_{1}

and

T_{2}

are, in some cases, defined as those for which the instruments cannot record measurements below or above, respectively, and, consequently, are treated as censoring values. In other cases, these observational limits are defined for ethical or practical reasons. For example, in clinical studies, it may be unethical to continue observing a patient under certain conditions, or the costs of prolonged observation may become prohibitive.

To address this issue in the two-part model, Moulton and Halsey [40] propose a new approach to adjust the mixture of continuous and discrete random variables. This approach allows for the possibility that some limiting responses result from an interval censoring of

f (\cdot)

. The model proposed by Moulton and Halsey (1995) for left censoring at point a is given by:

g (y_{i}) = [p_{i} + (1 - p_{i}) F (T)] I_{i} + (1 - p_{i}) f (y_{i}) (1 - I_{i})

, where F is the CDF associated to f and, T It is a pre-established constant within the interval

(a, T)

where some limiting responses are considered censored. Similarly to how we generalized Cragg’s model, Moulton and Halsey’s model can also be generalized for left and right censoring or two boundary inflation points within the definition interval of the pdf

f (\cdot)

. In our case, for the unit PHN distribution within the interval

[0, 1]

, this generalization of Moulton and Halsey’s model is given by:

\begin{matrix} g (y_{i}) & = & (p_{0 i} + (1 - p_{0 i} - p_{1 i}) (1 - {\{S (z_{0 i})\}}^{α})) I_{0 i} + \frac{α (1 - p_{0 i} - p_{1 i})}{σ} ϕ (z_{i}) {\{S (z_{i})\}}^{α - 1} I_{(0, 1) i} \\ + (p_{1 i} + (1 - p_{0 i} - p_{1 i}) {\{S (z_{1 i})\}}^{α}) I_{1 i} . \end{matrix}

It can be observed that this distribution is a model with double censoring (at zero and one) and, therefore, allows for the fit of datasets with inflation at zero and one. This represents an alternative to the double-censored Tobit model, where the CDF of the normal distribution does not efficiently fit the probability of the point mass where double censoring occurs, i.e., the probability of the inflation points.

Extending this model to the case of covariates in each part of the model, we again assume that

x_{(0) i} = {(1, x_{0 i 1}, \dots, x_{0 i q})}^{⊤}

and

x_{(1) i} = {(1, x_{1 i 1}, \dots, x_{1 i r})}^{⊤}

are sets of auxiliary covariates for the discrete part at zero and one, respectively; and a set of covariates

x_{i} = (1, x_{i 1}, \dots, x_{i p}) ’

for the continuous part in the interval

(0, 1)

. Then, denoting by

ρ_{0}

the proportion of observations below zero,

y_{i} = 0

(lower detection limit), and by

ρ_{1}

the proportion of observations above one,

y_{i} = 1

(upper detection limit), the extension of the Moulton and Halsey model to the double-censored PHN case can be expressed through the PDF given by

g (y_{i}) = \{\begin{matrix} ρ_{0 i} + (1 - ρ_{0 i} - ρ_{1 i}) (1 - {S (z_{0 i})}^{α}), & if y_{i} \leq 0, \\ \frac{α}{σ} (1 - ρ_{0 i} - ρ_{1 i}) ϕ (z_{i}) {\{S (z_{i})\}}^{α - 1}, & if 0 < y_{i} < 1, \\ ρ_{1 i} + (1 - ρ_{0 i} - ρ_{1 i}) {S (z_{1 i})}^{α}, & if y_{i} \geq 1, \end{matrix}

where

ρ_{0 i}

and

ρ_{1 i}

are the probability masses at points zero and one, while

z_{0 i}, z_{1 i}, z_{i}

are as defined above;

log (μ_{i} / (1 - μ_{i})) = x_{i}^{⊤} β

, where

β

is the set of coefficients associated with the covariate vector

x_{i} = {(1, x_{i 1}, \dots, x_{i p})}^{⊤}

.

The CDF of this model is represented by

G (y_{i}) = \{\begin{matrix} ρ_{0 i} + (1 - ρ_{0 i} - ρ_{1 i}) (1 - {S (z_{0 i})}^{α}), & if y_{i} \leq 0, \\ ρ_{0 i} + (1 - ρ_{0 i} - ρ_{1 i}) [1 - {\{S (z_{i})\}}^{α}], & if 0 < y_{i} < 1, \\ 1, & if y_{i} \geq 1 . \end{matrix}

To model the responses at the point masses

y_{i} = 0

and

y_{i} = 1

, a multinomial logistic model with a logit link function is used again, where

β_{(0)}^{⊤}, β_{(1)}^{⊤}

are the vectors of coefficients associated with the sets of covariates

x_{(0) i} = {(1, x_{0 i 1}, \dots, x_{0 i q})}^{⊤}

and

x_{(1) i} = {(1, x_{1 i 1}, \dots, x_{1 i r})}^{⊤}

.

The log-likelihood function for parameter vector estimation

θ = {(β_{(0)}^{⊤}, β_{(1)}^{⊤}, β^{⊤}, σ, α)}^{⊤}

conditionally on

X_{(0)}

,

X

,

X_{(1)}

, is given by:

\begin{matrix} ℓ (θ) & = & \sum_{0} log [exp (x_{(0) i}^{⊤} β_{(0)}) + 1 - {\{S (z_{0 i})\}}^{α}] + \sum_{1} log [exp (x_{(1) i}^{⊤} β_{(1)}) + {\{S (z_{1 i})\}}^{α}] \\ + \sum_{i \in (0, 1)} (log (α) - log (σ) + log (ϕ (z_{i})) + (α - 1) log (S (z_{i}))) \\ - \sum_{i = 1}^{n} log [1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)})] . \end{matrix}

(30)

The score equations are obtained by performing the first derivatives with respect to the model parameters

θ = {(β_{(0)}^{⊤}, β_{(1)}^{⊤}, β^{⊤}, σ, α)}^{⊤}

while the information matrix is obtained by proceeding as in the models studied previously. Models with inflation only at zero or only at one can be studied by taking

ρ_{0} = 0

or

ρ_{1} = 0

, respectively.

4. Empirical Applications

In this section, we illustrate the application of the proposed models and compare it with other models using real data. We show that the proposed model can be a valid alternative to some existing regression models in the statistical literature.

4.1. Application 1: Case Study on Students’ Dropout Data

Student dropout is a major problem many Latin American countries face. In some universities in Colombia, this phenomenon can lead to more than 50% of students who enroll in a university program abandoning their higher education studies. This phenomenon has its greatest impact in the first four semesters of undergraduate studies, which is why it is important to determine the main causes leading to this abandonment of higher education.

This application refers to student dropout in the Faculty of Veterinary Medicine and Zootechnics (MVZ, by its acronym in Spanish) at the University of Córdoba, Colombia. The analyzed information corresponds to a sample of students who dropped out during one of the first four semesters (early dropout) of the programs in the MVZ Faculty at the University of Córdoba. The data correspond to variables from the SPADIES System of the Ministry of National Education (MEN by its acronym in Spanish) and the university itself.

The response variable y corresponds to the proportion of subjects passed up to the point of dropout. The explanatory variables considered were:

x_{1} =

Saber 11 test score (exams taken at the end of secondary education);

x_{2} =

age at the time of taking the Saber 11 test;

x_{3} =

variable indicating whether the student received financial support (taking values

1 =

yes,

0 =

no);

x_{4} =

mother’s educational level (categorized as 1 if professional and, 0 otherwise);

x_{5} =

number of siblings;

x_{6} =

socioeconomic status of the student (categorized as 1 if from strata 1, 2, or 3, referred to as low and 0, otherwise); and

x_{7} =

student’s gender (categorized as 1 if male and 0 otherwise).

The zero-one inflated model, PHN, UPHN, and Doubly-Censored PHN (DCPHN) were fitted since some students drop out in the first semester without passing any subjects, and others drop out in the first four semesters even after passing all enrolled subjects.

The results obtained with the models studied in this article show that in all models, the significant variables for

0 < y < 1

were the Saber 11 test score (

x_{1}

), age at the time of taking the Saber 11 test (

x_{2}

), and number of siblings (

x_{5}

). Similarly, the censored part at zero (

y = 0

) is not explained by any variable in any of the three models, while the censored part at one (

y = 1

) showed significance in variables such as age at the time of taking the Saber 11 test (directly related to the age of university entry) and number of siblings.

Table 1 shows the results of the best-fitted model for each of the considered models. To determine which model presents better performance, we used the AIC criteria [41] and the corrected AIC (AICc) [42]. These criteria are defined as:

A I C = - 2 ℓ (θ) + 2 p and A I C c = - 2 ℓ (θ) + \frac{2 n (p + 1)}{n - p - 2},

where p is the number of parameters of the model in question.

The MLEs, with standard errors in parentheses, are given in Table 1. According to the AIC and AICc criteria, the model that best fits the student dropout data is the UPHN, followed by the DCPHN model.

To identify outliers and/or model misspecification, we examined the transformation of the martingale residual,

r M T_{i}

, as proposed by Barros et al. [43]. These residuals are defined by

r M T_{i} = sgn (r M_{i}) \sqrt{- 2 [r M_{i} + δ_{i} log (δ_{i} - r M_{i})]}; i = 1, 2, 3, \dots, n,

where

r M_{i} = δ_{i} + log (S (e_{i}, \hat{θ}))

is the martingal residual proposed by Ortega et al. [44], where

δ_{i} = 0, 1

indicates whether the ith observation is censored or not, respectively,

sgn (r M_{i})

denotes the sign of

r M_{i}

and

S (e_{i}; \hat{θ})

represents the survival function evaluated at

e_{i}

, where

\hat{θ}

are the MLE for

θ

.

The plots of

r M T_{i}

with confidence envelope graphs generated for the PHN, UPHN, and DCPHN models, shown in Figure 1 and Figure 2, indicate that the fitted regression models PHN, UPHN, and DCPHN, with a logit link function, exhibit a good fit.

4.2. Application 2: Case Study on Periodontal Disease Data

The data motivating this second application come from a clinical study in which the clinical attachment level (CAL), a key marker of periodontal disease (PD), was measured at six sites on each tooth of a subject. The primary statistical question is to estimate functions that model the relationship between the “proportion of diseased sites associated with a specific tooth type (incisors, canines, premolars, and first molars)” and the covariates described below. The full dataset was previously analyzed by Galvis et al. [45] and includes information from 290 individuals. The response variable in this study is the proportion of diseased sites for the premolars (denoted as Y), with auxiliary covariates being gender (

X_{1}

), age (

X_{2}

), glycosylated hemoglobin (

X_{3}

), and smoking status (

X_{4}

).

The dataset exhibits significant inflation at

Y = 0

, but for certain subjects, we also observe

Y = 1

. To account for this, we applied the beta zero-one inflated (BIZU), truncated log-normal zero-one inflated (LNIZU), doubly censored proportional hazard normal (DCPHN), and the UPHN inflated zero-one (UPHNIZU) regression models. Our analysis revealed that only the covariates

X_{1}

and

X_{2}

were statistically significant. For the DCPHN model, only

X_{2}

was significant for both the discrete outcomes.

We used several information criteria to compare the various models, including AIC and the AIC_C. We also used the Bayesian Information Criterion (BIC) and the Hannan–Quinn Information Criterion, defined as follows:

B I C = - 2 ℓ (θ) + p log (n), H Q C = - 2 ℓ (θ) + 2 p log (log (n)),

where p is the number of parameters of the model in question.

The MLEs, with standard errors in parentheses, are given in Table 2.

In Figure 3, Figure 4, Figure 5 and Figure 6, it can be observed that the best fits correspond to the BIZU and UPHNIZU models. Additionally, note that in three of the criteria, the UPHNIZU model performs better than the BIZU model, while for the fourth criterion (BIC), no significant differences are found between the two models. It is important to consider that the BIZU model has one less parameter, which further supports the superior fit of the UPHNIZU model. This allows us to conclude that the UPHNIZU model is a promising new alternative for modeling responses within the unit interval

[0, 1]

with zero-one inflation.

We also generated standardized residual plots to identify the presence of outliers when fitting the UPHNIZU model. Additionally, we present the cumulative distribution function (CDF) plot of the UPHN model (Figure 5). From these, the model shows a good fit, and no outliers are detected. In addition, envelope plots were obtained for the fitted models BIZU, LNIZU, and DCLPHN, which are presented in Figure 3 and Figure 4. These plots demonstrate that the BIZU and LNIZU models exhibit a better fit than the DCPHN model.

5. Discussion

In this article, we introduced a broad class of skew regression models designed for response variables that lie within the unit interval, which may exhibit an excess of zeros or ones. These models were derived from a continuous-discrete mixture distribution that incorporates covariates in both its discrete and continuous components. As evidenced by applications using real data, the models we propose serve as a viable alternative for modeling rates and proportions that are inflated at either zero or one.

5.1. Major Results and Implications

Our findings demonstrate that the UPHNIZU model consistently surpassed other models in terms of AIC, AICc, BIC, and HQC values. These models delivered a superior fit for the data obtained from the case study on students’ dropout data and the clinical study on periodontal disease, where the response variable was the proportion of diseased tooth sites.

Our findings also demonstrate that UPHNIZU models generate a non-singular information matrix, allowing valid statistical inferences and outperforming other asymmetric models like those derived from the skew-normal distribution or the beta distribution. Empirical results show the models’ effectiveness in analyzing proportional data with zero and one inflation, highlighting their robustness and practicality in various research fields such as biomedicine, economics, and engineering. Additionally, they present parameter estimation methods using maximum likelihood and discuss applications in student dropout studies and periodontal disease. UPHNIZU models are a promising alternative for analyzing bounded data with extreme inflation, providing a robust and flexible tool to capture the complex characteristics of such data. The research also emphasizes the importance of innovations in probability distributions and their application in modeling complex phenomena, offering an advanced solution for the challenges of modeling proportional data with zero and one inflation.

5.2. Model Limitations

Although the results are encouraging, our study has several limitations. First, the models’ complexity and reliance on iterative numerical methods for parameter estimation can lead to high computational demands. Second, while the models showed strong performance with the datasets utilized in this research, additional validation on different types of data is required to ensure their applicability in broader contexts.

5.3. Prospects for Further Investigation

Future research may explore several avenues, including the creation of more efficient algorithms to lessen the computational demands of fitting these models. Furthermore, applying these models in fields like economics or environmental studies could offer additional validation and reveal new applications.

Given the importance of model performance in our analysis, while the methods employed—such as AIC, AICc, BIC, HQC, and martingale residuals—are effective for evaluating model adequacy, there is room for improvement. Future research could investigate additional goodness-of-fit tests specifically designed for bounded and inflated data, which could offer a more thorough evaluation of model performance and robustness. Additionally, exploring Bayesian inference methods for unit interval data with inflation could provide valuable insights and enhance the analytical framework.

An intriguing avenue for future research involves adapting these models to accommodate longitudinal or hierarchical data structures. This would require methods to manage correlations within subjects or groups, often present in practical datasets. Additionally, examining the robustness of these models in various misspecification scenarios could lead to more resilient modeling strategies.

6. Conclusions

Analyzing proportion data, particularly when values are inflated at zero and one, presents significant challenges across various scientific disciplines. Conventional models, such as beta and Tobit regression models, frequently fail to accurately capture the complexities associated with such data. This underscores the need for more sophisticated modeling techniques capable of addressing the unique distributional characteristics of zero-one inflation.

This work tackled these challenges by introducing the proportional hazard normal zero-one inflated models. These models incorporate a continuous-discrete mixture distribution with covariates in both components, offering an advanced framework for analyzing proportion data with specific inflation points. Consequently, the proportional hazard normal zero-one inflated models provide a robust and flexible method for capturing asymmetrically distributed data and mixed discrete-continuous characteristics, prevalent in fields such as medicine, sociology, humanities, and economics.

Our applications, which pertain to two case studies on student dropout and periodontal data, demonstrated that the proportional hazard normal zero-one inflated models with the logit link function are an excellent alternative to traditional models. The transformation of martingale residuals and the generation of simulated envelopes further validated the robustness of our models, underscoring their effectiveness in identifying model misfits and outliers. The proposed models address a critical gap in statistical modeling, providing valuable insights and reliable estimators for handling bounded and inflated data. The flexibility and robustness of the proportional hazard normal zero-one inflated models make them a viable alternative for describing proportion data that are inflated at zero or one.

In conclusion, the proportional hazard-normal zero-one inflated models signify a significant advancement in statistical modeling techniques for proportion data exhibiting zero-one inflation. These models provide a robust and adaptable framework for analyzing such data, yielding deeper insights and more reliable estimators.

Author Contributions

Conceptualization: G.M.-F., R.T.-F. and H.W.G.; data curation: G.M.-F. and R.T.-F.; formal analysis: G.M.-F., R.T.-F. and H.W.G.; investigation: G.M.-F., R.T.-F. and H.W.G.; methodology: G.M.-F., R.T.-F. and H.W.G.; writing—original draft: G.M.-F.; writing—review and editing: R.T-F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Vice-rectorate for Research of the Universidad de Córdoba, Colombia, project grant FCB-06-22: “Estudio de la deserción en los programas de pregrado de la Universidad de Córdoba usando diferentes metodologías estadísticas” (G.M.-F. and R.T.-F).

Data Availability Statement

The data and codes used in this study are available upon request to the authors.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Appendix A. Elements of the Observed Information Matrix

Appendix A.1. Truncated Proportional Hazard Normal Model

\begin{matrix} \ddot{ℓ} (ξ ξ) & = & \frac{n}{σ^{2}} + n \frac{α - 1}{σ^{2}} [\bar{h^{2}} - \bar{z h}] + n \frac{α}{σ^{2}} \frac{{(h (z_{0}) {\{S (z_{0})\}}^{α} - h (z_{1}) {\{S (z_{1})\}}^{α})}^{2}}{W^{2}} \\ - n \frac{α}{σ^{2}} \frac{z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α} + (α - 1) (h^{2} (z_{0}) {\{S (z_{0})\}}^{α} - h^{2} (z_{1}) {\{S (z_{1})\}}^{α})}{W}, \\ \ddot{ℓ} (ξ σ) & = & \frac{2 n}{σ^{2}} \bar{z} + n \frac{α - 1}{σ^{2}} [- \bar{z h^{2}} - \bar{z^{2} h} + \bar{h}] + n \frac{α}{σ^{2}} \frac{h (z_{0}) {\{S (z_{0})\}}^{α} - h (z_{1}) {\{S (z_{1})\}}^{α}}{W} \\ + n \frac{α}{σ^{2}} \frac{(h (z_{0}) {\{S (z_{0})\}}^{α} - h (z_{1}) {\{S (z_{1})\}}^{α}) (z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α})}{W^{2}} \\ - n \frac{α}{σ^{2}} \frac{z_{0}^{2} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1}^{2} h (z_{1}) {\{S (z_{1})\}}^{α} + (α - 1) (z_{0} h^{2} (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h^{2} (z_{1}) {\{S (z_{1})\}}^{α})}{W}, \\ \ddot{ℓ} (σ σ) & = & - \frac{n}{σ^{2}} + \frac{3 n}{σ^{2}} \bar{z^{2}} + n \frac{α - 1}{σ^{2}} [2 \bar{z h} + \bar{z^{2} h^{2}} - \bar{z^{3} h}] + n \frac{α}{σ^{2}} \frac{{(z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α})}^{2}}{W^{2}} \\ + n \frac{α}{σ^{2}} \frac{- z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} (- 2 + z_{0}^{2} + (α - 1) z_{0} h (z_{0})) + z_{1} h (z_{1}) {\{S (z_{1})\}}^{α} (- 2 + z_{1}^{2} + (α - 1) z_{1} h (z_{1}))}{W}, \end{matrix}

\begin{matrix} \ddot{ℓ} (ξ α) & = & - \frac{n}{σ} \bar{h} + n \frac{α}{σ} \frac{(h (z_{0}) {\{S (z_{0})\}}^{α} - h (z_{1}) {\{S (z_{1})\}}^{α}) ({\{S (z_{0})\}}^{α} log (S (z_{0})) - {\{S (z_{1})\}}^{α} log (S (z_{1})))}{W^{2}} \\ - \frac{n}{σ} \frac{(h (z_{0}) {\{S (z_{0})\}}^{α} - h (z_{1}) {\{S (z_{1})\}}^{α}) + α (h (z_{0}) {\{S (z_{0})\}}^{α} log (S (z_{0})) - h (z_{1}) {\{S (z_{1})\}}^{α} log (S (z_{1})))}{W}, \\ \ddot{ℓ} (σ α) & = & - \frac{n}{σ} \bar{z h} + n \frac{α}{σ} \frac{(z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α}) ({\{S (z_{0})\}}^{α} log (S (z_{0})) - {\{S (z_{1})\}}^{α} log (S (z_{1})))}{W^{2}} \\ - \frac{n}{σ} \frac{(z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α}) + α (z_{0} h (z_{0}) {\{S (z_{0})\}}^{α} log (S (z_{0})) - z_{1} h (z_{1}) {\{S (z_{1})\}}^{α} log (S (z_{1})))}{W}, \\ \ddot{ℓ} (α α) & = & \frac{n}{α} - n \frac{({\{S (z_{0})\}}^{α} {log}^{2} (S (z_{0})) - {\{S (z_{1})\}}^{α} {log}^{2} (S (z_{1})))}{W} \\ + n \frac{{({\{S (z_{0})\}}^{α} log (S (z_{0})) - {\{S (z_{1})\}}^{α} log (S (z_{1})))}^{2}}{W^{2}}, \end{matrix}

where

h (z_{i}) = \frac{ϕ (z_{i})}{S (z_{i})},

\bar{h} = \frac{1}{n} \sum_{i = 1}^{n} h (z_{i})

,

\bar{h^{2}} = \frac{1}{n} \sum_{i = 1}^{n} h^{2} (z_{i})

,

\bar{z h} = \frac{1}{n} \sum_{i = 1}^{n} z_{i} h (z_{i}), \dots,

\bar{z^{2} h^{2}} = \frac{1}{n} \sum_{i = 1}^{n} z_{i}^{2} h^{2} (z_{i})

,

\bar{z^{3} h} = \frac{1}{n} \sum_{i = 1}^{n} z_{i}^{3} h (z_{i}),

Appendix A.2. Unit-Proportional Hazard Normal Regression Model

\begin{matrix} \ddot{ℓ} (β_{j} β_{k}) & = & \frac{n}{σ^{2}} + \frac{α - 1}{σ^{2}} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i}^{2} {(1 - μ_{i})}^{2} [h^{2} (z_{i}) - z_{i} h (z_{i})] \\ + \frac{α}{σ^{2}} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i}^{2} {(1 - μ_{i})}^{2} \frac{{(h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h (z_{1 i}) {\{S (z_{1 i})\}}^{α})}^{2}}{W_{i}^{2}} \\ - \frac{α}{σ^{2}} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i}^{2} {(1 - μ_{i})}^{2} \frac{z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}} \\ - \frac{α (α - 1)}{σ^{2}} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i}^{2} {(1 - μ_{i})}^{2} \frac{(h^{2} (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h^{2} (z_{1 i}) {\{S (z_{1 i})\}}^{α})}{W_{i}} \\ - \frac{1}{σ} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i} (1 - μ_{i}) (1 - 2 μ_{i}) z_{i} - \frac{α - 1}{σ} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i} (1 - μ_{i}) (1 - 2 μ_{i}) h (z_{i}) \\ + \frac{α}{σ} \sum_{i = 1}^{n} x_{i j} x_{i k} μ_{i} (1 - μ_{i}) (1 - 2 μ_{i}) \frac{h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}}, \\ \ddot{ℓ} (β_{j} σ) & = & \frac{2}{σ^{2}} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) z_{i} + \frac{α - 1}{σ^{2}} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) [- z_{i} h^{2} (z_{i}) - z_{i}^{2} h (z_{i}) + h (z_{i})] \\ + \frac{α}{σ^{2}} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}} \\ + \frac{α}{σ^{2}} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{(h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h (z_{1 i}) {\{S (z_{1 i})\}}^{α}) (z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α})}{W_{i}^{2}} \\ - \frac{α}{σ^{2}} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{z_{0 i}^{2} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i}^{2} h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}} \\ - \frac{α (α - 1)}{σ^{2}} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{(z_{0 i} h^{2} (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i} h^{2} (z_{1 i}) {\{S (z_{1 i})\}}^{α})}{W_{i}}, \end{matrix}

\begin{matrix} \ddot{ℓ} (σ σ) & = & - \frac{n}{σ^{2}} + \frac{3}{σ^{2}} \sum_{i = 1}^{n} z_{i}^{2} + \frac{α - 1}{σ^{2}} \sum_{i = 1}^{n} [2 z_{i} h (z_{i}) + z_{i}^{2} h^{2} (z_{i}) - z_{i}^{3} h (z_{i})] \\ + \frac{α}{σ^{2}} \sum_{i = 1}^{n} \frac{{(z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α})}^{2}}{W_{i}^{2}} \\ - \frac{α}{σ^{2}} \sum_{i = 1}^{n} \frac{z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} (- 2 + z_{0 i}^{2} + (α - 1) z_{0 i} h (z_{0 i}))}{W_{i}} \\ + \frac{α}{σ^{2}} \sum_{i = 1}^{n} \frac{+ z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α} (- 2 + z_{1 i}^{2} + (α - 1) z_{1 i} h (z_{1 i}))}{W_{i}}, \\ \ddot{ℓ} (β_{j} α) & = & - \frac{n}{σ} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) h (z_{i}) - \frac{1}{σ} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}} \\ + \frac{α}{σ} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{h (z_{0 i}) {\{S (z_{0 i})\}}^{α} ({\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} log (S (z_{1 i})))}{W_{i}^{2}} \\ - \frac{α}{σ} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{h (z_{1 i}) {\{S (z_{1 i})\}}^{α} ({\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} log (S (z_{1 i})))}{W_{i}^{2}} \\ - \frac{α}{σ} \sum_{i = 1}^{n} x_{i j} μ_{i} (1 - μ_{i}) \frac{h (z_{0 i}) {\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - h (z_{1 i}) {\{S (z_{1 i})\}}^{α} log (S (z_{1 i}))}{W_{i}}, \end{matrix}

\begin{matrix} \ddot{ℓ} (σ α) & = & - \frac{1}{σ} \sum_{i = 1}^{n} z_{i} h (z_{i}) + \frac{α}{σ} \sum_{i = 1}^{n} \frac{z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} ({\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} log (S (z_{1 i})))}{W_{i}^{2}} \\ - \frac{α}{σ} \sum_{i = 1}^{n} \frac{z_{1 i} h (z_{1 i}) \{S (z_{1 i})\} ({\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} log (S (z_{1 i})))}{W_{i}^{2}} \\ - \frac{1}{σ} \sum_{i = 1}^{n} \frac{z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} - z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α}}{W_{i}} \\ - \frac{α}{σ} \sum_{i = 1}^{n} \frac{z_{0 i} h (z_{0 i}) {\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - z_{1 i} h (z_{1 i}) {\{S (z_{1 i})\}}^{α} log (S (z_{1 i}))}{W_{i}}, \\ \ddot{ℓ} (α α) & = & \frac{n}{α} - \sum_{i = 1}^{n} \frac{{\{S (z_{0 i})\}}^{α} {log}^{2} (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} {log}^{2} (S (z_{1 i}))}{W_{i}} \\ + \sum_{i = 1}^{n} \frac{{({\{S (z_{0 i})\}}^{α} log (S (z_{0 i})) - {\{S (z_{1 i})\}}^{α} log (S (z_{1 i})))}^{2}}{W_{i}^{2}}, \end{matrix}

where

z_{i} = \frac{y_{i} - μ_{i}}{σ},

z_{0 i} = - \frac{μ_{i}}{σ},

z_{1 i} = \frac{1 - μ_{i}}{σ}

and

W_{i} = W_{i} (μ_{i}, σ, α) = log ({\{S (z_{0 i})\}}^{α} - {\{S (z_{1 i})\}}^{α})

.

Appendix A.3. UPHN Regression Model Inflated at Zero and/or One

For the discrete part, the elements of the observed information matrix are given by:

\begin{matrix} \ddot{ℓ} (β_{(0) r} β_{(0) r^{'}}) & = \sum_{i = 1}^{n} \frac{x_{(0) i p} x_{(0) i p^{'}} exp (x_{(0) i}^{⊤} β_{(0)}) [1 + exp (x_{(1) i}^{⊤} β_{(1)})]}{{(1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)}))}^{2}}, \\ \ddot{ℓ} (β_{(1) q} β_{(0) r}) & = - \sum_{i = 1}^{n} \frac{x_{(0) i p} x_{(1) i q} exp (x_{(0) i}^{⊤} β_{(0)}) exp (x_{(1) i}^{⊤} β_{(1)})}{{(1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)}))}^{2}}, \\ \ddot{ℓ} (β_{(1) q} β_{(1) q^{'}}) & = \sum_{i = 1}^{n} \frac{x_{(1) i q} x_{(1) i q^{'}} exp (x_{(1) i}^{⊤} β_{(1)}) [1 + exp (x_{(0) i}^{⊤} β_{(0)})]}{{(1 + exp (x_{(0) i}^{⊤} β_{(0)}) + exp (x_{(1) i}^{⊤} β_{(1)}))}^{2}}, \end{matrix}

while the elements for the continuous part are given by:

\begin{matrix} \ddot{ℓ} (β_{j} β_{k}) & = & \frac{n_{01}}{σ^{2}} + \frac{α - 1}{σ^{2}} \sum_{y_{i} \in (0, 1)} x_{i j} x_{i k} μ_{i}^{2} {(1 - μ_{i})}^{2} [h^{2} (z_{i}) - z_{i} h (z_{i})] \\ - \frac{1}{σ} \sum_{y_{i} \in (0, 1)} x_{i j} x_{i k} μ_{i} (1 - μ_{i}) (1 - 2 μ_{i}) z_{i} - \frac{α - 1}{σ} \sum_{y_{i} \in (0, 1)} x_{i j} x_{i k} μ_{i} (1 - μ_{i}) (1 - 2 μ_{i}) h (z_{i}), \\ \ddot{ℓ} (β_{j} σ) & = & \frac{2}{σ^{2}} \sum_{y_{i} \in (0, 1)} x_{i j} μ_{i} (1 - μ_{i}) z_{i} + \frac{α - 1}{σ^{2}} \sum_{y_{i} \in (0, 1)} x_{i j} μ_{i} (1 - μ_{i}) [- z_{i} h^{2} (z_{i}) - z_{i}^{2} h (z_{i}) + h (z_{i})], \\ \ddot{ℓ} (σ σ) & = & - \frac{n_{01}}{σ^{2}} + \frac{3}{σ^{2}} \sum_{y_{i} \in (0, 1)} z_{i}^{2} + \frac{α - 1}{σ^{2}} \sum_{y_{i} \in (0, 1)} [2 z_{i} h (z_{i}) + z_{i}^{2} h^{2} (z_{i}) - z_{i}^{3} h (z_{i})], \\ \ddot{ℓ} (β_{j} α) & = & - \frac{n}{σ} \sum_{y_{i} \in (0, 1)} x_{i j} μ_{i} (1 - μ_{i}) h (z_{i}), \\ \ddot{ℓ} (σ α) & = & - \frac{1}{σ} \sum_{y_{i} \in (0, 1)} z_{i} h (z_{i}), \\ \ddot{ℓ} (α α) & = & \frac{n_{01}}{α}, \end{matrix}

where

z_{i} = \frac{y_{i} - μ_{i}}{σ}

and

n_{01}

is the number of sample elements that belong to the interval

(0, 1)

.

References

Eugene, N.; Lee, C.; Famoye, F. Beta-normal Distribution and Its Applications. Commun.-Stat.-Theory Methods 2002, 31, 497–512. [Google Scholar] [CrossRef]
Cordeiro, G.M.; de Castro, M. A New Family of Generalized Distributions. J. Stat. Comput. Simul. 2011, 81, 883–898. [Google Scholar] [CrossRef]
Silva, G.O.; Ortega, E.M.M.; Cordeiro, G.M. The Beta Modified Weibull Distribution. Lifetime Data Anal. 2010, 16, 409–430. [Google Scholar] [CrossRef] [PubMed]
Mahdavi, M.; Silva, G.O. A method to expand family of continuous distributions based on truncated distributions. J. Statist. Res. 2016, 13, 231–247. [Google Scholar] [CrossRef]
Chen, S.; Gui, W. Estimation of unknown parameters of Truncated Normal Distribution under Adaptive Progressive Type II Censoring Scheme. Mathematics 2021, 9, 49. [Google Scholar] [CrossRef]
Taketomi, N.; Yamamoto, K.; Chesneau, C.; Emura, T. Parametric Distributions for Survival and Reliability Analyses, a Review and Historical Sketch. Mathematics 2022, 10, 3907. [Google Scholar] [CrossRef]
Kreer, M.; Kızılersü, A.; Thomas, A.W.; Egídio dos Reis, A.D. Goodness-of-fit tests and applications for left-truncated Weibull distributions to non-life insurance. Eur. Actuar. J. 2015, 5, 139–163. [Google Scholar] [CrossRef]
Cordeiro, G.M.; Silva, G.O.; Ortega, E.M.M. The Beta-Weibull Geometric Distribution. Statistics 2013, 47, 817–834. [Google Scholar] [CrossRef]
Cordeiro, G.M.; Ortega, E.M.M.; Silva, G.O. The Kumaraswamy Modified Weibull Distribution: Theory and Applications. J. Stat. Comput. Simul. 2014, 84, 1387–1411. [Google Scholar] [CrossRef]
Zografos, K.; Balakrishnan, N. On Families of Beta-and Generalized Gamma generated Distributions and Associated Inference. Stat. Methodol. 2009, 6, 344–362. [Google Scholar] [CrossRef]
Ristić, M.M.; Balakrishnan, N. The Gamma-exponentiated Exponential Distribution. J. Stat. Comput. Simul. 2012, 82, 1191–1206. [Google Scholar] [CrossRef]
Castellares, F.; Santos, M.A.C.; Montenegro, L.C.; Cordeiro, G.M. A Gamma-Generated Logistic Distribution: Properties and Inference. Am. J. Math. Manag. Sci. 2015, 34, 14–39. [Google Scholar]
Cordeiro, G.M.; Lima, M.C.S.; Cysneiros, A.H.M.A.; Pascoa, M.A.R.; Pescim, R.R.; Ortega, E.M.M. An Extended Birnbaum–Saunders Distribution: Theory, Estimation, and Applications. Commun. Stat.-Theory Methods 2016, 45, 2268–2297. [Google Scholar] [CrossRef]
Enami, S.H. Truncated Lomax-exponential distribution ans its fitting to financial data. J. Mahani Math. Res. 2023, 12, 201–216. [Google Scholar]
Hadi, H.H.; Al-Noor, N.H. Truncated exponential Marshall Olkin Lomax distribution: Properties, entropies, and applications. AIP Conf. Proc. 2023, 2414, 201–216. [Google Scholar]
Al-Habib, K.H.; Khaleel, M.H.; Al-Mofleh, H. Statistical Properties and Application for [0,1] Truncated Nadarajah-Haghighi Exponential Distribution. Ibn-Haitham J. Pure Appl. Sci. 2024, 37, 363–377. [Google Scholar]
Schaminée, J.H.; Hennekens, S.M.; Chytry, M.; Rodwell, J.S. Vegetation-plot data and databases in Europe: An overview. Preslia 2009, 81, 173–185. [Google Scholar]
Desousa, M.; Saulo, H.; Leiva, V.; Scalco, P. On a tobit-Birnbaum–Saunders model with an application to medical data. J. Appl. Stat. 2018, 45, 932–955. [Google Scholar] [CrossRef]
Sanchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum–Saunders quantile regression and its diagnostics with application to economic data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Gómez, H.W.; Quintana, F. Statistical inference for a general class of asymmetric distributions. J. Stat. Plan. Inference 2005, 128, 427–443. [Google Scholar] [CrossRef]
Barros, M.; Galea, M.; Leiva, V.; Santos-Neto, M. Generalized tobit models: Diagnostics and application in econometrics. J. Appl. Stat. 2018, 45, 145–167. [Google Scholar] [CrossRef]
Azzalini, A. A class of distributions which includes the normal ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
Azzalini, A. Further results on a class of distributions which includes the normal ones. Statistica 1986, 46, 199–208. [Google Scholar]
Henze, N. A probabilistic representation of the skew-normal distribution. Scand. J. Stat. 1986, 13, 271–275. [Google Scholar]
Castillo, N.O.; Gómez, H.W.; Leiva, V.; Sanhueza, A. On the Fernández-Steel distribution: Inference and application. Comput. Stat. Data Anal. 2011, 55, 2951–2961. [Google Scholar] [CrossRef]
Gupta, R.D.; Gupta, R.C. Analyzing skewed data by power normal model. Test 2008, 17, 197–210. [Google Scholar] [CrossRef]
Pewsey, A.; Gómez, H.W.; Bolfarine, H. Likelihood-based inference for power distributions. Test 2012, 21, 775–789. [Google Scholar] [CrossRef]
Mohammadi, Z.; Sajjadnia, Z.; Bakouch, H.S.; Sharafi, M. Zero-and-one inflated Poisson-Lindley INAR (1) process for modelling count time series with extra zeros and ones. J. Stat. Comput. Simul. 2022, 92, 2018–2040. [Google Scholar] [CrossRef]
Lee, B.S.; Haran, M. A class of models for large zero-inflated spatial data. J. Agric. Biol. Environ. Stat. 2024. [Google Scholar] [CrossRef]
Figueroa-Zúñiga, J.; Niklitschek, S.; Leiva, V.; Liu, S. Modeling heavy-tailed bounded data by the trapezoidal beta distribution with applications. REVSTAT—Stat. J. 2022, 20, 387–404. [Google Scholar]
Jornsatian, C.; Bodhisuwan, W. Zero-one inflated negative binomial-beta exponential distribution for count data with many zeros and ones. Commun. Stat.-Theory Methods 2022, 51, 8517–8531. [Google Scholar] [CrossRef]
Keim, J.L.; DeWitt, P.D.; Fitzpatrick, J.J.; Jenni, N.S. Estimating plant abundance using inflated beta distributions: Applied learnings from a Lichen-Caribou ecosystem. Ecol. Evol. 2017, 7, 486–493. [Google Scholar] [CrossRef] [PubMed]
Benites, L.; Maehara, R.; Lachos, V.H.; Bolfarine, H. Linear regression models using finite mixtures of skew heavy-tailed distributions. Chil. J. Stat. 2019, 10, 21–41. [Google Scholar]
Desousa, M.; Saulo, H.; Santos-Neto, M.; Leiva, V. On a new mixture-based regression model: Simulation and application to data with high censoring. J. Stat. Comput. Simul. 2020, 90, 2861–2877. [Google Scholar] [CrossRef]
Arellano-Valle, R.B.; Gómez, H.W.; Quintana, F. A new class of skew-normal distributions. Commun. Stat.-Theory Methods 2004, 33, 1465–1480. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Moreno-Arenas, G.; Vergara-Cardozo, S. Properties and inference for proportional hazard Models. Rev. Colomb. Estadística 2013, 36, 95–114. [Google Scholar]
Lehmann, E.L.; Casella, G. Theory of Point Estimation, 2nd ed.; Springer: New York, NY, USA, 1998. [Google Scholar]
Sarabia, J.M.; Castillo, E. About a class of max-stable families with applications to income distributions. Int. J. Stat. 2005, LXIII, 505–527. [Google Scholar]
Cragg, J. Some statistical models for limited dependent variables with application to the demand for durable goods. Econometrica 1971, 39, 829–844. [Google Scholar] [CrossRef]
Moulton, L.; Halsey, N.A. A mixture model with detection limits for regression analysses of antibody response to vaccine. Biometrics 1995, 51, 1570–1578. [Google Scholar] [CrossRef]
Akaike, H. A new look at statistical model identification. IEEE Trans. Autom. Control. 1974, AU-19, 716–722. [Google Scholar] [CrossRef]
Cavanaugh, J.E. Unifying the derivations for the Akaike and corrected Akaike information criteria. Stat. Probab. Lett. 1997, 33, 201–208. [Google Scholar] [CrossRef]
Barros, M.; Paula, G.A.; Leiva, V. A new class of survival regression models with heavy-tailed errors: Robustness and diagnostics. Lifetime Data Anal. 2010, 14, 316–332. [Google Scholar] [CrossRef] [PubMed]
Ortega, E.M.; Bolfarine, H.; Paula, G.A. Influence diagnostics in generalized log-gamma regression models. Comput. Stat. Data Anal. 2003, 42, 165–186. [Google Scholar] [CrossRef]
Galvis, D.M.; Bandyopadhyay, D.; Lachos, V.H. Augmented mixed beta regression models for periodontal proportion data. Stat. Med. 2014, 33, 3759–3771. [Google Scholar] [CrossRef]

Figure 1. Plots of envelopes for

r M T_{i}

using: (Left) PHN and (Right) UPHN models and dropout data.

Figure 1. Plots of envelopes for

r M T_{i}

using: (Left) PHN and (Right) UPHN models and dropout data.

Figure 2. Plots of envelopes for

r M T_{i}

using DCPHN model and dropout data.

Figure 2. Plots of envelopes for

r M T_{i}

using DCPHN model and dropout data.

Figure 3. Plots of envelopes for

r M T_{i}

using: (Left) BIZU and (Right) LNIZU models and periodontal data.

Figure 3. Plots of envelopes for

r M T_{i}

using: (Left) BIZU and (Right) LNIZU models and periodontal data.

Figure 4. Plots of envelopes for

r M T_{i}

using DCPHN model and periodontal data.

Figure 4. Plots of envelopes for

r M T_{i}

using DCPHN model and periodontal data.

Figure 5. (Left) Empirical CDF of the residuals of the UPHNIZU model (solid line) and fitted CDF (dashed line). (Right) Plots of the standardized residuals of the UPHNIZU model, periodontal data.

Figure 6. Plots of envelopes for

r M T_{i}

using UPHNIZU model and periodontal data.

Figure 6. Plots of envelopes for

r M T_{i}

using UPHNIZU model and periodontal data.

Table 1. ML estimates of the indicated parameter and model for the dropout data and their AIC and AICc.

Estimador	PHN	UPHN	DCPHN
${\hat{β}}_{00}$	−2.1624	−2.1624	−2.4371
	(0.2071)	(0.2071)	(0.3025)
${\hat{β}}_{10}$	2.9392	0.9771	1.3859
	(0.0144)	(0.0223)	(0.6003)
${\hat{β}}_{11}$	0.0142	0.0273	0.0208
	(0.0092)	(0.0041)	(0.0096)
${\hat{β}}_{12}$	−0.3281	−0.2844	−0.2687
	(0.0125)	(0.0175)	(0.0905)
${\hat{β}}_{15}$	0.1295	0.2129	0.1847
	(0.0146)	(0.0205)	(0.0910)
${\hat{β}}_{20}$	14.5124	14.5124	16.0286
	(7.9470)	(7.9470)	(13.6058)
${\hat{β}}_{21}$	0.0208	0.0208
	(0.0127)	(0.0127)
${\hat{β}}_{22}$	−1.2230	−1.2230	−1.2024
	(0.5150)	(0.5150)	(0.8650)
${\hat{β}}_{25}$	0.4998	0.4998
	(0.2509)	(0.2509)
$\hat{σ}$	0.1064	0.1160	0.1238
	(0.0104)	(0.0057)	(0.0598)
$\hat{α}$	0.1538	0.1427	0.1721
	(0.0364)	(0.0197)	(0.1933)
AIC	195.0036	182.4216	183.6414
AICc	198.4687	185.6696	186.6646

Where PHN is proportional hazard normal, UPHN is truncated proportional hazard normal, DCPHN is doubly censored proportional hazard normal, AIC is Akaike information criterion, and AICc is corrected Akaike information criterion.

Table 2. ML estimates of the indicated parameter and model for the tooth data and their AIC, AICc, BIC, and HQC.

Estimador	BIZU	LNIZU	DCPHN	UPHNIZU
$β_{00}$	0.6337	0.6337	−7.2205	0.6337
	(0.7408)	(0.7408)	(0.8854)	(0.7408)
$β_{02}$	−0.0376	−0.0376	−0.0935	−0.0376
	(0.0135)	(0.0135)	(0.0161)	(0.0135)
$β_{10}$	−1.3885	−2.8949	−2.4039	−5.2246
	(0.3957)	(1.1453)	(0.6809)	(3.1908)
$β_{11}$	−0.5366	−1.3134	−0.5517	−2.9349
	(0.1613)	(0.4387)	(0.2420)	(1.4567)
$β_{12}$	0.0217	0.0393	0.0363	0.1325
	(0.0068)	(0.0194)	(0.0123)	(0.0735)
$β_{20}$	−8.0316	−8.0316	−12.7261	−8.0316
	(2.3153)	(2.3145)	(1.4938)	(2.3153)
$β_{22}$	0.0788	0.0788	−0.0487	0.0788
	(0.0358)	(0.0358)	(0.0236)	(0.0358)
$σ$	0.0903	0.3096	0.3060	0.6011
	(0.0652)	(0.0796)	(0.0305)	(0.1354)
$α$			1.5871	2.8429
			(0.0974)	(0.7634)
AIC	311.7097	316.0700	325.2363	308.0793
AICC	314.3525	318.7128	328.3095	310.8678
BIC	341.0687	345.4290	358.2652	341.1082
HQC	323.4723	327.8326	338.4693	321.3123

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Martínez-Flórez, G.; Tovar-Falón, R.; Gómez, H.W. Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models. Mathematics 2024, 12, 3566. https://doi.org/10.3390/math12223566

AMA Style

Martínez-Flórez G, Tovar-Falón R, Gómez HW. Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models. Mathematics. 2024; 12(22):3566. https://doi.org/10.3390/math12223566

Chicago/Turabian Style

Martínez-Flórez, Guillermo, Roger Tovar-Falón, and Héctor W. Gómez. 2024. "Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models" Mathematics 12, no. 22: 3566. https://doi.org/10.3390/math12223566

APA Style

Martínez-Flórez, G., Tovar-Falón, R., & Gómez, H. W. (2024). Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models. Mathematics, 12(22), 3566. https://doi.org/10.3390/math12223566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mathematical Formalization and Applications to Data with Excess of Zeros and Ones of the Unit-Proportional Hazard Inflated Models

Abstract

1. Introduction

2. An Asymmetric Distribution for Skew Data

2.1. Proportional Hazard Distribution and Its Modeling

2.2. Truncated Proportional Hazard Normal Distribution

2.3. Parameter Estimation in the TPHN Model

2.4. Information Matrix in TPHN Model

2.5. Unit-Proportional Hazard Regression Model

2.6. MCMC Methods for the PHN Model

3. UPHN Zero-One Inflated Regression Model

3.1. Models for Censored Data

3.2. Zero-One Inflated PHN Distribution

3.3. The Zero-One Inflated UPHN Model

3.4. Generalized Two-Part PHN Model

4. Empirical Applications

4.1. Application 1: Case Study on Students’ Dropout Data

4.2. Application 2: Case Study on Periodontal Disease Data

5. Discussion

5.1. Major Results and Implications

5.2. Model Limitations

5.3. Prospects for Further Investigation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Elements of the Observed Information Matrix

Appendix A.1. Truncated Proportional Hazard Normal Model

Appendix A.2. Unit-Proportional Hazard Normal Regression Model

Appendix A.3. UPHN Regression Model Inflated at Zero and/or One

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI