1. Introduction
In recent years, probability distributions have seen significant advancements, particularly through the creation of new families derived from extensions or generalizations of classical distributions. These innovations aim to overcome the limitations of traditional models and provide greater flexibility to better fit the complex phenomena observed in various fields of knowledge. Examples of these distributions include those based on transformations such as the generalized beta distribution by Eugene et al. [
1], the family of generalized distributions based on the Kumaraswamy distribution, referred to as Kw-distributions and introduced by Cordeiro and De Castro [
2] (Kw-normal, Kw-Weibull, Kw-gamma, Kw-Gumbel, and Kw-inverse Gaussian distribution); and the beta modified Weibull distribution of Silva et al. [
3]. These new distributions not only better capture data characteristics like skewness and kurtosis but also improve accuracy in modeling extreme events or phenomena with heavy tails. Furthermore, their implementation has proven useful in fields such as biomedicine, economics, and engineering, where classical models fail to adequately describe the reality of the data.
In parallel, truncated distributions have emerged as another essential tool, particularly when the data are bounded within a specific range. These distributions are modifications of classical ones, where values outside a certain interval are truncated, improving the model’s fit for data restricted by natural or experimental constraints [
4]. For example, the truncated normal distribution is widely used in reliability analysis and survival studies where negative values are not possible [
5,
6]. Similarly, the truncated Weibull distribution has been applied in actuarial sciences to model the time to event data [
7], offering greater flexibility when standard distributions fail to capture the behavior of the tail.
A method for creating new families of distributions involves using a generating distribution as a base. This method has been widely employed by various authors, including Cordeiro et al. [
8,
9], Zografos and Balakrishnan [
10], Ristić and Balakrishnan [
11], Castellares et al. [
12], and Cordeiro et al. [
13]. In the same context, Mahdavi and Silva [
4] introduced a method for generating families of truncated distributions, producing a two-parameter extension of the base distribution. This method has been used to derive distributions such as the truncated exponential-exponential and the truncated Lomax-Exponential. These innovations in probability distributions have proven to be valuable tools in statistical analysis, providing more robust and adaptable models for complex data.
The method introduced by Mahdavi and Silva [
4] can be summarized as follows:
Definition of the Truncated Distribution: A random variable
U with support in the interval
, where
and
, and cumulative distribution function (CDF)
F is considered. The CDF of the truncated random variable
U in the interval
is defined as:
Generation of the New Family of Distributions: Using the truncated CDF, the new truncated
F–
G family of distributions is introduced. For each absolutely continuous
G distribution (denoted as the baseline distribution), the
–
G distribution is associated. The CDF of the
–
G class of distributions is defined as:
where
G is the CDF of the random variable
V used to generate a new distribution.
The probability density function (PDF),
, survival function, and hazard rate function are given, respectively, by:
and
where
f and
g are the PDF of the random variables
U and
V, respectively. The extension to the location-scale case of the model (
3) is obtained from the transformation
, where
, for
y
; it has PDF given by:
where
Some distributions that have been derived using the generator proposed by [
4] are the truncated exponential-exponential (TEE), the truncated Lomax-Exponential by Enami [
14], the truncated exponential Marshall Olkin Lomax distribution of Hadi and Al-Noor [
15] and the truncated Nadarajah-Haghighi Exponential by Al-Habib et al. [
16]. The generator proposed by [
4] can also be used to derive distributions useful for modeling data in the interval
, such as proportions, rates, or indices.
The analysis of phenomena represented by proportion data, confined to values between zero and one, is essential across various scientific disciplines. These data elucidate part-to-whole relationships and are prevalent in numerous applications, including the prevalence of diseases, the distribution of resources in economics, the survival rates of species, and the utilization of habitats in ecology [
17]. Modeling such data can be highly challenging when there is high zero-to-one inflation in proportion data. Traditional statistical models, such as the censored normal or censored log-normal models, may not be the best solution, as they often struggle to accurately characterize the underlying distribution of proportion data with inflated extremes.
Numerous authors have collaborated to develop more robust models than the censored normal and censored log-normal models for this type of data. By incorporating distributions such as the Birnbaum–Saunders [
18,
19], Student-t [
20,
21], skew-normal (SN) [
22,
23,
24,
25], and power-normal (PN) [
26,
27] distributions, among others, they offer a framework for analyzing data with high degrees of skewness and kurtosis compared with traditional models.
Perhaps the beta distribution is the most well-known in the statistical literature and is commonly used for fitting unit interval data. However, it has limitations when modeling unit data with zero-one inflation. Recent proposals, such as the zero-one inflated beta models, have been made to overcome this limitation and have proven to be viable alternatives for handling data with certain degrees of asymmetry [
28,
29,
30,
31,
32,
33]. Despite advancements in modeling data with inflation and asymmetry, there remains a gap in adequately addressing zero-one inflation in proportion data. Existing models fail to fully capture the unique distributional characteristics and complexities introduced by these inflations, leading to biased estimators and imprecise inferences [
34,
35].
The primary aim of this study is to introduce and develop unit-proportional hazard zero-one inflated (UPHZOI) models, a novel class of regression models specifically designed to address the challenges posed by zero-one inflation in proportional data confined to the unit interval. UPHZOI models combine a continuous-discrete mixture distribution with covariates, enabling them to effectively capture the complex dynamics of such data.
The remainder of this article is structured as follows:
Section 2 provides background on the asymmetric proportional hazard model and introduces the truncated proportional hazard model. It also presents the process of parameter estimation, considering a classical approach using the maximum likelihood method. In
Section 3, we introduce new regression models for unit interval data with inflation, including the model formulation, parameter estimation, and elements of the Hessian matrix.
Section 4 demonstrates the application of these models through empirical case studies on doubly censored data and zero-inflated data.
Section 5 presents an analysis of the major results, limitations, and future research directions. The article concludes with
Section 6.
3. UPHN Zero-One Inflated Regression Model
In this section, we present some regression models for unit interval (proportion) data that account for inflation at values zero and one or any value between zero and one.
3.1. Models for Censored Data
Cragg proposed a two-part model [
39], which is a framework for fitting the mixture of a discrete and a continuous random variable. This model is represented by:
where
is the probability that determines the relative contribution of the point mass distribution made by the discrete variable,
is a PDF, and
is an indicator variable that takes values of 0 or 1. This model is optimal in cases where the model is inflated at the point mass value (for example,
, whose probability at
cannot be explained by the CDF associated with the PDF
. Cragg’s model can be extended to the case of a variable with double censoring or two-point mass values, for example, 0 and 1, in which case it is given by:
where
,
,
is the indicator variable that takes the value 1 if
and zero otherwise. Similarly,
is the indicator variable for
. In this model, the three components are determined by different stochastic processes, thus necessarily leading to a positive response from
f. On the other hand, a zero or a one comes from the distribution of a point mass.
3.2. Zero-One Inflated PHN Distribution
Based on Cragg’s model, we proposed the zero-one inflated PHN model as a means of
where
From this model, cases of inflation only at zero follow by taking or inflation only at one by taking .
The CDF is represented by:
The most interesting case in this new model is when covariates are used to explain the response both in the censored part (0 and 1) and in the uncensored part (the continuous part in
). Thus, for the discrete part, it is assumed that the responses at zero and one can be explained by the covariate vectors
and
respectively. Then, to determine the probabilities
and
, a logistic model with a polytomous response can be constructed such that:
where
y
are vectors of unknown parameters associated respectively with the covariate vectors
and
Similarly, for the continuous component of the model, a unit model
is still assumed with a logit link function in the mean response, i.e.,
, where
is a vector of covariates with associated coefficient vector
. For this model, it is easy to verify that the log-likelihood function for the parameter vector
given
,
and
can be written in the form:
where
and
Given these characteristics, the MLEs of the model parameters can be obtained separately for each component of the log-likelihood function. The score function is derived by differentiating each component of the log-likelihood function. It can be shown that the Fisher information matrix can be written as a block diagonal matrix in the form:
where
corresponds to the information matrix of the discrete part. The elements of the observed information matrix for the discrete part are given in the
Appendix A.3. The respective Fisher information matrix is obtained by calculating the expectation of the elements of the observed information matrix. Furthermore, since the inverse of a block diagonal matrix is the block diagonal matrix of the respective inverses, it follows that the variance-covariance matrix is given by:
Here, for large sample sizes it follows that for
Confidence intervals for with of confidence coefficient can be obtained as . By talking , the zero-inflated model is followed and, making , the zero-inflated model is obtained.
3.3. The Zero-One Inflated UPHN Model
Similarly to how the zero-one inflated PHN model was constructed, a zero and/or one-inflated UPHN distribution can be proposed, which is given by:
where
z,
and
are defined as in the zero-one inflated PHN model.
The CDF of this distribution is represented by
For the case of covariates in the model, and for the zero- and one-inflated part, with associated coefficient vector and . For the continuous component of the model, we connect the response variable with the linear predictor using the logit link function. As before, we choose this link function because, in addition to ensuring that the predictions model is within the interval, the logit function allows for more explicit expressions of the score function elements and the information matrix compared to the probit function, which depends on the integral of the cumulative distribution function of the standard normal distribution. In this way, we assume relationship , where is a vector of covariates with vector of coefficients .
The proposal again is to use a polytomous logistic model to explain the probabilities
and
. As in the case of the inflated PHN model, we have that the log-likelihood function is given by
where
is the same as the inflated PHN model, while
where
and
are as defined in (
22).
The score function is obtained by differentiating each component of the log-likelihood function and the Fisher information matrix can be written as a diagonal block matrix in the form:
The elements of the matrix
are like those given in the inflated PHN model, while the elements of the matrix
are like those given in the information matrix of the UPHN regression model.
3.4. Generalized Two-Part PHN Model
Cragg’s two-part model [
39] encounters the issue that some censored points may be values at the boundary of the censoring limit. This is particularly problematic for a distribution
) within the unit interval
, where a zero or one could either be a realization from the point mass distribution or a partial observation of
having a critical value that is not precisely known but is close to
or
for small values of the pre-specified constants
and
. In practice, the values
and
are, in some cases, defined as those for which the instruments cannot record measurements below or above, respectively, and, consequently, are treated as censoring values. In other cases, these observational limits are defined for ethical or practical reasons. For example, in clinical studies, it may be unethical to continue observing a patient under certain conditions, or the costs of prolonged observation may become prohibitive.
To address this issue in the two-part model, Moulton and Halsey [
40] propose a new approach to adjust the mixture of continuous and discrete random variables. This approach allows for the possibility that some limiting responses result from an interval censoring of
. The model proposed by Moulton and Halsey (1995) for left censoring at point
a is given by:
, where
F is the CDF associated to
f and,
T It is a pre-established constant within the interval
where some limiting responses are considered censored. Similarly to how we generalized Cragg’s model, Moulton and Halsey’s model can also be generalized for left and right censoring or two boundary inflation points within the definition interval of the pdf
. In our case, for the unit PHN distribution within the interval
, this generalization of Moulton and Halsey’s model is given by:
It can be observed that this distribution is a model with double censoring (at zero and one) and, therefore, allows for the fit of datasets with inflation at zero and one. This represents an alternative to the double-censored Tobit model, where the CDF of the normal distribution does not efficiently fit the probability of the point mass where double censoring occurs, i.e., the probability of the inflation points.
Extending this model to the case of covariates in each part of the model, we again assume that
and
are sets of auxiliary covariates for the discrete part at zero and one, respectively; and a set of covariates
for the continuous part in the interval
. Then, denoting by
the proportion of observations below zero,
(lower detection limit), and by
the proportion of observations above one,
(upper detection limit), the extension of the Moulton and Halsey model to the double-censored PHN case can be expressed through the PDF given by
where
and
are the probability masses at points zero and one, while
are as defined above;
, where
is the set of coefficients associated with the covariate vector
.
The CDF of this model is represented by
To model the responses at the point masses and , a multinomial logistic model with a logit link function is used again, where are the vectors of coefficients associated with the sets of covariates and .
The log-likelihood function for parameter vector estimation
conditionally on
,
,
, is given by:
The score equations are obtained by performing the first derivatives with respect to the model parameters while the information matrix is obtained by proceeding as in the models studied previously. Models with inflation only at zero or only at one can be studied by taking or , respectively.
4. Empirical Applications
In this section, we illustrate the application of the proposed models and compare it with other models using real data. We show that the proposed model can be a valid alternative to some existing regression models in the statistical literature.
4.1. Application 1: Case Study on Students’ Dropout Data
Student dropout is a major problem many Latin American countries face. In some universities in Colombia, this phenomenon can lead to more than 50% of students who enroll in a university program abandoning their higher education studies. This phenomenon has its greatest impact in the first four semesters of undergraduate studies, which is why it is important to determine the main causes leading to this abandonment of higher education.
This application refers to student dropout in the Faculty of Veterinary Medicine and Zootechnics (MVZ, by its acronym in Spanish) at the University of Córdoba, Colombia. The analyzed information corresponds to a sample of students who dropped out during one of the first four semesters (early dropout) of the programs in the MVZ Faculty at the University of Córdoba. The data correspond to variables from the SPADIES System of the Ministry of National Education (MEN by its acronym in Spanish) and the university itself.
The response variable y corresponds to the proportion of subjects passed up to the point of dropout. The explanatory variables considered were: Saber 11 test score (exams taken at the end of secondary education); age at the time of taking the Saber 11 test; variable indicating whether the student received financial support (taking values yes, no); mother’s educational level (categorized as 1 if professional and, 0 otherwise); number of siblings; socioeconomic status of the student (categorized as 1 if from strata 1, 2, or 3, referred to as low and 0, otherwise); and student’s gender (categorized as 1 if male and 0 otherwise).
The zero-one inflated model, PHN, UPHN, and Doubly-Censored PHN (DCPHN) were fitted since some students drop out in the first semester without passing any subjects, and others drop out in the first four semesters even after passing all enrolled subjects.
The results obtained with the models studied in this article show that in all models, the significant variables for were the Saber 11 test score (), age at the time of taking the Saber 11 test (), and number of siblings (). Similarly, the censored part at zero () is not explained by any variable in any of the three models, while the censored part at one () showed significance in variables such as age at the time of taking the Saber 11 test (directly related to the age of university entry) and number of siblings.
Table 1 shows the results of the best-fitted model for each of the considered models. To determine which model presents better performance, we used the AIC criteria [
41] and the corrected AIC (AICc) [
42]. These criteria are defined as:
where
p is the number of parameters of the model in question.
The MLEs, with standard errors in parentheses, are given in
Table 1. According to the AIC and AICc criteria, the model that best fits the student dropout data is the UPHN, followed by the DCPHN model.
To identify outliers and/or model misspecification, we examined the transformation of the martingale residual,
, as proposed by Barros et al. [
43]. These residuals are defined by
where
is the martingal residual proposed by Ortega et al. [
44], where
indicates whether the ith observation is censored or not, respectively,
denotes the sign of
and
represents the survival function evaluated at
, where
are the MLE for
.
The plots of
with confidence envelope graphs generated for the PHN, UPHN, and DCPHN models, shown in
Figure 1 and
Figure 2, indicate that the fitted regression models PHN, UPHN, and DCPHN, with a logit link function, exhibit a good fit.
4.2. Application 2: Case Study on Periodontal Disease Data
The data motivating this second application come from a clinical study in which the clinical attachment level (CAL), a key marker of periodontal disease (PD), was measured at six sites on each tooth of a subject. The primary statistical question is to estimate functions that model the relationship between the “proportion of diseased sites associated with a specific tooth type (incisors, canines, premolars, and first molars)” and the covariates described below. The full dataset was previously analyzed by Galvis et al. [
45] and includes information from 290 individuals. The response variable in this study is the proportion of diseased sites for the premolars (denoted as
Y), with auxiliary covariates being gender (
), age (
), glycosylated hemoglobin (
), and smoking status (
).
The dataset exhibits significant inflation at , but for certain subjects, we also observe . To account for this, we applied the beta zero-one inflated (BIZU), truncated log-normal zero-one inflated (LNIZU), doubly censored proportional hazard normal (DCPHN), and the UPHN inflated zero-one (UPHNIZU) regression models. Our analysis revealed that only the covariates and were statistically significant. For the DCPHN model, only was significant for both the discrete outcomes.
We used several information criteria to compare the various models, including AIC and the AIC
C. We also used the Bayesian Information Criterion (BIC) and the Hannan–Quinn Information Criterion, defined as follows:
where
p is the number of parameters of the model in question.
The MLEs, with standard errors in parentheses, are given in
Table 2.
In
Figure 3,
Figure 4,
Figure 5 and
Figure 6, it can be observed that the best fits correspond to the BIZU and UPHNIZU models. Additionally, note that in three of the criteria, the UPHNIZU model performs better than the BIZU model, while for the fourth criterion (BIC), no significant differences are found between the two models. It is important to consider that the BIZU model has one less parameter, which further supports the superior fit of the UPHNIZU model. This allows us to conclude that the UPHNIZU model is a promising new alternative for modeling responses within the unit interval
with zero-one inflation.
We also generated standardized residual plots to identify the presence of outliers when fitting the UPHNIZU model. Additionally, we present the cumulative distribution function (CDF) plot of the UPHN model (
Figure 5). From these, the model shows a good fit, and no outliers are detected. In addition, envelope plots were obtained for the fitted models BIZU, LNIZU, and DCLPHN, which are presented in
Figure 3 and
Figure 4. These plots demonstrate that the BIZU and LNIZU models exhibit a better fit than the DCPHN model.
5. Discussion
In this article, we introduced a broad class of skew regression models designed for response variables that lie within the unit interval, which may exhibit an excess of zeros or ones. These models were derived from a continuous-discrete mixture distribution that incorporates covariates in both its discrete and continuous components. As evidenced by applications using real data, the models we propose serve as a viable alternative for modeling rates and proportions that are inflated at either zero or one.
5.1. Major Results and Implications
Our findings demonstrate that the UPHNIZU model consistently surpassed other models in terms of AIC, AICc, BIC, and HQC values. These models delivered a superior fit for the data obtained from the case study on students’ dropout data and the clinical study on periodontal disease, where the response variable was the proportion of diseased tooth sites.
Our findings also demonstrate that UPHNIZU models generate a non-singular information matrix, allowing valid statistical inferences and outperforming other asymmetric models like those derived from the skew-normal distribution or the beta distribution. Empirical results show the models’ effectiveness in analyzing proportional data with zero and one inflation, highlighting their robustness and practicality in various research fields such as biomedicine, economics, and engineering. Additionally, they present parameter estimation methods using maximum likelihood and discuss applications in student dropout studies and periodontal disease. UPHNIZU models are a promising alternative for analyzing bounded data with extreme inflation, providing a robust and flexible tool to capture the complex characteristics of such data. The research also emphasizes the importance of innovations in probability distributions and their application in modeling complex phenomena, offering an advanced solution for the challenges of modeling proportional data with zero and one inflation.
5.2. Model Limitations
Although the results are encouraging, our study has several limitations. First, the models’ complexity and reliance on iterative numerical methods for parameter estimation can lead to high computational demands. Second, while the models showed strong performance with the datasets utilized in this research, additional validation on different types of data is required to ensure their applicability in broader contexts.
5.3. Prospects for Further Investigation
Future research may explore several avenues, including the creation of more efficient algorithms to lessen the computational demands of fitting these models. Furthermore, applying these models in fields like economics or environmental studies could offer additional validation and reveal new applications.
Given the importance of model performance in our analysis, while the methods employed—such as AIC, AICc, BIC, HQC, and martingale residuals—are effective for evaluating model adequacy, there is room for improvement. Future research could investigate additional goodness-of-fit tests specifically designed for bounded and inflated data, which could offer a more thorough evaluation of model performance and robustness. Additionally, exploring Bayesian inference methods for unit interval data with inflation could provide valuable insights and enhance the analytical framework.
An intriguing avenue for future research involves adapting these models to accommodate longitudinal or hierarchical data structures. This would require methods to manage correlations within subjects or groups, often present in practical datasets. Additionally, examining the robustness of these models in various misspecification scenarios could lead to more resilient modeling strategies.
6. Conclusions
Analyzing proportion data, particularly when values are inflated at zero and one, presents significant challenges across various scientific disciplines. Conventional models, such as beta and Tobit regression models, frequently fail to accurately capture the complexities associated with such data. This underscores the need for more sophisticated modeling techniques capable of addressing the unique distributional characteristics of zero-one inflation.
This work tackled these challenges by introducing the proportional hazard normal zero-one inflated models. These models incorporate a continuous-discrete mixture distribution with covariates in both components, offering an advanced framework for analyzing proportion data with specific inflation points. Consequently, the proportional hazard normal zero-one inflated models provide a robust and flexible method for capturing asymmetrically distributed data and mixed discrete-continuous characteristics, prevalent in fields such as medicine, sociology, humanities, and economics.
Our applications, which pertain to two case studies on student dropout and periodontal data, demonstrated that the proportional hazard normal zero-one inflated models with the logit link function are an excellent alternative to traditional models. The transformation of martingale residuals and the generation of simulated envelopes further validated the robustness of our models, underscoring their effectiveness in identifying model misfits and outliers. The proposed models address a critical gap in statistical modeling, providing valuable insights and reliable estimators for handling bounded and inflated data. The flexibility and robustness of the proportional hazard normal zero-one inflated models make them a viable alternative for describing proportion data that are inflated at zero or one.
In conclusion, the proportional hazard-normal zero-one inflated models signify a significant advancement in statistical modeling techniques for proportion data exhibiting zero-one inflation. These models provide a robust and adaptable framework for analyzing such data, yielding deeper insights and more reliable estimators.