1. Introduction
In many aspects, experimental results or measurements are reported in the form of ratios, scores, proportions or percentages, which is frequently encountered in sociology, psychology, epidemiology and clinical trials. The characteristic of the data is that they are continuously valued within the unit interval
; thus, models focusing on this limited range are worthwhile. Researchers have developed different strategies for modeling such kinds of data. First, the beta distribution and beta regression models have been exhaustively studied by many authors, including [
1,
2,
3]. Kieschnick and McCullough [
4] summarized and compared different regression models for proportional data in the open interval. Next, the simplex distribution investigated by Zhang and Qiu [
5] can also be utilized to model such continuous proportional data, and they further pointed out the simplex regression model is more robust than the beta model. By mimicking the construction of beta distributions with gamma variates, Lijoiu et al. [
6] proposed a so-called normalized
inverse Gaussian (IG) distribution by substituting the gamma variates with IG variates, as a new tool for modeling univariate proportional data. Later, Liu et al. [
7] renamed it as the
proportional inverse Gaussian (PIG) distribution and set up regression models. Due to the diversity and dimension enlarger of data, we need to generalize the univariate continuous proportional models to multivariate cases. Wang and Tu [
8] considered the semiparametric tests for multigroup proportional data in a closed interval
.
From the perspective of data structure, the multi-dimensional data limited in unit intervals can be divided into compositional data and multivariate proportional data according to their domains. For compositional data, which often appear in various fields, such as biology, medicine and economics, the summation of all components of data values equals one, also known as structure relative numbers reflecting the composition of objects. Thus, the corresponding models fitting for the compositional data are defined in the open hyperplane
. Due to the constraint of
, it leads to certain negative correlations between any two dimensions of compositional data. One of the well-known distributions is the Dirichlet distribution, which can be regarded as a generalization of the beta distribution to more than two components. It was first used to fit two compositional biological data in [
9]. Campbell and Mosimann [
10] considered a Dirichlet regression model by linking the parameters to a set of covariates via a polynomial function, and the models with applications to the analysis of psychiatric data are investigated in [
11]. By the way, the beta distribution could be regarded as a two-dimensional Dirichlet distribution, and a beta variate
X and its complement
are also negatively correlated. Other research on related models can be found in recent literature [
12,
13].
For multivariate proportional data, it appears that each component of the data is valued between 0 and 1 with no direct constraint among components. The corresponding models for this type of data are defined in the unit cubic
without restriction
. There are many ways to construct appropriate models, such as beta distribution with copula linking functions. Cepeda-Cuervo et al. [
14] defined a bivariate beta regression model from copulas and considered the Bayesian approach, in which the correlation could be positive or negative. Petterle et al. [
15] proposed a multivariate generalized linear mixed model for modeling continuous bounded variables in the interval
. Sun et al. [
16] proposed a linear
stochastic representation (SR) to construct multivariate positively correlated continuous models based on IG and gamma distributions, named as multivariate PIG and
proportional gamma (PGA) distributions, respectively, which can only fit positively correlated continuous proportional data.
The cortical thickness of schizophrenia data used in [
16] shows high correlations and compensation behaviors related to disease severity among different brain regions. Further, we find that a large number of negative covariant region pairs may occur in patients if the changes of compensations are reduced. This indicates the observations of negatively correlated regions in cortical thickness are of great significance for the study of schizophrenia and its prognosis. Motivated by the construction technique in multivariate PIG and PGA distributions, we will propose models to capture the negative correlation among components for multivariate proportional data. To the best of our knowledge, work considering the negative correlation of multivariate proportion data is quite scarce. Here, we focus on the bivariate situations; thus, the proposed models are expected to provide efficient tools in modeling negatively correlated proportional data.
By combining the construction of multivariate PIG/PGA distributions and the negative correlation structure in beta/Dirichlet distributions, we define a new random vector
via the following SR:
where
are independent random variables with the same support
, and each
can follow any same continuous distribution family but with possibly different parameters. In the following, for each
(
), we applied the IG and gamma distributions to construct bivariate
negatively correlated PIG (NPIG) and
negatively correlated proportional gamma (NPGA) distributions.
The rest of the paper is organized as follows. In
Section 2 and
Section 3, the bivariate NPIG and NPGA distributions are, respectively, proposed and related distributional properties (e.g., moments, joint densities) are provided. Moreover, the
normalized expectation–maximization (N-EM) facilitated by the one-step gradient descent algorithms are established for calculating the
maximum likelihood (ML) estimations of parameters of interest. In
Section 4, simulations for the proposed methods are performed. A data set on the cortical thickness of schizophrenia is used to illustrate the proposed methods in
Section 5. Finally, a discussion is provided in
Section 6. Some technical details are put in the
Appendix A and
Appendix B, and others are shown in the
Supplementary Material.
6. Conclusions, Limitations, and Future Research
In this paper, we proposed models that fit bivariate negatively correlated continuous proportional data for the first time. Based on the equal-dispersed IG distribution and the gamma distribution with a single parameter, we developed the bivariate NPIG and NPGA distributions. Models with covariates are also considered by formulating the mean regression models based on the two new distributions. Moreover, we provide efficient methods for parameter estimations of the four different models, respectively. The N-EM algorithm aided by the gradient descent algorithms based on Jensen’s inequality is used to overcome the difficulties in calculating ML estimates of parameters. For readers interested in algorithms, we recommend reading [
21,
22]. In
Section 5, we used two different criteria to evaluate the models. We study the negative correlation pairs that increase with the decrease in compensation behaviors, and the information obtained from the main research is consistent with our previous findings with the same dataset [
23]. Moreover, we propose the hypotheses of the causes of them based on the results, which needs further medical exploration. According to our analysis of the cortical thickness of schizophrenic patients and the control group, we verified the compensatory nature of cortical thickness in schizophrenic patients and found that it was negatively correlated with age. If you want to use the original data and R code of this article for your research, please contact the corresponding author by email. In addition, the use of original data should be agreed with the data collection team.
There are other topics worthy of further research beyond this paper. We only considered the mean regression models for the proposed distributions and did not consider the mode regressions as there are no closed forms for their modes. Similarly, there are quantile regressions. To better interpret the data, we hope to explore the mode regression models and have already constructed a new model with an explicit expression of the mode. The construction structure is
similar to (
1). Moreover, linear constructions, such as SR (
1), to set models with arbitrary positive or negative correlations are difficult to achieve. We consider changing independent
to a bivariate correlated vector
and then the correlation structure between components based on the construction (
1) more flexible. Moreover, the Copula method may be one feasible way, or mixture models could be considered by combining PIG with NPIG and PGA with NPGA. Finally, the exact tests in the bivariate NPIG and NPGA models for one sample and multiple samples are also our interests. They can help us research the significance of differences.