Abstract
A path analysis method for causal systems based on generalized linear models is proposed by using entropy. A practical example is introduced, and a brief explanation of the entropy coefficient of determination is given. Direct and indirect effects of explanatory variables are discussed as log odds ratios, i.e., relative information, and a method for summarizing the effects is proposed. The example dataset is re-analyzed by using the method.
1. Introduction
Path analysis [1] is often applied to causal systems of continuous variables through the linear structural equations model (LISREL) [2,3]. In the LISREL approach, causal relationships among variables are described by a path diagram and translated into linear equations of the variables. Causal effects can then be calculated by regression and correlation coefficients obtained for the linear equations. In contrast, path analysis of categorical variables is more complex than that of continuous variables because the causal system under consideration cannot be described by linear regression equations. Goodman [4–7] considered path analysis of binary variables by using logit models and discussed the effects of explanatory variables, though without discussing direct and indirect effects. Hagenaars [8] discussed path analysis of categorical variables by using a log-linear model, but here as well without discussion of direct and indirect effects. Eshima et al. [9] proposed a path analysis method for categorical variables in logit models. Kuha and Goldthorpe [10] gave a two-stage path analysis method for generalized linear models (GLMs) that uses log odds ratios. In their approach, first the total, direct and indirect effects are defined for mean differences of response variables, and then the method is applied to measuring the effects on the basis of log odds ratios. However, additive decomposition of the total effect into the direct and indirect effects only approximately reflects reality, and assessing effects in categorical (polytomous) variable systems become more complicated as the numbers of variable categories are increased [10]. Albert and Nelson [11] proposed a path analysis method to calculate pathway effects for causal systems on the basis of GLMs, but not all pathway effects are identifiable. As in the two-stage cases, when factors, intermediate variables, and response variables are categorical, pathway effects become very complicated because the variable effects are defined for mean differences of response variables.
There are many examples of response variables in practical data that are not normally distributed in various fields of study. There is need for a method of path analysis with responses that are not normally distributed, especially categorical responses, and it is useful to discuss a path analysis approach for causal systems of GLMs [12,13]. When describing causal systems of the variables by GLMs, regression parameters or coefficients are related to log odds ratios [14–16], and so it is natural to consider the effects of factors (explanatory variables) according to odds or log odds ratios. However results become more complicated as the number of categories of the variables increases. In such cases, it is suitable to summarize the effects of factors in GLMs. For this purpose, we use the entropy coefficient of determination (ECD), one of the entropy-based measures of predictive power for GLMs [15,16].
The remainder of this paper is organized as follows: Section 2 presents a practical example of causal systems—British mobility data [10]—and re-analyzes them by a new method of path analysis. Section 3 considers the relation between the log odds ratio and entropy, and ECD is briefly reviewed. Section 4 introduces a path analysis method for causal systems described by GLMs, and in Section 5 a method for testing effects of variables is given. The British mobility data are re-analyzed by the proposed approach in Section 6. Finally, Section 7 provides some discussion and conclusions for the present approach.
2. Practical Example
British mobility data describe the effects of education on social class mobility [10]. There are three variables, which are causally ordered as shown in Figure 1: parents’ social class, X; individual social class, Y; and education, Z, which intermediates between X and Y. The three variables are discrete. Social classes X and Y have three categories, “salariat and employers”, “middle class”, and “working class”; education Z has seven levels. While the effects of X and Z on Y can be discussed through log odds ratios, the results are complicated because the number of variable categories is large. It is important to summarize causal effects measured with log odds ratios, especially in such practical examples, to assess the intermediate effect of education on social class mobility.
Figure 1.
Path diagram of social class mobility.
3. Log Odds Ratio and Information
Let
and Y be a p × 1 explanatory-variable vector and a response variable, respectively, and let
be the conditional probability or probability density function of Y given that
. The function
is assumed to belong to the following family of exponential distributions:
where θ and φ are parameters, and
(>0),
, and
are known functions. Let
. The function θ is a function of
through a link function
.
Remark 1. In general, the systematic component can be extended to be a function of explanatory variable vector x. Then, the model is referred to as a generalized nonlinear model. For the sake of simplicity, the function is denoted by. The discussion below is applicable to this case.
Let
and
be the means of
and
, respectively. Then, let us consider the following log odds ratio:
The first and second terms of the right hand side of the above equation are the relative information with respect to response variable
, so the log odds ratio is the change of the relative information in explanatory variable vector
. In GLMs, taking the expectation of the above log odds ratio with respect to
and Y, it is reduced to
. The quantity
can be expressed as a symmetric type of the Kullback–Leibler (KL) information between a GLM based on (1) and the null model with
[15]; thus, we denote
by
in this paper. Let
be the density or probability function for null model
and let
be that of
. Then:
If the variables in the above are discrete, the related integrals are replaced by the summations. The ECD is then defined as:
Then, ECD can also be expressed as:
The measure is interpreted as the proportion of the variation in entropy of Y that is explained by
[15,16]. As shown above, GLMs explain the entropy of response variables, so it is suitable to measure the effects of explanatory variables based on entropy.
Remark 2. Applying ECD to the linear regression model, ECD is the usual coefficient of determination.
4. Measuring the Total, Direct, and Indirect Effects in Recursive GLM Systems
For simplicity, in the recursive case with
, where
precedes
, we discuss the effects of
and
on
. Let
be the expectation of
. Then, for a GLM with the conditional density or probability function of
when
given by (1), the total effect of
on
can be defined by using the following log odds ratio:
Taking the expectation of the above effect with respect to
,
and
, we have:
The above KL information is the (summary) total effect of explanatory variables
on response variable
. Let
and
be the conditional expectations for
and
, respectively, given that
. The log odds ratio with respect to
and
for a given
is:
The total effect of
on
when
is defined by the above log odds ratio because the effect expresses the total effect of
on
when the effect of the preceding variable
is excluded. From this, the total effect of
on
is defined by
By taking the expectation of the above information with respect to
,
and
, the (summary) total effect of
on
is given by
where
is given by
The second term implies the effect of
by itself, that is, the effect of
on
when the effect of
is excluded, and is defined as the (summary) total effect of
on
. The direct effect of
on
can be understood according to the following odds ratio:
The above effect is derived by excluding the effect of
, so it is defined as the direct effect of
on
. Taking the expectation of the above effect, we have the (summary) direct effect of
on
, expressed as follows:
where
is defined as in (2). The above quantity is the amount of entropy of
explained by
alone, that is, excluding the effect of
. By subtracting the direct effect of
on
from the total effect, we have the indirect effect of
on
:
Taking the expectation of the above effect, the (summary) indirect effect is given by
As in the previous section, to standardize the above effects by ECD, we define the standardized total, direct, and indirect effects of
and
on
as follows:
The total effect of
and
on
is:
The total effect of
on
:
The direct effect of
on
:
The indirect effect of
on
:
The total (direct) effect of
on
:
In this case:
A general approach based on the above discussion is given below. Let
be variables such that the parents of
are
, that is,
precedes
. Let
be the conditional density or probability of
given
such that:
Explaining response variable
in a GLM framework by explanatory variables
, the effects of the explanatory variables on the response variable can be treated in terms of entropy as discussed above. From this the standardized (summary) total effect of
on
is defined by:
Second, the total effect of
is defined as:
Then, we can find the total effects of
by induction, which yields:
where
and
can be defined as in (2). In the above formulae, we have:
and:
Remark 3. The total effect of on is given by:
where and be the conditional expectations of given and, respectively.
Let
be parent variables of
excluding
. The direct effect of
on
is defined by:
From this, we have the indirect effect of
:
Remark 4. The direct effect of on is given by
where is the conditional expectation of given.
For canonical links:
we have:
and:
From (5) we have:
The present approach is different from the usual approach for linear equation models and from the approach in [10], because it is based on the log odds ratio and entropy by using all the variables concerned.
Remark 5. The total effects of variables by Kuha and Goldthorpe [10] are defined with the marginal distributions of response variables and explanatory variables. Meanwhile the present approach defines the total effects of explanatory variables based on a recursive structure of all the variables concerned and we have (6).
Remark 6. Indirect effects are defined by the total effects minus the direct effects as (3), (4) and (7); however the interpretation can be done in terms of entropy. On the other hand, direct and indirect effects are defined in an approach by [10], though the sum of the effects does not equal to the total effect.
Remark 7. Assessing the model identification and testing the goodness-of-fit of the model are based on the discussion of GLMs.
5. Statistical Test for Effects
Let
and
be the ML estimators of
and
, respectively. A similar result presented in Eshima & Tabata [16] can be used to show that:
is asymptotically distributed according to a chi-squared distribution with the degrees of freedom equal to the number of parameters in the conditional independent model with
minus that with
.By using statistic (10), the total effects can be tested. Similarly, the statistic:
is asymptotically distributed according to a chi-squared distribution with degrees of freedom equal to the number of regression coefficients (parameters) related to variable
.
The following statistic is asymptotically distributed according to a non-central chi-squared distribution with degree of non-centrality:
and an appropriate degrees of freedom ν, found as the number of parameters in the conditional independent model with
minus that with
. Let:
and let
and
. The statistic
is asymptotically distributed according to the chi-squared distribution with
degrees of freedom. As
becomes large, the chi-squared distribution tends to a normal distribution with mean
and variance
. From this, for sufficiently large sample sizes
, the statistic:
is asymptotically normally distributed with mean
and variance
[17]. For sufficiently large
, we have that:
From this, the asymptotic standard error (ASE) of
is
. Similarly, the asymptotic standard error of:
is
. Moreover:
is asymptotically equal to a normal distribution with mean:
and variance
. By using the above results, ASEs of the estimates of the summary total and direct effects can be calculated.
6. Path Analysis of the British Morbility Data
The British mobility data described in Section 2 were analyzed in detail by using log odds ratios [10]. Here, the proposed path analysis method is applied to summarize the effects of parental class
and education
on destination class Y, measured by log odds ratios as in the previous section, and to give a simple interpretation from the summary effects of
and
on
. The three variables are random, and the GLM system can be composed of logit models. In this example, the employed logistic model can be expressed as follows. Let
be a categorical factor;
a score that take levels {1,2,3} and {1,2,…,7}, respectively, and let
be a categorical response variable with levels {1,2,3}. Let:
Then, dummy variable vectors
and
are identified with categorical variables
and response
, respectively. From this, the systematic component of the above model can be expressed as follows:
where:
Then, the logit model is described as:
where
implies the summation over all u. Then, from Table 4 in [10], the estimated regression parameters for men are calculated as follows:
Table 4.
The effects of X and Y on Y = W.
Similarly, we have the estimated parameters for women as follows:
From Tables 1 and 5 in [10], the joint distributions of parental class
and education
for men and women are calculated, respectively, in Table 1.
Table 1.
The estimated joint distributions of parental class X and education level Z.
Table 5.
Summary Direct, Indirect, and Total Effects of X and Z on Y.
On the basis of the estimated parameters shown above and the estimated joint distribution of
and Z in Table 1, the joint distributions of X, Y, and Z by sex can be estimated. The effects of X and Z on Y for men are shown in Tables 2–4, for example, the effects of
and
on
illustrated in Table 2 are as follows:
- the total effect of and on is calculated as follows: 0.51;
- the total effect of is 0.04;
- the total effect of is 0.47 when ;
- the direct effect of is 0.16 when ;
- the indirect effect of is 0.31 when .
Table 2.
The effects of X and Y on Y = S.
Table 3.
The effects of X and Y on Y = I.
Similarly, the effects of X and Z on Y for women can be calculated. The results are omitted to avoid redundancy of the discussion.
The standardized summary effects are shown in Table 5. For men, the total effect of X and Z on Y is 0.276, and so 27.6% of the variation of Y’s entropy is explained by X and Z. The indirect effect of X is about twice the direct effect, and the total (direct) effect of Z on Y is about 1.5-fold that of X. Therefore, the effect of education Z on the destination class Y is large. For women, the total effect of X and Z on Y is 0.289, meaning that 28.9% of the variation of Y’s entropy is explained by X and Z. The indirect effect of X on Y is about 6-fold that of the direct effect, and the direct effect is small. The total effect of Z on Y is about 2.7-fold that of X. The effect of Z on Y is more pronounced for women than for men.
In a comparison of men and women, the effect of Z on Y for women is about 1.3-fold the effect for men, and, contrarily, the effect of X on Y for men is about 1.4-fold the effect for women. For both men and women, the direct effects of X on Y are mostly very small, and this decomposition of effects shows that education plays an important role in determining social class as an adult.
7. Discussion
In the usual path analysis of continuous variable systems, use of the regression coefficients allows straightforward calculation of total, direct and indirect effects, and the total effect can be expressed by the sum of the direct and indirect effects. However such techniques cannot be applied to structural GLMs with categorical variables or variables that are not normally distributed. Moreover, multiple variable categories make the problem more complicated in comparison with linear equation models for continuous variables. In the present paper, a path analysis approach for structural GLM models was proposed, and calculation of the direct and indirect effects was discussed. Although the analysis of effects of explanatory variables on response variables can be discussed in detail by using log odds ratios, and the effects can be interpreted as changes of relative information, the results are generally quite complicated as demonstrated in Tables 2–4. The present path analysis summarizes the effects, as measured by log odds ratios, and the standardized summary total, direct, and indirect effects are interpreted in the framework of entropy. The present path analysis approach has potential for wide application in practical data analyses of causal systems represented as GLMs, and is particularly well suited to categorical data analysis. The present study has provided a basic idea for path analysis of recursive systems with GLMs, where all the variables concerned are causally ordered, and further studies are needed for performing path analysis of more complicated recursive GLM systems and assessing spurious effects.
Acknowledgments
The authors would like to thank the two referees and the editor for their useful comments and suggestions for improving the first version of this paper. This research was supported by Grant-in-aid for Scientific Research 26330045, Ministry of Education, Culture, Sports, Science and Technology of Japan.
Author Contributions
Nobuoki Eshima and Claudio Giovanni Borroni wrote the manuscript. Nobuoki Eshima and Minoru Tabata designed the research. Nobuoki Eshima, Claudio Giovanni Borroni and Yutaka Kano carried out the present path analysis for a real dataset. All authors have read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Asher, H.B. Causal Modelling; Sage Publications: Beverly Hills, CA, USA, 1976. [Google Scholar]
- Bentler, P.M.; Weeks, D.B. Linear structural equations with latent variables. Psychometrika 1980, 45, 289–308. [Google Scholar]
- Jöreskog, K.G.; Sörbom, D. LISREL8: User’s Reference Guide, 2nd ed; Scientific Software International: Chicago, IL, USA, 1996. [Google Scholar]
- Goodman, L.A. Causal analysis of data from panel studies and other kinds of surveys. Am. J. Sociol. 1973, 78, 1135–1191. [Google Scholar]
- Goodman, L.A. The analysis of multidimensional contingency tables when some variables are posterior to others: A modified path analysis approach. Biometrika 1973, 60, 179–192. [Google Scholar]
- Goodman, L.A. The analysis of systems of qualitative variables when some of the variables are unidentifiable: Part I. A modified latent structure approach. Am. J. Sociol. 1974, 79, 1179–1259. [Google Scholar]
- Goodman, L.A. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 1974, 61, 215–231. [Google Scholar]
- Hagenaars, J.A. Categorical causal modeling: Latent class analysis and directed loglinear models with latent variables. Sociol. Methods Res. 1998, 26, 436–489. [Google Scholar]
- Eshima, N.; Tabata, M.; Geng, Z. Path analysis with logistic regression models: Effect analysis of fully recursive causal systems of categorical variables. J. Jpn. Stat. Soc. 2001, 31, 1–14. [Google Scholar]
- Kuha, J.; Goldthorpe, J.H. Path analysis for discrete variables: The role of education in social mobility. J. R. Stat. Soc. A 2010, 173, 351–369. [Google Scholar]
- Albert, J.M.; Nelson, S. Generalized causal mediation analysis. Biometrics 2011, 67, 1028–1038. [Google Scholar]
- Nelder, J.A.; Wedderburn, R.W.M. Generalized linear model. J. R. Stat. Soc. A 1972, 135, 370–384. [Google Scholar]
- McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed; Chapman and Hall: London, UK, 1989. [Google Scholar]
- Eshima, N.; Tabata, M. Entropy correlation coefficient for measuring predictive power of generalized linear models. Stat. Probab. Lett. 2007, 77, 588–593. [Google Scholar]
- Eshima, N.; Tabata, M. Entropy coefficient of determination for generalized linear models. Comput. Stat. Data Anal. 2010, 54, 1381–1389. [Google Scholar]
- Eshima, N.; Tabata, M. Three predictive power measures for generalized linear models: Entropy coefficient of determination, entropy correlation coefficient and regression correlation coefficient. Comput. Stat. Data Anal. 2011, 55, 3049–3058. [Google Scholar]
- Patnaik, P.B. The non-central χ2 and F-distributions and their applications. Biometrika 1949, 36, 202–232. [Google Scholar]
© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).
