1. Introduction
In modern scientific studies, a primary focus is on selecting the appropriate models to use. Researchers typically gather data by measuring various aspects of the subjects being observed and then analyze how these variables affect a specific outcome. It is essential to determine which measures are essential to the outcome, identify any irrelevant measures, and evaluate any potential interactions between the variables that require consideration [
1].
In particular, the importance of model selection in regression analysis when dealing with normally distributed response variables is very familiar and widely applied in many areas of study, including engineering, biomedical sciences, and social sciences. The overwhelming interest to researchers in these fields is to obtain a regression model with as few regression parameters as possible (a property called a parsimonious model). The popular method, in practice, is one of forward selection, backward elimination, or stepwise selection procedures through a test of significance of a single regression coefficient, for example, a test of
, using the
F test ([
2,
3]), where
is the
jth regression parameter in a multiple linear regression model. Other model selection procedures, such as the Akaike information criterion (AIC) [
4] and the Bayesian information criterion (BIC) [
5] are also available. However, the properties of these model selection procedures are not well-known.
The purpose of this paper is to study the properties of these procedures in generalized linear models (GLMs). In GLMs, the choice of probability distribution is not limited to symmetric distributions like the normal distribution. It encompasses a range of asymmetric probability distributions, including the binomial and Poisson distributions. In this paper, a model selection procedure is developed to accommodate both symmetric and asymmetric regression models.
The history of regression starts from Gauss and Legendre who first introduced the method of least squares in the early 1800s. Later, between the 1800s and 1900s, Galton and Pearson developed the concept of regression. It was Fisher who combined the works of Gauss and Pearson to form a complete theory of the properties of least squares estimation. Fisher’s contribution to this field made regression analysis useful for predicting and understanding correlations as well as making inferences about the relationship between a response and a covariate. Later, nonparametric regression and semiparametric regression methods were developed based on kernels by Fan [
6], splines by Eilers and Marx [
7], and wavelets by Bock and Pliego [
8]. In this paper, we work on generalized linear regression models.
In an ordinary linear regression model (OLS), we assume that the error terms are normally distributed with common variance. However, in binomial regression models, the error terms can only be 0 or 1 for each observation, and no constant variance. Moreover, in the Poisson regression model, error terms can only be positive numbers, whereas in OLS, error terms could take on any value on the real number line. As a result, it cannot be assured, without further investigation, whether the model selection procedures that work for normally distributed data will also work for non-symmetric data, such as the Poisson and binomial data.
There have been recent advancements in selecting variables for GLM that are suitable for either large datasets or those with high dimensions. Some of the references for variable selection in big data are based on elastic net regularization paths [
9], debiased lasso [
10], reference models [
11], regularized version of the least-squares criterion [
12]. For variable selection in high-dimensional GLMs with binary outcomes, see [
13], for temporal-dependent data, refer to [
14], and for knowledge transfer, refer to [
15]. Model selection has garnered significant attention in the Bayesian approach to generalized linear mixed models [
16]. In this context, it is worth noting that there is a large amount of literature on goodness-of-fit tests, for example [
17,
18,
19,
20,
21,
22,
23,
24]. However, in this paper, we investigate variable selection in GLM for non-symmetric data, such as binomial and Poison regression models, when the dataset is small or moderate. This type of investigation, to the best of our knowledge, does not exist in the literature.
There are two aspects to the model selection procedure: (a) finding a suitable test statistic for testing the significance of a single regression coefficient, for example, to test , which performs best in holding an appropriate level of significance, say , and has superior power properties, and (b) finding a model selection procedure using this suitable test statistic, which, again, has the best properties with respect to level and power.
For (a), we developed three large sample test statistics, namely, the score test, the likelihood ratio test, and the Wald test. These three tests, along with the usual F test, are compared using a simulation study.
The score test [
25] is a special case of the
test [
26], where the nuisance parameters are replaced by maximum likelihood estimates, which are
-consistent; here,
n denotes the number of observations used in estimating the parameters. The score test is particularly appealing as we only have to study the distribution of the test statistic under the null hypothesis, which is that of the basic model. It often maintains, at least approximately, a preassigned level of significance, and often produces a statistic that is simple to calculate. On the contrary, the other two asymptotically equivalent tests (the LRT test and Wald test) require estimates of the parameters under the alternative hypothesis and often show liberal or conservative behaviors in small samples. For further discussion, see [
27].
For (b), an extensive simulation study was conducted to compare the properties of the forward selection and the backward elimination procedure using the best statistic found in (a), with the AIC and the BIC. Further discussion on this is provided in
Section 3.1.1.
In
Section 2, we develop the three large sample test statistics, which are then specialized for data from the normal, Poisson, and binomial distributions. The F statistic used in model selection for data from a normal distribution is also discussed. The results of an extensive simulation study are reported in
Section 3. Extensions for asymmetric distributions, such as over-dispersed Poisson (the negative binomial) and over-dispersed binomial (the beta-binomial) regression models, are presented and evaluated in
Section 4. Two examples are presented in
Section 5; a discussion follows in
Section 6.
2. Generalized Linear Model and the Test Statistics
2.1. Generalized Linear Model
The Generalized Linear Model (GLM) was developed by Nelder and Wedderburn [
28]. A GLM is the generalization of ordinary linear regression models to encompass non-normal response distributions and nonlinear functions of the mean. It is composed of three components:
- (i)
The random component: This describes the response variable y (categorical or continuous) and its probability distribution.
- (ii)
The systematic component: This connects a set of covariates with a linear predictor in the following form:
- (iii)
The link function: It is a monotone differentiable function
f applied to each component of
, which connects the random and systematic components through
. For more details, see [
28,
29].
The random variable
Y has a distribution of the GLM form if
where
. In GLMs,
is a variance function that characterizes a particular GLM family of distributions. Apart from the normal distribution, the discrete models, namely, the binomial model and the Poisson model, belong to this family. A set of covariates
is related to the mean
by
, where
is the link function,
is an
matrix, and
is the vector of regression parameters. Furthermore, we assume that
, so that
is the intercept parameter.
Inference procedures regarding the mean
or the regression parameters
are made using the log-likelihood function
. The log-likelihood for
can be written as
2.2. The Test Statistics
Our interest is to develop a test statistic for testing the hypothesis where one of the parameters is zero. As such, we consider the null hypothesis with unspecified, against .
In order to develop the test statistic that follows a distribution of the GLM form, we need to obtain the maximum likelihood estimates of the
parameters under the null as well as under the alternative hypotheses using the log-likelihood in Equation (
1) developed above, the first derivative of which is
where
,
, and
, where
denotes differentiation with respect to
. To estimate the parameters
, we need to solve
, which is non-linear in
, so must be solved iteratively.
Note that under the null hypothesis, we estimate for . We denote these estimates through . Furthermore, under the alternative hypothesis, we estimate , for . We denote these estimates by .
2.2.1. The Likelihood Ratio Test and the Wald Test
Generally, the likelihood ratio statistic used to test a null hypothesis against an alternative is the ratio of the maximum likelihood under the null hypothesis to that under the alternative hypothesis. In practice, we maximize the log-likelihoods to find maximum likelihood estimates of the parameters under the null and the alternative hypotheses. Let and be the maximized log-likelihood under the null and the alternative hypotheses, respectively. Then, it can be shown that the likelihood ratio statistic for testing the null hypothesis with unspecified, against , is .
Similarly, the Wald test statistic is the ratio of the maximum likelihood estimate of the parameter of interest under the alternative hypothesis and its standard error. Thus, the Wald test statistic for testing the null hypothesis with unspecified, against , is given by , where is obtained from the Hessian matrix at the end of the iterative process.
2.2.2. The Score Test
The score test is based on the partial derivatives of the log-likelihood function with respect to the nuisance parameters and the parameters of interest evaluated at the null hypothesis. The score test statistic can be shown to be
For the derivation of the score test statistics and the definition of
,
,
, and
; see
Appendix A. The above score test can also be obtained from Pregibon [
30], who developed the score test for the generalized linear interactive modeling system. The proof is presented in
Appendix B. Note that the symbol that represents the MLE under the null hypothesis. Asymptotically (for large
n), the distribution of each test statistic,
,
, and
, converges to
[
26]. Therefore, for a fixed significance level
, we reject the null hypothesis if the value of a test statistic is greater than
.
To save space, the expressions for the three test statistics,
,
, and
, for the special cases for which the data distribution is normal, Poisson, and binomial, respectively, are presented in
Appendix A.
2.2.3. The F Test
The
F statistic used in model selection for data from a normal distribution is
, where
and
. Here,
if
holds ([
3], p. 267).
2.3. Simulation Study
A simulation study is now conducted to compare the behaviors of the four test statistics, namely, the score, the LRT, the Wald, and the F, in terms of empirical level and power, for testing the significance of a single regression coefficient. We consider a two-variable regression model with link functions , , and for , , and distribution, respectively. and are generated from the standard normal distribution and is considered for normal distribution.
Suppose our interest is to test
against
in each case. For empirical levels, we take
, and
. For power, we take
and
, and different values of
, as presented in
Table 1 for normal and Poisson-distributed data, and
Table 2 for binomial-distributed data.
For data from the binomial distribution, the level and power results may be affected by the binomial index
m. To check this, we conduct simulations for
,
, and
. For both level and power, we consider sample sizes
and 50 for all distributions. Each simulation experiment is based on 10,000 replicated samples. The level and power results are presented in
Table 1 for normal and Poisson distributions and in
Table 2 for binomial distribution. Results in
Table 1 show that for normally distributed data, the score test and the
F test maintain the level reasonably well, although the score test shows some inflated level. As a result, it shows some inflated power. The other two statistics (Wald and LRT) show liberal behavior. Because of this, these two statistics show higher power than the other two tests.
Results in
Table 1 and
Table 2 show that for data from the Poisson and binomial distributions, the
F test performs very badly. The other three statistics hold the level very well and their power performances are also similar. Furthermore, results in
Table 2 show that the size of the binomial index
m does not have any effect on the size and power of the tests. So, in subsequent sections, we choose
as the binomial index.
We further conducted a simulation study where the covariates and are correlated for Poisson and binomial distributions, and the results (not included in the paper) show similar empirical level and empirical power properties.
It is reassuring that the
F test does well for data from the normal distribution. So, in
Section 3, we use this test in the study of the performance of the model selection procedures for normally distributed data. For data from Poisson and binomial distributions, we use the score test as it has a very simple form, it does not need estimates of the regression parameters under the alternative hypothesis, and its level and power properties are at least as good as those of the LRT and the Wald tests.
3. Model Selection
3.1. Empirical Level and Power
Following the findings in
Section 2.3, our model selection criterion for normally distributed data is based on testing the significance of a single regression coefficient
using the
F test presented in
Section 2.2.3. Also, as discussed in
Section 2.3, for data from the Poisson and the binomial distributions, we use the score test statistic
and
, respectively, presented in
Appendix A. Our purpose here is to make a comparative study of the performance of forward selection, backward elimination, AIC, and BIC, with respect to level and power.
Although these model selection procedures are well known, to help the readers, we provide brief descriptions of them below.
Forward Selection Procedure: The forward selection starts with only one variable in the model. So, if the model has p regression variables, apart from the intercept, in the first step, we fit p regression models and calculate the value of the score test statistic for each model. Then, the variable corresponding to the largest value of the score test statistic, which is also significant at a specified level of significance, is kept in the model. In step 2, we fit regression models with the regression variable selected at step 1, and one of the remaining regression variables, and follow the procedure as in step 1. We then continue this process by adding one more variable, each time, until no more variables can be included in the model. In the end, the final model will have variables.
Backward Elimination Procedure: The backward elimination starts with the full model. We calculate (p) the score test statistic for testing , . Then, if the variable with the smallest value of the score test statistic is found to be insignificant at a specified level of significance, we remove that variable from the model. We then continue this process by removing one more variable each time, until no more variables can be deleted from the model.
AIC and BIC Criteria: AIC judges a model by how close its fitted values tend to be to the true values, in terms of a certain expected value. AIC can be written as . Forward selection through AIC starts from the null model and every variable outside the current model can be added one at a time at each step until AIC is no better. A Bayesian argument motivates the BIC, an alternative to AIC. It takes the sample size into account and the forward selection process through BIC is similar to AIC, where .
As mentioned earlier, our purpose is to find the most parsimonious model. Here, we illustrate a method of calculating the empirical level using a p variable Poisson regression model with . For given values of the regression parameters and simulated values of the regression variables, we obtain a sample of size n from the Poisson distribution. We then use the score test statistic for testing and a model selection procedure, for example, the forward selection procedure, and find a model of a subset of the regression variables. We repeat this process 10,000 times and find 10,000 models. If the given value of is very small, we want to see that the regression variable is in the final model. We then count the number of models in which the variable is included. Let this number be s. Then the empirical level for rejecting is s/10,000. Empirical power is calculated similarly by taking a larger value of during the simulation process.
3.1.1. Simulation Study
We conduct a simulation study to compare the performance of the model selection procedures, forward selection, backward elimination, AIC, and BIC, with respect to empirical level and power. We consider a four-variable regression model. Data are drawn from the normal
regression model, the Poisson
regression model, and the Binomial
regression model with
respectively. Suppose we would like to test
. To calculate the empirical level for each distribution, we choose
, and for empirical power, we take different values of
, as presented in
Table 3. The rest of the parameters are set at
,
,
, and
for normal and Poisson distributions, and
,
,
, and
for binomial distributions. For each distribution, 10,000 replicated samples are taken for sample sizes of
10, 20, 30, and 50.
For the forward selection and backward elimination procedures, we consider . Note that for the other two procedures, cannot be fixed.
The level and power results are presented in
Table 3 and
Table 4, which show that the forward selection method using the
F test for normal-distributed, and both forward selection and backward elimination using the score test for Poisson and binomial-distributed data, always produce a reasonable empirical level (close to the nominal level), irrespective of the sample size. The other two procedures, the AIC and BIC, produce highly inflated type I errors. The BIC, however, does well for a large sample size (
), where its power performance is also comparable to that of the forward selection and backward elimination procedures using the score test.
Thus, for normal regression models, our recommendation is to use the forward selection procedure using the F test. For Poisson and binomial regression models, our recommendation is to use the forward selection procedure using the score test for small to moderate sample sizes, while for large n () sizes, the BIC should be used as it is computationally much simpler.
5. Real Data Analysis
To demonstrate the practical application of the model selection procedures discussed in this paper, we examine two real datasets that have small sample sizes.
Dataset 1: The Lower Respiratory Illness Count Dataset.
This is a dataset provided by LaVange et al. [
43], consisting of information on lower respiratory illness in 284 children during their first year of life. Each child was examined every two weeks over a period of one year.
There were eight covariates, namely
: Risk: the number of weeks where the child is at risk in that year.
: Passive: a dummy variable that indicates whether the child was exposed to cigarette smoking.
: Crowding: a variable that indicates whether or not living at home is crowded.
: Race: an indicator variable for race (1 = white, 0 = not white).
: Socioeconomic status (1, 0), (0, 1), and (0, 0) for low-, medium-, and high-class, respectively, and
: Age group (1, 0), (0, 1), and (0, 0) for under four, four to six, and more than six months, respectively.
We find this dataset appealing as it comes from a real experiment. However, this is a large dataset. So, we construct a small dataset consisting of a random sample of 50 children with their respective lower respiratory illness status and the covariates. The dataset is presented in
Appendix D, and is analyzed below.
This is a count dataset. The usual model to analyze such count data is a Poisson regression model. So, we first use a Poisson regression model and apply the model section procedures discussed earlier. However, there may be overdispersion in the data since the children who have an infection are more likely to have other infections. To test this, we apply the score test statistic,
, given by Cameron and Trivedi ([
44], p. 49) to test
versus
. This statistic has an asymptotic standard normal distribution, and for the sample data,
with a
p-value
.
We consider a negative binomial regression model to accommodate overdispersion. Thus, the full model considered here for model selection is
In
Table 10, we provide variables that enter into the model in each step of the forward selection procedure using the score test, Wald test, LRT test, AIC, and BIC.
Table 10 shows that two covariates (passive smoking and crowding) are significant out of the eight covariates using the forward selection procedure through the score test, and through the Wald test for the Poisson and negative binomial regression models. In contrast, the forward selection procedure through the AIC and the BIC provides different parsimonious models. We select the final model using the forward selection through the score test.
Thus, the final model for these data is
Example 2: The Coronary Heart Disease Dataset.
The data presented here consist of 50 data points (Rousseauw et al. [
45]) from a retrospective sample of 3357 males in a coronary heart disease high-risk region of the Western Cape, South Africa. The response variable
y is coronary heart disease, which has two controls. There are nine covariates, namely,
systolic blood pressure;
cumulative tobacco (kg);
low density lipoprotein cholesterol,
adiposity,
family history of heart disease,
type-A behavior,
obesity,
current alcohol consumption, and
age at onset.
We consider a logistic regression model. Thus, the full model considered here for model selection is
In
Table 11, we present variables that enter into the model in each step of the forward selection procedure using the score test, Wald test, LRT test, AIC, and BIC.
Table 11 shows that two covariates (low-density lipoprotein cholesterol and family history of heart disease) are significant out of the nine covariates using the forward selection procedure through the score test and the Wald test. However, the forward selection procedure through the LRT test and the BIC provides different parsimonious models.
Thus, the final model for these data is
6. Discussion
In this paper, we first develop a score test procedure for testing the significance of a single covariate in generalized linear models that encompasses a range of symmetric and asymmetric probability distributions. This score test is compared—by extensive simulations—with the Wald test, the likelihood ratio test, and the F test.
The F test does well for data from the normal distribution. For data from Poisson and binomial distributions, the score test performs best.
Next, a comparative study of the performance of a few model selection procedures, such as the forward selection, the AIC, and the BIC, with respect to level and power, was conducted. The other two procedures, backward elimination and stepwise selection, are not included in our study, as in practice, these produce similar final models as those obtained by the forward selection procedure. Furthermore, although these model selection procedures are well-known, to be helpful to the readers, we provide a brief description in this paper.
The F test is well-known, and as it does best for normally distributed data, it is used in model selection for data from this distribution. The score test performs the best for data from Poisson and binomial data and it has a very simple form. So, for data from these distributions, the score test is recommended for model selection.
Simulation studies show that the forward selection procedure using the score test performs best in terms of the level and power for data from all three distributions, although model selection using the F test performs very well for normally distributed data.
The development of the score test procedure for testing the significance of a single covariate and, subsequently, using it in model selection, is extended to over-dispersed Poisson and over-dispersed binomial models, specifically for the negative binomial and beta-binomial models.