Next Article in Journal
On the Degree of Product of Two Algebraic Numbers
Next Article in Special Issue
Graphical Local Genetic Algorithm for High-Dimensional Log-Linear Models
Previous Article in Journal
Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data
Previous Article in Special Issue
Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modified BIC Criterion for Model Selection in Linear Mixed Models

1
Business Program, University of Guelph-Humber, Toronto, ON M9W 5L7, Canada
2
Department of Mathematics & Statistics, Faculty of Science, York University, Toronto, ON M3J 1P3, Canada
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(9), 2130; https://doi.org/10.3390/math11092130
Submission received: 5 April 2023 / Revised: 25 April 2023 / Accepted: 28 April 2023 / Published: 2 May 2023

Abstract

:
Linear mixed-effects models are widely used in applications to analyze clustered, hierarchical, and longitudinal data. Model selection in linear mixed models is more challenging than that of linear models as the parameter vector in a linear mixed model includes both fixed effects and variance component parameters. When selecting the variance components of the random effects, the variance of the random effects must be non-negative and the parameters may lie on the boundary of the parameter space. Therefore, classical model selection methods cannot be directly used to handle this situation. In this article, we propose a modified BIC for model selection with linear mixed-effects models that can solve the case when the variance components are on the boundary of the parameter space. Through the simulation results, we found that the modified BIC performed better than the regular BIC in most cases for linear mixed models. The modified BIC was also applied to a real dataset to choose the most-appropriate model.

1. Introduction

With the development of technology in recent times, more complex and large datasets have become available. Statisticians and researchers are also developing different statistical models to extract valuable information from data to aid decision-making processes. Classical multiple linear regression models can be used to model the relationship between variables. However, one of the assumptions of linear regression is that the errors are independent. Therefore, when the observations are correlated as with longitudinal data, clustered data, and hierarchical data, linear regression models are no longer appropriate. A more powerful class of models used to model correlated data are mixed-effects models, which have been used in many fields of applications. Recently, Sheng et al. [1] compared the linear models with linear mixed-effects models and showed that estimators from the latter are more advantageous in terms of both efficiency and unbiasedness. This shows the importance of applying linear mixed-effects models in longitudinal settings.
The correlation between observations may appear when data are collected hierarchically; for example, students may be sampled from the same school, and schools may be sampled within the same district. Consequently, students in the same school have the same teachers and school environment, and therefore, the observations are not independent of one another. Observations may be taken from members of the same family, where each family is considered a group or a cluster. As the observations are dependent, we can consider this clustered data. Another type of correlated data pertains to observations from the same subjects collected over time, such as repeated blood pressure measurements over a patient’s treatment period—an example of longitudinal data. Patients (or subjects) may vary in the number and date of the collected measurements. Since observations are recorded from the same individual over time, it is reasonable to assume that subject-specific correlations exist in the trend of the response variable over time. We wish to model the pattern of the response variable over time within subjects and the variation in the time trends between subjects. Linear mixed-effects models are used to model correlated data, accounting for the variability within and between clusters in clustered data or the variability within and between repeated measurements in longitudinal data.
Model selection is an important procedure in statistical analysis, allowing the most-appropriate model to be chosen from a set of potential candidate models. A desired model is parsimonious and can adequately fit the data in order to improve two important aspects: interpretability and predictability. In linear mixed models, identifying significant random effects is a challenging step in model selection, as it involves conducting a hypothesis test for whether or not the variance components of random effects are equal to zero. For example, we want to test H 0 : σ 2 = 0 against H 1 : σ 2 > 0 , where the parameter space of σ 2 is [ 0 , ) . Under the null hypothesis, the testing value of the variance component parameter lies on the boundary of the parameter space. This violates one of the classical regularity conditions that the true value of the parameter must be an interior point of the parameter space. Therefore, classical hypothesis tests such as the likelihood ratio, score, and Wald tests are no longer appropriate. We refer to this violation as the boundary issue. (Please see a graphical example of the boundary issue in Appendix A.3). When the boundary issue occurs, the asymptotic null distribution of the likelihood ratio test statistic does not follow a chi-squared distribution. Chernoff [2], Self and Liang [3], Stram and Lee [4], Azadbakhsh et al. [5], and Baey et al. [6] pointed out that, under some conditions on the parameter space and the likelihood functions, the asymptotic null distribution of the likelihood ratio test statistic is a mixture of chi-squared distributions. For instance, the asymptotic null distribution of the likelihood ratio test statistic for testing H 0 : σ 2 = 0 against H 1 : σ 2 > 0 is 1 2 χ 0 2 + 1 2 χ 1 2 , not χ 1 2 [4]. The distribution of a chi-bar-squared random variable depends on its mixing weights (Appendix A.2). Dykstra [7] discussed conditions on the weight distribution to ensure asymptotic normality for chi-bar-squared distributions. Shapiro [8] provided expressions to calculate the exact weights used in the mixture of chi-squared distributions for some special cases. However, in general, determining the exact weights used in the mixture of chi-squared distributions is challenging when the number of the variance components being tested under the null hypothesis is large, as the weights are not available in a tractable form (Baey et al. [6]).
There are a number of information criteria, such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), that were developed for model selection with linear mixed models by Vaida and Blanchard [9], Pauler [10], Jones [11], and Delattre and Poursat [12]. Other methods for identifying important fixed effects and random effects variance components, including shrinkage and permutation methods, were considered in Ibrahim et al. [13], Bondell et al. [14], Peng and Lu [15], and Drikvandi et al. [16].
The BIC is susceptible to the boundary issue. If we use the regular BIC in linear mixed models, that is we treat this case as if there were no constraints on the model’s parameter vector, then the penalty term of the regular BIC would include all the components of the parameter vector. Therefore, the regular BIC would overestimate the number of degrees of freedom of the linear mixed model (which we refer to as model complexity for this article) and would not take into account the fact that variances components are constrained and bounded below by 0. Consequently, the regular BIC tends to choose under-fitted linear mixed models. Several versions of the modified BIC have been proposed for model selection in linear mixed models [10,11,17]. However, to our knowledge, none of the current BICs can directly deal with the boundary issue.
The main objective of this article was to introduce a modified BIC for model selection when the true values of the variance components’ parameters lie on the boundary of the parameter space, allowing the most-appropriate model to be chosen from a set of candidate linear mixed models. Here is the general idea on how our proposed method solves the boundary problem. From the previous literature, we know that the asymptotic null distribution of the likelihood ratio test statistic of testing the nullity of several variances is a chi-bar-squared distribution (Baey et al. [6]). Based on this theoretical result, we took the average of the chi-bar-squared distribution and included this average in the complexity of the model. When random effects are correlated, calculating the weights of the chi-bar-squared distribution is not straightforward, as the weights depend on a cone C that contains the set of positive definite matrices. Describing the set of positive definite matrices explicitly using constraints on the components of the random effects covariance matrix is almost impossible. Thus, calculating the weights of the chi-bar-squared distribution for this case is not an easy task and has not been addressed in the literature. Our solution to this problem is to place a bound on cone C with a bigger cone. The bigger cone has a much simpler structure and allowed us to calculate the weights of the chi-bar-squared distribution. The rest of the paper is arranged as follows. In Section 2, we develop the methodology. The simulation and application are provided in Section 3 and Section 4. We conclude with a brief discussion in Section 5.

2. Methodology

2.1. Model Setup and Definitions

Consider the linear mixed model introduced in Laird and Ware [18]:
y i = X i β + Z i b i + ϵ i ,
for i = 1 , , N , where y i denotes the n i -dimensional vector of response measurements for cluster i with i = 1 , , N ; β is a p × 1 fixed effect parameter vector; X i is an n i × p matrix of covariates for the fixed effects; Z i is an n i × q matrix of covariates for the random effects; b i denotes the random effects vector of the i -th cluster; b i is assumed to follow a multivariate normal distribution N q ( 0 , D ) , where D is a q × q covariance matrix. b 1 , , b N were assumed to be independent. Fixed effects are used to model the population mean, while random effects are used to model between-cluster variation in the response. The vector of random errors ϵ i was assumed to follow a multivariate normal distribution, N ( 0 , σ ϵ 2 I n i ) , where I n i denotes the n i × n i identity matrix. It was assumed that b i and ϵ i are pairwise independent for i = 1 , , N . The marginal distribution of y i is N ( X i β , V i ) , where V i = Z i D Z i T + σ ϵ 2 I n i .
Let τ denote the vector of distinct variance and covariance components in matrix D , and let η = ( τ T , σ ϵ 2 ) T . The vector of parameters for this model is θ = ( β T , η T ) T . We assumed that the response vectors y 1 , , y N from N clusters are independent random observations. Given a clustered dataset, we wish to choose a linear mixed model that fits the data well and is also a parsimonious model.
Definition 1 
(Definition of an approximating cone [2]). Let Θ R p and θ 0 Θ . The set Θ is said to be approximated by a cone A at θ 0 if d ( y , A ) = o ( | | y θ 0 | | ) , for all y Θ and d ( x , Θ ) = o ( | | x θ | | ) , for all x A , where d ( x , Ω ) = inf y Ω | | x y | | , which is the distance between point x and its projection onto any space Ω . In this case, A is called the approximating cone of Θ at θ 0 and Θ is said to be Chernoff-regular at θ 0 .
Definition 2 
(Definition of a tangent cone [19]). A tangent cone T A ( θ 0 ) of a set Θ at a point θ 0 in Θ is the set of limits of sequences t n 1 ( θ n θ 0 ) , where t n are positive real numbers and t n 0 and θ n in Θ converge to θ 0 .
Definition 3 
(Definition of chi-bar-squared distribution [19]). Let C R m be a closed convex cone, and let Z N m ( 0 , V ) , where V is a positive definite matrix. χ ¯ 2 ( V , C ) is a random variable, which has the same distribution as Z T V 1 Z min θ C ( Z θ ) T V 1 ( Z θ ) . Therefore, we write
χ ¯ 2 ( V , C ) = Z T V 1 Z min θ C ( Z θ ) T V 1 ( Z θ )
where w i m , V , C , i = 0 , , m , are some non-negative numbers and i = 0 m w i m , V , C = 1 .

2.2. Proposed Methods

In this section, we introduce a modified BIC for linear mixed model selection. In linear mixed models, model selection includes the selection of the regression parameters β (fixed effects) and variance components of random effects. We first derived a modified BIC to choose random effects assuming that the random effects are independent. Then, we propose a modified BIC to choose random effects when random effects are assumed to be correlated. Lastly, we propose a modified BIC to choose both fixed effects and random effects simultaneously. We also considered two cases for when the covariance matrix for random effects b i are diagonal and full matrices.
Let M = { M k : k 1 } be a countable set of possible candidate linear mixed models. Let θ k denote the vector of parameters of model M k , and let d k be the complexity of model M k . Assume that d k can be calculated and d k < d l if M k M l . Let M T be the model that generates the data (called the true model) with parameter θ T and the true value of θ T is θ T , 0 . Any model M k that is more complex than the true model is called an over-fitting model, that is M T M k or θ T θ k and θ T θ k . Let M + be the set of all over-fitting models. An under-fitting model M k is a model such that θ T ’s components are not a subset of its parameter vector’s components, that is θ T θ k . Let M be the set of all under-fitting models. Assume that model M k has parameter vector θ k = ( β k T , τ k T , σ ϵ , k 2 ) T , where β k is the vector of fixed effects parameters, which includes the population regression coefficients; τ k contains the distinct variance and covariance elements of matrix D ; σ ϵ , k 2 is the parameter for the variance of the random error vector ϵ k . For a general covariance matrix, model M k is uniquely defined by its non-zero parameters in β and non-zero variance components on the diagonal of matrix D . If d i i = 0 , then all elements on row i and column i of this matrix are set to 0.

2.2.1. Modified BIC for Choosing Random Effects Assuming That the Random Effects Are Independent

In this section, we considered the case where the covariance matrix of random effects, D, is a diagonal matrix. Here, τ is a vector of variances on the diagonal of matrix D.
Lemma 1. 
When D is a diagonal matrix, under assumptions ( C 1 ) ( C 4 ) (Appendix A.1), assume that we wish to test the model M k (with τ k = ( d 1 , , d k ) ) against model M 1 (with τ 1 = ( d 1 , 0 , , 0 ) ) and both models have the same fixed effects part, then the null limiting distribution of the likelihood ratio test is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = 0 k 1 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = { 0 } p × { 0 } × R + k 1 × { 0 } ; w i m , ν ( θ ) 1 , C , i = 0 , , k 1 , are some non-negative numbers and i = 0 k 1 w i m , ν ( θ ) 1 , C = 1 ; matrix ν ( θ ) is some positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) , and m is the dimension of θ . θ denotes the true value of the parameter θ; l N ( θ ; y ) denotes the marginal log-likelihood function of the linear mixed model (1).
Proof. 
We applied the results from Baey et al. [6] on testing the nullity of r variance components of the q × q diagonal covariance matrix, D , using the likelihood ratio test statistic, assuming that the variances that are not being tested are strictly positive. Without loss of generality, assume that matrix D can be written as D = D 11 0 0 D 22 , where D 11 = diag ( d 1 , , d q r ) and D 22 = diag ( d q r + 1 , , d q ) . The parameter θ = ( β T , τ T , σ ϵ ) T Θ R m with τ = ( d 1 , , d q ) T . Consider the hypothesis test, H 0 : D = D 11 0 0 0 with positive definite matrix D 11 versus H 1 : D is positive definite.
The parameter spaces under and their corresponding tangent cones are
Θ 0 = { θ R m / β R p ; d 1 > 0 , , d q r > 0 , d q r + 1 = 0 , , d q = 0 , σ ϵ 2 > 0 } . T Θ 0 ( θ ) = { R p × R q r × { 0 } r × R } . Θ = { θ R m / β R p ; d 1 > 0 , , d q r > 0 , d q r + 1 0 , , d q 0 , σ ϵ 2 > 0 } . T Θ ( θ ) = R p × R q r × R + r × R .
In this case, T Θ 0 ( θ ) is a linear subspace in T Θ ( θ ) . Therefore, C = T Θ ( θ ) T Θ 0 ( θ ) = { 0 } p × { 0 } q r × R + r × { 0 } . C is contained in a linear subspace of dimension r. Thus, w i ( m , ν ( θ ) 1 , C ) = 0 for i = r + 1 , , m . Assume that the null hypothesis holds and θ Θ 0 ,. Baey et al. [6] pointed out that the asymptotic null distribution of the log-likelihood ratio test statistic is a mixture of chi-squared distributions with the degree of freedom ranging from 0 to r, denoted by
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = 0 r w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where χ i 2 is a chi-squared distribution with i degrees of freedom and ν ( θ ) is some positive definite matrix such that N 1 2 l N ( θ ) d N ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) .
We applied this result to our case with m = p + k + 1 ; q = k ; and r = k 1 . Thus, based on (3), the null limiting distribution of the likelihood ratio test statistic is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = 0 k 1 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = { 0 } p × { 0 } × R + k 1 × { 0 } ; w i m , ν ( θ ) 1 , C , i = 0 , , k 1 , are some non-negative numbers and i = 0 k 1 w i m , ν ( θ ) 1 , C = 1 ; matrix ν ( θ ) is some positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) , and m is the dimension of θ . □
We now take the expectation of the chi-bar-squared distribution in Equation (2) and include it in the complexity of model M k .
E [ χ ¯ 2 ( ν ( θ ) 1 , C ) ] = i = 0 k 1 w i ( m , ν ( θ ) 1 , C ) i .
We propose the following modified BIC:
B I C ( M k ) = 2 l ( θ ^ k ; y ) + d k log ( n ) ,
where θ ^ k is the maximum likelihood estimator of θ k in model M k ; n = i = 1 N n i and d k = p + 1.5 + i = 0 k 1 w i ( m , ν ( θ ) 1 , C ) i for k > 1 ; d k = p + 1.5 for k = 1 ; d k = p + 1 for k = 0 . The first term, 2 l ( θ ^ k ; y ) , measures the goodness-of-fit for model M k , and the second term, d k log ( n ) , is the penalty for the model complexity, which makes sure that the model selected is parsimonious.
The rationale of choosing the complexity d k for model M k when k > 1 is as follows: p is the number of fixed effects parameters; 1 is for the σ ϵ parameter and 0.5 for the assumed random effect in the model (such as random intercept), and the rest of d k is the expectation of the chi-bar-squared distribution in Equation (2). When k = 1 , d 1 = p + 1.5 is the complexity of model M 1 , which is the model with fixed effects and only one random effect (such as random intercept). When k = 0 , d 0 = p + 1 is the complexity of model M 0 , which is the model with fixed effects and no random effects. In this case, d 0 is exactly the regular BIC for multiple regression models. For example, M 3 is a model with three independent random effects. D = diag ( σ 0 2 , σ 1 2 , σ 2 2 ) , and τ = ( σ 0 2 , σ 1 2 , σ 2 2 ) .
We want to test H 0 : σ 0 2 > 0 ; σ 1 2 = 0 , σ 2 2 = 0 , vs. H 1 : σ 0 2 > 0 ; σ 1 2 > 0 , σ 2 2 > 0 . In this example, θ = ( β T , τ T , σ ϵ 2 ) T ; m = p + 3 + 1 , k = 3 , r = 2 . Therefore, the asymptotic null distribution of the log-likelihood ratio test statistic is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = 0 2 w i ( m , ν ( θ ) 1 , C ) χ i 2
where C = { 0 } p × { 0 } × R + 2 × { 0 } .
Cone C can be written as C = { θ R m / R θ 0 } , where R = 0 p + 1 | I 2 | 0 is a 2 × m matrix and I 2 is an identity matrix of order two. The chi-bar-squared weights are w i ( m , ν ( θ ) 1 , C ) = w i ( r , R ν ( θ ) 1 R T , R + 2 ) under Proposition 3.6.1 of [19]. The matrix, ν ( θ ) , is approximated by Γ = N 1 { I N ( θ ^ k ) } , where θ ^ k is the maximum likelihood estimator of θ k in model M k and I N ( θ ) is the Fisher information matrix. The chi-bar-squared weights, w i ( r , R Γ 1 R T , R + 2 ) , can be calculated using function “con-weights-boot” in the R package “restriktor” of Vanbrabant et al. [20]. In this example, we assumed that R Γ 1 R T = 1 0.5 0.5 1 . Then, we obtained the weights w 0 = 0.334 ,   w 1 = 0.503 , and w 2 = 0.163 . Thus,
χ ¯ 2 ( ν ( θ ) 1 , C ) = 0.334 χ 0 2 + 0.503 χ 1 2 + 0.163 χ 2 2 ,
We note that, in theory, w 1 must be 0.5 . However, in our simulation, w 1 = 0.503 . The expectation of this chi-bar-squared distribution is 0.334 ( 0 ) + 0.503 ( 1 ) + 0.163 ( 2 ) = 0.829 . The complexity of this model is d 3 = p + 1.5 + 0.829 .
Theorem 1. 
Assume that Assumptions (C1)–(C4) in Appendix A.1 are satisfied, then
lim n P ( B I C ( M T ) < B I C ( M k ) ) = 1 for all M k M + and lim n P ( B I C ( M T ) < B I C ( M k ) ) = 1 for all M k M .
Proof. 
We used l ( θ ^ k ; y ) instead of l N ( θ ^ k ; y ) for the convenience of exposition.
Case 1: For any under-fitting model, M k M , we want to prove that lim n P ( B I C ( M k ) B I C ( M T ) > 0 ) = 1 . We have that
B I C ( M k ) B I C ( M T ) = 2 l ( θ ^ k ; y ) l ( θ ^ T ; y ) + ( d k d T ) log ( n ) . 2 l ( θ ^ k ; y ) l ( θ ^ T ; y ) = 2 l ( θ ^ k ; y ) l ( θ k , 0 ; y ) + 2 l ( θ ^ T ; y ) l ( θ T , 0 ; y ) + 2 l ( θ T , 0 ; y ) l ( θ k , 0 ; y ) 2 E T , 0 l ( θ T , 0 ; Y ) l ( θ k , 0 ; Y ) + 2 E T , 0 l ( θ T , 0 ; Y ) l ( θ k , 0 ; Y ) .
We also have that l ( θ ^ k ; y ) l ( θ k , 0 ; y ) = o p ( 1 ) and l ( θ ^ T ; y ) l ( θ T , 0 ; y ) = o p ( 1 ) because θ ^ k p θ k , 0 and θ ^ T p θ T , 0 (as shown in the proof of Theorem 2 in Baey et al. [6]) and function l ( θ ; y ) is continuous with respect to θ . Furthermore, under Assumption C 4 ( i i ) , 1 N ( l ( θ T , 0 ; y ) E T , 0 [ l ( θ T , 0 ; Y ) ] ) p 0 and 1 N ( l ( θ k , 0 ; y ) E T , 0 [ l ( θ k , 0 ; Y ) ] ) p 0 . Thus,
1 N ( l ( θ T , 0 ; ; y ) l ( θ k , 0 ; y ) E T , 0 [ l ( θ T , 0 ; Y ) l ( θ k , 0 ; Y ) ] ) p 0 ,
and therefore, l ( θ T , 0 ; y ) l ( θ k , 0 ; y ) E T , 0 [ l ( θ T , 0 ; Y ) l ( θ k , 0 ; Y ) ] = o p ( N ) .
The last term can be evaluated as
E T , 0 [ l ( θ T , 0 ; Y ) l ( θ k , 0 ; Y ) ] = i = 1 N E T , 0 [ log f i ( Y i ; θ T , 0 ) log f i ( Y i ; θ k , 0 ) ] = i = 1 N E T , 0 log f i ( Y i ; θ T , 0 ) f i ( Y i ; θ k , 0 ) = O p ( N ) .
This is because E T , 0 log f i ( Y i ; θ T , 0 ) f i ( Y i ; θ k , 0 ) is the Kullback–Leibler distance between f i ( Y i ; θ k , 0 ) and f i ( Y i ; θ T , 0 ) ; and is positive and finite by Assumption C 4 ( i ) .
Assume that the cluster sample sizes, n 1 , , n N , are uniformly bounded (Assumption C 3 ), then O p ( N ) dominates ( d k d T ) log ( n ) as N . Thus, B I C ( M k ) B I C ( M T ) > 0 , and for all M k M ,
lim n P ( B I C ( M T ) < B I C ( M k ) ) = 1
Case 2: For any over-fitting model, M k M + , we also prove that lim n P ( B I C ( M k ) B I C ( M T ) > 0 ) = 1 . Without loss of generality, assume that θ T = ( β T T , ψ T T , 0 ̲ , σ ϵ , T 2 ) T and θ k = ( β k T , ψ k , 1 T , ψ k , 2 T , σ ϵ , k 2 ) T , where ψ T has the same dimension as ψ k , 1 and 0 ̲ has the same dimension as ψ k , 2 T . Let r be the dimension of ψ k , 2 ; dim ( ψ k , 2 ) = r , and all elements of 0 ̲ are 0. We have that
B I C ( M k ) B I C ( M T ) = 2 l ( θ ^ k ; y ) l ( θ ^ T ; y ) + ( d k d T ) log ( n ) .
Then, 2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) is the likelihood ratio test statistic of the following hypothesis test:
H 0 : ψ k , 1 0 , ψ k , 2 = 0 H 1 : ψ k , 1 0 , ψ k , 2 > 0 .
According to Baey et al. [6], under H 0 , the asymptotic distribution of 2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) is
i = 0 r w i ( m , ν ( θ ) 1 , C ) χ i 2 .
Therefore, 2 ( l ( θ ^ k ; y ) l ( θ ^ T ; y ) ) = O p ( 1 ) , according to Theorem 2.4 of [21]. We also have that
2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) = 2 l ( θ ^ T ; y ) l ( θ ^ 1 ; y ) l ( θ ^ k ; y ) l ( θ ^ 1 ; y ) = 2 l ( θ ^ 1 ; y ) l ( θ ^ T ; y ) 2 l ( θ ^ 1 ; y ) l ( θ ^ k ; y ) . E 2 l ( θ ^ T ; Y ) l ( θ ^ k ; Y ) = E 2 l ( θ ^ 1 ; Y ) l ( θ ^ T ; Y ) E 2 l ( θ ^ 1 ; Y ) l ( θ ^ k ; Y ) = d T d k ,
where l ( θ ^ 1 ; y ) is the maximum log-likelihood of the simplest model, that is the model with only the random intercept. Therefore,
E 2 l ( θ ^ T ; Y ) l ( θ ^ k ; Y ) = d k d T .
On the other hand, 2 l ( θ ^ T ; y ) l ( θ ^ k ; y ) asymptotically follows a mixture of the chi-squared distributions. Therefore, E 2 l ( θ ^ T ; Y ) l ( θ ^ k ; Y ) must be positive, and hence, d k d T > 0 . Thus, B I C ( M k ) B I C ( M T ) as n and
lim n P B I C ( M k ) B I C ( M T ) > 0 = 1
for M k M + . This completes the proof of Theorem 1. □
Given a set of candidate models, we calculated the proposed BIC value for each model. Then, the selected model is the one that minimizes the proposed BIC.

2.2.2. Modified BIC for Choosing Random Effects Assuming That the Random Effects Are Correlated

In this section, we introduce a modified BIC for selecting linear mixed models with correlated random effects. We still focused on only selecting random effects. In the parameter vector θ = ( β T , τ T , σ ϵ 2 ) T , τ is the parameter of interest; β and σ ϵ 2 are considered as nuisance parameters. We now considered that the linear mixed model (1) with the covariance matrix for random effects b i is a full matrix. Therefore, vector τ contains all distinct variances and covariances of matrix D .
Lemma 2. 
When D is a full matrix and under Assumptions (C1)–(C4) in Appendix A.1, assume that we test the model M k against model M 1 , where M 1 contains only one random effect, which is a random intercept, M k contains k random effects including a random intercept, and both models have the same fixed effects part, then the null limiting distribution of the likelihood ratio test is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = k 1 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = { 0 } p × { 0 } × S + k 1 × { 0 } ; m is the dimension of θ; w i ( m , ν ( θ ) 1 , C ) , i = ( k 1 ) , , ( k 1 ) ( k + 2 ) / 2 , are some non-negative numbers; i = k 1 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) = 1 ; χ i 2 is a chi-squared distribution with i degrees of freedom; ν ( θ ) is a positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) . S + r denotes the set of symmetric positive semi-definite matrices of size r × r .
Proof. 
When D is a full matrix, the number of distinct variances and covariances is q ( q + 1 ) / 2 . We also applied the results from Baey et al. [6] on testing the nullity of r variance components of the q × q covariance matrix, D , when this matrix is a full matrix. Assume that matrix D is written as D = D 11 D 12 D 12 T D 22 where the size of D 11 is ( q r ) × ( q r ) and the size of D 22 is r × r . Consider the hypothesis test: H 0 : D 11 > 0 , D 12 = 0 , D 22 = 0 versus H 1 : D is a positive definite matrix.
The parameter space under the null hypothesis is
Θ 0 = { θ R m / β R p ; D 11 > 0 ; D 12 = 0 , D 22 = 0 , σ ϵ 2 > 0 } = { R p × S + q r × { 0 } r ( q r ) × { 0 } r ( r + 1 ) × R + } ,
where S + q r is the set of symmetric positive semi-definite matrices of size ( q r ) × ( q r ) .
Assume that the null hypothesis holds and θ Θ 0 , then applying the results of Baey et al. [6], we obtain the tangent cone to Θ 0 at θ :
T Θ 0 ( θ ) = { R p × S q r × { 0 } r ( q r ) × { 0 } r ( r + 1 ) × R } = { R p × R ( q r ) ( q r + 1 ) / 2 × { 0 } r ( q r ) × { 0 } r ( r + 1 ) × R } ,
where S ( q r ) × ( q r ) is the set of symmetric matrices of size ( q r ) × ( q r ) . Furthermore,
Θ = { θ R m / β R p ; D S + q , σ ϵ 2 > 0 } = { R p × S + q × R + } .
According to the results of Baey et al. [6], the tangent cone to Θ at θ is
T Θ ( θ ) = R p × R ( q r ) ( q r + 1 ) / 2 × R r ( q r ) × S + r × R ,
where S + r is the set of symmetric positive semi-definite matrices of size r × r . Since T Θ 0 ( θ ) is a linear subspace in T Θ ( θ ) , the asymptotic null distribution of the likelihood ratio test statistic for the above hypothesis test is χ ¯ 2 ( ν ( θ ) 1 , C ) , where C = T Θ ( θ ) T Θ 0 ( θ ) = { 0 } p × { 0 } ( q r ) ( q r + 1 ) / 2 × R r ( q r ) × S + r × { 0 } .
When D is a full matrix, under the null hypothesis, Baey et al. [6] pointed out that the asymptotic null distribution of the log-likelihood test statistic is χ ¯ 2 ( ν ( θ ) 1 , C ) , which is a mixture of chi-squared distributions with the degree of freedom ranging from r ( q r ) to r ( q r ) + r ( r + 1 ) / 2 .
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = r ( q r ) r ( q r ) + r ( r + 1 ) / 2 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where w i m , ν ( θ ) 1 , C , i = r ( q r ) , , r ( q r ) + r ( r + 1 ) / 2 , are some non-negative numbers and i = r ( q r ) r ( q r ) + r ( r + 1 ) / 2 w i m , ν ( θ ) 1 , C = 1 ; χ i 2 is a chi-squared distribution with i degrees of freedom; ν ( θ ) is a positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) .
Assume that model M k has parameter vector θ k = ( β k T , τ k T , σ ϵ , k 2 ) T , where β k represents the parameter vector of the fixed effects; τ k contains distinct variances and covariances of the random effect covariance matrix D k , and σ ϵ , k 2 is the variance of the random error term ϵ k . Let p be the number of parameters of β k and q k be the number of parameters of τ k . Assume that we tested the model M k against model M 1 , where M 1 contains only one random effect, which is a random intercept, and M k contains k random effects including a random intercept. Assume that the two models contain the same fixed effects part. In this case, m = dim ( θ k ) = p + q k + 1 , r = k 1 , q = k , and q r = 1 . Thus, r ( q r ) = k 1 and r ( q r ) + r ( r + 1 ) / 2 = ( k 1 ) ( k + 2 ) / 2 . Therefore, based on (8), the asymptotic null distribution of the log-likelihood ratio test statistic is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = k 1 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = { 0 } p × { 0 } × S + k 1 × { 0 } ; w i ( m , ν ( θ ) 1 , C ) , i = ( k 1 ) , , ( k 1 ) ( k + 2 ) / 2 , are some non-negative numbers and i = k 1 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) = 1 , χ i 2 is a chi-squared distribution with i degrees of freedom; ν ( θ ) is a positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) . □
We note that it is too complex to define S + k 1 using equality and inequality constraints on the variance and covariance components of matrix D k . Since S + k 1 R ( k 1 ) ( k 2 ) / 2 × R + k 1 , in our work, we approximated C by C = { 0 } p × { 0 } × R k 1 × R ( k 1 ) ( k 2 ) / 2 × R + k 1 × { 0 } . Thus, χ ¯ 2 ( ν ( θ ) 1 , C ) is approximated by
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = k ( k 1 ) / 2 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where w i ( m , ν ( θ ) 1 , C ) , i = k ( k 1 ) / 2 , , ( k 1 ) ( k + 2 ) / 2 , are some non-negative numbers and i = k ( k 1 ) / 2 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) = 1 ; χ i 2 is a chi-squared distribution with i degrees of freedom, and ν ( θ ) is a positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) . This is because C = { 0 } p × { 0 } × R k 1 × R ( k 1 ) ( k 2 ) / 2 × R + k 1 × { 0 } contains a linear space of dimension ( k 1 ) + ( k 1 ) ( k 2 ) / 2 and is included in a linear space of dimension ( k 1 ) + ( k 1 ) ( k 2 ) ) / 2 + ( k 1 ) . Therefore, the weights w i ( m , ν ( θ ) 1 , C ) are zero for i = 0 , , ( k 1 ) + ( k 1 ) ( k 2 ) / 2 1 and for i = ( k 1 ) + ( k 1 ) ( k 2 ) ) / 2 + ( k 1 ) + 1 , , m [8]. From (10), let c k = E ( χ ¯ 2 ( ν ( θ ) 1 , C ) = i = k ( k 1 ) / 2 ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) i .
We propose the following modified BIC:
B I C ( M k ) = 2 l ( θ ^ k ; y ) + d k log ( n ) ,
where θ ^ k is the maximum likelihood estimator of θ k in model M k ; n = i = 1 N n i and d k = p + 1.5 + c k for k > 1 ; d k = p + 1.5 for k = 1 ; and d k = p + 1 for k = 0 .

2.2.3. Modified BIC for Selecting Both Fixed Effects and Random Effects in Linear Mixed Models

In this section, we propose a modified BIC to select both fixed effects and random effects for linear mixed models. We also divided the situations into two cases: when the random effects are independent, that is the covariance matrix, D , of random effects is diagonal, and when the random effects are correlated, that is the covariance matrix, D , is a full matrix.
Scenario 1: modified BIC for selecting both fixed effects and random effects when random effects are independent.
In the model selection, we assumed that the smallest model (called model M 1 ) contains only the intercept term for fixed effects and a random intercept for random effects. Model M k contains ( p k + 1 ) fixed effects, and the covariance matrix, D k , of random effects is of order k × k . If random effects are assumed to be independent, then the number of random effects variance components is q k = k .
Lemma 3. 
When D is a diagonal matrix, under Assumptions (C1)–(C4) in Appendix A.1, assume that we tested model M k against model M 1 , then the asymptotic null distribution of the log-likelihood test statistic is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = p k p k + k 1 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = R p k × { 0 } × { 0 } × R + k 1 × { 0 } ; w i m , ν ( θ ) 1 , C , i = p k , , p k + k 1 , are some non-negative numbers and i = p k p k + k 1 w i m , ν ( θ ) 1 , C = 1 ; m is the dimension of θ.
Proof. 
When D is a diagonal matrix, D = diag ( d 1 , , d q r , d q r + 1 , , d q ) . The fixed effects parameter β = ( β 0 , β 1 , , β p 1 ) . Without loss of generality, assume that we wanted to test the nullity of the s components of β , which are β 1 , , β s , and the nullity of the last r variance components of matrix D , which are d q r + 1 , , d q .
Consider the hypothesis test, H 0 : β 1 = 0 , , β s = 0 ; d q r + 1 = 0 , , d q = 0 versus H 1 : β 1 0 , , β s 0 ; d q r + 1 > 0 , , d q > 0 , assuming that the variances that are not tested ( d 1 , , d q r ) are positive. Let θ be the true value of the parameter vector. Assume that the null hypothesis holds and θ Θ 0 , then the parameter spaces under the null and alternative hypotheses and their tangent cones at θ are
Θ 0 = { { 0 } s × R p s × { 0 } r × R q r × R + } , T Θ 0 ( θ ) = { { 0 } s × R p s × R q r × { 0 } r × R } , Θ = { R p × R + q r × R + r × R } , T Θ ( θ ) = R p × R q r × R + r × R .
Since T Θ 0 ( θ ) is also a linear subspace in T Θ ( θ ) , Baey et al. [6] pointed out that the asymptotic null distribution of χ ¯ 2 ( ν ( θ ) 1 , C ) is a mixture of chi-squared distributions with the degree of freedom ranging from s to s + r .
χ ¯ 2 ( ( ν ( θ ) 1 , C ) ) = i = s s + r w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = T Θ ( θ ) T Θ 0 ( θ ) = R s × { 0 } p s × { 0 } q r × R + r × { 0 } ; χ i 2 is a chi-squared distribution with i degrees of freedom and ν ( θ ) is some positive definite matrix such that N 1 2 l N ( θ ) d N ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) .
When we test model M k against model M 1 , we are testing the nullity of the s = p k regression coefficients and r = k 1 random effects variance components. Therefore, based on Equation (13), the asymptotic null distribution of the log-likelihood test statistic is
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = p k p k + k 1 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = R p k × { 0 } × { 0 } × R + k 1 × { 0 } ; w i m , ν ( θ ) 1 , C , i = p k , , p k + k 1 , are some non-negative numbers and i = p k p k + k 1 w i m , ν ( θ ) 1 , C = 1 . □
Let u k be the expectation of χ ¯ 2 ( ν ( θ ) 1 , C ) , then u k = i = p k p k + k 1 w i ( m , ν ( θ ) 1 , C ) i . We propose a modified BIC for this case as
B I C ( M k ) = 2 l ( θ ^ k ; y ) + d k log ( n ) ,
where θ ^ k is the maximum likelihood estimator of θ k in model M k ; n = i = 1 N n i and d k = 2.5 + u k for k > 1 ; d k = p k + 2.5 for k = 1 ; d k = p k + 2 for k = 0 . Here, in the formula d k = 2.5 + u k for k > 1 , we added 2.5 to u k to account for the degrees of freedom of a fixed effect intercept (1 degree of freedom), a random intercept (0.5 degree of freedom), and the variance component of the error term, ϵ (1 degree of freedom).
Scenario 2: modified BIC for selecting both fixed effects and random effects when random effects are correlated.
When random effects in the linear mixed model (1) are correlated, their covariance matrix, D , is a full matrix. Matrix D can be written as D = D 11 D 12 D 12 T D 22 , where the size of D 11 is ( q r ) × ( q r ) and the size of D 22 is r × r . The number of distinct variance and covariance components in D is q ( q + 1 ) / 2 .
Consider the hypothesis test, H 0 : β 1 = 0 , , β s = 0 , D 11 > 0 , D 12 = 0 , D 22 = 0 versus H 1 : β R p , D > 0 . That is, D is a positive definite matrix. Let θ be the true value of the parameter vector. Assume that the null hypothesis holds and θ Θ 0 , then the parameter spaces under the null hypothesis and its tangent cone at θ are:
Θ 0 = { { 0 } s × R p s × S + q r × { 0 } r ( q r ) × { 0 } r ( r + 1 ) × R + } , T Θ 0 ( θ ) = { { 0 } s × R p s × S q r × { 0 } r ( q r ) × { 0 } r ( r + 1 ) × R } , = { { 0 } s × R p s × R ( q r ) ( q r + 1 ) / 2 × { 0 } r ( q r ) × { 0 } r ( r + 1 ) × R } ,
where S + q r is the set of symmetric positive semi-definite matrices of size ( q r ) × ( q r ) . Furthermore, the parameter space under the alternative hypothesis is
Θ = { θ R m / β R p ; D S + q , σ ϵ 2 > 0 } = { R p × S + q × R + } .
The tangent cone to Θ at θ is
T Θ ( θ ) = R p × R ( q r ) ( q r + 1 ) / 2 × R r ( q r ) × S + r × R .
where S + r is the set of symmetric positive semi-definite matrices of size r × r . Since T Θ 0 ( θ ) is a linear subspace in T Θ ( θ ) , the asymptotic null distribution of the likelihood ratio test statistic for the above hypothesis test is χ ¯ 2 ( ν ( θ ) 1 , C ) , where C = T Θ ( θ ) T Θ 0 ( θ ) = R s × { 0 } p s × { 0 } ( q r ) ( q r + 1 ) / 2 × R r ( q r ) × S + r × { 0 } . As in Lemma 2, it is challenging to define S + r using equality and inequality constraints. Since S + r R r ( r 1 ) / 2 × R + r , we approximated C by C = R s × { 0 } p s × { 0 } ( q r ) ( q r + 1 ) / 2 × R r ( q r ) × R r ( r 1 ) / 2 × R + r × { 0 } . χ ¯ 2 ( ν ( θ ) 1 , C ) is approximated by
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = s + r ( q r ) + r ( r 1 ) / 2 s + r ( q r ) + r ( r 1 ) / 2 + r w i ( m , ν ( θ ) 1 , C ) χ i 2 .
where w i ( m , ν ( θ ) 1 , C ) , i = s + r ( q r ) + r ( r 1 ) / 2 , , s + r ( q r ) + r ( r 1 ) / 2 + r , are some non-negative numbers and i = s + r ( q r ) + r ( r 1 ) / 2 s + r ( q r ) + r ( r 1 ) / 2 + r w i ( m , ν ( θ ) 1 , C ) = 1 ; χ i 2 is a chi-squared distribution with i degrees of freedom; ν ( θ ) is a positive definite matrix such that N 1 2 l N ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l N ( θ ) } a . s . ν ( θ ) .
In the model selection, we assumed that model M 1 contains only the intercept term for fixed effects and a random intercept for random effects. Model M k contains ( p k + 1 ) fixed effects, and the covariance matrix, D k , of random effects is of order k × k . When random effects are assumed to be correlated, the number of distinct random effects variance and covariance components is q k = k ( k + 1 ) / 2 . When we tested model M k against model M 1 , applying (16) with s = p k and r = k 1 , then s + r ( q r ) + r ( r 1 ) / 2 = p k + k ( k 1 ) / 2 and s + r ( q r ) + r ( r 1 ) / 2 + r = p k + ( k 1 ) ( k + 2 ) / 2 . Thus, the asymptotic null distribution of the log-likelihood ratio test statistic is approximated by
χ ¯ 2 ( ν ( θ ) 1 , C ) = i = p k + k ( k 1 ) / 2 p k + ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) χ i 2 ,
where C = R p k × { 0 } × { 0 } × R k ( k 1 ) / 2 × R + k 1 × { 0 } .
Furthermore, let h k be the expectation of χ ¯ 2 ( ν ( θ ) 1 , C ) , then
h k = i = p k + k ( k 1 ) / 2 p k + ( k 1 ) ( k + 2 ) / 2 w i ( m , ν ( θ ) 1 , C ) i .
Our proposed modified BIC for this case is
B I C ( M k ) = 2 l ( θ ^ k ; y ) + d k log ( n ) ,
where θ ^ k is the maximum likelihood estimator of θ k in model M k ; n = i = 1 N n i and d k = 2.5 + h k for k > 1 ; d k = p k + 2.5 for k = 1 ; and d k = p k + 2 for k = 0 .
Theorem 2. 
Assume that Assumptions ( C 1 ) ( C 4 ) Appendix A.1 are satisfied and B I C ( M k ) is defined as in (18), then
lim n P ( B I C ( M T ) < B I C ( M k ) ) = 1 for all M k M + , and lim n P ( B I C ( M T ) < B I C ( M k ) ) = 1 for all M k M .
Proof. 
Case 1: For any over-fitting model, M k M + , we also prove that lim n P ( B I C ( M k ) B I C ( M T ) > 0 ) = 1 . Assume that model M k contains p k fixed effects and q k random effects and the true model M T contains p T fixed effects and q T random effects. Let s = p k p T and r = q k q T with s 0 , r 0 , and s + r > 0 . Without loss of generality, assume that the covariance matrix of random effects in model M k is D = D 11 D 12 D 12 T D 22 , where D 11 is the covariance matrix of random effects of the true model M T . The size of D 11 is q T × q T , and the size of D 22 is r × r . Let θ T = ( 0 β ̲ , β T T , ψ T T , 0 ̲ , σ ϵ , T 2 ) T and θ k = ( β k , 1 T , β k , 2 T , ψ k , 1 T , ψ k , 2 T , σ ϵ , k 2 ) T , where 0 β ̲ has the same dimension as β k , 1 ; β T has the same dimension as β k , 2 ; ψ T has the same dimension as ψ k , 1 ; 0 ̲ has the same dimension as ψ k , 2 . All elements of 0 β ̲ and 0 ̲ are 0. We have that
B I C ( M k ) B I C ( M T ) = 2 l ( θ ^ k ; y ) l ( θ ^ T ; y ) + ( d k d T ) log ( n ) .
Then, 2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) is the likelihood ratio test statistic of the following hypothesis test:
H 0 : β k , 1 T = 0 ; D 11 > 0 ; D 12 = 0 , D 22 = 0 , H 1 : β k R p , D > 0 .
As in Lemma 2, under H 0 , the asymptotic distribution of 2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) is χ ¯ 2 ( ν ( θ ) 1 , C ) , where C = T Θ ( θ ) T Θ 0 ( θ ) = R s × { 0 } p s × { 0 } ( q r ) ( q r + 1 ) / 2 × R r ( q r ) × S + r × { 0 } with s = p k p T , r = q k q T , p = p k + 1 , and q = q k ; ν ( θ ) is some positive definite matrix such that N 1 2 l ( θ ) d N m ( 0 , ν ( θ ) ) and N 1 { l ( θ ) } a . s . ν ( θ ) .
Therefore, 2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) = O p ( 1 ) . We also have that
2 ( l ( θ ^ T ; y ) l ( θ ^ k ; y ) ) = 2 l ( θ ^ 1 ; y ) l ( θ ^ T ; y ) 2 l ( θ ^ 1 ; y ) l ( θ ^ k ; y ) . E 2 l ( θ ^ T ; Y ) l ( θ ^ k ; Y ) = E 2 l ( θ ^ 1 ; Y ) l ( θ ^ T ; Y ) E 2 l ( θ ^ 1 ; Y ) l ( θ ^ k ; Y ) = d T d k ,
where l ( θ ^ 1 ; y ) is the maximum log-likelihood of the simplest model, that is the model with only the intercept for fixed effects and a random intercept for random effects. Therefore,
E 2 l ( θ ^ T ; Y ) l ( θ ^ k ; Y ) = d k d T .
On the other hand, 2 l ( θ ^ T ; y ) l ( θ ^ k ; y ) asymptotically follows a mixture of the chi-squared distributions. Therefore, E 2 l ( θ ^ T ; Y ) l ( θ ^ k ; Y ) must be positive and, therefore, d k d T > 0 . Thus, B I C ( M k ) B I C ( M T ) as n and lim n P ( B I C ( M k ) B I C ( M T ) > 0 ) = 1 for M k M + .
Case 2: For any under-fitting model, M k M , we want to prove that lim n P ( B I C ( M k ) B I C ( M T ) > 0 ) = 1 . Using similar arguments as in the proof for Case 1 in Theorem 1, we obtain this result. □

3. Simulation

In this section, we evaluated the performance of the proposed BIC*. We compared the performance of the proposed BIC* to the regular BIC. For each candidate model, we computed the BIC* and regular BIC; then for each method, we chose the model with the minimum value of the BIC* and the regular BIC, respectively. All models were run using function “lmer” in the R package lme4 [22]. The chi-bar-squared weights were calculated using function “con-weights-boot” in the R package “restriktor” [20]. Following the methods used in Gao and Song [23] and Chen and Chen [24], the criteria we used to evaluate and compare the proposed BIC to the regular BIC were (1) positive selection rate (PSR), (2) false discovery rate (FDR), and (3) correction rate (CR). For each chosen model, the positive selection rate (PSR) is the ratio of the number of predictors that are correctly identified as significant in the chosen model to the number of predictors that are truly significant in the data-generating model. Then, we took the average of the PSR over all chosen models. The false discovery rate (FDR) is the ratio of the number of predictors that are incorrectly identified as significant in the chosen model to the number of predictors that are identified as significant in the chosen model. Then, we took the average of the FDR over all chosen models. The correction rate (CR) is the proportion of the times the true data-generating model is selected in all chosen models. For each selection criterion, we had 1001 models obtained from 1001 simulations. We then calculated the means and standard deviations of the positive selection rate and false discovery rate and the correction rate for each criterion.

3.1. Simulation Setup

Our data were generated from the linear mixed model, y = X β + Z b + ϵ . For all simulation, ϵ was generated from a multivariate normal distribution, N ( 0 , σ ϵ 2 I n ) with σ ϵ 2 = 1 .

3.1.1. Setup A: Choose Random Effects Assuming That the Random Effects Are Independent

Scenario 1: With total number of observations n = 500 and number of clusters N = 100 , X is an n × p matrix with p = 2 ; the first column of X includes all ones. The second column is X 1 , which was generated from the standard normal distribution. The vector of fixed effects β = ( 1 , 2 ) T . Matrix Z contains the first two columns z 0 , z 1 , which are the same as two columns of matrix X , and two more columns z 2 , z 3 , both generated from the standard normal distributions. Random effects, b i , were generated from multivariate normal distribution N q ( 0 , D ) with D a 4 × 4 diagonal matrix and D = diag ( σ 0 2 , , σ 3 2 ) . The random intercept, b i 0 , had a standard deviation of σ 0 = 5 . Random effects components, b i 1 , b i 2 , and b i 3 , had standard deviations σ 1 , σ 2 , and σ 3 , respectively. To measure the ability to detect the significance of the variance component parameters of the proposed B I C , we considered different sizes of σ 1 2 , σ 2 2 , and σ 3 2 . σ 1 is a sequence of values from 0 to 0.5 incrementing by 0.05 ; σ 2 is a sequence of values from 0 to 1 incrementing by 0.1 ; σ 3 is a sequence of values from 0 to 2 incrementing by 0.2 .
Scenario 2: With the total number of observations n = 500 and number of clusters N = 100 , X is an n × p matrix with n = 500 ; p = 3 ; the first column of X includes all ones. The last two columns of matrix X 1 and X 2 were generated from the standard normal distributions. The vector of fixed effects β = ( 1 , 2 , 3 ) T . Matrix Z contains the first three columns z 0 , z 1 , and z 2 , which are the same as three columns of matrix X and three more columns z 3 , z 4 , and z 5 , which were generated from the standard normal distributions. Random effects, b i , were generated from multivariate normal distribution N q ( 0 , D ) with D a 6 × 6 diagonal matrix and D = diag ( σ 0 2 , , σ 5 2 ) . To measure the ability to detect the significance of the variance component parameters of the B I C , we also considered different sizes of σ 1 2 , σ 2 2 , and σ 3 2 as in Scenario 1 with σ 4 2 = 0 and σ 5 2 = 0. We then repeated this setup with n = 1000 ( N = 200 ) and n = 250 ( N = 50 ).
Scenario 3: The setup was similar to the one in Scenario 2. However, matrix Z contains the first three columns z 0 , z 1 , and z 2 , which are the same as the three columns of matrix X , and eight more columns z 3 , …, z 10 , which were generated from the standard normal distributions. Random effects, b i , were generated from multivariate normal distribution N q ( 0 , D ) with D a 11 × 11 diagonal matrix and D = diag ( σ 0 2 , , σ 10 2 ) , where σ 1 2 = 0.16 , σ 2 2 = 0.64 , σ 3 2 = 1 , σ 4 2 = 1.44 , and σ 5 2 , , σ 10 2 are all 0. We also repeated this simulation setup with n = 1000 ( N = 200 ) and n = 250 ( N = 50 ).

3.1.2. Setup B: Choose Random Effects Assuming That the Random Effects Are Correlated

In this set up, the total number of observations is n = 1000 and the number of clusters is N = 100 . Matrix X and the vector of fixed effects, β , were generated the same as in Setup A Scenario 2. Matrix Z contains the first three columns z 0 , z 1 , and z 2 , which are the same as three columns of matrix X , and three more columns z 3 , z 4 , and z 5 were generated from the standard normal distributions. Random effects, b i , were generated from multivariate normal distribution N q ( 0 , D ) with D a 6 × 6 matrix. The correlation matrix between the random effects components, b i 0 , b i 1 , b i 2 , and b i 3 , in the data-generating model is
R = 1 0.7 0.6 0.5 0.7 1 0.4 0.3 0.6 0.4 1 0.5 0.5 0.3 0.5 1 .
To measure the ability to detect the significance of variance component parameters of the proposed B I C , we created different cases for different sizes of σ 0 2 , σ 1 2 , σ 2 2 , and σ 3 2 , as shown below. σ 4 2 , σ 5 2 , and the covariances of random effects b i 4 and b i 5 corresponding to z 4 and z 5 are all 0.
Case 1: The standard deviations of the random effects were σ 0 = 5 , σ 1 = 1.0 , σ 2 = 0.8 , σ 3 = 0.4 , σ 4 = 0 , and σ 5 = 0 .
Case 2: The standard deviations of the random effects were σ 0 = 2 , σ 1 = 0.8 , σ 2 = 0.5 , σ 3 = 0.3 , σ 4 = 0 , and σ 5 = 0 .
Case 3: The standard deviations of the random effects were σ 0 = 2 , σ 1 = 0.5 , σ 2 = 0.4 , σ 3 = 0.2 , σ 4 = 0 , and σ 5 = 0 .
Case 4: In this case, we kept the standard deviations of the random effects the same as the ones in Case 2. However, we increased the correlations by 0.1 for each non-zero correlation in the correlation matrix to see how this affects the correction rates. The correlation matrix between the random effects is
R 1 = 1 0.8 0.7 0.6 0.8 1 0.5 0.4 0.7 0.5 1 0.6 0.6 0.4 0.6 1 .

3.1.3. Setup C: Choose Both Fixed Effects and Random Effects Assuming That the Random Effects Are Correlated

With the total number of observations n = 1000 and number of clusters N = 100 , X is an n × p matrix with p = 6 ; the first column of X includes all ones. The last five columns, X 1 to X 5 , weer generated from the standard normal distribution. The vector of fixed effects β = ( 1 , 2 , 3 , 1 , 0 , 0 ) T . Matrix Z contains the first three columns z 0 , z 1 , and z 2 , which are the same as three columns of matrix X , and three more columns z 3 , z 4 , and z 5 were generated from the standard normal distributions. The correlation matrix between the random effects components, b i 0 , b i 1 , b i 2 , and b i 3 , in the data-generating model is
R = 1 0.7 0.6 0.5 0.7 1 0.4 0.3 0.6 0.4 1 0.5 0.5 0.3 0.5 1 .
To measure the ability to detect the significance of the fixed effects and variance component parameters of the proposed B I C , we explored two different cases for different sizes of σ 0 2 , σ 1 2 , σ 2 2 , and σ 3 2 , as shown below. The σ 4 2 , σ 5 2 , and covariances corresponding to the random effects of z 4 and z 5 are all 0.
Case 1: The standard deviations of the random effects were σ 0 = 5 , σ 1 = 1.5 , σ 2 = 1 , σ 3 = 0.5 , σ 4 = 0 , and σ 5 = 0 .
Case 2: The standard deviations of the random effects were σ 0 = 2 , σ 1 = 0.8 , σ 2 = 0.5 , σ 3 = 0.3 , σ 4 = 0 , and σ 5 = 0 .
We also ran simulations for the case when the random effects were assumed to be uncorrelated and the variances of random effects were the same as the values in Case 1 and Case 2.

3.2. Simulation Procedure

3.2.1. For Setup A

In all scenarios, for each set of values of σ 1 2 , σ 2 2 , and σ 3 2 , B = 1001 simulations were run. In each simulation, all possible candidate models were run. All these models had the same fixed effect covariates (including X 1 and the intercept); meanwhile, the covariates for random effects part varied in the power set of { 1 , 2 , 3 } . The proposed B I C and regular BIC were calculated for each model. Then, one model with the minimum proposed BIC was selected and one model with the minimum regular BIC. Now, for each selection criterion, we had 1001 models obtained from 1001 simulations. We calculated the correction rate (CR) for each criterion.
In Scenario 2, for each set of values of σ 1 2 , …, σ 5 2 , B = 1001 simulations were run. In each simulation, all possible candidate models were run. All these models had the same fixed effect covariates (including X 1 , X 2 , and the intercept); meanwhile, the covariates for random effects varied in the power set of { 1 , , 5 } . The proposed B I C and regular BIC were calculated for each model. Then, one model with the minimum proposed BIC was selected, and one model with the minimum regular BIC was selected. We calculated the means and standard deviations of the positive selection rate and false discovery rate. We also calculated the correction rate for each criterion.
In Scenario 3, with the given set of values of σ 1 2 , …, σ 10 2 , B = 1001 simulations were run. In each simulation, all possible candidate models were run. All these models had the same fixed effect covariates; meanwhile, the covariates for the random effects varied in the power set of { 1 , , 10 } . We calculated the means and standard deviations of the positive selection rate and false discovery rate and calculated the correction rate for each criterion. All simulations were performed by using R Version 4 . 0.2 [25].

3.2.2. For Setup B

In each case presented above, B = 1001 simulations were run. In each simulation, all possible candidate models were run. All these models had the same fixed effect covariates (including the intercept, X 1 and X 2 ); meanwhile, the covariates for the random effects part varied in the power set of { 1 , 2 , 3 , 4 , 5 } and also included a random intercept. The proposed B I C , regular BIC, and cAIC were calculated for each model. Greven and Kneib [26] developed an analytic version of the corrected cAIC, and their method was implemented in the cAIC4 package in R [27]. Then, one model with the minimum proposed BIC was selected; one model with the minimum regular BIC was selected; one model with the minimum cAIC was selected. We calculated the means and standard deviations of the positive selection rate and false discovery rate and the correction rate for each criterion.

3.2.3. For Setup C

For each case above, we ran B = 1001 simulations. In each simulation, all possible candidate models were run. All models contained the intercept term for the fixed effect and a random intercept for the random effects. The covariates for the fixed effects part varied in the power set of { 1 , 2 , 3 , 4 , 5 } for X 1 to X 5 , and the covariates for random effects part varied in the power set of { 1 , 2 , 3 , 4 , 5 } for z 1 to z 5 . We also included the models that included only the intercept term for the fixed effect with varying random effects and the models that included a random intercept only with varying fixed effects. The proposed B I C and regular BIC were calculated for each model. Then, the model with the minimum proposed BIC was selected, and the model with the minimum regular BIC was selected.

3.3. Simulation Results

Scenario 1: Table 1 summarizes the results of Scenario 1. We observed that the correction rate for the proposed B I C was greater than that of the regular BIC. Furthermore, the correction rates of the two methods were higher when the values of σ 1 2 , σ 2 2 , and σ 3 2 were bigger.
Scenario 2: Table 2 summarizes the results of Scenario 2. The simulation results suggested that the values of the positive selection rate (PSR) for the proposed B I C were higher than the regular BIC when the values of the variance components were close to 0. That is, the ability to choose the significant variance components was higher for the proposed B I C than the regular BIC. Almost all of the false discovery rate (FDR) values were within 5 percent in all cases. We also observed that the proposed BIC approach had a higher FDR and corresponding SD as compared to the regular BIC approach. For some very low values of the sigma values, the FDR values of the proposed BIC were greater than 5 percent. The possible reason behind this is because the calculation of the penalty term of the regular BIC uses an exact chi-squared distribution, meanwhile the penalty term of proposed BIC uses the approximated weights of the chi-bar-square distribution.
As the values of the variance components increased, the PSR increased. From the results obtained, we also saw that the ability to choose the true model also became larger as the values of the variance components increased. We also noted that the standard deviations were small for all cases. This means that the estimated PSR and FDR were very consistent.
Figure 1 shows the comparison of the proposed BIC and regular BIC methods in terms of the positive selection rate and correction rate for different values of σ 1 , σ 2 , and σ 3 when n = 500 and ( N = 100 ) in Scenario 2.
Figure 2 shows the comparison of the positive selection rate (PSR) and correction rates for Scenario 2 when n = 250 , 500, and 1000 with N = 50 , 100, and 200, respectively. Given the same set of values of σ 1 2 , …, σ 5 2 , we observed that the positive sensitivity rate increased as the number of clusters N increased.
We also ran 104 simulations with three more competing methods: “cAIC”, “ B I C J ”, and “Splmm”, using the same setting as in Scenario 2. The cAIC is the corrected conditional AIC as implemented in the cAIC4 package in R [27]. The “ B I C J ” is a modified BIC for linear mixed models as introduced in (Jones [11]). “Splmm” (simultaneous penalized linear mixed-effects models) is a method for choosing both the fixed effects and random effects for variable selection using the penalized likelihood function. This method is based on the results in (Yang and Wu [28]) and was implemented in the R-package “ S p l m m ”. Figure 3 shows that the modified BIC performed better than the regular BIC, “ B I C J ”, and “ S p l m m ” in this scenario in terms of the positive selection rate and correction rate. The ability to choose correct variables was higher for the cAIC than the modified BIC. However, the correction rates for the cAIC were not always higher than that of the modified BIC. The “ S p l m m ” method did not seem to work well in this scenario. This may be because the method works better for the case when the number of parameters is much higher than the number of observations.
Scenario 3: Table 3 summarizes the results of Scenario 3. We saw that, in all cases, for the sample sizes n = 250 , 500, 1000, the mean PSR and the correction rates were higher for the proposed BIC; meanwhile, the FDRs kept around 5 % .
Table 4 shows the comparison of the proposed BIC, regular BIC, and cAIC methods in terms of the positive selection rate, the false discovery rate, and correction rate for Case 1 to Case 4. In all cases, the correction rate for the proposed BIC was greater than that of the regular BIC. The difference in the correction rate between these two methods was bigger when the values of σ 1 2 , σ 2 2 , and σ 3 2 were smaller. In most cases, the two methods seemed to perform better than the cAIC method.
Table 5 shows the comparison of the proposed BIC, regular BIC, and cAIC methods in terms of fixed effects correction rate, random effects correction rate, and both effects correction rate for both Case 1 and Case 2 when random effects were assumed to be correlated.
Based on the simulation results for the situation when random effects were assumed correlated in Table 5, we saw that the proposed BIC method performed better than the regular BIC and the cAIC methods in terms of the correction rate for selecting the fixed effects, the correction rate for selecting the random effects, and also for selecting both fixed effects and random effects simultaneously. We also saw that, when the values of the variances for random effects were smaller, the correction rates were lower for all methods. However, the performance of the proposed method was still much better than the other two methods.
When random effects were assumed uncorrelated, based on the simulation results in Table 6, we saw that the proposed BIC and regular BIC still performed well and better than the cAIC method. The proposed BIC method performed better than the regular BIC in Case 2, but did not perform better than the regular BIC in Case 1. This may be because the penalty term of the regular BIC was calculated using the exact chi-squared distribution and the calculation of the penalty term was without any error. However, for the proposed BIC, the weights of the chi-bar-squared distribution were approximated. Therefore, the penalty term was approximated only. From the simulation results, we noticed that when the values of the variances for random effects were smaller, the correction rates were lower for the proposed and regular BIC methods. However, the correction rates in Case 2 were better than Case 1 for the cAIC method.
Comparing the computational complexity, the proposed method requires Monte Carlo simulations to estimate the weights so that the penalty parameter can be computed. This is more computational intensive than the regular BIC. In our simulation, for one dataset with a given model, for Setup A, it took about 0.11 to 0.42 s for the proposed BIC method and about 0.04 to 0.07 s for the regular BIC. For Setup B, it took about 0.19 s for the proposed BIC, 0.10 s for the regular BIC, and 0.22 s for the cAIC. For Setup C with independent random effects, it took about 0.11 s for the proposed BIC, 0.05 s for the regular BIC, and 0.09 s for the cAIC. For Setup C with correlated random effects, it took about 0.19 s for the proposed BIC, 0.10 s for the regular BIC, and 0.21 s for the cAIC. We noted that the model with correlated random effects took longer than the one with independent random effects. Furthermore, the computational time of the proposed method was longer than that of the regular BIC, but quite close to that of the cAIC method. The OS and CPU system specifications that we used to run our methods were Windows 10, CPU: Intel Core i 7 8550 U with 4 cores, 8 threads. The memory requirements of our methods are 8 GB RAM.

4. Real-Data Application

In this section, we applied the proposed BIC to a real dataset. We worked with a dataset that is a subset of 120 schools of dataset “hsfull” from package “spida2” in R, which was developed by Monette et al. [29]. This dataset was originally from the 1982 “High School and Beyond” (HSB) survey dataset in Raudenbush and Bryk’s text on hierarchical linear models (Raudenbush and Bryk [30]). The data include the mathematics achievement test scores of 5307 students from 50 Catholic and 70 public high schools, with the number of students in each school ranging from 19 to 66 students.
The variables included in the analysis were school identification number, mathematics achievement score ( Y ), socioeconomic status ( X 1 ), sex (female (0) or male (1); X 2 ), visible minority status (yes (1) or no (0); X 3 ), and school sector (Catholic (0) or public (1); X 4 ). Variables X 1 , X 2 , and X 3 are group-centered. The objective was to study the relationship between students’ mathematics achievement score and socioeconomic status, sex, and visible minority status in public and Catholic schools and whether this relationship varies across schools within each sector.
The candidate variables in the fixed effects part were X 1 , X 2 , X 3 , and X 4 , which are group-centered. The candidate variables in the random effects part were z 1 , z 2 , and z 3 , which are the same as X 1 , X 2 , and X 3 .
We first fit a linear mixed model that included only the intercept term for fixed effects and a random intercept. Then, we fit the models with only the intercept term for fixed effects and all possible combinations of z 1 , z 2 , and z 3 with a random intercept for random effects. Next, we fit the models with all possible combinations of X 1 , X 2 , X 3 , and X 4 for fixed effects and only a random intercept for random effects. Lastly, for each combination of X 1 , X 2 , X 3 , and X 4 for fixed effects, we fit the models with all possible combinations of z 1 , z 2 , and z 3 with a random intercept for random effects. For each model, we recorded the values of the proposed BIC, regular BIC, and cAIC. There were 128 values for each method. Now, for each method, we chose the model with the minimum value of the corresponding criterion. We applied this procedure for both cases when random effects were assumed to be correlated and uncorrelated.
When random effects were assumed to be correlated, the optimal model we obtained using the proposed BIC was the model with all X 1 , X 2 , X 3 , and X 4 and a random intercept; the proposed BIC was 34,379.83. The optimal model we obtained using the regular BIC was also the model with X 1 , X 2 , X 3 , and X 4 and a random intercept only. The regular BIC of this model was also 34,379.83. The cAIC yielded the optimal model, which contained X 1 , X 2 , X 3 , and X 4 with a random intercept and random slopes of z 1 and z 3 . The cAIC of the optimal model was 34,166.25.
When random effects were assumed to be uncorrelated, the optimal model we obtained using the proposed BIC was the model with all X 1 , X 2 , X 3 , and X 4 , a random intercept, and random slopes of z 3 ; the proposed BIC value was 34,378.23. The optimal model we obtained using the regular BIC was the model with X 1 , X 2 , X 3 , and X 4 and a random intercept only. The regular BIC of this model was 34,379.83. The cAIC yielded the optimal model, which contained X 1 , X 2 , X 3 , and X 4 with a random intercept and random slopes of z 1 , z 2 , and z 3 . The cAIC of the optimal model was 34,165.13.
Table 7 shows the proposed BIC, regular BIC, and cAIC for all models that contained X 1 , X 2 , X 3 , and X 4 with correlated random effects considered.
Table 8 shows the optimal model chosen by each method when the random effects were assumed to be independent and when the random effects were correlated. All X 1 , X 2 , X 3 , and X 4 were included in the models.
Based on the results presented above, we would choose the model with all X 1 , X 2 , X 3 , and X 4 for fixed effects and a random intercept and a random slope of z 3 for random effects assuming that random effects are uncorrelated. There was a significant relationship between students’ math achievement score and socioeconomic status, sex, and visible minority status in public and Catholic schools, and the school mean math achievement score and minority gap effect varied across the schools within each sector.

5. Discussion

In this article, we introduced a modified BIC for linear mixed models that can directly deal with the boundary issue of variance components. First, we focused on selecting random effects variance components and proposed a model selection criterion when the random effects were assumed to be independent (the covariance matrix of random effects was a diagonal matrix). Second, we proposed a criterion for choosing random effects variance components when the random effects were assumed to be correlated. Instead of working with a complex tangent cone to the alternative parameter space, we approximated the tangent cone using a bigger, but simpler cone. This allowed us to obtain the weights of the chi-bar-squared distribution. Lastly, we presented a model selection criterion for choosing both fixed effects and random effects simultaneously in both cases: when random effects were assumed to be independent and when they were correlated. We also proved the consistency of the modified BIC.
Based on the simulation studies, the modified BIC performed quite well in terms of the correction rate. The ability to select the data-generating model of the modified BIC was better when the size of the random effects variance component or the size of correlation component was bigger. Compared to the regular BIC, the modified BIC gave higher correction rates, especially when the variances of random effects were small. Based on the correction rate, the modified BIC and performed better than the regular BIC in most cases. Furthermore, there was significant improvement in the positive selection rate in most of the simulation scenarios.
One limitation of the modified BIC is that, when choosing the optimal model, the proposed method looks at all possible models. Since the number of possible models increased exponentially as the number of fixed effects and random effects increased, the model selection process may be increasingly computationally intensive. We may combine the proposed BIC with some selection procedure such as shrinkage methods or fence methods as introduced in Müller et al. [31] to reduce the number of candidate models. Then, we can use the proposed BIC method to perform model selection.

Author Contributions

Writing, original draft, H.L.; writing, reviewing and editing, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by an NSERC grant held by Gao.

Data Availability Statement

The dataset used in the Application Section is a subset of the dataset “hsfull” from the package “spida2” in R, using the R command: data(hsfull, package = ‘spida2’). The package is available at https://github.com/gmonette/spida2, accessed on 25 January 2021.

Acknowledgments

The authors would like to thank the Editor and reviewers for their helpful comments and suggestions, which have helped us to improve the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Assumptions for Lemmas 1–3 and Theorems 1 and 2

(C1).
The observations y = ( y 1 , , y N ) from different clusters are independent random vectors. All the assumptions of the linear mixed model (1) are satisfied.
(C2).
Let l N ( θ ; y ) be the log-likelihood function of the linear mixed model (1). Denote by Θ the parameter space of the model parameter vector, θ , and let θ be the true value of the parameter vector. Denote the vector of first partial derivatives of l N ( θ ; y ) with respect to θ by l N ( θ ) , and denote the matrix of the second partial derivatives of l N ( θ ; y ) with respect to θ by l N ( θ ) . Directional derivatives are used when θ is on the boundary of Θ . (i) Assume that, for all θ , the first three partial derivatives of the log-likelihood function with respect to θ exist almost everywhere. (ii) Furthermore, assume that N 1 times the absolute value of the third derivative of l N ( θ ; y ) is bounded as a function of ( Y 1 , , Y N ) , whose expectation exists, and finite on the intersection of the neighborhoods of θ and Θ .
(C3).
Assume that n 1 , , n N are uniformly bounded. That is, there exists a constant K > 0 such that n i K for i = 1 , , N .
(C4).
Let θ T be the parameter vector of the true model M T , and let θ T , 0 denote the true value of θ T .
(i)
For any under-fitting model, M k , with model parameter θ k Θ k , assume that E T , 0 log f ( y ; θ T , 0 ) f ( y ; θ k ) exists and there exists a unique pseudo true, θ k , 0 , such as θ k , 0 = arg min θ k Θ k E T , 0 log f i ( y ; θ T , 0 ) f i ( y ; θ k ) for all i.
(ii)
For all θ , 1 N ( l ( θ ; y ) E T , 0 [ l ( θ ; Y ) ] ) p 0 .
(iii)
For any two nested models, M k M l , 2 l ( θ ^ k ; Y ) l ( θ ^ l ; Y ) is bounded by an integrable function, M ( Y ) , and E [ M ( Y ) ] < .

Appendix A.2. Graphs of Chi-Bar-Squared Distributions

In this section, we created graphs of some density functions of different chi-bar-squared distributions.
Figure A1. Chi-bar-squared distributions.
Figure A1. Chi-bar-squared distributions.
Mathematics 11 02130 g0a1
The graphs show that the distribution of a chi-bar-squared distribution depends on it mixing weights.

Appendix A.3. Graphical Example of the Boundary Issue

In this example, we tested H 0 : σ 1 2 > 0 , σ 2 2 = 0 against H 1 : σ 1 2 > 0 , σ 2 2 > 0 . The parameter space under the null hypothesis was the set of all points of the form ( a , 0 ) with a > 0 , illustrated by the blue interval along the axis of σ 1 2 . Under the alternative hypothesis, the parameter space was the set of all points ( a , b ) with a > 0 and b 0 and is illustrated by the shaded orange region on the graph. Under the null hypothesis, the testing value of the parameter vector lies on the boundary of the parameter space.
Figure A2. Graphical example of boundary issue.
Figure A2. Graphical example of boundary issue.
Mathematics 11 02130 g0a2

References

  1. Sheng, Y.; Yang, C.; Curhan, S.; Curhan, G.; Wang, M. Analytical methods for correlated data arising from multicenter hearing studies. Stat. Med. 2022, 41, 5335–5348. [Google Scholar] [CrossRef] [PubMed]
  2. Chernoff, H. On the distribution of the likelihood ratio. Ann. Math. Stat. 1954, 25, 573–578. [Google Scholar] [CrossRef]
  3. Self, S.G.; Liang, K.Y. Asymptotic Properties of Maximum Likelihood Estimators and Likelihood Ratio Tests Under Nonstandard Conditions. J. Am. Stat. Assoc. 1987, 82, 605–610. [Google Scholar] [CrossRef]
  4. Stram, D.O.; Lee, J.W. Variance Components Testing in the Longitudinal Mixed Effects Model. Biometrics 1994, 50, 1171–1179. [Google Scholar] [CrossRef] [PubMed]
  5. Azadbakhsh, M.; Gao, X.; Jankowski, H. Composite likelihood ratio testing under nonstandard conditions using tangent cones. Stat 2021, 10, e375. [Google Scholar] [CrossRef]
  6. Baey, C.; Cournède, P.H.; Kuhn, E. Asymptotic distribution of likelihood ratio test statistics for variance components in nonlinear mixed-effects models. Comput. Stat. Data Anal. 2019, 135, 107–122. [Google Scholar] [CrossRef]
  7. Dykstra, R. Asymptotic normality for chi-bar-squared distributions. Can. J. Stat. 1991, 19, 297–306. [Google Scholar] [CrossRef]
  8. Shapiro, A. Asymptotic Distribution of Test Statistics in the Analysis of Moment Structures Under Inequality Constraints. Biometrika 1985, 72, 133–144. [Google Scholar] [CrossRef]
  9. Vaida, F.; Blanchard, S. Conditional Akaike information for mixed-effects models. Biometrika 2005, 92, 351–370. [Google Scholar] [CrossRef]
  10. Pauler, D. The Schwarz criterion and related methods for normal linear models. Biometrika 1998, 85, 13–27. [Google Scholar] [CrossRef]
  11. Jones, R.H. Bayesian information criterion for longitudinal and clustered data. Stat. Med. 2011, 30, 3050–3056. [Google Scholar] [CrossRef] [PubMed]
  12. Delattre, M.; Poursat, M.A. An iterative algorithm for joint covariate and random effect selection in mixed-effects models. Int. J. Biostat. 2020, 16, 1–12. [Google Scholar] [CrossRef] [PubMed]
  13. Ibrahim, J.G.; Zhu, H.; Garcia, R.I.; GUO, R. Fixed and random effects selection in mixed-effects models. Biometrics 2011, 67, 495–503. [Google Scholar] [CrossRef] [PubMed]
  14. Bondell, H.D.; Krishna, A.; Ghosh, S.K. Joint variable selection for fixed and random effects in linear mixed effects models. Biometrics 2010, 66, 1069–1077. [Google Scholar] [CrossRef]
  15. Peng, H.; Lu, Y. Model selection in linear mixed effect models. Multivar. Anal. 2012, 109, 109–129. [Google Scholar] [CrossRef]
  16. Drikvandi, R.; Verbeke, G.; Khodadadi, A.; Nia, V.P. Testing multiple variance components in linear mixed-effects models. Biostatistics 2012, 14, 144–159. [Google Scholar] [CrossRef]
  17. Pauler, D.K.; Wakefield, J.C.; Kass, R.E. Bayes Factors and Approximations for Variance Component Models. J. Am. Stat. Assoc. 1999, 94, 1242–1253. [Google Scholar] [CrossRef]
  18. Laird, N.M.; Ware, J.H. Random-Effects Models for Longitudinal Data. Biometrics 1982, 38, 963–974. [Google Scholar] [CrossRef]
  19. Silvapulle, M.J.; Sen, P.K. Constrained Statistical Inference: Order, Inequality, and Shape Constraints; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar] [CrossRef]
  20. Vanbrabant, L.; Rosseel, Y.; Dacko, A. con_weights_boot: Function for Computing the Chi-Bar-Square Weights Based on Monte Carlo Simulation. 2019. Available online: https://www.rdocumentation.org/packages/restriktor/versions/0.2-250/topics/con_weights_boot/ (accessed on 12 August 2020).
  21. van der Vaart, A. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  22. Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting Linear Mixed-Effects Models Using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
  23. Gao, X.; Song, P.X.K. Composite Likelihood Bayesian Information Criteria for Model Selection in High-Dimensional Data. J. Am. Stat. Assoc. 2010, 105, 1531–1540. [Google Scholar] [CrossRef]
  24. Chen, J.; Chen, Z. Extended BIC for small-n-large-P sparse GLM. Stat. Sin. 2012, 22, 555–574. [Google Scholar] [CrossRef]
  25. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021; Available online: https://www.R-project.org/ (accessed on 29 December 2021).
  26. Greven, S.; Kneib, T. On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika 2010, 97, 773–789. [Google Scholar] [CrossRef]
  27. Säfken, B.; Rügamer, D.; Kneib, T.; Greven, S. Conditional model selection in mixed-effects models with cAIC4. arXiv 2018, arXiv:1803.05664. [Google Scholar] [CrossRef]
  28. Yang, L.; Wu, T. Model-based clustering of high-dimensional longitudinal data via regularization. Biometrics 2022, 1–14. [Google Scholar] [CrossRef]
  29. Monette, G.; Fox, J.; Friendly, M.; Krause, H.; Zhu, F. spida2: Collection of Tools Developed for the Summer Programme in Data Analysis 2000–2012. R Package Version 0.2.1. 2019. Available online: https://github.com/gmonette/spida2 (accessed on 30 April 2020).
  30. Raudenbush, S.; Bryk, A. Hierarchical Linear Models: Applications and Data Analysis Methods; SAGE Publications: Thousand Oaks, CA, USA, 2002. [Google Scholar]
  31. Müller, S.; Scealy, J.L.; Welsh, A.H. Model Selection in Linear Mixed Models. Stat. Sci. 2013, 28, 135–167. [Google Scholar] [CrossRef]
Figure 1. Comparison of the proposed BIC and regular BIC methods in terms of the positive selection rate and correction rate for different values of σ 1 , σ 2 , and σ 3 , n = 500   ( N = 100 ) . In this simulation setup, for each value of σ 1 on the horizontal axis, the value of σ 2 is 2 σ 1 and the value of σ 3 is 4 σ 1 ; σ 4 = 0 and σ 5 = 0 .
Figure 1. Comparison of the proposed BIC and regular BIC methods in terms of the positive selection rate and correction rate for different values of σ 1 , σ 2 , and σ 3 , n = 500   ( N = 100 ) . In this simulation setup, for each value of σ 1 on the horizontal axis, the value of σ 2 is 2 σ 1 and the value of σ 3 is 4 σ 1 ; σ 4 = 0 and σ 5 = 0 .
Mathematics 11 02130 g001
Figure 2. Comparison of the positive selection rate and correction rate for n = 250   ( N = 50 ) , n = 500   ( N = 100 ) , and n = 1000   ( N = 200 ) . For each value of σ 1 on the horizontal axis, the value of σ 2 is 2 σ 1 and the value of σ 3 is 4 σ 1 ; σ 4 = 0 and σ 5 = 0 .
Figure 2. Comparison of the positive selection rate and correction rate for n = 250   ( N = 50 ) , n = 500   ( N = 100 ) , and n = 1000   ( N = 200 ) . For each value of σ 1 on the horizontal axis, the value of σ 2 is 2 σ 1 and the value of σ 3 is 4 σ 1 ; σ 4 = 0 and σ 5 = 0 .
Mathematics 11 02130 g002
Figure 3. Comparison of the positive selection rate and correction rate for n = 500   ( N = 100 ) with different competing methods for different values of σ 1 , σ 2 , and σ 3 . For each value of σ 1 on the horizontal axis, the value of σ 2 is 2 σ 1 and the value of σ 3 is 4 σ 1 ; σ 4 = 0 and σ 5 = 0 .
Figure 3. Comparison of the positive selection rate and correction rate for n = 500   ( N = 100 ) with different competing methods for different values of σ 1 , σ 2 , and σ 3 . For each value of σ 1 on the horizontal axis, the value of σ 2 is 2 σ 1 and the value of σ 3 is 4 σ 1 ; σ 4 = 0 and σ 5 = 0 .
Mathematics 11 02130 g003
Table 1. Comparison of the proposed BIC and regular BIC methods in terms of correction rate for the simulation in Scenario 1 with n = 500 and N = 100 .
Table 1. Comparison of the proposed BIC and regular BIC methods in terms of correction rate for the simulation in Scenario 1 with n = 500 and N = 100 .
σ 1 σ 2 σ 3 Correction Rate
Proposed BICRegular BIC
0.000.000.000.000.00
0.050.100.200.000.00
0.100.200.400.010.00
0.150.300.600.040.01
0.200.400.800.110.02
0.250.501.000.240.08
0.300.601.200.360.18
0.350.701.400.540.32
0.400.801.600.670.47
0.450.901.800.780.60
0.501.002.000.870.72
“Correction Rate” reports the proportion of times the selected model is the true data-generating model.
Table 2. Comparison of the proposed BIC and regular BIC methods in terms of the positive selection rate, the false discovery rate, and correction rate for different values of σ 1 , σ 2 , σ 3 , σ 4 = 0 , and σ 5 = 0 in Scenario 2 with n = 500 and N = 100 .
Table 2. Comparison of the proposed BIC and regular BIC methods in terms of the positive selection rate, the false discovery rate, and correction rate for different values of σ 1 , σ 2 , σ 3 , σ 4 = 0 , and σ 5 = 0 in Scenario 2 with n = 500 and N = 100 .
Proposed BICRegular BIC
σ 1 σ 2 σ 3 PSR (SD)FDR (SD)Correction RatePSR (SD)FDR (SD)Correction Rate
0.000.000.000.029 (0.010)0.067 (0.062)0.000.004 (0.001)0.009 (0.008)0.00
0.050.100.200.118 (0.031)0.060 (0.053)0.000.037 (0.011)0.013 (0.013)0.00
0.100.200.400.392 (0.034)0.033 (0.016)0.010.292 (0.027)0.007 (0.005)0.00
0.150.300.600.537 (0.034)0.027 (0.011)0.030.429 (0.026)0.008 (0.004)0.01
0.200.400.800.657 (0.032)0.026 (0.009)0.120.568 (0.032)0.007 (0.003)0.04
0.250.501.000.734 (0.028)0.016 (0.005)0.230.660 (0.025)0.003 (0.001)0.10
0.300.601.200.796 (0.028)0.019 (0.006)0.370.719 (0.021)0.004 (0.001)0.18
0.350.701.400.854 (0.027)0.018 (0.005)0.530.777 (0.025)0.004 (0.001)0.33
0.400.801.600.902 (0.023)0.013 (0.004)0.670.830 (0.028)0.004 (0.001)0.48
0.450.901.800.945 (0.015)0.015 (0.004)0.800.888 (0.025)0.004 (0.001)0.66
0.501.002.000.960 (0.012)0.016 (0.004)0.830.916 (0.021)0.003 (0.001)0.74
PSR is positive selection rate, and FDR is false discovery rate; both are averaged over 1001 simulations. All values in brackets are sample standard deviations.
Table 3. Comparison of the proposed BIC and regular BIC methods in terms of the positive sensitivity rate and correction rate for n = 250 , n = 500 , and n = 1000 in Scenario 3.
Table 3. Comparison of the proposed BIC and regular BIC methods in terms of the positive sensitivity rate and correction rate for n = 250 , n = 500 , and n = 1000 in Scenario 3.
( n , N )MethodAverage PSR (SD)Average FDR (SD)Correction Rate
(250, 50)Proposed BIC0.825 (0.015)0.063 (0.014)0.21
Regular BIC0.771 (0.014)0.017 (0.004)0.15
(500, 100)Proposed BIC0.883 (0.016)0.051 (0.010)0.40
Regular BIC0.832 (0.015)0.009 (0.002)0.32
(1000, 200)Proposed BIC0.959 (0.009)0.041 (0.008)0.67
Regular BIC0.916 (0.014)0.005 (0.001)0.65
“Correction Rate” reports the proportion of times the selected model is the true data-generating model.
Table 4. Comparison of the proposed BIC, regular BIC, and cAIC methods in terms of the positive selection rate, the false discovery rate, and correction rate for different values of σ 0 , σ 1 , σ 2 , σ 3 , σ 4 = 0 , and σ 5 = 0 with correlated random effects.
Table 4. Comparison of the proposed BIC, regular BIC, and cAIC methods in terms of the positive selection rate, the false discovery rate, and correction rate for different values of σ 0 , σ 1 , σ 2 , σ 3 , σ 4 = 0 , and σ 5 = 0 with correlated random effects.
Proposed BICRegular BICcAIC
CasePSR (SD)FDR (SD)Correction RatePSR (SD)FDR (SD)Correction RatePSR (SD)FDR (SD)Correction Rate
10.99 (0.0032)0.0007 (0.0002)0.9670.9837 (0.0052)0.0007 (0.0002)0.94810.9950 (0.0025)0.1110 (0.0212)0.6054
20.8911 (0.0244)0.0003 (0.0001)0.67330.8541 (0.0273)0.0 (0.000)0.56240.9933 (0.0026)0.0935 (0.0188)0.6603
30.7106 (0.0132)0.0003 (0.0001)0.13390.6893 (0.0084)0.0003 (0.0001)0.07390.9314 (0.0184)0.0940 (0.0206)0.5355
40.9204 (0.0202)0.0 (0.000)0.76120.8901 (0.0246)0.0 (0.000)0.67030.995 (0.0016)0.0904 (0.0189)0.6763
PSR is positive selection rate, and FDR is false discovery rate; both are averaged over 1001 simulations. All values in brackets are sample standard deviations.
Table 5. Comparison of the proposed BIC, regular BIC, and cAIC methods in terms of the correction rate for fixed effects, random effects, and both for different values of σ 0 , σ 1 , and σ 2 with σ 3 , σ 4 = 0 , σ 5 = 0 , and correlated random effects.
Table 5. Comparison of the proposed BIC, regular BIC, and cAIC methods in terms of the correction rate for fixed effects, random effects, and both for different values of σ 0 , σ 1 , and σ 2 with σ 3 , σ 4 = 0 , σ 5 = 0 , and correlated random effects.
Proposed BICRegular BICcAIC
CaseFE-CRRE-CRBoth-CRFE-CRRE-CRBoth-CRFE-CRRE-CRBoth-CR
10.9830.9990.9820.9820.9970.9790.31470.31770.1708
20.9750.66730.65030.9790.56840.55540.33770.37460.2128
FE-CR is the correction rate for fixed effects variables; RE-CR is the correction rate for random effects variables; Both-CR is the correction rate of selecting the true model. All the rates are calculated over 1001 simulations.
Table 6. Comparison of the proposed BIC, regular BIC, and cAIC methods in terms of fixed effects correction rate, random effects correction rate, and both effects correction rate for different values of σ 0 , σ 1 , σ 2 , and σ 3 with σ 4 = 0 , σ 5 = 0 , and independent random effects.
Table 6. Comparison of the proposed BIC, regular BIC, and cAIC methods in terms of fixed effects correction rate, random effects correction rate, and both effects correction rate for different values of σ 0 , σ 1 , σ 2 , and σ 3 with σ 4 = 0 , σ 5 = 0 , and independent random effects.
Proposed BICRegular BICcAIC
CaseFE-CRRE-CRBoth-CRFE-CRRE-CRBoth-CRFE-CRRE-CRBoth-CR
10.9850.94810.93510.9860.9940.9810.51050.48650.2667
20.9800.86610.85110.9810.76720.75320.56640.53850.3017
FE-CR is the correction rate for fixed effects variables; RE-CR is the correction rate for random effects variables; Both-CR is the correction rate of selecting the true model. All the rates are calculated over 1001 simulations.
Table 7. Results of the proposed BIC, regular BIC, and cAIC for all models with correlated random effects considered for the subset of the “hsfull” dataset.
Table 7. Results of the proposed BIC, regular BIC, and cAIC for all models with correlated random effects considered for the subset of the “hsfull” dataset.
ModelProposed BICRegular BICcAIC
Random Intercept (RI)34,379.3334,379.8334,176.92
RI, z 1 34,380.6134,385.434,167.38
RI, z 2 34,391.3834,396.1734,181.23
RI, z 1 , z 2 34,401.434,410.234,174.07
RI, z 2 34,384.5834,389.3734,169.18
RI, z 1 , z 2 34,391.0334,400.3834167.38
RI, z 2 , z 2 34,405.5934,414.6334169.18
RI, z 1 , z 2 , z 2 34,420.434,433.6334,166.25
All values are rounded to two decimal places.
Table 8. Comparison of the optimal model chosen by each method for correlated random effects and independent random effects.
Table 8. Comparison of the optimal model chosen by each method for correlated random effects and independent random effects.
Proposed BICRegular BICcAIC
CaseOptimal ModelProposed BICOptimal ModelRegular BICOptimal ModelcAIC
Correlated Random EffectsRI34,379.33RI34,379.83RI, z 1 , z 2 , z 3 34,166.25
Independent Random EffectsRI, z 3 34,377.73RI34,379.83RI, z 1 , z 3 34,165.13
RI means random intercept.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lai, H.; Gao, X. Modified BIC Criterion for Model Selection in Linear Mixed Models. Mathematics 2023, 11, 2130. https://doi.org/10.3390/math11092130

AMA Style

Lai H, Gao X. Modified BIC Criterion for Model Selection in Linear Mixed Models. Mathematics. 2023; 11(9):2130. https://doi.org/10.3390/math11092130

Chicago/Turabian Style

Lai, Hang, and Xin Gao. 2023. "Modified BIC Criterion for Model Selection in Linear Mixed Models" Mathematics 11, no. 9: 2130. https://doi.org/10.3390/math11092130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop