*Article* **Detection of Near-Nulticollinearity through Centered and Noncentered Regression**

#### **Rom ´an Salmer ´on G ´omez 1, Catalina Garc´ıa Garc´ıa 1,\* and Jos´e Garc´ıa P´erez <sup>2</sup>**


Received: 1 May 2020; Accepted: 4 June 2020; Published: 7 June 2020

**Abstract:** This paper analyzes the diagnostic of near-multicollinearity in a multiple linear regression from auxiliary centered (with intercept) and noncentered (without intercept) regressions. From these auxiliary regressions, the centered and noncentered variance inflation factors (VIFs) are calculated. An expression is also presented that relates both of them. In addition, this paper analyzes why the VIF is not able to detect the relation between the intercept and the rest of the independent variables of an econometric model. At the same time, an analysis is also provided to determine how the auxiliary regression applied to calculate the VIF can be useful to detect this kind of multicollinearity.

**Keywords:** centered model; noncentered model; intercept; essential multicollinearity; nonessential multicollinearity

**MSC:** 62JXX; 62J20; 60PXX

#### **1. Introduction**

Consider the following multiple linear model with *n* observations and *k* regressors:

$$\mathbf{y}\_{n \times 1} = \mathbf{X}\_{n \times k} \cdot \boldsymbol{\mathcal{B}}\_{k \times 1} + \mathbf{u}\_{n \times 1} \tag{1}$$

where **y** is a vector with the observations of the dependent variable, **X** is a matrix containing the observations of regressors and **u** is a vector representing a random disturbance (that is assumed to be spherical). Generally, the first column of matrix **X** is composed of ones to denote that the model contains an intercept. Thus, **<sup>X</sup>** = [**1 X**<sup>2</sup> ... **<sup>X</sup>***k*] where **<sup>1</sup>***n*×<sup>1</sup> = (1 1 ... <sup>1</sup>)*<sup>t</sup>* . This model is considered to be centered.

When this model presents worrying near-multicollinearity (hereinafter, multicollinearity), that is, when the linear relation between the regressors affects the numerical and/or statistical analysis of the model, the usual approach is to transform the regressors (see, for example, Belsley [1], Marquardt [2] or, more recently, Velilla [3]). Due to the transformations (centering, typification or standardization) implying the elimination of the intercept in the model, the transformed models are considered to be noncentered. Note that even after transforming the data, it is possible to recover the original model (centered) from the estimations of the transformed model (noncentered model). However, in this paper, we refer to the centered and noncentered model depending on whether the intercept is initially included or not. Thus, it is considered that the model is centered if **X** = [**1 X**<sup>2</sup> ... **X***k*] and noncentered if **X** = [**X**<sup>1</sup> **X**<sup>2</sup> ... **X***k*], given that **X***<sup>j</sup>* = **1** with *j* = 1, . . . , *k*.

From the intercept is also possible to distinguish between essential and nonessential multicollinearity:

**Nonessential:** A near-linear relation between the intercept and at least one of the rest independent variables.

**Essential:** A near-linear relation between at least two of the independent variables (excluding the intercept).

A first idea of these definitions was provided by Cohen et al. [4]: Nonessential ill-conditioning results simply from the scaling of the variables, whereas essential ill-conditioning results from substantive relationships among the variables. While in some papers the idea of distinguishing between essential and nonessential collinearity is attributed to Marquardt [5], it is possible to find this concept in Marquardt and Snee [6]. These terms have been widely used not only for linear models but also, for example, for moderated models with interactions and/or with a quadratic term. However, these concepts have been analyzed fundamentally from the point of view of the solution of collinearity. Thus, as Marquardt and Snee [6] stated: In a linear model, centering removes the correlation between the constant term and all linear terms.

The variance inflation factor is one of the most applied measures to detect multicollinearity. Following O'Brien [7], commonly a VIF of 10 or even one as low as 4 have been used as rules of thumbs to indicate excessive or serious collinearity. Salmeron et al. ´ [8] show that the VIF does not detect the nonessential multicollinearity, while this kind of multicollinearity is detected by the index of Stewart [9] (see Salmeron G ´ omez et al. ´ [10]). This index has been misunderstood in the literature since its presentation by Stewart, who wrongly identified it with the VIF. Even Marquardt [11] when published a comment of the paper of Stewart [9] stated: Stewart collinearity indices are simply the square roots of the corresponding variance inflation factor. It is not clear to me whether giving a new name to the square of a VIF is a help or a hindrance to understanding. There is a long and precisely analogous history of using the term "standard error" for the square root of the corresponding "variances". Given the continuing necessity for dealing with statistical quantities on both the scale of the observable and the scale of the observable squared, there may be a place for a new term. Clearly, the essential intellectual content is identical for both terms.

However, in Salmeron G ´ omez et al. ´ [12] it is shown that the VIF and the index of Stewart are not the same measure. This paper analyzes in what cases use one measure or another, focusing on the initial distinction between centered and noncentered models. Thus, the algebraic contextualization provided by Salmeron G ´ omez et al. ´ [12] will be complemented from an econometric point of view. This question was also presented by Jensen and Ramirez [13], striving to commit to a clarification of the misuse given to the VIF over decades since its first use, who insinuated: To choose a model, with or without intercept, is substantive, is specific to each experimental paradigm and is beyond the scope of the present study. It was also stated that: This differs between centered and uncentered diagnostics.

This paper, focused on the differences between essential and nonessential multicollinearity in relation to its diagnostic, analyzes the behaviour of the VIF depending on whether model (1) initially includes the intercept or not. For this analysis, it will be considered that the auxiliary regression used for its calculation is centered or not since as stated by Grob [14] (p. 304): Instead of using the classical coefficient of determination in the definition of VIF, one may also apply the centered coefficient of determination. As a matter of fact, the latter definition is more common. We may call VIF uncentered or centered, depending on whether the classical or centered coefficient of determination is used. From the above considerations, a centered VIF only makes sense when the matrix **X** contains ones as a column. Additionally, although initially in the centered version of model (1) it is possible to find these two kinds of multicollinearity, and in the noncentered version, it is only possible to find essential multicollinearity, this paper shows that this statement is subject to some nuances.

On the other hand, throughout the paper the following statement of Cook [15] will be illustrated: As a matter of fact, the centered VIF requires an intercept in the model but at the same time denies the status of the intercept as an independent "variable" being possibly related to collinearity effects. Furthermore, another statement was provided by Belsley [16] (p. 29): The centered VIF has no ability to discover collinearity involving the intercept. Thus, the second part of the paper analyzes why the centered VIF is unable to detect the nonessential multicollinearity and, for this, the centered coefficient of determination of the centered auxiliary regression to calculate the centered VIF is analyzed. This analysis will be applied to propose a methodology to detect the nonessential multicollinearity from the centered auxiliary regression.

The structure of the paper is as follows: Section 2 presents the detection of multicollinearity in noncentered models from the noncentered auxiliary regressions, Section 3 analyzes the effects of high values of the noncentered VIF on the statistical analysis of the model and Section 4 presents the detection of multicollinearity in centered models from the centered auxiliary regressions. Section 5 illustrates the contribution of the paper with two empirical applications. Finally, Section 6 summarizes the main conclusions.

#### **2. Auxiliary Noncentered Regressions**

This section presents the calculation of the VIF uncentered, VIFnc, considering that the auxiliary regression is noncentered, that is, it has no intercept. First, the method regarding how to calculate the coefficient of determination for noncentered models is presented.

#### *2.1. Noncentered Coefficient of Determination*

Given the linear regression of Equation (1) with or without the intercept, the following decomposition for the sum of squares is verified:

$$\sum\_{i=1}^{n} y\_i^2 = \sum\_{i=1}^{n} \hat{y}\_i^2 + \sum\_{i=1}^{n} e\_{i\prime}^2 \tag{2}$$

where **<sup>y</sup>**/ represents the estimation of the dependent variable of the model that is fit by employing ordinary least squares (OLS) and **<sup>e</sup>** <sup>=</sup> **<sup>y</sup>** <sup>−</sup> **<sup>y</sup>**/ are the residuals obtained from that fit. In this case, the coefficient of determination is obtained by the following expression:

$$R\_{nc}^2 = \frac{\sum\_{i=1}^n \hat{y}\_i^2}{\sum\_{i=1}^n y\_i^2} = 1 - \frac{\sum\_{i=1}^n e\_i^2}{\sum\_{i=1}^n y\_i^2}. \tag{3}$$

Comparing the decomposition of the sums of squares given by (2) with the traditionally applied method to calculate the coefficient of determination in models with the intercept, as in model (1):

$$\sum\_{i=1}^{n} (y\_i - \overline{y})^2 = \sum\_{i=1}^{n} (\widehat{y}\_i - \overline{y})^2 + \sum\_{i=1}^{n} e\_{i, \prime}^2 \tag{4}$$

it is noted that both coincide if the dependent variable has zero mean. If the mean is different from zero, both models present the same residual sum of squares but different explained and total sum of squares.

Thus, these models lead to the same value for the coefficient of determination (and, as a consequence, for the VIF) only if the dependent variable presents a mean equal to zero.

#### *2.2. Noncentered Variance Inflation Factor*

The VIFnc is obtained from the expression:

$$VIFnc(j) = \frac{1}{1 - R\_{nc}^2(j)}, \quad j = 1, \ldots, k,\tag{5}$$

where *R*<sup>2</sup> *nc*(*j*) is the coefficient of determination, calculated by following (3), of the noncentered auxiliary regression:

$$\mathbf{X}\_{j} = \mathbf{X}\_{-j}\boldsymbol{\delta} + \mathbf{w}\_{\prime} \tag{6}$$

where **<sup>X</sup>**−*<sup>j</sup>* is equal to the matrix **<sup>X</sup>** after eliminating the variable **<sup>X</sup>***j*, for *<sup>j</sup>* = 1, ... , *<sup>k</sup>*, and it does not have a vector of ones representing the intercept.

In this case:

$$\begin{array}{ll} \bullet & \sum\_{i=1}^{n} \mathbf{X}\_{ij}^{2} = \mathbf{X}\_{j}^{t} \mathbf{X}\_{j}, \text{ and} \\ & \sum\_{i=1}^{n} \widehat{\mathbf{X}}\_{ij}^{2} = \widehat{\mathbf{X}}\_{j}^{t} \widehat{\mathbf{X}}\_{j} = \mathbf{X}\_{j}^{t} \mathbf{X}\_{-j} \cdot \left(\mathbf{X}\_{-j}^{t} \mathbf{X}\_{-j}\right)^{-1} \cdot \mathbf{X}\_{-j}^{t} \mathbf{X}\_{j} \text{ due to } \widehat{\mathbf{X}}\_{j} = \mathbf{X}\_{-j} \cdot \left(\mathbf{X}\_{-j}^{t} \mathbf{X}\_{-j}\right)^{-1} \cdot \mathbf{X}\_{-j}^{t} \mathbf{X}\_{j}. \end{array}$$

Then:

$$\begin{array}{rcl} \mathsf{R}^2\_{\rm nc}(j) & = & \frac{\mathsf{X}^t\_{\slash}\mathsf{X}\_{-j}\cdot\left(\mathsf{X}^t\_{-j}\mathsf{X}\_{-j}\right)^{-1}\cdot\mathsf{X}^t\_{-j}\mathsf{X}\_{j}}{\mathsf{X}^t\_{j}\mathsf{X}\_{j}},\\ (1-\mathsf{R}^2\_{\rm nc}(j)) & = & \frac{\mathsf{X}^t\_{\slash}\mathsf{X}\_{j}\cdots\mathsf{X}^t\_{\slash}\mathsf{X}\_{-j}\cdot\left(\mathsf{X}^t\_{-j}\mathsf{X}\_{-j}\right)^{-1}\cdot\mathsf{X}^t\_{-j}\mathsf{X}\_{j}}{\mathsf{X}^t\_{j}\mathsf{X}\_{j}},\\ \mathsf{V}\mathit{I}\mathit{Fnc}(j) & = & \frac{\mathsf{X}^t\_{\natural}\mathsf{X}\_{j}}{\mathsf{X}^t\_{\slash}\mathsf{X}\_{j}\cdots\mathsf{X}^t\_{\slash}\mathsf{X}\_{-j}\cdot\left(\mathsf{X}^t\_{-j}\mathsf{X}\_{-j}\right)^{-1}\cdot\mathsf{X}^t\_{-j}\mathsf{X}\_{j}}.\\ \end{array} \tag{7}$$

Thus, the VIFnc coincides with the expression given by Stewart [9] for the VIF and is denoted as *k*2 *<sup>j</sup>* , that is, *V IFnc*(*j*) = *<sup>k</sup>*<sup>2</sup> *j* .

However, recently, Salmeron G ´ omez et al. ´ [12] showed that the index presented by Stewart has been misleadingly identified as the VIF, verifying the following relation between both measures:

$$k\_j^2 = VIF(j) + n \cdot \frac{\overline{\mathbf{X}}\_j^2}{RSS\_j}, \quad j = 2, \dots, k,\tag{8}$$

where **X***<sup>j</sup>* is the mean of the *j*−variable of **X**. This expression is also shown by Salmeron G ´ omez et al. [ ´ 10], where it is used to quantify the proportion of essential and nonessential multicollinearity existing in a concrete independent variable.

Note that the expression:

$$VIFnc(j) = VIF(j) + n \cdot \frac{\overline{X}\_j^2}{RSS\_j} \tag{9}$$

is obtained by Chennamaneni et al. [17] (expression (6) page 174), although it is also limited to the particular case of the moderated regression **Y** = *α*<sup>0</sup> · **1** + *α*<sup>1</sup> · **U** + *α*<sup>2</sup> · **V** + *α*<sup>3</sup> · **U** × **V** + *ν* where **U** and **V** are ratio-scaled explanatory variables in n-dimensional data vectors. Indeed, these authors proposed a new measure to detect multicollinearity in moderated regression models that is derived from the noncentered coefficient of determination. However, this use of the noncentered coefficient of determination lacks of the statistical contextualization provided by this paper

Finally, from expression (9), it is shown that the VIFnc and the VIF only coincide if the associated variable has zero mean, analogously to what happens in the decomposition of the sum of squares. Note that this expression also clarifies why Stewart's collinearity indices diminish when the variables are centered, which the author attributed to errors in regression variables: This phenomenon is a consequence of the fact that our definition of collinearity index compels us to work with relative errors.

**Example 1.** *Considering k* = 4 *in model (1), we use the noncentered coefficient of determination, R*<sup>2</sup> *nc, to calculate the noncentered variance inflation factor, V IFnc. For it, we consider the values displayed in Table 1. Note that variables* **y***,* **X**<sup>2</sup> *and* **X**<sup>3</sup> *were originally used by Belsley [1] and we have added a new variable,* **X**4*, that has been randomly generated (from a normal distribution with a mean equal to 4 and a variance equal to 16) to obtain a variable that is linearly independent with respect to the rest.*


**Table 1.** Data set applied by Belsley [1].

*In these data, the existence of nonessential multicollinearity is intuited. This fact is confirmed by the small values of the coefficient of variation (CV) in two of the independent variables and the following conclusions obtained from the value of the condition indices and the proportions of the variance (see, for example, Belsley et al. [18] and Belsley [16] for more details) shown in Table 2:*


**Table 2.** Diagnostic of collinearity of Belsley–Kuh–Welsch and coefficient of variation of the considered variables.


*Now, other models are proposed apart from the initial model for k* = 4*:*


• *Model 3 (Mod3):* **<sup>y</sup>** <sup>=</sup> *<sup>β</sup>*<sup>1</sup> · **<sup>1</sup>** <sup>+</sup> *<sup>β</sup>*<sup>3</sup> · **<sup>X</sup>**<sup>3</sup> <sup>+</sup> *<sup>β</sup>*<sup>4</sup> · **<sup>X</sup>**<sup>4</sup> <sup>+</sup> **<sup>u</sup>***.*

*Table 3 presents the VIF and the VIFnc of these models. Note that by using the original variables applied by Belsley (Mod1), the traditional VIF (from the centered model, see Theil [19]) provides a value equal to 1 (its minimum possible value), while the VIFnc is equal to 100,032.1. If the additional variable* **X**<sup>4</sup> *is included (Mod0), the traditional VIFs are also close to one while the noncentered VIFs present values higher than 100,000. The conclusion is that the VIF is not detecting the existence of nonessential multicollinearity (see Salmer´on et al. [8]) while the VIFnc "does detect it". However, since the calculation of VIFnc excludes the constant term, the detected relation refers to the one between* **X**<sup>2</sup> *and* **X**3*, and not to the relation between* **X**<sup>2</sup> *and/or* **X**<sup>3</sup> *with the intercept.*

*This fact is supported by the values obtained for the VIF and VIFnc of the second and fourth variables (Mod2) and for the third and fourth variables (Mod3).*


**Table 3.** Variance inflation factor (VIF) and VIF uncentered (VIFnc) of models proposed from Belsley [1] dataset.

#### *2.3. What Kind of Multicollinearity Detects the VIFnc?*

The results of Example 1 for **Mod0** suggest a new definition of nonessential multicollinearity as the relation between at least two variables with little variability. Thus, the particular case when one of these variables is the intercept leads to the definition initially given by Marquardt and Snee [6]. Then, the initial idea that in a noncentered model, is not possible to find nonessential collinearity is of a nuanced nature.

By following Salmeron et al. ´ [8] and Salmeron G ´ omez et al. ´ [10], it can be concluded that the VIF only detects the essential multicollinearity and, with these results, the VIFnc detects the nonessential multicollinearity but in its generalized definition since the intercept is eliminated in the corresponding auxiliary regression.

This fact is contradictory to the fact that the VIFnc coincides with the index of Stewart, see expression (7), since this measure is able to detect the nonessential multicollinearity (see Salmeron G ´ omez et al. ´ [10]). This is because the VIFnc could be fooled, including the constant as an independent variable in a model without the intercept, that is:

$$\mathbf{y} = \beta\_1 \cdot \mathbf{X}\_1 + \beta\_2 \cdot \mathbf{X}\_2 + \dots + \beta\_k \cdot \mathbf{X}\_k + \mathbf{u}\_{\prime 1}$$

where **X**<sup>1</sup> is a column of ones but is not considered as the intercept.

**Example 2.** *Now, we part from model 1 in the Belsley example but include the constant as an independent variable in a model without the intercept (Mod4) and two additional models (Mod5 and Mod6):*


*Table 4 presents the VIFnc obtained from expression (5) in Models 4–6. Results indicate that, considering the centered model and calculating the coefficient of determination of the auxiliary regressions as if the model were* *noncentered, it is possible to detect the nonessential multicollinearity. Thus, the contradiction indicated at the beginning of this subsection is saved.*

**Table 4.** VIFnc of Models 4–6 including the constant as an independent variable in a model without the intercept.


#### **3. Effects of the Vifnc on the Statistical Analysis of the Model**

Given the model (1), the expression obtained for the variance of the estimator is given by:

$$var(\hat{\beta}\_{j}) = \frac{\sigma^2}{RSS\_{\hat{j}}}, \quad j = 1, \ldots, k,\tag{10}$$

where *RSSj* is the residual sum of squares of the auxiliary regression of the *j*−independent variable as a function of the rest of the independent variables (see expression (6)).

From expression (10), and considering that expression (7) can be rewritten as:

$$VIFnc(j) = \frac{\mathbf{X}\_j^t \mathbf{X}\_j}{RSS\_j}.$$

it is possible to obtain:

$$var(\hat{\beta}\_{j}) = \frac{\sigma^2}{RSS\_j} = \frac{\sigma^2}{\mathbf{X}\_j^t \mathbf{X}\_j} \cdot VIFnc(j), \quad j = 1, \ldots, k. \tag{11}$$

Establishing a model as a reference is required to conclude whether the variance has been inflated (see, for example, Cook [20]). Thus, if the variables in **X** are orthogonal, it is verified that **X***<sup>t</sup>* **X** = *diag*(*d*1, ... , *dk*) where *dj* = **X***<sup>t</sup> j* **<sup>X</sup>***j*. In this case, <sup>+</sup> **X***t* **X** ,−<sup>1</sup> <sup>=</sup> *diag*(1/*d*1, ... , 1/*dk*), and consequently, the variance of the estimated coefficients in the hypothetical orthogonal case is given by the following expression:

$$var(\hat{\boldsymbol{\beta}}\_{j,o}) = \frac{\sigma^2}{\mathbf{x}\_j^t \mathbf{x}\_j}, \quad j = 1, \ldots, k. \tag{12}$$

In this case:

$$\frac{var(\beta\_{\hat{j}})}{var(\hat{\beta}\_{\hat{j},o})} = VIFnc(\hat{j}), \quad j = 1, \dots, k\_{\hat{r}}$$

and it is then possible to state that the VIFnc is a factor that inflates the variance.

As consequence, high values of *V IFnc*(*j*) imply high values of *var*(*β* /*j*) and a tendency not to reject the null hypothesis in the individual significance test of model (1). Thus, the statistical analysis of the model will be affected.

Note from expression (11) that this negative effect can be offset by low values of the estimation of *σ*2, that is, low values of the residual sum of squares of model (1) or high values of the number of observations, *n*. This is similar to what happen to the VIF (see O'Brien [7] for more details).

#### **4. Auxiliary Centered Regressions**

The use of the coefficient of determination of the auxiliary regression (6) where matrix **X**−*<sup>j</sup>* contains a column of ones that represents the intercept is a very common approach to detect the linear relations between the independent variables of the model (1). This is motivated due to the higher relation between **X***<sup>j</sup>* and the rest of the independent variables, that is, the higher the multicollinearity is, the higher the value of that coefficient of determination.

However, since the coefficient of determination ignores the role of the intercept, this measure is unable to detect the nonessential linear relations. The question is evident: Does another measure exist related to the auxiliary regression that allows detection of the nonessential multicollinearity?

#### *4.1. Case When There Is Only Nonessential Multicollinearity*

**Example 3.** *Suppose that 100 observations are simulated for variables* **X***,* **Z** *and* **W** *from normal distributions with a mean of 5, 4 and -4 and a standard deviation of 0.01, 4 and 0.01, respectively. Note that* **X** *and* **W** *present light variability and, for this reason, it is expected that the model presents nonessential multicollinearity.*

*Then,* **y** = **1** + **X** + **Z** − **W** + **v** *is generated by simulating* **v** *as a normal distribution with a mean equal to 0 and a standard deviation equal to 2.*

*The second column of Table 5 presents the results obtained after the estimation by ordinary least squares (OLS) of model* **y** = *β*<sup>1</sup> · **1** + *β*<sup>2</sup> · **X** + *β*<sup>3</sup> · **Z** + *β*<sup>4</sup> · **W** + **u***. Note that the estimations of the coefficients of the model differ substantially from the real values used to generate* **y***, except for the coefficient of the variable* **Z** *(this situation illustrates the fact that if the interest is to estimate the effect of variable* **Z** *on* **y***, the analysis will not be influenced by the linear relations between the rest of the independent variables), which is the variable free of multicollinearity (indeed, it is the unique coefficient significantly different from zero, with a 5% significance—the value used by default in this paper).*

**Table 5.** Estimation by ordinary least squares (OLS) of the first simulated model and its corresponding auxiliary regressions (estimated standard deviation in parenthesis and coefficients significantly different from zero in bold).


*This table also shows the results obtained from the estimations of the centered auxiliary regressions. Note that the coefficients of determination are very small, and consequently, the associated VIFs do not detect the degree of multicollinearity. However, note that in the auxiliary regressions corresponding to variables* **X** *and* **W***:*


*Thus, note that the auxiliary regressions are capturing the existence of nonessential multicollinearity. The problem is that it is not transferred to its coefficient of determination but to another characteristic.*

From this finding, it is possible to propose a way to detect the nonessential multicollinearity from the centered auxiliary regression traditionally applied to calculate the VIF:

**Condition 1 (C1):** Quantify the contribution of the estimation of the intercept to the total sum of the estimations of the coefficients of model (6), that is, calculate:

$$\frac{|\delta\_1|}{\sum\_{j=1}^{k-1} |\delta\_j|} \cdot 100\% \text{-} \sum\_{j=1}^{k-1} |\delta\_j| \text{-}$$

**Condition 2 (C2):** Calculate the number of independent variables with coefficients significantly different from zero and quantify the contribution of the intercept.

A Montecarlo simulation is presented considering the model (1) where *k* = 3 and the variable **<sup>X</sup>**<sup>2</sup> has been generated as a normal distribution with mean *<sup>μ</sup>*<sup>2</sup> <sup>∈</sup> **<sup>A</sup>** and variance *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> ∈ **B**, the variable **<sup>X</sup>**<sup>3</sup> has been generated as normal distribution with mean *<sup>μ</sup>*<sup>3</sup> <sup>∈</sup> **<sup>A</sup>** and variance *<sup>σ</sup>*<sup>2</sup> <sup>3</sup> ∈ **C** being **A** = {0, 1, 2, 3, 4, 5, 10, 15, 20}, **B** = {0.00001, 0.0001, 0.001, 0.1, **C**} and **C** = {1, 2, 3, 4, 5, 10, 15, 20}. The results are presented in Table 6. Taking into account that the sample size has varied within the set {15, 20, 25, . . . , 140, 145, 150}, 235872 iterations have been performed.

**Table 6.** Values of condition **C1** depending on the coefficient of variation (CV).


Considering the thresholds established by Salmeron G ´ omez et al. ´ [10], 90% of the simulations present values for condition **C1** between 99.402% and 99.999% if *CV* < 0.06674082 and between 95.485% and 99.999% if *CV* < 0.1002506. Thus, we can consider that values of condition **C1** higher than 95.485% will indicate that the auxiliary centered regressions are detecting the presence of nonessential multicollinearity.

Table 7 shows that a high value is obtained for the condition **C1**, even if any estimated coefficient is significantly different from zero (**C2** = NA).

Thus, the previous threshold, 95.485%, will be considered as valid if it is accompanied by a high value in the second condition.


**Table 7.** Values of condition **C1** depending on condition **C2**.

**Example 4.** *Applying these criteria to the data of the Example 1 for Mod1, it is obtained that:*


*Thus, the symptoms shown in the previous simulation also appear, and consequently, in both situations, the nonessential multicollinearity will be detected.*

*Replicating both situations where the VIFnc was not able to detect the nonessential multicollinearity, it is obtained that:*

	- **–** *In the auxiliary regression* **X**<sup>2</sup> = *δ*<sup>1</sup> · **1** + *δ*<sup>4</sup> · **X**<sup>4</sup> + **w***, the estimation of the intercept is equal to the 99.978% of the total, and the individual significance of the intercept corresponds to 100% of the significant estimated coefficients.*
	- **–** *In the auxiliary regression* **X**<sup>4</sup> = *δ*<sup>1</sup> · **1** + *δ*<sup>2</sup> · **X**<sup>2</sup> + **w***, the estimation of the intercept is equal to 50.138% of the total, and none of the estimated coefficients are significantly different from zero.*
	- **–** *In the auxiliary regression* **X**<sup>3</sup> = *δ*<sup>1</sup> · **1** + *δ*<sup>4</sup> · **X**<sup>4</sup> + **w***, the estimation of the intercept is equal to 99.984% of the total, and the individual significance of the intercept corresponds to 100% of the significant estimated coefficients.*
	- **–** *In the auxiliary regression* **X**<sup>4</sup> = *δ*<sup>1</sup> · **1** + *δ*<sup>3</sup> · **X**<sup>3</sup> + **w***, the estimation of the intercept is equal to 50.187% of the total, and none of the estimated coefficients are significantly different from zero.*

*Once again, it was shown that with this procedure, it is possible to detect the nonessential multicollinearity and the variables that are causing it.*

#### *4.2. Relevance of a Variable in a Regression Model*

Note that the conditions **C1** and **C2** are focused on measuring the relevance of one of the variables, in this case, the intercept, within the multiple linear regression model. It is interesting to analyze the behavior of other measures with this same goal as, for example, the index *ıj* of Stewart [9]. Given model (1), Stewart defined the relevance of the *j*−variable as the number:

$$a\_j = \frac{|\beta\_j| \cdot ||\mathbf{X}\_j||}{||\mathbf{y}||}, \quad j = 1, \dots, p\_r$$

where || · || is the usual Euclidean norm. Stewart considered that a variable with a relevance higher than 0.5 should not be ignored.

**Example 5.** *Table 8 presents the calculation of ıj for situations shown in Example 1. Note that in all cases, the intercept will be considered relevant, even when the variable* **X**<sup>4</sup> *is analyzed as a function of* **X**<sup>2</sup> *or* **X**3*, despite that it was previously shown that the intercept was not relevant in these situations (at least in relation to nonessential multicollinearity).*

**Table 8.** Calculation of *ıj* for situations **Mod1**, **Mod2** and **Mod3** shown in Example 1.


*Thus, the application of ıj seems not to be appropriate contrarily to what happens with conditions C1 and C2.*

*4.3. Case When There Is Generalized Nonessential Multicollinearity*

**Example 6.** *Suppose that the previous simulation is repeated, except for the generation of the variable* **Z***, which, in this case, is considered to be given by Zi* = 2 · *Xi* − *ai, for i* = 1, ... , 100*, where ai is generated from a normal distribution with a mean equal to 2 and a standard deviation equal to 0.01.*

*Table 9 presents the results of the estimation by OLS of the model* **y** = *β*<sup>1</sup> · **1** + *β*<sup>2</sup> · **X** + *β*<sup>3</sup> · **Z** + *β*<sup>4</sup> · **W** + **u** *and its possible auxiliary regressions.*

*In this case, none of the coefficients are significantly different from zero and the coefficients are very far from the real values used in the simulation.*


**Table 9.** Estimation by OLS of the second simulated model and its corresponding auxiliary regressions (estimated standard deviation in parenthesis and coefficients significantly different from zero in bold).

*In relation to the auxiliary regression, it is possible to conclude that:*


Note that the existence of generalized nonessential multicollinearity distorts the symptoms previously detected. Thus, the fact that in a centered auxiliary regression, the contribution (in absolute terms) of the estimation of the intercept to the total sum (in absolute value) of all estimations will be close to 100%, and the estimation of the intercept will be uniquely significantly different from zero, are indications of nonessential multicollinearity. However, it is possible that these symptoms are not manifested but there exists worrisome nonessential multicollinearity. Thus, these conditions are sufficient but not required.

However, in situations shown in Example 6 where conditions **C1** and **C2** are not verified, the VIFnc will be equal to 1109,259.3, 758,927.7 and 100,912.7. Thus, note that these results complement the results presented in the previous section in relation to the VIFnc. Thus, VIFnc detects generalized nonessential multicollinearity while conditions **C1** and **C2** detect the traditional nonessential multicollinearity given by Marquardt and Snee [6].

#### **5. Empirical Applications**

In order to illustrate the contribution of this study, this section presents two empirical applications with financial and economic real data. Note that in a financial prediction model, a financial variable with low variance means low risk and a better prediction, because the standard deviation and volatility are lower. However, as discussed above, a lower variance of the independent variable may mean

greater nonessential multicollinearity in a GLR model. Thus, the existence of worrisome nonessential collinearity may be relatively common in financial econometric models and this idea can be extended in general to economic applications. Note that the objective is to diagnose the type of multicollinearity existing in the model and indicate the most appropriate treatment (without applying it).

#### *5.1. Financial Empirical Application*

The following model of Euribor (**100%**) is specified from the data set composed by 47 Eurozone observations for the period January 2002 to July 2013 (quarterly and seasonally adjusted data) and previously applied by Salmeron G ´ omez et al. [ ´ 10]:

$$\mathbf{Euribor} = \beta\_1 + \beta\_2 \cdot \mathbf{HICP} + \beta\_3 \cdot \mathbf{BC} + \mathbf{u},\tag{13}$$

where **HICP** is the Harmonized Index of Consumer Prices (**100%**), **BC** is the Balance of Payments to net current account (millions of euros) and **u** is a random disturbance (centered, homoscedastic, and uncorrelated).

Table 10 presents the analysis of model (13) and its corresponding auxiliary regressions. The values of the VIFs which are very close to one will indicate that there is not essential multicollinearity. The correlation coefficient between **HICP** and **BC** is 0.231 and the determinant of the correlation matrix is 0.946. Both values indicate that there is no essential multicollinearity, see Garc´ıa Garc´ıa et al. [21] and Salmeron G ´ omez et al. [ ´ 22].

However, the condition number is higher than 30 indicating a strong multicollinearity associated, see conditions **C1** and **C2**, with variable **HICP**. The values of conditions **C1** and **C2** are conclusive in the case of variable **HICP**. In the case of variable **BC**, although condition **C1** presents a high value, none of the coefficients of the auxiliary regression is significatively different from zero (condition **C2**). By following the simulation presented in subsection, this indicate that the variable **BC** is not related to the intercept. This conclusion is in line with the value of the coefficient of variation of variable **HICP** that is lower than 0.1002506, the threshold established by Salmeron G ´ omez et al. ´ [10] for moderate nonessential multicollinearity.

Table 11 presents the calculation of the VIFnc. Note that it is not detecting the non-essential multicollinearity. As previously commented, the VIFnc only detects the essential and the generalized nonessential multicollinearity. This table also presents the VIFnc calculated in a model without intercept but including the constant as an independent variable (see Section 2.3). In this case, the VIFnc is able to detect the nonessential multicollinearity between the intercept and the variable **HIPC**.

In conclusion, this model will present nonessential multicollinearity caused by the variable **HICP**. This problem can be mitigated by centering that variable (see, for example, Marquardt and Snee [6] and Salmeron G ´ omez et al. [ ´ 10]).


**Table 10.** Estimations by OLS of model (13) and its corresponding auxiliary regressions (estimated standard deviation in parenthesis and coefficients significantly different from zero in bold).

**Table 11.** VIFnc of auxiliary regressions associated to model (13).


#### *5.2. Economic Empirical Application*

From French economy data from Chatterjee and Hadi [23], also analyzed by Malinvaud [24], Zhang and Liu [25] and Kibria and Lukman [26], among others, the following model is analyzed:

$$\mathbf{I} = \beta\_1 + \beta\_2 \cdot \mathbf{D} \mathbf{P} + \beta\_3 \cdot \mathbf{S} \mathbf{F} + \beta\_4 \cdot \mathbf{D} \mathbf{C} + \mathbf{u},\tag{14}$$

for years 1949 through 1966 where imports (**I**), domestic production (**DP**), stock formation (**SF**) and domestic consumption (**DC**), all are measured in billions of French francs and **u** is a random disturbance (centered, homoscedastic, and uncorrelated).

Table 12 presents the analysis of model (14) and its corresponding auxiliary regressions. The values of the VIFs of variables **DP** and **DC** indicate strong essential multicollinearity. The condition number is higher than 30 also indicating a strong multicollinearity.

Note that the values of condition **C1** for variables **DP** and **DC** are lower than threshold shown in the simulation. Only the variable **SF** presents a higher value but, in this case, condition **C2** indicates that none of the estimated coefficients of the auxiliary regression are significatively different from zero. This conclusion is in line with the coefficients of variation that are higher than the threshold established by Salmeron G ´ omez et al. [ ´ 10] indicating that there is no nonessential multicollinearity.

Table 13 presents the calculation of the VIFnc. Note that it is detecting the essential multicollinearity. This table also presents the VIFnc calculated in a model without intercept but including the constant as an independent variable. In this case, the VIFnc is also detecting the essential multicollinearity between the variables **DP** and **DC**. From thresholds established by Salmeron G ´ omez et al. ´ [10] for simple linear regression (*k* = 2), the value 60.0706 will not be worrisome and, consequently, the nonessential multicollinearity will not be worrisome.


**Table 12.** Estimations by OLS of Model (14) and its corresponding auxiliary regressions (estimated standard deviation in parenthesis and coefficients significantly different from zero in bold).

**Table 13.** VIFnc of auxiliary regressions associated to Model (14).


To conclude, this model presents essential multicollinearity caused by the variables **DP** and **DC**. In this case, the problem will be mitigated by applying estimation methods other than OLS such as ridge regression (see, for example, Hoerl and Kennard [27], Hoerl et al. [28], Marquardt [29]), LASSO regression (see Tibshirani [30]), raise regression (see, for example, Garc´ıa et al. [31], Salmeron et al. ´ [32], Garc´ıa and Ram´ırez [33], Salmeron et al. ´ [34]), residualization (see, for example, York [35], Garc´ıa et al. [36]) or the elastic net regularization (see Zou and Hastie [37]).

#### **6. Conclusions**

The distinction between essential and nonessential multicollinearity and its diagnosis has not been not been adequately treated in either the scientific literature or in statistical software and this lack of information has led to mistakes in some relevant papers, for example Velilla [3] or Jensen and Ramirez [13]. This paper analyzes the detection of essential and nonessential multicollinearity from auxiliary centered and noncentered regressions, obtaining two complementary measures between them that are able to detect both kinds of multicollinearity. The relevance of the results is that they are obtained within an econometric context, encompassing the distinction between centered and noncentered models that is not only accomplished from a numerical perspective, as was the case presented, for example, in Salmeron G ´ omez et al. ´ [12] or Salmeron G ´ omez et al. ´ [10]. An undoubtedly interesting point of view of this situation is the one presented by Spanos [38] that stated: It is argued that many confusions in the collinearity literature arise from erroneously attributing symptoms of statistical misspecification to the presence of collinearity when the latter is misdiagnosed using unreliable statistical measures. That is, the distinction related to the econometric model provides confidence to the measures of detection and avoids the problems commented by Spanos.

From a computational point of view, this debate clarifies what is calculated when the VIF is obtained for centered and noncentered models. It also clarifies, see Section 2.3, what type of multicollinearity is detected (and why) when the uncentered VIF is calculated in a centered model. At the same time, a definition of nonessential multicollinearity is presented that generalizes the definition given by Marquardt and Snee [6]. Note that this generalization can be understood as a particular kind of essential multicollinearity:

A near-linear relation between two independent variables with light variability. However, it is shown that this kind of multicollinearity is not detected by the VIF, and for this reason, we consider it more appropriate to include it within the nonessential multicollinearity.

In relation to the application of the VIFnc, this paper shows that the VIFnc detects the essential and the generalized nonessential multicollinearity and even the traditional nonessential multicollinearity if it is calculated in a regression without the intercept but including the constant as an independent variable. Note that the VIF, although widely applied in many different fields, only detects the essential multicollinearity. This paper has also analyzed why the VIF is unable to detect the nonessential multicollinearity, and two conditions are presented as sufficient (but not required) to establish the existence of nonessential multicollinearity. Since these conditions, **C1** and **C2**, are based on the relevance of the intercept within the centered auxiliary regression to calculate the VIF, this scenario was compared to the measure proposed by Stewart [9], *ıj*, to measure the relative importance of a variable within a multiple linear regression. It is shown that conditions **C1** and **C2** are preferable to the calculation of *ıj*.

To summarize:


To conclude, in order to detect the kind of multicollinearity and its degree, the greatest number of measures must be used (variance inflation factors, condition number, correlation matrix and its determinant, coefficient of variation, conditions **C1** and **C2**, etc.) as in Section 5, and it is inefficient to limit oneself to the management of only a few. Similarly, it is necessary to know what kind of multicollinearity is capable of detecting each one of them.

Finally, the following will be interesting as future lines of inquiry:


**Author Contributions:** Conceptualization, R.S.G. and C.G.G.; methodology, R.S.G.; software, R.S.G.; validation, R.S.G., C.G.G. and J.G.P.; formal analysis, R.S.G. and C.G.G.; investigation, R.S.G., C.G.G. and J.G.P.; resources, R.S.G., C.G.G. and J.G.P.; writing—original draft preparation, R.S.G. and C.G.G. ; writing—review and editing, R.S.G. and C.G.G.; supervision, J.G.P.; project administration, R.S.G.; funding acquisition, J.G.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by University of Almer´ıa.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
