1. Introduction
Regression analysis is one of the most widely used tools in statistical modeling and prediction. It helps researchers and practitioners to identify approximate analytical relationships between outcome variables and predictor variables using multiple empirical observations. Numerous textbooks, monographs, and research papers have been devoted to various types of regression modeling.
Advanced regression methods developed to mitigate multicollinearity effects include modern approaches such as penalized ridge regression, the least absolute shrinkage and selection operator (LASSO), elastic net regression, least angle regression (LARS), and partial least squares (PLS) [
1]. Tools designed for big data, statistical learning, and machine learning applications include support vector machines (SVM) and support vector regression (SVR), classification and regression trees (CART), prediction algorithms such as CHAID, multivariate adaptive regression splines (MARS), AdaBoost, neural networks (NN) for deep learning, as well as Bayesian and generalized linear mixed models (GLMM) [
2]. They, directly or indirectly, use regression as an important component.
Specialized regression techniques have also been proposed, including the stability selection approach [
3], the use of Shapley values in regression estimation [
4,
5], models accounting for errors in predictors [
6], and various strategies for identifying the “best” model [
7,
8]. Many manuals now present applied statistical techniques and their implementation in software environments such as R [
9,
10], Python [
11,
12], and other statistical and mathematical packages. However, the wide availability and ease of use of these software tools have led to uncritical applications of regression analysis, where practical studies often stop at reporting seemingly favorable
t-statistics,
p-values, and model fit indices as the final proof of validity. The present paper proposes a more critical examination of this practice and explores ways to improve it.
In this study, multiple large datasets were generated through sampling simulation, and the results of regression modeling were compared with the parameters originally used in the simulations. The simulated cases were characterized by two meta-parameters of the data: the level of uncertainty and the level of multicollinearity. The results show that there are situations in which simple precision-based criteria may provide more meaningful insights than the conventional indicators automatically produced by regression software and routinely used by practitioners.
It is also demonstrated that multicollinearity can have detrimental effects—producing biased and inflated regression coefficients—even when correlations among predictors are relatively low. The novel methods for selecting the best variables, based on the so-called reference matrix and efficiency indicator, are proposed, and their usability is demonstrated.
The results are obtained under idealized conditions, where datasets contain a very large number of observations (five thousand), are generated from well-behaved sources (normal distributions), and are free from heterogeneity, clustering, or outliers. For real-world datasets, which are typically smaller and less homogeneous, the effects identified here may be even more pronounced, highlighting the need for caution in the common use of regression modeling in applied statistical analysis. To test this, we also constructed and analyzed “distorted datasets” that deviate from the ideal assumptions by a clustered structure, and also applied the proposed techniques to a real data example.
The paper is organized as follows. After the Introduction,
Section 2 presents experimental design, and
Section 3 describes multiple regression features. The next sections discuss the obtained numerical results:
Section 4—quality of the retrieved regression parameters,
Section 5—estimations by the reference matrix,
Section 6—analysis of the predictors’ contribution,
Section 7—efficiency indicator and errors, and
Section 8—selection of variables in models.
Section 9 describes an example of modeling with real data, and
Section 10 summarizes the results.
2. Experimental Design
We generated the data for the multiple linear regression model
where
xij are
ith observations (
i = 1, 2, …,
N—number of observations) by all
jth predictors (
j = 1, 2, …,
n—number of predictors),
yi are observations by the dependent variable,
ei define the errors, and
aj are the coefficients of the regression.
The number of predictors in the simulations is
n = 10. The predictors were set as follows: the subgroup {
x1,
x2,
x3,
x4} of correlated variables, another subgroup {
x5,
x6,
x7} of correlated variables, and the others {
x8,
x9,
x10} are uncorrelated variables. The variables {
x1,
x5,
x8,
x9,
x10} were generated as normally distributed independent random variables with mean values and standard deviations equal to 1. The variables {
x2,
x3,
x4} were set as correlated one by one with
x1, and similarly, the variables {
x6,
x7} are correlated with
x5. The correlated variables were obtained via the well-known formula for the bivariate Cholesky decomposition [
13]:
where
are normally distributed random variables with mean 0 and standard deviation 1,
ρ is a desirable correlation, and
would have this correlation with
. In a dataset, the level of correlation was taken to be equal for all predictors. Correlations between predictors in different datasets were put at one of the following levels:
The last two levels indicate that the data has a high level of multicollinearity between the independent variables. At a high level of correlations between x2 with x1 and x3 with x1, a correlation between x2 and x3 will also be high. So, the predictors {x1, x2, x3, x4} are mutually correlated in one subset, and the predictors {x5, x6, x7} are correlated in another subset, but there is no or low correlation between the variables from different subsets, or those with the last three independent variables {x8, x9, x10}.
For the generated predictors, the corresponding coefficients of regression were taken as the constants shown in
Table 1.
The purpose of this selection of coefficient values was twofold. First, they should affect the results in different ways: some strongly (with values like 20 and 24), some weakly (with values like 1 or 2), and others at a middle level. Second, the same coefficients for several variables were set to capture the effect of the mutual correlations between independent variables. Respectively, the values 1, 10, and 24 were taken twice for the correlated and non-correlated subgroups. This setting allows for checking how the estimation of the same values’ coefficients differs depending on the correlations between variables.
The outcome variable
y was defined with zero intercept by the linear combination (1) of the predictors {
x1, …,
x10} with the parameters from
Table 1. The added random error
e in (1) was defined by the normal distribution with zero mean value and the standard deviation
taken at three levels, A, B, and C as follows:
These values imitate the situations with low, medium, and high levels of uncertainty in the data. Indeed, with the mean value of each predictor equal to one, the mean level of the dependent variable
y equals approximately the total of the coefficients in
Table 1, which is 126. Relative to this mean level of
y, the added random noise of the values (4) constitutes 20/128, 100/128, and 200/128, or about 16%, 80%, and 160%, respectively, of the low, medium, and high levels of data distortion by the errors.
With 5 values of correlations and 3 values of errors in (3)–(4), altogether, the 15 combinations of these parameters were considered. For each of such 15 scenarios, 120 random samples were generated, which yielded 1800 generated datasets. Each of these multiple datasets consisted of multivariate datapoints by
yi and
xij, with the number of generated observations
N = 5000. Finding parameters of the model (1) by the generated data and averaging them within each of the 15 simulation cases by the 120 samples, we can compare the obtained results with the original constants in
Table 1 employed in the data-generating process.
To examine how different features of regression behave when the data have a more heterogeneous structure, a “distorted dataset” (DD) was also created. The dataset contained 5000 observations constructed in such a way that three variables, instead of being drawn from a single normal distribution as described earlier, were generated from two different distributions. Specifically, for the variable x1, the first 2000 observations were sampled from a normal distribution with the mean 1 (as before), while the remaining 3000 observations were sampled from a distribution with the mean value 25 (standard deviation equals 1 in both cases). The variable x5 was generated similarly: the first 2000 observations had the mean −20, while the rest of the observations had the mean 1. For the variable x10, the first 1000 observations had the mean value 30, and the last 4000 observations had the mean value 1. This type of distortion created a complex cluster structure in the data. Since x1 and x5 served as the basis for three and two other variables, respectively, those were also affected. The variable x10, being independent, did not influence other predictors, but it strongly affected the target variable because it had the largest coefficient, 24. Naturally, there are countless ways to “distort” the original smooth dataset, and this particular way was chosen purely for illustrative purposes. The results of simulations with DD will be noted in commentary alongside the main conclusions.
3. Several Relations on Multiple Regression
Let us briefly describe some main regression modeling formulae needed for further consideration. For each of the generated datasets, the regression model (1) was built by the ordinary least squares (OLS) criterion of the error minimization:
The minimum of the objective (5) can be found by setting to zero its derivatives with respect to the parameters, which yields the so-called normal system of equations:
The first equation in (6) can be divided by
N, producing the expression for the intercept:
Using (7) in the other
n Equation (6) leads to the system which in matrix form is
where the matrix
Cxx and vector
cxy are defined as follows:
with
X denoting the matrix of
N ×
n order of the centered observations by the predictors, prime denoting transposition, and
y is the vector of the centered observations of the dependent variable. The matrix
Cxx and vector
cxy correspond to the sample covariance matrix of predictors
x and the vector of their covariances with the dependent variable
y. Solution of Equation (8) via the inverse matrix
yields the vector
a of the coefficients of the OLS multiple linear regression:
Let us add the following feature useful in applications of the normal system (6): it corresponds to the hyperplane going through the 1 +
n points of the weighted mean values of all variables. In the assumption that the total of any
x differs from zero, it is possible to divide each
jth Equation (6) by the term with the intercept
, which yields
Let us introduce the weight of
ith observation in their total for each variable
xj:
with the totals for each
jth variable equals one. Then we can define the means and weighted means in (11), denoted by bars, as follows:
where
and
are the mean values of
y and
xk, respectively, weighted by the
jth set of weights built by
xj (
j = 1, 2, ...,
n) as in (12). Then the system (11) can be simplified as
Therefore, the linear regression can be seen as a hyperplane passing through the 1 +
n points of the mean values of each variable weighted by each other variable. Properties of such a hyperplane and its building were described in detail in [
14].
The set of mean values in (14) we can call the reference matrix (RM) because it consists of the points through which the hyperplane of multivariate regression goes. Let us make several observations about the RM in its relation to the normal system (6). The normal system is often built with the exclusion of the intercept, to center the variables. But the total of the values by a centered variable equals zero, so it is impossible to divide by it as was done in the transition from (6) to (11). However, for the centered data, in place of any second moment
in (6) there is the centered second moment, which can be transformed as follows:
Then it is still possible to divide each kth row of the normal system (6) by the corresponding total , and the transformation to (14) (without the intercepts) holds for the centered variables weighted by the same totals (12). The same is true for the standardized data.
Another specific feature of the RM (14) is that the values in each jth column are of the same units of measurement, coinciding with the means in the jth column, so with the units of the jth variable’s measurement. It makes possible to consider correlations of the values in different columns of RM (14) (which can be used for a special aim described further). This property of RM differs from the features of the normal system (6) for the matrix of the second moments, where all units of measurement in the non-standardized matrix are defined by the product of two variables; thus, it is impossible to find correlations between columns that contain different units of measurement for each element in the column.
Significance of the coefficients of regression can be measured by their
t-statistics, defined by the quotient of the parameter
and its standard deviation
(which equals the product of the model residual variance and the corresponding diagonal element of the inverted correlation matrix):
The absolute value of
tj above 1.96 corresponds to the statistically significant coefficient, which differs from zero with probability 95%, and the related confidence intervals can be built.
Another way for directly measuring the precision of recovered regression coefficients in the simulated estimations can be considered by the relative deviation (
RD) of each
jth retrieved parameter
aj:
It could also be calculated as the absolute value of the relative deviation.
For the standardized variables (centered and normalized by their standard deviations
for
xj, and
for
y, respectively), the matrix
Cxx and vector
cxy in (8)–(9) reduce to the correlation matrix of predictors
Rxx, and vector
rxy with elements
rjy of correlations between the
jth predictor
xj and the dependent variable
y. Then solution (10) produces the normalized coefficients of regression, also known as beta coefficients, which are connected with the original coefficients by the relation:
The total quality of data fit by the regression model is commonly estimated by the coefficient of multiple determination
R2, which can be calculated as the sum of the predictors’ contributions
dj:
The value
dj equals the product of the beta coefficient
bj for the
jth predictor in multiple regression and the coefficient of correlation
rjy between
xj, and
y, which also coincides with the coefficient of the pair regression of
y by
xj. The
R2 belongs to the interval [0, 1] and reaches its maximum of 1 when the residual sum of squares
S2 (5) goes to zero. The product
dj can be understood as a contribution of the
jth predictor in the model quality measured by
R2: dividing Equation (19) by
R2 allows for calculating the relative contribution of the variables:
However, this interpretation is justified only if the beta-coefficients have the same signs as the corresponding coefficients of correlation, so the contributions are positive. For the highly correlated predictors, in other words, for multicollinearity among the variables, the coefficients of regression can be biased and inflated, and those closer to zero can change signs to the opposite in comparison with the signs of pair correlation. In such cases, more complicated methods can be applied for estimation of the predictors’ contribution to the model (for example, see [
4]). Sometimes, it is also interesting to find the shares of two terms in the product
dj and to calculate the relative importance of each multiplier via their logarithms:
The negative rj and bj can be taken by absolute value.
Besides finding the key drivers, or the most important variables in the model, sometimes it is needed to find and eliminate the less influential points to build the regressions with the reduced data. For this aim, the coefficient of multiple determination can be presented in any of two forms: as the total by all observations of the product of the original standardized dependent variable
yi and its prediction
yi,predic by the model, or as the total of this predicted variable squared:
Thus, it is possible to order observations by their contribution to the quality of fit, and such ordering can be simplified to the ordering by the normalized dependent variable
yi2 values squared. This approach corresponds to the earlier works by Wald and by Bartlett, where with two variables
x and
y, their
ith observations are ordered by
xi, the whole set is divided into three (or another number) groups, and the middle part is omitted. As shown in [
15,
16,
17,
18], connecting the mean points of the two far-distanced remaining groups gives an unbiased estimation for the coefficient of the pair regression. In the current estimations with multidimensional simulated observations by many predictors, the ordering was performed by the values of the squared normalized dependent variable.
The Variance Inflation Factor (VIF) is a traditional indicator reported in statistical packages as a measure of multicollinearity between the predictors and actively used in statistical practice for the elimination of the variables of high correlations with other predictors. For the standardized variables, the VIF of each predictor
xj is defined as the
jth diagonal element of the inverse correlation matrix
, which we can denote in simpler notations as
Also, as it is known, the coefficient of multiple determination in the model of
xj as the dependent variable by all other predictors can be found from the value (23) as follows:
Let us introduce an index which combines two useful features of a predictor in one measure: how strongly it is related to the target variable on one side, and to all other predictors on the other side. For this aim, we propose the following Relation Index (
RI):
where
is the squared coefficient of correlation between
and
y, and
is the coefficient of multiple determination (24) between
and other predictors. This index lies between 0 and 1 for the cases of the small and big relation of
xj to the dependent variable in comparison to the relation with the other predictors.
Formula (25) can be simplified to another index, which we will call the efficiency indicator (
EI)
:It is built similarly to (25), but instead of the multiple determination, the mean value of the squared correlations of xj with the other predictors is employed. It is a logical and natural gauge: the predictors highly correlated with the target variable, while not highly correlated with other predictors, are the best candidates for a good fit in the models without multicollinearity, so with interpretable coefficients of regression. Both measures (25) and (26) are highly correlated, but (26) is simpler to calculate in practice, so we focus further on it.
4. Simulation Results on the Quality of Retrieved Regression Parameters
For the recovered regression parameters in the simulated datasets, the results are presented in a series of figures, most of which share a similar structure. Each figure consists of three panels corresponding to the three predefined error levels (see
Section 4). In each panel, the
x-axis typically represents the correlation level among predictors (3). Anticipating the results (explained in more detail below), these three error levels yield coefficients of determination for their respective groups in the approximate ranges of 85–94%, 18–38%, and 5–14%, or roughly with
about 90%, 30%, and 10%.
As established, all true regression coefficients were assigned positive values (see
Table 1). However, a substantial number of estimated coefficients turned out to be negative, which is shown in
Figure 1. Only variables
x6,
x7, and
x10, which have the largest true values 22 and 24, did not exhibit this effect under any condition. For clarity, these variables, as well as some others, are omitted from the chart. For the remaining variables, the proportion of negative coefficient estimates increases with higher error levels. Even when correlations among predictors are small, variables
x1,
x2, and
x8—which have the smallest true coefficients 1 and 2 (see
Table 1)—show between 10% and 30% negative estimates for the error level of 100, and 20% to 40% for the error level of 200. This indicates that smaller coefficients are more vulnerable to sign reversal due to the inflationary effects of multicollinearity, particularly under high uncertainty. Only under low-error conditions (error 20), these coefficients are estimated without severe distortion.
For the distorted data (DD), the percentage of negative coefficients is somewhat higher, though not dramatically: for x1 33% versus 30% in Panel A, and 46% versus 41% in Panel B; for x2 39% versus 30% in Panel B, and similar patterns are observed for the other variables.
Figure 2 presents the percentage of non-significant regression coefficients, as determined by
t-statistics across different combinations of the simulation parameters. The results are shown for the 95% confidence level, so for
t-values with absolute magnitude less than 1.96. As can be seen, this percentage can be quite high, even for variables in Panel A. This implies that although, by construction, all variables are truly nonzero (and therefore significant), in many parameter combinations the
t-test incorrectly rejects them, concluding that the variables are not significantly different from zero and should be disregarded.
The t-statistic is also highly sensitive to the inter-predictor correlations. For example, variable x7, whose true coefficient equals 24, yields about 20% of “insignificant” estimates in cases with higher correlations (see Panel C).
For the distorted data (DD), the overall patterns are similar, though the specific percentages vary. In some cases, they are higher—for instance, for x1 96% versus 90% in Panel A, and for x6 25% versus 7% in Panel C—while in others they are lower (e.g., for x1 23% versus 90% in Panel A). In general, most of the values remain close between the standard and distorted datasets, showing that the t-statistic’s results are consistently unreliable for different datasets.
Figure 3 presents the average relative deviation (RD) of the regression coefficients, as defined in Equation (17). The values are calculated for positive deviations and averaged across approximately half of the 120 samples in each of the 15 scenarios described in
Section 2 (results for negative deviations are similar). For clarity, only four variables are shown:
x1 and
x8, which have the smallest true coefficients equal to 1, and
x7 and
x10 with the largest coefficient equal to 24 (see
Table 1). As shown in panels B and C, the average RD for
x1 (with the true coefficient 1) can be 6–12 times greater than the actual value when this variable is highly correlated with other predictors.
Several conclusions follow from
Figure 3. First, the higher the level of data uncertainty (the error), the higher will be the RD of the estimated parameters—this trend is consistent across all variables. This observation is intuitively expected: detecting a weak signal amid strong noise becomes more difficult, even with a large sample size of 5000 observations. Second, the difference in estimation of quality between small and large coefficients is quite noticeable—the largest coefficients, 24, are generally estimated with much greater accuracy. Finally, the relative error appears to be highly sensitive to correlations among the independent variables, even when these correlations are only moderate. For
x1, RD increases sharply within each panel, reaching extremely high average values—often exceeding the true coefficient by more than an order of magnitude. A similar, though less pronounced, pattern is observed for variables with larger coefficients. This observation supports the conclusion that multicollinearity effects can appear even under moderate correlations, particularly in the presence of high data uncertainty. Greater attention should therefore be paid to detecting and mitigating these effects. Consequently, regression models with low coefficients of determination—which are common in practical statistical research—are especially vulnerable to large estimation errors in their coefficients.
For the distorted data, the picture is practically the same, with one exception: the coefficient for x2 is estimated with very big errors—the RD varied from 90% or more in Panels B and C.
5. Estimations by the Reference Matrix
Let us consider another approach to analyzing the normal system by transforming it from the correlation matrix into the matrix of weighted means, defined in Equations (11)–(14). We refer to this transformed matrix as the reference matrix (RM). To begin, let us examine the correlations observed in the simulated datasets.
Figure 4 illustrates how the various predictors are correlated with the dependent variable. The results appear intuitively reasonable: all correlations are positive, and their magnitudes decrease noticeably as the level of noise in the data increases. Three variables,
x5,
x6, and
x7, exhibit behavior distinct from all the others: their correlations with
y increase sharply as multicollinearity rises. This can be explained by the fact that these variables have the highest regression coefficients of 20, 22, and 24, respectively. What is more surprising, however, is that variable
x10, which also has the largest coefficient 24, begins with the same correlation value as its counterpart
x7 but then decreases, whereas x
7 continues to increase.
Figure 5 shows the correlations between the same variables, but this time computed as the correlations between the columns of the mean values of the predictors and the column of the mean values of the dependent variable in the reference matrix (14). The striking difference is that these correlations are unaffected by the level of uncertainty—their patterns remain identical across all three panels. This indicates that the RM, constructed from the mean values, preserves its structure regardless of noise in the data. The reason is straightforward: mean values are inherently stable measures, largely resistant to random errors, at least in the absence of significant outliers.
Another noticeable difference is that three variables,
x5,
x6, and
x7, which already had the highest correlations, now exhibit even stronger values, while all others, in contrast, show markedly reduced correlations, often turning to the strongly negative values. This effect arises from the centering applied within the columns of the RM (14), which influences the correlation estimates. Consequently, transforming the system into the reference matrix provides a clearer distinction between groups of variables. In the present example, the three variables with the highest correlations with
y emerge as the most important ones—offering a natural basis for reducing the number of predictors, which is one of the recurring challenges in regression modeling (see
Section 8).
6. Estimation of Variables’ Contribution
The selection of good variables could also be considered by their inputs into the model’s quality, described in Formulas (19)–(21). Contributions of the predictors are noticeably negative for three variables:
x1 has 27% (481 out of 1800 samples),
x2 has 14%, and
x8 has 13% of such cases. For other variables, this number is either very small, less than 1% (for
x3,
x4,
x5, and
x6) or zero (for
x7,
x9, and
x10). Excluding these negative contributions and computing the average shares according to Equation (20) yields the results shown in
Figure 6. These results lead to several noteworthy observations. Most prominently, the variable
x10, which has the largest regression coefficient, displays a high contribution to the model’s quality—as expected—when its correlations with other predictors are minimum (0.05). However, its contribution decreases sharply as inter-predictor correlations increase. This behavior stands in marked contrast to that of its counterpart
x7 as well as to other strong predictors
x5 and
x6, whose contributions grow with higher correlations.
The shares of correlations in the contributions estimated by the Expression (21) are shown in
Figure 7. It reveals that all variables uncorrelated with other predictors {
x8,
x9,
x10}, regardless of the values of their coefficients of regression from 1 to 24 and for all conditions across the three panels, represent 50% in the product
dj (20). This 50% share is theoretically expected because when correlations among predictors approach zero, the correlation matrix becomes an identity matrix, and the beta coefficients converge to the simple pairwise correlations between predictors and the dependent variable. In this case, the shares of the correlations and beta coefficients are necessarily equal.
For the mutually correlated predictors, however, the pattern is different. The most sensitive variables—the ones with stronger true effects—exhibit the steepest decline from the 50% level as multicollinearity increases across all three panels.
7. Efficiency Indicator and Estimation Errors
As we saw earlier, even moderate multicollinearity can significantly distort regression results. It is reasonable to expect that a higher efficiency index, EI (26), should correspond to higher estimation accuracy. To test this assumption, several analyses were performed, and the results are summarized in
Table 2.
For each simulated dataset and each predictor, the following characteristics were calculated: EI (26), VIF (23),
t-statistic (16) (squared), and the relative deviation RD (17) (taken by the absolute value). Then, average values of these measures were computed within 15 groups defined by the simulation design, which combined three error levels with five multicollinearity coefficients as specified in (3) and (4). Two types of correlations with RD were then examined: correlations based on the original individual values (1800 values per variable), shown in columns 2–4 of
Table 2, and correlations based on the group averages (15 values per variable), presented in columns 5–7.
As expected, higher values of EI and t-statistic generally correspond to lower RD, yielding mostly negative correlations. In most cases, EI shows stronger correlations with RD than the t-statistic—across almost all variables in the individual-level data, and across all variables in the grouped data. This indicates that EI has greater predictive power for estimation accuracy.
In contrast, a higher VIF reflects stronger intercorrelation between a given predictor and other regressors, which should theoretically increase estimation errors. However, the actual results for VIF in
Table 2 are mixed (have different signs of correlations), suggesting that VIF is not a reliable predictor of RD in this context.
Some extremely large values of VIF or RD may distort the estimation of correlations. To mitigate this effect, we also applied more robust measures. Specifically, for the regression outputs, Spearman rank correlations were computed between the RD of the coefficients and the corresponding values of EI,
t-statistics, and VIF, within each of the 15 groups described above.
Figure 8 presents these results. For instance, the value −0.58 in panel A represents the average Spearman correlation between RD and EI across 120 regressions under the condition of low error level and mutual correlation between predictors equal to 0.3 (shown on the horizontal axis).
As we can see, VIF demonstrates the expected positive correlations with RD, but they are rather weak (ranging from 0.05 to 0.36) and show an evident upward trend with increasing multicollinearity among the predictors. In contrast, the t-statistics exhibit much stronger negative correlations, typically between −0.6 and −0.75. The efficiency indicator EI, though somewhat weaker than the t-statistic in relation to RD, still performs substantially better than VIF. Similar results were obtained for the distorted data, where EI maintains its robust behavior: it does not fall into the “wrong” correlation range and continues to reflect predictor quality consistently well.
Summarizing the results of evaluating EI as a predictor for the accuracy of a coefficient’s estimation (see
Table 2 and
Figure 8), we can conclude that EI is at least as effective as the
t-statistic and, in many cases, even superior. The most reasonable approach is not to replace the
t-statistic with EI, but rather to use them jointly: EI reinforces correct inferences when the
t-statistic alone may fail. VIF, on the other hand, should be regarded as a supplementary but not decisive measure. We recommend using EI together with another measure—the predictor’s contribution to the variance of the dependent variable (discussed in
Section 6). Both indicators are valuable for identifying the best subset of variables and for providing a sound framework to assess the strength of causal relationships between variables.
The term “causal” is used here with caution: while we acknowledge the extensive literature on causality in statistical modeling (see [
19,
20]), we intentionally avoid delving into it in depth. Nevertheless, it is worth noting that the data simulation design used in this study has a causal nature, since all predictors directly affect the dependent variable. In contrast, for most real-life data, establishing genuine causality remains an exceptionally challenging task. Thus, all the regression issues discussed here would be magnified in typical situations where the very existence of causality is in question.
8. Selection of Variables for the Models
In all our scenarios, the parameters of regression differ from zero; thus, the correct model should presumably include all of them, and the estimated coefficients should be significantly different from zero. This is not what occurred, however.
Table 3 presents information about the “proper models” that satisfy the condition of
t-statistics (16) equal to or exceeding 1.96 for all coefficients to be significant at the 95% confidence level.
The total number of such models is 283 out of 1800, or 15.7%, and they all correspond exclusively to the minimum error in the data, which is also associated with models having a coefficient of multiple determination of 85–94%. One can assert that in the majority of real-world scenarios, this approach will not yield satisfactory results, because a model with a determination coefficient of 85–94% is itself a rare occurrence. Moreover, these models, while formally adequate, produce substantial deviations in estimation. We calculated the maximum RD (17) in each model and then averaged them within each subgroup. The rationale for this evaluation is that if the deviation for even one variable is sufficiently large, the model cannot be considered satisfactory. These deviations in the two largest subgroups of proper models with minimum correlations of 0.05 and 0.30 are approximately 30%, closely approaching the average for the entire subgroups shown in the final column, which encompasses both proper and improper models. Thus, t-statistic selection operates successfully only under limited circumstances and still does not guarantee high-quality estimates. These results are consistent with the outcomes described in relation to
Figure 2 and
Figure 3.
In practical research, “the actual model” is typically unknown, and many of the collected independent variables can be eliminated for model simplification to retain only the best predictors, selected according to one or another criterion of data fit. Let us return to the reference matrix results shown in
Figure 5 to identify a set of highly influential variables {
x5,
x6,
x7}, which we designate as “good variables”. Our purpose is to demonstrate that this approach yields surprisingly robust results due to the stability of the correlations presented in
Figure 5, in contrast to the use of
t-statistics discussed in relation to
Table 3. To illustrate the quality of such selection, we compared models containing the good variables with models employing three alternative variables,
x3,
x4, and
x10, with medium and large coefficients of 10, 12, and 24, respectively (see
Table 1). These appear to be the most promising alternative candidates precisely because they possess large coefficients beyond those in the good subset.
Figure 9 demonstrates how both types of models approximate the dependent variable in comparison with the complete (and correct by definition) model across all 1800 samples. Models incorporating the good variables follow the correct pattern of the complete models considerably more closely than models utilizing the other three variables. In general, the pattern for the good variables corresponds to the diagonal line (which represents the ideal correspondence), whereas the alternative models deviate substantially from it. The prediction accuracy with which the good-set models approximate the complete model quality is approximately 99%.
It is also possible to compare the RD of these models. If one calculates the averages of the RD in absolute value for the complete and reduced models, and then subtracts the former from the latter, a positive sign of the difference would indicate that the reduced model performs worse than the complete model, and vice versa. These differences are shown in
Figure 10. We observe that the RDs of the models with alternative variables are substantially higher than those of the good models in panel A, approximately equivalent to each other in panel B, and yield mixed results in panel C; all of these outcomes depend on the level of correlations between predictors. The models of the good set, regardless of the noise level in the data, consistently approximate the complete models within a very narrow interval of several percentage points. Thus, the models of the good set demonstrate remarkable robustness under all conditions (which corresponds to the observations from
Figure 6 and
Table 3).
Altogether, we have discussed several methods of best predictors selection: two novel approaches (efficiency indicators and reference matrix) and two traditional methods: t-statistics and VIF. Our purpose was not to propose a new universal method of variable selection but rather to demonstrate that the most traditional and routine approach, oriented toward t-statistics and p-values, does not function adequately in many situations and requires the other supportive measures discussed herein.
We experimented not only with the selection of the best predictors but also with the reduction in the number of observations by inputs into the coefficient of multiple determination according to the expressions in (22). Some results demonstrated significant improvement in the model’s quality as measured by RD, which could be achieved with comparatively small samples of 10–15% of the original size. However, it proved difficult to identify intelligible decision rules to provide practical recommendations; for this reason, we do not present these results here but defer such consideration to future research.
9. Practical Example
For the numerical example with real data, we employed a dataset “Cars 93” taken from the MASS library in the R software. There are 93 observations on the American and foreign cars measured by various variables. The Price (in US$1000) was taken as the dependent variable y, and the following 11 numerical variables were used as predictors: x1—Engine size (liters), x2—Horsepower, x3—RPM (revolutions per minute at maximum horsepower), x4—Rev.per.mile (engine revolutions per mile in highest gear), x5—Fuel tank capacity (US gallons), x6—Passengers (number of persons), x7—Length (inches), x8—Wheelbase (inches), x9—Width (inches), x10—Turn.circle (U-turn space, feet), x11—Weight (pounds). Since the number of observations, 93, is considerably smaller than the 5000 employed in the simulation, all caveats should be treated with greater caution.
Table 4 presents the reference matrix RM, where each row in the table corresponds to the coefficients in each of Equation (14). The added bottom row displays the pairwise correlations between the columns of mean values of
y and the predictors in the reference matrix. Large correlations identify the superior predictors, beginning with
x2.
Table 5 presents several statistics discussed above. The bottom row reports the coefficient of multiple determination for this model:
R2 = 0.742, which equals the sum of all contributions (19). The RI (25) is highly correlated with EI (26) at a level of 0.99 and is therefore omitted here. The final columns display the ranks of the three key indicators to simplify the interpretation of the results.
The only variable that ideally fits our description of a “very good” predictor is x2, which ranks highest across all three indicators. All others raise certain doubts. For example, x9, under the standard approach, might appear to be a strong candidate because it has a high (though negative) t-value. However, closer examination reveals that it has a positive correlation with the target and a very low EI of only 0.27. By contrast, the “insignificant” x11 (with t = 0.66 and p-value = 0.51) shows the third-highest values for both EI and contributions, making it clearly worthy of consideration. Similarly, x5 could be recommended despite its low t-statistic. This example is merely illustrative, but it demonstrates that examining multiple aspects of regression provides substantially deeper insights than relying solely on t-statistics and p-values. As we discussed earlier, standard statistical tools with the new measures of precision and fit may lead to divergent conclusions—an issue that warrants further investigation.
10. Summary
Two factors play a crucial role in linear regression analysis—the level of uncertainty in the data and the degree of multicollinearity among predictors. When both are low, regression results tend to be adequate. In this study, we sought to disentangle these two effects by employing a direct comparison of regression results with known parameters of simulated data. Our main conclusions are as follows.
Impact of correlations: Even moderate correlations between predictors can significantly affect the quality of estimation. This underscores the importance of employing regularization techniques and other strategies to mitigate multicollinearity.
Limitations of traditional tests: The discrepancy between standard significance tests for regression coefficients and relative errors can be substantial, particularly when data uncertainty is high. A cautious approach to standard indicators is therefore recommended, along with wider confidence intervals and the use of complementary methods proposed in this paper.
Reference matrix as a tool: Presenting the normal system in the form of a reference matrix of weighted means enhances the interpretability of regression coefficients and provides a useful basis for numerical evaluation. Examining correlations within the reference matrix makes it possible to identify predictors estimated with greater reliability, even under high uncertainty and strong multicollinearity.
Efficiency Indicator (EI): We propose this measure to quantify the connection of each predictor with the target variable relative to its connections with other predictors. Our analysis demonstrates that EI can serve as a valuable supplementary statistic to the commonly used t-statistic and, in some situations, can substitute for it when the latter proves inadequate.
Combined use of contributions and EI: We introduce a novel procedure that jointly considers a variable’s contribution to the multiple determination coefficient and its efficiency indicator. This combined approach provides a more comprehensive and reliable assessment of predictor importance.
Applicability of the results is supported by the variety of data utilized. This includes a large dataset in simulation, featuring both comparatively “smooth” and “distorted” (clustered) characteristics, as well as a small, untested dataset in a practical example. All discussed concerns will be magnified by a higher level of uncertainty, smaller sample sizes, and other irregularities encountered in real-world situations.
Overall, the results contribute to a deeper understanding of estimated regression parameters and quality measures. The methods described herein can present valuable additional tools in practical statistical modeling and analysis, particularly in the context of big data, which has become routine in many modern applied research projects.