Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling

Mandel, Igor; Lipovetsky, Stan

doi:10.3390/appliedmath5040140

Open AccessArticle

Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling

by

Igor Mandel

¹ and

Stan Lipovetsky

^2,*

¹

Independent Researcher, Fair Lawn, NJ 07410, USA

²

Independent Researcher, Minnetonka, MN 55305, USA

^*

Author to whom correspondence should be addressed.

AppliedMath 2025, 5(4), 140; https://doi.org/10.3390/appliedmath5040140

Submission received: 28 August 2025 / Revised: 8 October 2025 / Accepted: 10 October 2025 / Published: 13 October 2025

Download

Browse Figures

Versions Notes

Abstract

Large multiple datasets were simulated through sampling, and regression modeling results were compared with known parameters—an analysis undertaken here for the first time on such a scale. The study demonstrates that the impact of multicollinearity on the quality of parameter estimates is far stronger than commonly assumed, even at low or moderate correlations between predictors. The standard practice of assessing the significance of regression coefficients using t-statistics is compared with the actual precision of estimates relative to their true values, and the results are critically examined. It is shown that t-statistics for regression parameters can often be misleading. Two novel approaches for selecting the most effective variables are proposed: one based on the so-called reference matrix and the other on efficiency indicators. A combined use of these methods, together with the analysis of each variable’s contribution to determination, is recommended. The practical value of these approaches is confirmed through extensive testing on both simulated homogeneous and heterogeneous datasets, as well as on a real-world example. The results contribute to a more accurate understanding of regression properties, model quality characteristics, and effective strategies for identifying the most reliable predictors. They provide practitioners with better analytical tools.

Keywords:

multiple linear regression; simulation; sampling; model quality; reference matrix; efficiency indicator; predictor performance; multicollinearity; t-statistics; relative deviation; best predictors selection

1. Introduction

Regression analysis is one of the most widely used tools in statistical modeling and prediction. It helps researchers and practitioners to identify approximate analytical relationships between outcome variables and predictor variables using multiple empirical observations. Numerous textbooks, monographs, and research papers have been devoted to various types of regression modeling.

Advanced regression methods developed to mitigate multicollinearity effects include modern approaches such as penalized ridge regression, the least absolute shrinkage and selection operator (LASSO), elastic net regression, least angle regression (LARS), and partial least squares (PLS) [1]. Tools designed for big data, statistical learning, and machine learning applications include support vector machines (SVM) and support vector regression (SVR), classification and regression trees (CART), prediction algorithms such as CHAID, multivariate adaptive regression splines (MARS), AdaBoost, neural networks (NN) for deep learning, as well as Bayesian and generalized linear mixed models (GLMM) [2]. They, directly or indirectly, use regression as an important component.

Specialized regression techniques have also been proposed, including the stability selection approach [3], the use of Shapley values in regression estimation [4,5], models accounting for errors in predictors [6], and various strategies for identifying the “best” model [7,8]. Many manuals now present applied statistical techniques and their implementation in software environments such as R [9,10], Python [11,12], and other statistical and mathematical packages. However, the wide availability and ease of use of these software tools have led to uncritical applications of regression analysis, where practical studies often stop at reporting seemingly favorable t-statistics, p-values, and model fit indices as the final proof of validity. The present paper proposes a more critical examination of this practice and explores ways to improve it.

In this study, multiple large datasets were generated through sampling simulation, and the results of regression modeling were compared with the parameters originally used in the simulations. The simulated cases were characterized by two meta-parameters of the data: the level of uncertainty and the level of multicollinearity. The results show that there are situations in which simple precision-based criteria may provide more meaningful insights than the conventional indicators automatically produced by regression software and routinely used by practitioners.

It is also demonstrated that multicollinearity can have detrimental effects—producing biased and inflated regression coefficients—even when correlations among predictors are relatively low. The novel methods for selecting the best variables, based on the so-called reference matrix and efficiency indicator, are proposed, and their usability is demonstrated.

The results are obtained under idealized conditions, where datasets contain a very large number of observations (five thousand), are generated from well-behaved sources (normal distributions), and are free from heterogeneity, clustering, or outliers. For real-world datasets, which are typically smaller and less homogeneous, the effects identified here may be even more pronounced, highlighting the need for caution in the common use of regression modeling in applied statistical analysis. To test this, we also constructed and analyzed “distorted datasets” that deviate from the ideal assumptions by a clustered structure, and also applied the proposed techniques to a real data example.

The paper is organized as follows. After the Introduction, Section 2 presents experimental design, and Section 3 describes multiple regression features. The next sections discuss the obtained numerical results: Section 4—quality of the retrieved regression parameters, Section 5—estimations by the reference matrix, Section 6—analysis of the predictors’ contribution, Section 7—efficiency indicator and errors, and Section 8—selection of variables in models. Section 9 describes an example of modeling with real data, and Section 10 summarizes the results.

2. Experimental Design

We generated the data for the multiple linear regression model

y_{i} = a_{0} + a_{1} x_{i 1} + a_{2} x_{i 2} + . . . + a_{n} x_{i n} + e_{i},

(1)

where x_ij are ith observations (i = 1, 2, …, N—number of observations) by all jth predictors (j = 1, 2, …, n—number of predictors), y_i are observations by the dependent variable, e_i define the errors, and a_j are the coefficients of the regression.

The number of predictors in the simulations is n = 10. The predictors were set as follows: the subgroup {x₁, x₂, x₃, x₄} of correlated variables, another subgroup {x₅, x₆, x₇} of correlated variables, and the others {x₈, x₉, x₁₀} are uncorrelated variables. The variables {x₁, x₅, x₈, x₉, x₁₀} were generated as normally distributed independent random variables with mean values and standard deviations equal to 1. The variables {x₂, x₃, x₄} were set as correlated one by one with x₁, and similarly, the variables {x₆, x₇} are correlated with x₅. The correlated variables were obtained via the well-known formula for the bivariate Cholesky decomposition [13]:

v = ρ z_{1} + ((1 - ρ^{2})^{0.5}) z_{2},

(2)

where

z_{1} a n d z_{2}

are normally distributed random variables with mean 0 and standard deviation 1, ρ is a desirable correlation, and

v

would have this correlation with

z_{1}

. In a dataset, the level of correlation was taken to be equal for all predictors. Correlations between predictors in different datasets were put at one of the following levels:

ρ = 0.05, 0.30, 0.60, 0.90, 0.95 .

(3)

The last two levels indicate that the data has a high level of multicollinearity between the independent variables. At a high level of correlations between x₂ with x₁ and x₃ with x₁, a correlation between x₂ and x₃ will also be high. So, the predictors {x₁, x₂, x₃, x₄} are mutually correlated in one subset, and the predictors {x₅, x₆, x₇} are correlated in another subset, but there is no or low correlation between the variables from different subsets, or those with the last three independent variables {x₈, x₉, x₁₀}.

For the generated predictors, the corresponding coefficients of regression were taken as the constants shown in Table 1.

The purpose of this selection of coefficient values was twofold. First, they should affect the results in different ways: some strongly (with values like 20 and 24), some weakly (with values like 1 or 2), and others at a middle level. Second, the same coefficients for several variables were set to capture the effect of the mutual correlations between independent variables. Respectively, the values 1, 10, and 24 were taken twice for the correlated and non-correlated subgroups. This setting allows for checking how the estimation of the same values’ coefficients differs depending on the correlations between variables.

The outcome variable y was defined with zero intercept by the linear combination (1) of the predictors {x₁, …, x₁₀} with the parameters from Table 1. The added random error e in (1) was defined by the normal distribution with zero mean value and the standard deviation

σ

taken at three levels, A, B, and C as follows:

σ_{A} = 20, σ_{B} = 100, σ_{C} = 200 .

(4)

These values imitate the situations with low, medium, and high levels of uncertainty in the data. Indeed, with the mean value of each predictor equal to one, the mean level of the dependent variable y equals approximately the total of the coefficients in Table 1, which is 126. Relative to this mean level of y, the added random noise of the values (4) constitutes 20/128, 100/128, and 200/128, or about 16%, 80%, and 160%, respectively, of the low, medium, and high levels of data distortion by the errors.

With 5 values of correlations and 3 values of errors in (3)–(4), altogether, the 15 combinations of these parameters were considered. For each of such 15 scenarios, 120 random samples were generated, which yielded 1800 generated datasets. Each of these multiple datasets consisted of multivariate datapoints by y_i and x_ij, with the number of generated observations N = 5000. Finding parameters of the model (1) by the generated data and averaging them within each of the 15 simulation cases by the 120 samples, we can compare the obtained results with the original constants in Table 1 employed in the data-generating process.

To examine how different features of regression behave when the data have a more heterogeneous structure, a “distorted dataset” (DD) was also created. The dataset contained 5000 observations constructed in such a way that three variables, instead of being drawn from a single normal distribution as described earlier, were generated from two different distributions. Specifically, for the variable x₁, the first 2000 observations were sampled from a normal distribution with the mean 1 (as before), while the remaining 3000 observations were sampled from a distribution with the mean value 25 (standard deviation equals 1 in both cases). The variable x₅ was generated similarly: the first 2000 observations had the mean −20, while the rest of the observations had the mean 1. For the variable x₁₀, the first 1000 observations had the mean value 30, and the last 4000 observations had the mean value 1. This type of distortion created a complex cluster structure in the data. Since x₁ and x₅ served as the basis for three and two other variables, respectively, those were also affected. The variable x₁₀, being independent, did not influence other predictors, but it strongly affected the target variable because it had the largest coefficient, 24. Naturally, there are countless ways to “distort” the original smooth dataset, and this particular way was chosen purely for illustrative purposes. The results of simulations with DD will be noted in commentary alongside the main conclusions.

3. Several Relations on Multiple Regression

Let us briefly describe some main regression modeling formulae needed for further consideration. For each of the generated datasets, the regression model (1) was built by the ordinary least squares (OLS) criterion of the error minimization:

S^{2} = \sum_{i = 1}^{N} e_{i}^{2} = \sum_{i = 1}^{N} {(y_{i} - a_{0} - a_{1} x_{i 1} - . . . - a_{n} x_{i n})}^{2} .

(5)

The minimum of the objective (5) can be found by setting to zero its derivatives with respect to the parameters, which yields the so-called normal system of equations:

\{\begin{matrix} \sum_{i = 1}^{N} y_{i} = a_{0} N + a_{1} \sum_{i = 1}^{N} x_{i 1} + \dots + a_{n} \sum_{i = 1}^{N} x_{i n} \\ \sum_{i = 1}^{N} {x_{i 1} y}_{i} = a_{0} \sum_{i = 1}^{N} x_{i 1} + a_{1} \sum_{i = 1}^{N} {x_{i 1}}^{2} + \dots + a_{n} \sum_{i = 1}^{N} x_{i 1} x_{i n} \\ - - - - - - - - - - - - - \\ \sum_{i = 1}^{N} {x_{i n} y}_{i} = a_{0} \sum_{i = 1}^{N} x_{i n} + a_{1} \sum_{i = 1}^{N} x_{i n} x_{i 1} + \dots + a_{n} \sum_{i = 1}^{N} {x_{i n}}^{2} \end{matrix}

(6)

The first equation in (6) can be divided by N, producing the expression for the intercept:

a_{0} = \bar{y} - a_{1} \bar{x_{1}} - \dots - a_{n} \bar{x_{n}} .

(7)

Using (7) in the other n Equation (6) leads to the system which in matrix form is

C_{x x} a = c_{x y},

(8)

where the matrix C_xx and vector c_xy are defined as follows:

C_{x x} = X^{'} X, c_{x y} = X^{'} y,

(9)

with X denoting the matrix of N × n order of the centered observations by the predictors, prime denoting transposition, and y is the vector of the centered observations of the dependent variable. The matrix C_xx and vector c_xy correspond to the sample covariance matrix of predictors x and the vector of their covariances with the dependent variable y. Solution of Equation (8) via the inverse matrix

C_{x x}^{- 1}

yields the vector a of the coefficients of the OLS multiple linear regression:

a = {C_{x x}^{- 1} c}_{x y}

(10)

Let us add the following feature useful in applications of the normal system (6): it corresponds to the hyperplane going through the 1 + n points of the weighted mean values of all variables. In the assumption that the total of any x differs from zero, it is possible to divide each jth Equation (6) by the term with the intercept

a_{0}

, which yields

\{\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} y_{i} = a_{0} + a_{1} \frac{1}{N} \sum_{i = 1}^{N} x_{i 1} + \dots + a_{n} \frac{1}{N} \sum_{i = 1}^{N} x_{i n} \\ \sum_{i = 1}^{N} {\frac{x_{i 1}}{\sum_{i = 1}^{N} x_{i 1}} y}_{i} = a_{0} {+ a}_{1} \sum_{i = 1}^{N} {\frac{x_{i 1}}{\sum_{i = 1}^{N} x_{i 1}} x}_{i 1} + \dots + a_{n} \sum_{i = 1}^{N} {\frac{x_{i 1}}{\sum_{i = 1}^{N} x_{i 1}} x}_{i n} \\ - - - - - - - - - - - - - \\ \sum_{i = 1}^{N} {\frac{x_{i n}}{\sum_{i = 1}^{N} x_{i n}} y}_{i} = a_{0} + a_{1} \sum_{i = 1}^{N} {\frac{x_{i n}}{\sum_{i = 1}^{N} x_{i n}} x}_{i 1} + \dots + a_{n} \sum_{i = 1}^{N} {\frac{x_{i n}}{\sum_{i = 1}^{N} x_{i n}} x}_{i n} \end{matrix}

(11)

Let us introduce the weight of ith observation in their total for each variable x_j:

w_{i j} = \frac{x_{i j}}{\sum_{i = 1}^{N} x_{i j}}, \sum_{i = 1}^{N} w_{i j} = 1,

(12)

with the totals for each jth variable equals one. Then we can define the means and weighted means in (11), denoted by bars, as follows:

\{\begin{matrix} \bar{y} = \frac{1}{N} \sum_{i = 1}^{N} y_{i}, \bar{x_{1}} = \frac{1}{N} \sum_{i = 1}^{N} x_{i 1}, \dots, \bar{x_{n}} = \frac{1}{N} \sum_{i = 1}^{N} x_{i n} \\ \bar{y_{w 1}} = \sum_{i = 1}^{N} {w_{i 1} y}_{i}, \bar{x_{1 . w 1}} = \sum_{i = 1}^{N} {w_{i 1} x}_{i 1}, \dots, \bar{x_{n . w 1}} = \sum_{i = 1}^{N} {w_{i 1} x}_{i n} \\ - - - - - - - - - - - - - \\ \bar{y_{w n}} = \sum_{i = 1}^{N} {w_{i n} y}_{i}, \bar{x_{1 . w n}} = \sum_{i = 1}^{N} {w_{i n} x}_{i 1}, \dots \bar{x_{n . w n}} = \sum_{i = 1}^{N} {w_{i n} x}_{i n} \end{matrix},

(13)

where

\bar{y_{w j}}

and

\bar{x_{k . w j}}

are the mean values of y and x_k, respectively, weighted by the jth set of weights built by x_j (j = 1, 2, ..., n) as in (12). Then the system (11) can be simplified as

\{\begin{matrix} \bar{y} = a_{0} + a_{1} \bar{x_{1}} + \dots + a_{n} \bar{x_{n}} \\ \bar{y_{w 1}} = a_{0} + a_{1} \bar{x_{1 . w 1}} + \dots + a_{n} \bar{x_{n . w 1}} \\ - - - - - - - - - \\ \bar{y_{w n}} = a_{0} + a_{1} \bar{x_{1 . w n}} + \dots + a_{n} \bar{x_{n . w n}} \end{matrix}

(14)

Therefore, the linear regression can be seen as a hyperplane passing through the 1 + n points of the mean values of each variable weighted by each other variable. Properties of such a hyperplane and its building were described in detail in [14].

The set of mean values in (14) we can call the reference matrix (RM) because it consists of the points through which the hyperplane of multivariate regression goes. Let us make several observations about the RM in its relation to the normal system (6). The normal system is often built with the exclusion of the intercept, to center the variables. But the total of the values by a centered variable equals zero, so it is impossible to divide by it as was done in the transition from (6) to (11). However, for the centered data, in place of any second moment

\sum_{i = 1}^{N} x_{i j} x_{i k}

in (6) there is the centered second moment, which can be transformed as follows:

\sum_{i = 1}^{N} {(x}_{i j} - \bar{x_{j}}) (x_{i k} - \bar{x_{k}}) = \sum_{i = 1}^{N} {(x}_{i j} - \bar{x_{j}}) x_{i k} - \bar{x_{k}} \sum_{i = 1}^{N} {(x}_{i j} - \bar{x_{j}}) = \sum_{i = 1}^{N} {(x}_{i j} - \bar{x_{j}}) x_{i k}

(15)

Then it is still possible to divide each kth row of the normal system (6) by the corresponding total

\sum_{i = 1}^{N} x_{i k}

, and the transformation to (14) (without the intercepts) holds for the centered variables weighted by the same totals (12). The same is true for the standardized data.

Another specific feature of the RM (14) is that the values in each jth column are of the same units of measurement, coinciding with the means in the jth column, so with the units of the jth variable’s measurement. It makes possible to consider correlations of the values in different columns of RM (14) (which can be used for a special aim described further). This property of RM differs from the features of the normal system (6) for the matrix of the second moments, where all units of measurement in the non-standardized matrix are defined by the product of two variables; thus, it is impossible to find correlations between columns that contain different units of measurement for each element in the column.

Significance of the coefficients of regression can be measured by their t-statistics, defined by the quotient of the parameter

a_{j}

and its standard deviation

{σ (a)}_{j}

(which equals the product of the model residual variance and the corresponding diagonal element of the inverted correlation matrix):

t_{j} = a_{j} / {σ (a)}_{j}

(16)

The absolute value of t_j above 1.96 corresponds to the statistically significant coefficient, which differs from zero with probability 95%, and the related confidence intervals can be built.

Another way for directly measuring the precision of recovered regression coefficients in the simulated estimations can be considered by the relative deviation (RD) of each jth retrieved parameter a_j:

{R D}_{j} = {(a_{e s t i m a t e d} / a_{g e n e r a t e d} - 1)}_{j}

(17)

It could also be calculated as the absolute value of the relative deviation.

For the standardized variables (centered and normalized by their standard deviations

σ_{j}

for x_j, and

σ_{y}

for y, respectively), the matrix C_xx and vector c_xy in (8)–(9) reduce to the correlation matrix of predictors R_xx, and vector r_xy with elements r_jy of correlations between the jth predictor x_j and the dependent variable y. Then solution (10) produces the normalized coefficients of regression, also known as beta coefficients, which are connected with the original coefficients by the relation:

b_{j} = a_{j} \frac{σ_{j}}{σ_{y}}

(18)

The total quality of data fit by the regression model is commonly estimated by the coefficient of multiple determination R², which can be calculated as the sum of the predictors’ contributions d_j:

R^{2} = \sum_{j = 1}^{n} r_{j y} b_{j} = \sum_{j = 1}^{n} d_{j}

(19)

The value d_j equals the product of the beta coefficient b_j for the jth predictor in multiple regression and the coefficient of correlation r_jy between x_j, and y, which also coincides with the coefficient of the pair regression of y by x_j. The R² belongs to the interval [0, 1] and reaches its maximum of 1 when the residual sum of squares S² (5) goes to zero. The product d_j can be understood as a contribution of the jth predictor in the model quality measured by R²: dividing Equation (19) by R² allows for calculating the relative contribution of the variables:

\sum_{j = 1}^{n} \frac{d_{j}}{R^{2}} = \sum_{j = 1}^{n} \frac{r_{j y} b_{j}}{R^{2}} = 1

(20)

However, this interpretation is justified only if the beta-coefficients have the same signs as the corresponding coefficients of correlation, so the contributions are positive. For the highly correlated predictors, in other words, for multicollinearity among the variables, the coefficients of regression can be biased and inflated, and those closer to zero can change signs to the opposite in comparison with the signs of pair correlation. In such cases, more complicated methods can be applied for estimation of the predictors’ contribution to the model (for example, see [4]). Sometimes, it is also interesting to find the shares of two terms in the product d_j and to calculate the relative importance of each multiplier via their logarithms:

\frac{l n (r_{j})}{{l n (d}_{j})} + \frac{l n (b_{j})}{{l n (d}_{j})} = 1

(21)

The negative r_j and b_j can be taken by absolute value.

Besides finding the key drivers, or the most important variables in the model, sometimes it is needed to find and eliminate the less influential points to build the regressions with the reduced data. For this aim, the coefficient of multiple determination can be presented in any of two forms: as the total by all observations of the product of the original standardized dependent variable y_i and its prediction y_i,predic by the model, or as the total of this predicted variable squared:

R^{2} = \sum_{i = 1}^{N} y_{i} y_{i, p r e d i c} = \sum_{i = 1}^{N} y_{i, p r e d i c}^{2}

(22)

Thus, it is possible to order observations by their contribution to the quality of fit, and such ordering can be simplified to the ordering by the normalized dependent variable y_i² values squared. This approach corresponds to the earlier works by Wald and by Bartlett, where with two variables x and y, their ith observations are ordered by x_i, the whole set is divided into three (or another number) groups, and the middle part is omitted. As shown in [15,16,17,18], connecting the mean points of the two far-distanced remaining groups gives an unbiased estimation for the coefficient of the pair regression. In the current estimations with multidimensional simulated observations by many predictors, the ordering was performed by the values of the squared normalized dependent variable.

The Variance Inflation Factor (VIF) is a traditional indicator reported in statistical packages as a measure of multicollinearity between the predictors and actively used in statistical practice for the elimination of the variables of high correlations with other predictors. For the standardized variables, the VIF of each predictor x_j is defined as the jth diagonal element of the inverse correlation matrix

C_{x x}^{- 1}

, which we can denote in simpler notations as

C^{j j} :

{V I F}_{j} = C^{j j}

(23)

Also, as it is known, the coefficient of multiple determination in the model of x_j as the dependent variable by all other predictors can be found from the value (23) as follows:

R_{j}^{2} = 1 - 1 / {V I F}_{j}

(24)

Let us introduce an index which combines two useful features of a predictor in one measure: how strongly it is related to the target variable on one side, and to all other predictors on the other side. For this aim, we propose the following Relation Index (RI):

{R I}_{j} = \frac{r_{j y}^{2}}{r_{j y}^{2} + R_{j}^{2}}

(25)

where

r_{j y}^{2}

is the squared coefficient of correlation between

x_{j}

and y, and

R_{j}^{2}

is the coefficient of multiple determination (24) between

x_{j}

and other predictors. This index lies between 0 and 1 for the cases of the small and big relation of x_j to the dependent variable in comparison to the relation with the other predictors.

Formula (25) can be simplified to another index, which we will call the efficiency indicator (EI):

{E I}_{j} = \frac{r_{j y}^{2}}{r_{j y}^{2} + \bar{r_{j x}^{2}}}

(26)

It is built similarly to (25), but instead of the multiple determination, the mean value

\bar{r_{j x}^{2}}

of the squared correlations of x_j with the other predictors is employed. It is a logical and natural gauge: the predictors highly correlated with the target variable, while not highly correlated with other predictors, are the best candidates for a good fit in the models without multicollinearity, so with interpretable coefficients of regression. Both measures (25) and (26) are highly correlated, but (26) is simpler to calculate in practice, so we focus further on it.

4. Simulation Results on the Quality of Retrieved Regression Parameters

For the recovered regression parameters in the simulated datasets, the results are presented in a series of figures, most of which share a similar structure. Each figure consists of three panels corresponding to the three predefined error levels (see Section 4). In each panel, the x-axis typically represents the correlation level among predictors (3). Anticipating the results (explained in more detail below), these three error levels yield coefficients of determination for their respective groups in the approximate ranges of 85–94%, 18–38%, and 5–14%, or roughly with

R^{2}

about 90%, 30%, and 10%.

As established, all true regression coefficients were assigned positive values (see Table 1). However, a substantial number of estimated coefficients turned out to be negative, which is shown in Figure 1. Only variables x₆, x₇, and x₁₀, which have the largest true values 22 and 24, did not exhibit this effect under any condition. For clarity, these variables, as well as some others, are omitted from the chart. For the remaining variables, the proportion of negative coefficient estimates increases with higher error levels. Even when correlations among predictors are small, variables x₁, x₂, and x₈—which have the smallest true coefficients 1 and 2 (see Table 1)—show between 10% and 30% negative estimates for the error level of 100, and 20% to 40% for the error level of 200. This indicates that smaller coefficients are more vulnerable to sign reversal due to the inflationary effects of multicollinearity, particularly under high uncertainty. Only under low-error conditions (error 20), these coefficients are estimated without severe distortion.

For the distorted data (DD), the percentage of negative coefficients is somewhat higher, though not dramatically: for x₁ 33% versus 30% in Panel A, and 46% versus 41% in Panel B; for x₂ 39% versus 30% in Panel B, and similar patterns are observed for the other variables.

Figure 2 presents the percentage of non-significant regression coefficients, as determined by t-statistics across different combinations of the simulation parameters. The results are shown for the 95% confidence level, so for t-values with absolute magnitude less than 1.96. As can be seen, this percentage can be quite high, even for variables in Panel A. This implies that although, by construction, all variables are truly nonzero (and therefore significant), in many parameter combinations the t-test incorrectly rejects them, concluding that the variables are not significantly different from zero and should be disregarded.

The t-statistic is also highly sensitive to the inter-predictor correlations. For example, variable x₇, whose true coefficient equals 24, yields about 20% of “insignificant” estimates in cases with higher correlations (see Panel C).

For the distorted data (DD), the overall patterns are similar, though the specific percentages vary. In some cases, they are higher—for instance, for x₁ 96% versus 90% in Panel A, and for x₆ 25% versus 7% in Panel C—while in others they are lower (e.g., for x₁ 23% versus 90% in Panel A). In general, most of the values remain close between the standard and distorted datasets, showing that the t-statistic’s results are consistently unreliable for different datasets.

Figure 3 presents the average relative deviation (RD) of the regression coefficients, as defined in Equation (17). The values are calculated for positive deviations and averaged across approximately half of the 120 samples in each of the 15 scenarios described in Section 2 (results for negative deviations are similar). For clarity, only four variables are shown: x₁ and x₈, which have the smallest true coefficients equal to 1, and x₇ and x₁₀ with the largest coefficient equal to 24 (see Table 1). As shown in panels B and C, the average RD for x₁ (with the true coefficient 1) can be 6–12 times greater than the actual value when this variable is highly correlated with other predictors.

Several conclusions follow from Figure 3. First, the higher the level of data uncertainty (the error), the higher will be the RD of the estimated parameters—this trend is consistent across all variables. This observation is intuitively expected: detecting a weak signal amid strong noise becomes more difficult, even with a large sample size of 5000 observations. Second, the difference in estimation of quality between small and large coefficients is quite noticeable—the largest coefficients, 24, are generally estimated with much greater accuracy. Finally, the relative error appears to be highly sensitive to correlations among the independent variables, even when these correlations are only moderate. For x₁, RD increases sharply within each panel, reaching extremely high average values—often exceeding the true coefficient by more than an order of magnitude. A similar, though less pronounced, pattern is observed for variables with larger coefficients. This observation supports the conclusion that multicollinearity effects can appear even under moderate correlations, particularly in the presence of high data uncertainty. Greater attention should therefore be paid to detecting and mitigating these effects. Consequently, regression models with low coefficients of determination—which are common in practical statistical research—are especially vulnerable to large estimation errors in their coefficients.

For the distorted data, the picture is practically the same, with one exception: the coefficient for x₂ is estimated with very big errors—the RD varied from 90% or more in Panels B and C.

5. Estimations by the Reference Matrix

Let us consider another approach to analyzing the normal system by transforming it from the correlation matrix into the matrix of weighted means, defined in Equations (11)–(14). We refer to this transformed matrix as the reference matrix (RM). To begin, let us examine the correlations observed in the simulated datasets.

Figure 4 illustrates how the various predictors are correlated with the dependent variable. The results appear intuitively reasonable: all correlations are positive, and their magnitudes decrease noticeably as the level of noise in the data increases. Three variables, x₅, x₆, and x₇, exhibit behavior distinct from all the others: their correlations with y increase sharply as multicollinearity rises. This can be explained by the fact that these variables have the highest regression coefficients of 20, 22, and 24, respectively. What is more surprising, however, is that variable x₁₀, which also has the largest coefficient 24, begins with the same correlation value as its counterpart x₇ but then decreases, whereas x₇ continues to increase.

Figure 5 shows the correlations between the same variables, but this time computed as the correlations between the columns of the mean values of the predictors and the column of the mean values of the dependent variable in the reference matrix (14). The striking difference is that these correlations are unaffected by the level of uncertainty—their patterns remain identical across all three panels. This indicates that the RM, constructed from the mean values, preserves its structure regardless of noise in the data. The reason is straightforward: mean values are inherently stable measures, largely resistant to random errors, at least in the absence of significant outliers.

Another noticeable difference is that three variables, x₅, x₆, and x₇, which already had the highest correlations, now exhibit even stronger values, while all others, in contrast, show markedly reduced correlations, often turning to the strongly negative values. This effect arises from the centering applied within the columns of the RM (14), which influences the correlation estimates. Consequently, transforming the system into the reference matrix provides a clearer distinction between groups of variables. In the present example, the three variables with the highest correlations with y emerge as the most important ones—offering a natural basis for reducing the number of predictors, which is one of the recurring challenges in regression modeling (see Section 8).

6. Estimation of Variables’ Contribution

The selection of good variables could also be considered by their inputs into the model’s quality, described in Formulas (19)–(21). Contributions of the predictors are noticeably negative for three variables: x₁ has 27% (481 out of 1800 samples), x₂ has 14%, and x₈ has 13% of such cases. For other variables, this number is either very small, less than 1% (for x₃, x₄, x₅, and x₆) or zero (for x₇, x₉, and x₁₀). Excluding these negative contributions and computing the average shares according to Equation (20) yields the results shown in Figure 6. These results lead to several noteworthy observations. Most prominently, the variable x₁₀, which has the largest regression coefficient, displays a high contribution to the model’s quality—as expected—when its correlations with other predictors are minimum (0.05). However, its contribution decreases sharply as inter-predictor correlations increase. This behavior stands in marked contrast to that of its counterpart x₇ as well as to other strong predictors x₅ and x₆, whose contributions grow with higher correlations.

The shares of correlations in the contributions estimated by the Expression (21) are shown in Figure 7. It reveals that all variables uncorrelated with other predictors {x₈, x₉, x₁₀}, regardless of the values of their coefficients of regression from 1 to 24 and for all conditions across the three panels, represent 50% in the product dj (20). This 50% share is theoretically expected because when correlations among predictors approach zero, the correlation matrix becomes an identity matrix, and the beta coefficients converge to the simple pairwise correlations between predictors and the dependent variable. In this case, the shares of the correlations and beta coefficients are necessarily equal.

For the mutually correlated predictors, however, the pattern is different. The most sensitive variables—the ones with stronger true effects—exhibit the steepest decline from the 50% level as multicollinearity increases across all three panels.

7. Efficiency Indicator and Estimation Errors

As we saw earlier, even moderate multicollinearity can significantly distort regression results. It is reasonable to expect that a higher efficiency index, EI (26), should correspond to higher estimation accuracy. To test this assumption, several analyses were performed, and the results are summarized in Table 2.

For each simulated dataset and each predictor, the following characteristics were calculated: EI (26), VIF (23), t-statistic (16) (squared), and the relative deviation RD (17) (taken by the absolute value). Then, average values of these measures were computed within 15 groups defined by the simulation design, which combined three error levels with five multicollinearity coefficients as specified in (3) and (4). Two types of correlations with RD were then examined: correlations based on the original individual values (1800 values per variable), shown in columns 2–4 of Table 2, and correlations based on the group averages (15 values per variable), presented in columns 5–7.

As expected, higher values of EI and t-statistic generally correspond to lower RD, yielding mostly negative correlations. In most cases, EI shows stronger correlations with RD than the t-statistic—across almost all variables in the individual-level data, and across all variables in the grouped data. This indicates that EI has greater predictive power for estimation accuracy.

In contrast, a higher VIF reflects stronger intercorrelation between a given predictor and other regressors, which should theoretically increase estimation errors. However, the actual results for VIF in Table 2 are mixed (have different signs of correlations), suggesting that VIF is not a reliable predictor of RD in this context.

Some extremely large values of VIF or RD may distort the estimation of correlations. To mitigate this effect, we also applied more robust measures. Specifically, for the regression outputs, Spearman rank correlations were computed between the RD of the coefficients and the corresponding values of EI, t-statistics, and VIF, within each of the 15 groups described above. Figure 8 presents these results. For instance, the value −0.58 in panel A represents the average Spearman correlation between RD and EI across 120 regressions under the condition of low error level and mutual correlation between predictors equal to 0.3 (shown on the horizontal axis).

As we can see, VIF demonstrates the expected positive correlations with RD, but they are rather weak (ranging from 0.05 to 0.36) and show an evident upward trend with increasing multicollinearity among the predictors. In contrast, the t-statistics exhibit much stronger negative correlations, typically between −0.6 and −0.75. The efficiency indicator EI, though somewhat weaker than the t-statistic in relation to RD, still performs substantially better than VIF. Similar results were obtained for the distorted data, where EI maintains its robust behavior: it does not fall into the “wrong” correlation range and continues to reflect predictor quality consistently well.

Summarizing the results of evaluating EI as a predictor for the accuracy of a coefficient’s estimation (see Table 2 and Figure 8), we can conclude that EI is at least as effective as the t-statistic and, in many cases, even superior. The most reasonable approach is not to replace the t-statistic with EI, but rather to use them jointly: EI reinforces correct inferences when the t-statistic alone may fail. VIF, on the other hand, should be regarded as a supplementary but not decisive measure. We recommend using EI together with another measure—the predictor’s contribution to the variance of the dependent variable (discussed in Section 6). Both indicators are valuable for identifying the best subset of variables and for providing a sound framework to assess the strength of causal relationships between variables.

The term “causal” is used here with caution: while we acknowledge the extensive literature on causality in statistical modeling (see [19,20]), we intentionally avoid delving into it in depth. Nevertheless, it is worth noting that the data simulation design used in this study has a causal nature, since all predictors directly affect the dependent variable. In contrast, for most real-life data, establishing genuine causality remains an exceptionally challenging task. Thus, all the regression issues discussed here would be magnified in typical situations where the very existence of causality is in question.

8. Selection of Variables for the Models

In all our scenarios, the parameters of regression differ from zero; thus, the correct model should presumably include all of them, and the estimated coefficients should be significantly different from zero. This is not what occurred, however. Table 3 presents information about the “proper models” that satisfy the condition of t-statistics (16) equal to or exceeding 1.96 for all coefficients to be significant at the 95% confidence level.

The total number of such models is 283 out of 1800, or 15.7%, and they all correspond exclusively to the minimum error in the data, which is also associated with models having a coefficient of multiple determination of 85–94%. One can assert that in the majority of real-world scenarios, this approach will not yield satisfactory results, because a model with a determination coefficient of 85–94% is itself a rare occurrence. Moreover, these models, while formally adequate, produce substantial deviations in estimation. We calculated the maximum RD (17) in each model and then averaged them within each subgroup. The rationale for this evaluation is that if the deviation for even one variable is sufficiently large, the model cannot be considered satisfactory. These deviations in the two largest subgroups of proper models with minimum correlations of 0.05 and 0.30 are approximately 30%, closely approaching the average for the entire subgroups shown in the final column, which encompasses both proper and improper models. Thus, t-statistic selection operates successfully only under limited circumstances and still does not guarantee high-quality estimates. These results are consistent with the outcomes described in relation to Figure 2 and Figure 3.

In practical research, “the actual model” is typically unknown, and many of the collected independent variables can be eliminated for model simplification to retain only the best predictors, selected according to one or another criterion of data fit. Let us return to the reference matrix results shown in Figure 5 to identify a set of highly influential variables {x₅, x₆, x₇}, which we designate as “good variables”. Our purpose is to demonstrate that this approach yields surprisingly robust results due to the stability of the correlations presented in Figure 5, in contrast to the use of t-statistics discussed in relation to Table 3. To illustrate the quality of such selection, we compared models containing the good variables with models employing three alternative variables, x₃, x₄, and x₁₀, with medium and large coefficients of 10, 12, and 24, respectively (see Table 1). These appear to be the most promising alternative candidates precisely because they possess large coefficients beyond those in the good subset.

Figure 9 demonstrates how both types of models approximate the dependent variable in comparison with the complete (and correct by definition) model across all 1800 samples. Models incorporating the good variables follow the correct pattern of the complete models considerably more closely than models utilizing the other three variables. In general, the pattern for the good variables corresponds to the diagonal line (which represents the ideal correspondence), whereas the alternative models deviate substantially from it. The prediction accuracy with which the good-set models approximate the complete model quality is approximately 99%.

It is also possible to compare the RD of these models. If one calculates the averages of the RD in absolute value for the complete and reduced models, and then subtracts the former from the latter, a positive sign of the difference would indicate that the reduced model performs worse than the complete model, and vice versa. These differences are shown in Figure 10. We observe that the RDs of the models with alternative variables are substantially higher than those of the good models in panel A, approximately equivalent to each other in panel B, and yield mixed results in panel C; all of these outcomes depend on the level of correlations between predictors. The models of the good set, regardless of the noise level in the data, consistently approximate the complete models within a very narrow interval of several percentage points. Thus, the models of the good set demonstrate remarkable robustness under all conditions (which corresponds to the observations from Figure 6 and Table 3).

Altogether, we have discussed several methods of best predictors selection: two novel approaches (efficiency indicators and reference matrix) and two traditional methods: t-statistics and VIF. Our purpose was not to propose a new universal method of variable selection but rather to demonstrate that the most traditional and routine approach, oriented toward t-statistics and p-values, does not function adequately in many situations and requires the other supportive measures discussed herein.

We experimented not only with the selection of the best predictors but also with the reduction in the number of observations by inputs into the coefficient of multiple determination according to the expressions in (22). Some results demonstrated significant improvement in the model’s quality as measured by RD, which could be achieved with comparatively small samples of 10–15% of the original size. However, it proved difficult to identify intelligible decision rules to provide practical recommendations; for this reason, we do not present these results here but defer such consideration to future research.

9. Practical Example

For the numerical example with real data, we employed a dataset “Cars 93” taken from the MASS library in the R software. There are 93 observations on the American and foreign cars measured by various variables. The Price (in US$1000) was taken as the dependent variable y, and the following 11 numerical variables were used as predictors: x₁—Engine size (liters), x₂—Horsepower, x₃—RPM (revolutions per minute at maximum horsepower), x₄—Rev.per.mile (engine revolutions per mile in highest gear), x₅—Fuel tank capacity (US gallons), x₆—Passengers (number of persons), x₇—Length (inches), x₈—Wheelbase (inches), x₉—Width (inches), x₁₀—Turn.circle (U-turn space, feet), x₁₁—Weight (pounds). Since the number of observations, 93, is considerably smaller than the 5000 employed in the simulation, all caveats should be treated with greater caution.

Table 4 presents the reference matrix RM, where each row in the table corresponds to the coefficients in each of Equation (14). The added bottom row displays the pairwise correlations between the columns of mean values of y and the predictors in the reference matrix. Large correlations identify the superior predictors, beginning with x₂.

Table 5 presents several statistics discussed above. The bottom row reports the coefficient of multiple determination for this model: R² = 0.742, which equals the sum of all contributions (19). The RI (25) is highly correlated with EI (26) at a level of 0.99 and is therefore omitted here. The final columns display the ranks of the three key indicators to simplify the interpretation of the results.

The only variable that ideally fits our description of a “very good” predictor is x₂, which ranks highest across all three indicators. All others raise certain doubts. For example, x_9, under the standard approach, might appear to be a strong candidate because it has a high (though negative) t-value. However, closer examination reveals that it has a positive correlation with the target and a very low EI of only 0.27. By contrast, the “insignificant” x₁₁ (with t = 0.66 and p-value = 0.51) shows the third-highest values for both EI and contributions, making it clearly worthy of consideration. Similarly, x₅ could be recommended despite its low t-statistic. This example is merely illustrative, but it demonstrates that examining multiple aspects of regression provides substantially deeper insights than relying solely on t-statistics and p-values. As we discussed earlier, standard statistical tools with the new measures of precision and fit may lead to divergent conclusions—an issue that warrants further investigation.

10. Summary

Two factors play a crucial role in linear regression analysis—the level of uncertainty in the data and the degree of multicollinearity among predictors. When both are low, regression results tend to be adequate. In this study, we sought to disentangle these two effects by employing a direct comparison of regression results with known parameters of simulated data. Our main conclusions are as follows.

Impact of correlations: Even moderate correlations between predictors can significantly affect the quality of estimation. This underscores the importance of employing regularization techniques and other strategies to mitigate multicollinearity.

Limitations of traditional tests: The discrepancy between standard significance tests for regression coefficients and relative errors can be substantial, particularly when data uncertainty is high. A cautious approach to standard indicators is therefore recommended, along with wider confidence intervals and the use of complementary methods proposed in this paper.

Reference matrix as a tool: Presenting the normal system in the form of a reference matrix of weighted means enhances the interpretability of regression coefficients and provides a useful basis for numerical evaluation. Examining correlations within the reference matrix makes it possible to identify predictors estimated with greater reliability, even under high uncertainty and strong multicollinearity.

Efficiency Indicator (EI): We propose this measure to quantify the connection of each predictor with the target variable relative to its connections with other predictors. Our analysis demonstrates that EI can serve as a valuable supplementary statistic to the commonly used t-statistic and, in some situations, can substitute for it when the latter proves inadequate.

Combined use of contributions and EI: We introduce a novel procedure that jointly considers a variable’s contribution to the multiple determination coefficient and its efficiency indicator. This combined approach provides a more comprehensive and reliable assessment of predictor importance.

Applicability of the results is supported by the variety of data utilized. This includes a large dataset in simulation, featuring both comparatively “smooth” and “distorted” (clustered) characteristics, as well as a small, untested dataset in a practical example. All discussed concerns will be magnified by a higher level of uncertainty, smaller sample sizes, and other irregularities encountered in real-world situations.

Overall, the results contribute to a deeper understanding of estimated regression parameters and quality measures. The methods described herein can present valuable additional tools in practical statistical modeling and analysis, particularly in the context of big data, which has become routine in many modern applied research projects.

Author Contributions

Conceptualization, I.M. and S.L.; methodology, I.M. and S.L.; software, I.M.; validation, I.M.; formal analysis, S.L.; investigation, I.M. and S.L.; resources, I.M. and S.L.; data curation, I.M. and S.L.; writing—original draft preparation, I.M. and S.L.; writing—review and editing, I.M. and S.L.; visualization, I.M.; supervision, I.M. and S.L.; project administration, I.M. and S.L.; funding acquisition, non-applicable. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We thank the three anonymous reviewers for their comments and suggestions, which helped us improve the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Young, D.S. Handbook of Regression Methods; CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
Efron, B.; Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, 2nd ed.; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar]
Meinshausen, N.; Buhlmann, P. Stability Selection. J. R. Stat. Soc. Ser. B 2010, 72, 417–473. [Google Scholar]
Lipovetsky, S.; Conklin, M. Predictor relative importance and matching regression parameters. J. Appl. Stat. 2015, 42, 1017–1031. [Google Scholar] [CrossRef]
Biecek, P.; Burzykowski, T. Explanatory Model Analysis: Explore, Explain and Examine Predictive Models; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
Yi, G.Y.; Delaigle, A.; Gustafson, P. (Eds.) Handbook of Measurement Error Models; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
Ehrenberg, A.S.C. How good is best? J. R. Stat. Soc. Ser. A 1982, 145, 364–366. [Google Scholar] [CrossRef]
Weisberg, S. Applied Linear Regression, 4th ed.; Wiley: New York, NY, USA, 2013. [Google Scholar]
Irizarry, R.A. Introduction to Data Science: Data Analysis and Prediction Algorithms with R; CRC: Boca Raton, FL, USA, 2020. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2021. [Google Scholar]
McNulty, K. Handbook of Regression Modeling in People Analytics: With Examples in R and Python; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
Faraway, J.J. Linear Models with Python; CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
Kaiser, H.F.; Dickman, K. Sample and population score matrices and sample correlation matrices from an arbitrary population correlation matrix. Psychometrika 1962, 27, 179–182. [Google Scholar] [CrossRef]
Lipovetsky, S. Normal system in Laplace expansion and related regression modeling problems. Symmetry 2025, 17, 668. [Google Scholar] [CrossRef]
Wald, A. The fitting of straight lines if both variables are subject to error. Ann. Math. Stat. 1940, 11, 284–300. [Google Scholar] [CrossRef]
Bartlett, M.S. Fitting a straight line when both variables are subject to error. Biometrics 1949, 5, 207–212. [Google Scholar]
Theil, H.; van Yzeren, J. On the efficiency of Wald’s method of fitting straight lines. Rev. Int. Stat. Inst. 1956, 24, 383–397. [Google Scholar]
Leser, C.E.V. Econometric Techniques and Problems; Griffin: London, UK, 1974. [Google Scholar]
Mandel, I. Causality Modeling and Statistical Generative Mechanisms. In Braverman Reading in Machine Learning; Rozonoer, L., Mirkin, B., Muchnik, I., Eds.; Springer: Cham, Switzerland, 2018; pp. 148–188. [Google Scholar]
Peters, J.; Janzing, D.; Scholkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; The MIT Press: Boston, MA, USA, 2017. [Google Scholar]

Figure 1. Percentage of negative regression coefficients in estimations.

Figure 2. Percentage of insignificant coefficients by t-statistics.

Figure 3. Average value of relative deviations RD.

Figure 4. Average correlations between predictors and the dependent variable.

Figure 5. Average correlations between dependent and independent variables in the reference matrix.

Figure 6. Contributions of the predictors in the determination

R^{2}

.

Figure 6. Contributions of the predictors in the determination

R^{2}

.

Figure 7. Shares of the correlation coefficient in the total contribution of the predictor to

R^{2}

.

Figure 7. Shares of the correlation coefficient in the total contribution of the predictor to

R^{2}

.

Figure 8. Average Spearman correlations between relative deviations RD and indicators of quality.

Figure 9. Coefficients of multiple determination

R^{2}

for the complete and reduced models (the good set–blue, the other set–brown).

Figure 9. Coefficients of multiple determination

R^{2}

for the complete and reduced models (the good set–blue, the other set–brown).

Figure 10. Relative deviations in complete and reduced models (for the good set—the solid lines, others—the dashed lines).

Table 1. Coefficients of regression for the data simulation.

Subset-1 of Correlated Variables				Subset-2 of Correlated Variables			Uncorrelated Variables
a₁	a₂	a₃	a₄	a₅	a₆	a₇	a₈	a₉	a₁₀
1	2	10	12	20	22	24	1	10	24

Table 2. Correlations between three statistics and relative deviations (RD).

	Correlations by All Data			Correlations by Grouped Data
Predictors	EI and RD	t-Stat and RD	VIF and RD	EI and RD	t-Stat and RD	VIF and RD
1	2	3	4	5	6	7
x1	−0.38	−0.09	−0.27	−0.63	−0.42	0.59
x2	−0.31	−0.30	−0.28	−0.57	−0.53	0.52
x3	−0.37	−0.35	−0.55	−0.57	−0.54	0.53
x4	−0.42	−0.38	−0.58	−0.61	−0.55	0.53
x5	−0.56	−0.31	−0.47	−0.84	−0.47	0.58
x6	−0.60	−0.37	−0.53	−0.86	−0.53	0.52
x7	−0.57	−0.37	−0.54	−0.87	−0.57	0.50
x8	0.23	−0.29	−0.16	−0.80	−0.83	−0.27
x9	−0.48	−0.51	−0.35	−0.95	−0.86	0.03
x10	−0.46	−0.51	−0.4	−0.96	−0.86	0.04

Table 3. Percent of models selected by t-statistics for all variables.

Correlations Between Predictors	Proper Models by t-Statistics		Average Max Estimation RD in Proper Models, %	Average Max Estimation RD in All Models, %
Correlations Between Predictors	Number of Models	Percent of the Total Number of Models, %	Average Max Estimation RD in Proper Models, %	Average Max Estimation RD in All Models, %
0.05	103	5.7	27.5	31.8
0.30	100	5.6	31.2	35.0
0.60	65	3.6	42.6	43.7
0.90	13	0.7	150.6	88.3
0.95	2	0.1	252.5	127.3
Total	283	15.7	39.5	65.2

Table 4. Cars’ example: the reference matrix RM.

	y	x₀	x₁	x₂	x₃	x₄	x₅	x₆	x₇	x₈	x₉	x₁₀	x₁₁
x₀	19.5	1	2.67	144	5281	2332	16.7	5.09	183	104	69.4	39	3073
x₁	21.7	1	3.07	159	5155	2175	17.6	5.23	188	106	70.6	39.9	3265
x₂	22.3	1	2.94	163	5289	2225	17.5	5.09	186	105	70.3	39.6	3230
x₃	19.5	1	2.6	144	5347	2360	16.5	5.03	183	104	69.2	38.8	3045
x₄	18.6	1	2.49	137	5343	2437	16.2	5.01	181	103	68.8	38.5	2982
x₅	20.7	1	2.82	151	5242	2273	17.3	5.18	185	105	70	39.4	3176
x₆	19.6	1	2.75	144	5224	2299	17	5.3	185	105	69.8	39.3	3139
x₇	19.9	1	2.73	146	5260	2305	16.8	5.13	184	104	69.6	39.1	3110
x₈	19.8	1	2.72	145	5263	2312	16.8	5.13	184	104	69.6	39.1	3106
x₉	19.8	1	2.72	146	5263	2311	16.8	5.11	184	104	69.6	39.1	3101
x₁₀	19.8	1	2.73	146	5256	2302	16.8	5.12	184	104	69.6	39.2	3110
x₁₁	20.7	1	2.83	151	5232	2263	17.2	5.2	185	105	70	39.4	3185
r(y,x)			0.93	0.99	−0.54	−0.91	0.95	0.37	0.89	0.85	0.91	0.9	0.93

Table 5. Cars’ example: correlations and regression indicators.

	Variables	Correlations of y with x	Contributions (20), %	t-Value (16)	Pr(>\|t\|)	Efficiency Indicator EI (26)	Contribution Rank	t-Stat rank	EI Rank
x₁	EngineSize	0.60	5.6	0.50	0.62	0.40	5	7	4
x₂	Horsepower	0.79	65.2	4.13	0.00	0.66	1	1	1
x₃	RPM	−0.01	0.1	−1.36	0.18	0.00	7	9	11
x₄	Rev.per.mile	−0.43	−4.8	1.05	0.30	0.30	9	4	7
x₅	Fuel.tank	0.62	5.5	0.66	0.51	0.45	6	5	2
x₆	Passengers	0.06	−0.7	−1.23	0.22	0.02	8	8	10
x₇	Length	0.50	8.0	1.33	0.19	0.34	4	3	5
x₈	Wheelbase	0.50	20.1	2.59	0.01	0.33	2	2	6
x₉	Width	0.46	−28.5	−4.03	0.00	0.27	11	11	8
x₁₀	Turn.circle	0.39	−6.7	−1.62	0.11	0.25	10	10	9
x₁₁	Weight	0.65	10.4	0.66	0.51	0.42	3	6	3
	R², %		74.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mandel, I.; Lipovetsky, S. Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling. AppliedMath 2025, 5, 140. https://doi.org/10.3390/appliedmath5040140

AMA Style

Mandel I, Lipovetsky S. Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling. AppliedMath. 2025; 5(4):140. https://doi.org/10.3390/appliedmath5040140

Chicago/Turabian Style

Mandel, Igor, and Stan Lipovetsky. 2025. "Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling" AppliedMath 5, no. 4: 140. https://doi.org/10.3390/appliedmath5040140

APA Style

Mandel, I., & Lipovetsky, S. (2025). Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling. AppliedMath, 5(4), 140. https://doi.org/10.3390/appliedmath5040140

Article Menu

Rethinking Linear Regression: Simulation-Based Insights and Novel Criteria for Modeling

Abstract

1. Introduction

2. Experimental Design

3. Several Relations on Multiple Regression

4. Simulation Results on the Quality of Retrieved Regression Parameters

5. Estimations by the Reference Matrix

6. Estimation of Variables’ Contribution

7. Efficiency Indicator and Estimation Errors

8. Selection of Variables for the Models

9. Practical Example

10. Summary

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI