Ridge Regression and the Elastic Net: How Do They Do as Finders of True Regressors and Their Coefficients?
Abstract
:1. Introduction
2. Materials and Methods
2.1. Algorithmic Description of the RR Alternative to the EN
- (a)
- Starting from left to right, we partition into sub-matrices such that , where denotes the greatest integer function. If , then there will be partitions with regressors in each partition; otherwise, the mth partition will have regressors and all predecessor partitions will have regressors each. This partitioning is done so that all but the last partition has 3 OPR. We let the partitions be named, from left to right, , respectively. The number 3 OPR is chosen as the lower bound of because Austin and Steyerberg [28] have shown that an OPR of 2 is enough to detect statistical significance; and we judgmentally increased that by one. This is set as a hard constraint.
- (b)
- We define a “concatenation” operator that will be applied to matrices having the same number of rows but not necessarily the same number of columns. The operator will concatenate two matrices by creating a new matrix containing all of the columns of the two matrices while retaining duplicate columns only once. Concatenation is done from left to right; when duplicate columns exist, the leftmost one will be the only one (among the duplicates) retained. When such an operator operates over multiple matrices, we will index them, following familiar practice with other well-known operators such as the summation operator (e.g., ). For example, under this convention, .
- (c)
- We define another concatenation operator that will concatenate matrices by retaining all columns, including duplicate columns, as is. The following matrices (where the lower-case letters denote real numbers) illustrate how the two operators work:
- (d)
- We estimate by RR using the regressors in with set to the value prescribed by HKB, which is . We denote this prescribed value of by . We use the value of for each regressor in to determine whether it is statistically significant at an level (i.e., type I error) of 15%. We denote the sub-matrix of containing the statistically significant regressors found as .
- (e)
- We repeat step “(d)” by using the value of prescribed by LW, which is , where is the usual ratio in the analysis of variance (ANOVA) table resulting from estimating by classical LS. We denote this prescribed value of by and denote the resultant sub-matrix of containing the significant regressors found as .
- (f)
- We create , which is the set of significant regressors in identified by RR using either or . We let denote the number of regressors in .
- (g)
- We repeat steps “(d)”, “(e)” and “(f)” , where denotes the universal quantifier “for all”, an operator borrowed from Whitehead and Russell [29].
- (h)
- We define the statistically significant sub-matrix of regressors identified in by RR as follows, where denotes “such that”. The idea of subscripting and superscripting alphabets, for creating notation, is borrowed from the idea of the “tensor” (see Ricci-Curbastro and Levi-Civita [30]).
- (i)
- We re-estimate by GRR using the regressors in with the following vector of values: . Under GRR (see Hoerl and Kennard [1]), the usual matrix diagonal increment will be replaced by , where is a row-vector of constants, rather than a single non-stochastic number as it is in the usual “familiar” form of RR; is the corresponding diagonal matrix with the elements of on the diagonal, and is the usual identity matrix. We let the resultant GRR estimated coefficients be denoted by .
- (j)
- We repeat step “(i)” with the following vector of values: . We let the resultant GRR estimated coefficients be denoted by .
- (k)
- We will see that, for “low” to “moderate” levels of multicollinearity in , is a good solution, and, for “severe” levels of collinearity in , is a good solution. The mathematical characterization of the adjectives in quotes will be clarified below. We define the subset (sub-vector) of containing only the true coefficients found by GRR as , where the subscript as the case may be.
2.2. Simple Example to Illustrate Algorithm Use
2.3. Description of Simulation Design
- (i)
- We let be the ith eigenvalue of , where for all .
- (ii)
- We let be the matrix whose columns are the eigenvectors corresponding to the eigenvalues.
- (iii)
- Then , where is the diagonal matrix of the eigenvalues , stored in vector , and , where is a × identity matrix.
- (iv)
- We choose a new vector of arbitrary eigenvalues, , with denoting the ith eigenvalue entry in .
- (v)
- We create and transform to correlation form. We denote the transformed as . We denote by the ith eigenvalue of . We calculate .
- (vi)
- We repeat steps “(iv)” and “(v)” (by trial and error) until .
- (1)
- We pick a multicollinearity level () from {100, 300, 1000, 4000, extreme}, where “extreme” denotes the original multicollinearity level of .
- (2)
- We use the multicollinearity level picked in Step “(1)” to create, as described previously (i.e., Steps “(i)” through “(vi)” above), the partitions: , and . We let , where is the multicollinearity level picked in “(1)”. Where the interpretation is clear, we will simply represent, for notational brevity, as .
- (3)
- We pick a value of (i.e., ) from {10, 25, 50, 100, 250, 500, 1000, 1500, 3000, 5000, 7500, 10,000, 15,000, 30,000, 60,000, 100,000, 150,000, 200,000, 300,000, 600,000}. That is, 20 possible values of are selected.
- (4)
- We pick a value of from {0.05, 0.20, 0.35, 0.50, 0.65, 0.80, 0.95}. For greater granularity, we include additional values of from {0.10, 0.15, 0.25, 0.40, 0.45, 0.70} as necessary. These additional values of are used sometimes, to approximately mark -cutoffs where starts being less than .
- (5)
- We generate a vector, , of uniform random numbers in the interval [–1, 1].
- (6)
- We generate a vector, , of Bernoulli random variables (0 or 1) with .
- (7)
- We generate an vector, , of independent, identically distributed normal (0, 1) random variables.
- (8)
- We perform pairwise multiplication to create until .
- (9)
- We generate .
- (10)
- We generate .
- (11)
- We partition into 9 mutually exclusive sub-matrices: , , , , , , , and . This ensures that, for our case, the OPR is at least 2.
- (12)
- We estimate using RR with all of the regressors in under the HKB determined value of . We use the modified -statistic to select the RR-identified significant variables in and retain them in , a non-null matrix. We let the value of identifying the significant regressors be denoted by and the number of significant regressors identified be: .
- (13)
- We collect the RR-identified significant regressors, over all partitions, under the HKB-determined value of and concatenate them as . Here, “” includes only partitions for which at least one significant regressor is identified by RR. Where obvious, this interpretation of is assumed in lieu of burdening the indexing by expanding the subscript to read as: .
- (14)
- We estimate using RR with all of the regressors in under the LW-determined value of . We use the modified -statistic to select the RR-identified significant variables in and retain them in , a non-null matrix. We let the value of identifying the significant regressors be denoted by and the number of significant regressors identified be: .
- (15)
- We collect the RR-identified significant regressors, over all partitions, under the LW value of , and concatenate them as .
- (16)
- We create .
- (17)
- We create and .
- (18)
- We estimate using generalized RR with the regressors in and . We let the resultant estimated coefficient vector be denoted as: .
- (19)
- We calculate and .
- (20)
- We estimate using generalized RR with the regressors in and . We let the resultant estimated coefficient vector be denoted as: .
- (21)
- We calculate and .
- (22)
- We calculate the proportion of true regressors in and denote it by .
- (23)
- We estimate using the EN. We ese the Schwarz Bayesian criterion (SBC) [36] as the “stopping rule” with a maximum of 300 “steps” for EN calculations and . The SBC is defined as, where denotes the sum of squared errors resulting from the fitted EN regression. The EN estimated is stored in vector , where the coefficients of all EN-selected regressors, assumed to be “significant”, are in .
- (24)
- We calculate , and .
- (25)
- We repeat Steps 5 through 24, 2000 times.
- (26)
- We calculate the hit rate: . That is, we calculate the percentage of times this event occurs in 2000 trials (simulations) and interpret it as a probability.
- (27)
- We calculate the hit rate: .
- (28)
- We calculate the hit rate: .
- (29)
- We calculate the hit rate: .
- (30)
- We calculate the miss rate: .
- (31)
- We calculate the “empty” rates: and .
- (32)
- We redo Steps 5 through 31 for all of the combinations of the simulation starting conditions in Steps 1 through 4. That is, for each , the values of are crossed with the values of .
3. Results
3.1. Results by Collinearity Level
- RR is better than the EN at estimating true coefficients. Specifically, the double probability,
- RR is better than the EN at jointly estimating both true and spurious, but “statistically significant”, coefficients. In particular, the double probability
- RR is better than the EN at finding true regressors. In particular, the double probability
- RR is better than the EN at finding at least one true regressor. The probabilities, by , that and are empty, in rows 8 and 9 of Table 2, respectively, indicate this. As can be noted, the probability that is empty is generally much higher than the probability that is empty. In particular, the average probability that is empty across the entire simulation space is (2.58 + 1.75 + 2.94 + 4.91 + 7.98) ÷ 5 or about 4%, and the corresponding probability that is empty is about 25%, about six multiples of 4%.
- The conditional probability that is empty, given , , and a trial (simulation) where turns out to be empty, is denoted by:
- For given combinations of and , the average value of the proportion of true regressors found is calculated in the simulation, by . For RR and the EN, these are and , respectively. The averages (and standard deviations) of these two expectations across and , are shown, by , in rows 12 and 13 of Table 2, respectively. They also indicate that RR is better than the EN at finding true regressors and that RR does so with lower volatility.
- Row 6 of Table 2 indicates that the LW value of is better than the HKB value of when estimating the coefficient vector that includes both the true and spurious coefficients. Row 7 of Table 2 indicates that, for low levels of collinearity, the HKB value of is better than the LW value of when estimating the true coefficients.
3.2. Recognizing Failure Scenarios
3.3. Regression Modeling of Simulation Outputs to Understand Patterns
3.4. Time-Series Modeling of Simulation Output to Understand Patterns
3.5. Examining Failure Scenarios for RR Miss Rates
3.6. Examining Failure Scenarios for RR Hit Rates
3.7. Examining Simulation Stability as Increases
3.8. Examining Scenarios by Setting Less than 15%
3.9. Results Summary
- (1)
- RR finds more of the true regressors than does EN, with very high probability (≈99%). This is critically important for linear model discovery in the sciences, when . Thus, it is advantageous to consider the set of significant regressors selected by RR using the hard constraint of 3 OPR built into the proposed algorithm. For example, the RR-selected regressors can be compared and contrasted with the EN-selected regressors in the context of the science of the process driving the regressand.
- (2)
- When RR fails to find more of the true regressors than does the EN (0.73% of the time across all simulations), these failures occur when is “small” (≤20%) and is “large” (200,000) or when is “large” (65%) and is “small” (10). All of these failures occur under “mild” ( 300) multicollinearity levels.
- (3)
- The probability that the EN finds none of the true regressors is about six times higher than the corresponding probability (≈25% vs. 4%) that RR finds none of the true regressors. This re-emphasizes the fact that the RR-selected regressors should, at a minimum, be compared and contrasted with those selected by the EN.
- (4)
- The squared length of the RR-estimated coefficient vector (, say) from the true coefficient vector () is less than the corresponding EN with high probability (≈86%). Note that includes the spurious coefficients that are either statistically significant (as in RR) or are the output of the optimization process (as in the EN). This indicates that the simpler RR-estimation process tends to produce more accurate estimates of with high probability. Thus, comparing and contrasting both vectors, in the context of the underlying science of the process generating the regressand would be quite advantageous.
- (5)
- When the squared length of from , yielded by RR, fails to be less than the corresponding one yielded by the EN (13.5% of the time), 95% of these failures occur when 20% and 99% of them occur when 35%. When 35%, the failures occur for “large” (100,000).
- (6)
- The squared length of the true coefficients in the RR-estimated (with the spurious ones set to zero), from , is less than the corresponding one from the corresponding EN with high probability (≈74%). This re-emphasizes, again, that comparing and contrasting the RR with the EN is quite important. We may not know a priori which coefficients in are the true ones, but the science underlying the process generating the regressand may provide insights regarding wherein the truth lies. Conversely, the RR would have an advantage over the EN to alert scientists to the existence of possible causal variables hitherto unconsidered or undiscovered.
- (7)
- When the squared length of the true coefficients in the RR-estimated from fails to be less than the corresponding one for the EN (25.79% of the time), about 95% of these failures occur when 40% and 99% of them occur when 50%. These failure rates drop steeply under extreme multicollinearity.
- (8)
- It is observed in the simulation that, whenever the squared length of from , yielded by RR, fails to be less than the corresponding one yielded by the EN, so does the corresponding squared length for the true coefficients in .
- (9)
- On the other hand, if the squared length of from , yielded by RR, is less than the corresponding one yielded by the EN, there is about a 14% probability that the squared length of the true coefficients in the RR estimated from fails to be less than the corresponding one for the EN. Furthermore, when this event occurs, there is about an 89% probability that it happens when .
- (10)
- For low to moderate levels of collinearity among the regressors, as measured by the traces of the respective matrices of regressors under consideration for RR estimation, the Hoerl, Kennard and Baldwin [26]-proposed values of the RR tuning parameters provide a good estimate of . For higher levels of collinearity, the corresponding values proposed by Lawless and Wang [27] are good. In practice, it would be best to compare and contrast the RR solutions derived using the HKB and LW values of , respectively.
- (11)
- A wide range of input parameters are covered in the simulation. Specifically, the squared length of from the origin (i.e., ), the a priori probability (i.e., ) that an element of is zero and the multicollinearity level of the matrix of regressors (i.e., ) are varied in the simulation. For a given set of data, is knowable, but, and , in general, are not. However, the science underlying the data being examined may yield some insights into and . If so, the tables with simulation output, in the body of this paper and those in the supplementary materials Excel file can indicate where RR is inferior to the EN in terms of metrics such as the accuracy of . Alternatively, the regression equations relating simulation inputs and outputs (e.g., as in Table 3) can indicate this. In such cases, the regressors found by RR and by EN can be pooled, and another EN or RR (or both) re-estimation can be done to see how it impacts the science underlying the data in terms of the causal relationships between the inputs believed to generate the output.
4. Discussion
4.1. This Work
- (i)
- A few simulations were done using the following “ensemble” approach to selecting the RR tuning parameter : when doing RR estimation by partitions, we use the LW value of when the ratio of the trace associated with a partition to the number of regressors in that partition exceeds 10; otherwise, we use the HKB value of . However, the few simulations done in this regard did not improve upon the selections of considered herein. It may be of interest to pursue some variant of this idea further by increasing the cutoff value of 10 to a higher number and see if the ensemble approach to selecting is beneficial.
- (ii)
- The hard constraint of keeping 3 OPR when partitioning can be relaxed by increasing OPR to a higher number like 10. This can be done by running simulations on a “wider” dataset, for example one with = 5000 and = 100. For the dataset used herein, = 89 and = 33, which yields a -to- ratio of about 2.7. For the “wider” dataset, this ratio would be 50.
- (iii)
- Alternative selections of can be considered, for example, those of Hoerl and Kennard [63] and Inoue [64]. For RR “failure scenarios”, the conditional probability that RR finds fewer of the true regressors than does the EN is small (see row 16 of Table 2). However, for these failure scenarios, the RR is less accurate (in terms of squared distance, ) than the corresponding EN . It may be worth examining if other selections of , such as those proposed by Hemmerle [65], can improve RR performance in these failure scenarios.
- (iv)
- Following up on “(iii)” above, another consideration worth pursuing is iterative RR estimation for regressor selection. In this paper, significant regressors are selected by identifying significant values of the RR-estimated -ratio (i.e., ). However, this is done only once for each partition following Hoerl, Schuenemeyer and Hoerl [35]—i.e., all regressors associated with insignificant RR -ratios are dropped, and the remaining regressors are retained as significant. Two versions of iterative RR estimation were proposed by Gana [66]. The first version, called “backward stepwise eliminating” (BSE), is the following: (a) we fit a LS regression to the partition under consideration; (b) we calculate the LW ; (c) we re-estimate the LS regression with RR using the LW and calculate the RR -ratio; (d) we drop the regressor having an RR -ratio with the highest -value above 20%; and (e) we redo steps “a” through “d” until the values of the RR -ratios of the remaining regressors are below 20% or until no regressors meet this criterion. The second version, called “backward group eliminating” (BGE), is the following: we follow all of the steps laid out for a BSE RR after modifying only step, “d”, to drop all regressors with RR -ratios whose -values are greater than 20%. That is, in BGE RR, groups of insignificant regressors are dropped, in contrast to BSE RR, wherein the “most” insignificant regressors are dropped one at a time iteratively. For our problem, BSE RR may have some advantages. For example, some initial simulation done indicate that there is a greater than 50% probability that RR will improve in terms of accuracy.
- (v)
- Deeper explorations linking simulation outputs and inputs can be pursued. As mentioned before, multicollinearity levels () would be known a priori but not the squared length of the true coefficient vector from the origin () or the probabilities () generating the true coefficients. Linking simulation inputs to outputs may shed light on the nature of and by looking at observable outputs. For example, in Table 5, we note that, when RR fails to find more of the true regressors than EN, the probability of RR missing true regressors drops when RR finds more significant regressors than the number of observations (i.e., when ). Because is observable, the question is whether such results hold over the entire simulation space with high probability for or for other observable simulation outputs.
- (vi)
- Exploring connections between simulation outputs and inputs can also be pursued by researching whether connections exist between EN outputs and RR outputs.
4.2. Limitations
5. Conclusions
Supplementary Materials
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
BGE | Backward group eliminating |
BSE | Backward stepwise eliminating |
EN | Elastic Net |
GRR | Generalized ridge regression |
HKB | Hoerl, Kennard and Baldwin |
LS | Least squares |
LW | Lawless and Wang |
MSE | Mean squared error |
NLP | Nonlinear programming |
RR | Ridge regression |
SM | Supplementary material |
SNR | Signal-to-noise ratio |
VIF | Variance inflation factor |
References
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
- Petkovsek, M.; Wilf, H.S.; Zeilberger, D. A = B; CRC Press, Taylor & Francis Group: Boca Raton, FL, USA, 1996. [Google Scholar]
- Schott, J.R. Matrix Analysis for Statistics; John Wiley, Inc.: Hoboken, NJ, USA, 2016. [Google Scholar]
- Seber, G.A.F. A Matrix Handbook for Statisticians; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2007. [Google Scholar]
- Vinod, H.D. Equivariance of ridge estimators through standardization—A note. Commun. Stat.-Theory Methods 1978, 7, 1157–1161. [Google Scholar] [CrossRef]
- Frank, I.E.; Friedman, J.H. A Statistical View of Some Chemometrics Regression Tools. Technometrics 1993, 35, 109–135. [Google Scholar] [CrossRef]
- Halawa, A.M.; El Bassiouni, M.Y. Tests of regression coefficients under ridge regression models. J. Stat. Comput. Simul. 2000, 65, 341–356. [Google Scholar] [CrossRef]
- Gokpinar, E.; Ebegil, M. A study on tests of hypothesis based on ridge estimator. Gazi Univ. J. Sci. 2016, 29, 769–781. [Google Scholar]
- Muniz, G.; Kibria, G.B.M.; Shukur, G. On Developing Ridge Regression Parameters: A Graphical Investigation. Stat. Oper. Res. Trans. 2012, 36, 115–138. [Google Scholar]
- Piegorsch, W.W.; Casella, G. The Early Use of Matrix Diagonal Increments in Statistical Problems. SIAM Rev. 1989, 31, 428–434. [Google Scholar] [CrossRef]
- Hoerl, R.W. Ridge Analysis 25 Years Later. Am. Stat. 1985, 39, 186–192. [Google Scholar]
- Hoerl, R.W. Ridge Regression: A Historical Context. Technometrics 2020, 62, 420–425. [Google Scholar] [CrossRef]
- Brook, R.J.; Moore, T. On the expected length of the least squares coefficient vector. J. Econom. 1980, 12, 245–246. [Google Scholar] [CrossRef]
- Smith, G.; Campbell, F. A Critique of Some Ridge Regression Methods. J. Am. Stat. Assoc. 1980, 75, 74–81. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Osborne, M.; Presnell, B.; Turlach, B. A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 2000, 20, 389–403. [Google Scholar] [CrossRef] [Green Version]
- Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
- Delbos, F.; Gilbert, J.C. Global Linear Convergence of an Augmented Lagrangian Algorithm to Solve Convex Quadratic Optimization Problems. J. Convex Anal. 2005, 12, 45–69. [Google Scholar]
- Taylor, H.L.; Banks, S.C.; McCoy, J.F. Deconvolution with the ℓ1 norm. Geophysics 1979, 44, 39–52. [Google Scholar] [CrossRef]
- Santosa, F.; Symes, W.W. Linear Inversion of Band-Limited Reflection Seismograms. SIAM J. Sci. Stat. Comput. 1986, 7, 1307–1330. [Google Scholar] [CrossRef]
- Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Boissonnade, A.; Lagrange, J.L.; Vagliente, V.N. Analytical Mechanics; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
- Bussotti, P. On the Genesis of the Lagrange Multipliers. J. Optim. Theory Appl. 2003, 117, 453–459. [Google Scholar] [CrossRef]
- Plackett, R.L. Studies in the History of Probability and Statistics. XXIX. Biometrika 1972, 59, 239–251. [Google Scholar] [CrossRef]
- Stigler, S.M. Gauss and the Invention of Least Squares. Ann. Stat. 1981, 9, 465–474. [Google Scholar] [CrossRef]
- Hoerl, A.E.; Kannard, R.W.; Baldwin, K.F. Ridge regression: Some simulations. Commun. Stat. 1975, 4, 105–123. [Google Scholar] [CrossRef]
- Lawless, J.F.; Wang, P. A simulation study of ridge and other regression estimators. Commun. Stat.-Theory Methods 2010, 5, 307–323. [Google Scholar]
- Austin, P.C.; Steyerberg, E.W. The number of subjects per variable required in linear regression analyses. J. Clin. Epidemiol. 2015, 68, 627–636. [Google Scholar] [CrossRef]
- Whitehead, A.N.; Russell, B.A.W. Principia Mathematica to *56; Cambridge University Press: Cambridge, UK, 1962. [Google Scholar]
- Ricci, M.M.G.; Levi-Civita, T. Methodes de calcul diffrentiel absolu et leurs applications. Math. Ann. 1900, 54, 125–201. [Google Scholar] [CrossRef] [Green Version]
- Efroymson, M.A. Multiple Regression Analysis. In Mathematical Methods for Digital Computers; Ralston, A., Wilf, H.S., Eds.; John Wiley: New York, NY, USA, 1960. [Google Scholar]
- Hocking, R.R. A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression. Biometrics 1976, 32, 1–49. [Google Scholar] [CrossRef]
- Miller, A.J. The Convergence of Efroymson’s Stepwise Regression Algorithm. Am. Stat. 1996, 50, 180–181. [Google Scholar]
- Gana, R. Ridge regression and the Lasso: How do they do as finders of significant regressors and their multipliers? Commun. Stat.-Simul. Comput. 2020, 1–35. [Google Scholar] [CrossRef]
- Hoerl, R.W.; Schuenemeyer, J.H.; Hoerl, A.E. A Simulation of Biased Estimation and Subset Selection Regression Techniques. Technometrics 1986, 28, 369–380. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- SAS. SAS Enterprise Guide 7.15 HF9 (64-bit); SAS Institute Inc.: Cary, NC, USA, 2017. [Google Scholar]
- Marquardt, D.W.; Snee, R.D. Ridge Regression in Practice. Am. Stat. 1975, 29, 3–20. [Google Scholar]
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Applications to Nonorthogonal Problems. Technometrics 1970, 12, 69–82. [Google Scholar] [CrossRef]
- Hoerl, A.E.; Kennard, R.W. Ridge Regression: Degrees of Freedom in the Analysis of Variance. Commun. Stat.-Simul. Comput. 1990, 19, 1485–1495. [Google Scholar] [CrossRef]
- White, H. A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica 1980, 48, 817–838. [Google Scholar] [CrossRef]
- Obenchain, R.L. Classical F-Tests and Confidence Regions for Ridge Regression. Technometrics 1977, 19, 429–439. [Google Scholar] [CrossRef]
- Fubini, G. Opere scelte. Cremonese 1958, 2, 243–249. [Google Scholar]
- Fubini, G. Sugli integrali multipli. Rom. Acc. L. Rend. 1907, 16, 608–614. [Google Scholar]
- Shampine, L.F. Matlab program for quadrature in 2D. Appl. Math. Comput. 2008, 202, 266–274. [Google Scholar] [CrossRef]
- Shampine, L.F. Vectorized adaptive quadrature in MATLAB. J. Comput. Appl. Math. 2008, 211, 131–140. [Google Scholar] [CrossRef] [Green Version]
- Matlab. R2021b Update 2; The MathWorks Inc.: Natick, MA, USA, 2022. [Google Scholar]
- Durbin, J. Testing for Serial Correlation in Least-Squares Regression When Some of the Regressors are Lagged Dependent Variables. Econometrica 1970, 38, 410–421. [Google Scholar] [CrossRef]
- Durbin, J. Tests for Serial Correlation in Regression Analysis Based on the Periodogram of Least-Squares Residuals. Biometrika 1969, 56, 1–15. [Google Scholar] [CrossRef]
- Vinod, H.D. Generalization of the durbin-watson statistic for higher order autoregressive processes. Commun. Stat. 1973, 2, 115–144. [Google Scholar] [CrossRef]
- Spitzer, J.J. Small-Sample Properties of Nonlinear Least Squares and Maximum Likelihood Estimators in the Context of Autocorrelated Errors. J. Am. Stat. Assoc. 1979, 74, 41–47. [Google Scholar] [CrossRef]
- Elliott, G.; Rothenberg, T.J.; Stock, J.H. Efficient Tests for an Autoregressive Unit Root. Econometrica 1996, 64, 813–836. [Google Scholar] [CrossRef]
- Rose, A. Vision: Human and Electronic; Plenum Press: New York, NY, USA, 1973. [Google Scholar]
- Burgess, A.E. The Rose Model, Revisited. J. Opt. Soc. Am. A Opt. Image Sci. Vis. 1999, 16, 633–646. [Google Scholar] [CrossRef]
- Bendel, R.B.; Afifi, A.A. Comparison of Stopping Rules in Forward “Stepwise” Regression. J. Am. Stat. Assoc. 1977, 72, 46–53. [Google Scholar]
- Myers, R.H. Classical and Modern Regression with Applications; International Thomson Publishing: London, UK, 1990. [Google Scholar]
- Penrose, R. The Emperor’s New Mind; Oxford University Press: Oxford, UK, 2016. [Google Scholar]
- Levinson, N. A Motivated Account of an Elementary Proof of the Prime Number Theorem. Am. Math. Mon. 1969, 76, 225–245. [Google Scholar] [CrossRef]
- Hardy, G.H. Ramanujan: Twelve Lectures on Subjects Suggested by His Life and Work; Chelsea Publishing Company: New York, NY, USA, 1978. [Google Scholar]
- Bohr, H. Address of Professor Harald Bohr. In Proceedings of the International Congress of Mathematicians, Cambridge, MA, USA, 3 August–6 September 1950; American Mathematical Society: Cambridge, MA, USA, 1950; p. 129. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Breiman, L. Statistical Modeling: The Two Cultures (With Comments and a Rejoinder by the Author). Stat. Sci. 2001, 16, 199–231. [Google Scholar] [CrossRef]
- Hoerl, A.E.; Kennard, R.W. Ridge regression iterative estimation of the biasing parameter. Commun. Stat.-Theory Methods 1976, 5, 77–88. [Google Scholar] [CrossRef]
- Inoue, T. Improving the ‘Hkb’ Ordinary Type Ridge Estimator. J. Jpn. Stat. Soc. 2001, 31, 67–83. [Google Scholar] [CrossRef]
- Hemmerle, W.J. An Explicit Solution for Generalized Ridge Regression. Technometrics 1975, 17, 309–314. [Google Scholar] [CrossRef]
- Gana, R. Did COVID-19 Force a President to Face the St. Petersburg Paradox and lose the White House? Soc. Sci. Res. Netw. 2021, 1, 7. Available online: www.ssrn.com/abstract=3898613 (accessed on 6 August 2021). [CrossRef]
- Gana, R.; Vasudevan, S. Ridge regression estimated linear probability model predictions of O-glycosylation in proteins with structural and sequence data. BMC Mol. Cell Biol. 2019, 20, 21. [Google Scholar] [CrossRef]
- Ishwaran, H.; Rao, J.S. Geometry and properties of generalized ridge regression in high dimensions. Contemp. Math. 2014, 622, 81–93. [Google Scholar]
- Lawless, J.F. Mean Squared Error Properties of Generalized Ridge Estimators. J. Am. Stat. Assoc. 1981, 76, 462–466. [Google Scholar] [CrossRef]
- Sümmermann, M.L.; Sommerhoff, D.; Rott, B. Mathematics in the Digital Age: The Case of Simulation-Based Proofs. Int. J. Res. Undergrad. Math. Educ. 2021, 7, 438–465. [Google Scholar] [CrossRef]
- Diamond, H.G. Elementary methods in the study of the distribution of prime numbers. Bull. Am. Math. Soc. 1982, 7, 553–589. [Google Scholar] [CrossRef] [Green Version]
Metric in Plain English (for Generalists) and Mathematized (for Specialists) | Measure * |
---|---|
Average probability that the squared length of the RR-estimated coefficient vector (from the true coefficient vector) is shorter than the corresponding one for the EN: | 79% |
Probability that the squared length of the RR-estimated coefficient vector is shorter than the corresponding one for the EN more frequently (viz. 50 + % of the time): | 86% |
Average probability that the squared length of the RR-estimated true coefficient vector is shorter than the corresponding one for the EN: | 67% |
Probability that the squared length of the RR-estimated true coefficient vector is shorter than the corresponding one for the EN more frequently: | 74% |
Average probability that RR finds more of the true regressors than does the EN: | 89% |
Probability that RR finds more of the true regressors than does the EN more frequently: | 99% |
Average proportion of true regressors found by RR: | 65% |
Average proportion of true regressors found by the EN: | 41% |
Conditional probability that RR finds fewer of the true regressors than the EN more frequently, given that the squared length of the RR-estimated coefficient vector is shorter than the corresponding one for the EN less frequently (i.e., the RR downside to finding true regressors when RR coefficients are relatively imprecise is small): | 3% |
Probability that RR finds none of the true regressors: | 4% |
Probability that the EN finds none of the true regressors: | 25% |
Probability that the proportion of times the squared length of the RR-estimated coefficient vector chosen with the HKB RR tuning parameter is shorter than the corresponding one chosen with the LW RR tuning parameter more frequently: | 11% |
Probability that the proportion of times the squared length of the RR-estimated true coefficient vector using the HKB RR tuning parameter is shorter than the corresponding one using the LW RR tuning parameter more frequently: | 48% |
Row | Metric (Measured as Percent) | 1 | ||||
---|---|---|---|---|---|---|
100 | 300 | 1000 | 4000 | Extreme | ||
1 | 82.6 (79.3) | 84.8 (78.0) | 87.1 (74.6) | 73.6 (65.1) | 0.0 (0.99) | |
2 | 82.6 (79.5) | 84.8 (78.8) | 87.1 (78.4) | 84.7 (76.4) | 93.3 (83.7) | |
3 | 73.6 (67.0) | 73.1 (66.1) | 70.6 (64.4) | 65.0 (58.7) | 0.0 (5.5) | |
4 | 71.0 (66.4) | 68.4 (64.8) | 67.1 (64.2) | 65.0 (62.0) | 89.0 (76.8) | |
5 | 96.8 (75.2) | 99.4 (84.8) | 100 (90.1) | 100 (94.3) | 100 (99.8) | |
6 | 45.8 (43.1) | 7.0 (36.8) | 0.0 (26.8) | 0.0 (17.7) | 0.0 (0.29) | |
7 | 78.7 (52.1) | 69.0 (50.3) | 51.2 (43.4) | 41.7 (38.1) | 0.0 (5.0) | |
8 | 2.58 | 1.75 | 2.94 | 4.91 | 7.98 | |
9 | 4.52 | 8.77 | 24.11 | 28.83 | 58.28 | |
10 | 2 | 0.54 [0.9] | 1.08 [1.3] | 1.22 [1.6] | 0.81 [1.3] | 0.33 [0.7] |
11 | 2 | 1.60 [2.6] | 1.53 [3.1] | 1.26 [3.2] | 2.39 [4.3] | 9.03 [14.5] |
12 | 2 | 53.7 [12.1] | 57.0 [9.9] | 61.0 [8.7] | 66.9 [6.6] | 85.8 [5.8] |
13 | 2 | 49.0 [16.8] | 47.5 [15.1] | 46.3 [13.8] | 44.4 [12.6] | 18.4 [11.4] |
14 | 73.6 | 84.8 | 91.8 | 96.3 | 100 | |
15 | 40.0 | 39.2 | 51.2 | 58.9 | 98.2 | |
16 | 3 | 7.41 | 3.85 | 0 | 0 | 0 |
Simulation variables used as regressors | LS/RR coefficients (LS VIF) | LS value | Simulation variables used as regressors | LS/RR coefficients (LS VIF) | LS value |
Intercept | 1.3924/1.32 | 38.39 | Intercept | 1.1454/1.12 | 48.56 |
−0.0296/−0.03 (8.7) | 5.50 | −0.0394/−0.04 (6.6) | 8.40 | ||
−0.3842/−0.25 (5.3) | 8.27 | Not significant | |||
−0.4882/−0.42 (8.9) | 12.08 | −0.3338/−0.35 (5.0) | 8.92 | ||
0.5043/0.37 (35.4) | 4.72 | 0.2778/0.33 (21.6) | 3.24 | ||
0.0331/0.035 (32.7) | 3.35 | 0.0348/0.03 (21.5) | 4.14 | ||
−0.5394/−0.47 (5.3) | 17.79 | −0.4787/−0.44 (5.6) | 17.04 | ||
1.3230/1.05 (3.2) | 8.48 | 1.0866/0.95 (3.0) | 13.42 | ||
Sample size | 155 | 155 | |||
LS value | 309.90 | 474.59 | |||
LS/RR RMSE | 0.0727/0.0756 | 0.0727/0.0736 | |||
LS R-squared | 93.7% | 95.1% | |||
Conditional probability that RR finds fewer of the true regressors than EN more frequently, given the RR-estimated coefficients are less precise than those of EN more frequently * | 7.41% | NA (not applicable) | |||
Selected using ridge traces | 0.02 | 0.02 | |||
Average (via double integration) in the rectangular region bounded by , , and , and using the LS/RR regression coefficients | 83.3%/84.7% | Not computed | |||
Average (via double integration) in the rectangular region bounded by , , and , using the LS/RR regression coefficients | 84.3%/81.0% | Not computed |
Regressand Is | Regressand Is | |||
Simulation variables used as regressors | LS/RR coefficients (LS VIF) | LS value | LS/RR coefficients (LS VIF) | LS value |
Intercept | 0.2332/0.2378 | 11.39 | −1.061/−1.074 | 10.93 |
0.0068/0.0051 (2.3) | 4.07 | 0.0279/0.021 (2.3) | 3.51 | |
−0.1981/−0.1719 (2.6) | 10.23 | −1.259/−1.097 (2.6) | 13.71 | |
−0.1587/−0.1067 (4.8) | 8.12 | −0.8406/−0.57 (4.8) | 9.07 | |
1.2296/0.6209 (18.6) | 18.72 | 5.8716/3.015 (18.6) | 18.85 | |
−0.1918/−0.0532 (16.0) | 11.41 | −0.8453/−0.21 (16.0) | 10.61 | |
0.1205/0.1139 (2.4) | 9.91 | 0.5044/0.49 (2.4) | 8.75 | |
Not statistically significant at | ||||
Sample size | 155 | 155 | ||
LS value | 160.99 | 192.50 | ||
LS/RR RMSE | 0.0438/0.0551 | 0.2076/0.2606 | ||
LS R-squared | 86.7% | 88.6% | ||
Selected using ridge traces | 0.04 | 0.04 | ||
Average (via double integration) in the rectangular region bounded by , , , and , and using the LS/RR regression coefficients | Not computed | 23.5%/28.2% | ||
Average (via double integration) in the rectangular region boundedby , , , and , and using the LS/RR regression coefficients | Not computed | 19.5%/18.1% |
5% | 200,000 | 100 | 47.1% | 54.7% | 49.0% |
5% | 600,000 | 100 | 47.6% | 52.8% | 46.8% |
5% | 600,000 | 300 | 16.9% | 55.8% | 52.1% |
65% | 10 | 100 | 57.5% | 76.7% | 20.0% |
80% | 10 | 100 | 56.5% | 82.0% | 17.6% |
95% | 10 | 100 | 56.7% | 90.5% | 10.5% |
Miss Rate | Miss Rate | Miss Rate | Miss Rate | ||||
---|---|---|---|---|---|---|---|
0.5030 | 0.5330 | 0.5235 | 0.5445 | ||||
0.5350 | 0.5415 | 0.5430 | 0.5100 | ||||
0.5225 | 0.5260 | 0.5310 | 0.5400 | ||||
0.5185 | 0.5265 | 0.5245 | 0.5175 |
Metric | ||||||||
---|---|---|---|---|---|---|---|---|
5% | 20% | 35% | 40% | 50% | 65% | 80% | 95% | |
1.22 | 42.67 | 60.67 | 78.13 | 140.88 | 297.41 | 188.19 | 318.21 | |
0.99 | 32.29 | 30.43 | 51.47 | 72.81 | 108.08 | 176.82 | 260.40 | |
490.62 | 354.04 | 330.69 | 247.00 | 544.53 | 661.89 | 512.00 | 776.72 | |
320.83 | 378.64 | 244.60 | 284.69 | 387.35 | 269.77 | 271.48 | 333.68 | |
56.44 | 23.37 | 22.19 | 20.75 | 28.01 | 12.83 | 23.43 | 17.14 |
Row | Metric (Measured as Percent to the Nearest Integer) | ||||||
---|---|---|---|---|---|---|---|
100 | 1000 | 4000 | Extreme | 300 | Extreme | ||
1 | 82 | 79 | 74 | 91 | 84 | 93 | |
2 | 28 | 43 | 50 | 66 | 35 | 71 | |
3 | 54 | 55 | 54 | 85 | 64 | 89 | |
4 | 13 | 25 | 30 | 41 | 20 | 55 | |
5 | 3 | 6 | 76 | 100 | 78 | 100 | |
6 | 13 | 23 | 60 | 100 | 50 | 100 | |
7 | 10 | 12 | 20 | 12 | 4 | 14 | |
8 | 18 | 28 | 55 | 47 | 15 | 47 | |
9 | 6 | 22 | 31 | 48 | 11 | 57 | |
10 | 23 | 40 | 53 | 100 | 35 | 100 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gana, R. Ridge Regression and the Elastic Net: How Do They Do as Finders of True Regressors and Their Coefficients? Mathematics 2022, 10, 3057. https://doi.org/10.3390/math10173057
Gana R. Ridge Regression and the Elastic Net: How Do They Do as Finders of True Regressors and Their Coefficients? Mathematics. 2022; 10(17):3057. https://doi.org/10.3390/math10173057
Chicago/Turabian StyleGana, Rajaram. 2022. "Ridge Regression and the Elastic Net: How Do They Do as Finders of True Regressors and Their Coefficients?" Mathematics 10, no. 17: 3057. https://doi.org/10.3390/math10173057
APA StyleGana, R. (2022). Ridge Regression and the Elastic Net: How Do They Do as Finders of True Regressors and Their Coefficients? Mathematics, 10(17), 3057. https://doi.org/10.3390/math10173057