*2.3. Stage 3*

Having eliminated the variables with high *VIF*, we then attempted to determine whether the resulting subsets all yield appropriate models with low redundancy. Most commonly, such a task is solved by employing stepwise regression, which allows revealing the optimal regression model without examining all subsets [97–100]. For many decades, this approach to regression model building has been extensively used in statistics and econometrics as an appropriate trade-off between time expenditures and model performance [101–103]. Nowadays, a stepwise regression model building commonly employs the best subsets approach (BSA) that allows evaluating all possible regression models for a given set of regressors in a timely-effective and accurate manner [104–106].

Generally, the BSA-based checking of regression models involves a parameter of adjusted *R*<sup>2</sup> [107], which adjusts the *R*<sup>2</sup> of each subset to account for the number of regressors and the sample size [61]. In this study, the employment of adjusted *R*<sup>2</sup> instead of *R*<sup>2</sup> was preferable due to the need to compare Stage 2 subsets with different numbers of *Xn*. Among the competing subsets, the study proceeded with the one with the largest adjusted *R*2. In addition to adjusted *R*2, when the goal is to find the most appropriate model involving multitude subsets of regressors, a criterion of Mallows' *Cp* statistic (Equation (2)) is generally applied [60,61]. Examples include checking matchings between the subsets [108], model averaging [109–111], measuring the deviations from perfect rankings [112], and model selection [113].

$$C\_p = \frac{\left(1 - R\_k^2\right)(n - T)}{1 - R\_T^2} - \left(n - 2(k + 1)\right) \tag{2}$$

where *Cp* = Mallow's *Cp* statistic; *n* = number of observations; *k* = number of regressors; *T* = total number of variables in the full model, including the intercept; *R*<sup>2</sup> *<sup>k</sup>* = coefficient of multiple determination for a model with *k* regressors; *R*<sup>2</sup> *<sup>T</sup>* = coefficient of multiple determination for a model with all *T* variables.

In this study, *Cp* was applied as a tool to measure the differences between the models constructed at Stage 2 and optimal (or true) models that best explain the correlations. The idea was that the closer *Cp* to the number of variables included in a subset, the more accurate would be the model (only random differences from the optimal model might occur). Thus, Stage 3 resulted in identifying the subsets whose *Cp* were close to or below (k + 1). In total, eight subsets of independent *Xn* variables were built for eight territories.
