1. Introduction
Data-driven culture, based on data-informed decision-making, has become increasingly relevant across diverse industries, driven by the increasing availability of data and technological advancements [
1]. At the same time, big data have presented new challenges regarding analyzing and interpreting these massive volumes of information [
2].
In this context, Chowdhury and Turin [
3] state that it is possible to apply several variable selection techniques to construct a statistical model. Among them, the stepwise procedure is applied to identify a limited number of variables in statistical models, especially in prediction problems or situations with many original variables [
4]. However, there are other techniques, such as lasso-like regularization procedures, which may be more suitable and perform better in certain cases [
5]. Ultimately, it is up to the researcher to determine which method best fits their needs.
The stepwise procedure has the property of automatically deleting or maintaining the β parameters of the linear regression model according to the criteria presented and offering the final model only with parameters β statistically different from zero for a given level of significance [
6].
Also, stepwise provides a systematic approach to building models, which is especially useful when dealing with many variables. It allows analysts to focus on the most relevant variables, simplifying the interpretation of results [
7]. Stepwise ensures the validity and importance of the chosen variables and reduces the additional error introduced by redundant variables [
8].
Considering that the practical selection of variables plays a crucial role in constructing regression models, the computational implementation of the stepwise procedure offers an automated and efficient approach to selecting variables, allowing the construction of more robust models. In this sense, it is possible to perform the computational implementation of the stepwise regression procedure in several programming languages, each offering specific packages or libraries to assist in this process.
The stepwise procedure is widely found in the literature, with multidisciplinary applications, such as storm prediction [
9]; modeling of real estate sales prices [
10]; agricultural production forecast [
11,
12]; environment-related analyses [
13,
14,
15,
16,
17]; regression models in healthcare [
18,
19,
20]; and analyses related to human behavior [
21]. These are just a few of the many applications of the stepwise procedure in machine-learning models of multiple regressions.
As for the computational implementation of stepwise, we observe packages and libraries in several programming languages and software, such as R (v. 4.3.1), SAS (v. 9.4M8), MATLAB (v. R2024b), Julia (v. 1.9.3), SPSS (v. 29.0.1.1), Stata (v. 18), Java (v. 21), and C++ (v. 20), among others. However, Python (v. 3.13.0) requires a specific library that implements stepwise selection based on the statistical significance of the variables. As the most popular programming language for data science [
22], Python has gained widespread adoption for statistical analysis and machine learning [
23]. This highlights the importance of having tools that cater to rigorous statistical methodologies within such a versatile programming environment.
It is possible to perform the stepwise selection procedure based on the statistical significance of the variables manually; however, for models with a large number of variables, it is convenient to have a package that automates this process, making it easier to select relevant and statistically significant independent variables for the model.
In Python, there is the mlxtend library, which has the SequentialFeatureSelector (SFS) function that performs stepwise selection. However, two caveats are necessary. First, the package only selects the variables, and it is necessary to adjust the model again after selecting them. Second, and more importantly, the package selects the variables based on some metric, such as , MSE, MAE, RMSE, AIC, and BIC, among others, and it does not consider the statistical significance of the variables.
The stepwise procedure is applied to models with multiple independent variables (
), where the issue of multicollinearity arises. [
24]. A variable may be removed during the stepwise process for several reasons: if its parameter is not statistically different from zero at a given significance level (i.e., the variable
is not significant on its own); or if multicollinearity is present, meaning there is a correlation between variables, making it unnecessary to include two variables when one may already explain the behavior of the other. From a predictive standpoint, in cases of multicollinearity, one of the independent variables
will be removed to avoid prediction issues caused by multicollinearity. In other words, if an independent variable whose beta is not statistically different from zero is not excluded, the model loses its predictive power [
6]. This is because that
parameter, in the presence of other variables, ceteris paribus, does not contribute to the predictive composition, making the variable statistically insignificant in explaining the behavior of the dependent variable
alongside other
variables. Including such a
parameter could alter the magnitude and potentially even the sign of the other beta parameters, without contributing meaningfully to the construction of a predictive model [
6,
24,
25].
In this sense, it is important to provide an implementation alternative that uses the statistical significance of the variables as a criterion. According to [
6], metrics such as the
adjustment coefficient do not inform researchers whether a given explanatory variable is statistically significant and whether it is the true cause of changes in the behavior of the dependent variable. Moreover, this metric does not allow for an assessment of potential omitted variable bias or whether the choice of the explanatory variables included in the proposed model was appropriate. Thus, regarding the application of stepwise selection, it is insufficient to consider only adjustment or accuracy metrics; it is essential to take into account the statistical significance of the explanatory variables when selecting which ones remain in the final model after stepwise selection, since a model built from sample data that includes statistically insignificant beta parameters is not considered a final model when the goal is prediction [
6,
24,
26].
It is important to note that this approach is more commonly used in applied sciences, such as statistics, engineering, and economics, where there is a need to develop models that identify sample phenomena and align the estimated model with the real-world sample data. However, in some contexts, the focus on the statistical significance of beta parameters is less emphasized, such as in the study of deterministic phenomena [
6].
Despite the wealth of features offered by libraries such as “stats models” and “scikit-learn”, there was no specific function for the stepwise procedure based on the statistical significance of the variables in Python. This gap limits the efficiency and practicality of implementing this procedure in research projects and practical applications. The absence of a specific library not only complicates the adoption of the stepwise procedure but also makes it difficult to standardize and reproduce experiments, resulting in additional efforts for users who wish to adopt and explore the various methods of variable selection, which can be based on metrics or statistical significance. Filling this gap represents an opportunity to advance the development of specialized tools for automated variable selection.
This article proposes the development of a library in Python that implements the stepwise procedure based on the statistical significance of the variables as a simple and intuitive function. This function fills the gap in the literature by providing specialized functionalities for the automated selection of variables in multiple regression models. The library was developed based on the best practices of programming, modularity, and precise documentation to facilitate users’ use and understanding. In addition, the library will promote the standardization and reproduction of experiments, simplifying the research process and the development of statistical projects.
In summary, the present study presents the following objectives:
Propose and detail the stepwise function based on the statistical significance of the variables in Python;
Exemplify how the application of the proposed stepwise function can help in retaining only the statistically significant variables, potentially improving the overall model performance and enhancing the statistical reliability of the results;
Present a methodological framework for the treatment and mitigation of problems of heteroskedasticity, multicollinearity, and nonadherence of residues to normality, providing its computational implementation;
Consolidate the concepts discussed in a real case study of real estate pricing, considering linear and nonlinear multiple regression models stepwise.
Given the above, this paper aims to advance the understanding and application of statistical methods in Python, allowing researchers, students, and professionals from various areas to use these techniques in a way that is more effective, intuitive, and grounded. The results presented here may raise new debates and contributions to data analysis and statistical modeling.
2. Background
2.1. Multiple Linear Regression
Multiple linear regression (MLR) models are a class of statistical models used to analyze the relationship between a continuous dependent variable and several independent variables (or predictors) [
27,
28,
29].
Favero et al. [
25] state that MLR primarily enables analyses of the relationship between several explanatory variables, presented in linear form, and a quantitative dependent variable. Thus, it is possible to define a general MLR model as Equation (1):
where
represents the dependent variable;
matches the intercept;
(j = 1, 2, …, k) are the coefficients of each variable (angular coefficients);
is the explanatory variables (metrics or dummies);
is the error term (difference between the actual and predicted values of
through the model for each observation). The subscript
represents each sample’s observations under analysis (
= 1, 2, …, n, where
is the sample size).
It is possible to write the error term
, for each observation
, as presented in Equation (2):
represents the predicted value of the dependent variable that the estimated model will generate for the observation
[
6]. The error terms
occur due to some reasons that need to be known and considered by researchers [
25,
30,
31], such as the existence of aggregated and non-random variables, failures during model specification, and errors in data collection.
One of the most used techniques for estimating a multiple linear regression model is the ordinary least squares (OLS) method [
6]. A model estimated by OLS must determine
and
, making the sum of the residues’ squares as small as possible [
32].
In this context, an important concept related to the explanatory power of the regression model is the coefficient of adjustment or explanation (
) [
6]. Considering a multiple regression model,
represents how much of the behavior of variable
is explained by the joint variation of variable
considered in the model. However, Fávero and Belfiore [
32] point out that there is not necessarily a cause-and-effect relationship between variables
and
.
Stock and Watson [
33] define the
coefficient as the fraction of the variance of the
sample explained by the explanatory variables. It is possible to use the
coefficient to measure the degree of fit of the proposed model [
31].
Empirical studies highlight the importance of including a comprehensive set of independent variables in MLR models [
34,
35]. However, it is essential to consider the challenges associated with including many independent variables, such as multicollinearity, which can distort the results and hinder the interpretation of the estimated coefficients [
36].
Furthermore, as discussed in the introduction, it is well-known in the context of applied sciences that, for predictive purposes, including irrelevant independent variables can reduce not only the model’s efficiency but also its interpretability [
6,
24,
26]. In this regard, selecting the appropriate variables is crucial to avoid introducing unnecessary elements that do not contribute to the model’s predictive power [
37,
38].
Therefore, including an adequate number of independent variables in MLR models requires a careful analysis of the theoretical and empirical relationships between variables. It is essential to seek a balance between the inclusion of relevant variables and the consideration of the problems associated with multicollinearity [
3], since it is unnecessary to include variables when one may already explain the behavior of the other [
6].
Another relevant point to discuss when applying regression models refers to qualitative explanatory variables, such as gender, age range, or classification of a given client, when these are on the right side of the regression models to estimate [
32].
In these cases, according to [
6], assigning values to each of the categories of the qualitative variable (a procedure known as arbitrary weighting) is an incorrect approach. The correct way to treat such types of variables is to resort to the artifice of dummy, or binary, variables, which assume 0 or 1, stratifying the sample to, from there, be included in the model under analysis [
32].
2.2. Normality of Residuals
The normality of the residuals is an essential assumption for the hypothesis tests of the regression models to be validated, such as the
p-value of the
t and F tests [
6]. According to [
32], this assumption is constantly violated when estimating OLS regression models. However, normality is vital for defining the best functional form and determining confidence intervals for model prediction, with exceptions for sufficiently large samples.
In this context, the Shapiro–Francia statistical test, applied to the model’s error terms, allows for verifying the assumption of the normality of the residuals [
32]. This test, proposed by Shapiro and Francia [
39], can be applied to samples of size
. The Shapiro–Francia test assumes the following hypotheses [
6].
H0. The sample comes from a population with normal distribution.
H1. The sample does not come from a population with a normal distribution.
It is possible to see the Shapiro–Francia statistic (
) calculation in Equation (3):
where
are the statistics of the i-th ordered observation so that
;
is the approximate expected value of the i-th observation (
) [
6].
For the diagnosis of multicollinearity, the critical values
should be considered, such that
(considering a unilateral test on the left), with the
p-values being established by a correspondence table, found in [
6]; The parameter
corresponds to the level of statistical significance established. For the null hypothesis
to be rejected,
; otherwise,
is not rejected [
32].
Another form of analysis is the comparison of the
p-value (probability associated with
), obtained in the same table of correspondences [
6]. In this case,
is rejected if
. In short, if
is rejected, the residues do not adhere to normality.
2.3. The Problem of Multicollinearity
Multicollinearity occurs when there are very high correlations between the model’s explanatory variables, which can be caused by the presence of variables with the same trend or the use of databases with few observations [
32].
As forms of multicollinearity diagnosis, we highlight the statistics
and Variance Inflation Factor (
) based on estimating auxiliary regressions [
40], according to Equations (4) and (5):
is the adjustment coefficient of each of the estimated
auxiliary regressions. According to [
32], if
is too low (which implies a high
statistic), there is evidence of multicollinearity problems. In this case, the explanatory variable depends on this auxiliary regression, sharing a high percentage of variance with the other explanatory variables [
6].
As reference values for diagnosing multicollinearity problems, many authors establish a
threshold above 10. However, Favero and Belfiore [
32] warn that a
value equal to 4 will result in an
of 0.75 for a given auxiliary regression, a relatively high percentage of shared variance. If the VIF exceeds the established limits, it indicates the presence of multicollinearity in the proposed model. Therefore, the use of stepwise is recommended so that only statistically significant variables remain in the model, thus avoiding the problem of multicollinearity.
2.4. Diagnosis of Heteroskedasticity
In addition to the assumptions and statistical tests presented above, each random error term’s probability distribution is represented by
in Equations (2) and (3) must present the same variance; that is, they must be homoscedastic [
6].
As Favero and Belfiore [
32] explain, heteroskedasticity can indicate a correlation between the terms of the error and the explanatory variables, causing problems with the hypothesis tests of the t statistics [
41,
42]. However, it does not necessarily affect the consistency of the parameter estimates, and it can be addressed by robust standard errors, such as Huber-White, in order to provide valid inference by adjusting the standard errors of the estimated coefficients, ensuring that statistical tests remain reliable even in the presence of heteroscedasticity.
For the diagnosis of heteroscedasticity, the Breusch–Pagan/Cook–Weisberg test stands out, based on the Lagrange multiplier, with
corresponding to the variance of the constant error terms (homoskedasticity) [
6]; hypothesis
corresponds to the variance of error terms non-constantly, indicating that the terms
are a function of one or more explanatory variables (heteroskedastic errors) [
32]. Favero and Belfiore [
6] indicate this test when the assumption of normality of the residues is verified.
It is necessary to obtain each standardized residual using Equation (6) to apply the Breusch–Pagan test:
where
represents the residual vector, and
is the sample size. Next, the regression model, Equation (7), through which the sum of squares of the regression is calculated, divided by two, obtaining the
statistic [
6].
represents the dependent variable’s predicted value vector, and corresponds to the regression’s error term.
The Breusch–Pagan test presents
as the calculated statistic
with a chi-square distribution with 1 degree of freedom for a given significance level. In practice, if the error terms are homoskedastics, the squared residues do not increase or decrease with the increase of
[
6].
2.5. Nonlinear Regression Models
The definition of the best functional form of a regression model is an empirical question to be decided in favor of the best fit of the analyzed data [
43]. According to [
6], such a definition is based on the highest
, considering equal samples and with the same amount of parameters. Otherwise, one should choose the functional form whose model presents the highest adjusted
.
Box and Cox [
44] proposed a general regression model as the basis for all functional forms in this context. From the linear regression model with a single variable,
, it is possible to obtain a transformed model, replacing
by
and from
by
, where
and
are the parameters of the transformation [
25,
44,
45]. Thus, it is possible to represent the model by Equation (8):
From Equation (8), values are assigned to
and
that provide adherence to the primary functional forms, such as Semilogarithmic to the right, to the left, Logarithmic Inverse, Quadratic, and Cubic [
6]. When applied to an original variable, the Box–Cox transformation generates a new variable, which presents a new distribution [
32]. When applying the Box–Cox transformation to a regression model with multiple independent variables
, each variable can be transformed individually using its own parameter
. The parameters
and
are estimated through maximum likelihood before fitting the linear regression model to the transformed variables.
According to [
46], problems related to residuals in regression models may arise from specification failures in the model’s functional form. In this sense, the Box–Cox transformation can help define the functional form most adherent to the data, maximizing adherence to normality [
6].
2.6. The Stepwise Procedure
The main approaches to stepwise regression are forward selection, backward elimination, and bidirectional elimination [
47]. The forward selection procedure begins with an equation without variables. At each step, the technique involves adding the variable with the highest F-statistic or the lowest
p-value until there is none left to be added to the model [
8].
The backward elimination procedure starts from a “complete” model, with all the predictive input variables and the intercept. Then, at each step, the variable that least improves the model is deleted, one at a time, until no other excluded variable can significantly improve the model’s performance [
48].
Bidirectional elimination corresponds to a forward selection procedure but with the possibility of excluding a selected variable at each stage, as in backward elimination. This approach is commonly applied to stepwise regression, mainly when there are correlations between variables [
48].
2.6.1. Forward Selection
The inclusion criteria forward selection is a procedure that determines whether to include an independent variable in the model based on its statistically significant contribution to the model’s fit.
This inclusion is assessed through the significance test, which compares the statistical improvement from adding the variable to the model with a predetermined significance level [
27].
It is possible to see the general equation for the inclusion criteria below (Equation (9)):
where
represents the dependent variable,
represents the independent variable for inclusion in the model, and
is the error term, which captures the variation not explained by the model [
27].
The inclusion criteria seek to determine whether adding the variable
to the model results in a statistically significant improvement in the ability to explain the variability of the dependent variable. The procedure takes place by calculating test statistics such as the
p-value. If the
p-value is less than the predetermined significance level, the variable
is considered statistically significant and is included in the model [
49].
Thus, the inclusion criteria in the stepwise procedure allow the selection of independent variables that present a statistically significant relationship with the dependent variable, contributing to a better adjustment and explanation of the MLR model [
50].
It is important to mention that, when new variables
are added to the model, it is possible that other variables already included, which were previously statistically significant, may lose their significance. This occurs because the inclusion of additional variables can alter the relationships between covariates, affecting their
p-values. Therefore, it is crucial to thoroughly explore all independent variables and select a set in which all remain statistically significant in the presence of others. This ensures that each variable meaningfully contributes to the predictive power of the model based on the sample data, avoiding redundancy [
3,
6,
24,
26].
2.6.2. Backward Elimination
The exclusion criteria (Backward Elimination) guide the decision-making process: whether or not to remove an independent variable from the model based on statistical criteria. In this criterion, the independent variables are initially included in the model and then removed one by one if their exclusion results in a statistically significant improvement in model fit. The measure of improvement is usually assessed using the significance test, using a predetermined significance level [
27].
It is possible to see the general equation for the exclusion criteria below (Equation (10)):
In this equation,
represents the dependent variable;
represent the independent variables, and
is the error term. The exclusion criteria remove independent variables one by one based on their statistically insignificant contribution to explaining the variability of the dependent variable [
50].
Statistical tests guide the decision to remove a variable, such as a
p-value, which compares the relationship between each independent and dependent variable. Suppose the
p-value is more significant than the predetermined significance level, indicating that the variable does not contribute statistically significantly to the model. In that case, the exclusion criteria remove it from the model [
50].
Thus, the exclusion criterion in the stepwise procedure allows for refining the multiple linear regression model, eliminating the independent variables that are not statistically significant [
27]. If an independent variable with a beta that is not statistically different from zero is not excluded, the model loses predictive power. This beta does not contribute to the predictive composition in the presence of other variables, ceteris paribus, rendering it statistically insignificant in explaining the dependent variable
. Including such a variable can also distort the magnitude and sign of other beta parameters, without adding meaningful value to the predictive model [
3,
6,
24,
26].
2.6.3. Bidirectional Elimination
The Bidirectional Elimination is a flexible approach to selecting variables in a MLR model. This criterion can add or removes independent variables based on their statistical contribution to the model fit. The significance test usually evaluates the contribution measure and determines the statistical significance from a pre-established confidence level [
51].
2.7. Applications of the Stepwise Procedure in Multiple Regressions
The academic literature presents several relevant applications of the stepwise procedure in multiple regression machine learning models, as shown in
Table 1.
Based on the examples of the application of the stepwise procedure in machine learning models of multiple regressions, a wide range of studies are observed, covering several areas of research.
Its use aims to identify and select the most relevant and statistically significant variables, in the most diverse applications, for example, in the prediction of meteorological events, understanding of factors that affect prices and production in different sectors, such as real estate and agriculture, as well as providing valuable information in environmental fields, such as studies on soil, deforestation and air quality.
The approach also has applications in medicine, helping to estimate ages, predict disease and analyze the impact of health conditions on drug responses. Its versatility allows employment in socioeconomic and urban issues, such as impact and balance analyses in various contexts.
Notably, this literature review is not exhaustive since the stepwise procedure has an extensive and multifaceted scope. However, the various studies analyzed in this section allowed us to demonstrate its multidisciplinary applicability.
From predicting meteorological phenomena to analyzing socioeconomic impacts and evaluating health issues, the stepwise procedure has proven to be a versatile and valuable tool for the judicious selection of statistically significant variables in various research fields. In the study of sample data within applied sciences, it is essential to select only those variables that demonstrate statistical significance. A model constructed from sample data that includes betas that are not statistically significant, particularly for predictive purposes, cannot be considered the final model [
3,
6,
24,
26].
2.8. Computational Implementation of Stepwise
This section briefly describes implementations in some of the main languages used in the statistical analysis and data modeling.
The R language is widely known for its efficiency in statistical analyses and has a variety of packages that support the stepwise procedure [
60]. The “MASS” package (v. 7.3-61) provides the “stepAIC()” function, which allows for the implementation of the stepwise forward, backward, and bidirectional methods. In addition, the “leaps”, “stepwise”, and “caret” packages also provide additional functionality for variable selection and statistical modeling.
The stepwise procedure is available through the PROC REG command for users of SAS (v. 9.4M8), a statistical tool widely used in industry and research. This command allows the automatic selection of variables using different input methods, such as the forward, backward, or stepwise method.
In MATLAB (v. R2023b), a programming platform commonly used in engineering and science, the “Statistics and Machine Learning Toolbox” package assists in implementing the stepwise procedure. This package provides functionality for adjusting regression models, including the stepwise method.
In addition to the languages mentioned above, other programming languages implement the stepwise procedure. Julia, for example, offers the “GLM” (v. 1.9.0) and “StatsModels.jl” (v. 0.6.35) packages that provide functionality for stepwise regression. For SPSS software users (v. 29.0.1.1), the “REGRESSION” command, and the appropriate options to specify the variable selection method, help to perform the stepwise procedure. In the Stata environment (v. 18), the “stepwise” command allows for the implementation of the stepwise procedure.
Other languages like Java, C++, and C# support the stepwise procedure through different packages and libraries. Java, for example, relies on the Weka package (v. 3.8), which provides methods for selecting variables, including stepwise, based on p-values as well as accuracy metrics. Weka allows for forward, backward, or mixed selection approaches. In the case of C++, the Mlpack library (v. 4.5.0) offers functionality for stepwise regression, primarily based on performance metrics rather than p-values, and does not implement a traditional stepwise method but allows for forward or backward approaches based on model performance. C# has the Accord.NET package (v. 3.8.0), which also focuses on accuracy metrics rather than p-values, providing features for variable selection using forward and backward methods.
3. Methodology
The proposed library aims to apply the stepwise procedure to perform a regression analysis. This procedure makes it possible to select a subset of the independent variables that best explain the variability of a dependent variable, in such a way that all the independent variables selected are statistically significant at a pre-established confidence level.
The technological approach used is the Python language, which we used to propose and implement the regression model, facilitating its integration into studies that involve the application of a specific regression model.
Figure 1 illustrates the methodological workflow of the stepwise library to validate and select the statistically significant variables.
After collecting the data, processing and structuring it using Extraction, Transform, and Load (ETL) techniques is essential. Defining the variables that we want to predict and the variables that explain that prediction is critical to building a supervised machine learning model. In addition, it is essential to perform an exploratory analysis of the variables using unsupervised modeling to complement the evaluation process.
When we have one or more explanatory variables of a qualitative nature, it is necessary to transform them into dummy variables, creating dummies, where is the number of categories present in each explanatory variable. Next, we built a general regression model to evaluate the statistical importance of all the variables involved. Next, if there are variables that are not statistically significant, a diagnosis of multicollinearity should be performed using the VIF and Tolerance statistics.
Then, the main contribution of this paper is implemented: the stepwise procedure based on the statistical significance of variables as a function of the Python programming language. Stepwise automatically excludes variables that are not statistically significant from the model.
The variable selection process is automated, allowing the model to select independent variables without manual intervention. In situations with many independent variables available, the stepwise method can reduce the dimensionality of the model, avoiding overfitting problems and improving the model’s generalization to new data. During this process, variables may be added and then removed, as their statistical significance can change in the presence of other variables, even if they were significant initially. The algorithm retains those variables that, together, are all statistically significant, providing the best fit with all relevant predictors [
3]. This is crucial for maintaining the model’s predictive capability [
6]. It is essential that all considered variables are statistically representative to explain the behavior of the dependent variable
alongside the other
variables, thereby preventing the
parameters from being altered by the inclusion of a variable whose
is not statistically different from zero [
3,
30]. It is important to note that the implemented package allows you to define the desired error term. If the “error_type” argument is omitted, by default the stepwise function considers conventional error terms. However, if we set “error_type” to “robust”, the algorithm considers Huber-White standard errors.
After stepwise, a partial model is composed. The next step is to verify the adherence of the residuals to normality, using the Shapiro–Francia test, also implemented in the ‘statstests.tests’ package (v. 1.0.7) in Python and presented in this article.
If there is no adherence of the residues to normality, the next analysis is the diagnosis of heteroscedasticity. If there is heteroscedasticity in the model, applying the Breusch–Pagan test, which may indicate the omission of a relevant variable in the model.
Following the flow of
Figure 1, considering that the residuals are not adherent to normality and the data present heteroscedasticity, an efficient procedure is the Box–Cox transformation in the dependent variable, to maximize its adherence to normality.
To confirm whether the nonlinear model performs better, the stepwise procedure and multicollinearity, normality, and heteroscedasticity tests should be performed again.
All this process established in the methodological flow shown in
Figure 1 and described in this section is part of a general model of construction and analysis of a multiple regression model. In this process, stepwise is of great value in building a model with the guarantee of using variables with statistical significance for predictive purposes.
To exemplify the flowchart, in the next section a real case study of real estate pricing will be carried out, which runs through all the steps described in this section.
4. Case Study
To exemplify the use of the proposed library for implementing the stepwise procedure in Python, an estimation of a multiple linear regression model for evaluating apartments in the city of São Paulo, Brazil, will be performed. We used real estate data from 200 apartments in three neighborhoods (Vila Nova, Brooklin, and Moema).
We evaluated four explanatory variables: Apartment area (), measured in , represented in the database by ‘area’; Number of rooms (), represented by the number of compartments of the property, represented by ‘rooms’; Land area (), in , represented by ‘land_area’; and the neighborhood where the property is located (), a qualitative variable represented by ‘neighborhood’. As a dependent variable, the price of the property (), , represented by ‘price’, was considered. Notably, among the explanatory variables, the apartment area, number of rooms, and land area are quantitative, while the neighborhood is a qualitative variable with three categories (neighborhoods).
Equation (11) below represents the proposed MLR model, where
varies from 1 to 200 since the database consists of 200 apartments.
We applied the Python script available in the
Supplementary Material to perform all the analyses presented and discussed in this section.
Figure 2 presents the descriptive statistics of the quantitative variables of the model.
The descriptive statistics show that there are 200 observations for each variable; that is, there are no missing values in the sample. We also verified that the neighborhood variable is a polychotomous categorical variable with three categories.
Figure 3 shows the heatmap of the Pearson correlations matrix between the metric variables. The values of the main diagonal are all equal to 1 because they represent the correlation of a given variable with itself. A high correlation (practically equal to 1) between the variables ‘area’ and ‘land_area’ stands out.
Figure 4 consolidates the distributions of the metric variables on the main diagonal, scatters, values of correlations (r), and respective statistical significance (
p), all lower than 0.01, with a confidence level greater than 99%.
Next, the analysis of the frequency table of the qualitative variable is performed, with the number of apartments for each of the three neighborhoods: 72 apartments in Vilanova, 66 in Moema, and 62 in Brooklyn. As the variable ‘neighborhood’ is qualitative, to avoid the arbitrary weighting procedure, the dummies procedure is applied.
In the proposed case study, the Brooklyn neighborhood was used as a reference category, as it is the first in alphabetical order, which will have its behavior captured by the α intercept. Notably, the choice of the reference category does not affect the final result of the analysis since there will be a rearrangement of the β values.
Estimating the multiple linear regression model with
dummies, we arrive at the results of
Figure 5.
The model equation, after the
dummies procedure, is represented in Equation (12). The ‘neighborhood’ variable will be summarized by ‘neighb’.
The value of
is equal to 0.87; that is, the variable
explains 87% of the behavior of the variable
. However, we verified that the
p-values of the variables ‘area’ and ‘land_area’ are, respectively, 0.109 and 0.124. Such values are not statistically significant, at a confidence level of 95%, in the presence of the other variables. In this case, the multicollinearity diagnosis was performed using the statistics Variance Inflation Factor (VIF) and Tolerance, according to
Figure 6.
Although no cutoff point is established in the literature to define whether or not there is multicollinearity between explanatory variables, it is necessary to observe VIF values greater than ten. In this sense, the variables ‘area’ and ‘land_area’ present VIF values close to 4000, with Tolerance practically equal to 0, representing a practically maximum , representing a preliminary indication of multicollinearity in the proposed model.
To confirm the multicollinearity and establish the set of statistically significant variables, we arrive at the main contribution of this work: the implementation of the stepwise procedure based on the statistical significance of the variables as a function of the Python programming language. To apply stepwise through the function proposed in this paper, run the following command:
where ‘model_apartments’ is the MLR model in
Figure 5; ‘model_step_apartments’ is the MLR model obtained after the stepwise procedure; ‘pvalue_limit’ refers to the level of statistical significance (in this case, equal to 0.05).
Figure 7 shows the new model obtained after applying the stepwise.
According to the preliminary analysis by and statistics, the variables ‘area’ and ‘land_area’ present multicollinearity, being excluded from the multiple regression model after the stepwise procedure, leaving only the variables ‘rooms’ and ‘neighborhood’ (with two categories, since ‘Brooklyn’ was the reference category).
The following analysis consists of the verification of adherence of the residues to normality using the Shapiro–Francia test, implemented by the ‘Shapiro–Francia’ function of the ‘statstests.tests’ package, also presented in this article, according to the commands below:
where ‘model_step_apartments’ is the MLR model after performing the stepwise procedure. The Shapiro-Francia test is shown in
Figure 8. As a result, a
p-value of 0.00002 was obtained, lower than the established significance level of 0.05. Therefore, we rejected the hypothesis, which means there is no adherence to the normality of the residues of the analyzed model.
As a result of nonadherence to normality, another essential analysis is the diagnosis of heteroskedasticity using the Breusch–Pagan test. It is possible to see the result of which in
Figure 9. The
p-value was approximately
, lower than the significance level of 0.05. Thus, there is heteroskedasticity in the data, which means that there is correlation between the residues and one or more explanatory variables. In short, relevant variables are omitted from the model.
Considering that the residues of the model in question are not adherent to normality and the data present heteroskedasticity, a solution is the Box–Cox transformation in the dependent variable ‘price’, aiming to maximize the adherence of variable
to normality.
Figure 10 shows the
p-values of the model variables after applying the Box–Cox transformation.
Considering the level of statistical significance of 5%, as in the previous model, we verified that the variables ‘area’ and ‘land_area’ were not statistically significant in the presence of the other variables. In addition, there is an increase in to 0.893.
Applying the stepwise procedure to the ‘model_bc_apartments’, obtained after the Box–Cox transformation, we arrive at the result of
Figure 11.
At the significance level established, the variables ‘area’ and ‘land_area’ were once again removed from the final model by the stepwise procedure.
Comparing the initial models and after the Box–Cox transformation, both having undergone the stepwise procedure through the function proposed in this article, we arrive at
Figure 12.
We observed that the
of the Box–Cox model was higher than the initial one (0.8920 to 0.8679), which reflects a gain in adherence to the nonlinear model. Once again, applying the Shapiro–Francia test, we arrive at the result of
Figure 13.
The
p-value was approximately 0.0584 higher than the established significance level. Therefore, we did not reject the null hypothesis; that is, there is adherence to the normality of the residues. Then, by evaluating the heteroskedasticity, the result of the Breusch–Pagan test is obtained (
Figure 14).
A
p-value of approximately 0.48 is observed, higher than the established significance level. Thus, there is no more heteroskedasticity in the data, only by the adequacy of the dependent variable to a nonlinear form. Given the above, we arrive at the Equation (13) which is the equation of the final model proposed after the Box–Cox transformation and application of stepwise, proposed and implemented in this paper. The ‘neighborhood’ variable will be summarized by ‘neighb’.
5. Conclusions
The present research has achieved the proposed objectives, contributing significantly to data analyses and statistical modeling in Python. First, we presented the stepwise function based on the statistical significance of the variables in Python, detailed and exemplified, offering researchers, students, and professionals a practical and intuitive tool for carefully selecting variables in multiple regression models. This new tool differs from existing ones, which rely on accuracy metrics, as it is fundamentally based on the statistical significance of the variables. This approach is extremely important in the context of regression models within applied sciences, as failing to exclude variables whose parameters lack statistical significance can lead to a loss of predictive capability. Such inclusion can alter the magnitude and even the effect of other parameters in the predictive model.
Then, practical examples demonstrated how applying the proposed stepwise function can enhance the reliability of multiple regression models by removing non-significant variables, leading to more accurate and trustworthy model adjustments and predictions. In addition, we presented a comprehensive methodological framework to address common problems in data analyses, such as heteroskedasticity, multicollinearity, and nonadherence of residues to normality, providing a robust and user-friendly computational implementation.
The current case study of real estate pricing, considering linear and nonlinear multiple regression models through stepwise, allowed us to consolidate the concepts discussed and validate the effectiveness of the proposed approach. The results showed that the correct application of the stepwise procedure could lead to more accurate models and a better interpretation of the factors that influence real estate pricing.
In conclusion, this research fully achieved the proposed objectives, contributing significantly to advancing the understanding and application of statistical methods in Python. The results presented in this paper can be widely used and inspire new debates and contributions in data analysis and statistical modeling. The availability of the stepwise function based on the statistical significance of the variables in Python and the methodological framework will provide practitioners with a more informed and practical approach to handling complex data, driving significant advances in their respective research and application areas.
Despite the promising results achieved in this study, it is essential to recognize some limitations that may open space for future research. First, while the stepwise procedure based on the statistical significance of the variables is a valuable approach to variable selection, there may be better options in some scenarios. Additionally, this stepwise process can also be conducted using metrics of fit and accuracy, leaving it to the researcher to decide which method best suits their study. In addition, we focused the case study on real estate pricing, which may limit the generalization of the results to other application areas. Future research could explore the applicability of the proposed stepwise function in different contexts, addressing specific problems and testing the approach’s effectiveness in other modeling tasks.
Another aspect to consider is that we addressed heteroskedasticity problems, multicollinearity, and nonadherence of residues to normality in this study. Still, the complexity of these problems can vary widely in different data sets. Future research could explore more sophisticated and specific strategies to address these issues, considering their particularities in different scenarios.
Finally, although the methodological framework developed has demonstrated effectiveness in mitigating common problems in data analysis, it is possible to improve its performance and generality through more advanced approaches and adapt to different contexts. Future research could seek methodological improvements and the validation of the approach in other types of problems, contributing to a better understanding and application of statistical methods in Python in an even broader spectrum of research areas and challenges.