1. Introduction
Spatial econometrics, developed to deal with spatial correlation and spatial heterogeneity of data, has become a standard analytical tool for spatial data and has begun to enter the mainstream of econometrics. Spatial models have a long history in econometrics. Much progress has been made in the estimation of spatial models, please refer to the special works of Anselin (1988) [
1]; LeSage and Pace (2009) [
2]. Nowadays, the spatial autoregressive (SAR) model developed by Cliff and Ord in 1973 [
3] is the most studied and widely applied modeling method for dealing with spatial correlation. Additionally, the model can be widely applied in many fields including social networks (Ma et al. (2020) [
4]), real estate (Dubin (1999) [
5]; Osland (2010) [
6]), crime incidents (Kakamu et al. (2008) [
7]), sampled network data (Zhou et al. (2017) [
8]), artificial neural networks (Wang et al. (2019) [
9]), and geospatial data (Khalfi et al. (2021)). We can consider the spatial autoregressive model as an ordinary regression model that additively takes the spatial spillover effect of the dependent variable into account. Thus, this model can model both traditional covariates and network dependence simultaneously in a convenient manner.
However, most of the conventional spatial analyses were designed to address the problem of estimation or prediction based on continuous observations. In the case of discrete or binary variables, for instance, in pathological diagnosis, there are only two possible outcomes, positive (denoted as 1) or negative (denoted as 0). The logistic regressive model is a popular method to deal with discrete binary responses. As one of the most popular classification methods, the logistic regressive model has been studied extensively. Essentially, the model assumes that an individual’s class label is influenced by a set of predictors. In practical use, observational data can be taken from different places. In other words, the data generated can be cross-section data. The cross-section data involve several locations. Therefore, it is possible that spatial effect influences the model. In the presence of spatial effects, the usual logistic regression does not sufficiently model the data. Thus, the spatial logistic regression model will be better to model data that contains spatial effects [
10]. Theoretical economists and practical researchers are interested in the spatial logistic autoregressive model, which investigates how covariates affect the correlation response of spatial discrete values. The study of spatial logistic regression models, which use categorization technologies to model spatial data, is a relatively new area of spatial econometrics, and research in this area is still quite restricted. Calabrese and Elkink (2014) [
11] introduced the binary spatial autoregressive model for the first time. Hilwin Nisa et al. (2019) [
12] proposed the spatial logistic regression model which is obtained by the logistic regression model and spatial binary regression model.
High-dimensional spatial data appear frequently in many fields of social life and scientific research, such as biomedical imaging, X-ray tomography, finance, and geoscience. In recent years, a variety of regression methods have been proposed to model high-dimensional data in spatial statistics. For example, Piribauer et al. (2016) [
13] proposed a Bayesian variable selection procedure in a spatial autoregressive model. A penalized quasi-maximum likelihood method was put forth by Liu et al. (2018) [
14] for variable selection in the spatial autoregressive model. Model selection in spatial autoregressive models with varying coefficients was studied by Wei et al. (2019) [
15]. Variable selection for the spatial autoregressive models with a diverging number of parameters was considered by Xie et al. (2020) [
16]. Cai et al. (2020) [
17] considered variable selection and estimation for a high-dimensional spatial autoregressive model. Li et al. (2020) [
18] proposed a variable selection method for the partially linear varying coefficient spatial autoregressive model. More recently, Li et al. (2021) [
19] proposed a new variable selection method for a higher-order partially linear spatial autoregressive model with a diverging number of parameters. Liu et al. (2021) [
20] studied variable selection for the spatial autoregressive model with autoregressive disturbances. Song et al. (2021) [
21] proposed a new robust variable selection method with an exponential squared loss for the spatial autoregressive model.
The above methods mainly focus on the variable selection of continuous response variables based on the penalty regression technique. Penalized regression techniques shrink the insignificant coefficients to 0, which has attracted increasing attention to high-dimensional data analysis, such as least absolute shrinkage and selection operator (LASSO) (Tibshirani (1996) [
22]), smoothly clipped absolute deviation (SCAD) (Fan and Li (2001) [
23]), and minimax concave penalty (MCP) (Zhang (2010) [
24]) for mean regression. LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, for which it tends to produce some coefficients that are exactly 0 and hence give interpretable models. However, LASSO has some bias in the estimation of the coefficients. SCAD attempts to mitigate this bias and produce nearly unbiased estimates for large coefficients while still retaining the continuous penalty for sparsity. MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. However, penalized spatial logistic autoregression has been rarely studied. For high-dimensional spatial data, there are several problems using spatial logistic autoregressive modeling such as endogeneity and including too many variables in the model. First, the spatial lag term in the model will make it endogenous. In the presence of endogeneity, the ordinary least squares (OLS) method can produce biased and inconsistent parameter estimates. Second, as the dimensionality of the variables increases, redundant variables will bring challenges to the estimation in the modeling process.
Furthermore, as the penalized spatial logistic regression does not account for any prior information, and then we can consider how to incorporate prior information into the modeling procedure. Statistical models with linear constraints on variables have gained widespread applications recently. These constraints on regression coefficients reflect the prior information and structure, which can help us to find the optimal parameters with the given information. To incorporate the prior information in the modeling process, we add linear constraints to the penalized spatial logistic autoregressive model. As far as we are aware, no previous research has investigated penalized spatial logistics autoregressive models with linear constraints. Thus, in this paper, we will study a penalized spatial logistic autoregressive model with linear constraints. For the spatial logistic regression model (4), we estimate
by solving the following optimization
where
is the likelihood function,
,
,
n is the sample size,
is the penalty function, and
, and
are determined concretely according to the experience and knowledge of practical problems. In this paper, our contribution is summarized as follows:
Propose a penalized spatial logistics autoregression with linear constraints. These constraints contain the prior information and structure, which can improve the robustness of the model and help us to find the optimal parameters. Thus, the model performs better in the variable selection and estimation under a high-dimensional data space.
Provide the formula for degrees of freedom, and then construct the model selection criteria to select the optimal tuning parameter.
Simulation results show that the performance of the proposed method is more explanatory and reasonable than penalized quasi-maximum likelihood without linear constraints, and an empirical application illustrates the usefulness of the methods in practical work. The effectiveness of the penalized quasi-maximum likelihood with linear constraints algorithm is demonstrated.
The following is how the paper is arranged. In
Section 2, we introduce the general form of the problem we study and present our penalized quasi-maximum likelihood without linear constraints algorithm. The formula for degrees of freedom using Stein’s unbiased risk estimation (SURE) lemma is derived in detail in
Section 3. Some Monte Carlos results on the performance of the proposed method are discussed in
Section 4.
Section 5 shows our method for analyzing real data sets.
Section 6 presents the conclusions.
4. Simulation Studies
4.1. Simulation Experiment Design
In the simulation experiment, we test the performance of the model through Monte Carlo simulation. The random sample is generated by model (2.1) combined with model (2.7), in which the covariate is considered when the -dimensional normal distribution with zero mean and covariance matrix , where . Therefore, X is an matrix. In the following simulation, we set the number of samples , and the number of insignificant covariates . In this paper, we show the cases of and in the simulation results.
In the spatial autoregressive model, the network autocorrelation coefficient is generated by the uniform distribution on the interval , where . Define the spatial weight matrix , where , and ⊗ is the Kronecker product. Denote as the m-dimensional column vector of 1. In the simulation experiment, we consider , and differet values of R, where . The regression coefficients are set to , where is generated from the 3-dimensional normal distribution with the mean vector of , and the covariance matrix is .
To simplify the calculation, the regression coefficients are set to
, where
is a
q-dimensional zero vector. The response variable is given by the following formula:
Then use the following formula to convert the response variable into a category variable:
Thus, we can obtain the binary classification of the response variable Y. For purpose of confirming the robustness of the model, the error terms s are independently generated from the following two distributions: (a) normal distribution , denoted as and (b) Gaussian mixture distribution , denoted as . is generated by the uniform distribution on the interval , where . In the simulation experiment, we set .
In order to verify that the effect of the model with linear constraints is better, we will compare it with the model without constraints. In one case, we can set the constraints as:
Obviously, we can find C, d, E, and f.
4.2. Evaluation Indicators
According to the above simulation experiments, we set the number of Monte Carlo repetitions at 1000. We define the following three indicators to evaluate the performance of variable selection in different settings.
Correct: the average number of coefficients of the true zeros correctly set to zero;
Incorrect: the average number of coefficients of the true nonzeros incorrectly set to zero;
ME: the mean error between the true value and the estimate, which is defined by:
4.3. Simulation Results
Table 1 and
Table 2 show the results of models with linear constraints without linear constraints, respectively. The constrained model is recorded as "Const”, and the unconstrained model is recorded as “Unconst”. The results in
Table 1 clearly show that the spatial logistic model performs better with linear constraints, which also confirms the effectiveness of our model. Most significantly, when
, the error of the models with constraints and without constraints are very different, which shows that when the spatial effect is strong the constraints can greatly improve the accuracy of model parameter estimations. Moreover, we find that the effect of the model tends to become bad by increasing the network autocorrelation coefficient
, which indicates the importance of the spatial effect. Moreover, by setting two types of errors, we observe that the model has good robustness. Additionally, in most cases, the model with an error term of
performs better than the one with an error term of
. Similarly, with the increase in sample size n, the incorrect rate of variable selection and the estimation error both decrease. This situation is in line with our prediction of the effect of the model. Moreover, it can be seen in
Table 1 that the mean error is minimum when
n = 120,
= 0.2, and
, which also confirms our analysis above.
In
Table 2, by increasing the dimension to
q = 10, we find that with the increase in the network autocorrelation coefficient
, the effect of the model becomes bad. Compared to the case of
q = 5, we find an increase in the mean error of estimation. After analysis, the possible reason is that the proportion of data size to dimension is bad, that is, the dimension of the sample is higher than the sample size. suppose the network complexity increases, it will have a certain negative impact on the effect of variable selection. For high-dimensional samples, the model results can be optimized by increasing the sample size. Simultaneously, according to the simulation results, compared with the unconstrained case, the spatial logistic model has stronger robustness, higher accuracy, and lower estimation error rate.
We compare the variable selection using the SCAD penalty and LASSO penalty under constraints. The simulation results are shown in
Table 3. We can clearly observe that in the case where
and
, the performance of the SCAD penalty is significantly better than the LASSO penalty, which is shown in higher correct selection rate and lower estimation error. As the network autocorrelation coefficient increases to
, the correct rate of variable selection of the SCAD penalty is higher than the LASSO penalty, while the incorrect rate of the LASSO penalty is lower than the SCAD penalty. Additionally, the estimation error between them is not much different. The reason may be that the tuning parameter
in the LASSO penalty is too large and the shrinkage strength is stronger.
5. Real Data Example
In this section, we provide a real-world example to demonstrate the performance of the variable selection procedure proposed in this paper for spatial logistic regression models with linear constraints.
5.1. The Land Area Utilization Data
Land area utilization is analyzed by the spatial logistic model. The data set is different types of land area data from 48 states in the United States from 1954 to 2012 (recorded every five years). The dependent variables are binary, with “1” denoting a low land utilization rate, which means that most of the land has not been properly developed, and “0”, denoting a high land utilization rate, which means that most of the land has been efficiently developed and exploited. As for the independent variables, there are eight properties, which are Cropland used for crops, Cropland used for pasture, Cropland idled, Grassland pasture and range, Forest-use land grazed, Land in rural transportation facilities, Land in urban areas, and Other idle land (shown in
Table 4).
5.2. Variable Selection and Estimation
For the above land area utilization data sets, we constructed a spatial logistic autoregressive model. We use the land utilization rate as a response variable, and take eight variables, Cropland used for crops, Cropland used for pasture, Cropland idled, Grassland pasture and range, Forest-use land grazed, Land in rural transportation facilities, Land in urban areas, and Other idle land, as the independent variables.
According to theoretical knowledge, the idle area of cropland and other idle land areas have a significant impact on the probability of classification results. Through the fitting of the model, the following two cases are considered: the first is the parameter estimation without constraints, and the other is the parameter estimation with linear constraints, as shown in
Table 5 and
Table 6.
According to the results of parameter estimation based on the spatial logistic model, it is found that the performance of variable selection is not obvious under unconstrained conditions. Among them, “Forest-use land grazed” and “Land in rural transportation facilities” have little impact on land use efficiency, which can be almost ignored. However, “Other idle land” has a great influence on the classification effect. Considering the greater relationship between land use area and idle land area, the performance of model selection is greatly improved with constraints. According to the table below, it is found that “Cropland idle” and “Other idle land” have a great impact on the classification results, and the fitting parameters of other attributes are very small and can be ignored.
6. Conclusions
In the paper, we obtain a spatial logistic model from the spatial autoregressive model (SAR) and logistic regression model. In order to improve the accuracy of the model, we take the prior information into account, and finally, propose a variable selection method with linear constraints for the spatial logistic model. According to the simulation results, by comparing the constrained and unconstrained variable selection models, we find that the performance of variable selection is more stable with the increase in sample size in the case of limited samples. When we fix the sample size at a constant value, the performance of the model tends to improve with the increase in network complexity. At the same time, the model has strong robustness to noise. In order to verify the superiority of the SCAD penalty, we compare the performance of the SCAD and LASSO penalties in the case of the linear constraint model and find that the SCAD penalty has a better effect on variable selection.
In practical application, most data show the characteristics of small sample size and high dimensions. For purpose of verifying the wide adaptability of the model, we verify that the proposed model can be effectively applied to the data set of high-dimensional and small samples through simulation experiments. However, we find that in this case, when the network complexity is very high, the performance effect of the model is not very good. The sample size being too small might be the cause of this.