1. Introduction
In recent decades, spatial data analysis based on spatial autoregressive (SAR) models has become a very important and popular research direction in the academic research of econometricians and statisticians. Among them, in-depth research has been conducted on inference theories and methods based on linear SAR models and their extensions, including estimation, variable selection, hypothesis testing, etc., and a large amount of literature has also been produced, such as [
1,
2,
3]. Specifically, there are many in-depth research achievements on linear SAR models, such as [
4,
5,
6,
7]. However, in spatial data analysis, there is often a nonlinear relationship between response variables and covariates. Therefore, in order to explore this complex phenomenon, some semiparametric SAR models have been proposed in recent years and have been thoroughly studied. For example, based on partially linear spatial autoregressive models, Su and Jin [
8] proposed a profile quasi-maximum likelihood estimation method and established asymptotic theoretical properties of the obtained estimators. By using the spline approximations and instrumental variables estimation method, Du et al. [
9] developed an estimation method for partially linear additive spatial autoregressive models, and derived asymptotic theoretical properties of the obtained estimators. For partially linear single-index spatial autoregressive models, Cheng and Chen [
10] developed an estimation method and established consistency and asymptotic normality of the estimators under some mild assumptions. Other related research results on semiparametric SAR models can also be found in [
11,
12]. Previous research on various spatial autoregressive models was mainly based on the assumption of homoscedasticity, which assumes that the variance of model errors is constant. As is well known, heteroscedasticity is a common phenomenon in spatial data analysis. Therefore, using statistical inference methods under the assumption of homogeneity may lead to erroneous inference, as seen in Lin and Lee [
13]. Therefore, it is necessary to study heterogeneous spatial autoregressive models. Especially in recent years, many researchers have conducted in-depth research on spatial autoregressive models where the error variance is heteroscedasticity. For example, SAR models with heteroscedasticity was studied by Dai et al. [
14] for Bayesian local influence analysis. However, existing literature on heterogeneous spatial autoregressive models assume that the variance term is fixed and does not perform regression modeling analysis like the mean. In addition, in many application fields, such as econometrics, the assumption of equal variance may not be suitable for modeling data that exhibit heteroscedasticity. Therefore, we propose a new model, called a heterogeneous semiparametric spatial autoregressive model, in which the variance parameters are allowed to be modeled by using some covariates.
In addition, many estimation methods have recently been developed for SAR models from both frequentist and Bayesian perspectives. Specifically, due to the rapid development of advanced computing technology in the era of big data, Bayesian statistical analysis of SAR models and various other statistical models has received increasing attention, and large quantities of related research achievements have emerged in recent years. For example, within the framework of longitudinal data and for the generalized partial linear mixed models, Tang and Duan [
15] studied an effective semiparametric Bayesian method. By using spline approximation, Xu and Zhang [
16] introduced a Bayesian method for the partially linear model with heteroscedasticity based on the variance modelling technique. Based on the assumption that the response variables and random effects follow multivariate skew-normal distributions, a new spatial dynamic panel data model was proposed by Ju et al. [
17] and a Bayesian local influence analysis method was developed to simultaneously evaluate the impact of small perturbations on the data, priors, and sampling distributions. Pfarrhofer and Piribauer [
18] studied Bayesian variable selection for high-dimensional spatial autoregressive models based on two shrinkage priors. Wang and Tang [
19] made Bayesian statistical inference based on a quantile regression model with nonignorable missing covariates. To capture the linear and nonlinear relationships between explanatory variables and their responses to spatially relevant data, Chen and Chen [
20] developed a Bayesian sampling-based method based on the partially linear single-index spatial autoregressive models, in which it includes an efficient MCMC approach and explores the joint posterior distributions by using a Gibbs sampler. Within the framework of longitudinal data, Zhang et al. [
21] proposed semiparametric mixed-effects double regression models for analysis based on spline approximation technology, in which they jointly modeled the mean and variance of the mixed-effects as a function of covariates. To our knowledge, there is not much work on semiparametric Bayesian methods for heterogeneous spatial autoregressive models due to their complex spatial correlation structures. Therefore, based on a hybrid effective algorithm that combines a Gibbs sampler and the Metropolis–Hastings algorithm and has the advantages of both algorithms, this paper develops a Bayesian method for heterogeneous semiparametric spatial autoregressive models based on variance modeling.
The outline of the paper is as follows. A new heterogeneous SSAR model is introduced in
Section 2. In
Section 3, we derive the full conditional distributions for implementing the sampling-based method, and develop a Bayesian method to obtain estimates by using a Gibbs sampler and the Metropolis–Hastings algorithm.
Section 4 presents some simulation studies to illustrate the proposed methodology. As an application example,
Section 5 analyzes the Boston house price data by using the proposed method. A brief conclusion and discussion is given in
Section 6.
2. Heterogeneous Semiparametric Spatial Autoregressive Models
As is well known, the form of classical semiparametric spatial autoregressive models is as follows:
where
is an
n-dimensional response variable,
is an unknown spatial lag parameter that reflects spatial autocorrelation between neighbors, and
W is a known spatial weight matrix with zero diagonal elements. In the mean model,
is an
explanatory variable matrix where the
ith row is
and
is a
p-dimensional unknown regression coefficient to be estimated; moreover,
is an arbitrary unknown smooth function in the mean model, which needs to be estimated;
is an
n-dimensional vector whose
ith row
is an univariate observed covariate;
is an
n-dimensional vector that represents the regression errors of an independent and identically distributed regression disturbances with zero mean and finite variance
.
In addition, according to Xu and Zhang [
16], this paper considers the heterogeneity of the variance in the model and assumes that the variance parameters are related to other explanatory variables; thus, we establish a regression model for the variance parameters, namely
where
is an explanatory variable vector related to the variance of
and
is a
q-dimensional unknown regression coefficient to be estimated in the variance model. Some elements in
may coincide with some elements in
. In addition, for the identifiability of the models and considering that the variance is positive,
is a known monotonic positive function. For example, exponential functions are often used to model the variance. So, heterogeneous SSAR models are considered in this paper as follows:
4. Simulation Study
This section conducts a simulation study to evaluate the performance of the proposed Bayesian method under different sample sizes, spatial parameter values, and prior information selections. According to Lee [
24] and Xie et al. [
7], let
and
be the weight matrix, in which ⊗ represents the Kronecker product and
,
is an
m-dimensional vector with all component elements being 1. In order to investigate whether the proposed Bayesian estimation method is sensitive to the selection of prior distributions, we consider three hyperparameter values in the prior distributions of unknown parameters
:
Type I: , , , This situation can be considered as having good prior information.
Type II: ,, , This hyperparameter situation is considered to have no prior information.
Type III: ,, , This can be seen as a situation where the previous information was inaccurate.
In this section, R is selected as 25, 50, 75 and m is set to 4, and thus, n is to be 100, 200, 300. Furthermore, we generate X and Z, respectively, from the multivariate normal distribution with zero mean vector and covariance matrix where . Moreover, to reflect the different spatial dependencies between response variables, spatial parameters are selected to represent different spatial dependencies; , where follows a uniform distribution ,and the structure of the variance model is with .
Based on the various parameter setting environments and generated datasets mentioned above, we use the hybrid MCMC algorithm based on 100 replications to evaluate Bayesian estimations of unknown parameters under different sample sizes. Checking whether the MCMC sampler converges in algorithm implementation is an important thing. Therefore, here the estimated potential scale reduction (EPSR) value is used to diagnose whether the MCMC algorithm converges for each dataset [
25]. It can be easily observed that in all the runs we are considering, the EPSR value is very close to 1 and less than 1.1 after 3000 iterations. Therefore, after discarding the first 3000 burn-in iterations, collect the observation results of the following
for statistical inference. In addition, to evaluate the performance of the nonparametric function estimation, the square root of average square errors
are used here as the criterion for evaluation,
The simulation results are listed in
Table 1,
Table 2,
Table 3 and
Table 4. Moreover, in order to directly examine the accuracy of the estimation of function
, we plot the true value of function
and its estimated curve under different cases. To save space, we only list some nonparametric estimation curve results with different spatial parameters in
Figure 1,
Figure 2 and
Figure 3.
Figure 1,
Figure 2 and
Figure 3 depict the real sine curve and its estimated curve based on B-spline approximation. It is easy to observe that all estimated curves are close to the true curve, which indicates that the estimation of
using B-splines in the mean model performs well.
In
Table 1,
Table 2 and
Table 3, “Bias” represents the absolute difference between the true and mean values estimated by Bayesian estimation of parameters based on 100 replicates, and “SD” denotes standard deviation of the Bayesian estimates, while “RMS” is the root mean square between the estimated and true values based on 100 replicates. From
Table 1,
Table 2 and
Table 3, we can see that (i) From the Bias, RMS and SD values of Bayesian estimation, it can be seen that regardless of the prior information input, Bayesian estimation is quite accurate; and for different prior distributions, the proposed estimation method performs well, indicating that Bayesian estimation is not sensitive to prior information input. (ii) Bayesian estimation improves as the sample size increases; (iii)The Bayesian estimation results obtained based on different spatial parameters are similar. (iv) In the same situation, the RMS and SD values of the mean parameters are smaller than that of the variance parameters, which is consistent with the fact that lower order moments are easier to estimate than higher order moments. Furthermore, from
Figure 1,
Figure 2 and
Figure 3, it can be seen that under the considered parameter settings, the estimated function curve is very close to its corresponding true curve, which is consistent with the phenomenon found in
Table 4. In summary, the above simulation research results indicate that applying the Bayesian estimation method proposed in this paper to heterogeneous SSAR models is effective.
5. Real Data Analysis
Boston housing price data are a commonly used example, and many authors have conducted in-depth analysis based on different statistical models, such as [
26,
27], and so on. This section will also use the Bayesian estimation method proposed in this paper to analyze these data. This dataset can be easily obtained from R’s spdep library, which includes 14 variables and 506 observations. A detailed explanation of the variables involved in the dataset is presented in
Table 5.
In addition, following the variable selection results of Xie et al. [
7], we take MEDV as the response variable of the model (represented by
Y) and the important explanatory variables selected by Xie et al. [
7] as the X- variables: CRIM (denoted by
), ZN (denoted by
),NOX (denoted by
), RM(denoted by
), DIS (denoted by
), RAD (denoted by
),TAX (denoted by
), PTRATIO(denoted by
), and B (denoted by
). In order to facilitate data modeling and analysis, all variables were centralized and the index variable was set to
.
In addition, the Euclidean distances calculated using longitude and latitude as [
27,
28] are used to generate the space weight matrix
, where
is the Euclidean distance, and
takes a value of 0.05 as the threshold distance. Thus, the spatial weight matrix contains 19.1% non-zero elements. Then, here we consider the heterogeneous SSAR model as follows:
where three explanatory variables
are selected in the variance model. Thus, the hybrid algorithm proposed earlier is applied to obtain Bayesian estimates of
’s,
’s, and
’s, in which a B-spline with
and noninformative prior information are used. In order to check the convergence of the algorithm,
Figure 4 shows the relationship between the EPSR values of all unknown parameters and iterations, indicating that the algorithm converges after approximately 3000 iterations due to the EPSR values of all unknown parameters being less than 1.1 in approximately 3000 iterations. The Bayesian estimates (EST) of
’s,
’s and
’s and their standard deviation estimates (SD), 95% credible intervals (CI) are calculated. Results are given in
Table 6.
Figure 5 displays the Bayesian estimate of the nonparametric function
, which also confirms a significant nonlinear relationship between housing prices and the variable u. Some useful conclusions can be obtained from the results of the table, which are basically consistent with the research results of other authors. For example, the regression parameter corresponding to
in the mean model is estimated to be negative, indicating that housing prices will decrease as the per capita crime rate in urban areas increases. The estimated coefficient of
in the mean model is 0.3899, indicating that as the average number of rooms per dwelling increases, housing prices will also increase.
is negative, indicating that the greater the weighted distances to five Boston employment centres, the lower the housing price will be. The regression parameter corresponding to
in the mean model is estimated to be negative, indicating that housing prices will decrease as the pupil–teacher ratio by town increases. The regression parameter corresponding to
in the variance model is estimated to be negative, indicating that as the weighted distances to five Boston employment centres increases, the fluctuation of housing prices will also decrease. In addition, based on the estimation of
, we can obtain the estimated value of
and present the scatter plot of
in
Figure 6, indicating that heteroscedasticity modeling for this dataset is reasonable.