Next Article in Journal
Estimation of Gini Index within Pre-Specified Error Bound
Next Article in Special Issue
Generalized Information Matrix Tests for Detecting Model Misspecification
Previous Article in Journal
Testing Symmetry of Unknown Densities via Smoothing with the Generalized Gamma Kernels
Previous Article in Special Issue
Bootstrap Tests for Overidentification in Linear Regression Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables

School of Economic, Political and Policy Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
*
Author to whom correspondence should be addressed.
Daniel A. Griffith is an Ashbel Smith Professor.
Econometrics 2016, 4(2), 29; https://doi.org/10.3390/econometrics4020029
Submission received: 23 December 2015 / Revised: 12 May 2016 / Accepted: 30 May 2016 / Published: 21 June 2016
(This article belongs to the Special Issue Recent Developments of Specification Testing)

Abstract

:
The Ramsey regression equation specification error test (RESET) furnishes a diagnostic for omitted variables in a linear regression model specification (i.e., the null hypothesis is no omitted variables). Integer powers of fitted values from a regression analysis are introduced as additional covariates in a second regression analysis. The former regression model can be considered restricted, whereas the latter model can be considered unrestricted; this first model is nested within this second model. A RESET significance test is conducted with an F-test using the error sums of squares and the degrees of freedom for the two models. For georeferenced data, eigenvectors can be extracted from a modified spatial weights matrix, and included in a linear regression model specification to account for the presence of nonzero spatial autocorrelation. The intuition underlying this methodology is that these synthetic variates function as surrogates for omitted variables. Accordingly, a restricted regression model without eigenvectors should indicate an omitted variables problem, whereas an unrestricted regression model with eigenvectors should result in a failure to reject the RESET null hypothesis. This paper furnishes eleven empirical examples, covering a wide range of spatial attribute data types, that illustrate the effectiveness of eigenvector spatial filtering in addressing the omitted variables problem for georeferenced data as measured by the RESET.

1. Introduction

A practitioner spends considerable time contemplating which covariates to include in a descriptive regression equation, as well as the functional forms they should have. A serious problem in regression analysis is misspecification of a descriptive equation by failing to include all relevant covariates in it: the omitted variables problem. One result of such omissions is omitted-variable bias (OVB), which arises when parameter estimates for the covariates included in a descriptive equation are over- or under-estimated because estimation attempts to compensate for the omitted variables. In part, this outcome arises from multicollinearity; in part, this outcome arises from a biased error variance estimate (i.e., covariates being removed from a specification because they are deemed insignificant when they are significant). A serious linear regression consequence of OVB for ordinary least squares (OLS) estimation is biased and inconsistent parameter estimates. OVB also impacts on non-linear regression.
The Ramsey (1969) [1] regression equation specification error test (RESET) furnishes a tool to at least partially assess OVB. Technically, it is not about omitted variables, but rather it is about functional form (e.g., Wooldridge 2013 ([2], Chapter 9)). It addresses the question asking whether or not non-linear combinations of fitted values help explain a response variable. Its supporting logic contends that non-linear combinations (e.g., exponential powers and cross-products) of covariates that correlate with a response variable signify a mis-specified equation. Consequently, the RESET specifically tests functional form, but often with inferences drawn about omitted variables. Shukur and Mantalos (2004) [3] comment that the RESET has good statistical power with increasing misspecification, and as the RESET proxy variate more closely approximates omitted variables. Of note is that the only way to truly assess OVB is to have the omitted variables to assess, which is not practical.
Studies (e.g., Brasington and Hite 2005 [4], Pace and LeSage 2010 [5]) show that spatial models accommodating spatial dependence are less influenced by OVB, especially when a true data generating process contains a spatial dependence component. Comparisons of model specifications between non-spatial and/or spatial models already appear in the literature. LeSage and Parent (2007) [6] investigate OVB with different model specifications, including ones for non-spatial and spatial regression, using a Bayesian model averaging technique. LeSage and Fischer (2008) [7] and Piribauer and Fischer (2015) [8] extend this approach for model uncertainty in spatial growth modeling. Piribauer (2016) [9] further extends it using stochastic search variable selection priors to improve OVB as well as over-parameterization.
The purpose of this paper is to demonstrate how eigenvector spatial filtering (ESF) impacts OVB as measured by the RESET. As a popular alternative approach for spatial regression model specification (Griffith 2003 [10], Pace, LeSage, and Zhu 2013 [11], Chun and Griffith 2014 [12]), ESF offers the potential to alleviate OVB by including spatial dependence components.

2. The RESET for a Linear Regression Specification

Ramsey (1969) [1] formulated his test for the case of linear regression. His test begins with the conditional expectation
E ( Y | X ) = X β
where Y is an n-by-1 vector of response values, hat (the diacritical mark) denotes fitted value, E denotes the calculus of expectation operator, X is an n-by-(p + 1) matrix containing p covariates (p must be at least 1 here), n is the number of observations, and β is a (p + 1)-by-1 vector of regression coefficients. If some n-by-q matrix of covariates Z is incorrectly omitted from this regression equation, in the case where X and Z are non-stochastic, then
E ( γ ^ ) = ( X T X ) 1 X T ( X β + Z θ ) = β + bias
where superscript T denotes the matrix transpose operation, θ denotes regression coefficients for the covariates Z, and γ denotes the full set of regression coefficients. If XTZ = 0, which is highly unlikely in practice, then no OVB is present, emphasizing the relationship between OVB and multicollinearity.
If the covariate matrix in Equation (2) is expanded to (X Z), then E( γ ^ ) = ( β θ ) . Therefore, if this covariate matrix can be augmented with proxy covariates that approximate matrix Z (or at least the part of Z correlated with X), then the OVB decreases, converging on zero as the approximation becomes increasingly better. Thursby and Schmidt (1977) [13] discuss that an approximation being correlated with omitted variables can lead to a powerful test. The RESET uses exponential powers of X β for this approximation. Accordingly, matrix X must contain more than the vector of ones (for the intercept term). The resulting set of equations for testing purposes is given by
Y = X β + k = 1 K φ k Y ^ k + ϵ
where Y ^ k = ( X β ^ ) k for integer k ≥ 2, and ϵ is a n-by-1 vector of random errors for a non-spatial model. The joint null hypothesis for the φ k coefficients is that all of them are zero, which is tested using the F-ratio
[(ESS1 − ESS2)/(df2 − df1)]/[ESS2/(n-df2)]
where ESSj and dfj are, respectively, the error sum of squares and the degrees of freedom for model j (j = 1, 2, …). Rejection of the null hypothesis implies misspecification. When implementing Equation (3), in order to exploit the spatial autocorrelation common to X and Z, as well as the spatial autocorrelation unique to Z, our analyses used exponential powers of fitted values from an eigenvector spatial filter for this approximation: Y ^ = X β ^ + E h β ^ h , where Eh are the eigenvectors discussed in Section 4. That is, an ESF model can be expressed as
Y = X β + E h β ^ h + k = 1 K φ k Y ^ k + ϵ

3. The RESET for a Generalized Linear Regression Specification

Sapra (2005) [14] extends Ramsey’s RESET to generalized linear models (GLMs). The logic remains the same here; the response variable no longer is a normal random variable (RV). Rather, it is a Poisson, binomial, or other RV from the exponential family.
The basic equation is similar to (3): assessment is in terms of powers of a linear combination of covariates. For a Poisson RV, the linear combination is the log-mean estimate. For a binomial random variable, the linear combination is the log-odds ratio function. The test statistic is the chi-square, whereas the calculation is −2 times the log-likelihood function differences (subtracting that for the expanded specifications from the original specification). Sapra (2005) [14] comments that this extended version of the RESET appears to have reasonable statistical power for medium to large sample sizes.

4. Eigenvector Spatial Filtering and Omitted Variables

One contention about the presence of non-zero spatial autocorrelation in regression residuals is that it arises because covariates with spatial patterns are missing from a descriptive equation specification (e.g., Temple 1999 [15]). Shifting this spatial autocorrelation from the residuals to the systematic part of the equation (e.g., introducing a spatial autoregressive term) furnishes a surrogate for the missing variable(s), which can be seen by, for example, an increase in the accompanying pseudo-R2 value. But auto-models are complicated. ESF offers a simpler approach to handling this omitted variables problem. In other words, because spatial autocorrelation can arise from a missing relevant variable that has an underlying spatial map pattern, a spatial filter constructed with eigenvectors that shows this same underlying spatial autocorrelation pattern can serve as a proxy for missing variables by accounting for spatial autocorrelation.
ESF uses a set of synthetic proxy variables, which are extracted as eigenvectors from an adjusted spatial weights matrix C (defined in Equation (5)) that links geographic objects together in space, and then adds these vectors as control variables to an equation specification. These control variables identify and isolate the stochastic spatial dependencies among a given set of georeferenced observations, resulting in their mimicking independent ones, thus allowing spatial statistical analysis to proceed in standard ways. Spatial autocorrelation in regression residuals often arises because of a missing relevant variable that has an underlying spatial pattern (e.g., McMillan 2003 [16]). Thus, a spatial filter constructed with eigenvectors that exhibit appropriate spatial autocorrelation patterns can serve as a proxy by accounting for spatial autocorrelation.
ESF applies the mathematical decomposition that creates eigenfunctions to the following transformed spatial weights matrix:
( I 11 T / n ) C ( I 11 T / n )
where I is an n-by-n identity matrix, and 1 is an n-by-1 vector of ones. This decomposition generates n eigenvectors and their associated n eigenvalues. In descending order, the n eigenvalues can be denoted as λ = (λ1, λ2, λ3, …, λn), ranging between the largest eigenvalue that is positive, λ1, and the smallest eigenvalue that is negative, λn. The corresponding n eigenvectors can be denoted as E = (E1, E2, E3, …, En), where each eigenvector, Ej, is an n-by-1 vector.
These eigenfunctions have a number of important properties. First, the eigenvectors are mutually orthogonal and uncorrelated (Griffith 2000) [17]: the symmetry of matrix C ensures orthogonality, and the projection matrix ( I 11 T / n ) ensures that eigenvectors have zero means, guaranteeing uncorrelatedness. That is, EET = I and ET1 = 0, and the correlation between any pair of eigenvectors, say Ei and Ej, is zero when ij. Second, the eigenvectors portray distinct, selected map patterns. Tiefelsdorf and Boots (1995) [18] establish that each eigenvector portrays a different map pattern exhibiting a specified level of spatial autocorrelation when it is mapped onto the n areal units associated with the corresponding spatial weights matrix C. They also establish that the Moran coefficient (MC) value for a mapped eigenvector is equal to a function of its corresponding eigenvalue (i.e., MCj = n 1 T C 1 λ j , for Ej). Third, given a spatial weights matrix C, the feasible range of MC values is determined by the largest and smallest eigenvalues; i.e., by λ1 and λn (de Jong et al. 1984) [19]. Based upon these properties, the eigenvectors can be interpreted as follows (Griffith 2003) [10]:
The first eigenvector, E1, is the set of real numbers that has the largest MC value achievable by any set of real numbers for the spatial arrangement defined by the spatial weight matrix C; the second eigenvector, E2, is the set of real numbers that has the largest achievable MC value by any set that is uncorrelated with E1; the third eigenvector, E3, is the set of real numbers that has the largest achievable MC value by any set that is uncorrelated with both E1 and E2; the fourth eigenvector is the fourth such set of values; and so on through En, the set of real numbers that has the largest negative MC value achievable by any set that is uncorrelated with the preceding (n − 1) eigenvectors.
As such, these eigenvectors furnish distinct map pattern descriptions of latent spatial autocorrelation in spatial variables, because they are mutually both orthogonal and uncorrelated.
ESF furnishes a promising alternative approach to the popular spatial auto-model for describing a spatial process. Pace, LeSage, and Zhu (2013) [11] comment that ESF is an effective method to alleviate OVB. With a simulation experiment that examines ESF estimates for two different types of data generating processes (i.e., spatial autoregressive and spatial error processes), they find that ESF reduces bias in parameter estimates. One appealing feature of ESF is that it utilizes a relevant subset of eigenvectors extracted from a spatial weights matrix, whereas a spatial autoregressive model utilizes the full set of these eigenvectors, both ones that correlate and ones that do not correlate (and hence introduce noise) with the response variable in question. Another appealing feature of ESF is that determining its associated degrees of freedom is more straightforward; a spatial autoregressive model has a complicated degrees of freedom structure because of its multiplicative form. The number of degrees of freedom for the spatial autocorrelation parameter can differ from 1 (Janson, Fithian, and Hasatie 2015) [20].

5. Specimen Empirical Datasets

Illustrative analyses have been completed with eleven empirical datasets1 that span a range of sample sizes (49 to 3109): Dallas, TX City and County census tracts; United States (US) state economic areas (SEAs); US as well as Texas counties; Anselin’s Columbus neighborhoods; Plano, TX block groups; Mercer-Hall agricultural field plots; and, Puerto Rico municipalities. Figure 1 portrays the various surface partitionings associated with these datasets.
For the linear model specification coupled with a normal probability model, several of the response variables need to be subjected to a Box-Cox power transformation. Puerto Rican irrigated farm counts have been analyzed with both a normal approximation (for their density version) and a binomial generalized linear model specification (for their percentage version). Finally, Texas cancer counts have been analyzed with a Poisson generalized linear model specification.
Crime data are: 1980 for Columbus, OH; 2008 for Plano, TX (vehicle burglary); and, 2010 for the City of Dallas. Population density data are: 2010 for Dallas, TX, and for the US. Mercer-Hall crop data are 1910 wheat yields. Puerto Rico irrigated farms data are: 2007 for density; and, 2002 for percentages. US SEA white male prostate cancer rates are age-adjusted for 1970–1994. Finally, Texas county cancer counts are for 2003, whereas Texas county mortgage data are for 2000.
These datasets not only furnish a range of sizes, but Figure 1 reveals that they also furnish a wide range of qualitatively different surface partitionings. In addition, they furnish a range of covariate set sizes, as well as a range of response variable types that includes examples of each of the three most commonly encountered varieties of georeferenced RVs (e.g., normal, binomial, and Poisson).

6. RESET Results for the Specimen Empirical Datasets

The RESET for an ESF model was conducted with the selected eigenvectors as additional independent variables. That is, the F-test was calculated with the sums of squared errors for the ESF model and its counterpart with additional fitted value terms.2 Inclusion of a constructed eigenvector spatial filter improves the RESET analysis in all eleven cases (Table 1 and Table 2). This improvement is of three types: when the diagnostic fails to indicate omitted variables; when the diagnostic indicates omitted variables before, but not after, adding an eigenvector spatial filter; and, when the diagnostic still indicates omitted variables after inclusion of an eigenvector spatial filter.
In all cases, inclusion of an eigenvector spatial filter increases the (pseudo-)R2, sometimes more than tripling it. Both Columbus, OH crime rates, and Puerto Rico density of irrigated farms include covariates that do not yield a RESET diagnostic suggesting omitted variables; nevertheless, inclusion of an eigenvector spatial filter increases the null hypothesis (no omitted variables) RESET probability.
Plano vehicle burglary rates, City of Dallas crime rates, Mercer-Hall wheat yield, US SEA prostate cancer rates, and Dallas County population density have an initial RESET diagnostic suggesting omitted variables, and a RESET diagnostic with a probability of at least 0.1 after inclusion of an eigenvector spatial filter. The implication here is that an eigenvector spatial filter substitutes well for omitted variables.
Texas median monthly mortgages, US population density, and GLM results for both percentage of Puerto Rican irrigated farms and Texas cancer counts have RESET diagnostics that indicate the presence of omitted variables both with and without inclusion of an ESF. Inclusion of an ESF increases the RESET probabilities, but not enough for them to be non-significant. These may be cases in which a spatially unstructured term also is needed to compensate for omitted variables.
For comparison purposes, a RESET was conducted for spatial lag and spatial error model specifications using the Columbus dataset. Here, because of their non-linear forms, the RESET employs the chi-square test for the likelihood ratio difference between a restricted model and its unrestricted counterpart (Vaona 2009) [21]. That is, integer powers of (z-score versions of) fitted values from a spatial regression model are introduced as explanatory variables. Here the resulting RESET p-values are 0.3663 and 0.1852, respectively, whereas the resulting pseudo-R2 values are 0.6523 and 0.6584, respectively. These findings suggest that spatial autoregressive models also correct for OVB, offering spatial analysts two ways of exploiting spatial autocorrelation to compensate for omitted variables.

Cross-Validation RESET Results for the Specimen Empirical Datasets

Each of the specimen datasets was subjected to a cross-validation evaluation to examine the sensitivity of the RESET to individual observations, with each observation in a dataset being left out, in turn, and then predicted. Table 3 summarizes results for the linear model examples, and Table 4 summarizes results for the generalized linear model examples. These results are encouraging, given the number of improvements, but indicate the need for further refinement work in this area. The goal would be for almost all, if not all, of the cases to improve, achieving a RESET probability exceeding 0.1.

7. Correction for Omitted Variable Bias: Selected Simulation Experiments

OVB results in an estimated regression coefficient differing substantially from its population parameter, often in an attempt by included covariates to compensate for omitted variables. This substantial difference can render an incorrect null hypothesis test result concerning included variables. Empirical evidence presented here suggests that an eigenvector spatial filter helps remediate this situation.
The first simulation experiment summarized here is based upon the Puerto Rico (n = 73) agricultural dataset. The response variable is the sum of the density of farms using irrigation (X1) and Box-Tidwell transformed mean rainfall (X2), plus an independent and identically distributed (iid) random error term that is N(0, 0.12). The correlation between the two covariates is 0.43, indicating modest collinearity. The response variable (containing 73 values) was simulated 10,000 times, followed by estimation of its linear regression equation as well as each of the two individual bivariate regression equations, resulting in
Y ^ ¯ = β ^ ¯ 0 1 + 1.00046 X 1 + 0.99996 X 2 Y ^ ¯ = β ^ ¯ 0 1 + 1.42810 X 1 Y ^ ¯ = β ^ ¯ 0 1 + 1.42419 X 2 Y ^ j = β ^ 0 j 1 + E k j β kj , j = 1 , 2 , ... , 10,000
The intercept term estimate is not reported here because it is not of interest. The average regression coefficient estimates of 1.00046 and 0.99996 are not different from 1 (standard errors of roughly 0.049), their population parameter counterparts (i.e., the true model). The bivariate regression coefficient estimates indicate that the OVB is sizeable, exceeding 42%, and significant (standard errors of 0.044). Powers of the eigenvector spatial filter fitted values ( Y ^ j ) furnish the RESET terms for simulation replicate j. Table 5 summarizes outcomes of this simulation experiment, which involved stepwise selection of the RESET terms (which are constructed from eigenvector spatial filters). The average bivariate regression coefficient estimates corrected by the RESET are 0.95574 and 0.94882, both of which are markedly less than their OVB counterparts, although they are modestly deflated. Their respective standard errors are 0.062 and 0.067, which, unlike the original OVB estimates, mean they are not significantly different from 1.
The second simulation experiment summarized here is based upon the Texas (n = 254) cancer dataset. The response variable is the exponentiated weighted sum of the logarithms of median household income (X1), percentage of white population (X2), and percentage of single (i.e., unmarried) people (X3), plus log-total population as an offset variable. The weights are the Poisson regression coefficients from a GLM. Because the expectation equation is a description of cancer counts that are overdispersed, it was used as the mean of a gamma RV, whose sampled values were treated as means of Poisson RVs.3 The response variable (containing 254 values) was simulated 10,000 times, followed by estimation of its Poisson GLM equation as well as each of the three individual bivariate and individual trivariate binomial regression equations, resulting in
Y ^ ¯ = exp [ β ^ ¯ 0 1 0.30 X 1 + 0.20 X 2 0.80 X 3 + L N ( p o p u l a t i o n ) ] Y ^ ¯ = exp [ β ^ ¯ 0 1 0.46 X 1 + L N ( p o p u l a t i o n ) ] Y ^ ¯ = exp [ β ^ ¯ 0 1 + 1.05 X 2 + L N ( p o p u l a t i o n ) ] Y ^ ¯ = exp [ β ^ ¯ 0 1 0.97 X 3 + L N ( p o p u l a t i o n ) ] Y ^ ¯ = exp [ β ^ ¯ 0 1 0.32 X 1 + 0.95 X 2 + L N ( p o p u l a t i o n ) ] Y ^ ¯ = exp [ β ^ ¯ 0 1 0.31 X 1 0.90 X 3 + L N ( p o p u l a t i o n ) ] Y ^ ¯ = exp [ β ^ ¯ 0 1 + 0.29 X 2 0.82 X 3 + L N ( p o p u l a t i o n ) ] Y ^ j = exp ( β ^ 0 j 1 + E k j β kj + L N ( p o p u l a t i o n ) , j = 1 , 2 , ... , 10,000
Again, the intercept term estimate is not reported here because it is not of interest; however, in some empirical cases, it is of interest, another reason to use the z-score versions of fitted values. Table 6 summarizes outcomes of this simulation experiment, which involved stepwise selection of the RESET terms (which, as before, are constructed from eigenvector spatial filters).
The average regression coefficient estimates of −0.30213, 0.21343, and −0.80155 respectively do not differ from −0.3, 0.2, and −0.8 (standard errors of roughly 0.2), their population parameter counterparts. The bivariate and trivariate Poisson regression coefficient estimates indicate that the OVB is sizeable, many being at least 20%, and statistically significant. For the bivariate regressions, the eigenvector spatial filter reduces the OVB as reported in Table 7.
For the bivariate cases, the estimates with the ESF RESET adjustment are closer to their true values. Specifically, the estimates for X1 and X3 are close to their true values, whereas the adjustment for X2 is less effective. These results indicate that the ESF adjustment is reasonable in a bivariate regression case, but not so in a trivariate regression case. The correlation structure may play a role here: r X 1 X 2 = 0.11, r X 1 X 3 = −0.10, and r X 2 X 3 = −0.53.
These two empirically based simulation experiments furnish a proof of concept, and indicate that ESFs offer promise for effectively dealing with the OVB problem. Clearly, future research should be devoted to this theme.

8. Implications and Conclusions

Properly testing for OVB requires knowing the omitted variables, which does not help in practice. This situation also can be assessed if instrumental variables are available to use. At least in some cases, an eigenvector spatial filter can be treated like an instrument (see Le Gallo and Paez 2013 [22]). Ramsey’s RESET furnishes a special case test where the omitted variables are nonlinear functions of the included covariates. This paper summarizes findings based upon a set of empirical examples and a pair of conditional simulations suggesting that an ESF often can serve as a surrogate for omitted variables. Inclusion of an eigenvector spatial filter tends to increase the (pseudo-)R2 and the RESET null hypothesis probability. Combining an eigenvector spatial filter with a spatially unstructured term to correct for OVB merits subsequent research, too.

Author Contributions

Both authors contributed equally to the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. J.B. Ramsey. “Tests for specification errors in classical linear least squares regression analysis.” J. Royal Stat. Soc.: Ser. B 31 (1969): 350–371. [Google Scholar]
  2. J. Wooldridge. Introductory Econometrics: A Modern Approach, 5th ed. Mason, OH, USA: South-Western, 2013. [Google Scholar]
  3. G. Shukur, and P. Mantalos. “Size and power of the RESET test as applied to systems of equations: A bootstrap approach.” J. Mod. Appl. Stat. Methods 3 (2004): 370–385. [Google Scholar]
  4. D.M. Brasington, and D. Hite. “Demand for environmental quality: A spatial hedonic analysis.” Reg. Sci. Urban Econ. 35 (2005): 57–82. [Google Scholar] [CrossRef]
  5. R.K. Pace, and J.P. LeSage. “Omitted variable biases of OLS and spatial lag models.” In Progress in Spatial Analysis. Edited by A. Páez, J. LeGallo, R. Buliung and S. Dall’Erba. Berlin, Germany: Springer, 2010, pp. 17–28. [Google Scholar]
  6. J. LeSage, and O. Parent. “Bayesian model averaging for spatial econometric models.” Geogr. Anal. 39 (2007): 241–267. [Google Scholar] [CrossRef]
  7. J. LeSage, and M.M. Fischer. “Spatial growth regressions: Model specification, estimation and interpretation.” Spat. Econ. Anal. 3 (2008): 275–304. [Google Scholar] [CrossRef]
  8. P. Piribauer, and M.M. Fischer. “Model uncertainty in matrix exponential spatial growth regression models.” Geogr. Anal. 47 (2015): 240–261. [Google Scholar] [CrossRef]
  9. P. Piribauer. “Heterogeneity in spatial growth clusters.” Empir. Econ., 2016. [Google Scholar] [CrossRef]
  10. D.A. Griffith. Spatial Autocorrelation and Spatial Filtering: Gaining Understating through Theory and Scientific Visualization. Berlin, Germany: Springer, 2003. [Google Scholar]
  11. R.K. Pace, J.P. LeSage, and S. Zhu. “Interpretation and computation of estimates from regression models using spatial filtering.” Spat. Econ. Anal. 8 (2013): 352–369. [Google Scholar] [CrossRef]
  12. Y. Chun, and D.A. Griffith. “A quality assessment of eigenvector spatial filtering based parameter estimates for the normal probability model.” Spat. Stat. 10 (2014): 1–11. [Google Scholar] [CrossRef]
  13. J.G. Thursby, and P. Schmidt. “Some properties of tests for specification error in a linear regression model.” J. Am. Stat. Assoc. 72 (1977): 635–641. [Google Scholar] [CrossRef]
  14. S. Sapra. “A regression error specification test (RESET) for generalized linear model.” Econ. Bull. 3 (2005): 1–6. [Google Scholar]
  15. J. Temple. “The New Growth Evidence.” J. Econ. Lit. 37 (1999): 112–156. [Google Scholar] [CrossRef]
  16. D.P. McMillen. “Spatial autocorrelation or model misspecification? ” Int. Reg. Sci. Rev. 26 (2003): 208–217. [Google Scholar] [CrossRef]
  17. D.A. Griffith. “Eigenfunction properties and approximations of selected incidence matrices employed in spatial analyses.” Linear Algebra Its Appl. 321 (2000): 95–112. [Google Scholar] [CrossRef]
  18. M. Tiefelsdorf, and B.N. Boots. “The exact distribution of Moran’s I.” Environ. Plan. A 27 (1995): 985–999. [Google Scholar] [CrossRef]
  19. P. De Jong, C. Sprenger, and F.V. Veen. “On extreme values of Moran’s I and Geary’s c.” Geogr. Anal. 16 (1984): 17–24. [Google Scholar] [CrossRef]
  20. L. Janson, W. Fithian, and T.J. Hastie. “Effective degrees of freedom: A flawed metaphor.” Biometrika 102 (2015): 479–485. [Google Scholar] [CrossRef] [PubMed]
  21. A. Vaona. “Spatial autocorrelation or model misspecification? The help from RESET and the curse of small samples.” Lett. Spat. Resour. Sci. 2 (2009): 53–59. [Google Scholar] [CrossRef]
  22. J. Le Gallo, and A. Paez. “Using synthetic variables in instrumental variable estimation of spatial series models.” Environ. Plan. A 45 (2013): 2227–2242. [Google Scholar] [CrossRef]
  • 1Several of these dataset were used in the 2008 US National Science Foundation funded spatial filtering workshop held at the University of Texas at Dallas during June 16–20 (http://www.spatialfiltering.com/).
  • 2ESS1 was calculated with covariates and selected eigenvectors, and ESS2 was calculated with additional fitted terms as well as the covariates and the selected eigenvectors. For Columbus data, df2 for the non-spatial model is 41 (= 49 – the number of independent variables; that is, 2 covariates, intercept, and 5 fitted terms); df2 for the ESF model is 38 (= 49 – the number of independent variables with 3 additional eigenvectors).
  • 3The mean of the empirical RV is 133, its standard deviation is 407, and its overdispersion scale parameter is 2.8. The simulated data have a mean of 134, a standard deviation of 419, and a scale parameter of approximately 2.8.
Figure 1. Surface partitionings for the specimen datasets. (a) Columbus, OH (n = 49); (b) US counties (n = 3109); (c) US state economic areas (n = 508); (d) City of Dallas census tracts (n = 264); (e) Dallas County census tracts (n = 529); (f) Texas counties (n = 254); (g) Mercer-Hall agricultural field plots (n = 500); (h) City of Plano census block groups (n = 159); (i) Puerto Rico municipalities (n = 73).
Figure 1. Surface partitionings for the specimen datasets. (a) Columbus, OH (n = 49); (b) US counties (n = 3109); (c) US state economic areas (n = 508); (d) City of Dallas census tracts (n = 264); (e) Dallas County census tracts (n = 529); (f) Texas counties (n = 254); (g) Mercer-Hall agricultural field plots (n = 500); (h) City of Plano census block groups (n = 159); (i) Puerto Rico municipalities (n = 73).
Econometrics 04 00029 g001
Table 1. Ramsey regression equation specification error test (RESET) results for the linear model empirical examples.
Table 1. Ramsey regression equation specification error test (RESET) results for the linear model empirical examples.
DatanYXRESET Non-Spatial ModelRESET Spatial Model (ESF)
R2RESETDF1, DF2p-ValueR2RESETDF1, DF2p-Value
Columbus49Crime ratesHousing value, household income0.55241.61225, 410.17840.74191.43615, 380.2337
Puerto Rico73Irrigated farm densityMean rainfall0.13831.80755, 660.12350.46861.43615, 600.2245
Plano Census Block groups159Box-Cox 1 transformed Vehicle Burglary ratesRates of population aged between 18 and 24, Distance to highway0.14284.75585, 1510.00050.41691.57775, 1420.1701
Texas Counties254Median Monthly MortgageLog of Population Density, Log of Household Median Income, % of housing units built since 19800.774010.54035, 2453.5 × 10−90.85973.62975, 2280.0035
City of Dallas Census Tracts264Log of violation crime rates in 2000Rates of population aged between 13 and 17, Black population rates, Poverty rate0.533620.60774, 2569.8 × 10−150.73741.92734, 2450.1065
Mercer Hall500Wheat yieldStraw yield0.53263.91943, 4960.00880.73760.9914, 4550.4121
US SEA508White male Prostate cancer ratesWhite male Bladder cancer rate, Mean indoor radon concentration0.13924.08845, 500 0.00120.48570.33085, 4700.8943
Dallas County Census tracts529Box-Cox 2 transformed Pop. DensityY coordinates, # of families, Log of distance to CBD0.167111.98063, 5221.3 × 10−70.59491.26533, 4720.2857
US Counties3109Log of population densityLog of # of families, Old population rates (60+)0.739413.85455, 31012.1 × 10−130.89524.46815, 28940.0005
1 The Box-Cox transformation was performed with ( y λ 1 ) / λ where λ ^ = 0.1113 . 2 The Box-Cox transformation was performed with λ ^ = 0.3408 .
Table 2. RESET results for the generalized linear model (GLM) empirical examples.
Table 2. RESET results for the generalized linear model (GLM) empirical examples.
TermBefore ESFAfter ESF
χ 2 p-Values χ 2 p-Values
Puerto Rico (Binomial): Irrigate farms (y) with log of mean rainfall (x)
Y ^ 2 0.45100.50180.00030.9853
Y ^ 3 100.7835<2.2 × 10−1616.27810.0003
Pseudo-R20.45280.4829
Texas counties (Poisson): Cancer counts (y) with three covariates 1
Y ^ 2 127.3967<2.2 × 10−163.50060.0614
Y ^ 3 147.8025<2.2 × 10−1618.22740.0001
Pseudo-R20.13150.3722
1 The covariates are log of household median income, log of white population rates, and log of single marital status rates.
Table 3. RESET cross-validation results for the specimen linear models.
Table 3. RESET cross-validation results for the specimen linear models.
DatanMaintained p ≤ 0.1Improved from p ≤ 0.1 to p > 0.1Declined from p > 0.1 to p ≤ 0.1Maintained p > 0.1
Columbus4902245
Puerto Rico7306265
Plano Census Block Groups15992 16700
Texas Counties254254 2000
City of Dallas Census Tracts264261 3300
Mercer Hall500449501
US SEA508050701
Dallas County Census Tracts529052810
US Counties31093019 4000
1 p-values for 35 (out of 92) cases increased from less than 0.0001 to greater than 0.05. 2 p-values for 252 (out of 254) cases increased from less than 10−7 to greater than 0.001. 3 p-values for 256 (out of 261) cases increased from less than 10−9 to greater than 0.001. 4 p-values of 3104 (out of 3109) cases increased from less than 10−10 to greater than 0.0001.
Table 4. RESET cross-validation results for the specimen generalized linear models.
Table 4. RESET cross-validation results for the specimen generalized linear models.
TermnMaintained p ≤ 0.1Improved from p ≤ 0.1 to p > 0.1Declined from p > 0.1 to p ≤ 0.1Maintained p > 0.1
Puerto Rico (Binomial): Irrigate farms (y) with log of mean rainfall (x)
Y ^ 2 7301702
Y ^ 3 7372 1100
Texas counties (Poisson): Cancer counts (y) with three covariates 1
Y ^ 2 254246800
Y ^ 3 254253 2100
1 The p-values of 70 cases (out of 72) increased from one less than 1.0 × 10−16 to one greater than 1.0 × 10−5, and for 12 cases of them, increased to one greater than 0.0001. 2 The p-values of 252 cases (out of 253) increased from one less than 1.0 × 10−16 to one greater than 1.0 × 10−5, and for 190 cases of them, increased to one greater than 0.0001.
Table 5. Selection frequency of RESET terms for the Puerto Rico simulation experiment.
Table 5. Selection frequency of RESET terms for the Puerto Rico simulation experiment.
VariableNone Y ^ 2 Y ^ 3 Y ^ 4 Y ^ 2 & Y ^ 3 Y ^ 2 & Y ^ 4 Y ^ 3 & Y ^ 4 Y ^ 2 & Y ^ 3 & Y ^ 4
X10843460742820421054
X201673466036660100
Table 6. Selection frequency of RESET terms for the Texas data simulation experiment.
Table 6. Selection frequency of RESET terms for the Texas data simulation experiment.
VariableNone Y ^ 2 Y ^ 3 Y ^ 4 Y ^ 2 & Y ^ 3 Y ^ 2 & Y ^ 4 Y ^ 3 & Y ^ 4 Y ^ 2 & Y ^ 3 & Y ^ 4
X127594566923902553206415
X2846776091258271364726665
X31336773251636688500405776
X1 & X26257704991025332406606283
X1 & X31320809249635533515855854
X2 & X31718825287702578517435330
Table 7. Parameter estimates with OVB and ESF RESET adjustments.
Table 7. Parameter estimates with OVB and ESF RESET adjustments.
Number of Omitted VariablesEstimate TypeX1X2X3
twoParameter−0.300.20−0.80
OVB−0.461.05−0.97
ESF RESET adjusted−0.230.90−0.87
oneOVB−0.320.95
ESF RESET adjusted−0.250.92
OVB−0.31 −0.90
ESF RESET adjusted−0.34 −0.97
OVB 0.29−0.82
ESF RESET adjusted 0.39−0.76

Share and Cite

MDPI and ACS Style

Griffith, D.A.; Chun, Y. Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables. Econometrics 2016, 4, 29. https://doi.org/10.3390/econometrics4020029

AMA Style

Griffith DA, Chun Y. Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables. Econometrics. 2016; 4(2):29. https://doi.org/10.3390/econometrics4020029

Chicago/Turabian Style

Griffith, Daniel A., and Yongwan Chun. 2016. "Evaluating Eigenvector Spatial Filter Corrections for Omitted Georeferenced Variables" Econometrics 4, no. 2: 29. https://doi.org/10.3390/econometrics4020029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop