2.1. Data
The urban density data for this study were obtained from three different sources that are updated every five to ten years. Housing units’ populations at the Census-Block level were compiled nationwide for 2000 and 2010, providing a high-resolution insight into population and housing information. Housing data and the Census Block polygons were connected using Python and respectively obtained from the US Census Bureau website and Topologically Integrated Geographic Encoding and Referencing (TIGER) GIS data. For stratifying settlement morphologies within which buildings stand, a method developed by Heris [
48] was applied that uses a cell-based density profile to assign a general type of urban context to each building’s neighborhood. This method offers a national raster layer in which the settlement type for each cell is defined as high-density urban core, medium density, low density, urban fringe, or suburban. To incorporate building footprints in our model, Microsoft’s building dataset (
https://github.com/Microsoft/USBuildingFootprints, accessed on 5 November 2019), which covers the entire US, was used (
Figure 1).
High resolution 1 m
2 land-cover datasets available through the EPA’s EnviroAtlas data portal (
https://www.epa.gov/enviroatlas/enviroatlas-data-approach, accessed on 25 October 2019) were used to measure the percentage of tree canopy and impervious surface cover for each cell. Surface temperature data for 30 × 30 m
2 cells were extracted from Landsat 8 OLI images with less than 1% cloud cover. Atmospheric corrections were applied using ENVI for all bands, including thermal ones, and the Landsat Digital Numbers (DNs) of band 10 were converted to top-of-atmosphere (TOA) reflectance in accordance with USGS instructions (
http://landsat.usgs.gov/Landsat8_Using_Product.php, accessed on 20 July 2021):
where
. is the TOA spectral radiance (Watts/(m
2.srad.μm)),
is the band-specific multiplicative rescaling factor from the metadata,
is the band-specific additive rescaling factor, and
is the quantized and calibrated standard product pixel value (DN). In the next step, radiance values were converted to the satellite brightness temperature:
where
T is the at-satellite brightness temperature (K), and
K1 and
K2 are band-specific thermal conversion constants from the metadata. In the last step, satellite brightness temperature (
BT) data were normalized based on emissivity values for each land-cover class using the following equation:
where
W is the wavelength of emitted radiance and
P =
h *
c/
s (=14,380), with
s as the Boltzmann constant,
h as the Planck constant, and
c as the velocity of light (
Figure 2).
Energy data were collected for the cities of interest from different sources depending on the city-specific reporting requirements of energy benchmarking initiatives (mostly based on building type and size). For instance, Seattle’s Energy Benchmarking Program (SMC 22.920) mandates that the annual energy performance for non-residential and multifamily buildings over 1858 m
2 (20,000 ft
2) be reported by owners to the city (see
Table 1 for the building-type mix by city). Every building data-point that carried energy-use data was geocoded based on longitude and latitude. All the above-mentioned layer sections were converted to raster layers at 30 × 30 m
2 resolution. For land-cover layers, tree canopy and impervious surface coverage information were aggregated from 1 × 1 m
2 to 30 × 30 m
2 resolutions.
It shall be noted that some data processing was done prior to the analysis in order to connect the datasets. First, Census Block polygons of different years are not identical. Second, in core urban areas, Census Blocks are small and most of their net areas are developed. To reach a normalized building density measure and deal with the problem of having lower housing or population densities in low-resolution areas, it was necessary to identify the built sites and exclude open spaces. This was done using the NLCD product ‘percent developed imperviousness’ to exclude land that was entirely undeveloped, particularly in suburban communities. However, the presence of roads as impervious surfaces in these areas created significant noise in the data. After comparing different road network datasets such as TIGER lines and Open Street Map (OSM), OSM data were chosen for the analysis since they exclude roads from the impervious surface measurement.
2.2. Analysis
The two major tasks on hand were to identify the distance from a particular building at which it is most relevant to study the impact of urban form parameters on building energy use and to determine the relationship between form parameters and energy consumption. To answer the first question and to create an inquiry into variables for different distances from a particular building, our algorithm ran the explained script for windows of different sizes. These hypothetical windows were in the shape of squares (to match the 30 m × 30 m analysis units) drawn around every single building point and covering areas amounting to 150 × 150, 270 × 270, 390 × 390, …, 1950 × 1950 m2. For every window, urban form composites were created using a principal component analysis (PCA) in which correlated variables were loaded into orthogonal components. The variables used to create each of the composites were the average area of the buildings, number of buildings per hectare, average tree-cover area, impervious surface area, morphological density category (1–5; 1 = low density, 5 = high density), number of housing units per hectare, and surface temperature for all of the windows. It is important to note that, ultimately, only the first three components in each PCA were technically used in evaluating the relationship between urban form and EUI (energy per area per year) since 80% or more of the variance in all windows for all the cities was explained by them.
A stepwise regression analysis of EUI, log-transformed to correct for its non-normality, was then carried out on a series of covariates and the urban form composites. Using the Akaike information criteria (AIC), an estimator of the relative quality of statistical models, the additional variance explained by adding the urban form composites to each model was observed for each window. Two requirements had to be met in selecting the optimal window: the urban form composite needed to impact the model’s R2 significantly, and the additional predictive power was supposed to be the greatest of all the windows in which the first requirement was met.
As
Table 2 shows, in six out of the seven cities the considered urban form parameters did show an impact on energy consumption patterns. In the case of Minneapolis, MN, the small data size could have been the reason behind no clear pattern emerging. The results are as interesting as they are intuitive: in cities with more sprawled agglomerations like Austin, TX, the impact of variables such as surface temperature, average area of the buildings, number of buildings per hectare, average tree-cover area, and impervious surface area within the 1.5 km radius better explains the impact of urban form on building energy. However, in cities with higher urban densities the impact of the mentioned variables on EUI was most relevant in the immediate 0.5 km radius around the buildings.
The second research question concerned determining the relationship between urban density parameters and building energy consumption. Regression models of the EUI on a set of relevant covariates and urban form variables were fitted for the optimal window for each city, selected in the previous stage. However, the urban form composites were not applied at this stage, since using the original variables would have provided a better perspective for interpreting the relationships between form parameters and EUI. Also, surface temperature was absent in some of the final models since it could be modelled as a function of tree-cover area and impervious surface area, and it did not always add relevant variance to the models. The dependent variable was the log-transformed version of the EUI. The model predictors were selected using a stepwise approach and linear ordinary least square (OLS) models. All models were checked for multicollinearity to avoid correlated urban form variables inflating the variance explained by the models. In cases with the presence of multicollinearity, certain variables were excluded until normal levels of the variance inflator factor (VIF) were reached (below 4). The selection of covariates for each model was strongly determined by data availability and model predictive power. There were some discrepancies in the variables incorporated in the city-specific models since the energy data came from different sources.
The resulting models were also evaluated for spatial dependency by analyzing the residuals using graphical and statistical methods; namely, Moran’s I [
49], which is a measure of overall spatial autocorrelation. To run the Moran statistic and the spatial regression models, matrices of neighbors were created in which, for the sake of consistency across the models, the 20 closest units in the dataset (according to their GPS coordinates) were defined as neighbors of each unit. Since this approach could yield asymmetrical matrices of neighbors [
50], the matrices were corrected to make them symmetric; therefore, some units ended up having more than 20 neighbors in the last iteration. Finally, the matrices were weighted based on the inverse distance between the different units in order to acknowledge the higher influence of the closer neighbors. Moran’s I can be explained as:
where
is an element of the row-standardized weights matrix
W, and
is the value of the variable under examination (for instance, a residual or a dependent variable) for observation
i. Moran’s I was applied to determine whether EUI should be modelled using spatial regression methods (as opposed to non-spatial). As Moran’s I is sensitive to spatial patterning, for robustness, we replicated the calculations using Monte-Carlo simulations of the index (Monte-Carlo replications were consistent with the original index in every single case; see Chen [
51] for a more nuanced discussion on this topic). Among the broad range of spatial modeling techniques, the spatial error model (SEM) method was chosen:
where
is the independent error term (it is assumed that
),
is the connectivity matrix,
is the spatial error parameter, and
is the spatial component of the error term. SEMs assume that the remaining spatial dependency, not represented by the model variables, exists thanks to a set of unknown spatial factors, as opposed to a form of interaction between the included variables and space [
49]. Therefore, applying the SEM framework is more theoretically justifiable since the assumption is that the energy consumption of a unit is not directly impacted by that of its neighbors, and any spatial correlation is more likely due to unobserved variables. Neighbors are unlikely to share energy consumption data among themselves, and parameters such as the property value or urban form variable incorporate the possible existing correlations with space.
As a last step, some of the models were replicated using a Bayesian framework to enable further exploration and disentanglement of more intricate forms of nesting by fitting hierarchical spatial models. To fit these models, the brms package in R was used with standard priors [
52]. Brms provides a unified framework to model multilevel spatial regression, is easy to use, and allows users to access all the merits of the Stan program. Stan is a high-level language in which the user specifies a model and the starting values, and then a Hamiltonian Monte-Carlo (HMC) chain simulation is implemented to derive the posterior distributions. These methods converge faster than the more commonly used Metropolis–Hastings algorithm and/or Gibbs sampling, especially for high-dimensional modes [
53]. While using HMC may extend the time needed to calculate the gradient of the log-posterior, it provides higher quality samples than other samplers and enables drawing samples from the posterior predictive distribution as well as the pointwise log likelihood [
53]. Model fit was assessed by calculating the widely applicable information criterion (WAIC) and the leave-one-out cross-validation method (LOO), both available in brms. To perform the calculations, Proteus, a high-performance shared computer cluster of Drexel University’s Research Computing Facilities, was used. All models implemented in brms are distributional models, where a response “
y” is estimated through the prediction of the parameters “
θ” of a distribution response “
D”.
In turn, each parameter of the distribution response may be regressed on its own predictor term
. While
can take non-linear forms, it is typically written as a linear combination of population-level effects (X), group-level effects (Z), and smooth functions (sk) fitted via splines as follows:
Prior distributions can be specified in brms or can be used at their default. For example, the default priors for a Gaussian distribution are flat priors for the betas and a t distribution with three degrees of freedom for sigmas; priors for the spatial error parameter are flat priors over [0,1]. With regard to hierarchical Bayesian models, brms builds upon the syntax of other R packages and allows users to model hierarchical relationships for the predictors and/or response function “y”.