In this section, we present the study area and databases underlying both urban and micro-retail descriptors. Next, the spatial unit of analysis and different families of street-based urban form descriptors are briefly defined. Finally, the model and variable selection procedures are described.
3.1. Case Study and Data Sources
The analytical protocol proposed in this work was tested on a real case study of the French Riviera metropolitan area in southern France. This polycentric coastal settlement comprises 88 municipalities that are structured around six main urban centers. From west to east we find: the Cannes–Grasse–Antibes conurbation, with 74,200, 51,000 and 73,800 inhabitants in their central cities, respectively; Nice, with 343,000 inhabitants, representing the largest municipality of the French Riviera and its administrative center; and the enclave of Monaco and the border city of Menton, with 38,000 and 28,000 inhabitants, respectively. Within these six municipalities about 70% of all micro-retail businesses is found. Spread around these main centers, 295,000 people live in smaller cities, villages and hamlets surrounded by vast residential areas, according to the morphological properties of the site. All these differently sized centers are interconnected by a pervasive, discontinuous and car-dependent residential fabric. With a total of more than 1 million inhabitants, the French Riviera is the seventh most populated metropolitan area in France.
The combination of all these elements produces a sequence of urban centers and peripheral areas of different sizes that encompass a large variety of urban forms. Previous studies have disentangled the high heterogeneity of the study region, identifying typo-morphological regions both at district and neighborhood scales [
56,
57]. These sub-regions correspond to different urban morphological contexts characterized by specific combinations and distributions of urban configurational and morphological descriptors; moreover, for each of these regions, different zero-inflation and overdispersion properties of the micro-retail distribution are also observed. These characteristics allow the present work to overcome the limitations of traditional works that have investigated only individual core regions of medium- or large-sized monocentric cities [
10], and to assess the current analytical procedure under different contextual and statistical conditions.
Two sources of data are considered in this work. The official data about micro-retail distribution is provided by the local Chamber of Commerce of Nice Cote-d’Azur (CCINCA), counting about 50,000 businesses and services active as of 1 January 2017. (More recently, this same information has been made available at the national level by the national statistics agency (INSEE).)The address information allowed us to geocode the database and provide a spatial representation of the phenomena under study. This process was realized through the National Open Addresses Database (Base d’Adresses Nationale Ouverte (BANO)). The BANO geolocation tool associated a score of the geocoding results describing the localization precision at four levels: null, municipality, street and house number. From our original dataset: (i) 7% of the data presented missing information, or fell outside of our study area, and was thusly excluded from our analysis; (ii) 2% of information was geo-localized at the municipality level and 13% was at the street level—the cause of these mis-localisations was often a result of incomplete address information in the original database such as missing civic number, misspelt street name, incorrect name of an isolated hamlets and so on, and a manual correction was carried out when the correct retail activity address was available from other online sources; (iii) 78% of data were correctly located at the house-number level. We obtained a final dataset of 45,726 stores distributed across 33,221 locations (several activities shared the same addresses), 82% with a precise civic number and 18% at the street level (positioned at street segment midpoints). In 135 locations, large planned centers were found with retail surfaces higher than 2000 square meters. This specific retail format does not possess the same combinations of locational factors as smaller activities [
58], however its presence has the potential to profoundly modify the surrounding urban morphology and flow, making these centers an attractive element for smaller activities (i.e., retail locomotives). For this reason, these activities (from now on named “anchor stores”), were excluded from the original dataset and considered as a locational factor for smaller commercial activities (see
Section 3.2).
Urban form descriptors were based on the geographic databases (BD TOPO, 2017) from the French National Institute of Geographical and Forest Information (IGN). Four layers of urban morphological elements were used: building, street-network, parcel and digital terrain model (DTM).
Based on these data sources, well-established GIS-based protocols were implemented for the elaboration of the different urban morphological descriptors, while statistical procedures were implemented with R libraries [
59]. The use of relatively simple data and available analytical/statistical protocols make this work reproducible for future comparative studies.
3.2. The Variables under Investigation
The spatial unit of analysis was the street segment. Streets represent one of the most used spatial units, and have been attracting attention in the last 20 years from urban designers, configurational studies, morphologists and urban geographers [
60]. Streets are considered to be the bridging element between different methodological and theoretical approaches [
44].
The street segment is here defined as the centerline between two street-junctions. Four reasons motivate this choice, the first of which being that “the dominant network model is the one that represents the street junctions as vertices in the graph and the linear street segments as its edges” [
61]. Secondly, by using street network centerlines, the primary approach allows the independent identification of configurational properties according to the physical shapes and sizes of built forms surrounding street segments (isolating configurational properties of the network from morphometric measures of the streetscape and fabrics). Thirdly, the use of a centerline permits a geometrical reference when studying streetscapes from the street point of view (measures of setback, parallelism of facades and so on are used as reference street edges and/or street centerlines). The street segment therefore becomes both a geometrical (streetscape measures, the geometry of retail agglomerations, etc.) and metric (local configurational properties, local morphological patterns, etc.) reference [
62], and the use of visual axes as in SSx or alternative street-like representations of the street network provide a distorted reference system for streetscape descriptors. Finally, the street segment represents a behaviorally oriented partition of space, which is better suited for socioeconomic phenomena such as the distribution of retail businesses in urban space [
35].
To describe different aspects of urban form, several computer-aided procedures from established scientific literature were implemented for our study region. Each street segment was characterized by more than 100 street-based descriptors of urban form (further details about urban form indicators are described in
Appendix B).
Four main subsets of indicators can be recognized: the first comprises 40 indicators that have been defined to describe street network configurational properties using the MCA protocol [
8,
9]. Local Reach, Straightness, Closeness and Betweenness centralities are assessed at different scales and impedances on pedestrian and vehicular modelled street-networks (300-, 600- and 1200-meter radii and 5- and 20-minute radii, respectively. Their normalized versions are obtained following a two-step floating catchment area procedure (2SFCA) [
63].
The second subset of indicators is made up of 36 indicators describing the street-network accessibility towards public squares, coastline and anchor stores, which are considered influential components of an urban form on micro-retail distribution. As with the previous metrics, several scales and impedances were considered.
From the urban design and urban morphological literature, 30 indicators describing the built form layout along the street edges have been defined (also named skeletal streetscapes [
64]). Several GIS protocols have been proposed in recent urban form literature [
64,
65,
66,
67]. and indicators such as façade alignment, building set-back, average building height and so on are calculated while considering building distribution within a 50-meter distance from street edges through the definition of street-based proximity bands (PBs) and sightlines [
56,
67].
Finally, street-based contextual variables/partition have been obtained through the implementation of the Multiple Fabric Assessment procedure [
56], wherein each street segment is associated with nine values, with each one describing the probability of association with different urban fabric types. In more central and compact regions, historical centers, traditional planned fabrics with adjoining buildings and discontinuous fabrics of buildings and houses are found (respectively, UF1–3). Semi-peripheral and peripheral regions are prevalently composed of modernist urban fabrics and suburban areas with lower/higher natural constraints (respectively, UF4–6). Finally, the least dense regions are described by connective artificial fabrics and natural spaces of hills and mountains (respectively, UF7–9). This urban fabric partition is illustrated in
Figure 1 and further described in [
57]. The study of the spatial organization of these nine urban fabrics allows the identification of three morphological macro-regions within a metropolitan area: First-, Second-Age City (following the morphological categories of [
68]) and Natural Space. These two typo-morphological partitions of the study area, illustrated in
Figure 1, define the sub-regions where count regression approaches are individually applied; the limited number of streets with stores within the Natural Space and UF7–9 prevent the implementation of our analytical procedures in these specific morphological regions.
Of the almost 100,000 street segments composing the whole street network of the French Riviera, we focused on those where built-up elements were found within 50 meters from street edges. Streets crossing natural areas, large public parks and small connective segments were excluded, reducing our dataset to 63,071 units. Each street segment was defined by the number of small stores representing the target variable of our models. Different values of zero-inflation, street density and overdispersion were observed in each morphological sub-region (
Table 1).
Before proceeding with a description of the modelling protocol, two further aspects should be underlined. Firstly, the same four limitations presented in
Section 2 still persist when using other fine-grained spatial unit definitions and urban form descriptors. As such, the modelling solution presented in this paper might also be tested and implemented with other street-based spatial unit definitions (i.e., axial streets, named streets, raster-based solutions, plots, etc.). Nonetheless, the combination of several urban form analytical procedures, each one based on ad hoc spatial unit definitions, would require a supplementary artificial manipulation of the variables, which would lead to the introduction of a statistical bias and compromise both the modelling and variable selection procedure performances and outcomes.
Secondly, this work focuses on the study of the physical properties of urban form, and does not take into consideration any socioeconomic and land-use regulation aspects. It is fully recognized that such aspects play an important role as locational factors in retail distribution, and are each related to urban form in different ways. For this reason, both modelling performance measures and variable selection procedure could be strongly dependent on these variables, confounding the role of other urban descriptors. Their exclusion from the modelling procedure allows the roles of different properties of the urban built environment to be explored and pointed out. Further research would be needed to disentangle the roles of urban form, socioeconomic aspects and planning constraints.
3.3. Modelling Micro-retail Distribution: From Linear to Count Regression Approaches
As discussed in the previous section, count regression approaches seem to be best suited to our case study. These methods have been widely developed over the last 50 years [
30,
69,
70,
71,
72]. GLMs have been specifically developed to handle count data: a mathematical transformation on the dependent variable is operated, considering the true distribution of errors and assuming a distribution from an exponential family (i.e., binomial, Poisson, multinomial, etc.). A linear relationship is then investigated between the independent variables and the transformed response rather than its raw values. A maximum likelihood estimation (MLE) procedure is implemented for the estimation of the model parameters.
When the distribution of the dependent variables (and errors) follows a Gaussian (G) distribution, the identity function describes the transformation and, subsequently, the GLM results in the same estimates as the traditional MLR [
72]. When the variable to be analyzed is represented by a count variable, the random component assumes the form of a Poisson distribution and the corresponding transformation is usually a log function. The resulting model is called a log-linear or Poisson regression model (P). However, the main assumption of a Poisson model is that the mean and standard deviations of the observed dependent variable are equivalent, an assumption that is not met when the dependent variable is characterized by high heterogeneity. Negative binomials (NBs) might be considered an alternative to the Poisson model, and this specific form provides a built-in solution to account for overdispersion. P and NB represent two interesting alternatives to G/MLR overcoming the restrictive assumption of homoscedasticity while considering the true distribution of errors.
Despite being able to handle discrete non-negative and skewed distributions, the models presented so far cannot handle overdispersion due to zero-inflation (heuristic rules suggest a presence of zeroes not higher than 20% of the expected values, which is far less than what was observed in our target variable). In such situations, the GLM approach proposes alternative solutions that are able to integrate and model an excessive presence of zeroes.
With zero-inflated (ZI) regression models [
37], zeros originate according to two simultaneous processes. The probability distribution of zero-inflated models are defined as the combination of a logistic part modelling the structural zeros (or true zeros) and a count part assuming a P (ZIP) or NB (ZINB) form from which random zeros (or false zeros) are produced.
Zero-alternated (ZA, or hurdle) approaches [
73,
74] model all zeros as one part, while the non-zero part is modelled with zero-truncated count regressions. The implementation of the P or NB forms into the zero-truncated part of the model result in zero-alternated Poisson (ZAP) and negative binomial (ZANB) models.
Implementing ZI and ZA models allowed us to explore the possibility that two processes might determine the observed zero and non-zero values instead of considering that these values come from the same data-generating process. Both ZI and ZA are described by the combination of logistic regression and Poisson (ZIP-ZAP) or negative binomial (ZINB-ZANB) models. The main difference among these approaches is that the former considers the observed distribution of values to be the result of the combined processes with a possibility of distinguishing between structural and random zeros, while the latter supposes two separate generating processes producing zero and non-zero values. Finally, the opportunity to use P and NB both in ZI and ZA allows us to control for the combined overdispersion of count and zero parts.
For the three models previously described (G, P, NB), four additional models were implemented and compared (ZIP, ZAP, ZINB, ZANB). The seven models here presented were performed on the overall study area and eight aforementioned sub-regions.
GLM is a powerful technique that enables a wide number of modelling approaches beyond the traditional MLR to investigate different aspects of the dependent variable statistical distribution. While the implementation and comparison of these approaches have been already discussed in several disciplines, no work has investigated this specific aspect in the case of micro-retail distribution and urban form. The implementation of a comparative analysis of seven regression models allowed us to understand whether specific processes should be considered when describing the relationship between urban form and micro-retail distribution. Goodness-of-fit measures are described in the next section as support for the model selection procedure.
Before proceeding with further specifications, another observation should be made. Micro-retail distribution is frequently measured as a density; one might argue that the raw count of stores might be strongly biased by the size of the underlying spatial unit. A specific approach to handle density variables is possible when implementing GLM. Density might be seen as a rate between a count value (the store number) and the underlying spatial unit size (street length), also named the exposure variable. GLM handles exposure variables using simple algebra, changing the dependent variable from a rate into a count by simply multiplying both sides of the equation according to the exposure variable and moving it to the right side of the equation. In the final model, the exposure variable becomes a term of the regression coefficients, also called the offset variable. With this solution, GLM permits the preservation of the natural form of the counting data, which accounts for the variabilities determined by the underlying spatial unit dimension.
3.4. Modelling Selection: Goodness-of-fit Measures
Defining a common procedure by which to assess and compare the different models is a task of paramount importance when identifying the most adapted modelling approach.
Since the traditional coefficient of determination R2 requires a homoscedastic distribution of error, extensive scientific literature has focused on pseudo-R2 for count regression models [
75,
76,
77,
78]. Nonetheless, there is no consensus on which measure should be preferred, and each choice might lead to certain drawbacks [
79]. For example, goodness-of-fit measures have been specifically conceived for each type of GLM regression, preventing their application in a large variety of models with the final goal of supporting the model selection phase.
To overcome this limitation, measures based on information criteria (IC) have become increasingly popular. The notion behind IC approaches is the need to find a compromise between likelihood maximization and the principle of parsimony, which favors simpler models [
72]. The Akaike information criterion (AIC) [
80] is obtained as
, where K is the number of estimable parameters that correspond to the degree of freedom, and
is the maximum value of the likelihood function for the model M. In other words, the AIC score is an estimate of a constant based on the degrees of freedom of a model, plus the negative log-likelihood of the model knowing the data. A lower AIC score reflects models that are closer to reality. AIC scores do not have a specific meaning when independently considered, but a comparison of AIC scores from different models can help an analyst rank and select the best solutions from a finite set of models. An AIC can only be obtained from GLM approaches that allow non-nested models to be compared, which ordinary statistical tests cannot do.
The implementation of likelihood ratio-based tests (LR-test) provides an analyst with further evidence highlighting statistically significant differences between IC scores. The null hypothesis of an LR-test is whether both compared models are equally close to the true model. If the null hypothesis is not verified, one of the two models should be considered as having a better performance. The Vuong test [
81] for non-nested models is so far the most applied LR-test among the different domains of the scientific literature without any restrictions on GLMs. In this work, AIC scores and the non-nested Vuong testing were used to quantify and rank our model performances and, ultimately, guide the model selection. As we were aware of possible biases when considering ZI models [
82], rootgrams [
83] were also implemented as a graphic solution to support the model assessment.
While the aforementioned procedure assessed and supported the model selection procedure, two additional aspects should be outlined. Firstly, loglikelihood-based measures allowed comparison only if models shared the same underlying dataset (both in terms of variables and records). Therefore, the same approach was not suitable when comparing global model outcomes with those obtained from the subgroup regressions approach. Secondly, AIC is a global measure, and does not allow to appreciate the roles of overdispersion and zero-inflation on model performance outcomes.
Other parameters were also implemented, allowing the description of different aspects of the model outcomes. Count pseudo-R2 [
84] was implemented as the proportion of correct estimates on the overall number of predictions; similarly, weighted accuracy, recall and F1 scores were also provided. Traditional measures of dispersion of the residuals (mean absolute and standard deviation) for each model completed the model outcome description. These measures were applied while considering zero and count parts of each model separately, thus revealing their relative impacts on the overall goodness-of-fit measures.
3.5. Feature Selection.
In the previous sections we defined a model selection procedure to identify the most adapted approach to describing micro-retail distribution, which we based on overall goodness-of-fit measures, without considering the specific combination of regressors. Nonetheless, as outlined in
Section 2.4, non-experimental studies are nearly always characterized by the presence of multicollinearity; this was even more true in this work, where different facets and metrics of the same phenomenon—the urban physical form—were studied and combined. Another goal of this work was to outline the subsets of individual urban morphological variables related to micro-retail spatial distribution within each sub-region under analysis.
In order to achieve this objective, a specific category of feature selection—penalized regression (PR)—provided a built-in solution for GLM count regression approaches. While the goal of traditional selection procedures is to remove predictors from a model that are not considered significant and thus set their regressor coefficients to zero, the idea underlying PR is to penalize them toward zero without forcing them to be exactly zero (for this reason, these methods are also known as shrinkage or regularization methods). In this way, the complexity of the model is reduced while keeping all or part of the variables in the model. PR traditionally requires the choice of a shrinkage value of lambda to define the magnitude of the penalization.
Three main penalized regression procedures are most commonly used: ridge, least absolute shrinkage selection operator (LASSO) and elastic net (Enet). In ridge PR, the loss function underlying the regression models is augmented to minimize the sum of the squared residuals while taking into account and penalizing the size of the parameter estimates, with the final goal of shrinking them toward zero. In LASSO PR [
49], the regression coefficient to be shrunk toward zero as well as those with a minor contribution might be forced to be exactly equal to zero. Two different penalization functions are considered in ridge and LASSO approaches. While ridge seems to be more frequently adapted when coefficient parameters are of a similar size, LASSO regression is typically adapted when a model presents a subset of variables with high coefficient parameters while the remaining have very small coefficients [
85].
Finally, Enet regression combines both Ridge and LASSO penalization approaches, allowing both the coefficient to shrink toward zero while also setting some variables to equal zero precisely, producing simpler and more interpretable models. Implementing Enet regression in our case study enabled us to outline the subset of urban morphological variables most related to the spatial distribution of retail.
In order to find the optimal values for the shrinkage parameters, specific iterative processes were implemented from a large number of possibilities using optimization procedures based on IC such as AIC or, similarly, the Bayesian Information Criterion (BIC, [
86]). For each study region, we asked the Enet algorithm to explore 20 values of lambda. The regression coefficients reported in this work correspond to the penalized model for which the lowest BIC scores were observed.