1. Introduction
The spatial distribution of social infrastructure, including housing and service facilities, is usually not even across a nation [
1,
2]. Cultural amenities are often concentrated in large cities, and accommodations are primarily located in holiday destinations. The level of accessibility to healthcare could be considerably different not only between urban and rural areas but also between cities [
3]. While such unevenness may indicate underdevelopment in particular localities, it is sometimes due to different regional demographic and economic structures [
4,
5]. The underlying reasons behind the spatial configuration of infrastructure vary, and its comprehensive examination is crucial to understanding the true implications of their skewed distribution.
From a methodological point of view, the difficulties in evaluating the spatial distribution of social infrastructure arise from the fact that it encompasses a vast range of human-built features. Even a strict delineation of social infrastructure refers to all physical places that promote sociality, which include schools, hospitals, cafés, and restaurants [
6,
7]. The increasing availability of open data has facilitated the acquisition of necessary information for analysis. However, simultaneous examination of all social infrastructure is still challenging, and the presence of strong correlations between the facilities may impede finding meaningful patterns.
In this regard, the use of dimension-reduction techniques can be one way of addressing the methodological concern. A dataset representing social infrastructure is essentially multivariate data, where each variable corresponds to one of the facilities. The variable can be logical, indicating the presence or absence of a particular facility in an area, or it can be numeric, which depicts the amount of the facility accessible from a given location. If we reduce several facilities with a similar geographic distribution into a single composite variable, it would be easier to explore regional differences in social infrastructure and identify underdeveloped areas.
Principal component analysis (PCA) is perhaps one of the most widely used multivariate statistics for studying social infrastructure and other urban features. It produces a set of uncorrelated variables called principal components (PCs), each of which is a linear combination of the original variables [
8]. Unlike the original variables, the PCs are defined so that the first component explains the total variation of the data as much as possible and that the other components are also ordered by the remaining variation they account for. The usual objective of PCA is to summarize a large multivariate dataset by the first few components that can minimize the loss of important information.
In urban and regional studies, the construction of a composite index indicating the abundance of social infrastructure is one typical application of PCA. For example, Greyling and Tregenna [
9] employed PCA to develop a comprehensive quality-of-life index that incorporates economic and non-economic variables, such as housing and infrastructure, social relationships, health, and safety. Similarly, Manitiu and Pedrini [
10] calculated environmental, social, and cultural indices using the PCs that explain over 75% of the variance. Many other past studies have also demonstrated the usefulness of PCA for various urban analytic research, ranging from identifying housing submarkets to regionalizing cities and countries (see, for example, [
11,
12,
13]).
Despite the successful applications of PCA in the literature, it has several drawbacks when the primary interest lies in interpreting, not just evaluating, the spatial unevenness and clustering of social infrastructure. PCs are mathematically determined to account for as much variance as possible while maintaining orthogonality, and the linear combinations derived usually involve all the original variables with non-zero weights or loadings [
14,
15]. Consequently, the combinations often become overly complicated when the dataset has many variables, making it difficult to recognize the tangible meaning of each component.
Furthermore, PCA does not explicitly consider spatially varying relationships that might be present in the data [
16,
17]. This limitation is particularly problematic when the analysis deals with large geographic areas. The degree of correlations between various facilities might not be uniform across space, so such spatial heterogeneity should be appropriately addressed to gain a more precise understanding of the process [
18].
The ultimate purpose of this work is to develop a new methodological framework that can address the limitations of the existing methods. The proposed approach combines sparse PCA and the geographically weighted model. Sparse PCA is an optimization technique that attempts to reduce the number of non-zero loadings in PCs while retaining the variance explained by them as much as possible [
14,
15]. It shares the same objective as various rotation techniques [
19] and hard thresholding approaches [
20] in that it aims to increase the interpretability of PCs. However, sparse PCA achieves the desired level of sparsity more directly and works efficiently even when most variables in the dataset are strongly correlated. In this work, we employ sparse PCA based on a penalized least squares method called the lasso [
14] and apply it to a geographically weighted subset of the data to take spatial heterogeneity into account.
We use a dataset on housing and service facilities in Korea to illustrate the proposed method. Although the majority of housing units and service facilities are concentrated in the most populated city, Seoul, and its vicinities, religious and recreational sites are likely spread over suburban and rural areas. It is also expected that different cities display different numbers of education, healthcare, and tourism facilities per capita, depending on their demographic and economic structure. The use of the local sparse PCA would allow us not only to identify the apparent urban–rural distinction but also to explore the underlying intercity differences from a spatial perspective.
The rest of this paper is organized as follows. In the next section, we briefly explain the basic idea behind PCA and its applications in urban and regional studies. The following section describes the methodological details of the proposed local sparse PCA.
Section 4 presents the structure and standardization process of the dataset for demonstration, and the subsequent section compares the results from the proposed approach to those from the standard PCA. We conclude in
Section 5 by highlighting the advantages and limitations of the proposed approach.
2. PCA in Urban and Regional Studies
The ultimate objective of any dimension reduction technique is to derive a smaller number of latent variables than the original dataset without losing important details. PCA, which is perhaps the most widely used dimension reduction technique in various disciplines, also aims to find new sets of variables that explain underlying patterns in complex multivariate data. Each latent variable, called a PC, is determined to account for the largest variance possible while maintaining orthogonality to the others [
21]. The variance of the
th PC is usually denoted by
, so the proportion of the total variance explained by the corresponding PC can be calculated as follows:
where
and
indicates the number of variables in the original dataset.
There is a vast literature on using PCA in a wide range of disciplines, including but not limited to urban studies, environmental sciences, and remote sensing. In the context of urban studies, for example, Bitter and colleagues [
22] applied PCA to data on eight housing attributes and reduced it to two PCs for geographically weighted regression. Similarly, Tahmasbi and Haghshenas [
23] utilized PCA to construct a composite index of five accessibility measures to urban activities, and Labetski et al. [
24] demonstrated its effectiveness in comparing complex 3D building footprints. In the field of environmental sciences, PCA has also been routinely used for studies on the distribution of tropospheric ozone [
25] and soil contaminants [
26] and the factors causing air pollution in large cities [
27]. It is a common technique in remote sensing for simplifying hyperspectral imagery [
28] or improving the performance of scene classification [
29,
30].
Despite numerous successful applications in the existing literature, PCA has limitations. The incapability to address non-linearity and sensitivity to outliers are the well-known problems of this popular dimension reduction technique [
8]. For urban and regional studies, it is also a critical shortcoming that PCA does not consider spatial characteristics of the data [
16,
18]. Since the presence of spatial autocorrelation is common in many urban phenomena, it is important to take spatial clustering of similar values into account during the dimension reduction process for an accurate summary of the original dataset.
Locally weighted PCA (LWPCA) and geographically weighted PCA (GWPCA) are recent extensions to the standard PCA, which construct a different set of PCs for different geographic locations in data. LWPCA assumes homogeneity of the covariance structure for observations close to each other in the attribute space [
16]. It measures the distance between the observations and estimates eigenvalues and eigenvectors not only at observed locations but also at unobserved data locations. GWPCA can identify areas where it is inappropriate or overly simplistic to assume the same basic structure at all locations. It allows us to evaluate how the data dimension changes spatially and how the original variable affects each spatially changing component. It is also useful for solving the collinearity of GWR [
18].
Notwithstanding the advantages of LWPCA and GWPCA, their results are often challenging to interpret in practice. The volume of PCs derived from these local approaches is enormous, so evaluating individual components comprising all variables from the original dataset is infeasible. Sparse PCA is another extension to the standard PCA, whose primary goal is to simplify the component structure. It imposes penalty parameters on estimating the loadings for variables so that each PC can be as sparse as possible in its structure, making it easier to interpret. In this work, we combine sparse PCA with the local statistics framework to facilitate the evaluation of spatial heterogeneity in a large multivariate dataset.
4. Results and Discussion
4.1. Standard PCA
Table 2 presents the loadings of the first three PCs derived from the standard PCA. We chose these components because they have eigenvalues greater than the mean of all eigenvalues, satisfying Kaiser’s criterion [
36,
37]. Although the first PC alone accounts for a significant proportion of the total variance in the dataset (i.e., 77.61%), the other two may also contain information on different aspects of the social infrastructure distribution. In this section, therefore, we examine the composition of these three components and the spatial configuration of the associated scores.
A notable feature of the first PC is that all loadings have a positive sign. Furthermore, most have similar values ranging between 0.15 and 0.25, possibly due to the strong correlations between the variables. The largest loading value of 0.335 is attached to the variable ‘performing_venues’, and the smallest of 0.001 to the variable ‘camp_grounds’. As illustrated in
Figure 3, the performing art venues are clustered mainly in the Seoul metropolitan area, whereas most camping grounds are in the vicinity of Seoul but not within it. This observation implies that the first PC may be an indicator that distinguishes Seoul from the rest of the country.
On the other hand, there is a mixture of positive and negative loading values in the following two components, making their interpretation less straightforward. Some variables sharing similar characteristics have opposite effects on the same components. For example, both cultural amenities, such as the performing art venues, and medical facilities, including hospitals and postnatal care services, are typical urban features, but they have different signs in the second PC. The same ambiguity occurs in the third one as well. While the presence of large- and medium-scale hospitals leads to a decrease in the PC scores, retail pharmacies, which have almost identical distribution, would increase them.
The maps in
Figure 4 also depict some degrees of mixed information across the components.
Figure 4a clearly distinguishes Seoul, located in the northwest, from the rest of the country.
Figure 4b,c, on the other hand, highlight other urban areas, such as Busan in the southeast end of the mainland and Daegu, the fourth most populated city in the country. At the same time, however, they also exhibit high scores for the grids representing the Seoul metropolitan area mainly because considerable proportions of the medical and educational facilities are clustered in this region. These patterns imply that although the first three components are uncorrelated in an aspatial matrix form, their spatial arrangements could overlap to an extent.
Considering the amount of variance explained by the first PC, it can be an effective indicator that conveys the most significant information regarding the distribution of social infrastructure. Nevertheless, it is a linear combination of all variables with similar weights, so its interpretation is not much simpler than that of the original dataset. Furthermore, since the orthogonality of the components does not guarantee their spatial independence from each other, the subsequent components may provide only little unique information. In the following section, therefore, we will apply sparse PCA to local subsets of the dataset and discuss how this approach addresses the limitations.
4.2. Local Sparse PCA
Before applying sparse PCA to local subsets of the example dataset, we conducted it for the overall distribution of social infrastructure to seek optimal values for the penalty parameters, . For comparability to the results from the standard PCA, we used the same number of PCs, k = 3, in this section. The global parameter was set to a small value close to 0 (i.e., ).
Figure 5 illustrates how the sum of the variance explained by the first three components changes with different combinations of the component-specific penalty parameters. It shows a clear tradeoff between the degree of sparsity imposed and the amount of variance explained: the relationship is not linear but is monotonically decreasing. With small penalty parameters, sparse PCA accounted for as much variance as the standard PCA but was only marginally different in component structure. As the penalty parameters increased, the degree of sparsity escalated considerably, but the representativeness of the PCs appeared to diminish in return.
It may be worth noting that the variance explained by sparse PCA is not directly comparable to that of the standard PCA. The PCs from sparse PCA are not necessarily orthogonal; thus, Equation (1) will likely overestimate the variance explained. In this work, therefore, we calculated the variance explained in the way described in [
14].
The boxplot suggests that the penalty parameters should be carefully chosen based on the purpose of analysis. In this example, if the objective were to find one or two composite indices that effectively represent the entire dataset, modest penalties that produce 18 or 19 non-zero loadings in the first PC would be useful. It reduces over one-third of the social infrastructure variables but still explains most of the variance in their distribution. On the other hand, if the interest lies in exploring data and generating a hypothesis, larger parameter values may simplify the PCs and disclose the most significant pattern.
Table 3 shows the structures of the first three PCs where only six variables have non-zero loadings. Although these components together explain less than half of the total variance in the original dataset (
Figure 5), the implication of each component is more apparent than those in
Table 2 because it has only a few loadings, all of which are in the same direction.
From a spatial perspective, however, they seem to portray somewhat similar distributions of the score (
Figure 6). As with the case of the standard PCA, the concentration of social infrastructure around Seoul is highlighted in the distribution of the first PC scores (
Figure 6a), concealing other regional differences. The other two maps,
Figure 6b,c, convey practically the same visual impression, except for more scattered extreme values in the southern part of the country. These results suggest that the global application of sparse PCA may not be more effective than the standard PCA in exploring spatial heterogeneity.
To overcome this limitation of the conventional global application, we applied the sparse PCA algorithm to local subsets with the same penalty parameters. At each location
, the corresponding subset was determined using a binary weight matrix
, whose value is set to 1 for the areas within 100 km from
and 0 otherwise. It resulted in the same number of local subsets as the number of grids (i.e., 1354). Although the distance threshold of 100 km is large enough to include sufficient observations in most subsets, there were only a few in some grids representing remote islands. We conducted sparse PCA only at those with more than ten observations, that is, 1346 local subsets (
Figure 7).
Due to the volume of local outputs, it would be impractical to examine each PC from this approach. Therefore, we summarized the results based on the combination of non-zero loadings in the first PC to facilitate the interpretation process. While there could be up to 736,281 different configurations for choosing six variables from 31 variables, our empirical example found only 370 combinations among the 1346 subsets.
Table 4 presents those that occurred at least 20 times, labeling them from groups 1 to 9 in the order of occurrence. These groups account for about one-third of the total grids (i.e., 448 out of 1354).
Figure 8 shows distinctive spatial distributions for these groups and reveals the regional characteristics that the global approaches could not address. For example, the PCs around Seoul consist of housing units and daily living amenities, such as supermarkets and hospitals (i.e., group 1), likely because it distinguishes inhabited and uninhabited areas in this well-developed part of the country. Similarly, the components around Jeju Island and the eastern coast areas include accommodations and other tourism-related facilities, as they contrast these tourist destinations to nearby regions. These results provide empirical evidence that the proposed approach can be an effective tool for exploring spatial heterogeneity in various urban and regional phenomena, including the distribution of social infrastructure.
5. Conclusions
5.1. Summary and Implications
The use of quantitative research methods has been a common practice in urban and regional studies for a long time, dating back to as early as the 1950s. There is much literature that uses a wide range of statistical techniques for exploring human–space interactions, from simple hypothesis testing statistics to sophisticated regression models. While the escalation in the volume and range of open data over the past decade has enabled more accurate modeling of dynamic urban phenomena, it has also become challenging to explore and process the data. Such big data often consist of numerous correlated variables, violating statistical assumptions in conventional methods and making it difficult to uncover underlying patterns.
The ultimate purpose of this work was to develop a method that can assist exploratory spatial analysis for multidimensional data and allow more intuitive interpretation. In particular, we focused on applications to examine spatial inequalities in social infrastructure, including housing and service facilities. To this end, we presented an extension of PCA that constructs sparse PCs for each local subset. It used an elastic net-based algorithm [
14] to obtain the required number of non-zero loadings and a kernel function to define local subsets. The results from the proposed approach could be summarized based on the combination of the selected variables in the first PC and then mapped to identify regional differences or inequalities.
To demonstrate the advantages and limitations of the proposed approach, we applied it to the distribution of housing and service facilities in Korea. Although the first PC derived from the standard PCA accounted for most of the variance in the example dataset, it comprised all variables with similar weights. On the other hand, the local sparse PCA could yield PCs with only a few non-zero loadings and enable more straightforward interpretation. Furthermore, the selected variables in each local subset exhibited clear geographic patterns. For example, housing units and daily living amenities, such as supermarkets and hospitals, seemed to be the distinguishing features for urban and suburban parts of Seoul, whereas tourism-related facilities were unique characteristics of specific regions. These results offered valuable insights into the spatial patterns of social infrastructure, which the standard PCA only partly addressed.
This work provides empirical evidence that the recent sparse PCA method can be an effective alternative to the traditional dimension reduction techniques in urban and regional studies. The algorithm adopted in this work is suitable for identifying spatial relationships between variables and exploring their scales, as it runs even when there are more variables than the number of observations [
14]. Considering that the small sample size is a persistent problem in local spatial analysis [
38], this would be an important advantage over other variations of PCA.
Our case study also showed a monotonic inverse relationship between the degree of sparsity in PCs and the amount of variance explained. The PCs with no penalties applied accounted for as much variance as those from the standard PCA, but the components where half of the variables had zero loadings explained less than 40% of the total variance on average. This clear tradeoff suggests a need to tailor the penalty parameters to the purpose of analysis. Small penalty parameters would be appropriate if the analysis aims to find a few index variables that can summarize the entire dataset. For exploratory analysis, on the other hand, several different configurations may be examined for the balance between interpretability and explainability.
5.2. Limitations and Future Recommendations
Despite the merits of the proposed approach, it also has some drawbacks. First, the results from the proposed approach depend on various parameters whose correct values are generally unknown a priori. In practice, the parameters are often estimated by experiments, but it involves considerable computing resources for iterative evaluations. Moreover, since it is usually infeasible to test all possible configurations of the parameters, the optimal selection is not guaranteed. In this work, we chose the parameters manually by trial and error, but further research is required to use state-of-the-art machine learning techniques for tuning the parameters automatically.
Second, although the combination of the variables with non-zero loadings is a functional criterion for classifying local results, it inevitably causes some loss of information. Sparse PCA usually produces nonnegligible coefficients for all selected variables, but there could still be substantial variations in absolute values. Furthermore, the local subsets possessing the same set of the selected variables may have contrasting distributions of the loadings. Since the current classification method is based only on the inclusion and exclusion of variables, it cannot take such differences in the loadings into account, limiting its ability to distinguish local results.
Third, we did not leverage the characteristics of the facilities in the analysis. It is shown that many studies about urban facilities focus on to what extent people can access the facilities of interest, such as hospitals and grocery stores [
39]. In this sense, a consideration of the characteristics of the facilities may provide an improved understanding of the spatial distributions of the facilities rather than simply considering the spatial locations of facilities.
Notwithstanding these limitations, the value of the local sparse PCA lies in its capability to address spatially heterogeneous phenomena in reduced dimensions. While various geographically weighted models already exist for exploring spatial effects, they often produce an overwhelming volume of results that are not easy to interpret. In this work, we suggested using sparse PCA for each local subset so that each PC has a simple structure. As a result, it is not only intuitive for the interpretation of itself but also more straightforward to summarize the overall distribution in space. Although it requires more empirical applications to verify the strengths and limitations demonstrated, the proposed approach can be a valuable addition to the exploratory tools in urban and regional studies.