Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure

Hong, Seong-Yun; Moon, Seonggook; Chi, Sang-Hyun; Cho, Yoon-Jae; Kang, Jeon-Young

doi:10.3390/land11112034

Open AccessArticle

Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure

by

Seong-Yun Hong

¹

,

Seonggook Moon

^2,*,

Sang-Hyun Chi

¹

,

Yoon-Jae Cho

¹ and

Jeon-Young Kang

³

¹

Department of Geography, Kyung Hee University, Seoul 02447, Republic of Korea

²

LX Education Institute, Gongju-si 32522, Chungcheongnam-do, Republic of Korea

³

Department of Geography Education, Kongju National University, Gongju-si 32522, Chungcheongnam-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Land 2022, 11(11), 2034; https://doi.org/10.3390/land11112034

Submission received: 21 October 2022 / Revised: 4 November 2022 / Accepted: 10 November 2022 / Published: 13 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

The primary purpose of this study is to develop a method that can assist in exploring infrastructure-related multidimensional data. The spatial distribution of social infrastructure, including housing and service facilities, is usually uneven across a nation. The underlying reasons behind the spatial configuration of infrastructure vary, and its comprehensive examination is crucial to understanding the true implications of their skewed distribution. However, simultaneous examination of all social infrastructure is not always straightforward due to the volume of data. The presence of strong correlations between the facilities may further impede the finding of meaningful patterns. To this end, we present an extension of PCA that constructs sparse principal components for local subsets of the data. To demonstrate its strengths and limitations, we apply it to a dataset on housing and service facilities in Korea. The results exhibit clear geographic patterns and offer valuable insights into the spatial patterns of social infrastructure, which the standard PCA only partly addressed. It provides empirical evidence that the proposed method can be an effective alternative to the traditional dimension reduction techniques for exploring spatial heterogeneity in massive multidimensional data.

Keywords:

exploratory spatial analysis; principal component analysis; sparse loadings; spatial heterogeneity; spatial distribution; social infrastructure; urban analytics

1. Introduction

The spatial distribution of social infrastructure, including housing and service facilities, is usually not even across a nation [1,2]. Cultural amenities are often concentrated in large cities, and accommodations are primarily located in holiday destinations. The level of accessibility to healthcare could be considerably different not only between urban and rural areas but also between cities [3]. While such unevenness may indicate underdevelopment in particular localities, it is sometimes due to different regional demographic and economic structures [4,5]. The underlying reasons behind the spatial configuration of infrastructure vary, and its comprehensive examination is crucial to understanding the true implications of their skewed distribution.

From a methodological point of view, the difficulties in evaluating the spatial distribution of social infrastructure arise from the fact that it encompasses a vast range of human-built features. Even a strict delineation of social infrastructure refers to all physical places that promote sociality, which include schools, hospitals, cafés, and restaurants [6,7]. The increasing availability of open data has facilitated the acquisition of necessary information for analysis. However, simultaneous examination of all social infrastructure is still challenging, and the presence of strong correlations between the facilities may impede finding meaningful patterns.

In this regard, the use of dimension-reduction techniques can be one way of addressing the methodological concern. A dataset representing social infrastructure is essentially multivariate data, where each variable corresponds to one of the facilities. The variable can be logical, indicating the presence or absence of a particular facility in an area, or it can be numeric, which depicts the amount of the facility accessible from a given location. If we reduce several facilities with a similar geographic distribution into a single composite variable, it would be easier to explore regional differences in social infrastructure and identify underdeveloped areas.

Principal component analysis (PCA) is perhaps one of the most widely used multivariate statistics for studying social infrastructure and other urban features. It produces a set of uncorrelated variables called principal components (PCs), each of which is a linear combination of the original variables [8]. Unlike the original variables, the PCs are defined so that the first component explains the total variation of the data as much as possible and that the other components are also ordered by the remaining variation they account for. The usual objective of PCA is to summarize a large multivariate dataset by the first few components that can minimize the loss of important information.

In urban and regional studies, the construction of a composite index indicating the abundance of social infrastructure is one typical application of PCA. For example, Greyling and Tregenna [9] employed PCA to develop a comprehensive quality-of-life index that incorporates economic and non-economic variables, such as housing and infrastructure, social relationships, health, and safety. Similarly, Manitiu and Pedrini [10] calculated environmental, social, and cultural indices using the PCs that explain over 75% of the variance. Many other past studies have also demonstrated the usefulness of PCA for various urban analytic research, ranging from identifying housing submarkets to regionalizing cities and countries (see, for example, [11,12,13]).

Despite the successful applications of PCA in the literature, it has several drawbacks when the primary interest lies in interpreting, not just evaluating, the spatial unevenness and clustering of social infrastructure. PCs are mathematically determined to account for as much variance as possible while maintaining orthogonality, and the linear combinations derived usually involve all the original variables with non-zero weights or loadings [14,15]. Consequently, the combinations often become overly complicated when the dataset has many variables, making it difficult to recognize the tangible meaning of each component.

Furthermore, PCA does not explicitly consider spatially varying relationships that might be present in the data [16,17]. This limitation is particularly problematic when the analysis deals with large geographic areas. The degree of correlations between various facilities might not be uniform across space, so such spatial heterogeneity should be appropriately addressed to gain a more precise understanding of the process [18].

The ultimate purpose of this work is to develop a new methodological framework that can address the limitations of the existing methods. The proposed approach combines sparse PCA and the geographically weighted model. Sparse PCA is an optimization technique that attempts to reduce the number of non-zero loadings in PCs while retaining the variance explained by them as much as possible [14,15]. It shares the same objective as various rotation techniques [19] and hard thresholding approaches [20] in that it aims to increase the interpretability of PCs. However, sparse PCA achieves the desired level of sparsity more directly and works efficiently even when most variables in the dataset are strongly correlated. In this work, we employ sparse PCA based on a penalized least squares method called the lasso [14] and apply it to a geographically weighted subset of the data to take spatial heterogeneity into account.

We use a dataset on housing and service facilities in Korea to illustrate the proposed method. Although the majority of housing units and service facilities are concentrated in the most populated city, Seoul, and its vicinities, religious and recreational sites are likely spread over suburban and rural areas. It is also expected that different cities display different numbers of education, healthcare, and tourism facilities per capita, depending on their demographic and economic structure. The use of the local sparse PCA would allow us not only to identify the apparent urban–rural distinction but also to explore the underlying intercity differences from a spatial perspective.

The rest of this paper is organized as follows. In the next section, we briefly explain the basic idea behind PCA and its applications in urban and regional studies. The following section describes the methodological details of the proposed local sparse PCA. Section 4 presents the structure and standardization process of the dataset for demonstration, and the subsequent section compares the results from the proposed approach to those from the standard PCA. We conclude in Section 5 by highlighting the advantages and limitations of the proposed approach.

2. PCA in Urban and Regional Studies

The ultimate objective of any dimension reduction technique is to derive a smaller number of latent variables than the original dataset without losing important details. PCA, which is perhaps the most widely used dimension reduction technique in various disciplines, also aims to find new sets of variables that explain underlying patterns in complex multivariate data. Each latent variable, called a PC, is determined to account for the largest variance possible while maintaining orthogonality to the others [21]. The variance of the

i

th PC is usually denoted by

λ_{i}

, so the proportion of the total variance explained by the corresponding PC can be calculated as follows:

P_{i} = \frac{λ_{i}}{trace (S)}

(1)

where

trace (S) = \sum_{i = 1}^{n} λ_{i}

and

n

indicates the number of variables in the original dataset.

There is a vast literature on using PCA in a wide range of disciplines, including but not limited to urban studies, environmental sciences, and remote sensing. In the context of urban studies, for example, Bitter and colleagues [22] applied PCA to data on eight housing attributes and reduced it to two PCs for geographically weighted regression. Similarly, Tahmasbi and Haghshenas [23] utilized PCA to construct a composite index of five accessibility measures to urban activities, and Labetski et al. [24] demonstrated its effectiveness in comparing complex 3D building footprints. In the field of environmental sciences, PCA has also been routinely used for studies on the distribution of tropospheric ozone [25] and soil contaminants [26] and the factors causing air pollution in large cities [27]. It is a common technique in remote sensing for simplifying hyperspectral imagery [28] or improving the performance of scene classification [29,30].

Despite numerous successful applications in the existing literature, PCA has limitations. The incapability to address non-linearity and sensitivity to outliers are the well-known problems of this popular dimension reduction technique [8]. For urban and regional studies, it is also a critical shortcoming that PCA does not consider spatial characteristics of the data [16,18]. Since the presence of spatial autocorrelation is common in many urban phenomena, it is important to take spatial clustering of similar values into account during the dimension reduction process for an accurate summary of the original dataset.

Locally weighted PCA (LWPCA) and geographically weighted PCA (GWPCA) are recent extensions to the standard PCA, which construct a different set of PCs for different geographic locations in data. LWPCA assumes homogeneity of the covariance structure for observations close to each other in the attribute space [16]. It measures the distance between the observations and estimates eigenvalues and eigenvectors not only at observed locations but also at unobserved data locations. GWPCA can identify areas where it is inappropriate or overly simplistic to assume the same basic structure at all locations. It allows us to evaluate how the data dimension changes spatially and how the original variable affects each spatially changing component. It is also useful for solving the collinearity of GWR [18].

Notwithstanding the advantages of LWPCA and GWPCA, their results are often challenging to interpret in practice. The volume of PCs derived from these local approaches is enormous, so evaluating individual components comprising all variables from the original dataset is infeasible. Sparse PCA is another extension to the standard PCA, whose primary goal is to simplify the component structure. It imposes penalty parameters on estimating the loadings for variables so that each PC can be as sparse as possible in its structure, making it easier to interpret. In this work, we combine sparse PCA with the local statistics framework to facilitate the evaluation of spatial heterogeneity in a large multivariate dataset.

3. Data and Method

3.1. Social Infrastructure Data of Korea

In this work, we explore the underlying spatial relationship in the social infrastructure data of Korea. The dataset has 31 variables representing housing and service facilities, which were obtained from Statistics Korea and the Korea Local Information Research & Development Institute. Each observation corresponds to the number of these facilities in a 10 km-by-10 km grid defined by Statistics Korea. There are 1354 grids covering the entire country, as shown in Figure 1.

Since each variable is different in size, its variance also varied greatly (Table 1). The standard deviation for postnatal care services is only 3.7, whereas the same statistic for apartments, the most common housing type in the country, is 32,413. Although this size effect can be eliminated by applying PCA to the correlation matrix of the dataset, it may conceal important differences between the facilities. Thus, we column-standardized and mean-centered each variable,

x_{i}

, as below:

x_{i}^{'} = \frac{(x_{i} - \bar{x})}{\sum_{i = 1}^{n} x_{i}}

(2)

The standardized variable,

x_{i}^{'}

, can indicate the relative abundance of the corresponding facility at a given location.

Table 1 presents the mean and standard deviation of all variables before the transformation, as well as the standard deviation after the transformation. It shows that the standardized variables still have different degrees of variance but on a comparable scale. The mean after the transformation is omitted, as all variables have the same mean of zero.

Figure 2 shows the spatial distributions of all facilities. In general, the spatial distribution of social infrastructure appears to be associated with that of the population. Since Seoul is the capital city and a highly dense area in the country, many residential facilities are located in the region (i.e., the left upper region in the figures). Not only that, Busan is another major city in Korea (i.e., the right lower region in the figures) and accommodates many housing units. Likewise, there are also concentrations of housing units in other major cities (i.e., Incheon, Gwangju, Daegu, Daejeon, and Sejong).

Compared to the housing units, some service facilities seem to be evenly distributed nationwide. For example, the variables such as ‘museums’, ‘culture_centres’, ‘accommodation’, ‘camp_grounds’, and ‘temples’ do not show notable clustering. On the other hand, those related to the population (i.e., the population serves as the demand for the facilities) are still dominantly located in the major cities. Examples include the variables ‘schools’, ‘hagwon’, ‘hospitals’, ‘clinics_gp’, ‘pharmacies’, and ‘postnatal’.

The spatial patterns presented in this dataset demonstrate typical distributions of social infrastructure in a country. While the urban concentration of many facilities would make PCA effective for reducing the number of variables, the strong collinearity is likely to pose difficulties in interpretation. However, it could be an ideal setting for demonstrating the advantages of the proposed method, which focuses on the interpretability of the resulting components. In the subsequent sections, therefore, we present the proposed method in detail, then apply it to the dataset of Korea to illustrate its relative strengths.

3.2. Method

This work adopts the sparse PCA algorithm proposed in [14]. It solves the following equation to obtain the sparse loadings of the first k ordinary PCs:

β_{j} = \underset{β}{argmin} {(α_{j} - β)}^{T} X^{T} X (α_{j} - β) + {λ ‖ β ‖}^{2} + λ_{1, j} {‖ β ‖}_{1}

(3)

where

A = [α_{1}, α_{2}, \dots, α_{k}]

denotes the original loadings of the first k components, and

λ

and

λ_{1, j}

are non-negative penalty parameters that determine the degree of sparsity. When the number of observations is greater than that of variables (i.e.,

n > p

),

λ

can be set to zero for simplicity. In practice, however, the use of a small positive number for

λ

is recommended when there are collinearities in the data [14]. Another penalty parameter

λ_{1, j}

affects the number of non-zero loadings for each component, and it is generally chosen by experiments.

Since this equation is equivalent to a linear combination of the lasso and ridge penalties, we can calculate

β_{j}

for all

j = \{1, 2, \dots, k\}

using an elastic net method [31]. Once

B = [β_{1}, β_{2}, \dots, β_{k}]

is estimated, a singular value decomposition of

X^{T} X B = U D V^{T}

gives us a new set of

A = U V^{T}

. The algorithm solves Equation (3) repeatedly using the updated loadings until it converges. As a result, the sparsity in

β_{j}

increases as

λ_{1, j}

becomes large, and so does the sparsity in the PCs, which is defined as

{\hat{V}}_{j} = β_{j} / ‖ β_{j} ‖ .

In this work, we propose to apply this algorithm to local subsets of a given spatial dataset. If the relationship between the variables varies across space, the proposed approach can uncover the pattern and extent of the underlying non-stationarity. To this end, we substitute the covariance matrix

X^{T} X

in Equation (3) with a geographically weighted one

X^{T} W_{u, v} X

at each location

(u, v)

, as suggested in [18,32]. The matrix

W_{u, v}

is an n-by-n diagonal matrix containing the spatial weights for all locations from

(u, v)

. In the simplest form, it can be a binary matrix where the value of 1 indicates that the corresponding location is part of the local subset and the value of 0 otherwise.

While any kernel function can be adapted to determine the weights, past studies have shown that a simple bi-square function works reasonably well:

w_{i, i} = \{\begin{matrix} {(1 - {(\frac{d_{i}}{r})}^{2})}^{2}, | d_{i} \leq r \\ 0, | d_{i} > r \end{matrix}

(4)

where

d_{i}

is the Euclidean distance from

(u, v)

to the location of interest

i

, and

r

refers to the bandwidth. In general, the bandwidth is considered a more critical parameter than the choice of the kernel function. Although there is a massive amount of literature on optimal bandwidth selection (see, for example, [33,34,35]), it is often chosen through iterations or based on a priori knowledge about the scale of spatial heterogeneity presented in the given dataset.

Figure 3 illustrates the calculation procedures of the proposed approach. Once the kernel function and its bandwidth are chosen to determine the local subsets, the number of PCs to use,

k

, should be specified for each. Furthermore, the sparse penalties

λ

and

λ_{1, j}

may also need to be adjusted for different subsets, which requires significant computing time and resources. In this work, therefore, we demonstrate the use of the proposed approach using the same values for

k

,

λ

and

λ_{1, j}

estimated from the global result. The results consist of

m \times k

sparse PCs, where

m

is the number of the local subsets and can be simplified and mapped to identify the spatial structure of the relationships between the variables.

4. Results and Discussion

4.1. Standard PCA

Table 2 presents the loadings of the first three PCs derived from the standard PCA. We chose these components because they have eigenvalues greater than the mean of all eigenvalues, satisfying Kaiser’s criterion [36,37]. Although the first PC alone accounts for a significant proportion of the total variance in the dataset (i.e., 77.61%), the other two may also contain information on different aspects of the social infrastructure distribution. In this section, therefore, we examine the composition of these three components and the spatial configuration of the associated scores.

A notable feature of the first PC is that all loadings have a positive sign. Furthermore, most have similar values ranging between 0.15 and 0.25, possibly due to the strong correlations between the variables. The largest loading value of 0.335 is attached to the variable ‘performing_venues’, and the smallest of 0.001 to the variable ‘camp_grounds’. As illustrated in Figure 3, the performing art venues are clustered mainly in the Seoul metropolitan area, whereas most camping grounds are in the vicinity of Seoul but not within it. This observation implies that the first PC may be an indicator that distinguishes Seoul from the rest of the country.

On the other hand, there is a mixture of positive and negative loading values in the following two components, making their interpretation less straightforward. Some variables sharing similar characteristics have opposite effects on the same components. For example, both cultural amenities, such as the performing art venues, and medical facilities, including hospitals and postnatal care services, are typical urban features, but they have different signs in the second PC. The same ambiguity occurs in the third one as well. While the presence of large- and medium-scale hospitals leads to a decrease in the PC scores, retail pharmacies, which have almost identical distribution, would increase them.

The maps in Figure 4 also depict some degrees of mixed information across the components. Figure 4a clearly distinguishes Seoul, located in the northwest, from the rest of the country. Figure 4b,c, on the other hand, highlight other urban areas, such as Busan in the southeast end of the mainland and Daegu, the fourth most populated city in the country. At the same time, however, they also exhibit high scores for the grids representing the Seoul metropolitan area mainly because considerable proportions of the medical and educational facilities are clustered in this region. These patterns imply that although the first three components are uncorrelated in an aspatial matrix form, their spatial arrangements could overlap to an extent.

Considering the amount of variance explained by the first PC, it can be an effective indicator that conveys the most significant information regarding the distribution of social infrastructure. Nevertheless, it is a linear combination of all variables with similar weights, so its interpretation is not much simpler than that of the original dataset. Furthermore, since the orthogonality of the components does not guarantee their spatial independence from each other, the subsequent components may provide only little unique information. In the following section, therefore, we will apply sparse PCA to local subsets of the dataset and discuss how this approach addresses the limitations.

4.2. Local Sparse PCA

Before applying sparse PCA to local subsets of the example dataset, we conducted it for the overall distribution of social infrastructure to seek optimal values for the penalty parameters,

λ_{1, j}

. For comparability to the results from the standard PCA, we used the same number of PCs, k = 3, in this section. The global parameter was set to a small value close to 0 (i.e.,

10^{- 6}

).

Figure 5 illustrates how the sum of the variance explained by the first three components changes with different combinations of the component-specific penalty parameters. It shows a clear tradeoff between the degree of sparsity imposed and the amount of variance explained: the relationship is not linear but is monotonically decreasing. With small penalty parameters, sparse PCA accounted for as much variance as the standard PCA but was only marginally different in component structure. As the penalty parameters increased, the degree of sparsity escalated considerably, but the representativeness of the PCs appeared to diminish in return.

It may be worth noting that the variance explained by sparse PCA is not directly comparable to that of the standard PCA. The PCs from sparse PCA are not necessarily orthogonal; thus, Equation (1) will likely overestimate the variance explained. In this work, therefore, we calculated the variance explained in the way described in [14].

The boxplot suggests that the penalty parameters should be carefully chosen based on the purpose of analysis. In this example, if the objective were to find one or two composite indices that effectively represent the entire dataset, modest penalties that produce 18 or 19 non-zero loadings in the first PC would be useful. It reduces over one-third of the social infrastructure variables but still explains most of the variance in their distribution. On the other hand, if the interest lies in exploring data and generating a hypothesis, larger parameter values may simplify the PCs and disclose the most significant pattern.

Table 3 shows the structures of the first three PCs where only six variables have non-zero loadings. Although these components together explain less than half of the total variance in the original dataset (Figure 5), the implication of each component is more apparent than those in Table 2 because it has only a few loadings, all of which are in the same direction.

From a spatial perspective, however, they seem to portray somewhat similar distributions of the score (Figure 6). As with the case of the standard PCA, the concentration of social infrastructure around Seoul is highlighted in the distribution of the first PC scores (Figure 6a), concealing other regional differences. The other two maps, Figure 6b,c, convey practically the same visual impression, except for more scattered extreme values in the southern part of the country. These results suggest that the global application of sparse PCA may not be more effective than the standard PCA in exploring spatial heterogeneity.

To overcome this limitation of the conventional global application, we applied the sparse PCA algorithm to local subsets with the same penalty parameters. At each location

i = (u, v)

, the corresponding subset was determined using a binary weight matrix

W_{u, v}

, whose value is set to 1 for the areas within 100 km from

(u, v)

and 0 otherwise. It resulted in the same number of local subsets as the number of grids (i.e., 1354). Although the distance threshold of 100 km is large enough to include sufficient observations in most subsets, there were only a few in some grids representing remote islands. We conducted sparse PCA only at those with more than ten observations, that is, 1346 local subsets (Figure 7).

Due to the volume of local outputs, it would be impractical to examine each PC from this approach. Therefore, we summarized the results based on the combination of non-zero loadings in the first PC to facilitate the interpretation process. While there could be up to 736,281 different configurations for choosing six variables from 31 variables, our empirical example found only 370 combinations among the 1346 subsets. Table 4 presents those that occurred at least 20 times, labeling them from groups 1 to 9 in the order of occurrence. These groups account for about one-third of the total grids (i.e., 448 out of 1354).

Figure 8 shows distinctive spatial distributions for these groups and reveals the regional characteristics that the global approaches could not address. For example, the PCs around Seoul consist of housing units and daily living amenities, such as supermarkets and hospitals (i.e., group 1), likely because it distinguishes inhabited and uninhabited areas in this well-developed part of the country. Similarly, the components around Jeju Island and the eastern coast areas include accommodations and other tourism-related facilities, as they contrast these tourist destinations to nearby regions. These results provide empirical evidence that the proposed approach can be an effective tool for exploring spatial heterogeneity in various urban and regional phenomena, including the distribution of social infrastructure.

5. Conclusions

5.1. Summary and Implications

The use of quantitative research methods has been a common practice in urban and regional studies for a long time, dating back to as early as the 1950s. There is much literature that uses a wide range of statistical techniques for exploring human–space interactions, from simple hypothesis testing statistics to sophisticated regression models. While the escalation in the volume and range of open data over the past decade has enabled more accurate modeling of dynamic urban phenomena, it has also become challenging to explore and process the data. Such big data often consist of numerous correlated variables, violating statistical assumptions in conventional methods and making it difficult to uncover underlying patterns.

The ultimate purpose of this work was to develop a method that can assist exploratory spatial analysis for multidimensional data and allow more intuitive interpretation. In particular, we focused on applications to examine spatial inequalities in social infrastructure, including housing and service facilities. To this end, we presented an extension of PCA that constructs sparse PCs for each local subset. It used an elastic net-based algorithm [14] to obtain the required number of non-zero loadings and a kernel function to define local subsets. The results from the proposed approach could be summarized based on the combination of the selected variables in the first PC and then mapped to identify regional differences or inequalities.

To demonstrate the advantages and limitations of the proposed approach, we applied it to the distribution of housing and service facilities in Korea. Although the first PC derived from the standard PCA accounted for most of the variance in the example dataset, it comprised all variables with similar weights. On the other hand, the local sparse PCA could yield PCs with only a few non-zero loadings and enable more straightforward interpretation. Furthermore, the selected variables in each local subset exhibited clear geographic patterns. For example, housing units and daily living amenities, such as supermarkets and hospitals, seemed to be the distinguishing features for urban and suburban parts of Seoul, whereas tourism-related facilities were unique characteristics of specific regions. These results offered valuable insights into the spatial patterns of social infrastructure, which the standard PCA only partly addressed.

This work provides empirical evidence that the recent sparse PCA method can be an effective alternative to the traditional dimension reduction techniques in urban and regional studies. The algorithm adopted in this work is suitable for identifying spatial relationships between variables and exploring their scales, as it runs even when there are more variables than the number of observations [14]. Considering that the small sample size is a persistent problem in local spatial analysis [38], this would be an important advantage over other variations of PCA.

Our case study also showed a monotonic inverse relationship between the degree of sparsity in PCs and the amount of variance explained. The PCs with no penalties applied accounted for as much variance as those from the standard PCA, but the components where half of the variables had zero loadings explained less than 40% of the total variance on average. This clear tradeoff suggests a need to tailor the penalty parameters to the purpose of analysis. Small penalty parameters would be appropriate if the analysis aims to find a few index variables that can summarize the entire dataset. For exploratory analysis, on the other hand, several different configurations may be examined for the balance between interpretability and explainability.

5.2. Limitations and Future Recommendations

Despite the merits of the proposed approach, it also has some drawbacks. First, the results from the proposed approach depend on various parameters whose correct values are generally unknown a priori. In practice, the parameters are often estimated by experiments, but it involves considerable computing resources for iterative evaluations. Moreover, since it is usually infeasible to test all possible configurations of the parameters, the optimal selection is not guaranteed. In this work, we chose the parameters manually by trial and error, but further research is required to use state-of-the-art machine learning techniques for tuning the parameters automatically.

Second, although the combination of the variables with non-zero loadings is a functional criterion for classifying local results, it inevitably causes some loss of information. Sparse PCA usually produces nonnegligible coefficients for all selected variables, but there could still be substantial variations in absolute values. Furthermore, the local subsets possessing the same set of the selected variables may have contrasting distributions of the loadings. Since the current classification method is based only on the inclusion and exclusion of variables, it cannot take such differences in the loadings into account, limiting its ability to distinguish local results.

Third, we did not leverage the characteristics of the facilities in the analysis. It is shown that many studies about urban facilities focus on to what extent people can access the facilities of interest, such as hospitals and grocery stores [39]. In this sense, a consideration of the characteristics of the facilities may provide an improved understanding of the spatial distributions of the facilities rather than simply considering the spatial locations of facilities.

Notwithstanding these limitations, the value of the local sparse PCA lies in its capability to address spatially heterogeneous phenomena in reduced dimensions. While various geographically weighted models already exist for exploring spatial effects, they often produce an overwhelming volume of results that are not easy to interpret. In this work, we suggested using sparse PCA for each local subset so that each PC has a simple structure. As a result, it is not only intuitive for the interpretation of itself but also more straightforward to summarize the overall distribution in space. Although it requires more empirical applications to verify the strengths and limitations demonstrated, the proposed approach can be a valuable addition to the exploratory tools in urban and regional studies.

Author Contributions

Conceptualization, S.-Y.H. and S.-H.C.; methodology, S.-Y.H.; software, S.-Y.H.; validation, S.-Y.H. and S.-H.C.; formal analysis, S.-Y.H.; investigation, S.-Y.H.; resources, S.-Y.H.; data curation, S.-Y.H. and S.M.; writing—original draft preparation, S.-Y.H., S.M., Y.-J.C. and J.-Y.K.; writing—review and editing, S.-Y.H.; visualization, S.-Y.H.; supervision, S.-Y.H.; project administration, S.-Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study were obtained from Statistics Korea and the Korea Local Information Research & Development Institute, openly available from https://sgis.kostat.go.kr/ and https://www.localdata.go.kr/ accessed on 17 September 2022, respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

Smith, N. Gentrification and uneven development. Econ. Geogr. 1982, 58, 139–155. [Google Scholar] [CrossRef]
Li, Z.; Wang, X.; Zarazaga, J.; Smith-Colin, J.; Minsker, B. Do infrastructure deserts exist? Measuring and mapping infrastructure equity: A case study in Dallas, Texas, USA. Cities 2022, 130, 103927. [Google Scholar] [CrossRef]
Bissonnette, L.; Wilson, K.; Bell, S.; Shah, T.I. Neighbourhoods and potential access to health care: The role of spatial and aspatial factors. Health Place 2012, 18, 841–853. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Landry, S.M.; Chakraborty, J. Street trees and equity: Evaluating the spatial distribution of an urban amenity. Environ. Plan. A Econ. Space 2009, 41, 2651–2670. [Google Scholar] [CrossRef]
Rigolon, A.; Németh, J. What shapes uneven access to urban amenities? Thick injustice and the legacy of racial discrimination in Denver’s parks. J. Plan. Educ. Res. 2021, 41, 312–325. [Google Scholar] [CrossRef]
Klinenberg, E. Palaces for the People: How Social Infrastructure Can Help Fight Inequality, Polarization, and the Decline of Civic Life; Crown: New York, USA, 2018. [Google Scholar]
Latham, A.; Layton, J. Social infrastructure and the public life of cities: Studying urban sociality and public spaces. Geogr. Compass 2019, 13, e12444. [Google Scholar] [CrossRef] [Green Version]
Everitt, B.; Dunn, G. Applied Multivariate Data Analysis, 2nd ed.; John Wiley & Sons, Ltd.: London, UK, 2001; p. 342. [Google Scholar]
Greyling, T.; Tregenna, F. Construction and analysis of a composite quality of life index for a region of South Africa. Soc. Indic. Res. 2017, 131, 887–930. [Google Scholar] [CrossRef]
Manitiu, D.N.; Pedrini, G. Urban smartness and sustainability in Europe. An ex ante assessment of environmental, social and cultural domains. Eur. Plan. Stud. 2016, 24, 1766–1787. [Google Scholar] [CrossRef]
Bourassa, S.C.; Hamelink, F.; Hoesli, M.; MacGregor, B.D. Defining housing submarkets. J. Hous. Econ. 1999, 8, 160–183. [Google Scholar] [CrossRef]
Wu, C.; Sharma, R. Housing submarket classification: The role of spatial contiguity. Appl. Geogr. 2012, 32, 746–756. [Google Scholar] [CrossRef]
Wiersma, S.; Just, T.; Heinrich, M. Segmenting German housing markets using principal component and cluster analyses. Int. J. Hous. Mark. Anal. 2022, 15, 548–578. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T.; Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 2006, 15, 265–286. [Google Scholar] [CrossRef] [Green Version]
Shen, H.; Huang, J.Z. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal. 2008, 99, 1015–1034. [Google Scholar] [CrossRef] [Green Version]
Demšar, U.; Harris, P.; Brunsdon, C.; Fotheringham, A.S.; McLoone, S. Principal component analysis on spatial data: An overview. Ann. Assoc. Am. Geogr. 2013, 103, 106–128. [Google Scholar] [CrossRef]
Cartone, A.; Postiglione, P. Principal component analysis for geographical data: The role of spatial effects in the definition of composite indicators. Spat. Econ. Anal. 2021, 16, 126–147. [Google Scholar] [CrossRef]
Harris, P.; Brunsdon, C.; Charlton, M. Geographically weighted principal components analysis. Int. J. Geogr. Inf. Sci. 2011, 25, 1717–1736. [Google Scholar] [CrossRef]
Kaiser, H.F. The varimax criterion for analytic rotation in factor analysis. Psychometrika 1958, 23, 187–200. [Google Scholar] [CrossRef]
Jeffers, J.N.R. Two case studies in the application of principal component analysis. J. R. Stat. Society. Ser. C (Appl. Stat.) 1967, 16, 225–236. [Google Scholar] [CrossRef]
Everitt, B.; Hothorn, T. An Introduction to Applied Multivariate Analysis with R; Springer Science & Business Media: New York, USA, 2011. [Google Scholar]
Bitter, C.; Mulligan, G.F.; Dall’erba, S. Incorporating spatial variation in housing attribute prices: A comparison of geographically weighted regression and the spatial expansion method. J. Geogr. Syst. 2007, 9, 7–27. [Google Scholar] [CrossRef] [Green Version]
Tahmasbi, B.; Haghshenas, H. Public transport accessibility measure based on weighted door to door travel time. Comput. Environ. Urban Syst. 2019, 76, 163–177. [Google Scholar] [CrossRef]
Labetski, A.; Vitalis, S.; Biljecki, F.; Arroyo Ohori, K.; Stoter, J. 3D building metrics for urban morphology. Int. J. Geogr. Inf. Sci. 2022, 1–32. [Google Scholar] [CrossRef]
Felipe-Sotelo, M.; Gustems, L.; Hernàndez, I.; Terrado, M.; Tauler, R. Investigation of geographical and temporal distribution of tropospheric ozone in Catalonia (North-East Spain) during the period 2000-2004 using multivariate data analysis methods. Atmos. Environ. 2006, 40, 7421–7436. [Google Scholar] [CrossRef]
Zhang, C. Using multivariate analyses and GIS to identify pollutants and their spatial patterns in urban soils in Galway, Ireland. Environ. Pollut. 2006, 142, 501–511. [Google Scholar] [CrossRef] [PubMed]
Kazemi, Z.; Jonidi Jafari, A.; Farzadkia, M.; Kazemnezhad Leyli, E.; Shahsavani, A.; Kermani, M. Assessment of the risk of exposure to air pollutants and identifying the affecting factors on making pollution by PCA, CFA. Int. J. Environ. Anal. Chem. 2022, 1–20. [Google Scholar] [CrossRef]
Uddin, M.P.; Mamun, M.A.; Hossain, M.A. PCA-based feature reduction for hyperspectral remote sensing image classification. IETE Technol. Rev. 2021, 38, 377–396. [Google Scholar] [CrossRef]
Zhao, H.; Liu, F.; Zhang, H.; Liang, Z. Convolutional neural network based heterogeneous transfer learning for remote-sensing scene classification. Int. J. Remote Sens. 2019, 40, 8506–8527. [Google Scholar] [CrossRef]
Zhao, B.; Dong, X.; Guo, Y.; Jia, X.; Huang, Y. PCA dimensionality reduction method for image classification. Neural Process. Lett. 2022, 54, 347–368. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Brunsdon, C.; Fotheringham, A.S.; Charlton, M.E. Geographically weighted regression: A method for exploring spatial nonstationarity. Geogr. Anal. 1996, 28, 281–298. [Google Scholar] [CrossRef]
Lepski, O.V.; Mammen, E.; Spokoiny, V.G. Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Ann. Stat. 1997, 25, 929–947. [Google Scholar] [CrossRef]
Fotheringham, A.S.; Yu, H.; Wolf, L.J.; Oshan, T.M.; Li, Z. On the notion of ‘bandwidth’ in geographically weighted regression models of spatially varying processes. Int. J. Geogr. Inf. Sci. 2022, 36, 1485–1502. [Google Scholar] [CrossRef]
Chiu, S.-T. An automatic bandwidth selector for kernel density estimation. Biometrika 1992, 79, 771–782. [Google Scholar] [CrossRef]
Kaiser, H.F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20, 141–151. [Google Scholar] [CrossRef]
Ferré, L. Selection of components in principal component analysis: A comparison of methods. Comput. Stat. Data Anal. 1995, 19, 669–682. [Google Scholar] [CrossRef]
Griffith, D.A.; Chun, Y.; Lee, M. Deeper spatial statistical insights into small geographic area data uncertainty. Int. J. Environ. Res. Public Health 2021, 18, 231. [Google Scholar] [CrossRef] [PubMed]
Kang, J.-Y.; Lee, S. Exploring food deserts in Seoul, South Korea during the COVID-19 pandemic (from 2019 to 2021). Sustainability 2022, 14, 5210. [Google Scholar] [CrossRef]

Figure 1. Study area covered by 10 km-by-10 km grids.

Figure 2. Spatial distribution of the social infrastructure variables.

Figure 3. Calculation process of the proposed method.

Figure 4. Spatial distribution of the PC scores from the standard PCA: (a) PC1 scores; (b) PC2 scores; (c) PC3 scores.

Figure 5. Distribution of variances explained by the first three sparse components for a given number of non-zero loadings in the first component.

Figure 6. Spatial distribution of the PC scores from sparse PCA: (a) PC1 scores; (b) PC2 scores; (c) PC3 scores.

Figure 7. Variance explained by the first three sparse components constructed for 1346 local subsets.

Figure 8. Spatial classification based on the combination of the variables with non-zero loadings from the local sparse PCA.

Table 1. A list of the social infrastructure variables with summary statistics before and after standardization.

Variable	Description	$\bar{x}$	$σ_{x}$	$σ_{x'}$
ho_area1	Houses $\leq$ 40 m² gross floor area	1591.4	7583.1	3.68
ho_area2	Houses $\leq$ 60 m² gross floor area	3867.9	14,116.7	2.82
ho_area3	Houses $\leq$ 85 m² gross floor area	5193.6	17,094.3	2.55
ho_area4	Houses $\leq$ 130 m² gross floor area	2382.4	8034.8	2.61
ho_area5	Houses $>$ 130 m² gross floor area	676.6	2799.7	3.20
ho_type1	Multi-household houses	1601.8	10,542.8	5.08
ho_type2	Detached houses	2867.3	5730.3	1.56
ho_type3	Apartments	8728.5	32,413.0	2.87
ho_type4	Townhouses	360.3	1465.1	3.14
ho_yr79	Houses built in or before 1979	950.2	2618.5	2.14
ho_yr80_89	Houses built in the 1980s	1338.5	6366.9	3.68
ho_yr90_99	Houses built in the 1990s	4068.2	14,690.0	2.79
ho_yr00_09	Houses built in the 2000s	3502.6	13,854.4	3.06
ho_yr10_20	Houses built in or after 2010	3853.0	13,209.1	2.65
schools	Primary and secondary schools	19.1	47.7	1.93
hagwon	Hagwon providing extracurricular lessons	104.5	434.1	3.20
hospitals	Hospitals with at least 30 staffed beds	5.5	21.8	3.07
clinics_gp	Hospitals with less than 30 staffed beds	88.1	449.4	3.92
pharmacies	Retailing pharmaceuticals	49.1	248.9	3.90
postnatal	Postnatal care services	0.7	3.7	3.77
performing_venues	Performing arts venues	1.9	18.4	7.30
museums	Museums, including art museums	0.9	2.3	2.01
theatres	Movie theatres	3.8	16.4	3.31
culture_centres	Community culture centers	0.1	0.4	2.37
accommodation	Accommodations, including hotels and motels	39.8	123.5	2.39
camp_grounds	Caravan parks and camping grounds	2.2	4.1	1.44
temples	Temples (religious services)	1.3	2.7	1.59
bars	Licensed bars	30.4	153.8	3.89
clubs	Hospitality clubs	39.9	146.1	2.82
cctv	Closed circuit televisions	209.0	697.2	2.57
supermarkets	Supermarkets $>$ 3000 m² gross floor area	2.9	14.3	3.85

Table 2. Loadings of the first three PCs from the standard PCA.

Variable	PC1	PC2	PC3
ho_area1	0.221	−0.032	0.220
ho_area2	0.166	0.104	0.016
ho_area3	0.150	0.111	−0.085
ho_area4	0.152	0.083	−0.095
ho_area5	0.190	0.069	0.001
ho_type1	0.287	0.064	0.540
ho_type2	0.086	0.019	−0.149
ho_type3	0.168	0.128	−0.081
ho_type4	0.188	−0.013	0.091
ho_yr79	0.113	−0.052	−0.148
ho_yr80_89	0.205	0.190	0.038
ho_yr90_99	0.162	0.114	−0.078
ho_yr00_09	0.183	0.078	0.022
ho_yr10_20	0.151	0.086	0.020
schools	0.114	0.071	−0.085
hagwon	0.188	0.130	−0.014
hospitals	0.158	0.155	−0.317
clinics_gp	0.234	0.021	0.067
pharmacies	0.239	−0.071	0.041
postnatal	0.213	0.162	0.180
performing_venues	0.335	−0.877	−0.109
museums	0.051	0.110	−0.224
theatres	0.196	0.021	−0.138
culture_centres	0.044	0.091	−0.448
accommodation	0.130	−0.006	−0.227
camp_grounds	0.001	0.001	0.023
temples	0.034	−0.021	−0.136
bars	0.227	−0.020	0.063
clubs	0.140	0.088	−0.262
cctv	0.147	0.043	0.047
supermarkets	0.234	0.014	−0.054
$P_{i}$	77.611	9.520	3.349

Table 3. Loadings of the first three PCs from the selected sparse PCA algorithm.

Variable ¹	PC1	PC2	PC3
ho_type1	0.152	0.578
ho_type3			−0.294
ho_yr80_89		0.405	−0.006
hagwon		0.235
hospitals			−0.844
clinics_gp	0.207	0.198
pharmacies	0.179
postnatal		0.638
performing_venues	0.861
culture_centres			−0.202
bars	0.124
clubs			−0.360
supermarkets	0.382	0.013	−0.175
$P_{i}$	28.04	7.26	2.16

¹ The variables not listed in this table have zero loadings in the first three PCs.

Table 4. Groups based on the variables with non-zero loadings in the first sparse component and the number of grids in each group, n.

Group	Variables with Non-Zero Loadings	$n$
1	clinics_gp, ho_type1, performing_venues, pharmacies, postnatal, supermarkets	172
2	bars, clubs, hagwon, ho_yr80_89, hospitals, supermarkets	50
3	ho_area5, ho_yr80_89, hospitals, performing_venues, supermarkets, theatres	39
4	bars, clinics_gp, ho_type1, performing_venues, pharmacies, supermarkets	35
5	accommodation, bars, clubs, culture_centres, ho_type4, postnatal	35
6	accommodation, clubs, ho_type3, ho_yr90_99, hospitals, theatres	34
7	accommodation, clubs, hagwon, ho_yr80_89, hospitals, theatres	34
8	clubs, ho_area5, ho_yr80_89, performing_venues, supermarkets, theatres	29
9	accommodation, clubs, hagwon, ho_type3, hospitals, theatres	20
		448

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, S.-Y.; Moon, S.; Chi, S.-H.; Cho, Y.-J.; Kang, J.-Y. Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure. Land 2022, 11, 2034. https://doi.org/10.3390/land11112034

AMA Style

Hong S-Y, Moon S, Chi S-H, Cho Y-J, Kang J-Y. Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure. Land. 2022; 11(11):2034. https://doi.org/10.3390/land11112034

Chicago/Turabian Style

Hong, Seong-Yun, Seonggook Moon, Sang-Hyun Chi, Yoon-Jae Cho, and Jeon-Young Kang. 2022. "Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure" Land 11, no. 11: 2034. https://doi.org/10.3390/land11112034

APA Style

Hong, S.-Y., Moon, S., Chi, S.-H., Cho, Y.-J., & Kang, J.-Y. (2022). Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure. Land, 11(11), 2034. https://doi.org/10.3390/land11112034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Local Sparse Principal Component Analysis for Exploring the Spatial Distribution of Social Infrastructure

Abstract

1. Introduction

2. PCA in Urban and Regional Studies

3. Data and Method

3.1. Social Infrastructure Data of Korea

3.2. Method

4. Results and Discussion

4.1. Standard PCA

4.2. Local Sparse PCA

5. Conclusions

5.1. Summary and Implications

5.2. Limitations and Future Recommendations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI