A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment

Liu, Xichen; Li, Yajie; Zou, Muquan

doi:10.3390/ijgi14090331

Open AccessArticle

A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment

by

Xichen Liu

^1,2

,

Yajie Li

^1,2

and

Muquan Zou

^1,2,*

¹

School of Information Engineering, Kunming University, Kunming 650214, China

²

Yunnan Key Laboratory of Intelligent Logistics Equipment and Systems, Kunming 650214, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(9), 331; https://doi.org/10.3390/ijgi14090331

Submission received: 28 June 2025 / Revised: 5 August 2025 / Accepted: 10 August 2025 / Published: 26 August 2025

Download

Browse Figures

Versions Notes

Abstract

Spatial co-location patterns are used to describe the spatial associations between features, finding wide applications in geographic information systems, urban planning, and other fields. Traditional frameworks for mining spatial features typically consist of two stages: constructing spatial proximity relationships and discovering frequent patterns. However, existing methods have limitations: the construction of proximity relationships relies on fixed distance thresholds or clustering centers, making it difficult to adapt to spatial density heterogeneity; meanwhile, frequency metrics overly depend on participation indices, lacking quantitative analysis of the strength of geometric associations between features. To address these issues, a spatial co-location pattern mining method based on Hausdorff distance is proposed. Drawing on the concept of Hausdorff distance, this method employs Voronoi tessellation to achieve data-adaptive partitioning of the spatial domain. Combined with a K-dimensional tree, it adopts an iterative strategy of direct allocation, proportional allocation, and residual allocation to align instances, generating a spatial proximity relationship graph. Additionally, a new frequency metric based on instance distribution—alignment rate—is introduced, leveraging the decreasing trend of alignment rate in conjunction with a pruning optimization algorithm. Experimental results demonstrate that this method excels in handling noise points, effectively addressing the challenges of uneven data density distribution while enhancing the identification of weakly associated yet potentially valuable patterns.

Keywords:

spatial co-location pattern; hausdorff distance; voronoi; k-dimensional tree; alignment rate

1. Introduction

Spatial co-location pattern mining uncovers co-occurrence patterns among diverse features in spatial data, providing data-driven insights for decision-making. Advances in data collection and storage have generated vast spatial datasets rich with information. This approach analyzes frequent feature co-occurrences to reveal synergistic distributions in geographic space [1]. For instance, in urban planning, frequent co-occurrences of hospitals and pharmacies may indicate complementary facility functions, optimizing urban layouts. Typically, spatial co-location pattern mining involves two phases: constructing proximity relationships and quantifying their frequency.

Conventional methods rely on fixed distance thresholds to define proximity and participation ratios to measure frequency [2]. However, these approaches face several limitations. First, fixed thresholds are sensitive to noise and uneven distributions, often misidentifying noise as proximate or missing genuine relationships. Second, participation ratios are susceptible to outliers, where extreme values may inflate results, obscuring true pattern strength. Third, these methods exhibit low efficiency on large-scale datasets, limiting practical applicability. Prior studies have enhanced proximity generation using density-based clustering [3] or tree-based algorithms [4,5], yet they remain threshold-dependent. Efforts to improve frequency measurement with weighted or composite metrics [6] increase complexity and reduce interpretability. Efficiency improvements via graph structures [7] or indexing [8] still encounter computational bottlenecks. Despite these advances, significant opportunities remain to enhance accuracy, robustness, and applicability.

This study proposes a threshold-free, robust, and efficient method for spatial co-location pattern mining, addressing the aforementioned challenges with the following contributions:

1.: Proximity Relationship Generation: Conventional methods use fixed distance thresholds or single k-Nearest Neighbor (k-NN) algorithms, which are sensitive to noise and density variations, hindering the capture of complex spatial structures. The proposed method employs Voronoi tessellation to delineate distribution boundaries and constructs KD-trees for feature subsets in each region, eliminating threshold reliance. A dynamic k-NN algorithm with a three-stage allocation strategy (direct, proportional, and residual) adaptively aligns instances across sparse and dense regions. Drawing on the Hausdorff distance, this approach evaluates point set similarity, ensuring accurate and robust proximity relationships. Local relationships are refined by converting directed edges to undirected edges and applying a transfer substitution strategy, maintaining robustness under noise or density fluctuations, effectively filtering noise while preserving true patterns.
2.: Frequency Measurement: Traditional participation ratios overlook synergistic feature alignment, making them prone to distortions from local anomalies. The proposed alignment ratio, a novel metric measuring the proportion of aligned feature instances, offers a comprehensive frequency measure. A cumulative multiplication strategy captures k-order pattern effects, preserving global information and mitigating anomaly distortions, thus providing an intuitive and accurate representation of associations.

The rest of the paper is organized as follows: Section 4 details the proposed method, Section 5 presents experimental results, and Section 6 concludes the study with future work.

2. Related Work

Spatial co-location pattern mining has been extensively studied, with the participation ratio proposed by Huang et al. [9] as a foundational metric for pattern frequency. Its anti-monotonicity reduces the pattern search space, enhancing mining efficiency. Subsequent refinements led to algorithms such as the Join-based Co-location Pattern Mining Algorithm (Join-based) and the Partial-join Co-location Pattern Mining Algorithm (Partial-join) [10]. The latter improves efficiency by minimizing join operations. However, the computational overhead of joins limits performance on large-scale datasets. To address this, the Join-less Co-location Pattern Mining Algorithm (Join-less) [11] employs a lookup strategy, significantly boosting performance. Further optimizations include tree-based algorithms like the Co-location Pattern Index Tree (CPI-tree) [12], the Improved CPI-tree (iCPI-tree) [13], and the Order-Clique-Based Co-location Pattern Mining Algorithm [14]. These leverage prefix-sharing to reduce storage redundancy and integrate proximity relationship generation, improving both precision and efficiency.

Traditional methods for generating proximity relationships rely on distance-based metrics, such as Euclidean [15] and Manhattan distances [16], using fixed thresholds. These approaches struggle with irregularly distributed data or complex topological structures due to their sensitivity to data distribution. Topology-based methods, such as those using Voronoi diagrams [17,18] or Delaunay triangulation [14,19], address these limitations by partitioning space and adapting to its structure, making them suitable for complex spatial data. Network-index-based methods, including k-Nearest Neighbor (k-NN) algorithms [20,21], fuzzy grid clustering [8], and GPU-accelerated approaches [22], enhance efficiency for large-scale datasets through optimized indexing structures. Recent theoretical advances in spatial graph modeling, such as graph neural networks (GNNs) and spatial autoregressive models, provide robust frameworks for capturing complex spatial dependencies. GNNs represent spatial data as graphs, using node and edge features to model intricate relationships across heterogeneous topologies [23]. Spatial autoregressive models incorporate spatial correlation structures to handle dynamic and irregular spatial configurations [24]. These methods offer theoretical rigor for modeling complex spatial interactions but often require intensive computational resources. Our HD-SCPM algorithm complements these approaches by integrating Voronoi tessellation and KD-tree indexing, providing an efficient theoretical framework for proximity relationship generation with reduced computational complexity.

Theoretically, multi-scale spatial pattern mining addresses the challenge of detecting patterns across varying spatial resolutions through frameworks like multi-resolution grid partitioning and hierarchical clustering. Multi-resolution approaches decompose spatial data into nested grids, enabling the identification of patterns at both fine and coarse scales [25]. Hierarchical clustering constructs layered representations of spatial relationships, capturing dependencies across different levels of granularity [26]. Additionally, wavelet-based transformations provide a theoretical basis for analyzing spatial patterns at multiple scales by decomposing data into frequency components [27]. These frameworks are particularly effective for modeling spatial heterogeneity in complex environments. HD-SCPM’s alignment ratio metric is theoretically designed to adapt to multi-scale distributions by dynamically adjusting proximity thresholds, ensuring robust pattern detection across diverse spatial granularities without requiring extensive empirical validation.

Drawing on the Hausdorff distance, this work optimizes distance metrics for complex, irregularly shaped datasets, outperforming traditional metrics in noisy or uneven distributions [28,29]. Additionally, a novel alignment ratio replaces conventional frequency metrics, such as the maximum-minimum [30,31] and weighted participation ratios [32]. This metric better captures relationships between spatial features, offering superior accuracy and robustness, particularly for irregular topological structures. The choice of Hausdorff distance is justified by its ability to capture global geometric relationships, making it robust for noisy and irregular datasets like Shanghai_POI. Compared to alternative similarity metrics, such as Dynamic Time Warping (DTW), which excels in measuring temporal sequence similarity for spatio-temporal data [33], Hausdorff distance is more efficient for static spatial patterns, with a complexity of

O (n log n)

versus DTW’s

O (n m)

for sequences of lengths n and m. For spatio-temporal extensions, HD-SCPM could potentially integrate DTW to analyze dynamic urban patterns, such as time-varying POI distributions, but this would require additional temporal data and optimization to manage DTW’s higher computational cost. The current focus on static spatial patterns limits the direct applicability of DTW, though future work could explore hybrid metrics to enhance HD-SCPM’s temporal robustness. Compared to existing algorithms, HD-SCPM achieves competitive efficiency for large-scale, heterogeneously distributed datasets. Its time complexity, as detailed in Section 4.1 (Equation (25)), is

O (| F |^{3} n log | F | + p | F |)

, where

| F |

is the average number of instances per feature, n is the number of instances, and p is the number of frequent patterns. This is comparable to Join-less and CPI-tree (

O (| F | n log n)

) in moderately sized datasets and outperforms Join-based (

O (| F |^{2} n^{2})

) in scenarios with irregular distributions due to its use of Voronoi tessellation and KD-tree indexing. HD-SCPM excels in multi-scale settings by dynamically adjusting proximity thresholds, unlike the static structures of CPI-tree. However, its scalability is limited in high-dimensional spaces due to KD-tree inefficiencies and in dense datasets where the

{| F |}^{3}

term dominates. These trade-offs make HD-SCPM particularly suitable for urban datasets (e.g., Shanghai_POI) with complex topologies and moderate feature density. This work advances spatial co-location pattern mining by integrating Voronoi tessellation, KD-tree indexing, and a novel alignment ratio, achieving superior accuracy and efficiency for complex spatial datasets, while positioning itself at the intersection of spatial graph modeling and multi-scale pattern mining to theoretically address diverse spatial data challenges.

3. Related Definitions

3.1. Basic Concepts

Spatial features represent distinct categories of objects characterized by specific attributes within a spatial domain, such as buildings, rivers, or roads. Let

F = {f_{1}, f_{2}, \dots, f_{n}}

denote a set of spatial features, where each

f_{i}

(

i \in N, 1 \leq i \leq n

) corresponds to a feature type. Each feature

f_{i}

is associated with a set of instances

O_{i} = {o_{i 1}, o_{i 2}, \dots, o_{i m_{i}}}

, where each instance

o_{i j}

(

j \in N, 1 \leq j \leq m_{i}

) is positioned at

loc (o_{i j}) \in R^{d}

, with d representing the spatial dimensionality (typically

d = 2

or 3).

Given two instances

o_{i n} \in O_{i}

and

o_{j m} \in O_{j}

, a proximity relationship exists if the distance between their locations satisfies a predefined condition:

D (loc (o_{i n}), loc (o_{j m})) \leq δ,

(1)

where

D : R^{d} \times R^{d} \to R_{\geq 0}

is a distance metric (e.g., Euclidean distance), and

δ \in R_{> 0}

is a user-defined threshold. This relationship is denoted as

o_{i n} \sim o_{j m}

. A set

c l = {o_{1}, o_{2}, \dots, o_{k}} \subseteq ⋃_{i = 1}^{n} O_{i}

forms a spatial clique if every pair

(o_{p}, o_{q}) \in c l \times c l

(with

p \neq q

) satisfies Equation (1). A spatial clique is maximal if it is not contained within any larger clique satisfying the same condition.

For example, in Figure 1a (subfigure (a) of Figure 1), the traditional threshold-based method identifies proximity relationships, such as between instances

a_{2}

and

b_{6}

, marked by a black solid line, based solely on the distance threshold

δ

.

A spatial co-location pattern

c \subseteq F

is a subset of features

c = {f_{i_{1}}, f_{i_{2}}, \dots, f_{i_{k}}}

, where k is the pattern’s order. A spatial clique

c l

containing instances of all features in c, with each instance belonging to a distinct feature, is called a row instance of c. The collection of all row instances forms the table instance

T (c)

. The participation ratio for a feature

f_{i} \in c

is defined as:

PR (c, f_{i}) = \frac{| {o \in O_{i} ∣ \exists c l \in T (c), o \in c l} |}{| O_{i} |},

(2)

where

| \cdot |

denotes the cardinality of a set. The participation index of pattern c is the minimum participation ratio across all features:

PI (c) = min_{f_{i} \in c} PR (c, f_{i}) .

(3)

A pattern c is deemed frequent if

PI (c) \geq θ

, where

θ \in (0, 1]

is a user-specified threshold. For example, in Figure 1a, for the pattern

c = {a, b}

, the participation ratios are

PR (c, a) = \frac{3}{4} = 0.75

and

PR (c, b) = \frac{6}{7} \approx 0.857

, yielding

PI (c) = min (0.75, 0.857) = 0.75

. If

θ = 0.3

, then c is frequent.

3.2. Instance Alignment Based on Hausdorff Distance

To enhance spatial proximity mining, we propose a unified alignment framework inspired by the Hausdorff distance, enabling adaptive instance pairing across heterogeneous spatial distributions. Let

A = {a_{1}, a_{2}, \dots, a_{m}} \subseteq R^{d}

and

B = {b_{1}, b_{2}, \dots, b_{n}} \subseteq R^{d}

represent instance sets of features

f_{i}

and

f_{j}

. The directed Hausdorff distance is:

h (A, B) = sup_{a \in A} inf_{b \in B} d (a, b),

(4)

where

d (a, b) = {∥ a - b ∥}_{2}

is the Euclidean norm in

R^{d}

. The Hausdorff distance is:

D_{H} (A, B) = max {h (A, B), h (B, A)} .

(5)

For a singleton set

{o_{i}}

, the directed Hausdorff distance simplifies to the distance to its nearest neighbor in B. The alignment function

Φ : O_{i} \to O_{j}

maps instances of

f_{i}

to

f_{j}

, constrained by a threshold

δ_{H}

:

Φ (o_{i}) = o_{j} \Rightarrow d (o_{i}, o_{j}) \leq δ_{H}, o_{j} = arg min_{o^{'} \in O_{j}} d (o_{i}, o^{'}) .

(6)

The threshold

δ_{H}

is set based on the dataset’s spatial density, e.g., 100 m for Shanghai_POI to align with urban co-location distances.

The alignment process employs a three-stage strategy (direct, proportional, and residual allocation) to handle varying instance cardinalities. The unified allocation set is:

A (f_{i}, f_{j}) = A_{direct} (f_{i}, f_{j}) \cup A_{prop} (f_{i}, f_{j}) \cup A_{resid} (f_{i}, f_{j}),

(7)

with total cost:

C (f_{i}, f_{j}) = \sum_{(o_{i}, o_{j}) \in A (f_{i}, f_{j})} d (o_{i}, o_{j}) .

(8)

The HD-SCPM algorithm uses this framework to construct a proximity graph with bidirectional edges:

E_{proximity} = {(o_{i}, o_{j}) \in A (f_{i}, f_{j}) ∣ (o_{j}, o_{i}) \in A (f_{j}, f_{i})} .

(9)

This process takes instance sets

O_{i}

,

O_{j}

, and threshold

δ_{H}

as input, outputting the proximity graph

E_{proximity}

. It uses spatial partitioning to mimic the inf and sup operations of the Hausdorff distance, ensuring efficiency and adaptability. The spatial partitioning employs a randomized Voronoi tessellation, where feature instances serve as seed points to generate Voronoi cells, approximating nearest-neighbor searches for efficient alignment. To control potential variability introduced by randomization, we use fixed seed points derived from the dataset’s instance locations and apply multiple sampling iterations (e.g., Monte Carlo methods) to stabilize results, ensuring consistent and reproducible proximity relationships.

To illustrate the evolution of the three-stage alignment process, we present a four-panel figure (Figure 2) that integrates the direct, proportional, and residual allocation stages, culminating in the final proximity graph. Figure 2a depicts direct allocation using green dashed arrows to indicate the nearest-neighbor assignments from instances of feature

f_{i}

to

f_{j}

, clarifying the directional pairing process. Figure 2b illustrates proportional allocation, adding orange dashed arrows (with a distinct style) to show balanced assignments based on the proportionality factor. Figure 2c incorporates residual allocation, using purple solid arrows to highlight the assignment of remaining instances, ensuring all instances are allocated. Figure 2d presents the final proximity graph with bidirectional edges, consistent with the original visualization. The panels are arranged in a sequential order (top-left, top-right, bottom-right, bottom-left) to clearly demonstrate the step-by-step evolution of the allocation process.

From residual allocation to the final proximity graph, the HD-SCPM algorithm transforms the allocation results into a proximity graph by retaining only bidirectional relationships. Specifically, instance pairs

(o_{i}, o_{j})

with bidirectional arrows (i.e.,

(o_{i}, o_{j}) \in A (f_{i}, f_{j})

and

(o_{j}, o_{i}) \in A (f_{j}, f_{i})

) are represented as undirected black solid lines in Figure 2d, indicating a confirmed proximity relationship. Instance pairs with only unidirectional arrows (i.e.,

(o_{i}, o_{j}) \in A (f_{i}, f_{j})

but

(o_{j}, o_{i}) \notin A (f_{j}, f_{i})

) are removed, as they do not satisfy the bidirectional condition for proximity, and thus are not considered neighbors. This process, transitioning from the purple solid arrows in Figure 2c to the black solid lines in Figure 2d, ensures that only robust spatial relationships are retained in the proximity graph.

To highlight the advantages of the HD-SCPM algorithm over traditional threshold-based methods, we present a side-by-side comparison in Figure 1. Subfigure (a) illustrates the proximity relationships identified by the traditional method, where instances such as

a_{2}

and

b_{6}

are connected by a black solid line, indicating a proximity relationship based solely on the distance threshold

δ

. Subfigure (b) shows the HD-SCPM method’s results, where the same instance pair

a_{2}

and

b_{6}

is connected by a light gray solid line, indicating that it does not satisfy the bidirectional alignment condition and is thus not considered a proximity relationship. This comparison demonstrates that HD-SCPM’s requirement for bidirectional alignment reduces false positives, ensuring more robust and accurate proximity relationships.

3.3. Direct Allocation

When

| O_{i} | \geq | O_{j} |

, direct allocation assigns each instance

o_{i} \in O_{i}

to its nearest neighbor in

O_{j}

:

A_{direct} (f_{i}, f_{j}) = {(o_{i}, o_{j}) ∣ o_{i} \in O_{i}, o_{j} = Φ (o_{i})},

(10)

with

Φ (o_{i})

as defined in Equation (6). Ties are resolved by instance indices. This process is visualized in Figure 2a, where green dashed arrows indicate the directional nearest-neighbor assignments.

Proof.

Direct allocation minimizes the total distance by greedily assigning each

o_{i}

to its nearest neighbor, ensuring local optimality. By induction, this is globally optimal for

| O_{i} | \geq | O_{j} |

. □

3.4. Proportional Allocation

When

| O_{i} | < | O_{j} |

, proportional allocation balances the assignment with proportionality factor:

k = ⌊\frac{| O_{j} |}{| O_{i} |}⌋,

(11)

assigning k instances of

O_{j}

to each

o_{i} \in O_{i}

:

A_{prop} (f_{i}, f_{j}) = {(o_{i}, o_{j}) ∣ o_{j} \in O_{j}, o_{i} \in {NN}_{k} (o_{j}, O_{i})},

(12)

where

{NN}_{k} (o_{j}, O_{i})

is the set of k-nearest neighbors of

o_{j}

in

O_{i}

. This is depicted in Figure 2b, with orange dashed arrows (distinct from direct allocation) illustrating the balanced assignments.

Proof.

Proportional allocation ensures balanced distribution, assigning approximately k instances of

O_{j}

per

o_{i} \in O_{i}

. The number of assignments is

m_{i} \cdot k \leq m_{j}

, with

r = m_{j} - m_{i} \cdot k

unallocated instances, ensuring fairness. □

3.5. Residual Allocation

Residual allocation assigns the remaining r instances of

O_{j}

:

O_{j}^{unalloc} = O_{j} ∖ {o_{j} ∣ \exists o_{i}, (o_{i}, o_{j}) \in A_{prop} (f_{i}, f_{j})},

(13)

A_{resid} (f_{i}, f_{j}) = {(o_{i}, o_{j}) ∣ o_{j} \in {Top}_{r} (O_{j}^{unalloc}, o_{i})},

(14)

where

{Top}_{r} (O_{j}^{unalloc}, o_{i})

selects the r closest unallocated instances to

o_{i}

. This is shown in Figure 2c, where purple solid arrows indicate the assignment of remaining instances.

Proof.

Residual allocation ensures all instances are assigned:

| A_{prop} (f_{i}, f_{j}) | + | A_{resid} (f_{i}, f_{j}) | = m_{j} .

(15)

Greedy selection of the r closest instances minimizes the allocation cost. □

3.6. Alignment Ratio of Instances

3.6.1. Alignment Ratio Between Two Spatial Features

The alignment ratio for features

f_{i}

and

f_{j}

quantifies the proportion of aligned instance pairs:

AR (f_{i}, f_{j}) = \frac{| A_{aligned} (f_{i}, f_{j}) |}{| O_{i} | + | O_{j} |},

(16)

where

A_{aligned} (f_{i}, f_{j})

is the set of aligned pairs. A pattern

{f_{i}, f_{j}}

is frequent if:

AR (f_{i}, f_{j}) \geq θ_{AR},

(17)

with

θ_{AR} \in (0, 1]

as a user-defined threshold. The aligned pairs are defined as:

A_{aligned} (f_{i}, f_{j}) = {(o_{i}, o_{j}) \in E_{proximity} ∣ d (o_{i}, o_{j}) \leq δ_{H}},

(18)

where

d (o_{i}, o_{j})

is the Euclidean distance defined in Equation (4). The thresholds

δ_{H}

and

θ_{AR}

are calibrated to ensure statistically significant patterns. For spatial_data_A,

δ_{H} = 5

and

θ_{AR} = 0.3

are set based on average nearest neighbor distances and frequent pattern significance. For Shanghai_POI,

δ_{H} = 100

meters and

θ_{AR} = 0.5

reflect urban co-occurrence rates and pattern frequency analysis. For example, in Figure 2d, the pattern

{a, b}

is frequent as

AR (a, b) \approx 0.4545 \geq 0.3

.

3.6.2. Alignment Ratio of k-Order Spatial Co-Location Patterns

For a k-order pattern

c_{k} = {f_{1}, f_{2}, \dots, f_{k}}

, the alignment ratio is:

AR (c_{k}) = \prod_{1 \leq i < j \leq k} AR {(f_{i}, f_{j})}^{ω_{i j}},

(19)

with

ω_{i j} = 1 / k

, chosen to ensure equal contribution of each feature pair to the overall alignment ratio, reflecting the balanced influence of all pairwise relationships in a k-order pattern. This uniform weighting, where the sum of weights across all

k (k - 1) / 2

feature pairs approximates 1, maintains consistency with the participation index and simplifies computation for heterogeneous spatial distributions. or alternatively:

AR (c_{k}) = \frac{| A (c_{k}) |}{| I (c_{k}) |},

(20)

where

A (c_{k}) = {c l \in T (c_{k}) ∣ \forall (o_{i}, o_{j}) \in c l, d (o_{i}, o_{j}) \leq δ_{H}}

, and

I (c_{k}) = ⋃_{f_{i} \in c_{k}} O_{i}

.

3.6.3. Decreasing Property of Alignment Ratio for k-Order Patterns

For a

(k + 1)

-order pattern

c_{k + 1} = c_{k} \cup {f_{k + 1}}

, the aligned instance set satisfies:

A (c_{k + 1}) \subseteq A (c_{k}),

(21)

and the total instance set grows:

I (c_{k}) \subseteq I (c_{k + 1}) .

(22)

Thus:

| A (c_{k + 1}) | \leq | A (c_{k}) |, | I (c_{k + 1}) | \geq | I (c_{k}) | .

(23)

From Equation (20), the alignment ratio is:

AR (c_{k + 1}) = \frac{| A (c_{k + 1}) |}{| I (c_{k + 1}) |},

(24)

implying

AR (c_{k + 1}) \leq AR (c_{k})

.

Proof.

Since

| A (c_{k + 1}) | \leq | A (c_{k}) |

and

| I (c_{k + 1}) | \geq | I (c_{k}) |

, the alignment ratio is non-increasing, as

\frac{| A (c_{k + 1}) |}{| I (c_{k + 1}) |} \leq \frac{| A (c_{k}) |}{| I (c_{k}) |}

. □

4. Algorithm and Analysis

This study transforms traditional spatial co-location pattern mining into an alignment problem between instances of spatial features to enhance characterization accuracy and algorithmic performance. First, Voronoi tessellation is applied based on random spatial features, and on this basis, a KD-tree is constructed for each feature subset within each region. Leveraging this, a three-stage allocation strategy (direct allocation, proportional allocation, and residual allocation) is employed in conjunction with a dynamic k-Nearest Neighbor (k-NN) algorithm to initialize instance alignment, mitigating the imbalance in handling sparse and dense regions. When constructing the spatial proximity relationship, directed edges are converted into undirected edges to improve relationship stability, and a transfer substitution strategy is applied to adjust instance pairs that locally violate “spatial proximity,” ensuring geographical rationality. Ultimately, a Spatial Proximity Relationship Graph (SNRG) is generated, providing the foundational structure for pattern mining. The alignment ratio is adopted as the frequency metric, and the Moran’s index [14] is used to validate spatial autocorrelation, enabling the identification of high-frequency proximity patterns while optimizing computational efficiency.

4.1. Algorithm Implementation

The first part of the algorithm (Alignment Process) generates the Spatial Proximity Relationship Graph (SNRG) through six steps, as illustrated in Figure 3: First, it initializes the edge sets; then, it generates a Voronoi diagram to partition the spatial domain based on random features; next, it constructs a KD-tree for the feature instance set in each region; subsequently, for each pair of features, it performs direct allocation (Equation (10)), proportional allocation (Equation (12)), and residual allocation (Equation (14)), utilizing the dynamic k-NN algorithm and the Hausdorff distance concept Equation (4)) to adaptively align instances and generate directed edges; afterward, it filters bidirectional edges to convert them into undirected edges; finally, it returns the SNRG.

The second part of the algorithm (Alignment Ratio Calculation) mines frequent patterns through five steps, as shown in Figure 4: First, it initializes the frequent pattern set; then, it calculates the alignment ratio for each pair of features (Equation (16)), filtering second-order frequent patterns; next, it sets the initial order

k = 2

; subsequently, it iteratively generates higher-order candidate patterns, pruning with the alignment ratio monotonicity (Equation (21)), calculating the k-order alignment ratio (Equation (19)) only for candidates whose sub-patterns are all frequent, and adding them to the frequent pattern set if the threshold is met; finally, it returns all frequent patterns.

The time complexity of constructing the Voronoi diagram is

O (n log n)

, where n is the number of instances of the randomly selected features. The time complexity of building a local KD-tree is

O (| F | log | F |)

, where

| F |

is the average number of instances in each Voronoi region. Since a KD-tree is built for each feature in each region, and there are

O (n)

regions in total, the total complexity is

O (n | F | log | F |)

. The time complexity of nearest neighbor search is

O (log | F |)

. In direct allocation, for each point in each region queried against other features, the complexity per region is

O (| F | log | F |)

, resulting in a total complexity of

O (| F |^{3} n log | F |)

. The complexities of proportional allocation and residual allocation are also

O (| F |^{3} n log | F |)

. The complexity of generating directed and undirected edges is

O (e)

, where e is the number of edges, and in the worst case, e is

O (| F |^{2} n)

. The time complexity of calculating the alignment ratio is

O (| F |^{2} n)

, as it requires counting the number of edges between each pair of features. The complexity of identifying frequent patterns is

O (p | F |)

, where p is the number of frequent patterns. In summary, the total time complexity of the algorithm is:

O (| F |^{3} n log | F | + p | F |)

(25)

5. Experimental Results Analysis

This section analyzes the runtime efficiency and mining results of the proposed algorithm through experiments, comparing it with the traditional join-based pattern mining algorithm from [1] and the CPI-tree mining algorithm from [4]. The algorithm was implemented in Python3.10, and all programs were run on a PC with an Intel Core i9 CPU, 32 GB RAM, and Windows 11 Professional.

5.1. Experimental Dataset and Parameter Settings

To analyze the efficiency of the algorithm on datasets of varying scales and distributions, this paper uses a randomly generated spatial feature dataset, spatial_data_A (containing 4 features), with 20% of the data sampled as sample_spatial_data for experiments. To enhance visual aesthetics, the heatmaps of spatial_data_A and spatial_data_B are presented as subfigures in Figure 5, illustrating their density variations with higher energy in high-density regions. Based on spatial_data_A, a non-uniformly distributed dataset spatial_data_B is generated through transformation. Additionally, a real-world dataset, Shanghai Point of Interest (POI) (where each POI category is treated as a feature), is used. The Shanghai_POI dataset includes 217 diverse feature categories (e.g., supermarkets, subway stations, banks, pharmacies), capturing complex urban environments with varied functional zones such as commercial, industrial, and residential areas. Due to the large data volume (242,557 instances), two representative features (primary schools and farmers’ markets) are selected for visualization in Figure 6 to ensure clarity, while the full dataset with all features is used for pattern mining. The dataset information is detailed in Table 1.

The spatial_data_A dataset exhibits a uniform spatial distribution, suitable for controlled testing, while spatial_data_B introduces non-uniform clustering to simulate varied density patterns. In contrast, the Shanghai_POI dataset reflects complex urban spatial patterns with diverse functional zones, such as high-density commercial districts and lower-density industrial areas, making it representative of real-world heterogeneity. The synthetic datasets, while effective for validating algorithmic robustness, lack the intricate spatial interactions of real urban environments. The Shanghai_POI dataset, with its large scale and high feature diversity, poses computational challenges due to its volume and complexity. HD-SCPM excels in detecting patterns in high-density, heterogeneous urban settings like Shanghai’s Pudong district, where diverse facilities coexist. However, its performance may be limited in sparse regions (e.g., suburban areas) with fewer co-location opportunities or in single-function zones where patterns are less prevalent. Additionally, the computational complexity of

O (| F |^{3} n log | F |)

(Section 4.1, Equation (25)) restricts scalability in extremely high-dimensional or dense datasets, indicating a need for optimization in such scenarios.

In the experiments, the default settings of the algorithm’s required parameters—distance threshold, participation threshold, and alignment ratio threshold—across different datasets are listed in Table 2. The join-based and CPI-tree algorithms maintain the same distance and participation thresholds.

5.2. Experimental Evaluation

The research method proposed in this paper, High-Dimensional Spatial Co-location Pattern Mining (HD-SCPM), differs from traditional spatial co-location pattern mining algorithms. It analyzes the impact of data scale and complexity on the algorithm from two aspects: (1) the impact of the number of instances on algorithm efficiency, and (2) the impact of the number of features on algorithm efficiency. Additionally, it compares HD-SCPM with join-based [1] and CPI-tree [4] algorithms in terms of runtime efficiency, mining accuracy, noise tolerance, robustness, and pattern mining quality.

5.2.1. Impact of the Number of Instances and Features on the Runtime Efficiency of the HD-SCPM Algorithm

When the number of features is fixed at 10 and the number of instances is 50, 200, 500, 1000, and 2000, respectively, the algorithm’s runtime is shown in Figure 7a. When the number of instances is fixed at 50 and the number of features is 10, 20, 50, 100, and 200, respectively, the runtime is shown in Figure 7b.

Figure 7 reveals that as the number of instances and features increases, the runtime of the HD-SCPM algorithm significantly grows. This is primarily because the computational complexity of the HD-SCPM algorithm mainly stems from instance alignment and alignment ratio computation. Specifically, an increase in the number of instances leads to a multiplicative increase in distance calculations and traversal operations. Meanwhile, an increase in the number of features causes an exponential rise in the number of feature pair combinations. This, in turn, affects the construction of the proximity relationship graph, alignment ratio computation, and frequent pattern determination. Consequently, when both the number of instances and features increase simultaneously, the runtime growth becomes more pronounced. This highlights the importance of optimizing these processes to enhance the algorithm’s efficiency in handling large-scale and high-dimensional data.

5.2.2. Impact of Dataset Scale on Runtime Efficiency

The experiment evaluates the runtime efficiency differences of the three algorithms on the Shanghai_POI dataset as the dataset size continuously increases. When the number of features increases by 5 each time and the number of instances per feature increases by 50 each time, the runtime of the three algorithms is shown in Figure 8.

Figure 8 presents a 3 × 3 matrix visualization, comparing the performance of three algorithms—HD-SCPM (blue), Join-based (green), and CPI-tree (orange)—across three variables: Feature number (range 0–50, peak around 40), Instance number (range 0–500, peak around 400), and Runtime/s (range 0–200 s, peak 0–50 s). The diagonal panels display the kernel density estimation (KDE) distribution for each variable, while the off-diagonal panels show scatter plots of pairwise variable relationships, with points colored by algorithm. The figure indicates that, on small-scale datasets, both HD-SCPM and the Join-based algorithm [1] exhibit lower runtimes compared to the CPI-tree algorithm [4], with HD-SCPM slightly outperforming the Join-based method. However, as the dataset scale increases (with rising Feature number and Instance number), the efficiency of HD-SCPM declines due to time-consuming alignment operations, and the Join-based algorithm also experiences reduced efficiency due to an increase in intermediate results, though it remains slightly better than HD-SCPM on medium-to-small-scale datasets. The CPI-tree algorithm, through reducing redundancy and employing efficient pruning techniques, shows better adaptability on some large-scale datasets, but its runtime growth trend remains noticeable within the current data range, requiring further validation of its overall performance on large scales.

5.2.3. Differences in Pattern Mining Accuracy Among the Three Algorithms

Accuracy comparison experiments were conducted on the sample_spatial_data dataset (containing 29 frequent patterns) and the spatial_data_A dataset (containing 174 frequent patterns). The accuracy of the three algorithms was initially analyzed, followed by further evaluation using precision, recall, and F1-score metrics. The spatial_data_B dataset contains 119 frequent patterns. The accuracy of the three algorithms on spatial_data_A and spatial_data_B is shown in Figure 9a. The evaluation metrics for mining frequent patterns on spatial_data_A are presented in Figure 9b, while those for sample_spatial_data are shown in Figure 9c. In all panels, the vertical axis represents proportional values ranging from 0 to 1, rounded to one decimal place, and uniformly displayed as percentage data.

Figure 9 shows that the HD-SCPM algorithm outperforms both the join-based [1] and CPI-tree [4] algorithms on both small-scale and large-scale datasets. HD-SCPM maintains high accuracy on both balanced and imbalanced datasets, although there is a slight decrease in accuracy on imbalanced data, demonstrating its adaptability to different data distributions. On small-scale data, HD-SCPM achieves high precision but slightly lower recall. In contrast, the join-based and CPI-tree algorithms exhibit high precision, low recall, and consistent performance across different datasets. On large-scale data, HD-SCPM demonstrates high recall with a slight drop in precision, while the other two algorithms experience declines in both precision and recall. Overall, HD-SCPM excels in accuracy, precision, and recall, particularly showing superior performance in complex pattern mining tasks.

5.2.4. Differences in Noise Tolerance Among the Three Algorithms

Noise tolerance comparison experiments were conducted on the sample_spatial_data and spatial_data_A datasets, building upon the previous results, to evaluate the proportion of noise points mistakenly identified as frequent pattern points by the three algorithms after adding noise. The noise levels were set at 10%, 20%, and 30%, and the algorithms’ effectiveness in noisy environments was assessed using the noise identification rate, defined as the ratio of identified noise points to the total number of noise points. The noise identification rates of the three algorithms on sample_spatial_data are shown in Figure 10a, while those on spatial_data_A are presented in Figure 10b.

Figure 10 demonstrates that the HD-SCPM algorithm leverages the Hausdorff distance concept for instance alignment, exhibiting strong robustness and effectively resisting noise. Although its identification rate slightly decreases as noise levels increase, it still maintains a high level, demonstrating excellent performance. In contrast, the join-based algorithm [1] is highly sensitive to noise. Its pattern recognition accuracy significantly declines in noisy environments, resulting in poor performance. The CPI-tree algorithm [4] mitigates noise through pruning techniques, but its performance still degrades under high noise levels, showing overall moderate effectiveness.

5.2.5. Differences in Robustness Among the Three Algorithms

Robustness comparison experiments were conducted on the spatial_data_A dataset (containing 174 frequent patterns) and the spatial_data_B dataset (containing 119 frequent patterns) to evaluate the performance of the three algorithms under both uniform and non-uniform density distributions. The evaluation is based on the pattern recognition rate, defined as the ratio of the number of identified frequent patterns to the total number of frequent patterns. The experimental results are shown in Figure 11.

Figure 11 indicates that HD-SCPM and CPI-tree [4] exhibit similar performance on uniformly distributed datasets, while the join-based algorithm [1] has a lower recognition rate. On non-uniformly distributed datasets, HD-SCPM outperforms both join-based and CPI-tree algorithms, with only a slight but stable decrease in recognition rate. In contrast, both join-based and CPI-tree algorithms experience a significant decline in recognition rate, indicating that HD-SCPM has higher robustness in handling non-uniform density distributions.

5.2.6. Differences in Pattern Mining Quality Among the Three Algorithms

The quality of frequent patterns mined by the three algorithms on real-world datasets is evaluated, focusing on two key aspects: pattern complexity and pattern diversity. Pattern complexity examines whether the three algorithms can deeply mine higher-order frequent patterns, while pattern diversity investigates whether they can uncover rarer and more latent frequent patterns. Comparative experiments were conducted on the Shanghai_POI dataset, with 20% of the data randomly sampled for the experiments. The relationships between the number of frequent patterns, the number of categories, and the order of patterns extracted by the three algorithms are shown in Figure 12.

Figure 12 shows that as the order increases, the number and variety of frequent patterns mined by all three algorithms decrease. The join-based [1] and CPI-tree [4] algorithms exhibit similar performance at lower orders, but their performance declines at higher orders (above the fourth order). Specifically, the join-based algorithm produces the fewest effective patterns due to repetitive patterns, while the CPI-tree algorithm remains relatively stable but mines fewer high-order patterns. In contrast, HD-SCPM demonstrates the best performance, excelling notably at the third order and remaining effective at higher orders. Overall, HD-SCPM surpasses both the join-based and CPI-tree algorithms in pattern mining quality, particularly showing superior performance at higher orders.

5.3. Pattern Instance Analysis

On the real-world Shanghai_POI dataset, we conducted an instance analysis of the patterns mined by the proposed High-Dimensional Spatial Co-location Pattern Mining (HD-SCPM) algorithm and the CPI-tree algorithm [4], with partial mining results shown in Table 3. The traditional join-based algorithm [1] generates candidate patterns and verifies their frequency, which often leads to redundancy, especially for high-order patterns (fourth order and above), resulting in limited practical utility. The CPI-tree algorithm leverages a tree structure and pruning techniques to reduce redundancy and improve efficiency. Although its high-order patterns remain valuable, redundancy control is still insufficient. In contrast, HD-SCPM excels in high-order pattern mining by capturing global relationships through instance alignment, eliminating noise, and adapting to multi-scale distributions. It introduces the Alignment Ratio (AR) as a replacement for frequency counting, recursively filtering meaningful high-order patterns to ensure accuracy and reliability. Compared to CPI-tree, HD-SCPM is more effective in reducing redundancy, focusing on genuine spatial relationships, and offering greater practical value in complex data mining scenarios, particularly in high-density urban environments like Shanghai, where diverse functional areas coexist.

The CPI-tree and HD-SCPM algorithms share similarities but also exhibit differences in mining spatial patterns. Table 3 presents some representative patterns mined by the two algorithms. Both algorithms can mine common third-order patterns, such as {supermarket, public restroom, bus stop}, which reflect typical combinations of public facilities and commercial areas, commonly found in urban residential zones to support daily convenience. However, HD-SCPM demonstrates a unique advantage in capturing high-order and rare patterns, effectively uncovering refined combinations like {industrial park, training institute, gym}. This pattern reveals the spatial synergy between office spaces, skill development facilities, and fitness amenities in industrial parks, directly supporting urban planning by guiding the allocation of complementary services to meet the needs of professionals in knowledge-driven economies. For instance, city planners can use this pattern to optimize the placement of training institutes and gyms near industrial parks, enhancing employee productivity and well-being.

Furthermore, in fifth-order pattern mining, the pattern extracted by HD-SCPM, {subway station, commercial street, office building, bank, pharmacy}, aligns more closely with real-world urban layouts. This pattern reflects the functional integration of transportation hubs, commercial activities, and essential services in city centers, where pharmacies are strategically located to meet the health demands of large populations in busy areas. Such insights are actionable for urban planners and policymakers, enabling optimized placement of pharmacies to improve public health access or adjust transportation infrastructure to reduce congestion around commercial districts. In contrast, the pattern mined by CPI-tree, {subway station, commercial street, office building, bank, courier service point}, is less reflective of regional functional integration due to the scattered distribution of courier service points, which are often temporary or less tied to urban core functions. This demonstrates that HD-SCPM not only reduces the generation of redundant patterns in high-order mining but also more accurately captures genuine spatial relationships, offering higher practical application value, particularly for optimizing urban service layouts and resource allocation in complex metropolitan areas like Shanghai.

To enhance the practical utility of these findings, we validated the mined patterns through statistical analysis of spatial densities and co-occurrence rates in the Shanghai_POI dataset, supplemented by domain knowledge from urban planning experts. For example, the pattern {subway station, commercial street, office building, bank, pharmacy} was cross-verified with Shanghai’s urban zoning data, confirming its alignment with high-traffic commercial zones. These patterns can guide actionable decisions, such as prioritizing pharmacy locations near office buildings or enhancing bus stop accessibility near supermarkets. Compared to CPI-tree, HD-SCPM’s ability to focus on semantically meaningful patterns reduces false positives and supports precise urban planning and commercial decision-making, as demonstrated in applications like optimizing service facility placements in Shanghai’s Pudong district.

6. Conclusions and Future Work

The High-Dimensional Spatial Co-location Pattern Mining (HD-SCPM) algorithm demonstrates superior performance in mining spatial co-occurrence patterns with complex feature combinations. It employs a three-level allocation mechanism—direct allocation, proportional allocation, and residual allocation—which effectively overcomes the limitations of traditional fixed-threshold methods in terms of noise sensitivity and density heterogeneity. Experimental results show that the HD-SCPM algorithm exhibits strong robustness in handling local outliers. The Alignment Ratio (AR) metric, which employs a cumulative multiplication strategy to quantify the global synergistic relationships between features, significantly improves pattern recall compared to traditional participation models [5]. Additionally, leveraging its diminishing property, HD-SCPM effectively suppresses the generation of redundant patterns in high-order pattern mining. Through a tight integration of theoretical innovation and empirical analysis, this study proposes an adaptive spatial pattern mining method that combines robustness and scalability.

Future work will focus on further optimizing the performance of the HD-SCPM algorithm, particularly its processing efficiency on large-scale datasets. We plan to incorporate the tree structure of the CPI-tree algorithm [4] into HD-SCPM to optimize data storage and retrieval, thereby reducing redundant computations and improving overall operational efficiency. Furthermore, we will introduce the pruning strategy of CPI-tree to efficiently filter out irrelevant feature combinations, further enhancing the algorithm’s speed while maintaining mining accuracy [3]. Additionally, future research will explore frequent pattern mining at different scales to meet the demand for patterns at various levels in practical applications, drawing on multi-scale spatial analysis techniques [2,6,7]. This will provide more diverse solutions for complex application scenarios, such as urban planning and environmental monitoring.

Author Contributions

Conceptualization, Xichen Liu and Muquan Zou; software, Xichen Liu; validation, Xichen Liu; writing—original draft preparation, Xichen Liu; resources, Yajie Li; writing—review and editing, Muquan Zou and Yajie Li; supervision, Muquan Zou. All authors have read and agreed to the published version of the manuscript.

Funding

This article is the outcome of a decision-making consultation project in Yunnan Province, partially supported by the following grants: the Special Basic Cooperative Research Program of the Yunnan Provincial Association of Universities (Grants 202101BA070001-152 and 202101BA070001-155), the Yunnan Provincial Philosophy and Social Science Planning Think Tank Project (Grant ZK2024YB15), the Science and Technology Project for Key Industries in Yunnan Higher Education (Grant FWCY-QYCT2024016), and the Specialized Project on Teacher Education of Yunnan Provincial Education Science Planning (Grant GJZ2412).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are not publicly available due to institutional ownership restrictions, as they are proprietary laboratory assets restricted to internal research use.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Meshram, S.; Wagh, K.P. Spatial co-location pattern mining—A survey. In Fourth Congress on Intelligent Systems; Springer: Singapore, 2023; pp. 265–280. [Google Scholar]
Li, J.; Wang, L.; Yang, P.; Zhou, L. A novel algorithm for efficiently mining spatial multi-level co-location patterns. IEEE Trans. Knowl. Data Eng. 2024, 36. [Google Scholar] [CrossRef]
Garaeva, A.; Makhmutova, F.; Anikin, I.; Sattler, K.U. A framework for co-location patterns mining in big spatial data. In Proceedings of the 2017 20th IEEE International Conference on Soft Computing and Measurements, SCM 2017, St. Petersburg, Russia, 24–26 May 2017; pp. 477–480. [Google Scholar]
Wang, L.; Bao, Y.; Lu, J.; Yip, J. A new join-less approach for co-location pattern mining. In Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology, Sydney, NSW, Australia, 8–11 July 2008; pp. 197–202. [Google Scholar] [CrossRef]
Wang, L.; Bao, Y.; Lu, Z. Efficient discovery of spatial co-location patterns using the iCPI-tree. Open Inf. Syst. J. 2009, 3, 69–80. [Google Scholar] [CrossRef]
Feng, S.; Wang, L.; Fang, Y. Mining spatial collocation patterns with dominant features based on fuzzy proximity relations. Comput. Sci. Appl. 2021, 11, 176. [Google Scholar]
Chang, X.; Lu, J.; Chen, S.; Duan, P. Spatial collocation pattern mining method based on improved column calculation. Appl. Res. Comput. 2024, 41. [Google Scholar] [CrossRef]
Hu, Z.; Wang, L.; Tran, V.; Chen, H. Efficiently mining spatial co-location patterns utilizing fuzzy grid cliques. Inf. Sci. 2022, 592, 361–388. [Google Scholar] [CrossRef]
Huang, Y.; Shekhar, S.; Xiong, H. Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. Knowl. Data Eng. 2004, 16, 1472–1485. [Google Scholar] [CrossRef]
Yoo, J.S.; Shekhar, S. A partial join approach for mining co-location patterns: A summary of results. In Proceedings of the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, USA, 21–23 April 2005; pp. 556–560. [Google Scholar]
Yoo, J.S.; Shekhar, S.; Celik, M. A join-less approach for co-location pattern mining: A summary of results. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; pp. 813–816. [Google Scholar] [CrossRef]
Sheshikala, M.; Rao, D.R.; Prakash, R.V. A map-reduce framework for finding clusters of colocation patterns—A summary of results. In Proceedings of the 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, India, 5–7 January 2017; pp. 129–131. [Google Scholar]
Golfarelli, M.; Wrembel, R. Big data analytics and knowledge discovery. Data Knowl. Eng. 2023, 146, 102197. [Google Scholar] [CrossRef]
Tran, V.; Tran, D.; Le, A.; Ha, T. Efficiently discovering spatial prevalent co-location patterns without distance thresholds. In Proceedings of the 25th International Conference, iiWAS 2023, Bali, Indonesia, 4–6 December 2023; pp. 447–463. [Google Scholar] [CrossRef]
Chen, S.; Lu, J. Mining algorithm of fuzzy instance space collocation pattern based on Voronoi diagram and distance decay effect. Hans. J. Data Min. 2024, 14, 65–80. [Google Scholar] [CrossRef]
Wang, L.; Fang, Y.; Zhou, L. Mining spatial prevalent co-location patterns based on graph databases. J. Front. Comput. Sci. Technol. 2022, 16, 48–58. [Google Scholar] [CrossRef]
Zhang, X.I. Mining spatial co-location kernel patterns based on Voronoi diagrams. Chin. J. Comput. 2022, 45, 1908–1925. [Google Scholar] [CrossRef]
Wang, X.; Wang, L.; Chen, H.; Zeng, L. Identifying relationship between pollution sources and cancer cases with spatial ordered pair patterns. Data Anal. Knowl. Discov. 2021, 5, 14–31. [Google Scholar]
Tran, D.H.; Le, A.T.; Ha, T.N. Discovering Prevalent Co-Location Patterns in Different Density Spatial Data Without Distance Thresholds. Master’s Thesis, FPT University, Hanoi, Vietnam, 2023. [Google Scholar]
Qian, F.; Chiew, K.; He, Q.; Huang, H.; Ma, L. Discovery of regional co-location patterns with k-nearest neighbor graph. In Advances in Knowledge Discovery and Data Mining, 1st ed.; Springer: Berlin, Germany, 2013; pp. 174–186. [Google Scholar]
Qian, F.; Chiew, K.; He, Q.; Huang, H.; Ma, L. Mining regional co-location patterns with k-NNG. J. Intell. Inf. Syst. 2014, 42, 485–505. [Google Scholar] [CrossRef]
Liturri, G. Temporal Co-Location Pattern Discovery in Spatiotemporal Data Through Parallel Computing. Ph.D. Dissertation, Politecnico di Torino, Turin, Italy, 2023. [Google Scholar]
Bui, K.H.N.; Cho, J.; Yi, H. Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues. Appl. Intell. 2022, 52, 2763–2774. [Google Scholar] [CrossRef]
Ver Hoef, J.M.; Peterson, E.E.; Hooten, M.B.; Hanks, E.M.; Fortin, M.J. Spatial autoregressive models for statistical inference from ecological data. Ecol. Monogr. 2018, 88, 36–59. [Google Scholar] [CrossRef]
Liu, S.; Qiao, P.; Ye, Z.; Li, W.; Dou, Y. DistGrid: Scalable scene reconstruction with distributed multi-resolution hash grid. arXiv 2024, arXiv:2405.04416. [Google Scholar]
Ran, X.; Xi, Y.; Lu, Y.; Wang, X.; Lu, Z. Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artif. Intell. Rev. 2023, 56, 8219–8264. [Google Scholar] [CrossRef]
Seydi, S.T.; Bozorgasl, Z.; Chen, H. Unveiling the power of wavelets: A wavelet-based Kolmogorov-Arnold network for hyperspectral image classification. arXiv 2024, arXiv:2406.07869. [Google Scholar]
van Kreveld, M.; Miltzow, T.; Ophelders, T.; Sonke, W.; Vermeulen, J.L. Between shapes, using the Hausdorff distance. Comput. Geom. 2022, 100, 101817. [Google Scholar] [CrossRef]
Jungeblut, P.; Kleist, L.; Miltzow, T. The complexity of the Hausdorff distance. Discrete Comput. Geom. 2024, 71, 177–213. [Google Scholar] [CrossRef]
Huang, Y.; Pei, J.; Xiong, H. Mining co-location patterns with rare spatial features. Geoinformatica 2006, 10, 239–271. [Google Scholar] [CrossRef]
Shekhar, S.; Lu, C.; Zhang, P. A unified approach to spatial outliers, co-location and clustering. IEEE Trans. Knowl. Data Eng. 2003, 16, 1316–1329. [Google Scholar]
Qi, L.; Han, J.; He, M. Weighted co-location pattern mining in spatial data sets. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China; 2012; pp. 134–141. [Google Scholar] [CrossRef]
Azab, A.M.; Ahmadi, H.; Mihaylova, L.; Arvaneh, M. Dynamic time warping-based transfer learning for improving common spatial patterns in brain–computer interface. J. Neural Eng. 2020, 17, 016061. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of proximity relationships between traditional threshold-based method and HD-SCPM. (a) Traditional method, where instances

a_{2}

and

b_{6}

are connected by a black solid line indicating a proximity relationship. (b) HD-SCPM method, where

a_{2}

and

b_{6}

are connected by a light gray solid line, indicating they do not satisfy the bidirectional alignment condition and are not considered neighbors.

Figure 1. Comparison of proximity relationships between traditional threshold-based method and HD-SCPM. (a) Traditional method, where instances

a_{2}

and

b_{6}

are connected by a black solid line indicating a proximity relationship. (b) HD-SCPM method, where

a_{2}

and

b_{6}

are connected by a light gray solid line, indicating they do not satisfy the bidirectional alignment condition and are not considered neighbors.

Figure 2. Evolution of the three-stage alignment process: (a) Direct allocation with green dashed arrows, (b) Proportional allocation with orange dashed arrows, (c) Residual allocation with purple solid arrows, (d) Final proximity graph.

Figure 3. Algorithm 1.

Figure 4. Algorithm 2.

Figure 5. Heatmaps of spatial_data_A and spatial_data_B. Cool tones represent sparse data distribution, warm tones represent dense data distribution, and the transition from cool to warm tones indicates the process of data distribution from sparse to dense. (a) Heatmap of spatial_data_A showing density variations with higher energy in high-density regions. (b) Heatmap of spatial_data_B illustrating density variations of the non-uniformly distributed dataset.

Figure 6. Distribution of two representative features (primary schools and farmers’ markets) from the Shanghai_POI dataset, selected for visualization to avoid overcrowding due to the high density of the full 217-feature dataset.

Figure 7. Impact of instance and feature quantity on runtime efficiency of HD-SCPM. (a) Runtime with varying number of instances and fixed number of features. (b) Runtime with varying number of features and fixed number of instances.

Figure 8. Impact of dataset size on algorithm efficiency.

Figure 9. Pattern mining accuracy comparison across different datasets. (a) Accuracy of the three algorithms on spatial_data_A and spatial_data_B. (b) Precision, recall, and F1-score on spatial_data_A. (c) Precision, recall, and F1-score on sample_spatial_data.

Figure 10. Noise tolerance comparison across different datasets. (a) Noise identification rate of the three algorithms on sample_spatial_data. (b) Noise identification rate on spatial_data_A.

Figure 11. Pattern recognition rates of HD-SCPM, join-based, and CPI-tree algorithms on spatial_data_A and spatial_data_B.

Figure 12. Relationship between pattern mining quality and order for HD-SCPM, join-based, and CPI-tree algorithms on Shanghai_POI.

Table 1. Dataset information.

Dataset	Number of Features	Number of Instances	Spatial Range
spatial_data_A	4	700	100 × 100
sample_spatial_data	4	140	100 × 100
Shanghai_POI	217	242,557	120° ∼ 122 °E, 30° ∼ 32 °N

Table 2. Parameter settings.

Dataset	Distance Threshold	Participation Threshold	Alignment Ratio Threshold
spatial_data_A	5	0.3	0.3
sample_spatial_data	5	0.3	0.3
Shanghai_POI	100	0.5	0.5

Table 3. Partial patterns discovered by CPI-tree and HD-SCPM algorithms.

Pattern Order	CPI-tree Algorithm	HD-SCPM Algorithm
3rd Order	{supermarket, public restroom, bus stop}, {subway station, commercial street, office building}	{supermarket, public restroom, bus stop}, {industrial park, training institute, gym}, {subway station, commercial street, office building}
4th Order	{supermarket, public restroom, bus stop, restaurant}, {subway station, commercial street, office building, bank}	{subway station, commercial street, office building, bank}, {industrial park, training institute, gym, cafe}
5th Order	{supermarket, public restroom, bus stop, restaurant, convenience store}, {subway station, commercial street, office building, bank, courier service point}	{subway station, commercial street, office building, bank, pharmacy}, {industrial park, training institute, gym, cafe, convenience store}

Note: Italicized patterns are mined by the HD-SCPM algorithm but not by the CPI-tree algorithm, while bold patterns are mined by the CPI-tree algorithm but not by the HD-SCPM algorithm.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Li, Y.; Zou, M. A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment. ISPRS Int. J. Geo-Inf. 2025, 14, 331. https://doi.org/10.3390/ijgi14090331

AMA Style

Liu X, Li Y, Zou M. A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment. ISPRS International Journal of Geo-Information. 2025; 14(9):331. https://doi.org/10.3390/ijgi14090331

Chicago/Turabian Style

Liu, Xichen, Yajie Li, and Muquan Zou. 2025. "A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment" ISPRS International Journal of Geo-Information 14, no. 9: 331. https://doi.org/10.3390/ijgi14090331

APA Style

Liu, X., Li, Y., & Zou, M. (2025). A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment. ISPRS International Journal of Geo-Information, 14(9), 331. https://doi.org/10.3390/ijgi14090331

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Spatial Co-Location Pattern Mining Method Based on Hausdorff Distance Alignment

Abstract

1. Introduction

2. Related Work

3. Related Definitions

3.1. Basic Concepts

3.2. Instance Alignment Based on Hausdorff Distance

3.3. Direct Allocation

3.4. Proportional Allocation

3.5. Residual Allocation

3.6. Alignment Ratio of Instances

3.6.1. Alignment Ratio Between Two Spatial Features

3.6.2. Alignment Ratio of k-Order Spatial Co-Location Patterns

3.6.3. Decreasing Property of Alignment Ratio for k-Order Patterns

4. Algorithm and Analysis

4.1. Algorithm Implementation

5. Experimental Results Analysis

5.1. Experimental Dataset and Parameter Settings

5.2. Experimental Evaluation

5.2.1. Impact of the Number of Instances and Features on the Runtime Efficiency of the HD-SCPM Algorithm

5.2.2. Impact of Dataset Scale on Runtime Efficiency

5.2.3. Differences in Pattern Mining Accuracy Among the Three Algorithms

5.2.4. Differences in Noise Tolerance Among the Three Algorithms

5.2.5. Differences in Robustness Among the Three Algorithms

5.2.6. Differences in Pattern Mining Quality Among the Three Algorithms

5.3. Pattern Instance Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI