1. Introduction
People often perceive the world using named places. Thus, the world of named places can be a kind of georeferencing knowledge. As science and technology continue to penetrate all levels of society, big data are continuously generated from various sources, such as sensor systems data, user-generated content in social networks, socio-economic data, and hybrid data sources, including linked data and synthetic data [
1]. Most of these data involve named places, which effectively stimulates geospatial knowledge sharing among various sectors in cities [
2]. From the ontological perspective, place entities can be formalized using ontological approaches. The Semantic Web and its best practice, Linked Data, help to publish place entities on the Web using ontologies and Resource Description Framework [
3]. It is feasible to organize knowledge and details about place entities according to Linked Data principles and create a place knowledge graph on the Web [
4]. The above ideas can result in a paradigm shift for Geographic Information Science [
5,
6,
7]. From distributed databases accessed through Web services to knowledge represented as graphs, place nodes serve as a powerful “glue” to link any other data to its georeference, thus enabling integration of information across domains.
In the construction of multi-source data knowledge graphs that utilize location nodes as geographic references, place entity matching emerges as a pivotal technology [
7,
8,
9,
10]. Models for matching place entities can be categorized into two primary types, based on the nature of the data: textual matching and spatial matching. Textual matching processes largely depend on measuring similarities in the text-based attributes of entities, such as the congruence of place names and the likeness of categories, to determine if multiple source place entities refer to the same real-world location [
11,
12,
13]. Conversely, spatial matching utilizes geometric measures of spatial coordinates, including spatial distances, as significant indicators of match suitability [
14,
15,
16]. For the task of matching place entities, traditional rule-based methods necessitate the selection of matching factors for the computation of similarities, setting thresholds for each to ensure proper alignment. The most prevalent technique for amalgamating these similarities is the weighted sum approach within mathematical models, which assigns scores to each matching factor according to its relative weight. To diminish the level of manual intervention and circumvent the need for empirical weight configurations, recent advancements have been made in machine learning-based matching methods. Studies such as those by McKenzie et al. [
11] and Santos et al. [
17] have showcased models trained on manually annotated datasets, which serve as a substitute for manual parameter settings.
However, several challenging issues are hindering the advancement of geographical entity matching: (1) Rule-based methods are hampered by manual experiential interference and the tedious nature of feature selection and calculation, preventing the matching models from achieving uniformity across diverse, heterogeneous geographic data sources. (2) Existing machine learning approaches to geographical entity alignment generally depend on character distance similarity of toponyms, lacking the capability to capture deep semantic features, and thus limiting the further enhancement of place entities matching performance. (3) When multi-source place data lack attribute fields, geographic entity matching methods grounded in feature engineering may turn out to be ineffective. Therefore, the demands on matching models in place entity matching tasks are considerably higher, necessitating the ability of the models to rapidly, accessibly, and accurately adapt and align data from disparate sources for aggregation and unification [
18].
With the advent of pre-trained models, new approaches to address these requirements and challenges have been introduced. Pre-trained models, exemplified by Bidirectional Encoder Representations from Transformers (BERT) [
8] and Generative Pre-trained Transformer (GPT) [
19], acquire a broad understanding of language through unsupervised or semi-supervised learning on extensive text corpora. These models, particularly leveraging the Transformer architecture, effectively capture the bidirectional contextual information of language. Extensive research has validated their exemplary performance across various natural language processing downstream tasks. However, the efficacy of these models in addressing the domain-specific needs of place entities matching remains unverified. Place entity matching differs significantly from general domain entity matching, primarily due to its emphasis on spatial characteristics [
20]. A critical area of investigation is integrating spatial feature spaces with textual feature spaces within the model. This integration involves not only understanding the linguistic aspects, but also accurately interpreting and aligning the spatial dimensions of place entities. This would entail enhancing pre-trained models with spatial awareness and tailoring them to better accommodate the unique requirements of geographical data representation and alignment [
21].
In this paper, we propose a semantic-spatial aware representation learning model for place matching, which is an end-to-end place entity matching approach based on the large-scale pre-trained model. This advanced approach transcends the cumbersome manual feature extraction steps, which are a staple in traditional machine learning models and, instead, implements a comprehensive place entity matching pipeline system. This system is adept at supporting swift and dynamic updates in geospatial data-intensive tasks.
Our investigation particularly scrutinizes the multifaceted challenges that emerge when fusing diverse datasets to construct a robust knowledge graph. These challenges include, but are not limited to, linguistic variations, orthographic disparities, geometric inconsistencies, temporal discrepancies, and categorical ambiguities. We meticulously demonstrate the refinements and advancements that are brought about by incorporating rule-based models, machine learning models, as well as the more recent pre-trained large models and extensive language models when tackling these complexities.
The experimental data for this study is sourced from the authoritative Baidu Maps, the user-generated OSM, and the digital gazetteer GeoNames. Utilizing the methodology outlined herein, we publish our place entity linkage as an open data source for the place knowledge graph (PlaceKG).
Our contributions are threefold:
We present a Semantic-Spatial Aware Representation Learning Model for Place Matching and Fusion based on a pre-trained large model that achieves a unified mapping of location feature spaces and textual feature spaces.
We evaluate the capabilities of different types of models for the task of place matching, and present a granular showcase of the potential improvements offered by rule models, machine learning models, pre-trained large models, and large language models in response to nine types of case challenges.
We construct the PlaceKG and validate its utility in the realm of location querying and Location-Based Services, furthering the field of Geographic Information Science.
2. Related Work
The place knowledge graph in this paper is a semantic format based on linked data principles, which is constructed by conflating and structuring existing named place datasets. This knowledge graph is oriented to the place field. It helps to solve many geospatial technology challenges, such as named entity recognition (NER) [
22], toponym disambiguation [
23], and POI recommendation [
24]. For example, the NER task can recognize and predict the geographical coordinates of named place entities in text documents of web pages, blogs, encyclopedia articles, news stories, tweets, and travel notes, and is called geographic resolution or geographic coding. This work can connect the unstructured text with GIS structured entities [
25,
26]. Researchers claim that obtaining extensive gazetteer information is central for NER [
27]. A place knowledge graph that fuses various named places on the network can help address this concern.
At the conceptual level, POI usually has the following attributes: name, current location, category, and identifier. As the essential requirement of spatial data infrastructure (SDI), POI is characterized by type and often uses names rather than locations to identify a place [
28]. At the application level, POI can be used as a reference point in requesting location-based services, such as the destination of path navigation. The research on place data conflation can be traced back to the early works of digital map conflation and digital gazetteer conflation [
29,
30]. In the GIS context, POI data is usually the object of conflation. The term “conflation” describes integrating data from heterogeneous data sources, combining geographical information of different scales and precisions, and transferring or adding attributes from one dataset to another [
30]. POI conflation aims to determine whether POIs from different data sources represent the same place in a physical world, resolve the ambiguity of attribute values such as names and geographical locations, and integrate matched POIs. It often involves the following steps [
12,
13,
31]:
Pre-processing: unifying POI datasets into the same data structure and spatial coordinate system and mapping POI categories or types into a common taxonomy.
Candidate selection: using a set of conditions to select candidates from POI datasets.
Similarity measure: computing the attribute similarity of POI candidates, such as spatial similarity, name similarity, and type similarity.
Matching evaluation: aggregating different similarity measures to obtain an overall value that can rate the matching relevance and evaluate whether the POI candidates are matched.
Property conflation: conflating the property values from matched POI candidates. The properties can be either overlapped or complementary. In the former case, merging overlapping information often requires conflict-handling strategies. In the latter case, the final POI entity usually has various properties from different candidates.
Among these steps, the matching stage (Steps 3 and 4) is the most important in data conflation, and receives the most attention in existing efforts. For Step 3, i.e., similarity measures, various work has been conducted on spatial similarity, name similarity, as well as semantic similarity [
11,
12,
13]. In Step 4, i.e., matching evaluation, some methods have been proposed, including the regression-based weighted model [
11], the entropy-weighted model [
12], the graph-based matching approach [
13], and the Random Forest Classifier-based matching approach [
17].
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.
3. Data Sources for Named Places
Named place data are available through either authorities or geographical vendors. The former includes data provided by toponymic agencies or gazetteer services, and the latter includes POI data provided by crowdsourcing contributions such as OSM and WikiMapia. In this paper, we select one gazetteer,
GeoNames.org (accessed on 21 November 2023), which is one of the most famous accessible geographical databases in the world. For crowdsourcing data sources, we select POIs from OSM, one of the most representative sources of Volunteered Geographic Information (VGI). Furthermore, on the local scale, we select the POI data from one of the most significant Location Based Service (LBS) providers in China, Baidu Map. The empirical study using these datasets helps address the challenges in conflation caused by variability in language, spelling, historical changes, and feature types.
Table 1 provides a comparison of the three data sources. Detailed descriptions are given as follows.
3.1. Data Sources
3.1.1. GeoNames
GeoNames is a free geographical database for named geographical features [
32]. The dataset currently includes up to 13 million places, and the places are divided into nine categories and 680 subcategories. The fields in the dataset include some basic administrative information, such as administrative divisions, populations, and country codes. Some place entries in this dataset have aliases, such as names in different languages, short names, historical names, and colloquial or slang names. As for historical names, the start and end time stamps (valid dates) should be considered. All position coordinates are in World Geodetic System 1984 (WGS84).
3.1.2. OSM
OSM is a free online map built through crowdsourcing VGI [
33]. The geometries of geographic features are represented by essential data elements, i.e., nodes, ways, and relations. A node is a specific point on Earth by its latitude and longitude using the WGS84 standard. A way is an ordered list of nodes. A relation records relationships between two or more data elements, such as a polygon relation. Each data element has some common attributes related to creation information, such as users modifying the element (user/uid), the time of the last modification (timestamp), the version number of the edit (version), and the changeset number for a group of changes (changeset). Feature attributes are represented using tags attached to the data elements. A tag is a key–value pair with free-format texts. The communities can agree on certain key and value combinations for the most commonly used tags as consensus-based informal standards, such as a classification tag “highway = footway”. Currently, 24 types of primary point features are defined using well-accepted tags [
33]. The primary features used as keys in tags can act as types, and their values in tags can be regarded as subtypes. This paper selects 1511 frequently used subtypes. Although tags can be invented and used as needed, tags used by at least one wiki page can usually be used as types/subtypes. It is noted that historical changes are also tagged as significant features of a particular type. OSM also allows for additional properties to be defined using tags, such as “name”, “name:<lg>” for names in different languages, “alt_name” for alternative names, and “old_name” for historical names. The free tagging system makes OSM flexible, but hard to work with. Usually, these tags are used to serve as de facto standards.
3.1.3. Baidu Map
LBS services have been widely used in China. Baidu Inc., a Chinese-language internet search company, provides LBS services [
34] and is rated as one of the top 10 LBS providers in the world. Baidu offers POI data in China in the Chinese language. Assigned by the governmental agency in China, most domestic online map services use the “GCJ02” coordinate system, and Baidu uses the encrypted coordinate system “BD09” converted from “GCJ02” to provide Chinese POI data, which is unique to Baidu. Currently, Baidu Chinese POI data provides 23 categories and 153 subtypes. Each piece of data includes the city where the POI is located, the popularity of searching (i.e., heat), and some optional detailed attribute information.
3.2. Challenges
When researching the three datasets, several challenging issues need to be faced. Taking named places entities in China as examples, our paper lists the issues that need to be solved during the named place matching and conflating process in
Table 2. The differences are highlighted in the bold style, and the Chinese–English translation is provided in
Appendix A.
Issue I1 concerns the multilingual problem of place names in different data sources. For example, the place Guangzhou East Railway Station is expressed as “Guangzhou Dong” in English in GeoNames and presented as “广州东站” in Chinese in both OSM and Baidu Map. Issues I2-I6 focus on the spelling changes of place names in different data sources, including case sensitivity (“Sultan Turkish Restaurant” vs. “SULTAN TURKISH RESTAURANT”), misspelling (“珞瑜路” vs. “珞喻路”), word order (“地铁浔峰岗站” vs. “浔峰岗地铁站”), abbreviation (“广州火车站” vs. “广州站”), and synonym (“从化区办证中心” vs. “从化区政务服务中心”). I7 and I8 are location variability issues in different data sources, such as the location values corresponding to different coordinate systems and how coordinate precisions will change. I9 shows the type variability that the same named places from different data sources can have different type names in different categories.
4. Approach
Based on the preceding context, it is evident that the matching process necessitates addressing the disparities in names, locations, and types of named places in GeoNames, OSM, and Baidu Map. Our experiment primarily focuses on resolving such issues flexibly and conveniently.
The construction process of PlaceKG, as illustrated in
Figure 1, begins with data cleansing to obtain preliminary data from these three sources. Given that there will be
potential matches between two data sources containing n and m named locations, respectively, the sheer volume of data can make the precise matching process exceedingly laborious. To mitigate this, we initially filter these pairs based on threshold values to reduce the number of mismatched pairs. Then, this section introduces the SSARLM for data merging of named places from diverse data sources, to simplify and enhance the performance of similarity measurement. We have the spatial awareness of a large-scale pre-trained model with a location encoding strategy and utilize the candidate set data to fine-tune the model. The overall framework of the SSARLM model is depicted in
Figure 2.
4.1. Place Entity Serialization
In general, natural language tasks, pre-trained models accept token sequences (i.e., text) as input. Similarly, for named place entity data, serialization processing is required before feeding into the matching model, which facilitates subsequent encoding by the model. For multi-source heterogeneous geographic entity data, effective serialization is crucial for the model to optimally assimilate and process the data information. This step is fundamental in ensuring that the model accurately interprets and utilizes the geographic information embedded within these diverse data streams.
For data pairs composed of two places from two different data sources, such as their names, types, and location attributes, are transformed into token sequences. This transformation optimizes the model’s encoding process, thereby enhancing its capacity to ascertain if the candidate pair signifies identical geographical features. In serializing the attributes and values of a place entity from a specific data source, a marker is used to denote each attribute and its value for an individual place. If , , represent the attributes of name, type, and location, respectively, and their lowercase forms denote the corresponding attribute values, then a geographic entity can be represented as . This structured representation is instrumental in the model’s processing and analysis of the geographical data.
For a pair of place entities constituting a data pair, upon joint input into the model,
and
tokens are generated for separation and subsequent classification computations. It is key to note that
and
are Bert model-specific notations for the start and separation of entities. The
token serves as a delimiter marking the boundary between the two entities in the data pair, with the beginning and end of the data pair marked by
and
, respectively. Once processed by the model, the vector generated at the
position is fed into a fully connected layer for classification calculation. Therefore, the serialization result of a data pair can be represented in the following format:
This structure ensures that the model effectively discerns the start, separation, and end of each data pair, facilitating accurate classification and analysis of the geographic entities.
4.2. Fine-Tuning Pre-Trained Language Models
This study proposes an end-to-end approach for place entity matching models using pre-trained models, wherein the matching model is trained by fine-tuning these existing pre-trained architectures. Typical pre-trained models like BERT and GPT demonstrate robust performance across various Natural Language Processing (NLP) tasks. These models are usually composed of deep neural networks with multiple transformer layers, and are trained using unsupervised techniques on extensive text corpora, such as Wikipedia articles. During this pretraining phase, the models enhance their ability to understand sentence semantics by autonomously learning to predict missing tokens and subsequent sentences. This capability stems from the Transformer architecture’s ability to generate token embeddings from all tokens in an input sequence, thereby producing highly contextualized embeddings that encapsulate both the semantic and contextual understanding of words. Consequently, these embeddings adeptly capture polysemy, recognizing that a word can have different meanings in different phrases. For example, place names such as “Wuhan Station” and “Wuhan Railway Station” may still acquire similar word embeddings, despite the former lacking the key phrase ‘railway’. This similarity arises from training on extensive corpora, as the embeddings in pre-trained models are based on the semantic theories they have assimilated.
In this experiment, we utilized the case-sensitive pre-trained model DistilBERT provided by Hugging Face as the foundational model. The BERT pre-trained model is fundamentally developed through self-supervised training on a large-scale corpus, essentially constructed using a stacked architecture of multiple transformer layers. DistilBERT, as a distilled version of the BERT model, employs the same training corpus. However, it boasts a parameter size that is only 60% of that of BERT, while retaining 97% of its language understanding capability. Additionally, the model speed is enhanced by 60%. This efficient architecture of DistilBERT ensures a balance between computational resource requirements and the retention of substantial language processing proficiency, making it a suitable choice for our experiment’s objectives.
4.3. Semantic-Spatial Aware Representation Learning Model
It is well-known that locations are typically represented by latitude and longitude coordinate values. In multi-source data, issues such as coordinate drift and inconsistencies in decimal places among different data sources are common. When pre-trained models are used to directly encode these positional coordinates, a non-regular expression method based on an arithmetic foundation is adopted. This approach often fails to adequately address the aforementioned coordinate issues, consequently hindering its ability to effectively represent spatial location similarity. This limitation underscores the necessity for more sophisticated methodologies or preprocessing steps to ensure models accurately capture and reflect spatial relationships and similarities, particularly when dealing with diverse and inconsistent geographical data sources.
In addition, the use of location similarity calculation formulas presents challenges in integrating their computed outcomes with pre-trained models. This arises from the fact that similarity calculations require the separate extraction of coordinate values to obtain corresponding results, while pre-trained models typically process the entire data pair in a unified manner. Moreover, approaches based on location similarity calculations often neglect the original location information, relying solely on the outcomes derived from empirical formulas. This can lead to an over-reliance on empirical methods, potentially resulting in inaccuracies. Even small location deviations can inadvertently lead to the incorrect exclusion of candidate pairs. This highlights the need for a more balanced approach that integrates empirical calculations while retaining inherent location data.
Therefore, this study introduces an embedding fusion module designed for the unified encoding of semantic and spatial embeddings. In the previous work by Mai et al. [
35], a distributed location encoding approach is constructed to generate spatial embeddings. This approach employs methods such as unit vector inner product and periodic functions to transform two-dimensional positional coordinates into dense vectors of the same dimensionality as the pre-trained model. These vectors, when combined with other vectors encoded by the training model, are designed to convey data information more effectively, thereby enabling the training of a more precise overall similarity measure. Fundamentally, this approach can overcome the limitations of pre-trained models in location encoding, avoiding excessive reliance on empirical formulas, and thus enhancing the accuracy of place entities matching tasks. The specific encoding method is as follows:
For a positional coordinate , the process begins by assigning a specified number of unit vectors a and then performing dot product calculations , resulting in a new spatial representation. Subsequently, each dot product value undergoes scaling to a specified dimension. This study adheres to the six-dimensional choice as utilized in the original work, leading to a periodic expansion through different frequency sine and cosine functions. As illustrated in Equation (2), where represents the scale coefficients and is the scale value, this method effectively transforms the spatial coordinates into a more nuanced and dimensionally rich representation.
As depicted in
Figure 2, the token sequence is processed through our location encoder and text embedding layer, formally generating the input for training within the transformer architecture. Finally, the output obtains a binary classification result of 0 or 1 through linear layers and softmax functions.
5. Experiment and Discussion
In this section, we first introduce how named place pairs are selected and labeled as experimental data, then evaluate the SSARLM model for named place pair matching. Based on the matched entities, a provenance-aware place knowledge graph is constructed, using named places in China from the three data sources in
Section 2.
5.1. Data Preparation
The datasets used in this paper include 758,860 gazetteer entities from Chinese GeoNames, 287,182 POIs from Chinese OSM, and 1,127,582 POIs from Baidu Map covering six cities in China, i.e., Guangzhou, Hangzhou, Shanghai, Shenzhen, Zhuhai, and Wenzhou. Our goal is to establish equivalent associations between POIs from the three data sources. The basis of model training is whether the data is correlated or not. Therefore, each training data instance consists of two POIs from distinct sources (named place pair), along with a binary label (0/1) indicating whether they match. In theory, any two POIs from different sources can be used for model training. However, considering all possible combinations would result in excessively large datasets, not only significantly increasing the task of labeling annotations, but also imposing a burden on model training. Therefore, it is necessary to impose certain limiting conditions to ensure that we obtain relatively high-quality named place pairs.
To reduce the number of candidates matched named place pairs, our approach first filters mismatched named place pairs with a 1000 m distance threshold, a 0.4 place name similarity threshold, and a 0.5 place type similarity threshold. Then, for match prediction model training and test, this paper manually labeled matched and mismatched (similar but not matching) place pairs in Guangzhou and Shanghai from filtered candidates as positive and negative samples, as shown in
Figure 3.
Table 3 shows the statistics of matched and mismatched samples of place entity pairs from three data sources in two areas. It is observed that, despite the filtering process, the quantity of negative samples remains substantially large, with the ratio of positive to negative samples approaching 1:50. To mitigate the adverse effects caused by this imbalance, we extracted 15,000 data entries from the Guangzhou and Shanghai datasets, maintaining a ratio of approximately 1:3 between positive and negative samples.
5.2. Model Evaluation
The named place pair matching can be regarded as a classification problem with two possible outcomes (match or mismatch). We use Precision (P), Recall, and F1 as measurement indexes to evaluate the models’ performance. They can be calculated using Formulas (4)–(6), where
is the number of true positive results,
is the number of false positive results,
is the number of false negative results.
5.2.1. Overall Performance Evaluation
The named place matching model SSARLM in our paper is compared with several classical and commonly used binary classification machine learning models, including Support Vector Classifier (SVC), Random Forest Classifier (RFC), and Multilayer Perceptron (MLP). The prerequisite for utilizing these models is the computation of various similarity measures, including string similarity, phonetic similarity, and bag-of-words similarity, as well as location similarity and type similarity. It is important to highlight that type similarity relies on the type fusion system we have constructed. This system integrates types from three different data sources, referencing the national POI classification standards. Consequently, each type from every data source can find its corresponding representation within our system. Simultaneously, we conducted a comparative analysis with the MMCNN, which is a multi-layer neural network model specifically devised to place entity matching. The MMCNN consists of a two-layer architecture: the first layer is trained on features associated with the names of geographic entities, including string similarity, phonetic similarity, and bag-of-words similarity; the second layer is trained on a composite feature set that includes name similarity, type similarity, and location similarity. We contend that this model offers a comprehensive empirical representation of the pivotal attributes of geographic entities.
We utilized data from the Guangdong region to conduct quintuplicate experiments on each model. The data was partitioned into training, validation, and test sets in an 8:1:1 ratio. For the SSARLM model, we set the learning rate at , with a batch size of 32, and epochs capped at 50. The maximum sequence length was configured to 256. Because the experiment’s objective is to compare the optimal performance differences among various models, the hyperparameters for the other four models were not uniformly constrained.
Existing research has substantiated that large language models, exemplified by ChatGPT and GPT-4, achieve superior performance in a variety of natural language downstream tasks, even attaining state-of-the-art (SOTA) status [
36,
37]. In our study, we evaluated the performance of GPT-4 (zero-shot) and GPT-4 (20-shot) in the task of place entity matching. This approach enables direct testing on the test dataset without the necessity for training data.
For the GPT-4 model, we engineered a high-quality prompt specifically tailored to optimize its performance in the location entity matching task. In the case of GPT-4 (20-shot), we additionally selected examples from the training dataset that comprehensively covered all cases, employing these as in-context learning materials.
The final averaged evaluation metrics for each model are presented in
Table 4. Notably, SSARLM emerges as the only model achieving an F1 score of 0.95, while the other four machine learning models consistently fall within the 0.93–0.94 range. Despite GPT-4’s status as the most powerful current large language model, it demonstrates a relatively lower F1 score in the domain-specific task of entity matching. These results validate the efficacy and advancement of SSARLM. Given that the ultimate goal of our experiment is to achieve geographic entity matching across six cities and construct a knowledge graph, we further assessed the generalization performance of each model on data from other regions.
5.2.2. Generalization Performance Test
We have archived the training outcomes of each model based on the Guangdong region data and applied these models to data from the Shanghai region. This approach is employed to evaluate the generalization capabilities of each model. The results, presented in
Table 5, are derived from averaging five test iterations for each model.
Upon a holistic consideration of various performance metrics, it is observed that our proposed SSARLM model exhibits enhanced overall effectiveness on the Shanghai dataset. It surpasses the F1 scores of the other four models, notably outperforming MMCNN, MLP, and SVC by 0.7 to 1 percentage points, and RFC by 2.6 percentage points. When comparing Precision and Recall values, the SSARLM model demonstrates a more balanced performance across all indicators. Furthermore, in the context of our entity matching task, the probability of accurately predicting positive samples emerges as a crucial metric.
Analysis of the results indicates that SSARLM effectively learns the common features of geographic entities. It robustly and comprehensively assesses the matching degree across data sources with regional differences, demonstrating strong generalization capabilities.
5.2.3. Statistics on the Resolution of Challenging Issues
Utilizing the predictive results from Shanghai, we statistically assessed the ability of five models to address the first six challenging issues described in
Section 3.2. The remaining three issues, which are common to nearly all data pairs, will not be separately discussed further. Our analysis focused on original positive samples, with issue annotations for these samples obtained through both automated methods and manual verification.
Figure 4 presents the statistical distribution of various issues, revealing that Issues I1 and I5 have the highest proportions. While other issues have smaller shares, they still appear dozens of times in the positive samples from Shanghai and can be considered for evaluation purposes. Given that each model underwent five tests, to ensure reliability, we consider a model to accurately classify a data pair only if it predicts correctly in three or more tests.
Table 6 presents the probability statistics of correctly classified positive samples by five models under various challenging issues. In the first six scenarios, SSARLM outperforms the other four models. Notably for I2, SSARLM shows a distinct advantage, exceeding the other models by nearly 40 percentage points. Furthermore, in case I4, SSARLM’s performance is more than double that of the other models. In the other scenarios, SSARLM also consistently surpasses the others to varying degrees. Among the remaining four models, the performance differences are minimal, with MMCNN exhibiting relative superiority.
5.3. Discussion
The implementation results in
Section 5.2 show that the similarity measure methods and the SSARLM model presented in this paper can help find similar named places from different data providers with high accuracy. The matched named places conflation and PlaceKG construction methods help to create a merged place dataset.
In
Section 5.2.1, this paper compares the SSARLM based on a large-scale pre-trained model, various machine learning-based models, and models based on GPT-4. We initially conducted multiple training sessions for SSARLM and other machine learning models, using the training dataset from Guangzhou. In the case of the GPT-4 models, testing was directly performed using the test dataset. Subsequently, in
Section 5.2.2, we conducted a generalization performance test for SSARLM and machine learning models using data from Shanghai.
The machine learning models, including MMCNN and MLP, realize alignments automatically between sources and targets. However, implementing such methods requires the use of empirical formulas for calculating various features, undoubtedly leading to a relatively substantial workload. At the same time, in our practice, a mature and complete place type category with full semantic information is required for type alignments of all named place datasets. Due to the limited types in the three data sources, manual creation and alignment is feasible. Unfortunately, one typical drawback of the manual method is that it may cause some extension problems when more named place data sources are included. Furthermore, manually assigning types to them requires a certain standard as the basis and relies on the builder’s extensive experience. Each type of mapping needs to undergo careful verification, resulting in a heavy workload.
Our proposed SSARLM model develops an end-to-end model based on the large-scale pre-trained model to replace manual feature selection. This provides a more streamlined and efficient approach, eliminating the need for separate feature engineering steps. Furthermore, whether tested within Guangzhou or using the trained model on Shanghai data, our model exhibits a higher F1 score performance compared to others. It demonstrates a significantly higher Recall than competing models, indicating its superior accuracy in identifying matching geographical entities.
Both the SSARLM model and other machine learning models require high-quality sample data. Despite demonstrating the model’s generalization capabilities in location matching across different regions in
Section 5.2.2, we remain curious about whether it can replace training samples to reduce the resources needed for location matching. Several existing studies have shown that large models like ChatGPT and GPT-4 can be directly applied to downstream natural language tasks without any fine-tuning or training to update model parameters [
38]. Large language models are more robust than large-scale pre-trained models, possessing greater parameter volume and advanced language comprehension capabilities. In our experiments, we did not use training datasets; instead, we obtained location-matching results by engaging GPT-4 in dialogue using high-quality prompts. The experiment results show that the GPT-4-based model (zero-shot) achieved high Precision, but lower Recall. This suggests that GPT-4 might not effectively recognize that two names refer to the same location in certain specific scenarios or subtle differences, indicating that high-quality training data is still a necessary option for location entity matching tasks. Based on the concept of domain-specific fine-tuning for large models, substantial improvements in performance on specific tasks and enhanced tracking of human instructions can be achieved through extensive fine-tuning with a large volume of high-quality, location-matching-related training data. Ultimately, fine-tuning techniques for large models could enable an out-of-the-box place entity matching functionality based on natural human language instructions.
In
Section 5.2.3, our statistical analysis of the six challenging issues presented demonstrates a distinct advantage of SSARLM in addressing complex geographical entity matching problems. Specifically, for names with different cases (I2) and different word orders (I4), SSARLM’s metrics significantly exceed those of other models several times. These outcomes are likely attributed to the strengths of the pre-trained model DistilBERT. We selected a case-sensitive version of the model, thereby enhancing its focus on the variation in the capitalization of place names and their relevance to geographical entity matching during the training phase. Furthermore, BERT-based models utilize an attention mechanism to capture context without emphasizing the necessity of word order. This feature is beneficial for place name matching, as the inversion of word order typically does not alter the description of a location. Such reversals are common across various data sources. Overall, SSARLM demonstrates greater flexibility in representing place names, showing a clear performance advantage.
5.4. PlaceKG Construction
Using the SSARLM model, we can obtain matched named places through the three datasets. The statistical results are shown in
Table 7. In total, 1365 similar named places appear simultaneously in all three datasets, and 7264, 8663, and 97,453 matched named places separately in the pairwise matching of the three datasets. OSM has the most matched named places with other datasets, where 36.9 percent of POIs from OSM can find matched named places in GeoNames and Baidu Map. Only 1.5 percent of POIs from Baidu Map can find matched entities in other datasets. Since POIs from Baidu Map only cover six cities, and Baidu Map is a well-localized company, it has stronger capabilities in collecting POIs in China than other data providers. It is then necessary to conflate the matched named places and resolve conflicts when creating a place knowledge graph.
As a result, a provenance-aware place knowledge graph named PlaceKG is developed using the conflation strategies and the provenance model recording source entities. The PlaceKG contains 2,076,693 PlaceEntity instances in China and their provenance information. It is encoded using RDF format and currently includes 57,801,364 statements.
The PlaceKG can be further published on the SPARQL server and queried by the SPARQL.
Figure 5 provides examples of SPARQL queries on the PlaceKG with Web UI. The examples show how entities in the PlaceKG be searched and how their provenance information can be tracked for analysis. In
Figure 5a, a PlaceEntity instance is queried by its Chinese name. In
Figure 5b, matched named places from three data sources that generate the PlaceEntity instance can be traced; In
Figure 5c, the strategies exploited for handling conflicts when conflating place names can also be explored. Finally, in
Figure 5d, other properties, like the location of the PlaceEntity instance, can be obtained from the PlaceKG.
6. Conclusions and Future Work
This paper utilizes three geospatial data sources, encompassing different construction groups, organizational forms, and both Chinese and English languages, to perform high-performance fusion and transfer of heterogeneous data, ultimately leading to the formation of the PlaceKG knowledge graph.
Differing from traditional rule-matching and machine learning approaches, we propose a Semantic-Spatial Aware Representation Learning Model for Place Matching and Fusion based on a pre-trained large model. This model is built upon the case-sensitive DistilBERT pre-trained model, enhanced with distributed positional encoding for fine-tuning. It is adept at expressing textual features such as names and types, while accurately representing the spatial characteristics of geographical entities. This approach avoids the manual type alignment and formula-based biases typically encountered in traditional methods.
Experimental results demonstrate that the SSARLM model outperforms various types of baseline models, showing superior performance and providing methodological guidance for similarity calculations of geographical entities. Additionally, SSARLM exhibits advantages over traditional methods in multiple metrics during generalization performance tests on data from other regions, showing that methods based on large-scale pre-trained models have stronger transferability and generalization capabilities in place matching and fusion. This enhances the potential for extending our approach to a broader range of place data linking and constructing a comprehensive place knowledge graph. It is hoped that this work can draw attention to the problem of multi-data source conflation to reduce the deviation at the pattern level of knowledge graphs.
The paper leaves several extensions for future research. First, only structured POI data sources have been studied in our work, and more multimodal data can be studied to enrich place knowledge graphs, such as texts, trajectory data, remote sensing images, and vector maps. Second, POI data are constantly updated and changing, and the conflated place knowledge graph should also be updated to ensure its timeliness. While the provenance can help track historical changes, it is necessary to investigate approaches in the future to keep the place knowledge graph updated with the VGI sources.