This section presents the construction method of ISLKG, including the composition and construction framework of ISLKG, ontology design, data collection, knowledge extraction, knowledge fusion, and storage; it then presents the knowledge reasoning model based on embedding and applies it to the island dataset. The model is then validated, followed by the introduction of an ISLKG-based application case.
3.1. ISLKG Construction Techniques
3.1.1. Overall Framework
The island knowledge graph is constructed using a “top-down” methodology, which includes two stages: constructing the ontology layer and extracting in the instance layer, as opposed to the general knowledge graph that is built using a “bottom-up” methodology.
Figure 1 depicts the overall structure of ISLKG. The ontology layer structure represents entity types, attributes, and relationship types between entities, and the instance layer instantiates the designed ontology in accordance with the established concept hierarchy, mapping rules, and constraints in the ontology layer.
ISLKG’s construction framework is depicted in
Figure 2. First, the ontology layer defines a sequence of ideas, relationship types, characteristics, mapping rules, entity restrictions, and concept hierarchies, which will be used to guide tasks at the instance layer, such as data collection and knowledge extraction. Secondly, the instance layer collects multi-source island data according to ontology specifications. Mapping is used to collect the structured island database; crawler technology is used to collect data from semi-structured web pages and Json files and non-structural data such as web page texts and books. The data obtained from the three data sources are independently acquired, processed, and organized for quality control in line with each source’s specific requirements. Data processing will be repeated if the quality evaluation result does not meet requirements; if it does, knowledge fusion will be performed. The island knowledge graph is completed by storing the fused island knowledge graph triplets in the neo4j graph database.
Correspondingly, the construction and knowledge inference process of ISLKG is illustrated in
Figure 3. The first step involves designing the ontology for the islands. Specifically, based on the existing structured knowledge framework in the field of islands, the ontology is semi-automatically constructed using the Protégé tool developed by Stanford University. The second step involves collecting multi-source data based on the constraints of the island ontology. This primarily includes the collection of structured, semi-structured, and unstructured data. Detailed explanations and collection methods for these data types will be provided later in this paper. The third step involves performing initial data cleansing on the collected data. This mainly focuses on removing empty, erroneous, and corrupted text from unstructured data. The data for each island are integrated to maintain uniqueness, resulting in the compilation of the Island Text Corpus. Subsequently, knowledge extraction is conducted on the cleansed Island Text Corpus. Following the entity categories specified by the ontology, entities of all categories for each island are extracted to form the constituent units of the island knowledge graph—triples. A second round of data cleansing is applied to the extracted triples, targeting erroneous, corrupted, and illegitimate entities.
Subsequent to this, the cleaned triples undergo knowledge fusion, primarily utilizing the Dedupe tool for entity alignment, merging duplicate triples, and semantically similar triples. This enhances the data quality and usability of ISLKG. Ultimately, the aligned ISLKG is subjected to quality evaluation. The current evaluation standards primarily encompass two aspects: the accuracy of the triples sampled from ISLKG queries exceeding 80%, and the validation results of the knowledge inference model. If these criteria are met, the knowledge is stored in the graph database. Otherwise, more data are collected to augment the knowledge graph. At this point, the construction of ISLKG is complete, albeit still partially comprehensive. Therefore, additional knowledge inference is applied to enhance the existing knowledge graph of ISLKG. The quality of the augmented knowledge graph is evaluated and validated. Based on these validation results, it is determined whether more data collection is required for ISLKG augmentation. If the validation results meet the criteria, the ISLKG inference process is completed.
3.1.2. Ontology Construction of Island Knowledge Graph
An ontology refers to a precise, formal, and systematic definition of concepts and their relationships in a particular domain [
31]. It describes the knowledge graph’s data format and is essentially the knowledge graph’s uppermost layer. The ontology of the island knowledge graph contains the concepts, relationships, and properties of the island domain, as well as the concept hierarchy, concept constraints, and mapping rules. There are five categories of first-level ideas in the island ontology library: fundamental information, social information, scientific research activities, infrastructure, and natural qualities.
Figure 4 depicts the hierarchical structure of the island ontology library, which is composed of numerous second- and third-level concepts that relate to the first-level concept.
Basic information is the most fundamental category of islands, encompassing notions such as former names, present names, classification, and marine areas to which they belong. The concept names, entity examples, and concept constraints for fundamental island data are detailed in
Table 1. In addition, the basic properties of the island are outlined in
Table 2, which are included in the island’s basic information.
- 2.
Social Information
Island social information refers to the information that indicates the environment and mode of human social activities on the island.
Table 3 presents the main concepts of social information, such as businesses, hospitals, schools, tourist attractions, administrative agencies, and social activities.
- 3.
Research Activity
As stated in
Table 4, research activities refer to certain scientific research undertaken by scholars on the island, including the names of research specialists, papers linked to the island, and the island’s study field.
- 4.
Infrastructure
Infrastructure refers to the basic engineering facilities on an island that provide public services for social production and residents’ lives, such as ports, anchorages, wharves, waterways, and other transportation facilities that are crucial to the economic operations of the island.
Table 5 illustrates that the island ontology database includes some concepts that are important for an island such as water conservation and electric power.
- 5.
Nature Property
The natural property describes the natural resources and natural environment of the island and its surrounding waters. The natural resources include, as indicated in
Table 6, land resources, water resources, biological resources, marine resources, vegetation, etc. The natural environment consists of the island’s terrestrial ecosystem, the surrounding sea environment, and natural disasters. Moreover, these ecosystems encompass topics like climate, meteorology, and hydrology.
Table 7 displays the entity examples and conceptual limitations of the natural environment.
In the category of natural resources on islands, the biological resources of islands encompass the plants, animals, and microorganisms of both the island’s terrestrial environment and its surrounding maritime areas. Therefore, within the subset of biological resources on islands, they are further classified into marine and terrestrial biota. Specifically, terrestrial biota includes the species of animals and plants that inhabit the island, such as capercaillie, brown bear, bats, etc. Terrestrial plants encompass species like palms, coconut trees, and dandelion. Marine biota consists of the plants, animals, and microorganisms found in the vicinity of the island’s surrounding waters. For instance, salmon, seaweed, and cyanobacteria are examples of marine organisms.
Beyond biological resources, the vegetation information of islands is categorized separately. This is because the plant resources of islands primarily serve to highlight the diversity of plant species in the region, whereas vegetation information focuses more on the overall characteristics and distribution of plant communities. For instance, island vegetation information may encompass categories like moss, shrub, and mangrove forest.
The natural property of the island in marine meteorology and climate, ocean hydrology, and marine chemistry contains many entities that are associated with numerical data, such as air temperature, surrounding sea temperature, pH, sunshine, wind speed, etc. As the observation data are mostly maintained in an organized format in the database, mapping principles must be established to extract semantic concepts from these structured data. We divide the mapping principles into two parts: the form or structure of conceptual mapping and the approach of mapping.
The island ontology library primarily designs the mapping concept’s structure. For example, when mapping the concept of “average monthly temperature”, the temperature of the island entity is accumulated by hour and then by day to determine its average value, and the associated triplets are recorded. The approach of conceptual mapping will be detailed in
Section 3.1.3 on structured data collecting.
- 6.
Relationship Definition
In addition to conceptual concepts, knowledge graphs also include relationships between entities. Since island knowledge is a subset of geographic knowledge, the relationship between island entities inherits the general relationship between geographic entities and is divided into two types: spatial and semantic. The geographical relationship is subdivided into distance relationship, orientation relationship, and topological relationship, while the semantic relationship is subdivided into the entity attribute relationship and the objective attribute relationship.
Topological relationships are invariant with respect to topological transformations, such as rotation, scaling, and translation [
32]. Topological relationships between entities include intersection, disjoint, containment, inside, equality, overlap, tangency, and intersection. In the island knowledge graph, the orientation connection is depicted, i.e., the orientation of one island relative to another, namely east, west, south, north, northeast, northwest, southeast, and southwest. The geographic relationship in the island knowledge graph comprises both quantitative and qualitative distance. The quantitative distance is the geographical straight-line distance between two islands. The qualitative distance describes four levels of perceived distance: quite near, near, far, and quite far, corresponding to quantitative distances greater than 25 km, 250 km, 500 km, 1200 km, and 1200 km, respectively.
The relationship between the island and its basic attributes reflects their membership or subordination or inclusion relationships. Examples include the triples “<Changxing Island, former name, Changsheng Island>”, “<Changxing Island, the sea area to which it belongs, the East China Sea>”, “<Nan’ao Island, Bridge, Nan’ao Bridge>”, “<marine industry, including, coastal tourism>”, and “<climate, including, precipitation>”.
3.1.3. Data Collection and Preprocessing
The initial stage in generating the instance layer of the island knowledge graph is the collecting and processing of multi-source knowledge data of the island. Since the task of the instance layer requires the guide of the ontology layer, data collection and processing must be based on various data formats, conceptual constraints, and mapping rules. According to the data format, island data sources are classified as structured data, semi-structured data, and unstructured data. For different data formats, distinct data collection techniques are utilized. The next section describes in detail the features and collection techniques of the three data sources.
Structured data, also referred to as row data, are logically represented and realized by a two-dimensional table structure, strictly adhere to the data format and length criteria, and are primarily stored and managed in a relational database. Being one of the forms of island multi-source data, island structured data represent a reasonably large fraction of the total data and a very important data source in the island knowledge network. Particularly for observational data such as marine hydrology, marine water quality, marine climate, and meteorology, such as temperature, humidity, salinity, pH value, dissolved oxygen, nitrate, etc., this sort of data is semi-structured or unstructured throughout the island. The majority of data sources are kept in relational databases or other types of structured data sources. Thus, it is crucial to convert structured data mapping to island knowledge graph triples. Designing the corresponding conceptual mapping method based on the characteristics of the structured format of the to-be-collected dataset and converting the structured data mapping into the conceptual structure form mandated by the mapping rules of the island ontology database constitute the majority of the collection method for structured data. For various organized types of island data, various conceptual mapping methodologies are utilized. Using the Mysql data conversion tool, for instance, the structured island data in the Mysql relational database are converted into the triple form of the island knowledge graph, and the mapping of the structured data in ASCII format is collected based on the number of characters of the concept to be mapped.
Figure 5 displays a mapping method example for structured data in this format.
The image depicts a structured data file derived from the Chinese observation station data of the National Oceanographic Data Center of China (
http://mds.nmis.org.cn, accessed on 1 December 2022), which consists of the marine meteorological station’s observation, collecting, decoding, format checking, and code. After conversion, standardization, automatic quality control, visual inspection, calibration, etc., the real-time data of maritime meteorology, wave, temperature, and salinity are standardized. The format is ASCII [
33], and the number of characters, such as 1–15 numbers, is utilized to separate column data. The 16th to 19th digits represent the date, followed by the hours, latitude, longitude, visibility, temperature, wind speed, and air pressure, from left to right. It also includes additional marine meteorological data. Using Xiaochangshan Island’s temperature as an illustration, the conceptual mapping technique is as follows: Then, determine and collect the number of characters where the temperature is situated. The average temperature of Xiaochangshan Island for a given month is calculated by averaging the daily and hourly temperatures for that month. Lastly, based on the mapping rules specified in the island ontology library, it is grouped into relevant structural forms, and the triples pertaining to the island’s temperature are retrieved.
- 2.
Semi-structured Data
The term “semi-structured data” refers to data that are captured or formatted in non-standard ways. Due to the lack of a predetermined schema, semi-structured data do not adhere to the structure of a tabular data model or a relational database. However, the data are not entirely raw or unstructured; they have certain structural features, such as labeling and organizational metadata, to facilitate analysis. In comparison to structured data, semi-structured data are more flexible and easier to extend. Popular semi-structured data formats include XML, HTML, and JSON, among others. In the island knowledge graph, semi-structured data can supplement and correct fundamental notions such as island area, location, and coastline length extremely well. The primary source of semi-structured data for the island knowledge graph is the semi-structured data section of the encyclopedia. Using the encyclopedia page for Nan’ao Island as an example, the page’s semi-structured data consist of island information that has been integrated and processed by the encyclopedia, and its data collection method relies primarily on web crawler technology. We obtain the island’s encyclopedia page by first conducting a search using the normal name of the island. The second step is to evaluate the Sea Island Encyclopedia’s HTML page information to extract its semi-structured data. Then, construct island triplets using the acquired semi-structured data, such as <Nan’ao Island, area, 117.73 km2>, <Nan’ao Island, coastline length, 94.3 km>, etc.
- 3.
Unstructured Data
This section focuses mostly on the acquisition of unstructured data from an island; the mechanism for extracting unstructured data will be discussed in a later section. Unstructured data have an irregular or incomplete structure, for which there is no predetermined data model, and for which it is inconvenient to express using the database’s two-dimensional logic table. This data category encompasses all office document types, text, images, HTML, numerous reports, photos, audio/video data, etc. There are numerous formats and standards for unstructured data, and unstructured information is technically more difficult to standardize and comprehend than structured information. In comparison to structured and semi-structured data, unstructured data are vast and contain a wealth of information. Most concepts in the island knowledge graph can be discovered in the island’s unstructured data, and obtaining unstructured data is similarly simple. It is possible to collect unstructured data from any place. Unstructured data are abundant and simple to acquire but contain a substantial amount of data that are irrelevant to the goal information. How to properly and thoroughly extract desired target information from unstructured data has always been a challenging and study-worthy task, and academic circles have also undertaken a great deal of research on the topic.
The majority of the unstructured data sources for the island knowledge graph consist of island-related books, journals, surveys, news articles, and websites. Initially, books, monographs, survey bulletins, and other paper-based data are semi-automatically scanned into computer-side text, while island-related webpage data such as newspapers, news articles, and encyclopedias are scanned manually. Crawler technologies are used to explore and analyze the HTML pages that have been crawled to collect the island’s unstructured text. Then, a series of data-processing operations, such as data cleansing, data formatting, etc., are conducted on the two types of collected texts. The processed data are then summed up and incorporated to create an island text library based on unstructured data.
3.1.4. Knowledge Extraction
The second stage in the instance layer of the island knowledge graph is to extract knowledge from the obtained data by using the ontology library for the island. After data collection and processing, structured and semi-structured data can be directly translated due to their predefined structures. For the triples required for the island knowledge graph, no more knowledge extraction is needed. After the acquisition and processing of the unstructured data of islands, a text library containing rich information about islands is created. Unfortunately, it cannot be loaded into a knowledge graph. So, the main objective of knowledge extraction is to extract knowledge from the island text library, extract the abundant island information contained within it, and construct a triplet of the island knowledge graph.
On the basis of the characteristics of the island text domain knowledge in the island text database and the emergence pattern of the island-related concept grammar, a knowledge extraction method that combines entity dictionary and rule patterns is proposed, and the entity recognition of the island text database is accomplished in an efficient and exhaustive manner. Knowledge extraction, which is a technique for entity recognition, is unsupervised.
Figure 6 demonstrates that its recognition and extraction procedure is divided into two sections: knowledge extraction based on rule patterns and knowledge extraction based on entity dictionary. The former is primary and the latter is supplementary; the rule model designed and constructed by the former can supplement the entity dictionary for the latter, and the entity dictionary of the latter can guide and assist the construction of the rule model, thereby completing the island knowledge extraction from text databases to construct island knowledge graph triples. The next section describes the implementation and combining of details of the two approaches.
The concept of knowledge extraction based on rules and patterns is to build entity extraction rules after analyzing the grammatical composition and patterns of specific domain texts and entities in order to accomplish the extraction. The foundation of knowledge extraction based on rule patterns is the building of rule templates, which needs a comprehensive examination of the word formation rules of entity words or attribute values in domain texts, including character word formation rules and combination rules for parts of speech. First, make numerous observations on the island text in the island text database, then identify and describe the laws governing its entity presentation and grammatical structure. Then, develop and build the rule pattern based on its rules, and utilize the regular expression to extract. For instance, “w+([-+.]w+)@w+([-.]w+).w+([-.]w+)*
$” is meant to match the email address for extraction, as the email expression is often “
[email protected]”. The regular template phrase “d4[year-]d1,2[month-]d1,2day” is meant to recognize the date, and based on the part-of-speech rule of the date entity, the part-of-speech mark of the word is used to extract the combination rule, where word segmentation technology is employed. First, the word containing time information is segmented using Jieba word segmentation. Then, the word containing the parts of speech “m” (number) and “t” (time) is extracted, and the date’s authenticity is verified. Use sequential time terms as date entities.
If the process of rule pattern building necessitates the semantic complement of the entity dictionary, the entity dictionary constructed by the second technique provides support. On the basis of the island text database, it is essential, for instance, to establish which islands have the marine industry of “seawater salt production and salinization business”. After observing and summarizing the rules of its entity occurrence in the island text database and querying the relevant terms in the entity glossary, it has been determined that if an island has a seawater-salt-making and salinization industry, there is a high likelihood that the relevant professional terms will appear in its island text. Therefore, the professional terms included in seawater-salt-making and the salt chemical industry in the entity glossary are introduced into the rule mode, and the entity extraction rule mode of “salt making | salt making | salt drying | salt mining | salt processing | raw salt | sea salt | sea salt production | electrodialysis | freezing | salt industry” is built. After the rule pattern of the entity to be retrieved has been generated, the entity of interest is extracted from the island text database using the constructed rule pattern. The correctness of the entities extracted based on the rule pattern is dependent on the precision of the rule pattern’s design and construction, the rule pattern’s generalizability, and the quality of the island text library. Hence, the retrieved entities must undergo quality evaluation and data cleansing. At the conclusion of the knowledge extraction method based on the rule pattern, a series of data-processing operations were performed on the extracted entities, such as the quality evaluation of the extracted entities, data cleaning, data organization, etc., and the target triplet of the concept to be extracted was obtained.
- 2.
Knowledge Extraction Based on Entity Dictionary
Matching recognition based on an entity dictionary is often used for unsupervised named entity recognition problems in natural language processing. Even though an entity dictionary can only partially match the dictionary of the target text, it is still quite successful, and when domain-specific entity dictionaries are compared, the knowledge extraction approach based on the rule pattern can be supplemented and verified by the matching recognition based on entity dictionary, which, if everything goes as planned, has very high accuracy.
The knowledge extraction approach used in this article, which is based on an entity dictionary, creates an island entity dictionary by transforming the current marine thesaurus and experts in the marine sector, and then expands it using an entity extraction rule model. The island text library text is then word-segmented based on the island entity dictionary. The name and part of speech of each word in the entity dictionary are used for tagging to create a user dictionary to intervene in Jieba word segmentation in order to ensure that entity words appear in their entirety. This is realized with the aid of the user dictionary function in the Jieba word segmentation tool. Then, based on the evaluation results, determine whether it is necessary to expand the island dictionary by evaluating the word segmentation results, such as by using the maximum matching method to check the word segmentation results. If necessary, modify the created rule pattern or use crawler technology. Otherwise, in the outcomes of word segmentation, entity extraction is carried out in accordance with tags. Entity dictionary is increased. To acquire the target triplet of the concept to be extracted, the retrieved entity is next subjected to the same data processing as the first technique.
- 3.
Knowledge extraction model based on the combination of entity dictionary and rule pattern
As shown in
Figure 6, taking the knowledge extraction of Sanmen Island as an example, prior to extraction, entity categories and constraints for the words to be extracted are determined based on ontology-based restrictions. For the knowledge extraction of longitude and latitude, for instance, the unstructured text is initially segmented using the Jieba tool and the existing marine dictionary. The quality of segmentation depends on the coverage of the marine dictionary for the entities to be extracted from the text. Since the marine dictionary cannot cover all unstructured text information related to island entities completely, an evaluation of the segmentation result is performed. If the segmentation result contains longitude and latitude that adhere to the entity constraints, there is no need to expand the dictionary. Knowledge extraction is executed, resulting in the construction of the triple <Sanmen Island, longitude and latitude, (East longitude 114°37′58″, North latitude 22°27′47″)>, which is then stored in the knowledge graph, and the process proceeds to the next word to be extracted.
However, if the segmentation result does not contain longitude and latitude or if the entities extracted do not meet the constraint requirements, the segmentation result is deemed ineffective, prompting the need for dictionary expansion. At this point, the rule pattern extraction method is employed. Initially, the syntax rules for longitude and latitude are summarized and verified, specifically the format X°Y′Z″, where XYZ are integers. Subsequently, extraction patterns are designed using regular expressions. By determining whether the text preceding X refers to longitude or latitude, the currently extracted value is identified as either longitude or latitude. The longitude and latitude of Sanmen Island are extracted, and the resulting triple is constructed for storage in the ISLKG. Simultaneously, the extracted word is added to the marine dictionary. If the knowledge extraction based on pattern rules still fails to yield longitude and latitude, it is concluded that the text related to Sanmen Island does not contain this information. The text is then marked for future data supplementation. The automation of the extraction process is applied to each word to be extracted for every entity category in the island text. This process is iteratively performed for each island in the island text database until all pending entity extractions for island texts are completed.
3.1.5. Knowledge Fusion and Storage
Knowledge fusion is the final stage of the island knowledge graph’s instance layer. Once data collection and knowledge extraction are accomplished, the same island entity often exhibits various expressions due to the diverse origins of data sources and the variation in Chinese descriptions. This multiplicity necessitates a process of entity alignment to mitigate potential data redundancy within the knowledge storage of the graph. Entity alignment plays a pivotal role in identifying coreference relationships across distinct knowledge graphs—a foundational task within the realm of knowledge fusion. By facilitating the integration of knowledge from diverse sources, entity alignment enriches the comprehensive representation of information for subsequent analytical endeavors.
In contexts such as data governance, the practice of entity alignment becomes particularly pronounced, serving as a primary strategy for resolving redundancy. Essentially, it involves a deduplication procedure aimed at consolidating duplicated entities. To achieve this, we employ the Dedupe entity alignment tool, a Python library that leverages machine learning techniques to efficiently perform tasks such as entity alignment, fuzzy matching, and deduplication on structured data. The core methodology of Dedupe revolves around comparing and assessing sample similarity across a range of data formats. This approach transforms data deduplication into a feature-based scoring process, culminating in the grouping of related data instances into coherent clusters. Through this iterative process, deduplication emerges as a cornerstone of effective data refinement.
For example, taking the triple <ChongMing Island, area, 1269.1 km2> from ISLKG, due to the multi-source nature of the data, there might be multiple repeated triples. Additionally, variations like <ChongMing Island, area, 1269.1 square kilometers>, <ChongMing Island, area, 1,269,100 m2>, and <ChongMing Island, area, 1,269,100 square meters> may arise, representing semantically equivalent facts but with differing units or descriptions. There could even be triples like <ChongMing Island, area, 313,903 acres of land>. Despite differences in description, units, and format, all these triples essentially convey the same fact: the area of ChongMing Island is 1269.1 km2. Therefore, the need arises to align these entities and eliminate duplicates. To achieve this, the Dedupe tool is employed for entity alignment. The process begins with data preprocessing—organizing these triples into a dataset, ensuring consistency in the subject entities and relationship fields of each triple. Subsequently, this dataset is fed into the model for training. The model automatically learns similarity features to aid in identifying duplicate entities. Next, the trained model is utilized to align the entities within the dataset. The model applies learned similarity rules to determine which entities should be considered the same. Finally, based on the model’s output, aligned entities are merged to create a clean and consistent entity collection. This process combines various similar yet distinct expressions of entities into a unified representation while retaining the most accurate descriptions and units. This enhances the data quality and usability of ISLKG.
Knowledge storage is the final phase of island knowledge graph creation. After knowledge fusion, island information with a clear structure and a wealth of features and relationships benefits greatly from being stored in a graph database [
34], which can be viewed from several dimensions including entities, attributes, and ideas. High-performance NOSQL graph databases like Neo4j store structured data on the network rather than in tables. Neo4j is also a high-performance graph engine with all the attributes of an established database. Due to these benefits, Neo4j is currently the most widely used graph database. There is a vibrant community and an established ecology in the database. Moreover, it features its own visualization tools that help users view graphs as well as a unique semantic query language called Cypher [
35]. To implement the knowledge storage of the island knowledge graph, this study selects Neo4j as the knowledge graph store platform. The island concepts and entities are saved as nodes, the spatial and semantic relationships are stored as edges, and the fundamental qualities of the island are stored as attribute values.
Figure 7 and
Table 8 illustrate an example of the storage format for the island knowledge graph in Neo4j. Taking Changxing Island as an example, the floating box in the figure defines its basic attributes such as category, latitude and longitude, population count, length of coastline, and distance to the nearest mainland. Entities within the ISLKG are represented by nodes of different colors, including Changxing Island itself, as well as its soil types, economy, mineral resources, energy resources, elemental composition, marine industries, scenic spots, etc. Arrows between two entities indicate their relationships and the direction of the relationships, such as the relationship indicating that Changxing Island is subordinate to Shanghai.
3.2. Knowledge Reasoning Model Based on Knowledge Graph Embedding
Knowledge reasoning based on knowledge graph embedding, also referred to as knowledge reasoning based on knowledge representation learning, aims to embed the components of the knowledge graph (including entities and relationships) into a continuous vector space to preserve the knowledge graph while simplifying operations. Intrinsically structured and computationally involved to perform tasks such as link prediction. Knowledge graph embedding offers a denser representation of entities and interactions in knowledge graphs, reduces computational complexity in its application, and mitigates the effect of data sparsity on model inference results. Moreover, by assessing the similarity of their low-dimensional embeddings, knowledge graph embedding can directly reflect the similarity between entities and relationships.
3.2.1. Model Definition
The ConvE model is utilized in this paper. ConvE is a CNN-based method whose fundamental concept is to describe the interaction between entities and relationships using convolutional and fully connected layers by treating the head entities and relationships of knowledge graph triples as feature maps. The primary characteristic of the ConvE model is the prediction score of the fact triplet (h, r, t), where h, r, and t denote head entities, relationships, and tail entities, respectively. It is defined by the convolution of the 2D graph embedding, as illustrated in
Figure 8.
In the ConvE model, the embeddings of entities and relationships are first reshaped and concatenated; the resulting matrix serves as the input to the convolutional layer; the resulting feature map tensor is vectorized and projected into the k-dimensional space; and all candidate object embeddings are compared for a match. The scoring function is formally defined as follows:
where
is a relational parameter that depends on r;
and
denote a 2D reshaping matrix of
and
, respectively; if
,
, then
, where
.
During the feed-forward pass, the model first performs a row-vector lookup operation on two embedding matrices: one for entities, denoted , and the other for relationships, denoted , where and are the embedding dimensions of entities and relationships, and and represent the number of entities and relationships. The model then concatenates and and uses it as the input of a 2D convolutional layer with a convolution kernel . This layer returns a feature map tensor , where is the number of 2D feature maps of dimensions and . The tensor is then reshaped into a vector , projected into k-dimensional space by a linear transformation parameterized by matrix , and matched with the target embedding by the inner product. The parameters of the convolution kernel and matrix are independent of entities h and t and relationship r.
3.2.2. Model Training Process
To train the model parameters, a sigmoid activation function
is applied to the scoring scores, i.e.,
, and the following binary cross-entropy loss is minimized:
where
is a label vector with dimension
for 1-1 scoring and
for 1-N scoring. The elements of vector
are those relationships that exist; otherwise,
is 0.
The training first utilizes rectified linear units as nonlinear to increase training speed, followed by batch normalization after each layer to stabilize, normalize, and increase convergence speed. Second, the model is tweaked via dropout in many stages, including embedding layers, feature map layers following convolution operations, and hidden layers following fully connected operations. Adam is then utilized as the optimizer, and label smoothing is implemented to reduce overfitting caused by output nonlinear saturation on labels.
For the constructed ISLKG, in the training and validation process, the data of ISLKG are first exported and adjusted to conform to the input format of the ConvE model. Upon initiating the training, ConvE reshapes and concatenates each triple in the ISLKG based on the aforementioned procedure. It then performs convolutional neural network computations, and finally reshapes the results into a one-dimensional vector. This vector is projected into a multi-dimensional space through a linear transformation. Subsequently, it is matched with the target embedding, and prediction probabilities are calculated after this matching process.
3.2.3. Experiment and Results Analysis
One objective of knowledge reasoning is to augment the knowledge graph. The degree of knowledge graph completion is used to evaluate the performance of the knowledge reasoning model, and the completion degree is often evaluated by the link prediction task. Given two elements of a triple, such as the known head entity and relationship, the link prediction task entails predicting the third element, such as the proper tail entity, based on the known head entity and relationship. Given the queries (h, r, ?) or (?, r, t), the formal definition predicts the proper set of head entities and tail entities.
The data from the island knowledge graph that was created and stored in Neo4j are the data that were used in this paper. The data of the island knowledge graph were divided into a training set, a verification set, and a test set based on the ratio of 84%:7%:7%.
Table 9 displays specific descriptions of the dataset’s entities, relationships, and triples.
The assessment indicators of the link prediction task of the island knowledge graph consist of three sorting indicators: MRR (Mean Reciprocal Rank), MR (Mean Rank), and Hits@k. This is because the learning of entities and relationships is viewed as a sorting task by the link prediction task. Specifically, for each triplet in the test set and validation set, replace the head entity or tail entity with other entities related to the relationship in the island knowledge graph. These replaced triplets are referred to as negative samples, while the correct triplets are referred to as positive samples. After being scored by the scoring function, the correct triplets are sorted in descending order to determine their rank. Take the projected tail entity as an example for the evaluation sample (h, r, t), assuming that its accurate label set is S = e|(h, r, e). To calculate the aforementioned indicators, the ranking value of the right label t from the present assessment sample (h, r, t) must be counted among the candidate entities. There are two statistical approaches for counting the sorting value: the original sorting value and the filtered sorting value. The original ranking value is to directly count the ranking value of t in the candidate entity as the final rank. However, given that the correct label set S includes not only the label t, but also other entities, if the model only counts the ranking value of t in the candidate entity, then the final rank will be incorrect. If the correct label prior to t is disregarded, the estimated ranking value is too high; hence, this paper uses the filter sort value to filter out all accurate labels rated at t. Then, determine the rank of t as the final position. Based on the preceding facts, the calculating procedure for MR is as follows:
where is the number of evaluation samples, and
represents the ranking value of the correct label in the i-th sample. The meaning of MR stands for Mean Rank. As evident from the formula, for each test triple (h, r, t), the model predicts a score for all possible entities and ranks them based on the scores. The average rank of all test triples corresponds to the MR value of the model. Generally, a smaller MR value indicates better model performance. MRR stands for Mean Reciprocal Rank, which is a more comprehensive evaluation metric. It considers not only the rank of the correct tail entity but also the reciprocal of the rank. For each test triple, MRR calculates the reciprocal rank of the correct tail entity and then averages the reciprocal ranks of all test triples. The MRR value ranges from 0 to 1, with a value closer to 1 indicating better model performance. The corresponding MRR is calculated as follows:
The Hits@k indicator counts the proportion of the top k correct labels in all evaluation samples, and the calculation method is as follows:
where
is an indicative function, which takes a value of 1 when the parameter is true, and 0 otherwise. In the Hits@K metric, for each test triple, the model’s prediction for the correct tail entity is checked to see if it falls within the top K predictions. If it does, it is counted as a “hit”; otherwise, it is considered a “miss”. The calculation of Hits@K involves dividing the total number of hits across all test triples by the total number of test triples, resulting in a hit rate value. Smaller values of K, such as 1, 3, 10, etc., are typically used to indicate the hit rate of the model’s top K predictions. In summary, while MR and MRR mainly focus on ranking, Hits@K emphasizes the performance of the model’s predictions within the top K candidates. This paper selects Hits@10, Hits@3, and Hits@1 as evaluation indicators. When the MR evaluation index is small, and the MRR and Hits@k indexes are large, it indicates that the reasoning performance of the model is better.
Comparing the ConvE model with the classic knowledge graph link prediction models DistMult and Complex, utilizing the aforementioned evaluation indicators to assess the performance of these three models in link prediction, and adjusting the learning rate during training with an adaptive optimizer.
Table 10 displays the appropriate hyperparameter settings for these three models.
On the dataset of island knowledge graphs produced in this paper, the link prediction outcomes of several models are compared.
Table 11 displays the model’s average prediction results for head and tail entities, where bold denotes the ideal model predictor value. Specifically, the ConvE model achieved higher scores on the MRR metric, indicating its superiority in terms of average reciprocal ranking for predicting the correct head and tail entities. Furthermore, the ConvE model exhibited lower MR values, demonstrating better performance in terms of average ranking. Most notably, the ConvE model’s hit rates (Hits@10, 3, 1) at different K values significantly surpassed the other two models. This implies that the ConvE model is more accurate in identifying the correct entity predictions within the top K predictions. These results suggest that the ConvE model possesses stronger capabilities in representation learning and prediction within the context of knowledge graph embeddings.
Comparing the link prediction outcomes of various models on the island knowledge network dataset reveals that ConvE is superior to other models in all evaluation indicators, and is 0.02 higher than other models in MRR and Hit@10 above.
In addition, this paper uses the result of link prediction as one of the quality evaluation indicators of knowledge graph construction, and compares the link prediction result of ConvE on the island knowledge graph dataset with its link prediction public datasets WN18RR, FB15K-237, and YAGO3-10, as well as their entities, relationships, and number of triples. In addition, this study includes the comparative indices T/E and E/R. In the knowledge graph, T/E reflects the ratio between the number of triples and the number of entities. The precise definition is the average number of relationships between each entity and other entities, or the density of the relationship between entities in the knowledge graph. The greater the value of T/E in the knowledge graph, the greater the average number of relationships between each entity and other entities, and the greater the relationship between entities in the knowledge graph. E/R reflects the ratio between the number of entities and the number of relationships in the knowledge graph, i.e., the average number of entities corresponding to each relationship in the knowledge graph, i.e., the relationship’s complexity. When the number of entities in the knowledge graph is big, the smaller the E/R value, the fewer the number of entities corresponding to each relationship, and consequently, the higher the complexity of the knowledge graph relationship. We believe the number of entities, relationships, and triples in the knowledge graph is the foundation for enhancing link prediction once the model has been determined. When the knowledge graph’s data volume reaches a certain threshold, the T/E and E/R indicators have a substantial effect on the ultimate prediction results. The experimental results presented in the table below support our conclusion.
Table 12 displays the data volume of KG datasets and the outcomes of their link prediction using ConvE models. The bold font represents the optimal value of the same evaluation indicator across the four datasets. It can be shown that the T/E and E/R of FB15K-237 are superior to those of other datasets, indicating that the dataset is the most complicated and that the link between entities is the closest. Significantly better than MR in other datasets. Nevertheless, YAGO3-10 has the biggest data volume among the four datasets, and its complexity is second only to FB15K-237. The ConvE model has achieved the best Hits@10 on the YAGO3-10 dataset. The island knowledge graph dataset created for this paper has attained the highest index values for MRR, Hits@3, and Hits@1, while the index value for Hits@10 is second only to the best, YAGO3-10. Specifically, our ISLKG achieved an MRR score of 0.454, indicating remarkable performance in terms of average ranking and the ability to position correct tail entities higher in the rankings. Additionally, in terms of Hits@3, we obtained a result of 0.490. Similarly, for Hits@1, we achieved a result of 0.387, signifying high accuracy in the top three and highest predicted positions. Notably, our ISLKG dataset’s Hits@10 score was only slightly lower than the best-performing Yago-3 dataset. This indicates that our model’s hit rate within the top 10 predictions is also approaching the best performance levels.