1. Introduction
A knowledge graph (KG) consists of semantic edges and diverse entities, represented as (h,r,t). KGs can be general, such as YAGO [
1] and DBpedia [
2], or domain-specific, such as biomedical KG [
3] and financial KG [
4]. Medical KGs, in particular, exhibit unique structures, contain abundant semantic information [
5], and play a crucial role in various healthcare applications, including disease diagnosis [
6], drug analysis [
7], clinical education [
8], and data quality management [
9]. However, constructing a medical KG from scratch can be a time-consuming and labor-intensive procedure [
10]. Recently, there have been studies that propose constructing medical KGs by either fusing existing KGs [
11] or extracting entities related to ICD-9 from Freebase [
12]. These approaches aim to streamline the process of constructing medical KGs by leveraging existing resources and extracting relevant entities. However, these methods inevitably introduce noisy triplets due to human errors and imperfect algorithms. For example, the correct triplet (Bill Gates, found, Microsoft) might be mistakenly identified as (Bill Gates, found, Google). This can lead to substantial errors in downstream tasks and applications that rely on the accuracy of the knowledge graph [
13].
Traditional knowledge graph representation learning algorithms, such as TransE [
14], DistMult [
15], and RotatE [
16], assume the correctness of all triplets in the KG. However, the absence of error detection mechanisms exposes downstream tasks to significant risks. Therefore, it is crucial to develop an error detection algorithm to ensure the reliability of a medical KG.
As the errors present in KGs can be diverse and their nature may be unknown [
17], it is nontrivial to detect noisy triplets in KGs [
18]. Recently, some studies have enabled error-aware learning against the noisy triplets [
19,
20]. In particular, CKRL [
19] estimates the triplet with confidence through path and KG embedding. To classify the triplets directly with trustworthiness, KGTtm [
21] proposes a model that integrates entity, relationship, and KG global-level information. The state-of-the-art error detection method, CAGED [
22], effectively combines KG embedding with contrastive learning. This approach estimates the confidence of triplets both locally and globally, enabling the accurate detection of errors in the KG. By utilizing the additional entities’ attribute information, AKAE [
23] enhances the error detection process and improves the accuracy of identifying and addressing nontrivial errors in the KG.
While AKAE has shown promising results, obtaining external attribute information for entities can be challenging. Therefore, we propose an approach that leverages the topological structure of the graph and the rich potential entity information within MKGs to extract intrinsic label information. This approach is more versatile and can be applied to various medical KGs without relying on explicit attributes. To this end, we propose the framework named Enhancing error detection on Medical knowledge graphs via intrinsic labEL (EMKGEL). Our contributions are as follows:
Noting the abundance of entities’ labels in medical KGs, we propose a novel method that extracts the intrinsic label information of entities via a hyper-view KG. Further, we establish the hyper-view KG ourselves due to its absence;
Aiming to integrate the topological information and intrinsic label information, we propose a hyper-view GAT consisting of a bi-LSTM layer for capturing local structural messages and a modified graph attention mechanism for modeling neighborhood information with potential labels’ messages;
Ranking the triplet by confidence score, we conduct comprehensive experiments on three medical KGs and a general KG and outperform other methods.
This research aims to make a valuable contribution to the advancement of medical knowledge graphs within the open-source community. The proposed approach can be effectively utilized for identifying errors in knowledge graph construction. By leveraging this method, it is possible to reduce the time required for construction while simultaneously improving the overall quality of the knowledge graph. In summary, the proposed method represents a promising tool for error detection in knowledge graphs and has wide-ranging implications for advancing the field.
3. Problem Statement
Given a knowledge graph
, we define
, where
represent the sets of entity, relations, and triplets, respectively. The major notations used in this paper can be followed in
Table 1Definition 1. Errors in KGs. Considering a triplet (h, r, t), there is an error if there is a mismatch of the head, relation, or tail, represented as (h’, r’, t’). Two examples illustrate this. Firstly, the triplet (Nanjing, is the capital of, China) incorrectly associates Nanjing as the capital of China, whereas the correct head entity should be Beijing. Secondly, in the triplet (Jimmy Carter, is the son of, Chip Carter), the relationship between the two individuals, Jimmy Carter and Chip Carter, is inverted. In real-world knowledge graphs, there are no ground-truth datasets that include errors. To simulate noisy triplets, we introduce errors by randomly replacing either the head or tail of the original triplets. As a result, we obtain modified triplets with noise denoted as . It is important to note that entities and relations outside of the knowledge graph are not considered in this process.
Definition 2. The triplet-level KG . We consider a triplet as a node, and we define a relationship between two triplets that share either the head entity or the tail entity, e.g., there is a link between and . We denote , where represents the set of triplets from the original knowledge graph , represents the edges connecting the triplets that share entities, and represents the set of new triplets introduced in the process. As illustrated in Figure 1b, and share the entity Alzheimer’s disease, so they have a hidden connection. Definition 3. The hyper-view KG . For any given triplet in G that shares the same pair, we define as the set of all tail entities associated with this pair. Similarly, we define as the set of all head entities that share the same relation r and tail t. As depicted in Figure 1a, the head entities at the top share the same couple (belong department, Neurology). Entities such as cerebral infarction, NeuroLyme disease, syringomyelia, and Alzheimer’s disease are likely to be labeled as diseases. Similarly, chestnut and ginkgo are likely to be labeled as food items. Definition 4. Confidence score [19]. To classify a triplet, we introduce an estimated score ranging from 0 to 1. For a true triplet (h, r, t), the confidence score is expected to be close to 1, indicating a high degree of confidence in its accuracy. Unless otherwise specified, the boldface notation in the formulas represents vectors.
4. Methodology
Traditional methods for KG representation learning typically approach the problem by modeling the KG as a heterogeneous graph and learning embeddings for entities (nodes) and relations (edges) [
14,
26,
27]. However, most algorithms struggle to handle the complex relations between triplets. To tackle this issue, several studies [
32,
33,
34,
35] propose methods that aggregate information from neighborhood nodes for latent messages. Furthermore, some studies [
23,
37] explore the integration of external information, such as entities’ labels, to enhance KG embedding. However, obtaining label information for entities in MKGs is not always a straightforward process. Thus, this paper proposes a novel framework named Enhancing error detection on Medical knowledge graphs via intrinsic labEL (EMKGEL). Inspired by recent work [
22,
23], we construct a triplet-level KG in Definition 2 to capture neighborhood information. In contrast to CAGED, which employs a hyper-parameter for error filtering, our approach involves extracting potential label information at a hyper-view level. We then integrate this information into the attention mechanisms within our framework. Our intuition is that a triplet can acquire analogous label feature information from the set associated with the same hyperedge. For example, the subclass words under the same parent-class words often share similar semantic contexts.
As illustrated in
Figure 2, our proposed EMKGEL consists of a multi-view KG, a hyper-view GAT, and a joint confidence score. In our proposed approach, we first generate a triplet-level KG and a hyper-view KG, as described in Definitions 2 and 3. We then employ an attention mechanism to capture the intrinsic label information of entities in the hyper-view KG. This information is subsequently incorporated into the triplets using the hyper-view GAT to enhance the nodes. To train the model, we utilize a combination of KG embedding loss and global triplet embedding loss. This joint training enables the model to learn meaningful representations that capture both the structural relationships within the KG and the semantic contexts of the triplets. Finally, we estimate triplets using a joint confidence score based on the learned representations.
4.1. Hyper-View GAT for Representation Learning
Noisy triplets in KGs can have a detrimental effect on representation learning, thereby jeopardizing downstream tasks. It is crucial to ensure the reliability of the encoder to mitigate the impact of these errors. To address this challenge, CAGED introduces an error-aware GNN that filters out errors. However, determining an optimal hyper-parameter for error filtering is a complex task, making it challenging to apply CAGED to different KGs effectively. In this paper, we present a novel method to extract intrinsic label information from the hyper-view KG. We leverage this information to enhance the representation of triplets by incorporating it into a hyper-view GAT. By combining local structural information and neighborhood triplet messages, the hyper-view GAT effectively integrates multiple sources of information. Following the integration process, we estimate the reliability of each triplet by assigning a confidence score. This score serves as a measure of the triplet’s quality, aiding in the interpretation and utilization of the KG data.
Figure 2.
For training purposes, all triplets are generated with pairwise negative examples, denoted by red. This involves randomly replacing the head entity, the relation, and the tail entity in each triplet. (a) We construct a triplet-level KG and a hyper-view KG. (b) Enhanced by intrinsic label information, triplet-level nodes learn the embedding from neighborhood messages. (c) After training on a joint loss, we estimate the confidence of the combined local and global trustworthiness of triplets.
Figure 2.
For training purposes, all triplets are generated with pairwise negative examples, denoted by red. This involves randomly replacing the head entity, the relation, and the tail entity in each triplet. (a) We construct a triplet-level KG and a hyper-view KG. (b) Enhanced by intrinsic label information, triplet-level nodes learn the embedding from neighborhood messages. (c) After training on a joint loss, we estimate the confidence of the combined local and global trustworthiness of triplets.
4.1.1. Local Structural Information Modeling
Applying a GAT [
42] for learning emphasizes the capture of latent information from neighborhood nodes. However, this approach can potentially weaken the inherent information contained within the triplets, denoted as
. Taking inspiration from CAGED, we utilize a bidirectional LSTM to acquire local representations that preserve the specific structural information of the triplets within global information learning.
As shown in Equation (
1), we initialize a triplet
and pass the resulting vector
through the bi-LSTM layer. The output
represents the local triplet embedding, which captures the structural information of the triplet. Subsequently, we utilize this local embedding
as the input for the global modeling layer.
4.1.2. Intrinsic Label Information in Hyper-View KG
In the MKG, there is a wealth of untapped and exploitable unknown entity label information available. To uncover the latent label information of entities without external information, we construct a hyper-view KG that enriches each entity with diverse attributes within the medical domain. For an embedding triplet , the head entity h and the tail entity t have corresponding sets, as defined in Definition 3. Taking the shared pair as an example, we can obtain a set for the tail entity. Based on our assumption, the entities in likely share similar characteristics or properties, allowing us to infer potential labels or attributes for the tail entity based on this set.
To capture the messages associated with potential labels, we employ an attention mechanism,
indicates the coefficients across the original
t and the i-th tail entity in
.
denotes a single layer feed-forward neural network.
To reduce bias and ensure a fair comparison among the coefficients, we normalize them using a Softmax function. By incorporating information from the entire set
, we can capture the potential labels or attributes associated with
,
Similarly, we obtain the
from the all head entities of
,
To estimate the importance of the
and
, we employ the similarity calculation. The hyper-view score
is utilized to gauge the significance of neighboring nodes,
By considering the hyper-view score, we can determine the level of contribution that the neighboring nodes have towards the representation and understanding of the target triplet.
4.1.3. Neighborhood Information Modeling
Relying solely on KG embeddings to estimate the confidence of a triplet is insufficient. It is crucial to incorporate contextual information from neighboring triplets to enhance the confidence estimation process. While previous methods such as R-GCN [
34] and KGAT [
35] have shown effectiveness in leveraging neighborhood information, they may encounter a decline in performance when neighboring triplets contain noisy or erroneous messages, as highlighted in the work of CAGED [
22]. To address this issue, CAGED introduces a hyper-parameter that helps mitigate the impact of errors. Given the abundant availability of entity label information in MKGs, we utilize hyper-view scores to enhance the neighborhood nodes.
Specifically, for a given anchor triplet
with m neighboring triplets
, we aggregate the information from these neighbors to update the representation of the anchor triplet. The weights of anchor triplet
and neighboring triplets
are calculated as follows:
where
indicates the weight of
to
.
a indicates the attention function
,
is a trainable parameter matrix that projects triplet
into the same vector space.
Then, we incorporate the hyper-view score to enhance the label information, as depicted in Equations (
2)–(
6). In detail, we make a dot-product between
and
in normalization,
where
indicates the normalized coefficient weight of the j-th triplet
to anchor triplet
.
Finally, we obtain the reconstructed vector
for the original anchor triplet
. And the
denotes the Sigmoid function,
4.2. Joint Training Strategy
To capture the semantic and latent information, we introduce a training strategy to integrate KG embedding loss and global embedding loss.
Based on the translation assumption, we utilize the TransE score function
to fit the local structural information,
For neighborhood information embedding, we use the
to estimate the distance between the anchor triplet
and reconstruct triplet
as follows:
To integrate the KG and global triplet embedding, we introduce the trade-off hyper-parameter
. In
Section 5, we specifically investigate the effect of different values of
. The calculation is as follows:
Subsequently, we leverage a margin-based ranking loss function for negative sampling during the training process following the previous work [
14],
where
equals
,
is a hyper-parameter.
S indicates the set of the original triplets, and
indicates the set of the negative triplets that randomly replace the head and tail entities. It is crucial to ensure that corrupted triplets are not in
S and are non-repetitive.
4.3. Confidence Score
After training, we obtain the confidence score, as depicted in Equation (
14):
The
denotes the Sigmoid function. The function
denotes the similarity of the original triplet
and the reconstructed triplet
. The confidence score ranges from 0 to 1, with higher values indicating a stronger positive correlation for the triplet.
The learning process of our method is summarized in Algorithm 1.
Algorithm 1 Error detection on medical knowledge graphs via intrinsic label information |
Input: Knowledge graph with noise Output: KG embeddings and confidence score
- 1:
Initialize network parameters, - 2:
Construct a triplet-level and a hyper-view KG as per Definitions 2 and 3, respectively. - 3:
while not converged do - 4:
for each ∈ S do - 5:
Modeling the local structural information of triplets as defined in Equation ( 1), - 6:
Extract the intrinsic label information in hyper-view and then compute the importance score of triplets as defined in Equation ( 6), - 7:
Acquire the representation in hyper-view GAT in Equation ( 9), - 8:
Compute the KG embedding distance in Equation ( 10) and global triplet embedding distance in Equation ( 11). Combined with a trade-off parameter and obtain the joint loss in Equation ( 13). - 9:
end while - 10:
Compute the confidence score as defined in Equation ( 14).
|
5. Experiments and Discussion
In this section, we will provide detailed experimental settings. Through the parameter analysis, ablation study, and case study, we validate the effectiveness of the proposed method, EMKGEL.
5.1. Experimental Settings
In this section, we provide a detailed overview of the experimental settings, including datasets, baseline methods, and evaluation metrics.
5.1.1. Benchmark Datasets
Similar to prior studies, such as [
19,
22,
23], we adopt the approach of randomly replacing head and tail entities to generate noisy triplets. As depicted in Definition 1, we introduce 5% noisy triplets into three medical real-world KGs and 5%, 10%, and 15% noisy triplets into one general KG to explore the robustness of our method.
PharmKG-8k [
11] is a multi-relational attribute biomedical knowledge graph composed of more than 500,000 individual interconnections between genes, drugs, and diseases, with 29 relation types over a vocabulary of 8000 disambiguated entities.
DiaKG is derived from 41 publicly published diabetes guidelines and consensus documents, covering the most extensive range of research topics and hot areas in recent years.
DiseaseKG is a knowledge graph built upon common disease information utilizing the cnSchema framework.
Detailed information on the datasets is summarized in
Table 2.
5.1.2. Baseline Methods
In our experiments, we introduce KG embedding baseline methods and error detection baseline methods.
KG embedding: We compare them with the traditional representation learning methods, including TransE [
14], DistMult [
15], and RotatE [
16]. We leverage the function score as the confidence score after training. In TransE, we employ the Euclidean distance
as the confidence score.
Error detection: For KG error detection methods, we evaluate the proposed method against state-of-the-art approaches, including CKRL [
19], KGTtm [
21], and CAGED [
22].
5.1.3. Evaluation Metrics
Consistent with previous studies [
22,
23], we adopt the practice of ranking all triplets based on their confidence score. Triplets with lower scores are considered potential candidates for being noisy triplets. Precision@K and Recall@K metrics are utilized to estimate the effectiveness. In detail, Precision@K denotes the TopK lowest confidence score triplets among the TopK triplets. Recall@K denotes the TopK lowest confidence score triplets among all triplets.
5.1.4. Implementation Details
We conduct experiments on GPU NVIDIA GeForce RTX 3090; the Python version is 3.8 and the Pytorch version is 1.11.0. Based on the average in-degree of the datasets, the number of neighbors for each triplet in the four datasets is set to 59/2/7/2. The embedding hidden size is set to 100, the same as the bi-LSTM hidden size. Default Xavier initialization and an initial learning rate of 0.003 were used.
In our experiments, we explore different hyper-parameters to assess their impact on the results. The trade-off parameter is set from 0.001 to 1000, while the margin parameter is adjusted within the range of 0 to 1. To mitigate the impact of randomness introduced by erroneous triplets, we average the experimental data across 10 random seeds ranging from 0 to 9.
5.2. Results and Analysis
In this section, we conduct a comprehensive evaluation to assess the effectiveness of our method across four datasets. Through thorough observation and analysis, we demonstrate that our proposed method performs effectively in all four datasets. For clarity, we highlight the optimal results in black and underline the second-best results. Additionally, we indicate that Precision@K is equal to Recall@K when K equals ratios by using an asterisk (*).
5.2.1. Main Results
As depicted in
Table 3, the results demonstrate that (1) error detection methods outperform traditional embedding methods on the three medical KGs, and (2) notably, our proposed method, EMKGEL, outperforms all existing methods, delivering the best performance.
Specifically, traditional embedding methods such as TransE and RotatE solely focus on local structural information while neglecting global triplet embedding. Consequently, this limited perspective can result in losing important messages from neighborhood triplets. To tackle this issue, CAGED addresses the limitation by incorporating global triplet embedding through contrastive learning. This integration enables CAGED to capture and leverage the global context of triplets, detecting a higher number of errors in the KG. Different from CAGED’s utilization of the uncertainty parameter for error filtering, our approach takes a novel approach. Initially, we leveraged a hyper-view KG to extract potential label information for entities. Subsequently, we estimated the importance of nodes by assigning hyper-view scores, thereby enhancing their impact on neighboring nodes. In our experiments, we observe that our method demonstrates improvements of 0.7%, 6.1%, and 3.6% on PharmKG-8k, DiseaseKG, and DiaKG, respectively.
Additionally, we introduce different ratios of 5%, 10%, and 15% on Nell-995 to observe their effectiveness. As depicted in
Table 4, our method consistently achieves the best results across different cases, highlighting the robustness of EMKGEL.
5.2.2. Ablation Study
To validate the individual components of our proposed method, we conduct comprehensive experiments. Firstly, we replace the bi-LSTM by simply concatenating the triplet embedding to assess the impact of local structural information. As shown in
Table 5, we can observe that the variant employing only concatenation exhibits inferior performance due to the absence of structural information. Secondly, we replace the hyper-view GAT with a simple GAT. As indicated in
Table 5, our hyper-view GAT outperforms the GAT, thereby demonstrating the effectiveness of label information in enhancing the embeddings. Thirdly, we proceed to eliminate the KG and triplet embedding losses individually. Upon analyzing the results presented in
Table 5, it becomes apparent that the model’s performance suffers when either the KG or triplet embedding loss is removed. This observation suggests a strong interdependence between KG embedding and triplet embedding, indicating that these two components work collaboratively to enhance the model’s performance. Lastly, we introduce a replacement for the TransE score function, as shown in Equation (
10), by adopting the RotatE score function. This modification aims to explore the flexibility of different score functions and their impact on the model’s performance.
5.2.3. Parameter Analysis
The
is the trade-off parameter that balances the KG embedding (e.g.,
) and global triplet embedding (e.g.,
). To investigate the impact of
, we set it from 0.001 to 1000. We conducted experiments on all four datasets and the evaluation on Recall@K. Based on the findings depicted in
Figure 3a, we observe that (1) PharmKG-8k, DiaKG, and Nell-995 demonstrate their best performance when the value of
is set to 10; (2) DiseaseKG achieves its optimal result when
is set to 0.1; (3) at the outset, as
increases, the model’s performance shows improvement. However, once it reaches the optimal value, a decline in performance is observed. This suggests that while enhancing the impact of global embedding initially boosts performance, there is a point of diminishing returns. Pushing the value of
beyond this point does not yield the best performance for the model.
The
is the margin parameter. As shown in
Figure 3b, the optimal result is
, and the trends of the four datasets are essentially identical.
5.3. Case Study
To investigate how the hyper-score enhances error detection, we conducted a case study on the anchor triplet (Diabetes ketoacidosis, Symptom, Polydipsia) of DiseaseKG. As shown in
Figure 4, we present the hyper-scores of neighbors. In our assumption, the hyper-view GAT will enhance the triplets with high hyper-scores and reduce the impact of triplets with low hyper-scores. Finally, we assume that the confidence score will significantly differentiate between true triples and noisy triples.
To confirm the results, we set the number of share(h, r)/share(r, t) of to , and the confidence score of each triplet, respectively.
As shown in
Table 6, we observe that our proposed method outperforms CAGED. In detail, CAGED cannot lower the confidence score on three noisy triplets, while ours does so and to a better degree. Furthermore,
,
, and
possess more intrinsic label information, as indicated by their hyper-scores of {0.8993, 0.9103, 0.5371}, due to a high number of share(h, r) and share(r, t). Hyper-view GAT enhances the representation of
,
, and
. Conversely,
and
contain less intrinsic information as they have a low number of share(h, r) and share(r, t). Consequently, hyper-view GAT diminishes the influence of
and
. However, the confidence score of noisy triplet
(Diabetes ketoacidosis, Medication, Shagliptin tablets) in both methods is ambiguous. In real-world scenarios, it is not enough to prescribe medication based solely on simple triplet information; it is also necessary to consider the actual situation. Therefore, this is a common issue with existing methods based on knowledge graphs.
To show a straightforward validation of our method, we present a visualization in
Figure 5. The y-axis represents the confidence scores assigned to the triplets. True triplets are denoted by green nodes, while false triplets are represented by red nodes. In comparison to CAGED in
Figure 5b, our model in
Figure 5a assigns lower confidence scores to noisy triplets, approaching zero. This visualization serves as evidence of the effectiveness of our model in real-world scenarios.
6. Conclusions
In this paper, we propose a novel framework named Enhancing error detection on Medical knowledge graphs via intrinsic labEL (EMKGEL). Firstly, we construct a hyper-view KG and a triplet-level KG. The former aims to capture intrinsic label information, and the latter focuses on neighboring information. Secondly, we introduce the hyper-view GAT to incorporate the entity label information into the triplet. Then, we integrate KG embedding and global triplet embedding in the training stage. In the end, we estimate each triplet by their confidence score. The evaluation on three medical KGs and one general KG demonstrates the effectiveness of EMKGEL. We believe that our method can be an effective tool for error detection during KG construction.
However, there remain some challenges that future work needs to address:
Existing error detection methods [
19,
21,
22,
23] only take into account the entities and relations already present in the knowledge graph, while error triplets could originate from outside the dataset [
43]. To address the limitation, combining textual information from large language models with graph structure information is a promising direction [
44,
45].
The currently used max dataset size is approximately 500,000, and in the future, we will consider conducting more experiments on large-scale graphs while maintaining the effectiveness and reducing the training time.
We will explore more meaningful downstream tasks, such as knowledge-based question answering in the medical field.