5.2. Link Predition
The main task of link prediction is to predict t given (h, r) or predict h given (r, t) for a relation triple (h, r, t) where either the head entity or the tail entity is missing. It tests the quality of embedding learning by focusing on ranking candidate entity sets in the knowledge graph rather than directly obtaining the best result. In our experiments, this task was divided into two subtasks, that is, link prediction for instance relation triples and link prediction for concept relation triples. We performed these two subtasks on two datasets.
Experimental design. First, we divided the triple set into the training set, the validation set, and testing set, with relative percentages of 85%, 5%, and 10%, in accordance under the setting of the widely used benchmark dataset TransE [
2]. To avoid overfitting, we adopted the parameters obtained in the validation set during the testing stage. For each testing triple (h, r, t), replacing the tail entity t with an entity x in the knowledge graph, we obtained a set of corrupted triples (h, r, x). After calculating the score of the objective function F(h, r, x) of corrupted triples and ranking them in ascending order, we had the ranking of testing triples (h, r, t) within all candidate triples. Similarly, the ranking of the objective function scores for a set of candidate triples (x, r, t) that replace the head entity h for the testing triple (h, r, t) could also be obtained. Like TransE [
2], we used two evaluation metrics in this task: one was the mean reciprocal rank (MRR) of all testing triples, and the other was the proportion of correct triples that rank no larger than N (Hits@N). The better the performance of the model, the larger the MRR and Hits@N values. In addition, we applied the “Filter” [
4] from earlier studies, that is, if a candidate triple exists in the knowledge graph (that is, the candidate triple with head or tail replacing is still correct), then it is reasonable that its objective function score is lower than the score of the original testing triples. To eliminate the influence of this interference factor, we removed the candidate triplets that generated “interference” from the training set, the validation set, and the testing set before obtaining the ranking score of each testing triple in order to ensure that the candidate triplet did not belong to any dataset.
Experimental implementation. During training, the stochastic gradient descent algorithm learning rate , the margin hyperparameters , , are in the range of , the embedding space dimension is in the range , is in the range , and the weights of , , and are in the range , respectively. The best parameters were determined by the Hits@10 of the validation set. “Unif” was used to represent the traditional strategy of equal probability replacing head or tail entities, “self-adv” was used to represent the use of the self-adversarial negative sampling strategy, and the self-adversarial negative sampling parameter . The best parameter of the models for datasets are the same value: , , , , , , , , , , batch = 128, self-adversarial sampling parameter under the “self-adv” strategy. The model trains 5000 rounds for each dataset iteratively. We selected TransE as the intra-view model in JOIE to ensure fairness.
Analysis of experiment results.Table 2 presents the experiment results. We can see from the table that CIST performs better than other baseline models in almost every evaluation metric, especially the CIST (self-adv) variant, which uses the self-adversarial negative sampling method. On Hit@1 and Hits@10, CIST (self-adv) achieves almost twice as high scores compared with other baselines, which further verifies CIST’s effectiveness.
Compared with traditional models TransE, DissMult, and HolE, CIST outperforms these models on average by 87.2% on MRR, 71.8% on Hit@1, and 144.2% on Hit@10. This shows that CIST can learn the specific features of instances and concepts when learning entity embedding by exploiting the latent semantic relations between instances and concepts. CIST adds additional features to the embedding learning of the knowledge graph, which makes its performance much better than other models. Compared with TransC, CIST outperforms the model on average by 44.1% on MRR, 57.3% on Hit@1, and 38.3% on Hit@10. This is because, compared with TranC, CIST models instances and concepts into different embedding spaces, thereby avoiding the problem of a limited learning performance caused by the gathering issue of different instances that belong to the same concept. The experiment results demonstrate the effectiveness of modeling instances and concepts in different embedding spaces.
Compared to JOIE, CIST outperforms the model on average by 21.4% on MRR, 51.1% on Hit@1, and 29.2% on Hit@10. This is because, compared to JOIE, CIST adds a neighboring range parameter to the embedding of a concept, which can model the transitivity of the isA relations and the situation where the same instance belongs to different concepts. The experiment results show that CIST performs better in learning the latent semantic relations between instances and concepts. Compared with “unif”, the CIST with “self-adv” setting basically outperforms on MRR, Hit@1, and Hit@10. The experiment results demonstrate the effectiveness of the “self-adv” negative sampling strategy.
Compared to other baseline models, CIST, using “unif” sampling, performs slightly worse in instance relational triples on the YAGO26K-906 dataset. The reason may be that we search the best experimental configuration parameters based on Hits@10 for all triples on the validation set, which may not be the best parameters for instance relational triples, and may lead to the relatively lower evaluation value.
Overall, CIST is an effective knowledge graph embedding learning model that benefits the mutual learning of instance embedding and concept embedding. As a result, link prediction tasks perform well with CIST.
5.3. Triple Classification
Determining whether a testing triple is labeled as “correct” or “incorrect” is the main task of triple classification. The triples can be instance relation triples, concept relation triples, instanceof triples, or subclassof triples. This is a binary classification task, and its evaluation metrics include the accuracy rate, precision rate, recall rate, and F1 value, which are commonly used in binary classification tasks. We constructed the negative triples needed for the triple classification task testing according to the same settings as the neural tensor network model NTN [
29]. We constructed one negative triple for each positive triple in the validation set and the testing set. The numbers of positive triples and negative triples in the verification set and the testing set are the same.
Experimental design. We divided the triple set into the training set, validation set, and testing set, the percentage of which being 60%, 20%, and 20%, respectively. To avoid overfitting, we adopted the parameters obtained in the validation set during the testing stage. We set a threshold
for each relation r in the dataset. For a given testing relation triple (h, r, t), we calculated the score of its score function F(h, r, t). If its function score was less than the threshold
, then the label of the triple was predicted to be “correct”; otherwise, it was “incorrect”. Similarly, for instanceof triple (x, r
, c), if the score of Formula (
4) was less than
, it was predicted to be “correct"; for subclassof triple (x, r
, c), if the score of Formula (
5) was less than
, the triple was predicted as “correct”. The threshold
was determined by maximizing the classification accuracy on the validation set.
Experimental implementation. In this task, the optimization method of model parameters is the same as that of the link prediction task. The optimal configuration was determined by the accuracy of the validation set. The optimal parameter configuration of the model is as follows: for the YAGO26K-906 dataset, , , , , , , , , , , batch = 128, self-adversarial sampling parameter under “self-adv” strategy; for the DB111K-174 dataset, , , , , , , , , , , batch = 128, self-adversarial sampling parameter under the “self-adv” strategy. For each dataset, all training triples were trained for 5000 rounds iteratively.
Experiment results. In our dataset, triples can be categorized into four different types: instance relation triples, concept relation triples, instanceof triples, and subclassof triples. We conducted experiments on the above four types of triple sets, respectively. The experiment results are shown in
Table 3,
Table 4,
Table 5, and
Table 6, respectively.
We have the following observations from the experiment results.
(1) CIST has the highest accuracy and F1 score across all experiments, showing that it outperforms the state-of-the-art models in triple classification tasks. Although some baselines outperform CIST on the precision and recall value, CIST still achieves the highest F1 score. The F1 score balances the precision and recall, so the advantages of CIST can still be verified to a large extend.
(2) Compared with TransE, CIST outperforms it on average by 30.5% on accuracy, 29.0% on precision, 17.9% on recall, and 23.3% on F1 score. This further proves that CIST can model the specific features of instances and concepts when learning entity embedding by exploiting the latent semantic relations between instances and concepts.
(3) From
Table 3,
Table 5 and
Table 6, compared to TransC, CIST outperforms it on average by 17.4% on accuracy, 16.1% on precision, 14.4% on recall, and 14.9% on F1 score. The experiment results further demonstrate the effectiveness of modeling instances and concepts into different embedding spaces. TransC is not able to model concept relational triples, and thus we did not include TransC in
Table 4.
(4) Compared to JOIE, CIST outperforms it on average by 16.3% on accuracy, 13.9% on precision, 16.7% on recall, and 15.2% on F1 score, which further demonstrates that CIST can model the transitivity of isA relations and learn the latent semantic links between instances and concepts much better.
(5) For the CIST model, compared to the “unif” sampling strategy, CTSI (self-adv) outperforms it on average by 4.5% on accuracy, 3.7% on precision, 5.6% on recall, and 3.9% on F1 score. This shows that the “self-adv” sampling strategy is more effective than “unif” by extracting higher-quality negatives samples.
In conclusion, compared to TransC and JOIE, CIST can alleviate the gathering issue of instances and concepts embedding, and can model the transitivity of isA relations much better, making the CIST model achieve better results in triple classification tasks.