In this section, we first introduce our evaluation indicator selection and datasets, while we will give the structure-specific configuration of our model. After that, we will show the results of IntME on four benchmark datasets. Finally, we set up experiments to demonstrate the superiority of our approaches and the advantages of the structural choices.
4.1. Evaluation Indicator Selection
We still use the general metrics MR, MRR, and H@ for the evaluation of the model performance. MR is the mean rank, which is the average of the ranking of the probability or the score of the correct tail entity in the prediction result, and a lower MR means a better prediction result. H@ is Hits@, which indicates the proportion of the real tail entity ranking in the top positions in the prediction results and usually takes the values of 10, 3, and 1 corresponding to H@10, H@3, and H@1 to evaluate the model, and their higher values indicate a better model performance. MRR is mean reciprocal rank, which is the average of the inverse of the ranking of the probability or score of the correct tail entity in the prediction results. It differs from MR in that the higher its value is, the better the model performs.
Giving the set
, then MRR, MR and Hits@
(H@
) are calculated by Equations (
12)–(
14).
Given a triplet in the test set, we will filter the triplet that existed in the validation and training sets by applying the approach from [
8], then we evaluate the model’s performance using the prediction results of the tail entities with the above metrics.
4.2. Datasets
To validate the performance and robustness of our improved model on different types of datasets, we select representative datasets to validate our IntME based on the graph size and the number of relation categories. FB15K [
8] and WN18 [
8] are two classical datasets upon which many previous excellent works have been evaluated, and FB15k-237 [
32] and WN18RR [
1], which are subsets of the above datasets, respectively, are more capable of validating the performance of the model, so we select FB15k-237 and WN18RR as representative of the medium-sized datasets. YAGO3-10, as a subset of YAGO [
2], has a large number of entities and contains a large number of triplets, so we select YAGO3-10 as a representative of large datasets. Alyawarra Kinship [
33] can be considered a representative of a micro dataset compared to the above three datasets, so we select the above four datasets for our experiments. The detailed dataset information is shown in
Table 2.
The whole Freebase has 1.9 billion triplets, and FB15k-237 is its subset with nearly 15,000 entities and 237 relation types; as a subgraph of Freebase, it has the corresponding information of a father knowledge graph. WN18RR is a subgraph of WordNet with 40,943 entities and 11 types of relations. The size of FB15k-237 and WN18RR entities and the number of types of relations can well verify the expression ability of the model for explicit and latent knowledge. YAGO3-10 has 123,182 entities and 37 relations, and its size is very suitable for verifying the performance of the model under a large number of knowledge systems compared with the previous two datasets. Kinship has only 104 entities and 25 relations, which can verify the learning ability of our model for surface knowledge.
4.4. Experiment Results
We first compare the results of several benchmark models with our IntME after training on FB15k-237 and WN18RR. The benchmark models are chosen from distance-based models (i.e., TransE [
8], KMAE [
34]), bilinear-based models (i.e., DistMult [
13], ComplEx [
14]), and some convolution-based networks (i.e., ConvE [
15], InteractE [
18], JointE [
17], HypER [
25]). It is clear from
Table 4 that IntME outperforms all the benchmark models on FB15k-237 and WN18RR.
The performance of IntME on FB15k-237 far exceeds that of ConvE and HypER, with a considerable improvement compared to InteractE. The improvement is about 2% on MRR, about 0.8% on H@10, about 2.1% on H@3, and about 1.9% on H@1. Compared with JointE, which is a state-of-the-art convolution-based model, IntME also demonstrates its excellent performance.
From the results of WN18RR, IntME also shows an overwhelming superiority. Different from JointE, which is not as good as DistMult in MR, IntME also performs well in MR. While outperforming ConvE and HypER, the addition of Path 2 makes IntME outperform InteractE comprehensively, improving MRR from 0.469 to 0.475, MR from 5039 to 4055, and H@10 & H@3 by about 2.6% and 2.7%, respectively.
We re-run the open source code of InteractE on FB15k-237 and WN18RR and compare it with the improved model IntME. The best MRR-based performance variation on the validation set is shown in
Figure 5 and
Figure 6. As can be clearly seen in
Figure 5, IntME takes the overall lead over InteractE’s performance after about the 30th epoch, and the gap is very obvious. And in
Figure 6, InteractE performs checked feature reshaping four times, while IntME only uses one time. Moreover, InteractE uses a convolution kernel size of 11 for WN18RR in the open source code, while we only use 9, so we train more slowly on the first 80 epochs but, after 80 epochs, both H@10 and H@3 surpass InteractE, and after 150 epochs MRR also surpasses it.
Next, we utilize YAGO3-10 to validate the performance of our model for link prediction on large datasets, and we select several bilinear models, DistMult [
13], ComplEx [
14], and several benchmark convolution-based models such as ConvE [
15], HypER [
25], JointE [
17], and InteractE [
18] as our baselines. Experimental results are shown in
Table 5.
In YAGO3-10, our model still gives a good performance; we perform optimally on two metrics, MRR and H@1, with a performance that is not weaker than JointE and is stronger than the other baselines. Compared to InteractE, IntME increases MRR by 0.7%, H@10 by 0.2%, H@3 by 0.3%, and H@1 by 1.4%. Our model is fainter than JointE in H@10 and H@3 metrics, and we think that our model is not quite strong enough for feasible interactions. If we continue to try channel scaling reshaping, we think we can reach the level of JointE or even surpass it. Furthermore, benefitting from the powerful nonlinear fitting capability, the neural network is much more powerful than the linear model for latent knowledge processing.
Kinship, the smallest of the four datasets, is appropriate for evaluating the model’s ability to extract explicit knowledge. We pick several models, including ComplEx [
14], ConvE [
15], RotatE [
21], HAKE [
35], and InteractE [
18], as baselines. When applying
Path 2, IntME exhibits highly extraordinary results in the test of explicit knowledge recognition capability, and the results are displayed in
Table 6. InteractE is the worst-performing model in the results. IntME outperforms all other baselines in MRR, H@10, and H@3. ComplEx scores the highest marks in H@1, while IntME follows. Our model improves on the InteractE by 12.2% on MRR, 2.6% on H@10, 7.9% on H@3, and 20.3% on H@1, respectively.
In summary, IntME shows an excellent performance on all four datasets. The analysis of datasets of various sizes supports our expectations for what IntME can reach in terms of link prediction. Meanwhile, it is demonstrated that IntME can be the state-of-the-art convolution-based model in medium and large knowledge graphs.
4.5. Ablation Study
We design an ablation study to examine whether the individual modules of our model actually work.
To verify whether channel scaling reshaping is able to locate the interaction difference between reasonable interactions and the output features of
Path 1 through channel element adjustment, we set up two sets of experiments on FB15k-237 and WN18RR for verification. To simplify the factorization in 200-dimensional embedding, we only consider the form of the square feature maps, i.e.,
=
in Equation (
5). Thus, we set up 3 sets of square feature groups of different shapes—
,
, and
, respectively. We implement it on FB15k-237 and WN18RR with the optimal hyper-parameters, and the results are shown in
Table 7.
The results of FB15k-237 indicate that the three shapes perform differently, with exhibiting the best performance, indicating the benefit of improved interactions for IntME performance. Overall, in contrast to the results of FB15k-237, the best results of WN18RR are , and this suggests that increasing interactions should be carried out reasonably rather than arbitrarily.
As a result, channel scaling reshaping can produce plausible interaction improvement without the requirement for extra parameters and can quantify the difference through the adjustment of channel element number.
In
Section 3, we employ the
function, which is an element reordering function. In brief, it reorders both the row and column elements of the matrix according to different permutations. We anticipate that it can enhance the possibilities of interaction between different elements. Therefore, we perform controlled trials to verify our anticipation on FB15k-237 and WN18RR. The structures of the models in the trials are all optimally chosen and the results are shown in
Table 8.
According to the results, when is not applied, the performance of IntME at FB15k-237 is essentially identical to that of InteractE, with little improvement, and it is not significant on WN18RR, either. In contrast to the case of not applying, the IntME results are highly significantly improved on FB15k-237, and a slight improvement on WN18RR, especially improving FB15k-237’s results from 0.354 to 0.360 on MRR, from 0.536 to 0.543 on H@10, from 0.388 to 0.395 on H@3, and from 0.263 to 0.267 on H@1.
In summary, element reordering can effectively enhance the interaction quality through the disordered element interactions.
To verify the necessity of using
Path 1 and
Path 2 together in IntME, we implement an ablation study on FB15k-237 and WN18RR. The result is displayed in
Table 9.
We can see from the results that it is necessary to employ Path 1 and Path 2 together. Without Path 1, Path 2 loses some capability to extract some potential knowledge. The decrease is less significant in FB15k-237 but, in WN18RR, MRR decreases from 0.475 to 0.449, H@10 from 0.545 to 0.532, H@3 from 0.495 to 0.478, and H@1 from 0.436 to 0.397. Without Path 2, IntME lacks much of its capability to extract explicit knowledge.
In summary, IntME is only complete if both Path 1, which represents the initial value of the interactions, and Path 2, with explicit knowledge information and the difference between more reasonable interactions and initial value, are used.
4.6. Training Cost
Our model has the advantages of not only robust generalization ability but also low training cost. InteractE, as a state-of-the-art convolution-based model using external common filters, and JointE, which utilizes internal alternate filters and is also a state-of-the-art model, are well suited as baselines for a training cost comparison. To confirm the consistency of comparison results, we set a 256 batch size and we choose the 1-N strategy in FB15k-237, WN18RR, and YAGO3-10. There is no open source code for JointE, so its code is rewritten in our environment according to [
17].
As shown in
Figure 7, our model has a minor improvement of training cost on FB15k-237 but brings a more powerful performance than InteractE, while it takes even less training cost than InteractE with optimal hyper-parameters on WN18RR and YAGO3-10. It benefits from the shallow linear operations in
Path 2 and single feature map fusion in
Path 1 and the complementary feature extraction capability of both
Path 1 and
Path 2. InteractE employs depth-wise circle convolution, which causes much cost in I/O operations instead of in calculation processes, and internal alternate convolution filters in JointE even cost several or tens of times the overhead of the above models in I/O, so that the time cost in an epoch would be much more.
MFAE [
36] also has the advantage of low training cost, but our model can achieve a better performance than it with a lower training cost, and the comparison is displayed in
Table 10, all of which again prove the superiority of our model structure.