In this section, we provide a detailed description of the RiQ-KGC model, which is shown in
Figure 1. RiQ-KGC is composed of three functional modules, namely relation instantiation, quaternion space rotation, and geometric decoding. The relation instantiation module is utilized to localize the relations and make them entity-specific representations. The quaternion space rotation module rotates the entity in quaternion space through the instantiated relation to obtain geometric information. The geometric decoding module analyzes the information obtained from the above two modules. Its output compares the similarity with each entity in the quaternion space, and the entity with the highest score is considered the target result for the triple. Starting with the problem definition of link prediction, we will describe the working mechanism of each module of RiQ-KGC. For ease of comprehension, the primary notations relevant to this paper are enumerated in
Table 1.
3.1. Quaternion for Link Prediction
The KG consists of a set of entities E and a set of relations R, which are stored in the database as a series of triples , where and are the head and tail entities in the triples, respectively, and are the relations in the triples. Each triple represents a fact in the KG, indicating that the head entity is connected to the tail entity by the relation, which can be denoted as . To better express the semantic meaning of entities and relations, triples in the graph embedding problem are represented as a combination of three d-dimensional vectors , where and denote the embedding of the head and tail entities, respectively, and denotes the embedding of the relations. The link prediction task aims to discover the missing links between entities in the KG and is therefore considered to play an important role in KG completion.
Each triple is assigned a score, which represents the likelihood that the triple is a true fact. Higher scores indicate that the triple is closer to the true fact. The essence of link prediction is to construct a scoring function that accurately expresses the degree of truth. Given an incomplete triple
or
, we need to find the correct head or tail entity. In this paper, we calculate the score as
where the known entity in the triple is referred to as the source entity
and the entity to be predicted is called the target entity
. Following Lacroix [
45], we unify the prediction tasks as
, where
can be either the head entity or the tail entity in the triple, and two different sets of
are used to distinguish between the head entity prediction task and the tail entity prediction task. We initialize
vectors to represent entities and
vectors to represent relations for the datasets containing
entities and
relations.
In RiQ-KGC, entities are distributed in quaternion space, and relations are represented as transformation patterns of the entities. Quaternion is a hyper-complex system introduced by Hamilton (Holland, OH, USA). A quaternion Q consists of a scalar and a set of vectors , which can be represented as , where i, j, and k denote the unit vectors on the x-, y-, and z-axis, respectively. They satisfy , so that a quaternion can be represented as a quaternion . In RiQ-KGC, both the entity embedding and the relation embedding consist of four parts. The dimension of each part is , while the embedding of entities and relations is still represented using d-dimensional vectors.
Quaternions can express ideal parametric smooth rotations that are robust to noise and perturbations. Additionally, a quaternion is able to represent simultaneous rotations on three axes, unlike Eulerian rotation which rotates in a fixed order and can potentially lead to gimbal lock. Zhang [
23] demonstrated the advantages of using a quaternion in link prediction, as it possesses similar properties to complex rotations, including the ability to model symmetry, anti-symmetry, and inversion. Furthermore, compared to complex rotation, which only allows a single plane of rotation, a quaternion has two planes and thus provides more degrees of freedom. In the traditional embedding method using quaternion, the relation is represented as a quaternion that describes a rotation pattern. Meanwhile, a source entity can be rotated by the relation using the Hamilton product (quaternion multiplication) as
This method enables interactions between different parts of the quaternion, leading to richer expressive power.
For link prediction, the similarity between the resulting vector
obtained after rotating
and the target entity
is commonly used as a scoring function to evaluate the validity of a triple. The similarity between the two entities can be measured by computing the inner product of the corresponding quaternions as
The traditional method of quaternion embedding imposes a strict constraint on the embedding of entities. Specifically, it expects
to be equal to the
in the triple, which maximizes the objective function
. Nevertheless, in reality, each entity might be rotated by numerous other entities through different relations, and the constraint imposed by numerous neighboring entities can lead to a Pareto optimal scenario for the final embedding position of the entity. In such a scenario, the objective is to ensure that each true triple containing a score is as high as possible, rather than aiming to make each of them achieve the highest score. However, this constraint can have a significant impact on the model performance, particularly when the entities are more densely distributed.
Figure 2 shows how RiQ-KGC leverages a large number of parameters to establish a mapping relation between
and
, enabling the deep learning network to capitalize on its powerful fitting ability, while still retaining the interpretability advantages of the geometric approach. Additionally, we utilize Transformer to incorporate contextual information and enable the model to more accurately capture the entity’s information with the increase in knowledge.
3.2. Relation Instantiation
The expressiveness of relations is a critical factor in determining the effectiveness of graph-embedding models. As relations represent patterns of transformation between entities, they play a crucial role in the representation of geometric information. In most KGs, the number of entities is significantly larger than the number of relations. For example, the WN18RR [
36] dataset contains 40,943 instances of entities with 93,003 triples, while only 11 relations are given. As a result, each entity has only 11 geometric transformations, even though the number of neighbors can be much higher. Instantiating relations to corresponding entities can alleviate the m-to-1 problem in entities and the problem of complex relations between them. In the following section, we will demonstrate the effectiveness of the relation instantiation method in two different cases.
Case 1. Suppose that Hank, Ross, and Bruce are John’s father, mother, and uncle, respectively, and that all three of them are doctors at the Johns Hopkins Hospital (JHP). We can represent the relations between them using a relational rotation graph, as shown in
Figure 3a. The relations for father, mother, and uncle are different, and therefore their corresponding embedding positions are also different. However, the
relation forces JHP to be in three potential embedding positions simultaneously, which is not possible.
Figure 3b shows that the
relation can represent different rotations after the relation instantiation, and thus the embedding position of JHP can be accurately represented. This illustrates how relation instantiation can alleviate the m-to-1 problem. The process of relation instantiation can be seen as a form of “reverse clustering”, which prevents similar entities from being embedded in close proximity due to m-to-1 relations. This approach enhances the model’s ability to distinguish and accurately classify the hardest negative samples.
Case 2. As shown in
Figure 4a, when John grows up and becomes a doctor at JHP, Hank is not only his father but also his colleague. However, since father and colleague are represented by two different relations that correspond to different rotations, an error can occur if Hank is assigned two possible embedding positions simultaneously. One potential approach to address this issue involves making two relations represent the same rotation, which can uniquely position Hank, albeit in a way that is not equivalent in the physical world.
Figure 4b illustrates how two relations can be used to express the rotation from John to Hank using different rotations, without affecting other entities when the relations are applied after the relation instantiation. This approach enables us to uniquely position Hank and resolve the issue of having two possible embedding positions simultaneously.
The contextual information of an entity can be reflected by considering its neighboring entities in the KG, also known as “neighbors”. We designed a hierarchical Transformer structure to enable relations to fully integrate the environmental information contained within a given entity’s neighbors. Our goal was to ensure that relations can have different rotations in different environmental contexts.
Figure 5 shows the specific construction of the relation instantiation module. In relation instantiation, we select a significant number of second-hop, third-hop, and fourth-hop neighbors, represented by
,
, and
, to provide contextual information for the source entity. The nth-hop neighbors refer to the nodes that the source entity can reach through n transformations of relations. In this context, each transformation of the source entity to the tail entity through a relation in the triple is considered a first-hop. Neighbors with smaller hop counts, such as 1-hop and 2-hop neighbors, have a closer relationship with the source entity. They can directly reflect relevant information about the source entity. On the other hand, neighbors with larger hop counts, such as 3-hop and 4-hop neighbors, can provide additional information by roughly reflecting the scene where the source entity is located. It is important to note that we do not use the relations of these neighboring triples, as the size of the relation set
R is significantly smaller than the number of neighboring triples we will select. As a result, these relations do not contribute to the feature representation of the source entity.
Then, we obtain three context vectors by averaging the representations of , , and . These context vectors are then inputted into the multi-headed Transformer along with the source entity , the relation , and the two flag vectors, and . The resulting two relation components that are obtained via are represented by and , which represent the instantiation results of the relation from different perspectives. Using multiple relation components is intended to convey more contextual information and to help balance the weight of information across multi-hop neighbors during decoding.
The multi-head Transformer consists of a sub-layer that comprises a multi-head attention network connected in series with a sub-layer of a feed-forward network. The two sub-layers are connected by a connection layer that includes residual connections, dropout, and normalization operations.
3.3. Quaternion Space Rotation
Figure 6 shows the specific construction of the Quaternion space rotation module. We combined the outputs
and
with er to obtain a relation instantiation matrix. We performed a Hamilton product of
with each of the three vectors in this matrix, resulting in a quaternion rotation of the source entity from three different angles and producing a quaternion matrix
as
represents multi-hop neighbor information of the source entity, capturing a coarse-grained contextual situation. Meanwhile, fine-grained neighbor information is derived from the first-hop neighbors of the source entity. To integrate and unify these different levels of contextual information, we link the first-hop neighbor information with
. The first-hop neighbors serve as intermediate information that are used to explicitly establish the association between the source entity and its context, thus creating hierarchical contextual information. Inspired by Chen [
22], we utilize a multi-head Transformer
to integrate the first-hop neighbor entities
and the relations
. The source entity and relation,
and
, respectively, are also used as input to ensure that the information of the triple can be fully processed. Thus, each input group consists of the flag vector
, together with
or
and
or
, which are sequentially fed into
. The first-hop neighbor information is combined through
using
, and their corresponding outputs
are concatenated and stored as the neighbor matrix
as
where
f is the number of first-hop neighbors. To ensure that the source entity information is not overwhelmed by excessive neighbor information,
in
is replaced or masked by a random entity with a certain probability, encouraging the model to recover the source entity later.
conveys the source entity information in a fine-grained environment, while
conveys the target entity information produced by rotation in a coarse-grained environment. We combine
with
to produce a mixed matrix
as
Consequently, it covers the complete process of quaternion rotation under the influence of contextual relation. We insert a flag vector onto the first line of , which is subsequently employed in geometric decoding to resolve the representation of the target entity.
3.4. Geometric Decoding
As shown in
Figure 7,
is processed by a multi-head Transformer
for decoding. The first two lines of the resulting outputs,
and
, are applied for resolving the representation of the target entity and the degree of entity reduction, respectively. To provide the model with additional fitting space,
is finally inputted into a linear feed-forward network as
where
W and
b correspond to the weight and bias, and the output
represents the target entity. Subsequently, we compute the similarity with each
through dot product as
The similarity serves as the score of the triple, which can be perceived as a confidence level when the tail entity of the triple corresponds to . The higher the score of the triple, the more likely it is that the entity is the target entity and is therefore appropriately predicted.
During the training process, the score is established as the probability distribution of the correct entity based on the softmax activation function. This probability distribution is implemented to calculate the cross-entropy loss as
where
and
is 1 if
i is the true target entity.
To avoid overemphasizing neighbor information and disregarding source entity information during decoding, we use
to reduce the impact of the source entity. This is achieved by calculating similarity through the dot product for each
. The degree of reduction in
to the source entities is measured using cross-entropy loss as
where
and
is 1 if
i is the source entity.
We obtain the final loss value by adding the two loss values with weights as
Additionally, we apply entity regularization to deter over-fitting and to generalize the embedding location of the entity.