In this section, we theoretically introduce the idea of trace equivalence to information networks. We first give the concept of “states” and “actions” in information networks. Employing these concepts, we come up with the trace semantics of information networks inspired by the trace semantics of process theory in concurrent systems. Further, we show the computational method to determine whether two nodes are trace equivalent in information networks for deriving trace-equivalent networks from original networks. Moreover, to verify the maintainability of information in trace-equivalent networks, we conduct experiments on both the original networks and their trace-equivalent networks.
2.2.1. Trace Semantics of Information Networks
In the process graph, states and transitions are the node and edges. Equivalences are then used to reduce the duplicate branches by removing equivalent states. Inspired by this, we leverage equivalence by treating nodes in information networks as states and edges as transitions in information networks. Nevertheless, we still use nodes and edges in the subsequent sections for the generality of research.
To begin, we come up with the concept of the trace in information networks according to the concepts in concurrent systems [
4]. Information networks are usually represented as graphs, where nodes are entities and edges are relations between entities. Regarding nodes in information networks as states in process theory, the concept of trace in information networks is defined as follows according to the trace semantics of process theory.
Definition 1. Trace in information networks.Given an information network , a path starting from node is represented as , where and . A trace of this path is formally denoted as based on the path. The set of paths starting from node v is denoted as , and the set of traces is denoted as , respectively.
Note that a path is in the form of a node followed by an edge repeatedly, while continual edges composite a trace. Consequently, a trace describes and focuses on the relationship between two nodes since a trace is the edge sequence of a path starting from one node to another node, i.e., in a social network, if two people follow the same one, their relationship is common-follow, or if they follow each other, their relationship is mutual-follow.
Trace semantics of process is based on the idea that two processes are identified if they allow the same set of sequences of actions. In information networks, similarly, trace semantics are described as two nodes to be identified if they have the same set of relationships with other nodes. Moreover, it is significant that two nodes be identified if and only if they are of the same type. With Definition 1, we define trace equivalence as follows.
Definition 2. Trace equivalence in information networks.Given an information network , two nodes v, w aretrace equivalentif and only if they are of the same type and their sets of traces are equal. Trace equivalence is formally represented as and, for simplicity, notated as or .
For example, in
Figure 1, we show that in a small example abstracting from the DBLP bibliography network, in which there are three authors
,
and
, two papers
,
, and two venues
,
, the path set of
is
then the trace of author
is
similarly, the path set of
is
and the path of
is
and it is significant that authors
and
are trace equivalent since they are of the same type of
and their sets of traces both are
Moreover, reveals that the interactions of author with other nodes contain the interactions of author . We stipulate this situation as approximate trace equivalent because their sets of traces have common items but are not totally equal.
2.2.2. Computational Method of Trace Equivalences
Continuing with the concept of trace equivalence between nodes, we further give the computational method of trace equivalences in a mathematical way. In this regard, we need to represent an information network in mathematical form so that we leverage the adjacency matrix of an information network to reflect the messages we need mathematically. Given an information network , an adjacency matrix of G is represented as . if there is an edge between node i and node j such that these two nodes are connected and related; otherwise, means node i and node j have no edges, yet they are not related. can be used to depict all the paths whose start node is node i. Adjacency matrix A illustrates the relationships of every node with other nodes in the network, and with the adjacency matrix, we can describe the path information and further the trace information of every node.
A primitive adjacency matrix A not only describes the relationship of each node but also reflects all the paths in the network, where the length of all these paths is equal to 1. Moreover, the trace sets of these paths can also be described by the adjacency matrix A. means that there is a trace of node i indicating its relationships with node j. Furthermore, the matrix A multiplied by itself can be seen as a concatenation of two 1-length paths, where the end node of the former path is the same as the first node of the latter path. In this way, we can obtain the 2-length paths of the network. Repeating this procedure n times, we can acquire the n-length paths of the network. Formally, we use to represent all the n-length paths of the network and yet traces.
Cosine Similarity [25] Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In other words, it determines the orientation of the vectors rather than their magnitude. The smaller the angle between the vectors, the higher the cosine similarity. Formally, given two vectors
A and
B, the cosine similarity score of
A and
B can be calculated by
According to this equation, cosine similarity requires normalizing firstly the length of the two vectors
A,
B, then measuring the direction of the two vectors and finally resulting in a score of similarity of the two vectors. Cosine similarity is widely used in many research fields, such as natural language processing, information retrieval, and recommendation systems [
25,
26]. It has the advantageous feature that it is equal to 1 for identical vectors and 0 for orthogonal vectors when all elements of the vector are greater than 0.
In this paper, with adjacency matrix A of an information network indicating the traces deriving from 1-length paths of this network, we use A to show how to calculate the cosine similarity of traces. For two nodes i and j of an information network, their similarity score can be directly calculated by performing cosine similarity of the two vectors and , then the similarity score of these two nodes is calculated by the equation .
With the similarity score of traces of two nodes, we can say that two nodes are trace equivalent if and only if . Furthermore, since these vectors are used to represent sets of traces, elements of these vectors are usually greater than or equal to 0. The similarity score of each node pair is thus greater than or equal to 0. Based on this property, we stipulate that two nodes i and j are approximate trace equivalent if and only if .
Alongside the example in
Figure 1, we show the computation steps in
Figure 2. Considering three authors in this figure, the adjacency matrix
A shows the relations held by three authors and describes the 1-length path of three authors and the traces corresponding to these paths. For the longest meaningful path with a length equal to two in this example, we iterate the matrix multiplication one time to obtain
and obtain the 2-length paths of the information network. We then obtain the vectors representing the traces of every node from these paths. The trace sets of three authors are represented by vectors as follows.
For author
, the trace sets are
for author
, the trace sets are
and for author
, the trace sets are
Applying the cosine similarity measure on each pair of three authors, we obtain the similarity score of each pair:
The results exhibit that authors and are trace equivalent since their similarity score is 1, and are approximate trace equivalent since their scores are less than 1, and and are approximate trace equivalent as well.
For more complex networks, we will iterate the matrix multiplication of adjacency matrix more than two times to fetch more traces and provide more details when determining whether two nodes are trace equivalent. For an information network, we will iterate n times to get , which also describes the traces of this information network, and based on traces of every node, we can calculate the similarity score by applying the similarity score function (cosine similarity in this paper).
2.2.3. Derive Trace-Equivalent Networks
With the computational method of trace equivalence, we determine whether two nodes are trace equivalent by computing the similarity score between them in an information network. This section will investigate how to derive a trace-equivalent network from a given information network. Given an information network
, after performing the similarity measure of an information network, we acquire many tuples of trace equivalent nodes in a mathematical way such that
where i and j are the nodes of information network
G, and
. We use
to denote the set of all the tuples of trace-equivalent nodes of information network
G, formally as
where
encapsulates the rough trace equivalent node tuples of an information network. In
of an information network
G, the same node can appear in different tuples. The nodes in these different tuples are trace equivalent by the transitive of trace equivalence. We group these nodes by trace equivalence and merge all nodes in these tuples into the same set, which is named the trace-equivalence class.
We use
to formally represent these trace-equivalence classes of the information network
G. Each element of
is a group of nodes that are trace equivalent to one another, meaning they have the same trace in the network. The
can be formally noted as
With the concept of trace-equivalence classes
, we are able to simplify the representation of the information network
G. By reducing nodes based on these trace-equivalence classes, we can derive a trace-equivalent network. To accomplish this, we select a representative node from each set in
, which will be used to replace all nodes in that particular set. The generated trace-equivalent network not only simplifies the original network, but also maintains the essential information. It is a smaller and more manageable network, while still accurately reflecting the trace relationships between nodes. With this basis, we optimize the
as
where
i is the representative node of a trace-equivalence class
N. The optimized formula can be easily used in the process of node reduction by using
i to replace
N.
In
Figure 3, continuing with the example in
Figure 2, we demonstrate the procedure of generating equivalence classes and choose one representative node of each equivalence class on behalf of the whole class. Subsequently, we can derive the trace-equivalent network from the original information network. In this simple example,
only contains one tuple, and hence, it is also the only equivalence class of this network. We choose
to represent the whole equivalence class, and the result shows that one node and two edges are reduced in the new trace-equivalent network.
We use
to formally represent the derived network in which notation
represents the procedure of reducing nodes by trace equivalences.
encapsulates the complete structural information of the original network, which describes the relationships of the network, representing each equivalence class with one specific node, making it possible to reduce duplicate nodes and edges of the information network while preserving structural information. The representative nodes and edges of the same equivalence class hold the information of reduced nodes and edges. To state the method of deriving trace-equivalent networks more clearly, we summarize the above steps into Algorithm 1.
Algorithm 1: Deriving trace-equivalent network from an given network. |
|
With the having fewer nodes and edges than the original, it is possible to accelerate data mining algorithms since less information leads to fewer computations while executing these algorithms. Another problem we need to figure out is, though less information leads to fewer computations, the accuracy of these algorithms while maintaining consistent or approximate accuracy of the original network. In this paper, we choose the Pathsim algorithm to verify the maintainability of data mining algorithms on both G and . We prove that the accuracy of Pathsim algorithms on G and is consistent.
Definition 3. Pathsim[27]. Given an information network G, Pathsim measures the correlation of two entities x and y under a meta-path . The core component of the Pathsim algorithm is as follows.where is a path instance between x and y, is a path instance between x and x and is a path instance between y and y. For elaboration, a meta path is defined on and denoted as so that it illustrates the relation between node types. Theorem 1. The results of Pathsim are consistent on the original network G and its trace-equivalent network , i.e., the results of Pathsim are maintainable.
Proof. Given an information network , node and node are trace equivalent. Then, for every reachable node from node and , ,
,
,
Because node and node are trace equivalent,
,
then for node , the trace from node to node and the trace from node to node are equal.
Based on Definition 1, the paths from node to node and the paths from node to node are identical.
Then, .
Similarly, we can obtain ,
So, .
In trace-equivalent network , and are in the same equivalence class. We choose to represent this equivalence class. With , the results of Pathsim are consistent.
Therefore, the results of Pathsim are maintainable. □
Trace equivalence indicates that . Meanwhile, approximate trace equivalence indicates . The smaller the , the greater the difference of the Pathsim results between them. In the next section, we verify that our proof is correct through experiments.