3.2. Non-Euclidean Autoencoder
We first consider how the structural information and attribute information can be fused to be more effective and more consistent with the real situation expressed by the data. Let us first take a close look at the three graphs in
Figure 1, where the blue circles and the yellow squares at the nodes represent two different attributes in the attributed network. The network shown in
Figure 1a consists of 7 nodes, and the weights of the edges are 1.
Figure 1b show the same network as
Figure 1a, except that the weights of the edges are replaced by the Ricci curvatures of the edges.
Figure 1c represents another network where all nodes have the same attributes. DANE [
27] inputs topological data and attribute data into two autoencoders, which have their own loss functions for training, and then uses the hidden layer representations of the autoencoders to establish the correlation between the structural information and attribute information for fusion. However, we can see from
Figure 1a that, if the attribute information is expressed by the network structure, it does not form a connected graph, which is very different from the network topology. Similarly, in
Figure 1c, if a network is constructed by attributes, it is a complete graph that is very different from the network topology. Hence, the idea of looking separately for representations of structures and properties and then merging them does not work very well. As with DANE [
27] and ANRL [
20], our model uses an autoencoder module. However, we do not process the two heterogeneous information types separately as DANE does, nor do we fuse the structural and attribute information as ANRL does by using an autoencoder to reconstruct the target neighbors. We adopt the way of aggregation attribute features in RCGCN [
30]; that is to say, we integrate aggregation layers in the autoencoder for the fusion of structural information and attribute information, rather than input the two types of heterogeneous data into the autoencoder, hoping to achieve fusion only with the help of the autoencoder. According to the analysis conducted by RCGCN, the Ricci curvature is better than the original adjacency matrix for the aggregation of node attributes. The reason is that, as shown in
Figure 1b, the Ricci curvature can better distinguish the strength of the connection relationship in the network structure and can distinguish the meso structure of the network, namely, community. For example, the curvature of edge (3,4) in
Figure 1a is −0.667, which is a characteristic property of the Ricci curvature, that is, the curvatures of edges between communities are negative. Of course, RCGCN, like GCN, uses semi-supervised learning, while, here, we are going to conduct unsupervised learning.
Since hyperbolic geometry has advantages in representing hierarchical data, and the Ricci curvature can reflect the geometric characteristics of the underlying space where the network is located, we consider adding these elements into the autoencoder module of our model. First, we introduce the sigmoid function commonly used in neural networks, namely,
Let
A represent the adjacency matrix of the network, and
C represent the curvature matrix, where the elements are the values of the Ricci curvature of the edges processed by the sigmoid function. Let
, where
is equal to
and
is the Ricci curvature of the edge
. Let us define the matrix
S as being equal to
, where
are all nodes in the network. Consequently, we can define the matrix
F for fusion as follows:
In order to avoid the influence of the numerical size on the algorithm, F is normalized. Meanwhile, for the convenience of expression, the normalized aggregation matrix is still denoted as F.
The autoencoder module we adopted is shown in
Figure 2. This autoencoder has a total of 5 layers from the input layer to the output layer. The representations of the nodes in the hidden layer are used in the skip-gram module. The attribute data are fed into the autoencoder module in two ways. One is the randomly selected attribute data inputted in batches, denoted as
. The other is the selected partial attribute data related to the skip-gram module, denoted as
, which will be explained in detail in the section of the introduction to the skip-gram module. It is worth noting that, unlike traditional autoencoders, the representations of layer
l in the autoencoder of our model are not directly multiplied by the weight matrix to obtain the representations of layer
. As shown in
Figure 3, an aggregation layer based on the Ricci curvature is inserted between two adjacent layers in our autoencoder module, and the aggregation matrix is shown in Equation (
19). After the aggregation layer, there is a hyperbolic embedding layer, which uses a hyperboloid manifold. The operations on the hyperbolic embedding layer need to combine the tangent space of the hyperboloid manifold, and the operations often used include an exponential map and a logarithmic map.
We use
X to represent the matrix formed by the attribute vector
or
, which is the input of the autoencoder. Specifically, we use
to represent the node representations at layer
l of the autoencoder, and then we give the derivation of the node representations at layer
. First, we take
as the input of layer
l of the autoencoder, fuse it with an aggregation matrix
F, multiply it by the weight matrix
in the Euclidean space, and then map it to a hyperbid manifold, i.e.,
Next, the bias vector in the Euclidean space is first projected to the tangent space at the origin of the hyperboloid manifold, and then it is mapped to the hyperboloid manifold by an exponential map, i.e.,
Then, we perform the addition operation on the two quantities in Equations (
20) and (
21) according to Equation (
12), and the results are the points in the hyperbolic space, which are pulled back to the Euclidean space, namely, the tangent space, through the logarithmic map:
where
represents the activation function. Note that
is the attribute matrix X as the input.
Let
and
be the reconstructed representations obtained after inputting attribute vectors
and
into the autoencoder, respectively. The training goal of the autoencoder is to make the reconstructed attribute vectors close enough to the input attribute vectors. The matrix forms for
,
,
, and
are
X,
,
, and
, respectively. Hence, the autoencoder loss in our model is set to
where
is the Euclidean norm.
3.3. Skip-Gram Model Based on Ricci Curvature
One of the basic ideas of network embedding is that similar nodes should be close to each other in the embedded space, while dissimilar nodes should be far away. Purely from the perspective of network topology, some works define the similarity according to the neighborhood, that is, they believe that the nodes in the same neighborhood should be similar. The definitions of neighborhood are different in different literature studies. It can be composed of the direct neighbors of the target node, or it can be composed of the first-order neighbors and second-order neighbors of the target node. In some works, sampling is carried out by random walks starting from the target node, and the node pairs obtained by sampling are considered to be similar. This similarity actually reflects the structural information of the network. After these similar node pairs are obtained by the method of random walks, they are input into the skip-gram model for training, and the representations of nodes are obtained.
Take the network in
Figure 1a as an example, where we only consider the network structure without considering node attributes. In general, random walks are carried out according to the weights of the edges. Note that the weights of the edges in this network are all 1. Assume that a random walk arrives at node 3 and then decides which node to go to next. According to the traditional decision method of random walk, it is equally possible to choose nodes 1, 2, and 4 in the next step. However, we observe that nodes 1, 2, 3 form one community, while nodes 4, 5, 6, 7 form another community. Considering the community structure, the possibility of selecting node 4 in the next step is less than the possibility of selecting node 1 or node 2 because node 4 and node 3 are not in the same community, while node 3 is in the same community as node 1 and node 2. In other words, if the random walk can reflect the community structure, that is, the meso structure of the network, it can more truly reflect the network structure. The network in
Figure 1b corresponds to the network in
Figure 1a, and the weights of its edges are the Ricci curvatures of the network’s edges in
Figure 1a. As can be seen from
Figure 1b, the Ricci curvature of edge (3,4) is −0.667, which is smaller than the Ricci curvature of edge (1,3) and edge (2,3). Therefore, if the random walks are carried out not simply according to the weights of the edges, but according to the Ricci curvatures of the edges, then in the previous problem, starting from node 3, the possibility of selecting node 4 in the next step is less than the possibility of selecting nodes 1 and 2. Consequently, the Ricci curvatures of edges can better reflect the network structure than the weights of edges. In the algorithm, considering that Ricci curvatures have negative values, they are not easy to deal with, so we use a sigmoid function to transform the Ricci curvatures.
It is assumed that
are node pairs sampled by a random walk based on the Ricci curvature. Node pairs in this set are considered to be similar, while other node pairs not in this set are considered to be dissimilar. Let
be any node pair in set
. As mentioned in the previous subsection, the attribute
corresponding to node
is input into the autoencoder to obtain the reconstructed representation
. The representation of node
in the most intermediate hidden layer of the autoencoder is denoted as
, which is the representation obtained by integrating the structure and attribute information in the neighborhood of node
. Since
is a similar node pair, we adopt the processing method in ANRL to further integrate attribute information and structural information in the skip-gram module. We use the following conditional probability to express the similarity of node pairs
:
where
is the representation of node
when it is treated as a context node. Since the calculation of the denominator of Equation (
24) involves the whole network, the calculation amount is too large; therefore, we adopt the negative sampling technique in [
31] to reduce the calculation amount. Specifically, for a positive sample node pair
, we have the following loss function:
where
is the same as [
31], and
is the degree of node
v.
Consequently, the corresponding loss function of the skip-gram module is