Next Article in Journal
Volatility Dynamics of Non-Linear Volatile Time Series and Analysis of Information Flow: Evidence from Cryptocurrency Data
Previous Article in Journal
Cryptanalysis of a Semi-Quantum Bi-Signature Scheme Based on W States
Previous Article in Special Issue
A Block-Based Adaptive Decoupling Framework for Graph Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Auxiliary Graph for Attribute Graph Clustering

1
School of Computer, National University of Defense Technology, Changsha 410000, China
2
School of Cyberspace Science, Dongguan University of Technology, Dongguan 523808, China
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(10), 1409; https://doi.org/10.3390/e24101409
Submission received: 6 September 2022 / Revised: 28 September 2022 / Accepted: 29 September 2022 / Published: 2 October 2022
(This article belongs to the Collection Graphs and Networks from an Algorithmic Information Perspective)

Abstract

:
Attribute graph clustering algorithms that include topological structural information into node characteristics for building robust representations have proven to have promising efficacy in a variety of applications. However, the presented topological structure emphasizes local links between linked nodes but fails to convey relationships between nodes that are not directly linked, limiting the potential for future clustering performance improvement. To solve this issue, we offer the Auxiliary Graph for Attribute Graph Clustering technique (AGAGC). Specifically, we construct an additional graph as a supervisor based on the node attribute. The additional graph can serve as an auxiliary supervisor that aids the present one. To generate a trustworthy auxiliary graph, we offer a noise-filtering approach. Under the supervision of both the pre-defined graph and an auxiliary graph, a more effective clustering model is trained. Additionally, the embeddings of multiple layers are merged to improve the discriminative power of representations. We offer a clustering module for a self-supervisor to make the learned representation more clustering-aware. Finally, our model is trained using a triplet loss. Experiments are done on four available benchmark datasets, and the findings demonstrate that the proposed model outperforms or is comparable to state-of-the-art graph clustering models.

1. Introduction

The attribute graph data is ubiquitous in the real-world. For example, data from social networks [1], citation networks [2], protein-protein interaction networks [3]. For the lack of labeled data, there exists a need to divide data into groups.
In the early days, graph clustering methods used only structure information for network embedding. Utilizing structure information, some methods [4,5] based on random walk implement representation learning by maximizing the probability of cooccurrence of node pairs. Recently, refs. [6,7,8] suggest mining meaningful features from networks with BDM (Block decomposition method). For example, by BDM, ref. [6] obtain graph motif complexity for network clustering. Removing a minimum subset of edges, ref. [7,8] can obtain the desired clusters with minimum loss of information contribution, which is calculated by algorithmic complexity obtained from BDM. Along with the development of deep models, plenty of deep clustering models have emerged [9,10,11,12,13,14]. However, the conventional deep clustering models focus on investigating Euclidean structure data. For example, data of faces, data of animals, data of vehicles. Unlike Euclidean structure data, the relationships between nodes in the graph have nothing to do with their positions in space. For this reason, the traditional deep models cannot handle both the attribute and structure of graph data properly. Recently, the question of how to exploit both graph structure and node attribute sufficiently has attracted more and more attention in clustering tasks. Graph Convolutional Network (GCN) [2] is a powerful model to meet the need mentioned above. A great number of graph clustering models based on GCN have been developed. Inspired by AutoEncoder, Graph auto-encoder (GAE) [15] implements representation learning in an encoder-decoder mechanism. Following GAE, ARGE [16] improves representation learning by introducing an adversarial training module. MGAE [17] proposes to exploit the interplay between node attribute and structure information. GAT [18] introduces an attention mechanism to specify different weights to different neighbors. Following GAT, ref. [19] aggregates its neighbors by learning an attention mechanism in an unsupervised way. SDCN [20] is a deep model that can alleviate the impact of over-smoothness by fusing embeddings from different modalities. Based on SDCN, DFCN [21] improves performance by integrating global structure information into local structure information.
To some degree, these GCN based models exploit structure information in different ways and achieved noticeable improvements. However, we found that there are three kinds of cases that lead to a sub-optimal performance: (1) Methods that ignore global structure completely. (2) Methods that have taken global structure into consideration but trained with only the guidance of given graph structure. (3) Methods that ignore the guidance of the structure. All these mentioned methods fail to exploit the global structure appropriately and lead to a sub-optimal performance consequently.
To solve this issue, unlike those shallow models mentioned before [4,5,6,7,8], we propose a deep graph clustering model termed Auxiliary Graph for Attribute Graph Clustering. In particular, we construct an additional graph as a supervisor based on the similarity between nodes in their raw feature space. However, the newly constructed graph is rife with erroneous relationships due to the underlying noise in the raw data. To mitigate the impact, we employ a filtering technique to choose a certain number of nodes closest to the target nodes. We retain the relationships between each target node and a predefined number of neighbors who can be considered somewhat dependable. Assuming that the remaining relationships are untrustworthy, they are disregarded. We combine embeddings from various layers to generate representations that are highly discriminative. Finally, we have created a training technique that incorporates both reconstruction loss and clustering loss. On the one hand, we optimize our model by forcing it to reconstruct a graph that can approximate both the pre-defined graph and the auxiliary graph. On the other hand, we employ a clustering-oriented optimization whose efficacy has been thoroughly proved. In the former scenario, these two types of rebuilding are complementary. In the latter case, the clustering-friendly model enables learned representations to facilitate the clustering operation.
Our contributions are summarized as follows:
  • We build an auxiliary graph to reveal the relationships that were missed by the given graph. With the supervising of both auxiliary graph and given graph, the learned representations are improved to be more reliable.
  • The optimization by clustering loss based on fusing embeddings from multiple layers facilitates both the discriminativeness and the clustering-awareness of representations.
  • Extensive experiments on four popular benchmark datasets are conducted and the results validate the superiority of our method over the state-of-the-art methods.

2. Related Works

Deep clustering has always attracted extensive attention. During the past few years, plenty of deep clustering models have emerged [9,11,13,22,23,24,25,26,27,28,29,30]. Among them, AutoEncoder is a basic DNN model that is widely used for subsequent deep clustering models. In DEC [11], a target distribution is designed to prevent large clusters from distorting hidden feature space, which alleviates the impact of data imbalance. Inspired by [29], IDEC [9] improved DEC by introducing the optimization of reconstruction. Training by reconstructing the input data can keep the local structure-property for embeddings. DSC [22] also introduces an auto-encoder framework to the subspace clustering module. The auto-encoder module can learn a non-linear mapping that facilitates subspace clustering. DMC [30] keeps the local structure by minimizing the distance between the target point and its K-nearest neighbors. At the same time, it also constructs a clustering-friendly objective that improves representations. By forcing the embeddings from the noisy encoder to approximate that from a clean encoder, DEPICT [24] improves the robustness of representations. Although effective, deep clustering models neglect the information from graph structure, which contains a wealth of information that can improve representation learning greatly.
Recently, GCN-based deep clustering models have gained much attention. And an abundance of excellent models have been proposed [15,16,19,20,21,31,32,33,34]. Ref. [15] designed an encoder-decoder framework that based on graph convolution network (GAE) and its variation (VGAE) that was based on VAE [31]. As an unsupervised graph-based representation learning method, it is popular for the tasks of clustering. AGC argues that each graph has its distinct structure, and it is unreasonable to perform clustering tasks on different graphs by aggregating neighbors with a fixed neighborhood. Instead of keeping a fixed neighborhood for each graph, AGC proposed measurement for choosing a proper scale of the neighborhood. In DAEGC [19], neighbors are not equally important to the target nodes. It can capture the importance of neighbors for the target node by an attention network. Some develop different training schemes to improve clustering performance. MGAE [17] corrupts node features by a pre-defined probability to disturb the information so that the interaction between node content and structures can be reinforced and the representation capacity of the network can be improved. ARGE [16] incorporates an adversarial training scheme into GAE, which can learn a robust representation. Instead of reconstructing the graph only, ref. [35] improved the performance of ARGE by reconstructing both the graph and features. EGAE-JOCAS [32] utilizes K-means and spectral clustering jointly to guide the representation learning and improve performance. Some models [20,21,34] combine deep features of multi-modality to alleviate over-smoothness. In SDCN [20], a GCN module and an auto-encoder module are integrated. Incorporated with representations from the auto-encoder module, GCN is capable of capturing the relationship between nodes of longer distances. DFCN [21] improved SDCN by dynamically integrating features of multi-modality and optimizing with triplet guidance which could generate robust representations. AGCN [34] argues that when fusing features, features from different modalities should not be considered to be of equal importance. It proposed to adaptively fuse features of different modalities at each layer, and again adaptively fuse features of different layers.
Most of the methods mentioned above achieved promising performance in clustering, but few consider that there are plenty of relations that are missed by the given graph structure.

3. Proposed Method

The proposed model consists of the graph encoder, graph decoder, and clustering module, which will be introduced in turn as follows. Figure 1 is the flow chart of our proposed method.

3.1. Problem Definition

Given an undirected graph G = ( V , E ) , V = { v 1 , v 2 . . v n } is a set of nodes, and | V | = n . E is the edge set. X T = [ x 1 , x 2 , . . x n ] R d x n denotes a feature matrix of nodes. A R n x n denotes a symmetric adjacent matrix that indicates the connection of nodes, i.e., if node i links node j, then A i j = A j i = 1 , otherwise, A i j = A j i = 0 , i , j { 1 , 2 . . n } , A i j { 0 , 1 } . We define D as the degree matrix of A. D i i = A i 1 + A i 2 + + A i n and D i j = 0 when i j . More notations are summarized in Table 1.

3.2. Graph Encoder

GCN is used as a powerful tool for extracting features by integrating topological information into node attributes. In our model, we use the GCN as a basic module for encoding.
In GCN, nodes’ features are filtered in the frequency domain. As a result, the filtered features are supposed to be robust for being enhanced by their neighbors. After filtering, the features are transformed linearly by a weight matrix with an activation function. This process is formulated as the following equation:
H 1 = ϕ ( D ˜ 1 A ˜ H 0 W 1 )
H 2 = ϕ ( D ˜ 1 A ˜ H 1 W 2 )
H 0 denotes the input of the encoder, H 0 = X . l { 0 , 1 , 2 , , L } denotes the index of the layer, and L denotes the index of the last layer in the encoder. W l is the parameter of l t h —layer. ϕ denotes an activation function such as Tanh or LeakRelu. A ^ = I + A , D ^ i i = j A ^ i j . In GCN, different layers are supposed to generate features of different scales. It is supposed that the embeddings from fusing features from multiple layers should be more discriminative than those embeddings from the single layer. We apply a fusion strategy to the encoder, i.e., we simply concatenate each layer of the encoder for generating robust representations, and this operation can be formulated as the following equation:
H f = C o n c a t ( H 1 , H 2 )
In (2), C o n c a t denotes a concatenate function. H 1 R n × d 1 , H 2 R n × d 2 , H f R n × d f , f = d 1 + d 2 .

3.3. Graph Decoder

A graph decoder is usually used to reconstruct the original graph. Following Graph Auto-Encoder [15], we use an inner-dot operation as a graph decoder. The output of the decoder is a symmetric matrix that is constructed by the output of the encoder.
M = σ ( H f T · H f )
σ is an activative function that scales the values to the range of ( 0 , 1 ) . M is seen as the recovery of the original graph.

3.4. Optimization by Reconstructing Graphs

For the purpose of revealing the relationship between nodes thoroughly, we need to find the latent relationship between nodes that are missed by the original graph. To achieve this, we use M to reconstruct the original graph and the complementary graph simultaneously.

3.4.1. Optimization by Reconstructing Original Graph

After obtaining M from the graph decoder, we optimize the model by minimizing the reconstruction loss between M and A ˜ :
L r a = 1 2 n M A ˜ F 2
This process is widely used in a graph auto-encoder model. Here we minimize the loss between M and the original graph to keep the performance in a basic level.

3.4.2. Optimization by Reconstructing Complementary Graph

There are three parts to this optimization process. We describe them in the following sections: graph build, graph process, and minimization of reconstruction loss.
  • Graph Build To make a complement to the given graph, we build a graph based on some similarity metric such as cosine similarity, which can discover the latent relationships between nodes in a global view. The complementary graph is constructed by the following equations:
    S i j = X i X j | | X i | | 2 | | X j | | 2
    S ˜ i j = S i j k = 1 n S i k
    After calculating the similarity between each pair of nodes, we obtain a graph capturing the global relationships.
  • Graph Process After graph building, we obtain an initial graph S that unavoidably contains noise. To obtain a relatively clean graph, we need to filter noise. We introduce a simple but effective filtering mechanism.
    S i r a n k = s o r t ( S ˜ i )
    A s i j = S ˜ i j i f S ˜ i j S ˜ i r K 0 e l s e
    At first, we rank each row of S in descending order by a s o r t function. After ranking, S i r a n k = { S ˜ i r 1 , S ˜ i r 2 , S ˜ i r 3 , , S ˜ i r n } , S ˜ i r k S ˜ i r k + 1 . And then, by using a filter mechanism, we only keep relations of top-K highest confidence, and we reduce the rest to 0 to decrease the impact of false relations.
  • Minimization of reconstruction loss After the process of filtering, we obtain a more reliable graph A s . And we implement representation learning by minimizing the loss between M and A s , which is formulated as:
    L r s = 1 2 n M A s F 2
    The A ˜ and A s are supervisors that are complementary to each other.

3.4.3. The Joint Reconstruction Loss

A single supervisor may lead to bias in representation learning. Instead of using one single supervisor, we minimize the reconstruction loss by both supervisors A ˜ and A s . The objective function is formulated as follows:
L r e c = λ L r s + L r a
λ is a hyper-parameter used to control the importance of L r s .

3.5. Clustering Module

For unsupervised learning approaches, there are no given labels for target functions. We need an optimization that can be used to guide our model to facilitate clustering tasks. As most graph clustering models do, we introduce an alternative strategy to conduct a clustering-oriented optimization. We use Student’s t-distribution as the kernel to measure the similarity between centroids and embeddings:
q i j = ( 1 + h i μ k 2 ) 1 t ( 1 + h i μ t 2 ) 1
μ k denotes the centroid of cluster k. It is initialized by k-means or random vectors. q i j denotes the probability that node i belongs to cluster j. To improve the accuracy of centroids, we generate a target distribution. By matching the Student’s t-distribution of Q to the target distribution of P, the clusters’ centroids and embeddings are simultaneously optimized. The target distribution is constructed by the following equation:
p i j = q i j 2 / f j k q i k 2 / f k
In (11), f k = i q i k . And the optimizing process is to minimize the KL divergence loss between q i j and p i j :
L c = K L ( P Q ) = i j p i j l o g p i j q i j

3.6. Joint Optimization

To train the graph encoder-decoder and clustering module jointly, we design the objective function as:
L = L c + L r e c
L c denotes the clustering loss, and L r e c denotes the reconstruction loss. After training, we can obtain the clustering results Y from Q, and the prediction of node i is assigned by:
y i = arg max c q i c
Specifically, Y = [ y 1 , y 2 , . . . , y n ] , y i is the position of the max value in q i , which is a pseudo label of cluster as well. The detailed steps are summarized in Algorithm 1.
Algorithm 1 Deep Graph Clustering via Graph Augmentation
Require:
  Attribute matrix X, adjacent matrix A, iteration number i t e r , hyperparameter λ , K
Ensure:
  Clustering result Y;
  • Construct Top-K similarity matrix A s
  • for t = 1 to i t e r do
  •    Generate the embeddings h 1 , h 2 by (1),(2)
  •    Generate h f by (3)
  •    Construct M by (4)
  •    Calculate reconstruction loss by (11)
  •    Generate Q by (12)
  •    Generate P by (13)
  •    Calculate clustering loss by (14)
  •    Update the whole framework by (15)
  • end for
  • Obtain Y by (16).
  • returnY;

3.7. Complexity Analysis

For the sparsity of the matrix, the computational complexity of GCN is linear with | E | . Let d be the maximum number of neurons in hidden layers, the complexity is O ( | E | d 2 ) . In addition, we let k be the number of clusters, and the computational complexity of (10) is O ( n k + n l o g n ) . Taking both GCN and clustering module into account, the final complexity is O ( | E | d 2 + n k + n l o g n ) .

4. Experiment

4.1. Datasets

We implement experiments on four widely used graph datasets. More details about them are summarized in Table 2.
  • Citeseer This is a citation dataset. Papers in it are divided into six categories: Agents, Artificial Intelligence, Database, Information Retrieve, Machine Language, HCI. Each edge represents a citation relationship between documents. Each node denotes a paper whose feature is represented by a {0, 1} vector. Each dimension is a keyword from a specific vocabulary.
  • Dblp It is a cooperative network. Authors in it are divided into four classes: database, data mining, machine learning, and information retrieval. An edge represents a cooperative relationship between authors. The node features are the elements of a bag-of-words represented by keywords.
  • Acm It is a paper network. An edge between nodes represents that these two papers are written by the same author. Papers are divided into three classes: Database, Wireless Communication, and Data Mining. The features are bag-of-words of keywords from corresponding areas.
  • Pubmed It is a citation dataset about Diabetes. The publications in it are divided into 3 classes: Diabetes Experimental, Diabetes type1, and Diabetes type2. Each node is represented by a tf-idf vector of keywords.

4.2. Baselines

We compare our proposed method with 12 methods which can be divided into 4 types: Non-model based, Auto-Encoder based, Graph Auto-Encoder based, and Hybrid-module based.
  • K-means A widely used clustering algorithm based on an EM [36] updating strategy.
  • AE [10] A classical Deep model for unsupervised learning.
  • DEC [11] A deep embedding model based on Auto-Encoder for clustering.
  • IDEC [9] A deep model based on DEC with an additional Auto-Encoder module for preserving the local structure of data.
  • GAE&VGAE [15] A GCN-based model for unsupervised learning, based on the frameworks of AE&VAE.
  • ARGE&ARVGE [16] A GAE (VGAE) based model, with the adversarial training strategy to regularize the distribution of embedding for robust representations.
  • DAEGC [19] An attention mechanism based graph clustering model. Instead of being guided by a given graph structure, it learns to aggregate by posing attention scores on each neighbor.
  • SDCN [20] A hybrid deep clustering model that integrates embeddings from both Auto-Encoder and GCN module, which is designed for easing the problem of over-smoothness.
  • AGCN [34] Based on SDCN, it proposed a method to learn an attention mechanism to fuse the embeddings from different modules reasonably.
  • DFCN [21] Based on SDCN, it introduces a cross-modality fusion mechanism to improve the robustness.

4.3. Parameter Settings

As most GCN based models do, we use a 2-layer network for our model. Dimensions of each layer are d-256-16. Specifically, d is the dimension of input. The training process is divided into two steps. In the first step, we pre-train the network without the clustering module to minimize the reconstruction loss of similarity and graph structure. In the second step, together with cluster loss, we train the whole network. After analyzing the effect of hyperparameters, we set λ = 0.1 for Citeseer and λ = 10 for the other. Also, we set K = 100 for top-K similarity. For Citeseer, Dblp, and Acm, we set the learning rate to 0.001, for Pubmed, we set it to 0.005. For Dblp and Pubmed, we train the network for 500 epochs, and 100 epochs for Acm, 400 epochs for Citeseer. For fairness, we set the dimension of the network of GAE&VGAE the same as ours. In addition, for dealing with Pubmed, we use a sampling strategy for training. The sampling rate is set to 0.25 in our experiment. In each epoch, we sample a subgraph that contains 25% nodes of the dataset for training. For AE, GAE&VGAE, we use K-means to obtain the clustering results. For clustering methods, we follow the settings of their corresponding papers. We repeat the experiment 10 times to obtain the average result, which shows in Table 3. All experiments are implemented with PyTorch and run on a GPU (GeForce GTX 1080Ti).

4.4. Metrics

We use four popular metrics to evaluate the clustering performance: ACC (Accuracy), NMI (Normalized Mutual Information), ARI (Average Rank index), and F1 (macro F1-score). ACC is obtained by counting the matching pairs of predictions and labels and calculating the ratio of correctly matched pairs in the total matchings. NMI is used to measure the mutual information between prediction and true labels. ARI is used to measure the decision of clustering. F1 is an overall measurement for precision and recall. Higher values denote better performance.

4.5. Analysis of Result

In our experiments, our method was compared with 12 other methods on four benchmark datasets. Table 3, Table 4, Table 5 and Table 6 show the results. Bold numbers represent the best performance, the underline denotes the second best. From these tables, we have these observations:
  • We can observe from these tables that the proposed method outperforms all the compared baseline methods on four benchmark datasets on most metrics. For example, in Dblp, our model outperforms the second-best one by nearly 4 pp (pp: percentage point), 7 pp, 8 pp, 5 pp on ACC, NMI, ARI, and F1 respectively. In Pubmed, compared to the second strongest, our model outperforms it by nearly 2 pp, 3 pp, 3 pp, 2 pp on ACC, NMI, ARI, F1 respectively. There are three reasons for the effectiveness of our model: First, we fuse embeddings from multiple layers to generate discriminative representations; Second, we construct a filtered graph from the original feature space to preserve the global relations of nodes; Last, we develop a joint training strategy to learn representations that can facilitate clustering and preserve both local relations and intrinsic global relations of nodes.
  • AE, DEC, and IDEC only use node features for generating embeddings, which leads to a sub-optimal clustering performance compared with GCN-based models. K-means clustering is directly performed in the original feature space, it can be used to measure the quality of features. From k-means, we can observe that the quality of data in Acm is the best.
  • In GAE, VGAE, ARGE, and ARVGE, they generate embeddings from a single layer. Compared with them, besides reconstructing intrinsic relationships, our model can fuse multi-scale features to strengthen the discriminativeness for embeddings.
  • DAEGC exploited the attention mechanism for aggregating. Although considering the relations between nodes in a wider range, it implements representation learning by the supervision from the given graph structure, which cannot exploit the hidden relations that are missed by the given graph. Compared with it, our model has two advantages: First, we explore relations from a global view. Second, the explored relations come from original space, which can be considered to be more intrinsic.
  • SDCN, AGCN, and DFCN are powerful deep clustering models that exploit multi-modality to generate discriminative embeddings. Regardless of alleviating the problem of over-smoothness, these models fail to explore the latent relations of nodes that cannot be observed from the given graph. However, by measuring the similarity between nodes, our model successfully revealed the missing relations from the original feature space and outperforms the mentioned models.

4.6. Ablation Study

To make it clear how each part contributes to the proposed model, we implement experiments by removing them. Also, we conduct experiments to validate the strategy of fusing embeddings of each layer to improve the representations. The results of these experiments are shown in Table 7 and Table 8, respectively.

4.6.1. The Effectiveness of Each Component

Table 7 illustrates how each component of the model influences its performance. No single component of the model can outperform the other two across all datasets. In Citeseer and Pubmed, deleting the fusion portion has the most significant effect on performance. We conclude that the feature from various scales strengthens the representations in these datasets. However, in Dblp, similarity supervision has the greatest impact, indicating that the effectiveness of mining latent edges is promising. The combination of similarity and fusion dominates the performance of Citeseer, whereas the combination of similarity and adjacent dominates the performance of Dblp. Compared to other datasets, however, it appears that only the incorporation of three-part data can result in significant improvements for Acm and Pubmed.

4.6.2. The Effectiveness of Each Layer

To demonstrate the efficacy of the fusion technique, we implement the clustering task on each layer individually. Table 8 provides the results. H 1 and H 2 represent embeddings from layer-1 and layer-2, respectively, whereas H f represents the combination of H 1 , H 2 . We can observe that, across all datasets, the power of single-layer representation is consistently weaker than that of multiple-layer representation. In addition, we discovered that for varied datasets, individuals have varying preferences for the neighborhood scale. For Citeseer and Pubmed, embeddings of layer-1 are preferred, whereas embeddings of layer-2 improve clustering performance for Dblp and Acm. However, optimal performance can be achieved by combining embeddings from both layers, validating the efficacy of our fusion technique.

4.7. Analysis of Hyperparameters

In our experiments, we introduce 2 hyperparameters. K is the number of top-K nearest neighbors for target nodes. But it is used for choosing the top-K values of each row in the similarity matrix. λ is a hyper-parameter that is used for adjusting the importance of reconstruction of original relations of nodes.

4.7.1. Analysis of λ

We empirically choose the range of λ as { 100 , 10 , 1 , 0.1 , 0.01 } . In Acm, the fluctuation of the performance is slow and tiny, but it is clear to see that the best performance is achieved when λ = 10 . The best values for λ in Pubmed and Dblp is 10 too, as we can observe easily in Figure 2. However, the best performance is achieved in Citeseer when λ = 0.1 . These observations can validate that: (1) The auxiliary graph is helpful for clustering tasks. (2) Compared to Citeseer, the auxiliary graph plays more import roles in Pubmed, Dblp, and Acm. From the degree of improvement, Dblp is benefited most. It achieves an improvement of nearly 20% in Acc from 0.01 to 10. Although achieving improvement, the degree is not as much as Dblp’s. The reason may be that compared to the given graph of Dblp, graphs of the other datasets can cover relationships more completely. Also we can observe that the performance tend to decrease to different degrees for all datasets when λ varies from 10 to 100. There are two reasons for this: (1) Although filtered, the auxiliary graph still contains noise, and putting too much weight on it will increase the impact of noise. (2) There exists linked pairs in the given graph, they belong to the same cluster, but they are not linked in the auxiliary graph. Putting too much emphasis on the auxiliary graph may ignore this kind of relationship, which leads to a sub-optimal performance. According to the reasons above, we cannot put too much weight on the auxiliary graph during the training.

4.7.2. Analysis of K

The range of K is { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 20 , 30 , 50 , 100 , N } . N denotes the number of nodes in dataset. First of all, from the Figure 3 it is not hard to observe that for all datasets the best performances are achieved when K = 100 . However, when K = N then all the performances decrease to different degrees. There exists too much noise in an unfiltered auxiliary graph that will harm the performance noticeably. First of all, for all datasets, the best choice for K is 100 according to the figure. For Acm, the performance always keeps stable when K varies. Although a little, the auxiliary graph still improves the performance. For Citeseer, Dblp, and Pubmed, the performance can be improved substantially when K reaches or passes a certain thresh. In our experiment, the thresh for Citeseer is 5, 4 for Dblp, 50 for Pubmed. In most cases, the performance increases as the K increases. However, when K = N , the performance become worse than K = 100 . This is because a full connected graph which is built by raw features contains much more noise than a filtered graph.

4.8. Study on the Influence of Graph Structure and Attribute

To study how the structure influences our method, we conduct experiments in two different ways: (1) Remove the attribute from the input; (2) remove structure information from the input. The results are shown in Figure 4. From this figure we can easily observe that with the structure only our method can achieve better performance than the performance with features only. We can infer that for these datasets, the structure plays a more critical role than the feature does in our method. And we can easily observe that when we integrate attribute with structure as input, we can achieve the best performance over other methods that are compared in our experiments.

5. Conclusions

In this paper, we propose a clustering model termed Auxiliary Graph for Attribute Graph Clustering. In our model, we build an auxiliary graph to reveal the latent relations of nodes in a global view. To reduce the impact of inherent noises in datasets, we disregard unreliable relations by a filter mechanism. With the help of the auxiliary graph, our model can learn a more reliable representation. With the help of the fusion strategy and clustering module, the discriminativeness and clustering-awareness of learned representations are both improved. Experiments on four benchmark datasets demonstrate that our model can outperform state-of-the-art baselines in most cases. Although achieving promising performance, our model still has room to improve. In the future, we will improve our model to fit different datasets, especially large-scale datasets.

Author Contributions

Resources, Z.Z.; Writing—original draft, W.L.; Writing—review and editing, S.W., X.G. and E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key R&D Program of China under Grant No. 2022ZD0209103 and the National Natural Science Foundation under Grant No. 62206054.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hastings, M.B. Community detection as an inference problem. Phys. Rev. E 2006, 74, 035102. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
  3. Altaf-Ul-Amin, M.; Shinbo, Y.; Mihara, K.; Kurokawa, K.; Kanaya, S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinform. 2006, 7, 207. [Google Scholar] [CrossRef] [PubMed]
  4. Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; ACM: New York, NY, USA, 2014; pp. 701–710. [Google Scholar]
  5. Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 855–864. [Google Scholar]
  6. Zenil, H.; Kiani, N.A.; Tegnér, J. Algorithmic complexity of motifs clusters superfamilies of networks. In Proceedings of the 2013 IEEE International Conference on Bioinformatics and Biomedicine, Shanghai, China, 18–21 December 2013; Li, G., Kim, S., Hughes, M., McLachlan, G.J., Sun, H., Hu, X., Ressom, H.W., Liu, B., Liebman, M.N., Eds.; IEEE Computer Society: Manhattan, NY, USA, 2013; pp. 74–76. [Google Scholar]
  7. Zenil, H.; Kiani, N.A.; Marabita, F.; Deng, Y.; Elias, S.; Schmidt, A.; Ball, G.; Tegner, J. An algorithmic information calculus for causal discovery and reprogramming systems. iScience 2019, 19, 1160–1172. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Zenil, H.; Kiani, N.A.; Zea, A.A.; Tegnér, J. Causal deconvolution by algorithmic generative models. Nat. Mach. Intell. 2019, 1, 58–66. [Google Scholar] [CrossRef] [Green Version]
  9. Guo, X.; Gao, L.; Liu, X.; Yin, J. Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1753–1759. [Google Scholar]
  10. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Xie, J.; Girshick, R.B.; Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the 33nd International Conference on Machine Learning, New York City, NY, USA, 20–22 June 2016; Volume 48, pp. 478–487. [Google Scholar]
  12. Min, E.; Guo, X.; Liu, Q.; Zhang, G.; Cui, J.; Long, J. A Survey of Clustering with Deep Learning: From the Perspective of Network Architecture. IEEE Access 2018, 6, 39501–39514. [Google Scholar] [CrossRef]
  13. Yang, B.; Fu, X.; Sidiropoulos, N.D.; Hong, M. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, NSW, Australia, 6–11 August 2017; pp. 3861–3870. [Google Scholar]
  14. Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep Embedding Network for Clustering. In Proceedings of the 22nd International Conference on Pattern Recognition (ICPR 2014), Stockholm, Sweden, 24–28 August 2014; IEEE Computer Society: Manhattan, NY, USA, 2014; pp. 1532–1537. [Google Scholar]
  15. Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar]
  16. Pan, S.; Hu, R.; Long, G.; Jiang, J.; Yao, L.; Zhang, C. Adversarially Regularized Graph Autoencoder for Graph Embedding. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 13–19 July 2018; pp. 2609–2615. [Google Scholar]
  17. Wang, C.; Pan, S.; Long, G.; Zhu, X.; Jiang, J. MGAE: Marginalized Graph Autoencoder for Graph Clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017; pp. 889–898. [Google Scholar]
  18. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  19. Wang, C.; Pan, S.; Hu, R.; Long, G.; Jiang, J.; Zhang, C. Attributed Graph Clustering: A Deep Attentional Embedding Approach. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3670–3676. [Google Scholar]
  20. Bo, D.; Wang, X.; Shi, C.; Zhu, M.; Lu, E.; Cui, P. Structural Deep Clustering Network. In Proceedings of the WWW ’20: The Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; ACM: New York, NY, USA, 2020; pp. 1400–1410. [Google Scholar]
  21. Tu, W.; Zhou, S.; Liu, X.; Guo, X.; Cai, Z.; Zhu, E.; Cheng, J. Deep Fusion Clustering Network. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence (EAAI 2021), Virtual Event, 2–9 February 2021; AAAI Press: Palo Alto, CA, USA, 2021; pp. 9978–9987. [Google Scholar]
  22. Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; Reid, I.D. Deep Subspace Clustering Networks. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 24–33. [Google Scholar]
  23. Li, F.; Qiao, H.; Zhang, B. Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recognit. 2018, 83, 161–173. [Google Scholar] [CrossRef] [Green Version]
  24. Dizaji, K.G.; Herandi, A.; Deng, C.; Cai, W.; Huang, H. Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; IEEE Computer Society: Manhattan, NY, USA, 2017; pp. 5747–5756. [Google Scholar] [CrossRef] [Green Version]
  25. Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; Zhou, H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, 19–25 August 2017; Sierra, C., Ed.; 2017; pp. 1965–1972. [Google Scholar] [CrossRef] [Green Version]
  26. Yang, J.; Parikh, D.; Batra, D. Joint Unsupervised Learning of Deep Representations and Image Clusters. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Manhattan, NY, USA, 2016; pp. 5147–5156. [Google Scholar] [CrossRef]
  27. Hsu, C.; Lin, C. CNN-Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large-Scale Image Data. IEEE Trans. Multim. 2018, 20, 421–429. [Google Scholar] [CrossRef] [Green Version]
  28. Wang, Z.; Chang, S.; Zhou, J.; Wang, M.; Huang, T.S. Learning A Task-Specific Deep Architecture For Clustering. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; Venkatasubramanian, S.C., Meira, W., Eds.; 2016; pp. 369–377. [Google Scholar] [CrossRef] [Green Version]
  29. Peng, X.; Xiao, S.; Feng, J.; Yau, W.; Yi, Z. Deep Subspace Clustering with Sparsity Prior. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; Kambhampati, S., Ed.; IJCAI/AAAI Press: Palo Alto, CA, USA, 2016; pp. 1925–1931. [Google Scholar]
  30. Chen, D.; Lv, J.; Zhang, Y. Unsupervised Multi-Manifold Clustering by Learning Deep Representation. In Proceedings of the The Workshops of the The Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; AAAI Technical Report. AAAI Press: Palo Alto, CA, USA, 2017; Volume WS-17. [Google Scholar]
  31. Doersch, C. Tutorial on Variational Autoencoders. arXiv 2016, arXiv:1606.05908. [Google Scholar]
  32. Li, X.; Zhang, H.; Zhang, R. Embedding Graph Auto-Encoder with Joint Clustering via Adjacency Sharing. arXiv 2020, arXiv:2002.08643. [Google Scholar]
  33. Zhang, X.; Liu, H.; Li, Q.; Wu, X. Attributed Graph Clustering via Adaptive Graph Convolution. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 4327–4333. [Google Scholar]
  34. Peng, Z.; Liu, H.; Jia, Y.; Hou, J. Attention-driven Graph Clustering Network. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, China, 20–24 October 2021; ACM: New York, NY, USA, 2021; pp. 935–943. [Google Scholar]
  35. Pan, S.; Hu, R.; Fung, S.f.; Long, G.; Jiang, J.; Zhang, C. Learning graph embedding with adversarial training methods. IEEE Trans. Cybern. 2019, 50, 2475–2487. [Google Scholar] [CrossRef]
  36. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
Figure 1. This is the framework of AGAGC. From top to bottom, our model consists of three components: auxiliary graph creation, graph auto-encoder, and clustering procedure. The top section represents the construction of the auxiliary graph A s and consists of two steps: build and process. The construction of an auxiliary graph is a prerequisite to training. As the backbone of the middle section, we employ a graph auto-encoder (GAE). As an encoder, we introduce a GCN module. The encoder accepts as input the feature matrix X and the provided graph A. After encoding, we concatenate the embeddings of each GCN layer to get the output, denoted by H f . As per GAE, we employ an inner product as our model’s decoder. The graph decoder generates a symmetric matrix M by implementing the inner-product on H f and then applying a sigmoid function. During training, M is required to approximate both the pre-defined graph A and the auxiliary graph A s (minimize L r a and L r s , respectively). The bottom section is a module for clustering. H f serves as its input. The goal of introducing this module is to increase representations’ awareness of clustering. The module for clustering generates Q using a Student’s t-distribution. The clustering module creates a target distribution P by Q for the purpose of producing cluster-structured representations. By minimizing L c (KL divergence) between Q and P, the model can improve the cluster-friendliness of the representations.
Figure 1. This is the framework of AGAGC. From top to bottom, our model consists of three components: auxiliary graph creation, graph auto-encoder, and clustering procedure. The top section represents the construction of the auxiliary graph A s and consists of two steps: build and process. The construction of an auxiliary graph is a prerequisite to training. As the backbone of the middle section, we employ a graph auto-encoder (GAE). As an encoder, we introduce a GCN module. The encoder accepts as input the feature matrix X and the provided graph A. After encoding, we concatenate the embeddings of each GCN layer to get the output, denoted by H f . As per GAE, we employ an inner product as our model’s decoder. The graph decoder generates a symmetric matrix M by implementing the inner-product on H f and then applying a sigmoid function. During training, M is required to approximate both the pre-defined graph A and the auxiliary graph A s (minimize L r a and L r s , respectively). The bottom section is a module for clustering. H f serves as its input. The goal of introducing this module is to increase representations’ awareness of clustering. The module for clustering generates Q using a Student’s t-distribution. The clustering module creates a target distribution P by Q for the purpose of producing cluster-structured representations. By minimizing L c (KL divergence) between Q and P, the model can improve the cluster-friendliness of the representations.
Entropy 24 01409 g001
Figure 2. The sensitivity analysis of λ in (11).
Figure 2. The sensitivity analysis of λ in (11).
Entropy 24 01409 g002
Figure 3. The sensitivity of hyperparameter K (the parameter of KNN in (9)).
Figure 3. The sensitivity of hyperparameter K (the parameter of KNN in (9)).
Entropy 24 01409 g003
Figure 4. Impact to performance by structure and features.
Figure 4. Impact to performance by structure and features.
Entropy 24 01409 g004
Table 1. Notations.
Table 1. Notations.
NotationsMeaning
X R d × n Feature matrix
A R n × n Adjacent matrix
I R n × n Identity matrix
A s R n × n Filtered similarity matrix
A ^ R n × n Adjacent matrix with self-loop
A ˜ R n × n Normalized adjacent matrix
A s R n × n Constructed similarity matrix
D R n × n Degree matrix
H f R d × n Output of graph encoder
M R n × n Reconstructed matrix
S R n × n Constructed similarity matrix
Q R n × K Soft assignment distribution
P R n × K Target distribution
Table 2. Benchmark Datasets.
Table 2. Benchmark Datasets.
DatasetNodesDimensionClustersEdgesDegree
Citeseer332737036473299
Dblp40583344705645
Acm30251870326,25690
Pubmed19,717500344,325142
Table 3. Clustering results on Citeseer.
Table 3. Clustering results on Citeseer.
MethodACCNMIARIF1
k-means55.0629.2124.5653.03
AE53.9327.5626.0350.53
DEC60.9633.3633.2057.13
IDEC63.1636.5436.7560.37
GAE60.5536.3435.5056.24
VGAE51.4128.9624.8849.48
ARGE54.4026.1024.5052.90
ARVGE57.3035.0034.1054.60
DAEGC64.5436.4137.7862.20
SDCN65.9638.7140.15763.62
AGCN68.7941.5443.7962.37
DFCN69.5043.9045.5064.30
AGAGC70.4644.3646.5664.28
Table 4. Clustering results on Pubmed.
Table 4. Clustering results on Pubmed.
MethodACCNMIARIF1
k-means59.8331.0528.158.88
AE63.0726.3223.8664.01
DEC60.15422.4419.5561.49
IDEC60.7023.6720.5862.41
GAE62.0923.8420.6261.37
VGAE68.4830.6130.15567.68
ARGE65.2624.824.3565.69
ARVGE64.2523.8822.8264.51
DAEGC68.7328.2629.8468.23
SDCN64.2022.8722.3065.01
AGCN63.6123.3122.3664.19
DFCN68.8931.4330.6468.10
AGAGC70.7734.3333.8170.46
Table 5. Clustering results on Dblp.
Table 5. Clustering results on Dblp.
MethodACCNMIARIF1
k-means38.3510.996.6832.10
AE38.6214.037.4131.72
DEC61.4627.5325.2561.82
IDEC55.9224.5618.3756.82
GAE53.4229.2916.8354.9
VGAE53.0628.8716.6554.34
ARGE64.4430.2126.2164.32
ARVGE61.9425.6323.9160.57
DAEGC62.0532.4921.0361.75
SDCN68.0539.5039.1567.71
AGCN73.2639.6842.4972.80
DFCN76.0043.7047.0075.70
AGAGC80.5050.7755.4180.16
Table 6. Clustering results on Acm.
Table 6. Clustering results on Acm.
MethodACCNMIARIF1
k-means68.1733.4031.2968.42
AE78.5544.5346.9878.69
DEC72.5243.5043.4870.60
IDEC78.3350.8351.5276.44
GAE89.0664.6970.4789.05
VGAE76.7843.3341.1476.96
ARGE83.0649.3155.7784.81
ARVGE83.6552.1157.0881.40
DAEGC86.9456.1859.3587.07
SDCN90.4568.3173.9190.42
AGCN90.5968.3874.2090.58
DFCN90.9069.4074.9090.80
AGAGC91.5070.7476.4991.51
Table 7. Compare the impact of removing each part of the model.
Table 7. Compare the impact of removing each part of the model.
DatasetRemoveACCNMIARIF1
Citeseerw/o S65.2940.2640.1558.81
w/o C57.2436.8231.7650.26
w/o A70.3543.7045.1262.09
Proposed70.4644.3646.5664.28
Acmw/o S90.5468.5074.0490.58
w/o C89.8767.1272.5159.87
w/o A88.9665.6569.9589.07
Proposed91.6070.7476.4991.51
Dblpw/o S59.0325.5523.6458.58
w/o C80.1150.3155.1179.68
w/o A77.6449.0549.3777.70
Proposed80.5050.7755.4180.16
Pubmedw/o S62.1524.7821.4062.14
w/o C39.95--19.04
w/o A55.6018.2014.5354.31
Proposed70.7734.3333.8170.46
Table 8. The clustering performance on each layer.
Table 8. The clustering performance on each layer.
DatasetLayerACCNMIARIF1
Citeseer H 1 67.7641.8042.2963.73
H 2 67.9541.5640.9359.52
H f 70.4644.3646.5664.28
Dblp H 1 78.1147.8151.3077.36
H 2 80.0850.1555.1179.62
H f 80.5050.7755.4180.16
Acm H 1 89.6066.9871.6589.67
H 2 90.6269.1974.4190.63
H f 91.5070.7476.4991.51
Pubmed H 1 63.7326.0324.3764.98
H 2 62.9121.3420.6363.31
H f 70.7734.3333.8170.46
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, W.; Wang, S.; Guo, X.; Zhou, Z.; Zhu, E. Auxiliary Graph for Attribute Graph Clustering. Entropy 2022, 24, 1409. https://doi.org/10.3390/e24101409

AMA Style

Li W, Wang S, Guo X, Zhou Z, Zhu E. Auxiliary Graph for Attribute Graph Clustering. Entropy. 2022; 24(10):1409. https://doi.org/10.3390/e24101409

Chicago/Turabian Style

Li, Wang, Siwei Wang, Xifeng Guo, Zhenyu Zhou, and En Zhu. 2022. "Auxiliary Graph for Attribute Graph Clustering" Entropy 24, no. 10: 1409. https://doi.org/10.3390/e24101409

APA Style

Li, W., Wang, S., Guo, X., Zhou, Z., & Zhu, E. (2022). Auxiliary Graph for Attribute Graph Clustering. Entropy, 24(10), 1409. https://doi.org/10.3390/e24101409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop