Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Hierarchical and Unsupervised Graph Representation Learning with Loukas’s Coarsening

Algorithms 2020, 13(9), 206; https://doi.org/10.3390/a13090206

by Louis Béthune^1,*,†

, Yacouba Kaloga^1,†, Pierre Borgnat¹

, Aurélien Garivier²

and Amaury Habrard³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Algorithms 2020, 13(9), 206; https://doi.org/10.3390/a13090206

Submission received: 6 July 2020 / Revised: 10 August 2020 / Accepted: 17 August 2020 / Published: 21 August 2020

(This article belongs to the Special Issue Efficient Graph Algorithms in Machine Learning)

Round 1

Reviewer 1 Report

The paper proposes a novel approach for learning unsupervised graph representation. The method is unsupervised, in the sense that does not exploit graph labels to generate the representations. The method accounts for hierarchical representations of the graph, generated by means of a graph coarsening algorithm. The method is end-to-end differentiable. The method is inductive, in the sense it can be used to generate representations of graphs not seen during training.

The subject of the paper is interesting and the authors demonstrated some resourcefulness in combining together different approaches to implement their idea. However, I believe the paper should go through a major revision, to clarify better some parts of the proposed methodology. I ask the authors to account and address the following points.

I noticed some repetitions in the first part of the manuscript. For example, L56-60 and section 3.3 are a repetition. This is not the only example of concepts that are repeated and re-introduced multiple times in the manuscript. The author should re-organize to some extent the manuscript, to improve the readability and also shorten the manuscript.
It is not clear what is the advantage of end-to-end learning if the method is unsupervised and is not combined with other supervised loss, such as the loss used in graph classification problems.
In section 2, I don't understand what is the relationship of kernel methods with the proposed approach. Please clarify.
L51: I disagree that kernel methods require to re-train the model from scratch on "the extended dataset" (I assume the authors mean out-of-samples data). The basic approach needs to evaluate the kernel between the new sample and the whole training set, which is different from "re-training the whole model". Most importantly, there are many approximation techniques for reducing the computational burden. Classic examples are Nystrom approximation and Random Fourier Features.
L54: I think is imprecise to write that negative sampling is an estimator for mutual information. On the other hand, is a way to sample some type of data samples in a dataset where the positive example (e.g., samples belonging to the correct class to predict) are few and the samples from the other classes are many. Neg sampling is basically a way to avoid considering all the possible samples that should NOT be predicted.
In section 2, in the Graph coarsening paragraph, the authors should mention not only Diffpool but also MinCutPool (as presented in the paper "Spectral clustering with graph neural networks for graph pooling"), which is the state-of-the-art method to perform graph coarsening using clustering.
L65: performs graph clustering using pooling -> performs graph coarsening using clustering
I think there is a problem with Tab 1. Most methods, including GIN and Diffpool, have a complexity that is O(N^2) or O(E), where E is the number of edges. I would be very surprised to see any method operating on graphs whose complexity is linear in the number of nodes rather than in the number of edges. Also, I don't understand the "x" mark in the last column "Unsupervised"
In section 3 there is not a definition of the Loukas coarsening procedure, which is a core method used in this paper.
L78: the main and original purpose of WL method is to test if two graphs are homomorphic.
L79: I think the authors confused "Negative sampling" with "Node2Vec" here.
L82: the letter "Z" is used to express two different things. I encourage using a different letter to define the function that maps a graph node into a real-valued vector
L87: What do the authors mean by "distribution of labels"? Are they referring to the sequence of node representations that are generated during the different iterations of the WL algorithm?
L103-104: it is not clear what are the node and graph embedding \theta_g and \theta_x and how they are computed.
L113: what do the authors mean by "continuity in node features"? Maybe that the node features are real-valued vectors rather than discrete variables?
It is not clear to the definition of P(u). In Alg. 1 is referred to as "image of node u". Do you mean P(u) is the neighbourhood of node u? Please, clarify/explain.
Section 4.1.1 seems to belong to the experimental section rather than to the methodology.
L141: I think what the authors are referring here is what is called in the literature "exact graph matching" and "inexact graph matching". If that is the case, the authors should consider adopting such terminology.
L157: I believe that WL characterizes a graph at a global scale. Also, when the authors write "multiscale view" it seems like they are rather referring to local properties, which should be robust to small changes in the graph structure.
L164: the spectrum of the graph is a global property. Small changes in the graph might result in big changes in the spectrum.
L165: by "low-pass" the authors mean "components associated with a low frequency in the graph spectrum"?
L170: I think there is a typo: should be h^l and g^l-1
L174: The Wasserstein distance is not introduced and it is not discussed why it should be used and what advantages does it bring, compared to other types of distances.
L182: H_\theta^l and F_\theta_l are not defined
L194: what does "uniform marginals" mean?
L197: What is an "induced topology"? Also, induced by what?
L204: By "recognizes" you mean "implement"?
L208: I don't see the link between "the number of nodes pooled" and "the exploding gradient". Please, explain.
L212: the classifier distinguishing between pos and beg samples, should rather recognize if a node is a neighbour or not?
L214: what is the "stop_gradient" operation? Also, why the backpropagation should be prevented? This is not explained nor clearly introduced.
L251: why not using the standard Barabasi-Albert graphs for testing with graphs endowed with scale-free property?
Why using the Truncated Krylov rather than a more common ChebyNet (as presented in the paper "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering")?
Should 5.3 be a section? I don't see the point. Please, clarify/explain

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper introduces a hierarchical graph2vec model that combines, graph2vec, gnn and Loukas coarsening method. They highlight that their method is inductive, unsupervised and hierarchical.

Good part:

The paper provides a good amount of background and related work in order to explain the goal of the proposed method and identify their contribution.
The paper provides some empirical results concerning the WL sensibility to structural noise in the graph as shown in section 4.1.1 by generating random graphs of different categories and removing edges. This insight is good.
They provide the code for reproducibility.
They use multiple datasets, multiple baselines and provide ablation studies.

Weakness:

The proposed method seems to be a relatively straightforward combination of existing models (graph2vec, gnn, Loukas coarsening) and existing methods (negative sampling and mutual information in optimization).
While the paper tries to provide explanation of the methods, I find the details of the methods, equations, etc., unorganized and lack of consistent notation and detailed explanation, which makes the method harder to digest.
For the supervised classification task, hg2v is trained on the whole dataset, and I suppose from the lack of explanation other models are not trained from the whole dataset. If this is the case, I don't think this is a fair comparison. All models should only train on training set regardless of whether it's inductive or not, then evaluation the supervised task on test set.
From table 3, we can see that the proposed model does not perform ver well for the inductive inference task which is supposed to be the strength of the proposed model. Even for case where the training set and inference set come from the same data (enzymes and mnist), the results are worse than the baseline. Only for 1 out of the 6 (excluding mnist-usps experiment) cases, the proposed model is better. It seems the model is not very good at inductive inferencing.
From table 4, it seems that by removing Loukas, the model can perform even better? Then why do we need Loukas in the first place? (maybe I did not understand correctly about this table, but there's no explanation about the delta is with respect to what model. I assume the delta is with respect to the proposed model. If not, then in the ablation study, we should juxtapose graph2vec, graph2vec+gnn, graph2vec+Loukas, and hg2v together and compare.)
language needs improvement

Detailed comments:

In definition 1, Z is the function that maps u to its attributed vector, but then this attributed vector is denoted as X(u), shouldn't it be Z(u)? why then Z is also the matrix representations of these attributes? Can you make this definition clearer and use consistant notations?
what are x and y in P_{XY} in equation 3 and 4 respectively? There's no explanation.
on line 169, my understanding is that it should be "By symmetry, the same result follows for h^l and g^{l-1}"
fig 4, the correlation is not really obvious. can you provide some quantitative measurement, e.g., correlation coefficient and the significance? also what about different levels/scales of coarsening? is there a trend for higher levels, e.g., a decreasing correlation as the coarsening level increases?
According to equation 8 and 9, x^l(u) and g^{l+1}(P(u)) uses the exact same neighborhood (v \in N(u) \cup {u}), the only difference is the function, why is then the former is local scale while the latter is larger scale? is something missing here?
In equation 11, why does g^{l+1} has a bias term but x^l does not? what is the [.,.] operation? is it concatenation?
in table 3, the delta is over which baseline?
in table 4, the delta is over which model?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors did a quite extensive and timely work to address all the comments raised in the previous round of revision. I have carefully read the replies to the reviewers' comments and the parts in red that have been changed in the manuscript since the last round of revision.

I believe that now the paper reads better, the motivation is clear, the presentation of the methodology improved and also the obtained results can be read and understood more clearly.

Overall, I am satisfied with the current version of the paper and I believe it can be published after a final proof-read from the authors.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

I appreciate the authors' response. It looks like most of my comments have been addressed. There's one comment however, I feel is not fully addressed. For table 3, from my understanding after the authors' explanation, it is comparing the same proposed model but with domain adaptation vs no domain adaption. This comparison does not really tell use the performance of the proposed model vs existing models. The comparison should be between the proposed model and existing models on the ability to adapt to different domains.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Hierarchical and Unsupervised Graph Representation Learning with Loukas’s Coarsening

Further Information

Guidelines

MDPI Initiatives

Follow MDPI