Learning from Feature and Global Topologies: Adaptive Multi-View Parallel Graph Contrastive Learning

Song, Yumeng; Li, Xiaohua; Li, Fangfang; Yu, Ge

doi:10.3390/math12142277

Open AccessArticle

Learning from Feature and Global Topologies: Adaptive Multi-View Parallel Graph Contrastive Learning

by

Yumeng Song

,

Xiaohua Li

,

Fangfang Li

and

Ge Yu

^*

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2277; https://doi.org/10.3390/math12142277

Submission received: 3 June 2024 / Revised: 10 July 2024 / Accepted: 19 July 2024 / Published: 21 July 2024

(This article belongs to the Special Issue Mathematical Modeling for Parallel and Distributed Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

To address the limitations of existing graph contrastive learning methods, which fail to adaptively integrate feature and topological information and struggle to efficiently capture multi-hop information, we propose an adaptive multi-view parallel graph contrastive learning framework (AMPGCL). It is an unsupervised graph representation learning method designed to generate task-agnostic node embeddings. AMPGCL constructs and encodes feature and topological views to mine feature and global topological information. To encode global topological information, we introduce an H-Transformer to decouple multi-hop neighbor aggregations, capturing global topology from node subgraphs. AMPGCL learns embedding consistency among feature, topology, and original graph encodings through a multi-view contrastive loss, generating semantically rich embeddings while avoiding information redundancy. Experiments on nine real datasets demonstrate that AMPGCL consistently outperforms thirteen state-of-the-art graph representation learning models in classification accuracy, whether in homophilous or non-homophilous graphs.

Keywords:

contrastive learning; parallel deep learning; graph neural network; graph representation learning; self-supervised learning

MSC:

05C62

1. Introduction

Recently, graph contrastive learning (GCL) has been introduced as a novel self-supervised paradigm for graph representation learning (GRL), effectively leveraging the benefits of graph neural networks (GNNs) to capture informative representations. GCL generates multiple views by augmenting input data randomly and optimizes GNN encoders by learning the consistency between these distinct views. GCL diminishes reliance on graph representation learning labels, yielding state-of-the-art performance in tasks such as node classification [1,2,3,4].

While GCL achieves impressive successes in various graph benchmarks, it faces three major challenges:

Challenge I: How can one adaptively integrate feature and topological information in graph contrastive learning? Graphs encompass two crucial elements of discriminative information: feature information and topological information. Existing GCL methods typically employ GNN encoders, such as GCN [5] and GAT [6], which fuse node features along the topological structure and learn the latent information of nodes through contrastive loss. However, these GNNs cannot adaptively learn deep correlation information between the topological structure and node features. For example, in Figure 1a, whether encoding the graph G or two views, node

v_{1}

aggregates information from two different-class nodes

v_{5}

and

v_{6}

and one same-class node

v_{3}

. The information from different-class nodes outweighs the same-class information, potentially misleading the encoding of

v_{1}

. Consequently, the performance of existing GCL methods is affected. Although Chen et al. [7] attempt to address this issue by proposing a GCL framework that separately encodes feature information and topological information, independently learning from these two types of information fails to avoid mining redundant semantic information and maintaining the critical semantic information of the original graph.

Challenge II: How can one capture global structural topology? The existing GCL methods commonly use two layers of GNN as encoders to explore the topology structure [1,2,3,8,9,10,11,12]. Each layer of GNN aggregates neighbor information once. However, this leads to the limitation of the GCL encoder in aggregating neighbor information with a limited number of hops, which can only explore local structural information while ignoring global structural information. To obtain global topological structural features, a straightforward approach is to stack multiple convolutional layers in GNNs to capture multi-hop neighbor information. However, despite the success of this popular node interaction paradigm, the number of neighbors for each node grows exponentially. For example, in a graph where each node has a degree of d, each node v has d neighbors at one hop. At the h-th hop, each node has

d^{h}

neighbors. Aggregating all h-hop neighbors requires aggregating

\sum_{i = 1}^{h} d^{i}

nodes. It poses well-known limitations in terms of scalability and over-smoothing.

Challenge III: How can one parallelize the encoding of multiple views? GCL, as an unsupervised pre-trained model, often requires large-scale iterative training and model parameters. Parallel learning can effectively reduce training and encoding time and memory overhead although existing scalable GNNs [13,14,15,16,17], which are capable of supporting parallel training, often yield suboptimal embedding results. Furthermore, since GCL constructs different views and corresponding encoding models, different parallel strategies are necessary to accelerate the synchronization of multiple models.

To address the above issues, we propose Adaptive Multi-view Parallel Graph Contrastive Learning (AMPGCL). It enables the adaptive aggregation of feature and topology information, reducing redundancy and enhancing the ability to fully explore semantic information in a limited-dimensional embedding space, as shown in Figure 1b. AMPGCL encodes the two views and the original graph separately. In the feature view, node

v_{1}

receives information from the same-class node

v_{4}

and the different-class node

v_{6}

, as the topology of G is updated based on feature similarity. Compared to View 1 in Figure 1a, this reduces the aggregation of one different-class node. In the global topological view, node

v_{1}

receives information from all same-class nodes in the graph. Compared to previous GCL methods, AMPGCL abandons the single feature fusion approach, using multiple views to adaptively mine more semantic information from the graph. First, we introduce a feature–topology multi-view encoding module consisting of feature encoders and topology encoders. The module extracts the feature and topology information of the graph and fuses the feature and topology information of the original graph. Second, to optimize the learning of the global topology structure, we propose a hop Transformer encoder (H-Transformer) as a topology encoder. It simultaneously learns the information of h-hop neighbors by a multi-head attention module, avoiding excessive smoothness and scalability limitations caused by multiple iterations of learning. We also propose a parallel encoding strategy, where the node encoding is in parallel by the H-Transformer, to accelerate the training of contrastive learning. Third, to reduce redundancy and enhance the ability to fully explore semantic information within a limited-dimensional embedding space, we introduce a multi-view contrastive loss including contrastive losses between embeddings of different views. Specifically, a feature–topology contrastive loss ensures that the feature and topology information learned by the two encoders maintain the structural and feature differences in the original graph.

We summarize the contributions as follows:

We propose an adaptive multi-view parallel GCL framework that constructs feature and topology views and embeds these views and the original graph in parallel using different encoders for self-supervised multi-view contrastive learning. This enables the generation of node representations that adaptively fuse the feature and topology information, applicable to both homogeneous and heterogeneous graphs.
We propose a novel hop Transformer encoder to capture global structural information. It learns from multiple-hop neighbors with multi-head attention modules. We also introduce a parallel encoding learning strategy to enhance training efficiency.
We introduce a multi-view contrastive learning loss between the feature view and the original graph, the topology view and the original graph, and the feature view and the topology view. The loss function can adaptively capture both the feature and structural information of the original graph.
Experimental results on nine real datasets demonstrate that, when compared to thirteen baseline methods, AMPGCL achieves the best performance on both homophilous and non-homophilous graphs. Compared to state-of-the-art methods, AMPGCL improves the node classification performance by 5.22%.

We proceed to review related work in Section 2 and to introduce preliminaries in Section 3. Section 4 presents AMPGCL, and Section 5 covers the experimental study. Section 6 concludes and offers research directions.

2. Related Work

2.1. Graph Neural Network

Graph Neural Networks (GNNs) encode graph-structured data by utilizing neural networks. The key challenge in GNNs lies in defining the graph convolution operator. GNNs can be divided into two main methods: spectral and spatial methods. Spectral methods [5,18,19] transfer signals from the spatial domain to the spectral domain using graph Fourier transform and perform convolution operations in the spectral domain. Spectral GNN [18] is the first attempt to implement a Convolutional Neural Network (CNN) on graphs. ChebyNet [19] introduces a parameterization method using Chebyshev polynomials for the fast positioning of spectral filters. GCN [5] proposes a simplified version of ChebyNet and achieves success in semi-supervised learning tasks on graphs. However, these spectral methods face challenges in generalization due to the fixed graph sizes, and larger filter sizes increase computational and memory costs. Spatial methods [6,20,21] draw inspiration from the weighted summation in CNN to extract spatial features from the topological structure of the graph. However, these methods use non-learnable aggregators, and the positioning of the convolution operation remains uncertain. While these GNNs perform well in graph analysis tasks, the simple stacking of a few convolutional layers can only aggregate partial structural information, and excessive stacking can lead to over-smoothing and heterophily problems [13,22].

To address these problems, the state-of-the-art GNNs adopt residual connections [23,24] or adaptive aggregation strategies [13,14,25,26] to extend standard GNNs. For example, GCNII [15] is a deep GNN that combines GCN with initial connections and identity mapping. Geom-GCN [13] and WRGAT [27] transform the original graph by discovering non-local semantic similarity neighbors. H2GCN [14] improves the performance of GCN under heterogeneous graphs by utilizing self-separation, neighbor separation, and high-order combinations. GGCN [16] incorporates node-local neighborhood signature information and a degree correction mechanism for node-wise rescaling. FAGCN [17] and ACM-GCN [24] apply low-pass and high-pass filters in each convolutional layer.

Compared to GNNs, AMPGCL is an unsupervised learning framework that is trained without labels. Moreover, these GNNs also have high expressive power, and they exhibit high computational costs and limited scalability for large-scale graphs. Hence, they are not suitable for GCL, which requires multi-epoch pre-training for mining global graph features.

2.2. Graph Contrastive Learning

Benefiting from GNN and Contrastive Learning (CL), many graph contrastive learning (GCL) methods have achieved state-of-the-art performance in unsupervised node/graph representation learning tasks [1,2,3,7,8,9,10,12,28,29,30,31]. InfoGraph [28] and DGI [1] utilize graph structures at different nodes, edges, and subgraphs as contrastive views to learn graph representations. GRACE [12] and GraphCL [9] use random perturbations to the graph structure (e.g., node and edge dropout, graph sampling) to generate contrastive views. MVGRL [8] relies on graph diffusion to generate augmented views of the original graph. BGRL [29] is a contrastive learning method based on BYOL [32] that eliminates negative samples. GCA [10] designs enhancement schemes based on different graph attributes (such as node degree, feature vector, and PageRank [33]). MAGCL [2] proposes a paradigm with different random view encoder architectures to generate diverse views. GCIL [30] exploits the self-supervised node classification task from the standpoint of causality and proposes a contrastive learning loss based on the theory of causality. PiGCL [31] dynamically captures and ignores a portion of self-supervised negative samples based on gradient detection.

Most GCL methods use GNN as backbone encoders and achieve significant performance on homophilous datasets. However, these GCL methods are limited in their performance on non-homophilous datasets due to the bottleneck of GNN. Therefore, state-of-the-art GCL methods focus on considering the homophily of the data. ASP [7] learns different levels of homophily through two contrastive learning modules. HomoGCL [3] expands the positive samples through homophily in node-level representation learning.

These two methods not only improve the performance on non-homophilous graphs but also achieve good performance on homophilous graphs. However, the independent learning of feature and structural topological information ignores the real differences between the distribution of node features and the distribution of graph topology, causing the generated embeddings relying on homophilous graphs to contain two completely different types of information, deviating from the original features of graph nodes and leading to a decrease in learning performance. AMPGCL combines the information of feature distribution and topological distribution adaptively through multiple views.

3. Preliminaries

Definition 1.

A graph is denoted as

G = (V, E)

, where

V = {v_{1}, v_{2}, \dots, v_{n}}

is a node set and

E \subseteq V \times V

is an edge set.

Definition 2.

The feature matrix of a graph

G = (V, E)

is denoted as

X \in R^{n \times F}

, where each row

x_{i}

of

X

is an F-dimensional feature vector of a node

v_{i}

.

Definition 3.

The adjacency matrix of a graph

G = (V, E)

is denoted as

A \in R^{n \times n}

, where

A_{i j} : = \{\begin{matrix} 1 & if (v_{i}, v_{j}) \in E \\ 0 & otherwise \end{matrix} .

(1)

Definition 4.

The class label vector of a graph

G = (V, E)

is denoted as

y \in N^{n}

, where each element

y_{i}

of

y

represents the class label of

v_{i}

and

N

is a set of natural numbers.

Definition 5.

The homophily ratio of a graph denoted by

h o m o (G)

is given by

h o m o (G) : = \frac{| (v_{i}, v_{j}) : (v_{i}, v_{j}) \in E \land y_{i} = y_{j} |}{| E |},

(2)

where

| (v_{i}, v_{j}) : (v_{i}, v_{j}) \in E \land y_{i} = y_{j} |

is the number of edges that connect nodes with the same labels and

| E |

is the number of edges of G.

The homophily of G refers to the likelihood of nodes with the same label being close to each other in G. A graph with a homophily ratio close to 1 often has more edges connecting nodes of the same category, while a graph with a homophily ratio close to 0 has more edges connecting nodes of different categories. A homophily ratio greater than 0.5 indicates a homophilous graph, whereas a ratio less than 0.5 indicates a non-homophilous graph.

Example 1.

Figure 2 provides an example of a homophilous graph and a non-homophilous graph. For a graph G involving six nodes

v_{i} (1 \leq i \leq 6)

, the structural topology is shown in Figure 2a. If the label distribution of graph G is as shown in Figure 2b, out of the 7 edges, the edges connecting nodes with the same label are

(v_{1}, v_{2}), (v_{2}, v_{3}), (v_{4}, v_{5}), (v_{5}, v_{6}),

and

(v_{4}, v_{6})

, while the remaining edges connect nodes with different labels. The homophily ratio of G is 0.71, indicating that it is a homophilous graph. If the label distribution of graph G is as shown in Figure 2c, the edges connecting nodes with the same label are

(v_{2}, v_{3}), (v_{2}, v_{5}),

and

(v_{4}, v_{6})

. The homophily ratio of G is 0.43, indicating that it is a non-homophilous graph.

Definition 6.

Given a graph

G = (V, E)

, the feature matrix

X

, and the adjacency matrix

A

of G, adaptive multi-view parallel graph contrastive learning (AMPGCL) performs an unsupervised graph representation learning to ensure the consistency of representation for similar instances while maintaining inconsistent representation for different instances.

4. AMPGCL

4.1. Framework

AMPGCL is based on multi-view contrastive learning [34] to construct features and topology views and generate node embeddings that contain a different information by contrastive learning, as shown in Figure 3.

View construction. Based on the feature matrix $X$ of graph G, we reconstruct the adjacency matrix based on features to generate the k-nn feature view $V_{f}$ . Based on the adjacency matrix A of graph G, we construct h-hop node features as the h-hop topology view $V_{t}$ . We elaborate on the construction of these two views in Section 4.2.
Graph encoding. Given two constructed views and the original graph, we encode topology and features and generate four graph representations $Z_{i} (i = 1, 2, 3, 4)$ using four encoders, including two GCN encoders and two H-Transformer encoders. We detail the graph encoders in Section 4.3.
Parallel graph encdoing. We propose a multi-view parallel encoding strategy of AMPGCL and introduce minibatch training for topology encoding. We detail the parallel encoding strategy in Section 4.4.
Training loss. We propose four unsupervised contrastive learning training losses. The topology contrastive loss, the feature contrastive loss, and the node contrastive loss maximize the embedding agreement between embeddings of global topology and local topology, embeddings of feature views and the original graph, and embeddings from two original graph encoders, respectively. The feature-topology contrastive loss maximizes the consistency between the differences in topology and feature information of G and that in embeddings of the two views. The training loss is covered in Section 4.6.

4.2. View Construction

Many studies have made significant progress in multi-view contrastive learning. However, multi-view contrastive learning may lead to a degradation of representations for graphs that contain rich semantic information [34]. This is because (i) graph augmentation methods, such as node and edge perturbation [9,12], may disrupt the topological structure and features of the original graph, making it difficult to preserve the semantic meaning of the original graph; (ii) the features of augmentation views from the original graph can only be fused along the topological paths, making it difficult to avoid the interference between topology and feature [35], resulting in a decline in performance when learning from non-homophilous graphs.

A previous study [36] demonstrates that methods encoding features (e.g., MLP) or topology (e.g., DeepWalk) show degraded performance compared to topology-based feature aggregation methods (e.g., GCN) in some cases. For instance, GCN’s performance drops by 25% compared to MLP in graphs with random topology and label-correlated features, and by 13% compared to DeepWalk in graphs with label-correlated topology and random features. This is because of the inability of topology-based feature aggregation to adaptively learn from both features and topology, leading to performance degradation. Therefore, to capture the underlying structure of nodes in the feature space and obtain the global topological structure from the original graph, we construct a k nearest neighbor (k-NN) feature view and an h-hop topology view. It extracts both feature information and topological information from the original graph and alleviates the interference between them.

4.2.1. Feature View Construction

To capture the rich feature similarity relationships between nodes, the k-nn feature view

V_{f} = (A_{f}, X)

is based on the node feature matrix

X

, where

A_{f}

is the adjacency matrix of

V_{f}

. The k-nn graph constructs a feature view using feature similarity information. The k refers to selecting the top-k nodes with the most similar features for each node as neighbors. A larger k means each node learns more from feature-similar nodes, but it also introduces more noise. In homophilous graphs, where similar nodes are connected, k should be small to maintain similar information between the feature view and the original graph. In non-homophilous graphs, where connected nodes are not necessarily similar, a large k is chosen to gather information from more similar nodes despite the noise. Firstly, we compute the similarity matrix

S \in R^{n \times n}

among n nodes. There are many methods to obtain

S

such as Cosine similarity [37], Euclidean distance [38], Manhattan distance [39], and Jaccard similarity [40]. Additionally, dimensionality reduction techniques such as PCA [41] and t-SNE [42] can be used to preprocess the feature matrix X before computing the similarity matrix S for enhancing the quality of the feature view construction. For simplicity and computational efficiency, we use Cosine similarity, a common approach to measuring similarity between vectors. Cosine similarity calculates the Cosine value of the angle between two vectors to quantify their similarity. Formally, for the feature vectors

x_{i}

and

x_{j}

of nodes i and j, their similarity can be represented as follows:

S_{i j} = \frac{x_{i} \cdot x_{j}}{∥ x_{i} ∥ ∥ x_{j} ∥} .

(3)

Then, we construct the adjacency matrix

A_{f}

of

V_{f}

based on

S

. For nodes i and j, the adjacency value of

A_{f_{i j}}

is defined as

A_{f_{i j}} : = \{\begin{matrix} 1 & if v_{j} \in M_{i} \\ 0 & otherwise \end{matrix},

(4)

where

M_{i}

denotes the set of the top-k feature nodes similarto node i. With feature view

V_{f}

, we can train the node embeddings to encode the specific information in the feature space.

4.2.2. Topology View Construction

The extraction of information from distant neighbors is crucial for generating representative node embeddings in terms of topological information [43,44]. Existing methods use the topological information of the original graph as input to GNN encoders. However, GNNs require stacking h layers to reach nodes within h hops, resulting in high computational cost and limited scalability [45]. Additionally, edge diffusion methods [8] used in GCLs primarily utilize personalized PageRank (PPR) [33] or heat kernel [46] for global views. These methods involve computationally expensive calculations such as matrix exponentiation or matrix inversion. Therefore, we preprocess multi-hop neighbors based on the feature matrix

X

and adjacency matrix

A

of graph

G

to generate h-hop topological views

V_{t} = (X_{t})

, where

X_{t} \in R^{n \times h \times d}

is the feature tensor of h-hop neighbors for all nodes. In the topology view, h represents the number of hops considered for topological neighbors. A larger h means extracting information from more hops. In homophilous graphs, where nodes are similar to their neighbors, few hops are needed. In non-homophilous graphs, similar nodes may be farther apart and require more hops to be reached, necessitating a large h. Formally, the feature vector

x_{t, i}

of node i can be represented as

x_{t_{i}} = [x_{i}^{0}; x_{i}^{1}; \dots; x_{i}^{h}], x_{i}^{j} = A_{i}^{j} x_{i} (1 \leq j \leq h),

(5)

where

A_{i}^{j}

is the i-th row of the normalized adjacency matrix of the j-th power. The topology view construction does not require any learnable parameters, eliminating the computational complexity of node aggregation during the training process.

4.3. Graph Encoding

Graph encoding consists of topology and feature encoding, which extract different pieces of graph information from the topology and feature view, respectively. The feature matrix

X

for each view can be aggregated through the two views, generating two specific embeddings. To enable node embeddings to learn specific knowledge adaptively and apply it to contrastive learning, we include encoding of the original graph in each encoding. We detail topology and feature encodings below.

4.3.1. Topology Encoding

Common GNNs are not suitable for obtaining sufficient global topological information spanning multiple hops because they require multiple iterations of aggregation. In the constructed topology view

V_{t}

, each node i has a feature vector

X_{t_{i}}

for h hops. The feature matrix

X_{t_{i}}

is considered as a sequence containing ordered feature vectors of the node’s multiple-hop neighbors. For encoding sequences, the Transformer [47] is a better choice. With a self-attention mechanism, the Transformer can process the entire sequence input in parallel without requiring recursive processing of the input sequence at each time step. The Transformer has achieved significant performance on various sequence-based tasks, improving performance by exploring hidden information. Consequently, we propose the Hop Transformer (H-Transformer) as a graph encoder for learning multi-hop topological information. The H-Transformer can simultaneously encode the h-hop neighbors of a node, avoiding the need for GCNs to stack h layers to obtain h-hop information, which could lead to over-smoothing. The framework of H-Transformer for embedding a 6-hop topology view is illustrated in Figure 4, which includes input construction and a hop interaction encoder. Input construction refers to the process of constructing the h-hop topology view. The encoder independently encodes the multi-hop neighbor features of each node and finally merges all embeddings into one for each node.

Given a node v, its feature sequence is

{x^{0}, x^{1}, \dots, a n d x^{h}}

. We first employ a shared parameterized linear layer to encode the feature vectors of all hops. Following an existing study [48], we use a set of trainable parameters to encode the order of hops. The output of the linear transformation and hop order encoder of

x^{i} (0 \leq i \leq h)

is

\begin{matrix} {\hat{x}}^{i} = f (x^{i}) + o^{i}, o^{i} = [\cos (ω_{1} i), \sin (ω_{1} i), \dots, \cos (ω_{d^{*}} i), \sin (ω_{d^{*}} i)], \end{matrix}

(6)

where

f (\cdot)

is a linear transformation layer,

o^{i}

is the i-hop order vector,

ω

is the trainable parameters, and

d^{*}

is the output dimensionality of

f (\cdot)

.

The resulting sequence

{{\hat{x}}^{0}, {\hat{x}}^{1}, \dots, {\hat{x}}^{h}}

is then fed into a stacked Transformer encoder layer, where each layer consists of a multi-head self-attention layer and a feed-forward neural network, generating embedding for each hop. The l-th layer output embedding for the i-hop vector

{\hat{x}}^{i}

of the Transformer is denoted as

z_{t}^{i, (l)} = FFN (MultiHeadAttention (z_{t}^{i, (l - 1)})),

(7)

where

MultiHeadAttention (\cdot)

is the multi-head self-attention layer,

FFN (\cdot)

is the feed-forward neural network, and

z_{t}^{i, (0)} = {\hat{x}}^{i}

.

After conducting L layers of the Transformer encoder, we obtain the final embedding of the topology view by applying fusion and projection functions. First, we adopt the average fusion to average all hop vectors of

Z_{t}^{(L)} = [z_{t}^{0, (L)}, z_{t}^{1, (L)}, \dots, z_{t}^{h, (L)}] \in R^{h \times d^{*}}

, resulting in a single representation

z_{t}^{(L)} \in R^{d^{*}}

. Then, a simple linear projector is employed to produce the final embedding of the topology view

z_{t} \in R^{d}

. Formally, the output of the H-Transformer for node v is denoted as

\begin{matrix} z_{t}^{(L)} = Fusion (Z_{t}^{(L)}), z_{t} = Proj (z_{t}^{(L)}), \end{matrix}

(8)

where

Fusion (\cdot)

is the hop fusion function and

Proj (\cdot)

is the projection function.

Instead of directly comparing the node embeddings of the original graph and the topological view, we utilize the topological view as complementary information to the graph. Specifically, we first construct a multi-hop input with a small h value (e.g.,

h = 2

) as the input for the original graph. We feed the constructed multi-hop vectors of the original graph into the H-Transformer. It is important to note that the H-Transformer encoder for the original graph shares the same weights as the H-Transformer encoder for the topology view. The output of the original graph after being processed by the H-Transformer encoder is denoted as

Z_{2}

. We add the embedding vectors of the topology view to the topological embedding vectors of the original graph as the embedding vectors of the topology view:

Z_{1} = Z_{t} + Z_{2} .

(9)

In this way, the topological encoding of the original graph adaptively learns the global information from the topological view while preserving the original semantic information of the nodes in the original graph.

4.3.2. Feature Encoding

For the feature view

V_{f}

, each node’s first-order neighbors consist of k nodes that have the most similar features to the center node. In non-homophilous graphs, where downstream task labels are less correlated with topology, mining semantic information from nodes with similar features can enhance the performance of GCL. Therefore, we use GCN [5] as the feature encoder to capture rich feature information. GCN is a graph learning method that aggregates node features through topology. Since the constructed feature view updates the original graph topology by connecting feature-similar nodes, the goal of encoding the feature view is to mine hidden information from the first-order neighbors’ features in the constructed feature view. Thus, GCN is an intuitive and effective method. Compared to H-Transformer, it has a simpler architecture with fewer parameters. Since neighbors beyond the first hop in the view are not directly similar to the center node, we only utilize a single layer of GCN. Specifically, for the feature view

V_{f} = (A_{f}, X)

, the output

Z_{f}

of GCN can be represented as

Z_{f} = σ ({\tilde{D}}_{f}^{- \frac{1}{2}} {\tilde{A}}_{f} {\tilde{D}}_{f}^{- \frac{1}{2}} X W_{f}),

(10)

{\tilde{A}}_{f} = A_{f} + I_{f}, {\tilde{D}}_{f} = diag (\sum_{j} {\tilde{A}}_{f_{i j}}),

(11)

where

W_{f}

is the weight matrix of the GCN layer,

σ (\cdot)

is the activation function and

W_{f}

,

{\tilde{A}}_{f}

is the adjacency matrix

A_{f}

with added self-loops,

I

is the identity matrix, and

{\tilde{D}}_{f}

is the diagonal degree matrix of

{\tilde{A}}_{f}

.

Similar to topology encoding, we incorporate the feature view as being supplementary to the original graph. Specifically, we initially encode the original graph with feature matrix

X

and adjacency matrix

A

by an L-layer GCN. The output of the l-th layer

(0 \leq l \leq L)

is denoted as

Z_{f, o}^{(l)} = σ ({\tilde{D}}_{f}^{- \frac{1}{2}} {\tilde{A}}_{f} {\tilde{D}}_{f}^{- \frac{1}{2}} Z_{f, o}^{(l - 1)} W_{f}^{(l)}),

(12)

where the initial feature matrix is

Z_{f, o}^{(0} = X

and

W_{f}^{(l)}

is the l-th learnable parameter matrix of GCN. Note that the two GCN encoders for feature encoding share parameters.

We take the output of the last layer of the GCN for the original graph as the feature encoding embedding of the original graph, denoted as

Z_{3} = Z_{f, o}^{(L)}

. We add the embedding of the feature view to the feature encoding embedding of the original graph, resulting in the final embedding representation of the feature view, denoted as

Z_{4} = Z_{3} + Z_{f} .

(13)

The fusion of the feature encoding embedding of the original graph and the embedding of the feature view ensures that both embeddings can learn important semantics for each node during contrastive learning, without being hindered by the significant topological differences between

V_{f}

and G.

4.4. Parallel Graph Encoding

Contrastive learning, as an unsupervised learning model, typically requires extensive training [49,50]. Although multi-view learning can effectively extract various types of information, it also presents challenges for model training because it involves multiple views and complex encoding models. The number of edges in the constructed views can exponentially increase with changes in hyperparameters. Additionally, in multiple iterations, each iteration requires complex updates of edge data. Considering that parallel processing has become synonymous with computational efficiency, we designed parallel multi-view encoders and mini-batch training for H-Transformers to enhance the embedding efficiency of AMPGCL.

In AMPGCL, there are four encoders, and since there is no correlation between the different views, these four encoders can perform model parallel encoding. For the two GCN encoders in feature encoding, they are simple neural networks that can usually fit within the memory of a single machine (see Equations (10) and (12)). Moreover, these GCN encoders have fewer convolutional layers, with one of them containing only a single-layer GCN. Therefore, when the graph complexity is not particularly high, these two models occupy less memory and have high encoding efficiency. In the case of complex high-dimensional graphs, GCN can be replaced with other scalable GNNs [13,14,16,17] for distributed training.

The two complex H-Transformers are sampling-based models that can approximate full-batch training through mini-batch training. H-Transformer samples the h-hop neighbors of each node, generating subgraphs as mini-batches that are independent of each other and can be processed in parallel. It avoids the scalability limitations and over-smoothing caused by aggregating and updating all nodes’ neighbors. Although mini-batch parallelism may face load imbalance issues, existing research has explored methods to mitigate this problem [51].

We update gradients across all workers at the end of the feature encoding GCN and the end of each mini-batch during topology encoding. During mini-batch processing, synchronization of points and gradient updates is performed at the end of each mini-batch to ensure consistency across all workers.

In scaling to large datasets, several solutions have been proposed for scaling GCN encoders. For instance, GraphSAGE [21], Cluster-GCN [52], and GraphSAINT [53] offer scalable GNNs. The H-Transformer can handle encoding across multiple nodes and multiple hops. It encodes nodes independently and can be distributed across different machines. The multi-hop encoding leverages the parallel nature of Transformers [47], allowing for efficient parallel processing. Additionally, computational overhead can be reduced by limiting the number of hops.

4.5. Time Complexity Analysis

The feature encoder is a single-layer GCN. The time complexity of GCN to encode all nodes in

G = (V, E)

is

O (| E | d + | V | d^{2})

, where

| E |

is the number of edges,

| V |

is the number of nodes, and d is the dimension of the node features. It includes the complexity of multiplying the adjacency matrix with the feature matrix

O (| E | d)

and the linear transformation

O (| V | d^{2})

. The topological encoder is a two-layer H-Transformer. Since encoding each node is independent, for each node, the complexity is

O (h d^{2} + h^{2} d)

, where h is the number of hops. It includes the complexity of the self-attention mechanism

O (h^{2} d + h d^{2})

and the feedforward neural network

O (h d^{2})

. Encoding

| V |

nodes is related to the minibatch size b, so the overall complexity for encoding

| V |

nodes is

| V | / b \times O (h d^{2} + h^{2} d)

.

Existing graph encoders, such as those using GCN to extract h-hop information, require iterating through h layers of GCN. The time complexity for an h-layer GCN is

h \cdot O (| E | d + | V | d^{2})

. Compared to the H-Transformer, since

| V | ≫ h

and

| E | ≫ h

, and

| V | / b

can be adjusted based on b, in most cases,

| V | / b \cdot O (h d^{2} + h^{2} d) < h \cdot O (| E | d + | V | d^{2})

. Thus, the H-Transformer is more efficient for extracting multi-hop information.

4.6. Multi-View Contrastive Loss

The graph encoding yields four embeddings: the embedding of the topological view

Z_{1}

, the topological embedding of the original graph

Z_{2}

, the feature embedding of the original graph

Z_{3}

, and the embedding of the feature view

Z_{4}

. To learn the similarity between these embeddings, we introduce a multi-view contrastive loss. Specifically, we use topological contrastive loss and feature contrastive loss to compare the differences between embeddings learning the same information; node contrastive loss to compare the differences between the two embeddings of the original graph; and feature–topology contrastive loss to compare the differences between embeddings learning different types of information. We detail the four contrastive losses in the following, where

z_{c}^{i}

denotes the embedding of node i from the embedding matrix

Z_{c}

, where

c \in {1, 2, 3, 4}

.

Topology Contrastive Loss. This loss compares the embedding distribution of each node from the topology view $Z_{1}$ , which has learned global topological information, and the original graph $Z_{2}$ , which has learned local topological information. We use InfoNCE [54] to estimate the bound of mutual information between node embeddings. For node i, $z_{1}^{i}$ and $z_{2}^{i}$ form a pair of positive samples, while the embeddings of different nodes in different views form pairs of negative samples. For example, in Figure 5a, for node $v_{1}$ , the embeddings of $v_{1}$ in $Z_{1}$ and $Z_{2}$ are positive examples for each other, while the embeddings of $v_{1}$ and $v_{j}$ $(2 \leq j \leq 6)$ in different views are negative examples. Therefore, the topology contrastive loss can be defined as

$L_{topo} = \frac{1}{2 n} \sum_{v_{i} \in V} (ℓ_{topo}^{1} (z_{1}^{i}) + ℓ_{topo}^{2} (z_{2}^{i})) .$

(14)

The $ℓ_{topo}^{1}$ and $ℓ_{topo}^{2}$ in Formula (14) denote the topology contrastive losses of the two views. Since the two loss terms are symmetric, we only show $ℓ_{topo}^{1} (z_{1}^{i})$ :

$ℓ_{topo}^{1} (z_{1}^{i}) = - log \frac{exp (sim (z_{1}^{i}, z_{2}^{i}) / τ_{topo})}{\sum_{v_{j} \in V} exp (sim (z_{1}^{j}, z_{2}^{j}) / τ_{topo})},$

(15)

where $sim (\cdot)$ is the cosine similarity function and $τ_{topo}$ is a temperature parameter corresponding to $ℓ_{topo}^{1}$ .
Feature Contrastive Loss. Similar to the topology contrastive loss, in the feature contrastive loss, for node i, $z_{3}^{i}$ and $z_{4}^{i}$ form a pair of positive samples, while the remaining embeddings form pairs of negative samples. For example, in Figure 5b, for node $v_{1}$ , the embeddings of $v_{1}$ in $Z_{3}$ and $Z_{4}$ are positive examples for each other, while the embeddings of $v_{1}$ and $v_{j}$ $(2 \leq j \leq 6)$ in different views are negative examples. Therefore, the feature contrastive loss can be defined as

$L_{feat} = \frac{1}{2 n} \sum_{v_{i} \in V} (ℓ_{feat}^{3} (z_{3}^{i}) + ℓ_{feat}^{4} (z_{4}^{i})) .$

(16)

The $ℓ_{feat}^{3}$ and $ℓ_{feat}^{4}$ in Formula (16) denote the feature contrastive losses of the two views. We only show $ℓ_{feat}^{3} (z_{3}^{i})$ :

$ℓ_{feat}^{3} (z_{3}^{i}) = - log \frac{exp (sim (z_{3}^{i}, z_{4}^{i}) / τ_{feat})}{\sum_{v_{j} \in V} exp (sim (z_{3}^{j}, z_{4}^{j}) / τ_{feat})},$

(17)

where $τ_{feat}$ is a temperature parameter corresponding to $ℓ_{feat}^{3}$ .
Node Contrastive Loss. For the two embeddings $Z_{2}$ and $Z_{3}$ , although they come from the same graph, they are encoded by different encoders. To further enhance their commonality, we use contrastive loss to measure the similarity of the embedding distributions of the two graphs. For node i, $z_{2}^{i}$ and $z_{3}^{i}$ are positive samples for each other, while the embeddings of the remaining nodes are negative samples. For example, in Figure 5c, for node $v_{1}$ , the embeddings of $v_{1}$ in $Z_{2}$ and $Z_{3}$ are positive examples for each other, while the embeddings of $v_{1}$ and $v_{j}$ $(2 \leq j \leq 6)$ in different views are negative examples. We define the node contrastive loss for different embeddings of the same view as

$L_{n} = \frac{1}{2 n} \sum_{v_{i} \in V} (ℓ_{n}^{2} (z_{2}^{i}) + ℓ_{n}^{3} (z_{3}^{i})) .$

(18)

The $ℓ_{n}^{2}$ and $ℓ_{n}^{3}$ in Formula (18) denote the node contrastive losses. The $ℓ_{n}^{2} (z_{2}^{i})$ is

$ℓ_{n}^{2} (z_{2}^{i}) = - log \frac{exp (sim (z_{2}^{i}, z_{3}^{i}) / τ_{n})}{\sum_{v_{j} \in V} exp (sim (z_{2}^{j}, z_{3}^{j}) / τ_{n})},$

(19)

where $τ_{n}$ is a temperature parameter corresponding to $ℓ_{n}^{2}$ .
Feature–Topology Contrastive Loss. For the topology view embedding $Z_{1}$ and the feature view embedding $Z_{4}$ , the learning of the two views is independent. However, the information extracted from the two views should be consistent with the differences in features and topology of the original graph. For homophilous graphs, the feature and topology distributions of the original graph are similar, so the information learned from the two views should be similar. If $Z_{1}$ and $Z_{4}$ show significant differences, it indicates that there are many errors in the extracted information. Conversely, for non-homophilous graphs, the information learned from the two views should be as different as possible. Similar information leads to a lot of redundancy in the embeddings. Therefore, we designed the feature–topology contrastive loss to maintain the consistency between $Z_{1}$ and $Z_{4}$ with the topology and features of the original graph, as illustrated in Figure 6.

Specifically, given the original graph G and the feature view

V_{f}

, we can use the Jaccard coefficient [55] to compute the isomorphic similarity

y

. For each node i, the similarity is defined as

y_{i} = \frac{A_{i} \cap A_{f_{i}}}{A_{i} \cup A_{f_{i}}},

(20)

where

y_{i}

denotes the node-tier topology isomorphic similarity between G and

V_{f}

as the discrepancies between feature and global topologies. To introduce the discrepancies to node representation learning, we concatenate the topology view embedding

Z_{1}

and the feature view embedding

Z_{4}

and feed them into an MLP to predict the similarity. This process can be formulated by

\hat{y} = Predictor ([concat (Z_{1}, Z_{4})]),

(21)

where

Predictor (\cdot)

is an MLP predictor and

concat (\cdot, \cdot)

is the concatenation of two vectors. We adopt mean squared error to ensure that the learned topological and feature embeddings are isomorphic to the topology and feature information of the original graph. Therefore, the feature–topology contrastive loss is

L_{ft} = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i}, {\hat{y}}_{i})}^{2} .

(22)

4.7. Model Training

Overall, we define the training loss as the summation of four contrastive losses:

L = L_{topo} + L_{feat} + L_{n} + λ L_{ft},

(23)

where

λ > 0

is a hyperparameter that balances the importance of

L_{ft}

.

The topology contrastive loss, feature contrastive loss, and node contrastive loss train the model simultaneously to optimize different aspects, while the feature–topology contrastive loss introduces a repulsive force to ensure that the embeddings of the two views are distinct from each other. The

λ

is a weight parameter to control the magnitude of the repulsive force.

AMPGCL is trained in a self-supervised manner, which does not incorporate supervised information during training. It leads AMPGCL to generate task-agnostic node embeddings. For downstream tasks, fine-tuning models are employed to capture task-specific node embeddings. Since this fine-tuning is related to specific downstream tasks, it may lead to label overfitting in small datasets. This can be mitigated using techniques like regularization [56], dropout [57], or early stopping [58]. However, it is beyond the scope of this paper. For contrastive learning, self-supervised loss functions are generally less prone to overfitting compared to supervised loss functions, as they do not directly rely on label information. In smaller datasets, AMPGCL constructs feature and topology views to generate more positive and negative sample pairs enhancing model robustness. Additionally, the design of multi-view contrastive losses also helps balance different aspects of the model’s learning process. Employing harder negative samples across different views further improves learning effectiveness. Moreover, incorporating regularization, dropout, and early stopping strategies during training can also prevent overfitting.

5. Experimental Study

5.1. Experimental Setup

5.1.1. Datasets

We comprehensively evaluated the performance of AMPGCL on ten publicly available real datasets. All datasets are accessed on 20 July 2024. We summarize the statistics of all datasets in Table 1.

Cora (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html), Citeseer (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html), and Pubmed (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html) are citation graphs with papers as nodes and citation links as edges. Node features are the bag-of-words representation of papers, and node label is the academic topic of a paper.

Amazon-Computers (Computers) (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Amazon.html) is part of the Amazon co-purchase graph, where nodes represent products, and edges represent co-purchasing relationships.

Coauthor-CS (CS) (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Coauthor.html) is the co-authorship graph in the field of computer science, where nodes represent authors, and edges represent co-authorship relationships.

Cornell (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.WebKB.html), Texas (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.WebKB.html), and Wisconsin (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.WebKB.html) are webpage datasets collected from computer science departments of various universities by Carnegie Mellon University. Node features are the bag-of-words representation of web pages. The task is to classify the nodes into one of the five categories of student, project, course, staff, and faculty.

Actor (https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Actor.html) is the actor-only induced subgraph of the film-director-actor-writer network. Each node corresponds to an actor, and the edge between two nodes denotes co-occurrence on the same Wikipedia page. Node features correspond to some keywords on the Wikipedia pages. The task is to classify the nodes into five categories in terms of words of the actor’s Wikipedia.

ogbn-arxiv (https://ogb.stanford.edu/docs/home/) is part of the Open Graph Benchmark (OGB). It is a citation network graph, where nodes represent computer science arXiv papers, and directed edges represent citation relationships between them.

The ten datasets are commonly used in graph representation learning [2,3,5,6,7,8,13,15]. Cora, Citeseer, Pubmed, Computers, CS, and ogbn-arxiv are homophilous graph datasets while Cornell, Texas, Wisconsin, and Actor are non-homophilous graph datasets. We evaluated the performance and efficiency of AMPGCL and all baselines on Cora, Citeseer, Pubmed, Computers, CS, Cornell, Texas, Wisconsin, and Actor datasets and conducted scalability analysis on the ogbn-arxiv dataset.

5.1.2. Competitors

We compare AMPGCL with thirteen state-of-the-art proposals.

GCN [5] is a semi-supervised graph convolutional network that employs a convolutional neural network architecture to perform node classification tasks on graphs.
GAT [6] utilizes a multi-head attention mechanism to dynamically generate aggregated weights for neighboring nodes.
Geom-GCN [13] focuses on neighbor selection on the graph aggregation, which constructs a novel neighborhood set in the continuous space.
MixHop [43] is a method that leverages multiple hops to aggregate node features, effectively capturing higher-order neighborhood information.
GraphSAGE [21] is a scalable method that samples a fixed number of neighbors for each node and aggregates their features, making it efficient for large-scale graphs.
DGI [1] is a self-supervised graph representation learning framework that maximizes the mutual information between the node and high-level summaries of the graph structure.
GRACE [12] is a self-supervised approach to learning node- and graph-level representations by contrasting structural views of graphs.
MVGRL [8] is a self-supervised GCL method. It samples subgraphs to form graph views based on random walks.
GCA [10] leverages the domain knowledge of network science and adopts network centrality to perform graph augmentation.
BGRL [29] maintains two distinct graph encoders and learns a node representation by training an online encoder to predict the representation of a target one.
MA-GCL [2] focuses on perturbing the architectures of GNN encoders rather than altering the graph input or model parameters.
HomoGCL [3] is a self-supervised GCL method that introduces a bound of the mutual information between raw node features and node embeddings in augmented views.
ASP [7] is a self-supervised GCL method across three different graph views in a joint fashion and learns node representations preserving both attribute and structure information.

Since unsupervised and supervised learning models exhibit similar performance in studies in [2,3], we compare AMPGCL with GCN, GAT, Geom-GCN, Mix-Hop, and GraphSAGE, which are semi-supervised GNNs based on task labels. The remaining baselines are self-supervised GCL frameworks. HomoGCL and ASP are two state-of-the-art unsupervised models considering the homophily of graphs.

5.1.3. Hyperparameters and Experimental Settings

For Cora, Citeseer, and Pubmed, we used the public fixed split introduced in [59]. Specifically, (i) for Cora, the dataset was split into 140 nodes for training, 500 nodes for validation, and 1000 nodes for testing; (ii) for Citeseer, the dataset was split into 120 nodes for training, 500 nodes for validation, and 1000 nodes for testing; (iii) for Pubmed, the dataset was split into 60 nodes for training, 500 nodes for validation, and 1000 nodes for testing. For Computers, CS, Cornell, Texas, Wisconsin, Actor, and ogbn-arxiv, we randomly divided the nodes of each class into 60%, 20%, and 20% for training, validation, and testing, respectively. Table 1 details the splits for the ten datasets. All unsupervised GCL methods were trained with the corresponding self-supervised objectives. The resulting embeddings were used to train and test a simple logistic regression classifier with

l_{2}

regularization. In particular, 10-fold cross-validation was conducted on each dataset. We report the average classification accuracy and standard deviation.

We set the hyperparameters of the baselines as suggested by the respective papers. For AMPGCL, the number of neighbors k was set to 10 in the feature view for homophilous graphs and 30 for non-homophilous graphs. The hop h of the H-Transformer for topology view encoding was set to 6 for homophilous graphs and 10 for non-homophilous graphs. This is because AMPGCL aims to capture more same-class node information. Since the homophilous datasets have a high average homophily ratio, it does not need to gather information from distant neighbors. The non-homophilous graphs with lower homophily and more distant similar neighbors require larger h to aggregate farther neighbors and larger k to include more similar nodes despite the potential noise from distant neighbors. For view encoding hyperparameters, each H-Transformer contains two Transformer layers, the mini-batch size is 512, and the MLP used for prediction consists of two fully connected layers with 16 neurons in the hidden layer. The embedding dimensionality of the nodes d was set to 256, chosen based on experimental equipment and dataset scale. For contrastive learning hyperparameters, we set

τ_{topo} = 1.1

,

τ_{feat} = 1.1

, and

τ_{n} = 1.1

based on the best contrastive learning parameters in existing works [3,7,9,10] and

λ = 1

to validate the effectiveness of feature–topology contrastive loss. All experiments were conducted on a server with an Intel(R) Xeon(R) W-2155 CPU, 128GB memory, and two NVIDIA TITAN RTX GPUs.

5.2. Comparison

Table 2 and Table 3 report the classification accuracy of 13 baselines and AMPGCL on five homophilous datasets and four non-homophilous datasets, respectively. AMPGCL consistently outperforms other baselines on all datasets. Specifically, we outperformed the closest competitor by 0.61% on Cora, 1.01% on Citeseer, 0.23% on Pubmed, 0.05% on Computers, 0.2% on CS, 5.22% on Cornell, 4.71% on Texas, 0.59% on Wisconsin, and 0.13% on Actor. We also conducted t-tests to compare AMPGCL with the second-best-performing model and report the p-values on each dataset. The null hypothesis (

H_{0}

) is that there is no significant difference in performance between AMPGCL and the existing model. The alternative hypothesis (

H_{1}

) is that AMPGCL’s performance is significantly better than the existing model. At a significance level of

α = 0.05

,

H_{0}

can be rejected for almost all datasets, indicating that AMPGCL’s performance is significantly better than the second-best-performing model. This is because AMPGCL can adaptively utilize both the features and global topological information of the graph to learn node representations.

Additionally, there are four noteworthy points. Firstly, AMPGCL shows greater improvements on non-homophilous datasets than on homophilous datasets. This is because the feature and topological distributions differ significantly in non-homophilous datasets. By constructing and encoding two views, AMPGCL integrates these two specific types of information. The multi-view loss also promotes the model to learn more differentiated feature and topological types of information, reducing redundancy between the two embeddings.

Secondly, compared to ASP, AMPGCL also improves on homophilous datasets. Although ASP is also a GCL framework that separately encodes topology and features, it faces two problems: (1) ASP may suffer from over-smoothing in the global view due to the stacking of multiple GCN layers; (2) ASP ignores the guidance of these two types of information. In contrast, the H-Transformer of AMPGCL decouples multi-hop encoding, avoiding over-smoothing. Moreover, in homophilous graphs, where the differences between topology and features are small, the multi-view contrastive learning loss prevents large differences between the learned information.

Thirdly, HomoGCL, ASP, and AMPGCL, which consider homogeneity, perform better on non-homophilous graphs than GCL methods, which do not consider graph homophily. This demonstrates that considering graph homophily indeed helps improve the learning performance of GCL methods on non-homophilous graphs.

Fourthly, AMPGCL outperforms both MixHop and GraphSAGE. This is because AMPGCL retains both global topological information and feature information, while MixHop only focuses on multi-hop topology. GraphSAGE, despite its efficiency, does not capture the full graph information due to its sampling strategy. AMPGCL mitigates this limitation by incorporating feature information.

To comprehensively evaluate the performance of AMPGCL, we report additional evaluation metrics, including accuracy, precision, recall, and F1-score [60], for the classification tasks on the Cora and Cornell datasets, as shown in Table 4. Higher values for these metrics indicate better performance. Compared to 13 baselines, AMPGCL consistently demonstrates superior performance across most metrics in both datasets. On the Cora dataset, AMPGCL achieves the highest accuracy, recall, and F1-score, while its precision ranks second. On the Cornell dataset, AMPGCL achieves the highest accuracy and precision, with its recall and F1-score ranking second. Additionally, we visualize the ROC curves [61] of GCN and AMPGCL for the classification task on the Cora dataset, as shown in Figure 7. We present the aggregated curves across all classes, where the red solid line and blue dashed line represent the mean curves of AMPGCL and GCN, respectively, and the red and blue shades indicate their corresponding standard deviations. AMPGCL shows a better average AUC compared to GCN, indicating higher accuracy. The smaller shaded area for AMPGCL suggests a more stable performance. These observations confirm that AMPGCL is robust and reliable. AMPGCL is suitable for various applications and scenarios.

5.3. Ablation Study

To validate the performance of each module, we conducted ablation studies on both three homophilous and three non-homophilous graphs. Figure 8 shows the classification accuracy of different variants. “A” represents the complete model, “v1” denotes the AMPGCL variant without the feature view and encoding module, “v2” denotes the AMPGCL variant without the topology view and encoding module, “v3” denotes the AMPGCL variant that uses the node InfoNCE loss instead of the multi-view contrastive learning loss, “v4” represents the AMPGCL variant using only the GCN for both feature and topological encodings, “v5” represents the AMPGCL variant using only the H-Transformer for both feature and topological encodings, and “v6” represents the AMPGCL variant using the multi-view contrastive loss but without the feature–topology contrastive loss.

As shown in Figure 8, the performance of the variants decreases compared to the complete AMPGCL. It indicates that the embeddings contain richer semantic information, which benefits graph representation learning. However, in homophilous graphs, the performance gains of different variants are significantly lower than in non-homophilous graphs. This is because, in homophilous graphs, the distributions of topology and features are less diverse, so the learning of topology and features does not capture much additional semantic information. In contrast, in non-homophilous graphs, since topology and features contain different types of information, the model can adaptively combine these two types of information to generate node embeddings, resulting in better performance in downstream tasks.

Both v4 and v5 show a lower classification performance compared to the complete AMPGCL, with v4 exhibiting a more significant performance drop on non-homophilous graphs and v5 on homophilous graphs. For v4, the performance drop on non-homophilous graphs is due to over-smoothing caused by iterating over 30 layers when encoding topological information. Conversely, v5’s performance drop on homophilous graphs occurs because it averages the feature vectors of 1-hop neighbors as input, which fails to learn the distinct features of each neighboring node.

Moreover, the performance decline of v6, compared to the complete model, indicates that the feature–topology contrastive loss is instrumental in extracting both types of information, contributing more significantly to non-homophilous graphs. This is because the feature–topology contrastive loss encourages the extraction of diverse types of hidden information in non-homophilous graphs through the two encodings.

5.4. Parameter Analysis

5.4.1. The Impact of k

We study the impact of the number of feature nearest neighbors for each node when constructing the feature view (see Section 4.2.1). Figure 9 shows the classification accuracy of AMPGCL for values of k in the range of

[10, 50]

. Figure 9a shows the impact of k on the classification accuracy of three homophilous graphs, Cora, Citeseer, and Pubmed. The results indicate that the value of k has no significant impact on AMPGCL. On the Cora, the optimal k value is 40; on the Citeseer, the optimal k value is 30; and on the Pubmed, the optimal k value is 30. Figure 9 shows the impact of k on the learning performance of three non-homophilous graphs, Cornell, Texas, and Wisconsin. When

10 \leq k \leq 30

, the classification accuracy improves, but when

k > 30

, the classification accuracy begins to decline.

5.4.2. The Impact of h

We study the impact of the number of hops in topology encoding (see Section 4.2.2) for homophilous and non-homophilous graphs. Note that the hop count used in topology view construction is consistent with the hop count used in global topology encoding. Figure 10 shows the classification performance and training speed of the embeddings generated by AMPGCL when

h \in [0, 6, 10, 20, 30]

. When

h = 0

, it means that only features are used for encoding without aggregating any neighbors. As shown in Figure 10a, in homophilous graphs, when

h = 0

, the performance of AMPGCL drops significantly, demonstrating that in homophilous graphs, GCL benefits from neighbor aggregation. When

h > 6

, the classification accuracy of AMPGCL slowly decreases due to the noise introduced by too many hops of neighbors. As shown in Figure 10b, in non-homophilous graphs, when

h = 0

, the performance of AMPGCL is the lowest but still has acceptable classification performance. When

h > 6

, the classification accuracy of AMPGCL significantly improves. This is because in non-homophilous graphs, nodes with the same label may be found in more distant neighboring nodes, aggregating features of more similar nodes and thereby learning the latent semantic information of the nodes.

Additionally, the results for efficiency in both homophilous and non-homophilous graphs indicate that as h increases, the training speed of the model decreases. This is because, although the H-Transformer avoids multiple iterations, increasing h still results in the computation of multi-hop neighbor vectors and an increased input size for the H-Transformer.

5.4.3. The Impact of $λ$

We evaluate the impact of

λ

on graph representation learning, where

λ

controls the weight of the feature–topology contrastive loss in the multi-view contrastive loss (see Equation (23)). Table 5 reports the classification results on seven datasets, where

λ

varies between 0 and 1. The optimal

λ

choice differs for each dataset. The optimal choice for the Cora is 1.0, for the Citeseer is 0.2, for the Pubmed is 0.4, for the Cornell is 0.2, for the Texas is 0.4, for the Wisconsin is 1.0, and for the Actor is 0.2. This is because different datasets have different topological structures and feature distributions, and the model initialization parameters also vary, so the model needs to adaptively choose the optimal loss weight.

5.5. Model Analysis

5.5.1. Training Curves Analysis

We present the training curves for the multi-view contrastive losses (

L_{topo}

,

L_{feat}

,

L_{n}

, and

L_{ft}

) and the loss function of ASP (

L_{ASP}

) on Cora and Cornell in Figure 11. Since the losses are of different magnitudes,

L_{topo}

,

L_{feat}

,

L_{n}

, and

L_{ASP}

correspond to the left y-axis, while

L_{ft}

corresponds to the right y-axis.

As shown in Figure 11, AMPGCL’s four contrastive losses all converge quickly during training. Specifically, they converge at around the 200th epoch on the Cora and around the 150th epoch on the Cornell. The stable convergence indicates the reliability and applicability of four losses across different datasets. Additionally, AMPGCL’s convergence epochs are similar or slightly less than ASP, which converges around the 220th epoch on Cora and the 150th epoch on Cornell. This demonstrates that AMPGCL maintains good convergence and generalization even with the multi-view contrastive losses.

5.5.2. Efficiency Analysis

We compare the training and encoding efficiency of AMPGCL with the baseline ASP, as ASP is also a multi-view contrastive learning method. We report the training speed and encoding time on three homophilous graphs (Cora, Citeseer, and Pubmed) and three non-homophilous graphs (Cornell, Texas, and Wisconsin). Figure 12a shows the number of training epochs per second, which includes view generation, encoding, gradient computation, and model parameter updates. Figure 12b shows the time taken to encode the entire dataset using the trained model.

Compared to ASP, AMPGCL demonstrates faster training and encoding speeds. The training speed of AMPGCL is twice that of ASP, and the encoding time is half that of ASP. This is because the multi-view parallel encoding strategy allows for the encoding of multiple views to be executed simultaneously, and its implementation of a mini-batch strategy for multi-hop encoding. In contrast, ASP aggregates multi-hop information, which requires multiple iterative calculations.

5.6. Scalability Analysis

We report the training speed of AMPGCL and ASP, which uses multi-view encoding, on different graph sizes and node degrees on ogbn-arxiv, as shown in Figure 13. The training speed is measured as the number of training epochs per second on a specific graph. For different graph sizes, we randomly sample 20,000, 30,000, 40,000, 50,000, and 60,000 nodes to construct a training graph. Figure 13a shows that as the number of nodes increases, the training speed of both ASP and AMPGCL decreases. However, the rate of decrease for AMPGCL is significantly lower than that for ASP. For example, the training speed of AMPGCL on a graph with 60,000 nodes is almost the same as ASP on a graph with 30,000 nodes. This is because AMPGCL uses a parallel multi-view encoding strategy.

For different node degrees, we randomly constructed graphs with the maximum degree per node being 3, 6, 9, 12, and 15. Figure 13b shows that as the degree increases, the training speed of AMPGCL slowly decreases but consistently remains higher than that of ASP. This is because AMPGCL pools the h-hop neighboring vectors into h vectors in the H-Transformer, avoiding the impact of degree on aggregation, while ASP needs to aggregate more neighbors as the node degree increases.

5.7. Visualization

We use the t-SNE [42] to visualize the node embeddings generated by DGI and AMPGCL on the Cora dataset, as shown in Figure 14. The embeddings generated by the unsupervised methods DGI and CHGNN tend to be more evenly distributed, as shown in Figure 14a,b. The visualization demonstrates that the embeddings generated by AMPGCL exhibit more consistent clustering compared to those generated by DGI. This indicates that AMPGCL captures the graph’s global structure more effectively, leading to semantically rich and coherent node representations. The clear separation and tighter clusters in the AMPGCL embeddings highlight the model’s ability to learn meaningful representations that reflect the underlying graph topology.

6. Conclusions

We propose a multi-view parallel graph contrastive learning framework—AMPGCL. AMPGCL uses feature views and global topological views to learn two types of information from the graph. Through multi-view contrastive loss, AMPGCL adaptively integrates these two types of information into the node representations, thereby learning node embeddings that better reflect the underlying information of the nodes. Additionally, parallel topological and feature encoding strategies are employed to improve the encoding efficiency of AMPGCL. We conducted experiments on seven real-world datasets, and the results demonstrate that AMPGCL outperforms eleven advanced graph representation learning models in terms of classification accuracy. Overall, AMPGCL exhibits better learning capabilities in both homophilous and non-homophilous graphs, making it suitable for many downstream tasks. Future research can explore the application of AMPGCL on large sparse graphs.

Author Contributions

Conceptualization, Y.S.; methodology, Y.S.; software, Y.S.; validation, Y.S., X.L., F.L. and G.Y.; formal analysis, Y.S.; investigation, Y.S.; resources, X.L., F.L. and G.Y.; data curation, Y.S.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S., X.L., F.L. and G.Y.; visualization, Y.S.; supervision, X.L., F.L. and G.Y.; project administration, G.Y.; funding acquisition, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code is available on https://github.com/yumengs-exp/AMPGCL.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Velickovic, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. In Proceedings of the ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Gong, X.; Yang, C.; Shi, C. MA-GCL: Model augmentation tricks for graph contrastive learning. In Proceedings of the AAAI, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 4284–4292. [Google Scholar]
Li, W.Z.; Wang, C.D.; Xiong, H.; Lai, J.H. Homogcl: Rethinking homophily in graph contrastive learning. In Proceedings of the KDD, Long Beach, CA, USA, 6–10 August 2023; pp. 1341–1352. [Google Scholar]
Song, Y.; Gu, Y.; Li, T.; Qi, J.; Liu, Z.; Jensen, C.S.; Yu, G. CHGNN: A semi-supervised contrastive hypergraph learning network. IEEE Trans. Knowl. Data Eng. 2024. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Chen, J.; Kou, G. Attribute and structure preserving graph contrastive learning. In Proceedings of the AAAI, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 7024–7032. [Google Scholar]
Hassani, K.; Khasahmadi, A.H. Contrastive multi-view representation learning on graphs. In Proceedings of the ICML, Virtual Event, 13–18 July 2020; pp. 4116–4126. [Google Scholar]
You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph contrastive learning with augmentations. Adv. Neural Inf. Process. Syst. 2020, 33, 5812–5823. [Google Scholar]
Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 2069–2080. [Google Scholar]
Peng, Z.; Huang, W.; Luo, M.; Zheng, Q.; Rong, Y.; Xu, T.; Huang, J. Graph representation learning via graphical mutual information maximization. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 259–270. [Google Scholar]
Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Deep graph contrastive representation learning. arXiv 2020, arXiv:2006.04131. [Google Scholar]
Pei, H.; Wei, B.; Chang, K.C.; Lei, Y.; Yang, B. Geom-GCN: Geometric Graph Convolutional Networks. In Proceedings of the ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. Adv. Neural Inf. Process. Syst. 2020, 33, 7793–7804. [Google Scholar]
Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; Li, Y. Simple and deep graph convolutional networks. In Proceedings of the ICML, Virtual Event, 13–18 July 2020; pp. 1725–1735. [Google Scholar]
Yan, Y.; Hashemi, M.; Swersky, K.; Yang, Y.; Koutra, D. Two sides of the same coin: Heterophily and oversmoothing in graph convolutional neural networks. In Proceedings of the ICDM, Orlando, FL, USA, 28 November–1 December 2022; pp. 1287–1292. [Google Scholar]
Bo, D.; Wang, X.; Shi, C.; Shen, H. Beyond low-frequency information in graph convolutional networks. In Proceedings of the AAAI, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3950–3957. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. In Proceedings of the ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Inf. Process. Syst. 2016, 29, 3837–3845. [Google Scholar]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the ICML, Sydney, NSW, Australia, 6–11 August 2017; pp. 1263–1272. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1024–1034. [Google Scholar]
Li, Q.; Han, Z.; Wu, X.M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Li, G.; Muller, M.; Thabet, A.; Ghanem, B. DeepGCNs: Can gcns go as deep as cnns? In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9267–9276. [Google Scholar]
Luan, S.; Hua, C.; Lu, Q.; Zhu, J.; Zhao, M.; Zhang, S.; Chang, X.W.; Precup, D. Revisiting heterophily for graph neural networks. Adv. Neural Inf. Process. Syst. 2022, 35, 1362–1375. [Google Scholar]
Li, X.; Zhu, R.; Cheng, Y.; Shan, C.; Luo, S.; Li, D.; Qian, W. Finding global homophily in graph neural networks when meeting heterophily. In Proceedings of the ICML, Baltimore, MA, USA, 17–23 July 2022; pp. 13242–13256. [Google Scholar]
Chen, J.; Chen, S.; Gao, J.; Huang, Z.; Zhang, J.; Pu, J. Exploiting neighbor effect: Conv-agnostic GNN framework for graphs with heterophily. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef] [PubMed]
Suresh, S.; Budde, V.; Neville, J.; Li, P.; Ma, J. Breaking the limit of graph neural networks by improving the assortativity of graphs with local mixing patterns. In Proceedings of the KDD, Singapore, 14–18 August 2021; pp. 1541–1551. [Google Scholar]
Sun, F.; Hoffmann, J.; Verma, V.; Tang, J. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In Proceedings of the ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Thakoor, S.; Tallec, C.; Azar, M.G.; Munos, R.; Veličković, P.; Valko, M. Bootstrapped representation learning on graphs. In Proceedings of the ICLR 2021 Workshop on Geometrical and Topological Representation Learning, Virtual Event, 3–7 May 2021. [Google Scholar]
Mo, Y.; Wang, X.; Fan, S.; Shi, C. Graph Contrastive Invariant Learning from the Causal Perspective. In Proceedings of the AAAI, Vancouver, BC, Canada, 20–24 February 2024; pp. 8904–8912. [Google Scholar]
He, D.; Zhao, J.; Huo, C.; Huang, Y.; Huang, Y.; Feng, Z. A New Mechanism for Eliminating Implicit Conflict in Graph Contrastive Learning. In Proceedings of the AAAI, Vancouver, BC, Canada, 20–24 February 2024; Volume 38, pp. 12340–12348. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent—A new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Page, L.; Brin, S.; Motwani, R.; Winograd, T. The pagerank citation ranking: Bring order to the web. In Proceedings of the Web Conference 1998, Brisbane, Australia, 1 April 1998. [Google Scholar]
Xu, J.; Chen, S.; Ren, Y.; Shi, X.; Shen, H.; Niu, G.; Zhu, X. Self-Weighted Contrastive Learning among Multiple Views for Mitigating Representation Degeneration. Adv. Neural Inf. Process. Syst. 2024, 36, 1119–1131. [Google Scholar]
Yang, L.; Zhou, W.; Peng, W.; Niu, B.; Gu, J.; Wang, C.; Cao, X.; He, D. Graph neural networks beyond compromise between attribute and topology. In Proceedings of the CM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1127–1135. [Google Scholar]
Wang, X.; Zhu, M.; Bo, D.; Cui, P.; Shi, C.; Pei, J. Am-gcn: Adaptive multi-channel graph convolutional networks. In Proceedings of the KDD, Virtual Event, 23–27 August 2020; pp. 1243–1253. [Google Scholar]
Manning, C.D. Introduction to Information Retrieval; Syngress Publishing: St. Rockland, MA, USA, 2008. [Google Scholar]
Duda, R.O.; Hart, P.E. Pattern Classification; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Deza, E.; Deza, M.M.; Deza, M.M.; Deza, E. Encyclopedia of Distances; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Tan, P.N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Pearson Education: Chennai, India, 2016. [Google Scholar]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Lerman, K.; Harutyunyan, H.; Ver Steeg, G.; Galstyan, A. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 21–29. [Google Scholar]
Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.i.; Jegelka, S. Representation learning on graphs with jumping knowledge networks. In Proceedings of the ICML, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; pp. 5453–5462. [Google Scholar]
Zhang, S.; Liu, Y.; Sun, Y.; Shah, N. Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation. In Proceedings of the ICLR, Virtual Event, 3–7 May 2021. [Google Scholar]
Kondor, R.I.; Lafferty, J. Diffusion kernels on graphs and other discrete structures. In Proceedings of the ICML, Sydney, Australia, 8–12 July 2002; Volume 2002, pp. 315–322. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Xu, D.; Ruan, C.; Korpeoglu, E.; Kumar, S.; Achan, K. Inductive Representation Learning on Temporal Graphs. In Proceedings of the ICLR, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the ICML, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Mondal, S.; Manasi, S.D.; Kunal, K.; Sapatnekar, S.S. GNNIE: GNN inference engine with load-balancing and graph-specific caching. In Proceedings of the DAC, San Francisco, CA, USA, 10–14 July 2022; pp. 565–570. [Google Scholar]
Chiang, W.L.; Liu, X.; Si, S.; Li, Y.; Bengio, S.; Hsieh, C.J. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the KDD, Anchorage, AK, USA, 4–8 August 2019; pp. 257–266. [Google Scholar]
Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; Prasanna, V. Graphsaint: Graph sampling based inductive learning method. arXiv 2019, arXiv:1907.04931. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Jaccard, P. The distribution of the flora in the alpine zone. 1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
Ng, A.Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the ICML, Banff, AB, Canada, 4–8 July 2004; p. 78. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Prechelt, L. Early stopping-but when? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2002; pp. 55–69. [Google Scholar]
Yang, Z.; Cohen, W.; Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the ICML, New York City, NY, USA, 19–24 June 2016; pp. 40–48. [Google Scholar]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]

Figure 1. Comparison of existing GCLs and AMPGCL, where (i) colored circles denote nodes with class labels; (ii) rectangles attached to the circles denote the feature vectors of the nodes; (iii) augmented views are generated through edge perturbation and feature masking; (iv) feature and topology views are constructed using G’s feature vectors and adjacency matrix; (v) contrastive loss serves as the training objective function of GCL.

Figure 2. An example of a homophilous graph and a non-homophilous graph. The circles denote nodes and the lines denote edges of the graph G in (a). The blue and yellow circles in (b,c) denote nodes with different labels. The green and red lines denote edges connecting nodes with the same and different labels, respectively.

Figure 3. AMPGCL model overview. The input is a graph G. The output is node embedding matrices

Z_{i} (i = 1, 2, 3, 4)

of G, topology view, and feature view. Step 1 constructs h-hop topology view

V_{t}

and k-nn feature view

V_{f}

for G. Step 2 iteratively embeds the feature view with GCN and the topology view with H-Transformer and trains the model using the four proposed contrastive losses.

Figure 3. AMPGCL model overview. The input is a graph G. The output is node embedding matrices

Z_{i} (i = 1, 2, 3, 4)

of G, topology view, and feature view. Step 1 constructs h-hop topology view

V_{t}

and k-nn feature view

V_{f}

for G. Step 2 iteratively embeds the feature view with GCN and the topology view with H-Transformer and trains the model using the four proposed contrastive losses.

Figure 4. Overview of the proposed Hop Transformer (H-Transformer), illustration with 6 hops. The input is a topology view with an h-hop feature tensor

X_{t}

. The output is a node embedding matrix

Z_{t}

.

Figure 4. Overview of the proposed Hop Transformer (H-Transformer), illustration with 6 hops. The input is a topology view with an h-hop feature tensor

X_{t}

. The output is a node embedding matrix

Z_{t}

.

Figure 5. Examples of multi-view contrastive losses, where the green dashed arrows denote positive examples and the red dashed arrows denote negative examples.

Figure 6. An example of feature–topology contrastive loss.

Figure 7. ROC curves of AMPGCL and GCN.

Figure 8. Ablation study of AMPGCL.

Figure 9. The impact of k.

Figure 10. The impact of h.

Figure 11. Training loss curves of AMPGCL and ASP.

Figure 12. Efficiency analysis of AMPGCL.

Figure 13. Scalability test on ogbn-arxiv.

Figure 14. Embedding visualization on Cora. Colored dots denote nodes with different labels.

Table 1. Graph statistics of the ten datasets, where HR is the homophily ratio and T/V/T denotes training/validation/testing.

Dataset	Type	# Nodes	# Edges	# Attributes	# Classes	HR	Data Split (T/V/T)
Cora	Citation	2708	5278	1433	7	0.81	140/500/1000
Citeseer	Citation	3327	4552	3703	6	0.74	120/500/1000
Pubmed	Citation	19,717	44,324	500	3	0.80	60/500/1000
Computers	Co-purchase	13,752	245,861	767	10	0.79	60%/20%/20%
CS	Co-authorship	18,333	81,894	6805	15	0.83	60%/20%/20%
Cornell	Webpage	183	298	1703	5	0.31	60%/20%/20%
Texas	Webpage	183	325	1703	5	0.11	60%/20%/20%
Wisconsin	Webpage	251	515	1703	5	0.20	60%/20%/20%
Actor	Co-occurrence	7600	30,019	932	5	0.22	60%/20%/20%
ogbn-arxiv	Citation	169,343	1,166,243	128	40	0.66	60%/20%/20%

Table 2. Classification accuracy (%) on homophilous datasets, where (i) SSL indicates semi-supervised learning methods; (ii) Self-SL indicates self-supervised learning methods; (iii) the best and second-best performances are marked in bold and underlined, respectively; and (iv) p is the p-value of the t-test.

	Dataset	Cora	Citeseer	Pubmed	Computer	CS
SSL	GCN	81.53 ± 0.65	69.34 ± 0.27	70.70 ± 0.41	86.53 ± 0.14	91.05 ± 0.21
	GAT	83.17 ± 0.35	72.53 ± 0.77	79.52 ± 0.58	86.95 ± 0.17	90.33 ± 0.18
	Geom-GCN	83.93 ± 1.25	73.71 ± 1.48	81.25 ± 0.43	83.97 ± 0.16	90.17 ± 0.19
	Mix-Hop	82.15 ± 1.25	71.90 ± 0.89	81.30 ± 0.46	83.73 ± 2.20	89.03 ± 0.60
	GraphSAGE	74.82 ± 1.29	67.45 ± 1.04	77.12 ± 0.67	82.62 ± 1.80	89.55 ± 2.80
Self-SL	DGI	81.76 ± 0.66	71.55 ± 0.73	77.37 ± 0.66	87.48 ± 0.13	90.95 ± 0.13
	GRACE	81.57 ± 0.33	70.64 ± 0.52	80.25 ± 0.34	87.54 ± 0.11	90.13 ± 0.16
	MVGRL	82.90 ± 0.71	72.62 ± 0.73	79.42 ± 0.33	87.87 ± 0.13	91.12 ± 0.12
	GCA	80.91 ± 0.63	70.13 ± 2.05	80.32 ± 0.98	88.25 ± 0.14	91.03 ± 0.15
	BGRL	81.78 ± 0.62	71.13 ± 0.85	79.64 ± 0.53	87.67 ± 0.19	92.21 ± 0.13
	MA-GCL	82.91 ± 0.43	72.62 ± 0.17	83.14 ± 0.43	87.82 ± 0.18	93.21 ± 0.12
	HomoGCL	83.25 ± 0.53	71.33 ± 0.71	81.11 ± 0.3	87.97 ± 0.12	93.23 ± 0.14
	ASP	82.48 ± 0.18	70.76 ± 1.17	79.29 ± 0.43	87.89 ± 0.16	93.62 ± 0.11
	AMPGCL	84.54 ± 0.56 ( $p = 0.00007$ )	74.72 ± 0.99 ( $p = 0.00002$ )	83.37 ± 0.28 ( $p = 0.00372$ )	88.30 ± 0.13 ( $p = 0.000896$ )	93.82 ± 0.10 ( $p = 0.00502$ )

Table 3. Classification accuracy (%) on non-homophilous datasets, where (i) SSL indicates semi-supervised learning methods; (ii) Self-SL indicates self-supervised learning methods; (iii) the best and second-best performances are marked in bold and underlined, respectively; and (iv) p is the p-value of the t-test.

	Dataset	Cornell	Texas	Wisconsin	Actor
SSL	GCN	52.70 ± 5.30	52.16 ± 5.16	45.88 ± 3.06	27.32 ± 1.10
	GAT	54.32 ± 5.05	58.38 ± 5.16	49.41 ± 4.09	27.44 ± 0.89
	Geom-GCN	77.86 ± 3.79	74.57 ± 3.83	74.39 ± 3.40	31.44 ± 1.30
	Mix-Hop	60.53 ± 28.53	74.62 ± 7.66	77.25 ± 7.80	31.34 ± 2.34
	GraphSAGE	71.64 ± 1.24	74.26 ± 1.20	65.11 ± 5.14	31.55 ± 1.31
Self-SL	DGI	52.25 ± 7.09	54.56 ± 6.74	54.90 ± 2.77	27.87 ± 0.89
	GRACE	53.15 ± 7.75	55.86 ± 3.37	49.02 ± 6.98	29.78 ± 0.51
	MVGRL	67.70 ± 4.75	73.11 ± 4.75	74.25 ± 4.13	29.98 ± 0.53
	GCA	53.11 ± 9.34	71.97 ± 2.30	73.50 ± 3.00	31.13 ± 0.71
	BGRL	61.25 ± 0.85	64.20 ± 1.60	58.44 ± 1.43	25.35 ± 0.27
	MA-GCL	69.36 ± 4.59	69.45 ± 2.55	67.20 ± 4.23	30.78 ± 1.02
	HomoGCL	78.69 ± 3.44	66.07 ± 6.07	72.13 ± 2.75	31.13 ± 0.71
	ASP	76.14 ± 4.95	75.83 ± 1.14	79.02 ± 2.96	31.97 ± 0.93
	AMPGCL	83.91 ± 4.09 ( $p = 0.00607$ )	80.54 ± 4.60 ( $p = 0.00013$ )	79.61 ± 4.04 ( $p = 0.02373$ )	32.10 ± 0.91 ( $p = 0.65028$ )

Table 4. Classification accuracy, precision, recall, and F1-score (%) on the Cora and Cornell datasets, where the best and second-best performances are marked in bold and underlined, respectively.

Dataset	Cora				Cornell
Metric	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
GCN	80.80	78.85	81.31	79.80	51.62	73.76	63.30	68.13
GAT	82.30	80.90	83.12	81.54	53.02	79.62	62.76	70.19
Geom-GCN	83.12	81.56	83.65	82.08	77.78	77.34	62.81	69.32
MixHop	81.02	79.12	81.67	80.25	60.31	76.13	64.62	69.90
GraphSAGE	82.14	80.76	83.33	81.99	70.33	71.62	67.95	69.74
DGI	81.76	79.96	82.21	81.08	52.25	71.69	66.60	69.05
GRACE	81.57	79.63	82.02	80.81	52.35	70.59	64.49	67.40
MVGRL	82.90	81.05	80.07	80.56	67.77	78.79	69.23	73.70
GCA	80.91	78.99	81.35	80.16	52.31	76.06	62.19	68.43
BGRL	81.78	79.98	82.19	81.08	61.13	77.21	64.43	70.24
MA-GCL	82.91	81.41	83.76	82.08	69.56	70.22	65.56	67.81
HomoGCL	82.90	80.46	84.11	81.63	78.33	76.33	84.67	76.44
ASP	82.20	79.75	83.65	80.96	76.14	78.67	87.33	79.49
AMPGCL	83.40	81.48	85.01	82.47	83.61	81.02	85.33	79.38

Table 5. Classification accuracy (%) for different

λ

values where the best is marked in bold.

Table 5. Classification accuracy (%) for different

λ

values where the best is marked in bold.

$λ$	Cora	Citeseer	Pubmed	Cornell	Texas	Wisconsin	Actor
0.0	83.78	73.83	83.04	64.59	64.59	64.71	31.59
0.2	83.68	74.37	83.28	75.68	67.84	69.71	32.07
0.4	83.53	74.20	83.52	71.62	76.31	72.22	31.64
0.6	83.61	74.35	83.28	70.95	73.78	74.66	31.43
0.8	83.78	74.15	83.20	70.16	76.16	77.02	31.30
1.0	84.54	74.19	83.22	66.80	75.95	77.16	31.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, Y.; Li, X.; Li, F.; Yu, G. Learning from Feature and Global Topologies: Adaptive Multi-View Parallel Graph Contrastive Learning. Mathematics 2024, 12, 2277. https://doi.org/10.3390/math12142277

AMA Style

Song Y, Li X, Li F, Yu G. Learning from Feature and Global Topologies: Adaptive Multi-View Parallel Graph Contrastive Learning. Mathematics. 2024; 12(14):2277. https://doi.org/10.3390/math12142277

Chicago/Turabian Style

Song, Yumeng, Xiaohua Li, Fangfang Li, and Ge Yu. 2024. "Learning from Feature and Global Topologies: Adaptive Multi-View Parallel Graph Contrastive Learning" Mathematics 12, no. 14: 2277. https://doi.org/10.3390/math12142277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning from Feature and Global Topologies: Adaptive Multi-View Parallel Graph Contrastive Learning

Abstract

1. Introduction

2. Related Work

2.1. Graph Neural Network

2.2. Graph Contrastive Learning

3. Preliminaries

4. AMPGCL

4.1. Framework

4.2. View Construction

4.2.1. Feature View Construction

4.2.2. Topology View Construction

4.3. Graph Encoding

4.3.1. Topology Encoding

4.3.2. Feature Encoding

4.4. Parallel Graph Encoding

4.5. Time Complexity Analysis

4.6. Multi-View Contrastive Loss

4.7. Model Training

5. Experimental Study

5.1. Experimental Setup

5.1.1. Datasets

5.1.2. Competitors

5.1.3. Hyperparameters and Experimental Settings

5.2. Comparison

5.3. Ablation Study

5.4. Parameter Analysis

5.4.1. The Impact of k

5.4.2. The Impact of h

5.4.3. The Impact of λ

5.5. Model Analysis

5.5.1. Training Curves Analysis

5.5.2. Efficiency Analysis

5.6. Scalability Analysis

5.7. Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.4.3. The Impact of $λ$