A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients

Liu, Zijian; Wang, Yaning; Luo, Yang; Luo, Chunbo

doi:10.3390/math10081320

Open AccessArticle

A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients

¹

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou 313001, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(8), 1320; https://doi.org/10.3390/math10081320

Submission received: 21 March 2022 / Revised: 8 April 2022 / Accepted: 13 April 2022 / Published: 15 April 2022

Download

Browse Figures

Versions Notes

Abstract

:

Graph representation learning is a significant challenge in graph signal processing (GSP). The flourishing development of graph neural networks (GNNs) provides effective representations for GSP. To effectively learn from graph signals, we propose a regularized graph neural network based on approximate fractional order gradients (FGNN). The regularized graph neural network propagates the information between neighboring nodes. The approximation strategy for calculating fractional order derivatives avoids falling into fractional order extrema and overcomes the high computational complexity of fractional order derivatives. We further prove that such an approximation is feasible and FGNN is unbiased towards the global optimization solution. Extensive experiments on citation and community networks show that the proposed FGNN has improved recognition accuracy and convergence speed than vanilla FGNN. The five datasets of different sizes and domains confirm the great scalability of our proposed method.

Keywords:

graph neural networks; unbiased approximation; citation network; community network; fractional order derivatives; node classification

MSC:

68T07

1. Introduction

Network data are increasingly available in the digital world where communication networks, social networks, biological networks, and vehicular networks generate a large amount of data daily [1,2]. Graphs provide effective mathematical descriptions of such data, which are often irregular and complex but with significant potential to reveal intrinsic properties of the networks, such as structures, link information, and object information [3,4]. Thus, learning from such data has attracted significant research interest in both academia and industry. Particularly, graph neural networks (GNNs)—built upon graph convolutions to design the learning process for graph structure properties—have shown the great potential to process graph signals, extract the multi-scale spatial information, and map it into higher dimensions for node classification, edge prediction, and graph clustering [3,5,6,7,8,9,10,11,12,13].

GNNs build upon the classic neural networks incorporating gradient descent methods [14,15,16]. Such networks converge by following the steepest gradient searching direction on the first order derivative, which is usually slower than higher order methods (with the order greater than 1) [17]. Furthermore, there exists a potential risk of getting stuck in the first order local optimum, even though the random sampling methods have significantly reduced this risk [15,16].

To overcome such limitations, the application of fractional calculus in neural networks has been studied to exploit the long-term memory and non-locality characteristics of fractional order derivatives [18]. For example, Pu et al. adopted fractional order derivatives to replace integer order derivatives in the gradient descent method directly [19]. Wang et al. and Bao et al. proposed the novel fractional order backpropagation deep networks based on fractional order derivatives [20,21]. Khan et al. proposed the fractional order gradient radial basis function network [22]. Compared with the first order gradient models, fractional order gradient methods have two noticeable properties. First, for many functions, the local optimum points with different fractional orders are usually different (see Figure 1). Second, the computational complexity of fractional order derivatives is higher than the first order derivative. These two properties bring challenges to design a graph neural network based on fractional order gradients.

Although the fractional order gradient descent algorithm can effectively avoid falling into the first order local extreme point by exploiting the inconsistency of different order extreme points, it will converge to the local extreme point of its order, which differs from the first order extremum. Therefore, the fractional order algorithm is usually not optimal because the first order derivative is usually not zero when its fractional order derivative reaches zero (see Figure 1).
The fractional derivative of weight parameters in the neural networks requires a complex fractional order chain rule of composite functions, leading to high computational complexity in the iterative optimization of neural networks.

This paper proposes a regularized graph neural network based on approximate fractional order gradients (FGNN) to address the problems by integrating fractional order derivatives into graph neural networks and employing an unbiased simplified approximation strategy to reduce the computation of the fractional order derivatives. The main contributions are summarized as follows.

We propose a regularized graph neural network based on approximate fractional order gradients for graph representation learning. The proposed approximation strategy avoids falling into fractional order local extreme points, improves the overall performance, and promises a faster convergent speed. Particularly, FGNN achieves better recognition accuracy and faster convergent speed than vanilla GNN on semi-supervised node classification tasks.
We prove that the approximation for calculating fractional derivatives is biased for its order optimum, while unbiased for the first order optimum. Therefore, the approximation strategy is beneficial to the optimization problem and guarantees the determined convergency towards the first order extremum, where the first order derivative is zero while the fractional order derivative is not zero.
We analyze the convergence behavior of fractional order derivatives theoretically and identify the convergence condition of the proposed FGNN.

This article is arranged as follows. Section 2 reviews the related work. Section 3 proposes the regularized graph neural network based on approximate fractional order gradients. Section 4 proves the feasibility and convergence of the proposed FGNN. Section 5 evaluates the proposed method using experiments on citation and community network datasets. Section 6 discusses the limitations of our work and future research work. Section 7 concludes this paper.

The key symbols used in this paper are listed in Table 1.

2. Related Work

This section reviews the related research on GNN and introduces fractional calculus theory to set the background for the proposed method.

2.1. Graph Neural Networks

Typical neural networks (NNs) usually operate in the Euclidean space: either sequences or grid structured data (audios, images, videos, etc.). To study graph-structured data, GNNs extend NNs into the graph domain through three main types of propagation methods: convolution aggregators, attention aggregators, and gate aggregators.

The convolution operation of classic convolutional neural networks is generalized to the graph domain in convolution aggregators, which are composed of spectral and non-spectral methods. Spectral methods focus on the graph spectrum, which calculates the Laplacian eigenbasis of the specific graph structure. Bruna et al. [10] proposed a spectral network, which extends convolution to the Laplacian spectrum. To solve the high computational complexity, Defferrard et al. [12] defined a K localized convolution instead of computing the Laplacian eigenvectors; Kipf and Welling [6] limited the graph convolution operation to

K = 1

to moderate the redundant information aggregation in the graph spectrum. Non-spectral methods directly work on the graph, defining the spatial convolution operation on the adjacent neighbors. However, it is rather difficult to specify uniform convolution operations on neighbors of different sizes. Hamilton et al. [3] proposed the GraphSAGE, which generates embedding by sampling and aggregating information from nodes and their neighbors.

Attention mechanism, e.g., self- and intra-attention, has proven to be an effective way to represent sequence-based data [23]. Velickovic et al. proposed the graph attention networks (GAT) [24] to integrate the attention mechanism into the graph propagation process. GAT aggregates the information of nodes and their neighbors by computing the self-attention and intra-attention coefficients of the concatenated features of the vertex pair. This method also utilizes K independent attention heads to calculate the intermediate state and concatenates or averages their features.

Gate aggregators are proposed to address the limitations of existing GNN methods and enhance the long-term information broadcast while processing graph data. Similar to gated recurrent units (GRU) and long short-term memory (LSTM), gated graph neural networks are proposed to utilize the GRUs to aggregate information from other vertices and from the previous stage [25,26]; Zhang et al. proposed the Sentence LSTM (S-LSTM) [27] to enhance text coding by transforming texts into graphs.

2.2. Fractional Calculus

Different from integer calculus, fractional order derivative does not have a unified definition. Common definitions of fractional order derivative include Grünwald–Letnikov (GL), Riemann–Liouville (RL), and Caputo derivatives [28,29]. The GL fractional order derivative is given below,

_{a}^{G L} D_{x}^{ν} f (x) = lim_{h \to 0} h^{- ν} \sum_{k = 0}^{[(x - a) / h]} (\begin{matrix} - ν \\ k \end{matrix}) f (x - k h),

(1)

where

(\begin{matrix} - ν \\ k \end{matrix}) = \frac{(- ν) (- ν + 1) \dots (- ν + k - 1)}{k!} .

(2)

Furthermore,

_{a}^{G L} D_{x}^{ν}

denotes the fractional order gradient operator according to GL definition,

f (x)

is a differentiable and integrable function,

ν

is the fractional order, which can take any real value. a denotes the start of the duration

[a, x]

, and

[\cdot]

is the rounding function.

The RL fractional order derivative is given below,

_{a}^{R L} D_{x}^{ν} f (x) = \frac{1}{Γ (n - ν)} \frac{d^{n}}{d x^{n}} \int_{a}^{x} \frac{f (y)}{{(x - y)}^{ν - n + 1}} d y,

(3)

where

_{a}^{R L} D_{x}^{ν}

denotes the fractional order gradient operator based on RL definition.

n = [ν + 1]

means that n is the minimum integer greater than

ν

, and

Γ (\cdot)

is the gamma function. Furthermore, GL fractional order derivative can be inferred from the rule of RL principle.

The Caputo fractional order derivative is given below,

_{a}^{C} D_{x}^{ν} f (x) = \frac{1}{Γ (n - ν)} \int_{a}^{x} {(x - y)}^{n - ν - 1} f^{(n)} (y) d y,

(4)

where

_{a}^{C} D_{x}^{ν}

denotes the fractional order gradient operator based on Caputo principle. In this paper, we mainly adopt the GL definition following fractional order derivative and use the equivalent notations

D_{x}^{ν} =_{0}^{G L} D_{x}^{ν}

.

3. Fractional Order Graph Neural Network

In this section, we propose a regularized graph neural network based on approximate fractional order gradients to address the problem that the vanilla fractional order neural network may become stuck in its local optima. We first introduce the fractional order GNN, then extend fractional order GNN to semi-supervised tasks by employing transductive learning. Finally, the approximation strategy is proposed for calculating fractional order derivatives.

3.1. Fractional Order GNN

We firstly introduce the backpropagation for GNN [6]. Graph neural networks were proposed to extend neural networks into non-Euclidean domain. In a graph

G = (V, E)

, each node

v^{i} \in V

has d-dimension features

h^{i} \in R^{d \times 1}

and each edge

e_{i j} \in E

denotes the connection relationship between node i and node j.

e_{i j} = 1

if node i connects to node j else

e_{i j} = 0

.

For an undirected graph G, the adjacent matrix

A

represents the set of all edges.

A_{i j} = \{\begin{matrix} 1, & e_{i j} = 1; \\ 0, & e_{i j} = 0, \end{matrix}

(5)

where

A

is symmetric, with all diagonal entries as zeros

A_{i i} = 0

. Then, the degree matrix

D

can be obtained, which counts the number of neighbors of each node,

D_{i j} = \{\begin{matrix} \sum_{k = 1}^{∥ V ∥} A_{i k}, & i = j; \\ 0, & i \neq j, \end{matrix}

(6)

where the diagonal entries of

D

are equivalent to their corresponding column or row summation of the adjacent matrix

A

. Furthermore, non-diagonal entries of

D

are zeros.

All node features and adjacent matrix are the input of GNN. Moreover, the input features of next layer

m + 1

come from the output of the previous layer m. Both

H^{0}

and

A

are the input of the first layer,

H^{0} = [h^{1}, h^{2}, \dots, h^{∥ V ∥}] .

(7)

Based on the node classification task, the iterative scheme for computing the state

H^{m + 1}

is given below,

\begin{matrix} Z^{m + 1} = & W^{m} H^{m} \tilde{A}; \\ H^{m + 1} = & σ (Z^{m + 1}); \\ \hat{Y} = & g (H^{M}), \end{matrix}

(8)

where

σ (\cdot)

denotes the activation function,

W^{m}

means weight parameters matrix at step m,

g (\cdot)

is the output classification function (such as sigmoid layer or softmax layer) and

\hat{Y}

is the prediction result. Here

\tilde{A}

is the normalized adjacent matrix,

\tilde{A} = D^{- \frac{1}{2}} (I + A) D^{- \frac{1}{2}} .

(9)

The mean square error is adopted as the loss function,

L = \frac{1}{2} {(\hat{Y} - Y)}^{2},

(10)

where

L

function evaluates the similarity between the predicted

\hat{Y}

and the ground-truth value

Y

. Firstly, we define a factor

δ^{m}

to simplify the representation of the gradient,

δ^{m} = \frac{\partial L}{\partial Z^{m}} .

(11)

According to Equation (10), we can obtain,

δ^{M} = (\hat{Y} - Y) ⊙ g^{'} (H^{M}) ⊙ σ^{'} (Z^{M}),

(12)

where ⊙ denotes the Hadamard product (entrywise product).

σ^{'} (\cdot)

denotes the first order derivative of the activate function

σ (\cdot)

. Then the relationship between

δ^{m}

and

δ^{m + 1}

can be given by,

δ^{m} = \frac{\partial L}{\partial Z^{m + 1}} \frac{\partial Z^{m + 1}}{\partial Z^{m}} = {(W^{m})}^{T} δ^{m + 1} \tilde{A} ⊙ σ^{'} (Z^{m}) .

(13)

Calculate the derivative of

L

with respect to

W^{m}

,

\frac{\partial L}{\partial W^{m}} = δ^{m + 1} \frac{\partial Z^{m + 1}}{\partial W^{m}} = δ^{m + 1} \tilde{A} {(H^{m})}^{T} .

(14)

The weight parameters are thus updated iteratively,

W^{m (+)} = W^{m} - η \frac{\partial L}{\partial W^{m}}, {m = 0, 1, \dots, M - 1},

(15)

where

W^{m (+)}

is the updated weight parameters and

η > 0

is the learning rate. Similarly, the updated weight parameters of the fractional order GNN is given below,

W^{m (+)} = W^{m} - η D_{W^{m}}^{ν} L, {m = 0, 1, \dots, M - 1} .

(16)

Neural networks always overfit easily for a small-scale dataset. Utilizing L2 regularization can efficiently avoid overfitting without modifying the structure of networks [30]. Therefore, by introducing the L2 regularization term into the total loss, the modified loss function can be presented as,

L_{L 2} = L + \frac{λ}{2} \sum_{m = 0}^{M - 1} {∥ W^{m} ∥}^{2},

(17)

where

λ

denotes the L2 regularization coefficient. According to (17), we obtain

W^{m (+)} = W^{m} - η D_{W^{m}}^{ν} L_{L 2}, {m = 0, 1, \dots, M - 1} .

(18)

The regularized fractional order GNN updates the weight parameters of graph neural networks by means of the fractional order gradient descent method.

3.2. Semi-Supervised Training Mechanism

Although the neural network structure has been developed, the training mechanism should be different for different tasks. In semi-supervised node classification tasks, transductive learning should be adopted [6,31], where all samples are visible for the training process and only the labeled samples are involved in the backward propagation.

In the forward propagation, all node features and adjacent matrix are input to the network, as follows,

H^{0} = concat ([X_{L}, X_{U}]) .

(19)

where

X

denotes the node features, the subscripts U and L denote the labeled and unlabeled samples for training and testing, respectively. Function

concat (\cdot)

denotes the concatenation of matrices. Therefore, the input feature

H^{0}

of the network contains

X_{L}

and

X_{U}

.

In the backward propagation, only the loss between the predicted results and the labels of training nodes is calculated,

L_{L 2} = \frac{1}{2} {({\hat{Y}}_{L}, Y_{L})}^{2} + \frac{λ}{2} \sum_{m = 0}^{M - 1} {∥ W^{m} ∥}^{2} .

(20)

According to the given training loss (20), the fractional order gradients of the weight parameters are calculated and the weight parameters are updated by (18).

3.3. Approximation Strategy

The computational complexity of vanilla fractional order derivatives is usually high, given the Di Bruno’s formula [32,33]. This subsection introduces approximate computation for calculating fractional order derivatives. We will prove such approximation is unbiased and feasible for the optimization in Section 4.1.

According to the Di Bruno’s formula, the Caputo fractional order derivative of a composite function expression can be derived as follows,

\begin{matrix} D_{x}^{ν} & f [u (x)] = \sum_{m = 0}^{\infty} \frac{f^{m} [u (x)]}{m!} \sum_{k = m}^{\infty} \frac{sin (π (ν - k))}{π (ν - k)} (\frac{Γ (ν + 1)}{Γ (k + 1)}) \\ \times & x^{k - ν} (\sum_{j = 0}^{m} {(- 1)}^{j} (\begin{matrix} m \\ j \end{matrix}) u {(x)}^{j} \frac{d^{k}}{d x^{k}} u {(x)}^{m - j}) . \end{matrix}

(21)

We can thus obtain the approximate fractional order chain rule [20,21],

D_{x}^{ν} (f [u (x)]) \approx \frac{d f}{d u} D_{x}^{ν} u (x) .

(22)

Obviously, when

ν = 1

, (22) is completed. The approximate chain rule based fractional order derivative of

L

and

L_{L 2}

with respect to

W^{l}

are given below,

D_{W^{m}}^{ν} L \approx \frac{\partial L}{\partial W^{m}} \frac{{(W^{m})}^{1 - ν}}{Γ (2 - ν)} = δ^{m + 1} \tilde{A} {(H^{m})}^{T} ⊙ \frac{{(W^{m})}^{1 - ν}}{Γ (2 - ν)},

(23)

\begin{matrix} D_{W^{m}}^{ν} L_{L 2} & \approx \frac{\partial L}{\partial W^{m}} \frac{{(W^{m})}^{1 - ν}}{Γ (2 - ν)} + λ \frac{{(W^{m})}^{2 - ν}}{Γ (3 - ν)} \\ = δ^{m + 1} \tilde{A} {(H^{m})}^{T} ⊙ \frac{{(W^{m})}^{1 - ν}}{Γ (2 - ν)} + λ \frac{{(W^{m})}^{2 - ν}}{Γ (3 - ν)} . \end{matrix}

(24)

Based on the fractional order gradient descent method, semi-supervised training mechanism, and approximate fractional order derivatives, the detailed pipeline and algorithm of FGNN are shown in Figure 2 and Algorithm 1.

Algorithm 1: Approximate fractional order GNN

4. Theoretical Analysis

This section proves the feasibility proof of the approximate fractional order GNN, then presents the convergence proof of the proposed FGNN.

4.1. Approximation and Feasibility Proof

It has been revealed that the fractional order extreme point is not coincident with the first order extrema point [19]. Therefore, it is potential for the fractional order gradient to escape from the local optimal point. Nevertheless, such a gradient method cannot promise convergence to the real extreme point. To solve this problem, the truncate and approximate principle [20,34,35] is proposed to modify the fractional order gradient method which could reach the real extreme point just like the integer order method, but with faster convergence than the first order.

Here we prove the approximation and feasibility. For the truncation and approximate principle, it can be divided into two types. The first type is the chain rule approximation [20,21] (22), where the fractional order chain rule of the composite function

f [u (x)]

can be approximately divided into the product of the first order derivative

\frac{d f}{d u}

and fractional order derivative

D_{x}^{ν} u (x)

. If

u (x) = x

, the fractional order derivative of the composite function reduces as,

D_{x}^{ν} f [u (x)] \approx \frac{d f}{d x} \frac{x^{1 - ν}}{Γ (2 - ν)} .

(25)

The second type is the approximate expansion [34,35],

_{x_{k - 1}} D_{x_{k}}^{ν} f (x) = \sum_{i = 1}^{\infty} \frac{f^{(i)} (x_{k - 1})}{Γ (i + 1 - ν)} {(x_{k} - x_{k - 1})}^{i - ν} \approx \frac{f^{(1)} (x_{k - 1})}{Γ (2 - ν)} {(x_{k} - x_{k - 1})}^{1 - ν},

(26)

where the fractional order of the function

f (x)

can be decomposed into the Taylor expansion. The first item significantly dominates the searching direction. Therefore, researchers mainly focus on the first order truncation.

These two types of fractional order derivatives can be divided into the product between the first order derivative

f^{(1)} (x)

and the power function of the variable x. Therefore, their extreme points contain the first order one. Furthermore, if

∥ x ∥ < 1

, (25) has a greater step than the first order gradient. If

∥ x_{k} - x_{k - 1} ∥ < 1

, (26) also has a greater step than the first order gradient.

Figure 3 shows the minimizing process of the function

f (x) = x^{2} - x

with different gradient descent methods, which include the first order gradient, the fractional order gradient based on the approximate chain rule, the fractional order gradient based on the approximate expansion, and the accurate fractional order gradient. The first row shows the searching process on the function

f (x) = x^{2} - x

from the starting point (red square) to the endpoint (black circle). The second row shows the minimizing function during 20 iterations. The first order and fractional order extreme points are different for the quadratic system,

x_{1}^{⋆} = 0.5

and

x_{ν}^{⋆} = \frac{Γ (3 - ν)}{2 Γ (2 - ν)}

. Therefore

x_{1}^{⋆} \neq x_{ν}^{⋆}

, if

ν \neq 1

. For the mentioned approximate fractional order derivatives, iterative values have been calculated and their results confirm the consistency between the first order and approximate fractional order extreme points. Therefore, the approximate fractional order extrema are unbiased towards the first order extrema but biased towards the accurate fractional order extrema. Furthermore, these two approximate strategies converge faster than the first order one, whereas the approximate expansion strategy needs the determined initial value

x_{0}

, which is sensitive to optimization performance. Therefore, we choose the approximate chain rule strategy (25).

Since the example is explicit, the analytic expression for the first order and fractional order derivatives are available. However, for complex implicit functions, we need to use the chain rule or difference method to calculate the first order and fractional order derivatives of the functions. For an implicit function

f [u (x)]

, for the sake of easy calculation, we assume that the function value of each point is known, and the first order derivative is,

\frac{d f [u (x)]}{d x} = \frac{f [u (x)] - f [u (x - h)]}{h} .

(27)

Then the approximate fractional order derivative is,

D_{x}^{ν} f [u (x)] \approx \frac{d f [u (x)]}{d x} \frac{x^{1 - ν}}{Γ (2 - ν)} = \frac{f [u (x)] - f [u (x - h)]}{h} \frac{x^{1 - ν}}{Γ (2 - ν)} .

(28)

Furthermore, the accurate fractional order derivative is,

D_{x}^{ν} f [u (x)] = \sum_{i = 1}^{\infty} \frac{f^{(i)} [u (x)]}{Γ (i + 1 - ν)} h^{i - ν} \approx \sum_{i = 1}^{N} \frac{f^{(i)} [u (x)]}{Γ (i + 1 - ν)} h^{i - ν},

(29)

where the higher order (greater than N) items are considered as zeros if

f^{N + 1} [u (x)] = 0

or

h^{N - ν} \approx 0

.

From these three derivatives of the complex implicit function

f [u (x)]

, we know that the first order derivative needs an addition and a multiplication and the approximate fractional order derivative needs an addition and three multiplications. For the accurate fractional order derivative, the ith item (

i \in {1, 2, \dots, N}

) is given as follows,

\frac{f^{(i)} [u (x)]}{Γ (i + 1 - ν)} h^{i - ν} = \frac{f^{(i - 1)} [u (x)] - f^{(i - 1)} [u (x - h)]}{Γ (i + 1 - ν)} h^{i - ν - 1},

(30)

which needs

2^{i} - 1

additions and two multiplications. Therefore, the accurate fractional order derivative totally needs

2^{N + 1} - 3

additions and

2 N

multiplications. The addition and multiplication complexity of these above three derivatives are concluded in Table 2. The approximate fractional order derivative has a computational complexity close to the first order one, while the computational complexity of the accurate fractional order derivative is much higher than the first order one.

4.2. Convergence Proof of the Fractional Gradient Descent Method

Although the fractional order gradient descent method has promising performance in dealing with local optimum problems theoretically, the convergence of such a mechanism remains to be proved.

For the sake of simple expression, we utilize the following notations,

Δ W = W^{+} - W = - η D_{W}^{ν} L .

(31)

Without loss of generality, we analyze the weight parameters

W^{m}

of the m-th layer, simply denoted as

W

. The superscript k means the k-th iteration.

Lemma 1.

The MSE set

{L (W^{k})}, (k = 1, 2, \dots, T

) is monotonically decreasing, i.e.,

L (W^{k + 1}) \leq L (W^{k}) .

Proof.

By using the Taylor mean value theorem with Lagrange remainder,

Δ = L (W^{k + 1}) - L (W^{k}) = L^{'} Δ W^{k} + \frac{1}{2} L^{″} Δ {∥ W^{k} ∥}^{2},

(32)

where

L^{'}

and

L^{″}

denote the first order and the second order derivatives of

L

, and

Δ ∥ W^{k} ∥^{2}

denote the high-order item of

∥ W^{k} ∥

.

Add (23) and (31) into (32),

\begin{matrix} Δ & = (L^{'} \frac{1}{- η D_{W^{k}}^{ν} L} + \frac{1}{2} L^{″}) {∥ Δ W^{k} ∥}^{2} \\ = (L^{'} \frac{Γ (2 - ν)}{- η L^{'} {(W^{k})}^{1 - ν}} + \frac{1}{2} L^{″}) {∥ Δ W^{k} ∥}^{2} \\ \leq (- \frac{1}{η} Γ (2 - ν) {(W^{k})}^{ν - 1} + C_{1}) {∥ Δ W^{k} ∥}^{2}, \end{matrix}

(33)

where

C_{1} \geq \frac{1}{2} L^{″}

. Generally, in the computation of the fractional order derivative, we select

ν \in [0, 2)

, and let

∥ W^{k} ∥

replace

W^{k}

, which guarantees

Γ (2 - ν) > 0

. The upper bound of the learning rate is

η \leq \frac{Γ (2 - ν) ∥ W^{k} ∥^{ν - 1}}{C_{1}}

.

The proof of Lemma 1 is completed. □

Lemma 2.

The MSE function set

{L (W^{k})}, (k = 1, 2, \dots, T

) is convergent, i.e.,

lim_{k \to \infty} L (W^{k}) = L^{⋆} .

Proof.

L (W^{k})

is the square error, thus

L (W^{k}) \geq 0

. Applying Lemma 1,

L (W^{k + 1}) \leq L (W^{k})

. Hence there exits

L^{⋆} \geq 0

satisfying,

lim_{k \to \infty} L (W^{k}) = L^{⋆} .

(34)

The proof of Lemma 2 is completed. □

Lemma 3.

The MSE function set

{L (W^{k})}, (k = 1, 2, \dots, T

) converges to zero, i.e.,

lim_{k \to \infty} ∥ D_{W}^{ν} L (W^{k}) ∥ = 0 .

Proof.

According to (33), let

β = \frac{1}{η} Γ (2 - ν) {∥ W^{k} ∥}^{ν - 1} - C_{1} \geq 0

,

\begin{matrix} L (W^{k + 1}) & \leq L (W^{k}) - β {∥ Δ W^{k} ∥}^{2} \\ \leq (L (W^{k - 1}) - β {∥ Δ W^{k - 1} ∥}^{2}) - β {∥ Δ W^{k} ∥}^{2} \\ \dots \\ \leq L (W^{0}) - β \sum_{i = 0}^{k} {∥ Δ W^{i} ∥}^{2} . \end{matrix}

(35)

Because

L (W^{k + 1}) \geq 0

, therefore,

β \sum_{i = 0}^{k} {∥ Δ W^{i} ∥}^{2} \leq L (W^{0}) .

(36)

Let

k \to \infty

, it holds that,

\sum_{i = 0}^{\infty} {∥ Δ W^{i} ∥}^{2} < \infty .

(37)

According to the principle of series convergence, the general term is derived as follows,

lim_{k \to \infty} ∥ Δ W^{k} ∥ = 0 .

(38)

Meanwhile,

- η D_{W}^{ν} L (W^{k}) = Δ W^{k};

(39)

lim_{k \to \infty} D_{W}^{ν} (W^{k}) = 0 .

(40)

The proof of Lemma 3 is completed. □

Based on these three lemmas, we conclude that the fractional order gradient method converges when

η \leq \frac{Γ (2 - ν) ∥ W^{k} ∥^{ν - 1}}{C_{1}}

.

5. Experiments

In this section, we evaluate the proposed FGNN on citation and community networks to demonstrate its effectiveness. First, we introduce three citation network datasets and two community network datasets. Second, we give implementation details and evaluation metrics. Third, seven comparative methods are introduced as the baselines. Finally, we discuss the effectiveness of the proposed FGNN by quantitatively and qualitatively comparing our proposed FGNN against the baseline models.

5.1. Citation and Community Network Datasets

Following the experimental setup in [36], we introduce citation network datasets including Cora, Citeseer [37], and Pubmed [38] to establish the benchmark, where articles are regarded as nodes and citation links between articles are regarded as undirected edges. These edges contribute to the binary and symmetric adjacent matrix. The datasets contain sparse bag-of-words feature or Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each article and a list of citation links between different articles. The detailed number of nodes, edges, features, and classes is shown in Table 3. 20 labeled nodes of each class are adopted for training. 500 and 1000 labeled nodes are adopted for validation and testing, respectively.

In addition, we further introduce two community network datasets Karate club and Football [39,40] (see Figure 4). In Karate club, each node denotes a club member. The nodes have clubs that indicate to which clubs they belong, either ‘Mr. Hi’ or ‘Officer’. In Football, each node denotes an American university. The nodes have values that indicate to which conferences they belong. In both two datasets, 10% of samples are adopted for training, and the other 90% of samples are adopted for testing. Different from the citation network datasets, the community network datasets have no node features. Therefore, two identity matrices of different sizes are the input node features of the networks in Karate club and Football. The detailed number of nodes, edges, features, and classes is shown in Table 4.

5.2. Implementation Details and Evaluation Metrics

We implemented our work with Numpy. The first order and fractional order backward propagations are manually derived rather than the auto-gradient computation, such as PyTorch and TensorFlow. FGNN and GNN are trained on these citation networks for 5 thousand epochs with a learning rate of

η = 0.006

. Both FGNN and GNN contain two graph convolution layers and 128 neurons. We will save the model parameters if the model achieves better recognition accuracy in the validation set. The performance of FGNN with different fractional orders (

ν \in {0.5, 0.6, \dots, 1.7}

) is verified.

For quantitative assessment, recognition accuracy is used as the main evaluation metric for all baseline models and our proposed model,

Accuracy = \frac{TP + TN}{TP + TN + FP + FN},

(41)

where TP, FP, TN, and FN are the number of true positives, false positives, true negatives, and false negatives, respectively.

To describe the performance difference between GNN and FGNN in more detail, we adopt the area under curve (AUC), which is defined as the area of the receiver operating characteristic curve (ROC), as the evaluation metric.

5.3. Comparative Methods

Seven methods are selected as the baseline models, i.e., label propagation (LP) [41], semi-supervised embedding (SemiEmb) [42], manifold regularization (ManiReg) [43], skip-gram based graph embeddings (DeepWalk) [44], iterative classification algorithm (ICA), Planetoid [36], and vanilla graph neural network (GNN) [6]. The recognition performance of the first six methods is cited from [6], while vanilla GNN and our proposed FGNN are manually calculated utilizing no optimizers, only employing the first order and fractional order gradient descent algorithms.

5.4. Accuracy Comparison on Citation Network

A group of experiments verify our proposed FGNN on the citation network datasets. To investigate the impact of fractional orders on FGNN, we also carried out experiments on FGNN with different orders. Finally, we discuss the influence of L2 regularization on model performance.

Table 5 presents the recognition accuracy of the baseline models and FGNN with order

ν = 1.3

on Cora, Citeseer, and Pubmed. FGNN obtains the best performance against existing models, achieving 80.9%, 70.0%, and 79.4%, respectively. Note that Cora and Citeseer have seven and six categories with the higher feature dimension and the lower graph size, while Pubmed has three categories with the lower feature dimension and the higher graph size. A total of 140 and 120 nodes are trained for the first two datasets, while 60 nodes are trained for Pubmed. Our proposed FGNN is better at handling the higher feature dimension and more training samples and has 1.0–21.4% and 0.6–9.9% improvements, respectively. For Pubmed, FGNN has a slight performance gain of 0.2–8.7% with limited training samples and feature dimensions.

Figure 5 shows the recognition accuracy of FGNN with different orders

ν \in [0.5, \dots, 1.7]

on Cora, Citeseer, and Pubmed. With the order increasing from 0.5 to 1.0, the recognition accuracy of FGNN increases from 53.9%, 25.4%, 75.5% to 79.9%, 69.3%, 79.2%. With the order increasing from 1.3 to 1.7, the recognition accuracy of FGNN decreases from 80.9%, 70.0%, 79.4% to 77.3%, 38.5%, 36.5%. For Cora, FGNN has the best performance of 80.9% with

ν = 1.3

and achieves a gain of 1.0% than GNN; for Citeseer, FGNN has the best performance of 70.4% with

ν = 1.2

and achieves a gain of 1.1% than GNN; for Pubmed, FGNN has the best performance of 79.4% with

ν = 1.2, 1.3

and achieves a gain of 0.2% than GNN.

To illustrate the effectiveness of L2 regularization, the recognition accuracy of GNN and FGNN with or without L2 regularization has been shown in Figure 6. On these three citation network datasets, L2 regularization plays a significant role in improving the recognition performance for both GNN and FGNN. For GNN, it improves 9.6%, 0.3%, and 6.2% on Cora, Citeseer, and Pubmed. For FGNN, it improves 3.3%, 0.7%, and 6.0% on Cora, Citeseer, and Pubmed. Furthermore, whether L2 regularization is utilized or not, the recognition accuracy of the proposed FGNN outperforms vanilla GNN.

It is worth mentioning that we did not tune the network structure and parameters against specific datasets, and thus further fine-tuning may lead to even greater performance gains. However, this is beyond the scope of this paper.

5.5. Convergence Analysis on Citation Network

A group of experiments verify the MSE loss and recognition accuracy curves of vanilla GNN and our proposed FGNN, then describe ROC curves and AUC values, and further visualize node features of FGNN.

Table 6 shows the convergence time of GNN and FGNN on citation network datasets. FGNN converges after 11.83 s, 14.25 s, and 15.27 s on Cora, Citeseer, and Pubmed, respectively. Meanwhile, GNN converges after 16.08s, 34.62 s, and 25.13 s on Cora, Citeseer, and Pubmed, respectively. These experimental results show that FGNN has a faster convergence time than vanilla GNN. From a theoretical perspective, the approximate fractional order derivative has a similar computational complexity to the first order one. Meanwhile, the approximate fractional order derivative has a greater iterative step than the first order one under the same learning rate. Therefore, it has been proven experimentally and theoretically that FGNN converges faster than vanilla GNN.

Figure 7 shows the MSE loss and recognition accuracy curves of GNN and FGNN on Cora and Citeseer. The horizontal axis denotes training epochs. The left vertical axis denotes the MSE loss. The right vertical axis denotes the recognition accuracy. With the training epoch increasing, the MSE loss of GNN and FGNN have decreased, while the recognition accuracy has increased until they converge to the extreme point. Furthermore, FGNN has a faster convergent speed and achieves higher performance than vanilla GNN.

Figure 8 shows the ROC curves of GNN and FGNN on Cora and Citeseer. The horizontal axis denotes the false positive rate. The vertical axis denotes the true positive rate. The curve of FGNN is closer to the “upper left” than GNN on Cora and Citeseer, so the proposed FGNN has greater AUC value and recognition performance. Particularly, FGNN achieves the AUC value of 0.97 and 0.93, while GNN achieves the AUC value of 0.96 and 0.91.

Figure 9 demonstrates dimensionality reduction visualization results of node features of the input layer, hidden layer, and output layer of FGNN on Cora, Citeseer, and Pubmed. Nodes with different colors represent different categories, and the distance of each point is the 2D visualization of node features with the t-SNE (t-distributed Stochastic Neighbor Embedding) method. Visualization results show that various types of nodes in the input layer are mingled in the space and cannot be directly classified. As nodes are embedded in the graph, nodes from the same category gradually converge, and the chaos of various nodes in the hidden layer is relatively reduced. In the output layer, all kinds of nodes are clustered near the center of the category, but there are still some nodes in the cross-region of different clusters, the embedding and information aggregation of which will be the key to further improving the recognition rate.

5.6. Accuracy Comparison on Community Network

To further illustrate the scalability of the proposed model, we evaluate GNN and FGNN on the community network datasets. Table 7 shows the recognition accuracy of GNN and FGNN on the community network datasets. Compared with Karate club, the node classification task on Football is more difficult because it has twelve categories, which is larger than Karate club’s three. FGNN achieves the recognition accuracy of 96.67% and 61.17% and has the performance gains of 3.34% and 9.71% over the vanilla GNN on Karate club and Football, respectively. Therefore, the proposed FGNN achieves great scalability on five datasets from different domains—citation network and community network.

6. Discussion

Although the proposed FGNN achieves promising performance, there are some potential limitations. First, the comparative results are calculated by the first order and fractional order gradient descent methods utilizing no optimizers. The advanced optimizer could be beneficial to improving model performance. Second, the proposed model only contains two graph convolution layers because of the limitation of the graph over smoothing. The deep graph model should be exploited.

7. Conclusions

In this paper, we generalize the fractional order gradient descent method to the regularized graph neural network. In order to address the local optimality and high complexity problem of fractional order GNNs, we propose an approximate fractional order mechanism to underpin GNN. Then we further prove the feasibility and unbiased property of such approximation towards the first order optimization. Theoretical analysis can promise the convergence of the proposed FGNN. Experiments on node classification tasks demonstrate the performance of the proposed FGNN—it achieves a greater advantage over the baseline models and faster convergent speed than vanilla GNN.

Author Contributions

Conceptualization, Z.L. and C.L.; methodology, Z.L.; software, Z.L. and Y.W.; validation, Z.L. and Y.W.; formal analysis, Z.L., Y.L. and C.L.; investigation, Z.L.; resources, Y.L. and C.L.; data curation, Z.L. and Y.W.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., Y.W., Y.L. and C.L.; visualization, Z.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant No. 61871096.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code for this study is available at https://www.github.com/liu6zijian/frac_GNN (accessed on 20 March 2022). The citation network datasets are openly available at https://www.github.com/kimiyoung/planetoid/tree/master/data (accessed on 20 March 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GSP	Global signal processing
GNN	Graph neural network
FGNN	Regularized graph neural network based on the fractional order gradients
NN	Neural network
MSE	Mean square loss
ROC	Receiver operating characteristic curve
AUC	Area under curve of ROC
t-SNE	t-distributed Stochastic Neighbor Embedding

References

Bi, Z.; Zhang, T.; Zhou, P.; Li, Y. Knowledge transfer for out-of-knowledge-base entities: Improving graph-neural-network-based embedding using convolutional layers. IEEE Access 2020, 8, 159039–159049. [Google Scholar] [CrossRef]
Khalil, E.; Dai, H.; Zhang, Y.; Dilkina, B.; Song, L. Learning combinatorial optimization algorithms over graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 5358–6348. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1024–1034. [Google Scholar]
Fout, A.; Byrd, J.; Shariat, B.; Ben-Hur, A. Protein interface prediction using graph convolutional networks. Adv. Neural Inf. Process. Syst. 2017, 30, 6530–6539. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [Green Version]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Such, F.P.; Sah, S.; Dominguez, M.A.; Pillai, S.; Zhang, C.; Michael, A.; Cahill, N.D.; Ptucha, R. Robust spatial filtering with graph convolutional neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 884–896. [Google Scholar] [CrossRef] [Green Version]
Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; Bronstein, M.M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5115–5124. [Google Scholar]
Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. Adv. Neural Inf. Process. Syst. 2016, 29, 1993–2001. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Henaff, M.; Bruna, J.; LeCun, Y. Deep convolutional networks on graph-structured data. arXiv 2015, arXiv:1506.05163. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Inf. Process. Syst. 2016, 29, 3844–3852. [Google Scholar]
Niepert, M.; Ahmed, M.; Kutzkov, K. Learning convolutional neural networks for graphs. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 2014–2023. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1310–1318. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gundersen, G.; Steihaug, T. On large-scale unconstrained optimization problems and higher order methods. Optim. Methods Softw. 2010, 25, 337–358. [Google Scholar] [CrossRef]
Song, C.; Cao, J. Dynamics in fractional-order neural networks. Neurocomputing 2014, 142, 494–498. [Google Scholar] [CrossRef]
Pu, Y.F.; Zhou, J.L.; Zhang, Y.; Zhang, N.; Huang, G.; Siarry, P. Fractional extreme value adaptive training method: Fractional steepest descent approach. IEEE Trans. Neural Netw. Learn. Syst. 2013, 26, 653–662. [Google Scholar] [CrossRef]
Wang, J.; Wen, Y.; Gou, Y.; Ye, Z.; Chen, H. Fractional-order gradient descent learning of BP neural networks with Caputo derivative. Neural Netw. 2017, 89, 19–30. [Google Scholar] [CrossRef]
Bao, C.; Pu, Y.; Zhang, Y. Fractional-order deep backpropagation neural network. Comput. Intell. Neurosci. 2018, 2018, 7361628. [Google Scholar] [CrossRef]
Khan, S.; Naseem, I.; Malik, M.A.; Togneri, R.; Bennamoun, M. A fractional gradient descent-based rbf neural network. Circuits Syst. Signal Process. 2018, 37, 5311–5332. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Peng, N.; Poon, H.; Quirk, C.; Toutanova, K.; Yih, W.t. Cross-sentence n-ary relation extraction with graph lstms. Trans. Assoc. Comput. Linguist. 2017, 5, 101–115. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493. [Google Scholar]
Zhang, Y.; Liu, Q.; Song, L. Sentence-state lstm for text representation. arXiv 2018, arXiv:1805.02474. [Google Scholar]
Nishimoto, K. Fractional Calculus: Integrations and Differentiations of Arbitrary Order; Descartes Press: Peterborough, UK, 1984; Volume 5. [Google Scholar]
Podlubny, I. Fractional Differential Equations: An Introduction to Fractional Derivatives, Fractional Differential Equations, to Methods of Their Solution and Some of Their Applications; Elsevier: Amsterdam, The Netherlands, 1998. [Google Scholar]
Phaisangittisagul, E. An analysis of the regularization between L2 and dropout in single hidden layer neural network. In Proceedings of the 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS), Bangkok, Thailand, 25–27 January 2016; pp. 174–179. [Google Scholar]
Ueffing, N.; Haffari, G.; Sarkar, A. Transductive learning for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 25–27 June 2007; pp. 25–32. [Google Scholar]
Jordan, C.; Jordán, K. Calculus of Finite Differences; American Mathematical Soc.: Providence, RI, USA, 1965; Volume 33. [Google Scholar]
Shchedrin, G.; Smith, N.C.; Gladkina, A.; Carr, L.D. Fractional derivative of composite functions: Exact results and physical applications. arXiv 2018, arXiv:1803.05018. [Google Scholar]
Chen, Y.; Gao, Q.; Wei, Y.; Wang, Y. Study on fractional order gradient methods. Appl. Math. Comput. 2017, 314, 310–321. [Google Scholar] [CrossRef]
Sheng, D.; Wei, Y.; Chen, Y.; Wang, Y. Convolutional neural networks with fractional order gradient method. Neurocomputing 2020, 408, 42–50. [Google Scholar] [CrossRef] [Green Version]
Yang, Z.; Cohen, W.; Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 40–48. [Google Scholar]
Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective classification in network data. AI Mag. 2008, 29, 93. [Google Scholar] [CrossRef] [Green Version]
Namata, G.; London, B.; Getoor, L.; Huang, B.; EDU, U. Query-driven active surveying for collective classification. In Proceedings of the 10th International Workshop on Mining and Learning with Graphs, Edinburgh, Scotland, UK, 1 July 2012; Volume 8, p. 1. [Google Scholar]
Zachary, W.W. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 1977, 33, 452–473. [Google Scholar] [CrossRef] [Green Version]
Girvan, M.; Newman, M.E. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef] [Green Version]
Zhu, X.; Ghahramani, Z.; Lafferty, J.D. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 912–919. [Google Scholar]
Weston, J.; Ratle, F.; Mobahi, H.; Collobert, R. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 639–655. [Google Scholar]
Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006, 7, 2399–2434. [Google Scholar]
Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]

Figure 1. Demonstration of fractional order local optimal points of fractional order derivatives of

0.1 | x | - cos (x)

with different orders

ν

. The horizontal axis denotes the variable x’s range

[- 8, 8]

. The vertical axis denotes the fractional order derivatives with different orders

ν \in [0, 0.2, 0.4, 0.6, 0.8 . 1.0]

. The two circles denote two local extrema of different order derivatives (

ν = 1 (left), 0.4 (right)

).

Figure 1. Demonstration of fractional order local optimal points of fractional order derivatives of

0.1 | x | - cos (x)

with different orders

ν

. The horizontal axis denotes the variable x’s range

[- 8, 8]

. The vertical axis denotes the fractional order derivatives with different orders

ν \in [0, 0.2, 0.4, 0.6, 0.8 . 1.0]

. The two circles denote two local extrema of different order derivatives (

ν = 1 (left), 0.4 (right)

).

Figure 2. The pipeline of the graph transductive learning on semi-supervised node classification tasks.

Figure 3. Illustration of minimizing function

f (x) = x^{2} - x

with different gradient descent methods, which include (first column) the first order gradient, (second column) the fractional order gradient based on the approximate chain rule (25), (third column) the fractional order gradient based on the approximate expansion (26), and (last column) the accurate fractional order gradient.

Figure 3. Illustration of minimizing function

f (x) = x^{2} - x

with different gradient descent methods, which include (first column) the first order gradient, (second column) the fractional order gradient based on the approximate chain rule (25), (third column) the fractional order gradient based on the approximate expansion (26), and (last column) the accurate fractional order gradient.

Figure 4. Topology of community network datasets (a) Karate club and (b) Football.

Figure 5. Recognition accuracy (%) of FGNN with different fractional orders

ν \in [0.5, 1.7]

on the citation network datasets.

Figure 5. Recognition accuracy (%) of FGNN with different fractional orders

ν \in [0.5, 1.7]

on the citation network datasets.

Figure 6. Testing recognition accuracy (%) of GNN and FGNN with or without L2 regularization on the citation network datasets (a) Cora, (b) Citeseer, and (c) Pubmed.

Figure 7. Testing MSE loss and recognition accuracy (%) curves of GNN and FGNN on (a) Cora and (b) Citeseer. The horizontal axis denotes training epochs in the semi-supervised tasks. The left vertical axis denotes the MSE loss. The right vertical axis denotes the recognition accuracy.

Figure 8. Testing ROC curves of GNN and FGNN on (a) Cora and (b) Citeseer. The horizontal axis denotes the false positive rate. The vertical axis denotes the true positive rate.

Figure 9. Dimensionality reduction visualization results of node features with FGNN on Cora, Citeseer, and Pubmed (input layer, hidden layer, and output layer). Nodes with different colors represent different categories, and the distance of each point is the 2D visualization of nodes features with t-SNE method.

Table 1. Summary of the key symbols.

Notations	Description
V	vertex set of a graph
E	edge set of a graph
$H^{m}$	features of all nodes at the m-th layer
$A$	adjacent matrix of a graph
$D$	degree matrix of a graph
$σ (\cdot)$	activation function
$W^{m}$	weight parameters of the m-th layer
$\tilde{A}$	normalized adjacent matrix
$\hat{Y}$	predicted values
$Y$	ground-truth values
$L$	loss function
$η$	learning rate
$W^{m (+)}$	updated weights of the m-th layer
$D$	fractional order gradient operator
$ν$	arbitrary real order
a	initial point of fractional order derivative
⊙	Hadamard product
t	index for the current iteration epoch
T	iterative upper bound
$λ$	L2 regularization coefficient
$L_{L 2}$	loss function with L2 regularization

Table 2. The addition and multiplication complexity of the first order, the approximate fractional order, and the accurate fractional order derivatives.

Derivative	Addition	Multiplication
First order	1	1
Approximate fractional order	1	3
Accurate fractional order	$2^{N + 1} - 3$	$2 N$

Table 3. Information of citation network datasets Cora , Citeseer and Pubmed.

Property	Cora	Citeseer	Pubmed
Nodes	2708	3327	19,717
Edges	5429	4732	44,338
Features	1433	3703	500
Classes	7	6	3
Training nodes	140	120	60
Validation nodes	500	500	500
Test nodes	1000	1000	1000

Table 4. Information of community network datasets Karate club and Football.

Property	Karate Club	Football
Nodes	34	115
Edges	78	613
Features	Null	Null
Classes	2	12
Training nodes	4	12
Testing nodes	30	103

Table 5. Recognition accuracy (%) of the baseline models and FGNN on the citation network dataset Cora, Citeseer, and Pubmed.

Method	ManiReg	SemiEmb	LP	DeepWalk	ICA	Planetoid	GNN	FGNN (Ours)
Cora	59.5	59.0	68.0	67.2	75.1	75.7	79.9	80.9
Citeseer	60.1	59.6	45.3	43.2	69.1	64.7	69.4	70.0
Pubmed	70.7	71.1	63.0	65.3	73.9	77.2	79.2	79.4

Table 6. Convergence time of GNN and FGNN on citation network datasets.

Method	Cora	Citeseer	Pubmed
GNN	16.08 s	34.62 s	25.13 s
FGNN	11.83 s	14.25 s	15.27 s

Table 7. Recognition accuracy (%) of GNN and FGNN on community network datasets.

Method	Karate Club	Football
GNN	93.33	51.46
FGNN	96.67	61.17

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Wang, Y.; Luo, Y.; Luo, C. A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients. Mathematics 2022, 10, 1320. https://doi.org/10.3390/math10081320

AMA Style

Liu Z, Wang Y, Luo Y, Luo C. A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients. Mathematics. 2022; 10(8):1320. https://doi.org/10.3390/math10081320

Chicago/Turabian Style

Liu, Zijian, Yaning Wang, Yang Luo, and Chunbo Luo. 2022. "A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients" Mathematics 10, no. 8: 1320. https://doi.org/10.3390/math10081320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Regularized Graph Neural Network Based on Approximate Fractional Order Gradients

Abstract

1. Introduction

2. Related Work

2.1. Graph Neural Networks

2.2. Fractional Calculus

3. Fractional Order Graph Neural Network

3.1. Fractional Order GNN

3.2. Semi-Supervised Training Mechanism

3.3. Approximation Strategy

4. Theoretical Analysis

4.1. Approximation and Feasibility Proof

4.2. Convergence Proof of the Fractional Gradient Descent Method

5. Experiments

5.1. Citation and Community Network Datasets

5.2. Implementation Details and Evaluation Metrics

5.3. Comparative Methods

5.4. Accuracy Comparison on Citation Network

5.5. Convergence Analysis on Citation Network

5.6. Accuracy Comparison on Community Network

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI