1. Introduction
In recent years, graph neural networks (GNNs) [
1] have exhibited outstanding performance in processing graph-structured data, and graph convolutional networks (GCNs) [
2] have emerged as the most popular and widely used due to their efficiency, scalability, and ease of application in various domains such as social networks [
3], chemical molecules [
4], etc. Furthermore, GCN serves as the fundamental building block of numerous complex models [
5,
6,
7]. Although GCN has been highly successful, the increasing need to process large, sparse, and non-linearly separable graphs, as well as higher demands from various applications, has created the requirement for models that can aggregate multi-hop neighbors at a farther distance and improve representation power with a larger number of parameters. Specifically, when processing large and sparse graph-structured data, the limitations of shallow GCN models become more apparent, primarily manifested as an inability to access and aggregate information from more distant neighbors and insufficient parameters, which restricts their feature extraction ability. Deep GCN models have potential advantages in perceiving more graph-structured information and obtaining higher expressive power, and are one of the current important research topics. Many real-world applications involve large-scale graph data, such as social networks, recommendation systems [
8], bioinformatics, and transportation networks. In these applications, deep GCN models can better capture the complex relationships between nodes, thereby improving the effectiveness of graph data analysis. In addition, deep GCN models can also be applied to some fields that require the processing of time-series data, such as traffic prediction and financial risk prediction. Therefore, the motivation for studying deep GCN models is to improve the efficiency and accuracy of large-scale graph data analysis and achieve more extensive applications. It is worth noting that deep residual networks for image classification tasks can have up to 152 layers [
9], while the Transformer model for natural language processing can have up to 1000 layers [
10]. The studies have provided evidence that deeper architectures can enhance the representation power of models. Consequently, it motivates us to investigate deeper GCN models to address the requirements.
However, current research indicates that deep GCN models often suffer from performance degradation as depth increases [
2]. In practical applications, shallow GCN models with only 2 to 4 layers typically achieve the best generalization performance. By comparing the node classification accuracy of the two most representative GCN architectures, vanilla GCN and its variant ResGCN [
11], the degradation phenomenon of the deep model can be observed in detail. As shown by the solid curves in
Figure 1, on different datasets, as the depth of the GCN models increases, their testing set accuracy continues to decline from the two-layer model, and ResGCNs always maintain a stable and very slow decline. The results indicate that although ResGCNs exhibits better generalization performance compared to vanilla GCNs, ResGCNs cannot consistently improve the generalization performance as the model depth increases. While recent studies have proposed more powerful and deeper models that can overcome the degradation problem, achieving effectiveness often requires strong assumptions or tuning many parameters specifically for a given dataset [
6,
7]. Therefore, identifying the key factors that cause model performance degradation is crucial in guiding the development of models that can enhance generalization performance, which remains a significant challenge.
The degradation of the model’s performance as it goes deeper is widely attributed to “over-smoothing” [
12,
13,
14,
15,
16,
17], which is reflected on the model output as the indistinguishability of node representations. Current theoretical studies on the over-smoothing problem focus on the low-pass filter effect of the graph, which refers to the phenomenon that after multiple graph convolution operations (i.e., propagation operations), the augmented normalized Laplacian matrix of the model converges mathematically in the spectral space [
5,
18,
19]. Specifically, the node representations approach an invariant subspace determined by node degree information alone. This is consistent with the experimental observation that the outputs of the model have indistinguishable properties. Based on the understanding, they have proposed strategies to mitigate or address the problem of over-smoothing. Meanwhile, some studies attribute the degradation problem to aspects such as gradient vanishing and training difficulty [
11,
20,
21], and have proposed corresponding optimization strategies. Despite the effectiveness of these strategies in enhancing model representation power, there is still no consensus on the understanding of the degradation problem. Furthermore, current experimental studies on the degradation problem of deep GCN models tend to focus on shallow layers with less than 10 layers, which is insufficient for studying the problem comprehensively. As a result, in the absence of consensus, we conducted both theoretical and experimental investigations into the degradation of model performance with increasing numbers of layers.
In this paper, we specifically conduct research from the following aspects: 1. The GCN architecture includes not only the graph convolution aggregation operation (i.e., propagation operation) affected by the augmented normalized Laplacian matrix, but also the transformation operation affected by the weight matrix. However, the weight matrix is often ignored in investigations of the over-smoothing problem. We conduct a theoretical analysis of both operations of GCN from a global perspective. 2. As the key factors leading to model performance degradation still have no consensus, we continue to investigate this problem by conducting comparative experiments between deep GCN models, designed reasonably. Motivated by the aforementioned thoughts, the contributions of this paper are as follows.
Firstly, we integrate the aforementioned studies on graph signals and extend the findings on the over-smoothing problem under reasonable assumptions. Rather than adopting the theoretical analysis perspective of inertial decoupling, we analyze the changing trend of graph signals in the spectral space from a unified global perspective, combined with transformation operations. Our analysis shows that the GCN architecture naturally avoids the over-smoothing problem and does not undergo the process of converging to the invariant subspace. Additionally, we show that the random noise in the graph signals have a decreasing impact on the model after passing through the low-pass filter as the number of layers increases.
Secondly, in addition to using conventional experimental methods such as comparing node classification accuracy, we propose an experimental analysis strategy to analyze the singular value distribution of the model weight matrix under the guidance of random matrix theory (RMT). The experimental results explicitly show the process of multiple transformation operations leading to the gradual capture of less information by deep models. This enables us to better understand how the representational power of the model degrades as the number of layers increases. The results indicate that the transformation component in the GCN architecture is a key factor leading to the degradation in model performance. This lays the foundation for our subsequent research.
Overall, our analytical theory is well-suited to explain the degradation of deep GCNs in multi-layer architectures. To further support our theoretical analysis, we employ more angles of experimental analysis strategies to confirm the theoretical analysis conclusions in deeper models.
2. Related Work
Research suggests that a deeper model with increasing complexity can improve accuracy in computer vision or natural language processing tasks [
22,
23,
24,
25]. This has been demonstrated through the Weisfeiler–Lehman (WL) graph isomorphism test, which shows that deep models have a better capacity to distinguish subgraphs compared to shallow models [
21]. Additionally, ref. [
26] conducts an image classification experiment and finds that the model with the largest number of parameters, reaching 550 M, achieved the highest top-1 classification accuracy. Research has shown that increasing the depth and complexity of models can improve the accuracy of computer vision or natural language processing tasks. However, in our research, we have extended the problem to the graph domain, which involves operating on structured data represented as graphs and learning messages through the iterative structure of the graph. However, we have encountered several difficulties in this process. Firstly, graphs have two types of information: topological information and node information, which differ from the information handled in computer vision or natural language processing. Secondly, the performance degradation phenomenon exhibited by deep graph neural networks is unique. There is no consensus on the attribution of the degradation phenomenon in deep models, and it remains an area for exploration. We will provide specific examples below. Finally, many studies have addressed deep problems in a framework that corresponds to shallow models [
27,
28]. However, analyzing the degradation of model performance in shallow models cannot solve the problems in deep models.
It is known that the single-layer GCN is decomposed into three operations (components): propagation operation (i.e., graph convolution aggregation operation) with augmented normalized Laplacian matrix, transformation operation with weight matrix, and non-linear operation with ReLU activation function. We extend the setting to a multi-layer GCN for a semi-supervised node classification task [
2], which requires the model to learn a hypothesis to extract node features and topology information from the graph and to predict the labels of nodes.
Firstly, the lack of consensus in research on the degradation phenomenon is mainly manifested in the controversial and even contradictory attribution of the cause of degradation. For example, ref. [
13] proposed a GPR-GNN model with Generalized PageRank techniques to trade-off node features and topological features of the graph, thus preventing the over-smoothing issue of node representations. Ref. [
29] shows that anti-over-smoothing processes can occur via GCN models during transformation operations, and they suggest that overfitting is a major contributor to model deterioration. Ref. [
30] further refutes the insights such as overfitting and gradient vanishing, and asserts that it is the weight matrix multiplication. Secondly, there is a diversity in optimization strategies, and their effectiveness is limited. If the problem is attributed to a specific type of propagation operation, transformation operation, or non-linear operation, it would be reasonable to propose a decoupling structure to solve the problem, such as increasing propagation operations and reducing transformation operations. However, these optimization strategies are often controlled by various hyperparameters (such as learning rate, number of training iterations, etc.), output at each layer, gradient distribution, and other settings. However, the analysis remains insufficiently thorough and governed by the setting of various hyper-parameters (i.e., the learning rate, iterations of training, etc.), the distribution of each layer output, gradient distribution, and other settings. For instance, ref. [
19] shows that the representation capacity of the model cannot be improved with the increasing number of layers and nonlinear operations. Ref. [
5] proposed a SGC model, which consists of a fixed low-pass filter followed by a linear classifier, to remove the excessive complexity caused by the nonlinear activation function and excessive weight matrix multiplication. At the same time, the experimental evaluation shows that this variant will not hurt the classification accuracy. Actually, these research ideas violate the fundamental needs of the continuous deepening of neural networks. The architectures have strict requirements on data sets, which are required to be linearly separable, and also have strict restrictions on the assumptions of the model itself. Processing real sparse big data generally does not meet these conditions. The research on the phenomena that a modeląŕs performance degrades as its depth increases is still very controversial, as the above summary makes clear. If there is consensus on the underlying factors of this issue, that would be helpful. We need newer perspectives, more detailed experiments, and a more systematic evaluation inductive inspection of this problem.
4. Analysis
4.1. Empirical Analysis of Over-Smoothing
In the previous section, we provided theoretical evidence that multiple convolution operations can cause over-smoothing. However, it is currently unclear whether the performance on real graph data is consistent with the theoretical analysis. To address this question, we compare and analyze the relative norms of the output of each layer in shallow and deep GCN models, as well as ResGCN models, to explicitly observe smoothness on the Cora dataset.
In the following, we explicitly show the indistinguishability of the model output (i.e., over-smoothing) and indicate problems. Specifically, the empirical results show that the outputs of the model at each layer have a low-rank structure, indicating that deep GCN models exhibit over-smoothing in practice. However, by observing the effectiveness of deep ResGCN in addressing the over-smoothing problem, we can conclude that convergence is not only related to propagation operations, but also to transformation operations. In previous analyses, transformation operations were often not considered.
We evaluate the model’s degree of convergence during forward pass using
and
is a rank-1 matrix [
33]. Then, the closer the output of each layer is to it, the greater the degree of convergence
. As shown in
Figure 3 and
Figure 4, curves represent the change process of the relative norm for each layer output in different models (shallow and deep). The curves depict the variation of the relative norm for the convergence degree of the output of each layer in different models, with the horizontal axis representing the layer index of the models. The vertical axis represents
, where
.
Firstly,
Figure 3a shows that the shallow GCN model performs normally, with curves staying at a high level and achieving good node classification results.
Figure 3a and
Figure 4a show that the curve changes from a stable high level to a stable low level, indicating that the deep GCN model does suffer from over-smoothing. This seems to confirm the conclusions of the theoretical derivation in
Section 3.
Figure 4b demonstrates the superior effectiveness of the deep ResGCN model, which uses residual connection technology, with the relative norm curve changing from a low level to a stable high level, indicating its superior effectiveness in addressing the over-smoothing problem in the forward pass process. As a supplement,
Figure 3a,b primarily show that the performance of the shallow ResGCN models, which use residual connection technology, is more stable. Based on the experimental observations above, we propose the following idea: inputting the original graph information into each layer of the model instead of just the first layer, which is considered to be a practical and traditional solution for addressing the convergence problem [
34,
35]. However, as shown in Equation (
7), the ResGCN architecture combines the output of the current layer with the output of the previous layer to create the input for the following layer. This approach is actually different from the traditional view and, by acting on the weight matrix of each layer, has the effect of reducing convergence, which was originally proposed as a strategy for alleviating gradient vanishing.
This makes us comprehend that merely considering the property of propagation operation converging to the invariant subspace is not enough to address the over-smoothing problem. A holistic perspective that takes into account both the propagation and transformation operations, with particular attention on the weight matrix’s existence, is required.
4.2. Graph Noise Signal Analysis
As the propagation operation acts as a low-pass filter obtained through Graph Fourier Transform, it is necessary to investigate the impact of random noise signals on the model before proceeding with further derivations to ensure that the subsequent results are not affected by noise. The conclusion is presented in Theorem 2.
Theorem 2 (Informal Noise Signal Bound)
. Let , and define Q as the random noise of the signal. Then, with the probability at least over the model depth K increasing, we have It is known that the observed signal is composed of real signal features and noise signals. The following holds under Assumption 2. Theorem 2 guarantees that the observed signals are probably approximately correct (PAC) [
36,
37] estimates of the true features excluding the noise signals.
Figure 3.
Reflection of how each layer of the shallow GCN (8-layers) and ResGCN models’ output converged on the Cora dataset. The vertical axis represents the relative norm of the degree of conversion. (a) The curve of shallow GCNs remains stable at high points. (b) The curve of shallow ResGCNs remains stable at high points for comparison.
Figure 3.
Reflection of how each layer of the shallow GCN (8-layers) and ResGCN models’ output converged on the Cora dataset. The vertical axis represents the relative norm of the degree of conversion. (a) The curve of shallow GCNs remains stable at high points. (b) The curve of shallow ResGCNs remains stable at high points for comparison.
Figure 4.
Reflection of how each layer of the deep GCN and ResGCN models’ output converged on the Cora dataset. The vertical axis represents the relative norm of the degree of conversion. (a) The curve of deep GCNs remains stable at low points. (b) The curve of deep ResGCNs remains stable at high points for comparison.
Figure 4.
Reflection of how each layer of the deep GCN and ResGCN models’ output converged on the Cora dataset. The vertical axis represents the relative norm of the degree of conversion. (a) The curve of deep GCNs remains stable at low points. (b) The curve of deep ResGCNs remains stable at high points for comparison.
Armed with Lemma 5 in [
18], it mainly demonstrates that the filtered noise with a probability at least
over the choice of
is small enough through an exponential inequality for chi-square distributions [
38]. Thus, we theoretically guarantee the probability of the distribution of signals that contain graph topology information. It provides the premise for the subsequent perceptron analysis and constitutes a rigorous analysis process.
The inspiration for the derivation comes from [
18], but it should be noted that we only exclude the interference of random noise because we argue that there is node feature information in deterministic noise. Specifically, we argue that the node expression after the filter more closely represents the topology features of the graph. If the distance between the true features and the topology features is directly regarded as noise, then the information in the noise will also contain node features, resulting in an excessive amount of noise being observed. In reality, however, the random noise continuously decreases as the model depth increases (by Theorem 2). As a result, the model will incorrectly believe that it has been overfit to the noise, and will attempt to avoid overfitting in the following operations.
Proof of Theorem 2. We assume the noise signals to be i.i.d. Gaussian variables with zero-mean and the same diagonal variable
, i.e.,
. Then, the filter noise
According to Lemma 1 in [
38], we adopt the exponential inequality of chi-square distributions for any positive
c. Through the logarithm of the Laplace transform of
, we set
Then, for any
, by substituting
, we have
To be concrete, we note that
. The well-known Dirichlet energy [
39] property for normalized Laplacian
is calculated through
We have
, and
decreases in
,
□
4.3. Spectrum Analysis and Rethinking Over-Smoothing
In the previous subsections, we suggested analyzing the GCN architecture from a holistic perspective. In this subsection, we rethink the over-smoothing problem by examining the trend of the distance between the node embedding and the invariant subspace after propagation and transformation operations. If the distance does not decrease, it means that the over-smoothing caused by propagation operations has not actually occurred. To address this issue, we introduce Theorem 3.
Remember that
is the eigenvalue of
, and
is the maximum singular values of weight matrices. The distance mentioned in
Section 3 differs from the distance used here in that it just takes the propagation operation into account and does not take the transformation operation into account.
Theorem 3. The distance between node representations of K-th layer and invariant subspace has an exponential variation that is dependent on as the number of layers increasing, that is, . Here, denotes the distance between the original inputs of the model and the invariant subspace.
Proof of Theorem 3. We denote the distance between
and
by
, and the distance between
and
by
. By Lemma 1 of [
19], we have
where
denotes the Frobenius norm, and
refers to the node observed features after one propagation operation.
Note that
is a ReLU activation function and acts as a contraction mapping. We have the output of the
K-th layer
, which satisfies
Combining Equation (
16), we have
Combining Equations (
15)–(
17), we have
□
We know that the maximum frequency of the propagation matrix
is always below and close to 1 [
21], and according to Gordonąŕs theorem for Gaussian matrices in [
40], the maximum singular value of the weight matrix
s is usually greater than 1. Therefore, in general,
. Some studies impose a strong assumption on the model to make
[
19].
By Equation (
18), we can know that if
, the distance between the model’s output
and the invariant subspace
is less than the distance between the original input
and the invariant subspace
, and their relative size decreases exponentially with respect to the number of layers. If
, the upper bound of the relative distance will increase exponentially with respect to the number of layers. As the number of layers increases, the constraints are continuously relaxed, which contradicts the absolute convergence derived in
Section 3.3.
Theorem 3 indicates that GCNs working with real-world data do not suffering the over-smoothing problem brought on by multiple propagation operations, as this problem can be resolved by combining multiple transformation operations, while the degradation of model performance is due to multiple transformation operations in high possibility. In fact, we investigate and determine the statistical distribution of s using the Random Matrix Theory in the following section, and then empirically analysis the transformation operation.
6. Conclusions
This study mainly analyzes the issue that the generalization performance of Deep GCN models continues to decrease as the number of layers increases.
Firstly, we integrate the existing studies on graph signals under our appropriate assumptions. This is manifested as the smooth behavior of the input data converging to the invariant subspace after multiple propagation operations. These serve as the theoretical basis for studying the degradation issue later.
Afterwards, by comparing the node classification accuracy of the GCN and ResGCN models of different layers, as well as the relative norms reflecting the convergence of the outputs at each layer, we observe that the ResGCN model with residual connections does not suffer from the convergence to the invariant subspace (i.e., over-smoothing) problem. Based on the above research, we conclude from a global perspective in the spectral space by considering the transformation and propagation operations of the GCN architecture that GCN does not suffer from over-smoothing due to multiple propagation operations, as it can be resolved through multiple transformation operations. The degradation of the model is more likely to be affected by the transformation operation.
Finally, the ResGCN model only alleviates the degradation phenomenon, and it fails to realize the ideal situation where the model performance improves with increasing depth. We proposed an experimental analysis strategy to investigate the change in the distribution of singular values of the weight matrices of deep GCN models with increasing depth, with the goal of directly observing the impact of multiple transformation operations on model performance. The experimental results confirmed our theoretical speculations that multiple transformation operations are indeed a key factor leading to the degradation phenomenon of the model performance. Therefore, we suggest shifting attention from the propagation operation to the transformation operation in the GCN architecture when studying the degradation problem.