*Article* **Learning Data-Driven Propagation Mechanism for Graph Neural Network**

**Yue Wu 1, Xidao Hu 1, Xiaolong Fan 2, Wenping Ma 3,\* and Qiuyue Gao <sup>1</sup>**


**\*** Correspondence: wpma@mail.xidian.edu.cn

**Abstract:** A graph is a relational data structure suitable for representing non-Euclidean structured data. In recent years, graph neural networks (GNN) and their subsequent variants, which utilize deep neural networks to complete graph analysis and representation, have shown excellent performance in various application fields. However, the propagation mechanism of existing methods relies on handdesigned GNN layer connection architecture, which is prone to information redundancy and oversmoothing problems. To alleviate this problem, we propose a data-driven propagation mechanism to adaptively propagate information between layers. Specifically, we construct a bi-level optimization objective and use the gradient descent algorithm to learn the forward propagation architecture, which improves the efficiency of learning different layer combinations in multilayer networks. The experimental results of the model on seven benchmark datasets demonstrate the effectiveness of the proposed method. Furthermore, combining this data-driven propagation mechanism with models, such as Graph Attention Networks, can consistently improve the performance of these models.

**Keywords:** graph neural network; propagation mechanism; data-driven method; deep learning

#### **1. Introduction**

Graphs are data structures that model a set of objects (nodes) and their relationships (edges). Graphs can be irregular and have variable-sized unordered nodes, and nodes can have different numbers of neighbors. As a consequence, while some important operations (e.g., convolutions [1]) can be easily applied to the image domain [2], it is difficult to apply to the graph domain. In addition, a key assumption of existing deep learning algorithms is that the data samples are independent of each other. For graphs, however, there are edges between each data sample (node) and other data samples (nodes) that capture the interdependencies between instances. Due to the powerful representational power of graph structures, the study of graph analysis using machine learning methods has received increasing attention. Researchers have defined and designed a neural network architecture for processing graph data. This structure has become a new research hotspot—"graph neural network (GNN)", which achieves excellent performance and interpretability on graph-structured data.

For example, papers in a citation network are linked to each other by citations, and GNNs can classify each paper into a different group [3–6]. In the fields of chemistry and medicine, molecules can be modeled as graphs, and their biological activities can be identified by GNNs for drug development [7–10]. In the field of computer vision, GNNs can identify objects depicted by 3D point clouds and explore their topology [11–15]. In the traffic system, GNNs can accurately predict the traffic speed and traffic flow in the traffic network for route planning and flow control [16,17].

GNNs are used to learn node representations (node embeddings), which can simultaneously model node features and graph topology information. In addition, GNNs utilize the relationships (edges) between nodes of a graph to propagate information rather than

**Citation:** Wu, Y.; Hu, X.; Fan, X.; Ma, W.; Gao, Q. Learning Data-Driven Propagation Mechanism for Graph Neural Network. *Electronics* **2023**, *12*, 46. https://doi.org/10.3390/ electronics12010046

Academic Editor: Stefanos Kollias

Received: 21 November 2022 Revised: 12 December 2022 Accepted: 20 December 2022 Published: 22 December 2022

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

treating them as features of nodes. Among them, models such as Graph Convolutional Networks (GCN) [3] and Graph Attention Networks (GAT) [18] follow a neighborhood aggregation (message passing) scheme. These models learn to iteratively aggregate the hidden features of each node in the graph and its neighbors as its new hidden features, where the iterations are parameterized by neural network layers. Theoretically, the aggregation process of L iterations fuses the structural information of each node at each layer, which can simultaneously learn the topology and the distribution of node features in the neighborhood.

However, in practice, a deeper version of the model with more information is likely to perform worse. For example, the best performance of GCN and GAT experiments on the Planetoid dataset [19] is achieved with a two-layer model, and increasing the number of layers will reduce the performance. A similar degradation of learning for computer vision problems is addressed by residual connections [20], which greatly aids the training of deep models. However, even with residual connections, GCNs with more layers do not perform as well as two-layer GCNs on datasets such as the Citation Network datasets PubMed [21], CiteSeer, and Cora [22].

We believe that the structure of different nodes and their neighborhoods (subgraphs) in the graph has a great influence on the result of neighborhood aggregation. The rate of expansion, or the growth rate of the radius of influence, is characterized by the mixing time of random walks and varies significantly across subgraphs of different structures. Therefore, the same number of iterations can result in very different local distributions. For example, consider a node at the center of the graph and a node at the edge of the graph to start an expansion of a random walk. After the same number of iterative layers, the nodes that may be located in the center of the graph already contain basically all the information of the graph, so only a small amount of information from other layers needs to be aggregated. At this time, if all the information of each layer is aggregated, it will cause redundancy. The nodes located at the edge of the graph may contain only a small amount of information, and more information needs to be aggregated to perceive the structure of the graph.

To adaptively adjust the influence radius of each node and task, we propose a datadriven propagation mechanism that learns to selectively acquire information from various layers. Finally, each node can selectively obtain low-order local structural information and high-order neighborhood information, thereby effectively avoiding the problems of local structural information degradation and information redundancy and enhancing the representation ability of GNNs. Additionally, stacking too many layers and non-linear transformations can lead to over-smoothing issues, where node representations tend to converge to a fixed value, resulting in degraded model performance. To alleviate this problem, we add an identity map to the convolution operation to improve the network performance.

Since learning a combination of different layers in a multilayer network is computationally expensive, we adopt a differentiable approach to reduce the computational cost. The model achieves good results on the node classification task, demonstrating the effectiveness of the proposed data-driven propagation mechanism. In conclusion, we outline the main contributions of this paper as follows:

(1) We propose a data-driven propagation mechanism (GraphSAP), which adaptively learns the connections between different layers, enabling nodes to selectively fuse low-order local structural information while acquiring high-order neighborhood information.

(2) We add the identity map to the neighbor aggregation function of the GraphSAP model and use a differentiable algorithm during training to make the model more efficient while maintaining high performance.

(3) We provide a quantitative comparison of the node classification task under different datasets, showing the good performance of the model.

#### **2. Related Work**

#### *2.1. Graph Neural Networks*

The concept of graph neural networks was first proposed in [23] and further clarified by Scarselli et al. [24], and many variants [18,25] have been proposed over the past few years. Ref. [24] is the first paper to propose a graph neural network model, which applies neural networks to graph-structured data, and elaborates the structure, calculation method, optimization algorithm, and implementation of the neural network model in detail.

GNN is a new research hotspot that emerged after the maturity of convolutional neural networks (CNN) [1] to process non-Euclidean data. Some existing studies try to apply the methods used by CNN to GNN to utilize the excellent abilities of CNN. The existing deep GNN model adds other operations to the convolution operation to alleviate over-smoothing or aggregates different layers. Among the contributions of stacking more layers of CNNs, ResNet [20] and DenseNet [26] are excellent methods that can be seen in many deep networks today. JKNet [27] is inspired by ResNet, but it does not achieve good performance by stacking multiple layers like ResNet and can not fully achieve the representation ability of GNN. These methods are all hand-crafted networks. Therefore, we cannot directly apply the method of CNN to GNN, but needs to convert these methods to make GNN obtain better performance. The focus of our work is to better exploit the representational power of GNNs.

#### *2.2. Data-Driven Methods*

Hand-designed interlayer connection network structures have achieved great success in the past. The emergence of ResNet [20] and DenseNet [26] showed the importance of residual and dense connections for the design of deep networks and had a huge impact on the design of deep neural networks. With the continuous development of deep neural networks and the continuous invention and utilization of various models and new modules, people gradually realize that developing a new neural network structure is more timeconsuming and labor-intensive.

People have begun to explore how to use existing machine learning knowledge to independently build networks suitable for business scenarios. Automated Machine Learning (AutoML) is one of the hottest fields in machine learning and deep learning in recent years. Several recent works have demonstrated the feasibility of automated learning [28] and designed some models that go beyond hand-designed ones, such as [29,30]. Using the dataset as the basis for training the network, various network structures can be designed. For example, if you have a four-layer network, then mathematically, there are 15 combinations of layer-to-layer connections in total. Ideally, given sufficient resources and time, data-driven learning methods can simulate all connections between layers, which would cover all hand-designed network structures. A representative method is the Neural Architecture Search (NAS) algorithm, such as [31]. In NAS, the network architecture is mainly designed from three parts: search space, search strategy, and evaluation strategy. The data-driven approach is also a method in the field of AutoML, which adaptively learns a network model suitable for the data based on the existing data, which is used in our work.

#### **3. Background**

Given an undirected graph <sup>G</sup> = (*V*, *<sup>E</sup>*) with node features *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*n*×*di* , where *<sup>V</sup>* and *E* denote node and edge sets, respectively. *n* represents the number of nodes, and *di* is the dimension of node features. We use *N*(*v*) to represent the first-order neighbors of a node *<sup>v</sup>* in <sup>G</sup>, i.e., *<sup>N</sup>*(*v*) = {*<sup>u</sup>* <sup>∈</sup> *<sup>V</sup>*|(*v*, *<sup>u</sup>*) <sup>∈</sup> *<sup>E</sup>*}. In addition, we use the set *<sup>N</sup>*˜ (*v*) to denote the set of neighbors, including oneself, i.e., *<sup>N</sup>*˜ (*v*) = {*v*} <sup>∪</sup> {*<sup>u</sup>* <sup>∈</sup> *<sup>V</sup>*|(*v*, *<sup>u</sup>*) <sup>∈</sup> *<sup>E</sup>*}. Let <sup>G</sup>˜ be the graph obtained by adding a self-loop to every *v* ∈ *V*. The hidden feature of node *v* learned by the *l*-th layer of the model is denoted by *h* (*l*) *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*dh* , where *dh* denotes the dimension of the hidden features. For simplicity of illustration, we assume that it is the same between layers. Let *A* denote the adjacency matrix and *D* the diagonal degree matrix. Consequently, the adjacency matrix and diagonal degree matrix of *G*˜ is defined to be

*A*˜ = *A* + *I* and *D*˜ = *D* + *I*, respectively. The normalized graph Laplacian matrix is defined as *<sup>L</sup>* <sup>=</sup> *In* <sup>−</sup> *<sup>D</sup>*−1/2*AD*−1/2, which is a symmetric positive semidefinite matrix.

#### *3.1. Graph Convolutional Network*

Kipf et al. [3] proposed the Graph Convolutional Network (GCN) model, which can be described as the "pioneering work" of GNN. GCN uses approximation techniques to derive a simple and efficient model that enables convolution operations in image processing to be easily used for graph-structured data processing. Inspired by GCNs, various new graph neural networks are emerging. The form of GCN can be expressed as:

$$h\_i^{(l+1)} = \sigma \left( b^{(l)} + \sum\_{j \in \mathcal{N}(i)} \frac{1}{\mathcal{E}\_{ij}} h\_j^{(l)} \mathcal{W}^{(l)} \right) \tag{1}$$

where *<sup>c</sup>*(*ij*) <sup>=</sup> <sup>|</sup>*N*(*i*)<sup>|</sup> <sup>|</sup>*N*(*j*)<sup>|</sup> is a regularization term, *<sup>W</sup>*(*l*) and *<sup>b</sup>*(*l*) are trainable parameters, and *σ* is a non-linear activation function, e.g., a ReLU.

In principle, deeper versions of GCN models that can capture more information will perform better. We conduct node classification experiments on the Cora dataset using GCNs with 2-layer, 4-layer, 6-layer, and 16-layer network structures, respectively, to analyze the performance of GCNs with different layers. The experimental result is shown in Figure 1. The best performance of GCN on the node classification task on the Cora dataset is achieved with a 2-layer model, and increasing the number of layers will reduce the performance. This is due to the over-smoothing problem; as the number of layers in the network increases and the number of iterations increases, the hidden layer representation of each node tends to converge to the same value.

**Figure 1.** Performance of GCNs with different numbers of layers on the node classification task on the Cora dataset.

#### *3.2. Deep GNNs*

In order to better exploit information from neighborhoods of differing localities and improve the over-smoothing problem of deep GNN models, models such as Jumping Knowledge Networks [27] and GCNII [32] proposed a network structure similar to ResNet [20] structure. These models are roughly represented as follows:

$$h\_v^{(l+1)} = \sigma(\mathcal{W}^{(l+1)} \cdot a \text{\textit{g\"}gregate}(\{h\_u^{(l)}, u \in \check{N}(v)\})) \tag{2}$$

$$h\_{\upsilon}^{(final)} = \text{layer\\_agregation}(h\_{\upsilon}^{(1)}, h\_{\upsilon}^{(2)}, \dots, h\_{\upsilon}^{(n)}) \tag{3}$$

where *aggregate* represents aggregation operations between nodes and *layer*\_*aggregation* is the layer aggregation function, indicating that all representations of the middle layer are aggregated in the last layer. However, this hand-designed way of aggregating the features of all layers may result in information redundancy.

Many GNN models [33–35] obtain node features via a message-passing pattern [7,36,37], where the representation of each node is learned by iteratively aggregating the embeddings ("messages") of its neighbors. APGCN [33] sets each node as an extra unit when the message is passed, which outputs a value that controls whether the communication should continue. This method can better control the information propagation of nodes to combine information from more distant neighbors, but it cannot aggregate information from different layers. To address the above issues, we use GCN as a benchmark and design an adaptive learning method for inter-layer aggregation. Compared with hand-designed networks, it can automatically learn the network aggregation architecture to fully exploit the representational capabilities of GNNs.

#### **4. GraphSAP Network**

#### *4.1. Model Analysis*

To improve the representation ability of the network model, we design a data-driven propagation mechanism that adaptively learns the connections of different layers, that is, the aggregation of different neighbors. We use GCN as the baseline network in our network and alleviate the over-smoothing problem in deep networks by adding an identity map to the convolution operation. Formally, we define the *l*-th layer of our GraphSAP as:

$$H^{(l+1)} = \sigma \left( \kappa\_l L H^l \left( (1 - \beta\_l) I\_n + \beta\_l W^l \right) \right) \tag{4}$$

where *κ<sup>l</sup>* and *β<sup>l</sup>* are two hyperparameters, and *L*˜ = *D*˜ <sup>−</sup>1/2*A*˜*D*˜ <sup>−</sup>1/2 is the graph convolution matrix with the renormalization trick. Compared to the vanilla GCN model (Equation (1)), we add the identity map *In* to the *l*-th weight matrix *W<sup>l</sup>* .

Each intermediate layer is computed from all its predecessors:

$$layer^{(j)} = \sum\_{i$$

where *layer* can be obtained by Equation (4), and *o*(*i*,*j*) denotes the connection state between *layer*(*i*) and *layer*(*j*).

The network architecture of our GraphSAP is shown in Figure 2. The main difference between our model and the existing models is that we design an adaptive learning network architecture based on a data-driven propagation mechanism instead of relying on handcrafted designs. We incorporate identity maps into convolutions to guarantee model performance and then use a data-driven adaptive approach to learn the best-performing network aggregation structure. Our proposed network achieves good results in the node classification task, demonstrating the feasibility of our proposed method.

Identity maps play an important role in preventing performance degradation in deep models, so we add identity maps to the model's operations. Generally speaking, identity mapping is to add the identity matrix to the weight matrix, which can alleviate the oversmoothing problem of the model due to the increase in the number of network layers. Frequent interactions between different dimensions of the feature matrix [38] will degrade the performance of the model in semi-supervised tasks, whereas direct mapping of the smooth representation *LH*˜ *<sup>l</sup>* to the output will reduce this interaction.

**Figure 2.** Network architecture of GraphSAP. SAP is based on data-driven learning at different layers. Where *l* is the number of layers of the network, *h*(*l*) denotes the hidden features learned by the node at layer *l*, *relax* denotes relaxation operation, *mix* denotes Equation (6), and *X* denotes the initial node features.

#### *4.2. Data-Driven Propagation Mechanism*

In this subsection, we first introduce the proposed propagation mechanism. We first introduce the differences and connections between our data-driven propagation mechanism and existing propagation strategies, such as Learning to Propagation (L2P) [39]. Although both L2P and our proposed GraphSAP belong to adaptive propagation, there are still differences between the two methods. Our GraphSAP learns whether neighbor node features of nodes at different levels are aggregated. L2P considers that different nodes may require different propagation layers, so it needs to learn the order of neighbor nodes. Next, we introduce methods for continuous operations between layers and, finally, optimization methods to speed up the learning time.

JKNet [27] aggregates the features of nodes of all layers to get the final feature representation, as shown in Equation (3). Our method obtains a layer connection operation space and adaptively learns aggregations between different layers, as shown in Equation (5), each directed edge (*i*, *j*) is associated with the edge state *o*(*i*,*j*). Our final task is to find a suitable connection method for each layer. The combination of these operations is discontinuous and learning in discrete spaces is very difficult.

To make the search space continuous, we relax the classification selection for a specific operation to a softmax of all possible operations:

$$\sigma^{(i,j)}(layer) = \sum\_{o \in \mathcal{O}} \frac{\exp\left(a\_o^{i,j}\right)}{\sum\_{o' \in \mathcal{O}} \exp\left(a\_{o'}^{i,j}\right)} o(layer) \tag{6}$$

where O is the set of all candidate aggregation operations (e.g., *identity*, *maxpooling*, and *zero*), and each operation represents some function *o*(·) to be applied to the layer, and the operation mixing weights for a pair of layers (*i*, *j*) are parameterized by a vector *αi*,*<sup>j</sup>* of dimension |O|. *layer* represents the GNN layer, as shown in Equation (5). The layer aggregation operation of GraphSAP is shown in Equation (4), and the node features of the last layer can be obtained in the following ways:

$$layer^{(l)} = [o(H^l), \dots, o(H^{l-1})] \tag{7}$$

where *o* is a classification operation, indicating whether this layer participates in information transmission. The final feature of the nodes can be expressed as

$$Z = \operatorname{softmax}\left(\operatorname{layer}^{(l)}\right) \tag{8}$$

When training the network, we use *Ltrain* and *Lval* to denote the training and validation losses. Both losses depend not only on the architecture *α*, but also on the weights *w* in the network. The goal of our method is to find *α*<sup>∗</sup> that minimizes L*val*(*w*∗, *α*∗), where *w*<sup>∗</sup> is the weight that minimizes L*train*. Thus, our model actually needs to solve a bi-level optimization [40] problem:

$$\min\_{\alpha \in \mathcal{A}} \quad \mathcal{L}\_{val}(w^\*(\alpha), \alpha^\*) \tag{9}$$

$$s.t.\quad w^\*(\mathfrak{a}) = \arg\min\_{w} \mathcal{L}\_{train}(w, \mathfrak{a})\tag{10}$$

where *α* denotes the network architecture, and *ω*∗(*α*) denotes the weight of this architecture after training. In our experiments, we choose to use the cross-entropy loss for our semisupervised node classification task:

$$\mathcal{L} = -\sum\_{i \in \mathcal{Y}\_L} \mathcal{Y}\_i \log Z\_i \tag{11}$$

where Y*<sup>L</sup>* is the set of node indices with labels, *Yi* denotes the predicted label of node *i*, and *Zi* is the final feature representation of node *i*. This cross-entropy function is used for adaptive generative network structure training and task model training.

We train the network using a one-shot differentiable method; the optimization details are given in Algorithm 1. In addition, we use the gradient-based approximation method [41–43] to update the operation parameter *α* to save training time, as follows:

$$\bigtriangledown \mathcal{L}\_{\text{val}}(w^\*(a), a) \approx \bigtriangledown \mathcal{L}\_{\text{val}}(w - \gamma \bigtriangledown \mathcal{L}\_{\text{train}}(w, a), a) \tag{12}$$

where *w* denotes the current weights maintained by the algorithm, and *γ* is the learning rate for a step of inner optimization. We use only a single training step to adjust *w* to approximate *w*∗(*α*) without fully solving the internal optimization by training until convergence (Equation (10)).


our experiments, we set *k* = 1), such as the maximum weight in Equation (6), to form our model. After adaptive learning is complete, we train from scratch using the best-performing model and adjust it based on the validation data to receive the final parameters.

#### **5. Experiment**

#### *5.1. Datasets*

To verify the effectiveness of our proposed algorithm, we use seven benchmark datasets to perform the node classification task. Table 1 summarizes the statistics of the dataset. We conduct experiments on three citation network datasets: PubMed [21], CiteSeer, and Cora [22]. Each of their nodes represents a paper, and each edge represents a citation relationship between two papers. The dataset contains bag-of-words features for each paper (node). The task is to classify papers into different topics according to a citation network, i.e., node classification. We also introduce four new datasets for the node classification task: Coauthor CS, Coauthor Physics, Amazon Computers, and Amazon Photo [44]. Descriptions of these new datasets are mentioned below. We split the nodes in all graphs into 60%, 20%, and 20%for training, validation, and testing.


**Cora.** The Cora dataset consists of machine learning papers divided into the following seven categories: Case Based; Genetic Algorithms; Neural Networks; Probabilistic Methods; Reinforcement Learning; Rule Learning; Theory.

**CiteSeer.** The Citeseer dataset is a portion of papers selected from the CiteSeer Digital Papers repository and is grouped into the following six categories: Agents; AI; DB; IR; ML; HCI.

**PubMed.** The PubMed dataset includes 19,717 scientific publications on diabetes from the Pubmed database, divided into three categories: Diabetes Mellitus, Experimental; Diabetes Mellitus Type 1; Diabetes Mellitus Type 2.

**Coauthor CS and Coauthor Physics.** They are coauthorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Nodes in the dataset represent authors and are connected by an edge if two authors coauthored a paper. Node features represent keywords of each author's papers, and category labels represent each author's most active research area.

**Amazon Computers and Amazon Photo.** They are fragments of the Amazon copurchase graph [44], where nodes represent items, edges represent two items that are frequently purchased together, node features are bag-of-words-encoded product reviews, and category labels represent product classifications.

#### *5.2. Settings*

**Baselines.** To compare our proposed mechanism with other existing methods, we consider the following baselines: Graph Convolutional Network (GCN) [3], Graph Attention Network (GAT) [18], Simplified Graph Convolution Network (SGC)) [4], JKNet [27], Multilayer Perceptron (MLP) [45], Graph Sample and Aggregate (GraphSage) [46], DAGNN [47], GCNII [32], DenseGCN [48], and ResGCN [48].

**Configurations.** Our experiment is run on a NVIDIA GTX 3090Ti Graphical Card using PyTorch (version 1.7). In our experiment, GCN [3] is used as the baseline model, identity mapping is added to the convolution, and the data-driven propagation mechanism is used to obtain the network model. In all the experiments, we set the depth in {2, 4, 8, 16, 32, 64}. Throughout the experiment, we use the Adam optimizer [49]. We adopt the learning rate to be 0.005 and the maximum number of epochs to be 1000. We set the dropout to be 0.5, the dimensions of the hidden features to be 32, and the weight decay to be 0.001. We add L2 regularization to the model parameters. We set *κ*- = 1 and *β*- <sup>=</sup> log *<sup>λ</sup>* - + 1 <sup>≈</sup> *<sup>λ</sup>* - . The principle of setting *κ* is to ensure the decay of the weight matrix adaptively increases as we stack more layers.

#### *5.3. Results*

**Network evaluation.** We evaluate the training performance of the model by observing the training of the model in network structure selection and corresponding weight generation. The training results are shown in Figure 3a,b. We adaptively learn the network structure using the validation set and train and optimize the *α* in Equation (9) to obtain the network structure with the best performance. The jumping case of the training loss in Figure 3a is the process of optimizing the network architecture. After the adaptive learning of the network structure is completed, we use the method of training GNN to train and optimize the network parameters *w* to obtain the final network model. Due to the use of differentiable methods for optimization, both our network structure selection training and parameter training can converge quickly. These training results confirm that the differentiable method we use is feasible and effective.

**Figure 3.** The training state of our model on the Amazon Photo dataset [44] with a 64-layer network structure. (**a**) is the validation loss and training loss for the training selection network structure; (**b**) is the validation loss and training loss for training parameters corresponding to the network structure.

**Performance comparison.** The quantitative comparison results of node classification performance with other methods on various datasets are shown in Table 2. All results used for comparison are the best results achievable using the respective models. Our network achieves good performance on all seven datasets, achieving the highest classification accuracy on five of them. GAT shows good results on some datasets, such as Cora, but the effect on the Amazon Com dataset is mediocre. Compared with some current deep models GCNII, ResGCN, and DenseGCN, GCNII performs best on Cora and Citeseer datasets, but our network achieves the best results on all other datasets. In general, our model can be applied to various datasets and has achieved good results, which proves the effectiveness of the model. Our method also provides a feasible direction to better utilize the representational power of GNNs.

To investigate the model performance trends at different depths, we further compare the representational capabilities of our proposed model and existing models at different depths. The detailed comparison results of models with different depths are shown in Figure 4. From these experimental results, we can make the following observations. The baseline model (GCN) struggles to maintain consistent performance as we stack more layers. We also found that residual and dense connections can help improve the model performance on most datasets but not much for Amazon Computer and Pubmed datasets. The Jumping Knowledge (JK) mechanism outperforms the baseline model (GCN) [3] in most cases. However, increasing depth also causes its performance to degrade. The GCNII model outperforms GCN and JKNet on multilayer networks, and the problem of over-smoothing is alleviated with increasing depth. However, GCNII performs poorly on four new datasets, and its generalizability is questionable. These experimental results further confirm that our proposed method is effective and feasible for training models with excellent representation ability.

**Figure 4.** The performance comparison of the network we designed under different layers. We have performed experiments on different datasets. As can be seen in the figure, our data-driven layer connection learning method has relatively good network performance when the number of layers increases.


**Table 2.** Comparison of GraphSAP with other models for node classification tasks on Cora, Citeseer, PubMed, Coauthor CS, Coauthor Physics, Amazon Computers, and Amazon Photo datasets.

**Model Visualization.** We visualize the network structure learned by the model for node classification tasks on the Amazon Photo dataset, as shown in Figure 5. The network structure diagram shows that the final classification result is an aggregation of neighbors from different layers. Neighbors that need to be aggregated are adaptively learned by our method without relying on the manual design. The aggregation between layers of the network structure is irregular. Our method is flexible and widely applicable and has excellent graph representation ability.

**Figure 5.** The 16-layer network structure learned by the model for the node classification task on the Amazon Photo dataset, where *Z* denotes the final representation of the node after softmax.

#### **6. Conclusions**

We propose a data-driven propagation mechanism that adaptively learns different connections between different layers, i.e., learns combinations of different neighbors. This mechanism can alleviate the information redundancy and over-smoothing problems caused by the previously hand-designed GNN layer-connected architecture. Compared with other mainstream methods, the network architecture can be adapted to a variety of different datasets. The proposed GraphSAP achieves good performance on all three public datasets and achieves the best results on one of the public datasets as well as the new four datasets tested. In addition, our method has almost no performance degradation when the number of model layers is deepened. Further, the training efficiency is improved by adopting a more efficient differentiable learning algorithm.

In the future, we will explore more automatic learning methods to further improve the performance of GraphSAP. It also includes exploring other layer aggregators and studying the impact of different combinations of different layers and node aggregators on the graph structure. Furthermore, we can also explore tasks other than node classification tasks, such as graph classification.

**Author Contributions:** Conceptualization, Y.W.; methodology, Y.W.; validation, X.F.; writing original draft preparation, Q.G. and X.H.; writing—review and editing, Y.W. and X.H.; supervision, W.M.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the National Natural Science Foundation of China (62276200, 62036006), the Natural Science Basic Research Plan in Shaanxi Province of China (2022JM-327) and the CAAI-Huawei MINDSPORE Academic Open Fund.

**Data Availability Statement:** The data used to support the findings of this study are available from the corresponding author upon request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
