In the process of node information propagation in GNNs, nodes first aggregate the information of their neighboring nodes according to certain rules and weights and then update their own feature representation using the collected information. However, since all nodes have the same priority during the message aggregation process, this may result in some nodes transmitting useless information or even interfering with the efficient transmission of important information. Assigning priority to the representation of all graph node features can effectively solve this issue, but it also leads to a notable rise in computational complexity.
To handle this challenge, we propose a self-attention mechanism to compute the priority of features from nodes of different categories. This mechanism utilizes self-attention to evaluate the dependency weights between nodes in an adaptive manner, thereby capturing the differences in importance between them. Additionally, since the mechanism focuses solely on information relevant to the current node, it can reduce the computational complexity and enhance the information transmission efficiency. The ICR consists of two modules connected in series: Feature Compression (FC) and Feature Diffusion (FD). The FC module acts as a filtering mechanism that retains the most significant characteristics in the feature map and simultaneously reduces its size to enhance the computational efficiency. Meanwhile, the FD module discriminates between the significance of different attributes based on their categorical distinctions and ascertains their respective weights in the node representation.
3.2.1. Feature Compression Module
The FC module is designed to effectively process the node attributes by identifying the most representative features while minimizing the influence of the irrelevant attributes. This is crucial in node classification tasks, as it enables the discovery of interrelationships and similarities among attributes. Determining the correlation among all node attributes using an idealized approach is computationally expensive. It has a time complexity of , where N represents the number of nodes, and C represents the dimension of the input features. Due to the significantly increased computational burden compared to small graphs, large graphs, such as bioinformatics and recommendation systems, are limited in their applicability.
To address this challenge, the FC module aims to balance the investigation requirements and to categorize computations more efficiently. It leverages the category-oriented feature attention coefficients to assign weightings to the attributes in the node representation. This prioritizes the transmission of the most significant attributes along the edges of the graph within each category, while suppressing less relevant attributes. Importantly, the computational complexity of the FC module primarily depends on the overall number of node categories. By distinguishing and representing nodes of the same category using a single cascaded attention matrix, the computation can be greatly simplified to .
The FC module consists of three main operations as illustrated in
Figure 2. Firstly, nodes in each sub-feature matrix are sorted to determine their importance within the category. Next, the sorted matrix is transformed into a three-dimensional sub-feature map, capturing the local salient features of each category. Finally, the FC module highlights the crucial local attributes of each node category, contributing to improving accuracy in node classification. By compressing and selectively emphasizing the relevant features, the FC module effectively reduces the computational complexity while preserving the essential information for accurate node classification.
Feature homogenization. Feature homogenization is one approach to capture the similarities between nodes within a group or submatrix. It involves using the mean feature vector to represent a particular category, which promotes consistency and reduces noise. Once the common attribute vector for each category is established, the nodes within each submatrix can be sorted based on their proximity to the common attribute vector. This sorting method helps to identify patterns and relationships within the submatrix, making the data easier to analyze and interpret. Equation (
1) compares each node’s feature vector against the average values of all the nodes’ features, in order to quantify the level of feature homogenization between them.
where
is the cosine function, and the mean feature vector
is calculated based on the average of all feature vectors within each category in the
i-th sub-feature matrix. The sub-feature matrix comprises individual feature vectors represented by
, which are utilized for computing the similarity,
, between any pair of feature vectors.
Feature Transformer Module. Given that adjacent nodes in the sorted sub-feature matrices typically exhibit similar attributes, our main objective is to determine the most suitable attributes for each dimension C in the sorted 2D sub-feature matrix and allocate local advantage values to these attributes. C also represents the depth of the 3D feature map. To accomplish this objective, we transform into a matrix, where the is the number of feature vectors of the i-th sub-feature matrix. It is worth noting that this number is not always an integer. Therefore, in order to obtain an integer value for , it is necessary to round up to the nearest integer, where .
Max Pooling. In the last stage of the FC, we aim to identify the local dominant and representative attributes within each feature map and associate them with their corresponding category. To achieve this, we perform max pooling to downsample the feature maps and capture the most prominent features of each node category. The resulting feature maps,
, are updated according to Equation (
2) to obtain their new dimensions,
.
The stride value, denoted by S, can effectively capture the intrinsic properties of each node category, which are crucial for accurately calculating the internal priority of the nodes in the FD module. The selection of this important feature plays a vital role in balancing the computation cost and performance. As excessively large stride values can reduce the computation burden but may also degrade the performance, excessively small stride values can result in prohibitively large computational burdens.
3.2.2. Feature Diffusion Module
Although the FC can highlight the most important features in each category, the priority between categories remains unknown. Therefore, the main goal of the FD is to determine the priority level of the crucial attributes in different categories, in order to classify the nodes more accurately. To achieve this objective, the FD adopts a sequential module approach, which includes two main steps: restoring the size of the 3D sub-feature map and transforming it back to a 2D sub-feature matrix, computing the internal priority. The specific details are shown in
Figure 2.
Feature Matrix Recover. The first stage of the FD module involves restoring the feature maps to their original size. To achieve this, we use upsampling to recover the reduced feature map. The feature maps are restored from to . Then, we fill all the missing pixels with zero values, which has no impact on the semantic information contained in the feature maps. Next, the feature maps are reshaped into a sub-feature matrix with reduced dimensions during the transformation process for further computation. To classify the nodes, the sub-feature matrix is sorted according to the original order of the nodes. A matching mask is generated with the same shape as the matrix, which records whether each node in its corresponding category has representative attributes (true/false). This enables the FD module to determine the priority of the most important feature across different categories.
Matrix Calculator. The module employs a self-attention mechanism with learnable parameters to each updated feature matrix using Equation (
3) to calculate the prioritization of nodes within each category. This scheme identifies the most important nodes in each category and assigns them higher priority for further processing.
where
and
indicate the attention coefficient and vector of
k-th dimension in each sub-feature matrix, respectively. We use masks to record the positions of the key features of the nodes, in order to reduce the computational resource consumption. Then, we activate the feature of the new nonlinear node
t using Equation (
4).
The final step of this process is to obtain an attention matrix from each of the two branches separately in Equation (
5).
where
represents the weights that can be adjusted during the learning process. Meanwhile,
and
are the activation functions used for the ICR and IRE, respectively. In this context, the variable
denotes the representation of the center node,
t, while
m is one of the neighbor nodes, denoted as
, associated with
t. The representations of node
m are denoted as
. The
indicates the significance assigned by node
t to its neighboring node
m. In the final process, the attention matrix obtained through combination contains both the graph structure information and the node-specific information. Algorithm 1 summarizes the ICR algorithm.
Algorithm 1: the algorithmic flow of the Intra-node Contextual Reasoning |
|