Next Article in Journal
Seismic Risk Analysis of Existing Link Slab Bridges Using Novel Fragility Functions
Previous Article in Journal
An Exploration of the Pepper Robot’s Capabilities: Unveiling Its Potential
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ESA-GCN: An Enhanced Graph-Based Node Classification Method for Class Imbalance Using ENN-SMOTE Sampling and an Attention Mechanism

1
College of Information Science and Engineering, China University of Petroleum (Beijing), Beijing 102249, China
2
College of Artificial Intelligence, China University of Petroleum (Beijing), Beijing 102249, China
3
Beijing Key Laboratory of Petroleum Data Mining, China University of Petroleum (Beijing), Beijing 102249, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(1), 111; https://doi.org/10.3390/app14010111
Submission received: 18 November 2023 / Revised: 17 December 2023 / Accepted: 19 December 2023 / Published: 22 December 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
In recent years, graph neural networks (GNNs) have achieved great success in handling node classification tasks. However, as data explosively grows in various industries, the problem of class imbalance becomes increasingly severe. Traditional GNNs tend to prioritize majority class nodes when dealing with imbalanced class distributions, which fail to adequately capture the features of minority class nodes, leading to significant difficulties and challenges in data classification. To address issues such as inaccurate edge generation during graph data oversampling, insufficient representation of minority classes, and the presence of noisy samples, this paper proposes the ESA-GCN model. The advantages of this model are as follows: (i) it employs the ENN-SMOTE comprehensive sampling method to balance the dataset by reducing majority class nodes and increasing minority class nodes; (ii) the ENN algorithm reduces the classifier’s error rate and improves performance stability by removing low-quality and noisy data; (iii) an attention mechanism is introduced during the edge generation phase between new nodes and original nodes, fully considering the mutual relationships between nodes and concentrating on a subset of key information with high weights, thereby significantly improving classification accuracy while reducing model parameters and computational complexity. Experiments conducted on three public datasets (Cora, Citeseer, and PubMed) demonstrate that the ESA-GCN has achieved significant results in class imbalance node classification tasks.

1. Introduction

Recently, GNNs have achieved advanced performance in node classification. However, most existing GNNs are based on the sometimes incorrect assumption that the samples of different classes are balanced. The resulting class imbalance problem has gained more importance due to the exponential growth of data in various industries. Imbalanced data is prevalent in real-world applications such as fraud detection, disease diagnosis, and financial risk analysis [1]. When dealing with imbalanced datasets in graph node classification tasks, directly training a GNN classifier with the original data may not adequately represent samples from minority classes, resulting in poor performance [2]. The significant imbalance in the number of training instances between different classes greatly affects supervised learning performance, making imbalanced node classification a crucial research area.
Currently, there have been preliminary advancements in addressing imbalanced node classification. Shi et al. [3] proposed a novel graph convolutional neural network with two regularizations to tackle class imbalances in graph data. They trained all unlabeled nodes to have similar data distributions as well-trained nodes, promoting balanced training between different classes. However, the limitation of this algorithm lies in its difficulty in generalizing to large-scale graphs. Zhao et al. [2] proposed oversampling through generating nodes and edges to balance the classification. However, applying this method to graphs disregards the intricate relationships and interactions between nodes, resulting in potential quality concerns and increased complexity in edge generation. To address this issue, Wu et al. [4] proposed GraphMixup, which creates distinct semantic spaces for mixing semantic features at a semantic level and introduces context-based edge mixing for graphs. Two context-based, self-supervised tasks are designed to consider both local and global structural information in the graph structure [4]. Additionally, a reinforcement mixing mechanism is developed to dynamically determine the upsampling ratio for each minority class. However, GraphMixup interpolates two different graphs of data, resulting in mixed structural information between the two graphs. This may blur the original graph data features and introduce some noise and inconsistency. Liang et al. [5] proposed the ImGAGN model, which addresses the issue of secondary nodes by synthesizing them through feature interpolation. This is achieved by generating a weight matrix that allows for the interpolation of features among all secondary nodes. If the weight in the matrix is greater than a fixed threshold, the synthesized node is connected to the original secondary node. However, the performance of ImGAGN largely depends on the quality and diversity of the training data. If the training data are insufficient or have significant noise, it may decrease the quality of the embedding results. In addition, ImGAGN mainly targets binary classification, so the sample diversity of the synthesized nodes is significantly limited. Joonhyung et al. [6] introduced a novel data augmentation model called GraphENS, which utilizes the complete node to synthesize the secondary node. The experimental findings demonstrate that this approach exhibits superior performance in node classification tasks, particularly when dealing with a limited number of secondary classes within the graph. However, the GraphENS algorithm is susceptible to noise and requires data preprocessing, which may lead to loss of information. These drawbacks indicate that the application of the GraphENS algorithm in real-world scenarios may be limited. Li et al. [7] proposed a graph neural network framework with curriculum learning (GNNCL). To solve the problem that prevents traditional sampling methods from being applied to graphs, a novel graph-based oversampling method—adaptive graph oversampling—is adopted, which supplements nodes and edges in the graph and generates new edges based on two basic attributes: smoothness and homogeneity. Researchers also focus on improving the representation quality of original and synthesized minority class nodes by using metric learning to incrementally improve minority classes and incorporating a neighbor-based triplet loss to identify the sparse boundary of minority class samples. However, the GNNCL model also has some drawbacks. When handling imbalanced node classification problems, due to the imbalance of labels, GNNCL often cannot distinguish and classify minority classes well, which may affect the accuracy and generalization ability of the model. Additionally, GNNCL relies on graph structure data and necessitates a high-quality and representative graph structure. Incomplete or noisy structural information in the input graph can have a noteworthy impact on the model’s performance.
From the research results at home and abroad, some progress has been made in addressing the problem of class imbalance in graph data. However, these studies still face challenges such as inaccurate edge generation during graph oversampling, high complexity, insufficient expression power of minority classes, and the presence of noisy samples. To address these issues, this study proposes the ESA-GCN model, which combines the ENN-SMOTE comprehensive sampling method with attention mechanisms. The contributions of this research are summarized as follows:
  • The ENN-SMOTE sampling method is designed to balance the dataset by decreasing the majority class nodes and increasing the minority class nodes. Furthermore, it improves the classifier’s performance stability and reduces its error rate by eliminating low-quality and noisy sample data using the ENN algorithm.
  • Moreover, the incorporation of attention mechanisms during the edge generation stage of graph oversampling takes into account the interconnectedness between nodes and emphasizes a subset of important information with higher weights.
  • Extensive experimental results on three public datasets demonstrate the performance advantages of the ESA-GCN model.

2. Related Work

2.1. Imbalanced Node Classification

Most existing GNN models are based on the assumption that node samples from different classes are balanced [2]. However, in many real-world graphs, there is a problem of class imbalance. Training a graph neural network (GNN) classifier directly with the original data can be problematic in scenarios where there is an imbalance between the classes. This is because the GNN may not adequately represent samples from minority classes, resulting in poor performance.
Currently, there have been some studies focused on solving the problem of unbalanced node classification. For example, Shi et al. [3] proposed a novel graph convolutional network that employs two regularization methods to handle class imbalances in graph data. It achieves balanced training across different classes by training unlabeled nodes to have a data distribution similar to well-trained nodes. However, the limitation of this algorithm is its difficulty in generalizing to large-scale graphs. Additionally, Zhao et al. [2] perform oversampling by generating nodes and edges to achieve class balance. However, this approach encounters challenges in edge generation, such as high time complexity and overlooking intricate relationships and feature interactions among nodes, potentially compromising the quality of generated edges. Wu et al. [4] introduced GraphMixup, which establishes a distinct semantic space for conducting semantic feature mixing at the semantic level, taking into account both local and global information in the graph structure. Nevertheless, GraphMixup interpolates between data from two distinct graphs, which may blur the original data features and introduce noise and inconsistencies. The ImGAGN [5] synthesizes mini-nodes by interpolating the generated weight matrix into features between the entire mini-nodes and then performing classification by connecting the synthesized nodes with the original mini-nodes. The drawback of this approach is that it may lead to information loss. Subsequently, Joonhyung Park [6] proposed GraphENS, a novel data augmentation model that synthesizes minor nodes using all nodes, leading to superior performance, particularly when the graph has a small number of minor classes. Nonetheless, the GraphENS algorithm updates the representation of the target node by incorporating information from neighboring nodes, which could be noisy or erroneous, potentially compromising the quality of embeddings. Despite adopting an adaptive neighbor selection mechanism to mitigate this sensitivity, it cannot fully eliminate it and necessitates preprocessing of the dataset, resulting in potential information loss. Li et al. [7] introduced GNNCL, a graph neural network framework with curriculum learning that employs adaptive graph oversampling to augment nodes and edges in the graph, enhancing the representation quality of minority class nodes through metric learning and neighbor-based triplet loss. However, due to label imbalance, the GNNCL model often struggles to effectively differentiate and classify minority classes, thereby impacting its accuracy and generalization capability. Furthermore, the GNNCL model heavily relies on structural information from the input graph, and if the input graph is incomplete or contains noise, it can significantly affect the model’s performance.
In summary, existing research on addressing the problem of unbalanced node classification has some drawbacks and limitations, such as inaccurate edge generation during resampling, poor generalization ability, susceptibility to noise, and high requirements for graph structure quality. These limitations may restrict the practical application of these algorithms in real-world scenarios.

2.2. Graph Neural Networks for Node Classification

GNNs [8] have become a widely used class of models in recent years. They transform complex graph-structured data into meaningful representations for downstream mining tasks through information propagation and aggregation, based on the dependencies in the network. Among all GNN models, graph convolutional networks (GCNs) are considered a major solution and can be categorized into two types: spectral methods and spatial methods. In the spectral domain, Bruna et al. [8] proposed using Fourier basis vectors for graph convolution in the spectral domain. ChebNet [9] introduced smooth filters in spectral convolution, which can be approximated well by K-order Chebyshev polynomials. Kipf et al. [10] further simplified and constrained the parameters of ChebNet by introducing a convolutional architecture based on the local first-order approximation of spectral graph convolution. Alternatively, spatial methods directly operate on the graph by defining operations on the target node and its topological neighbors, enabling aggregation operations on the graph structure. For example, Hamilton et al. [11] proposed the GraphSAGE model, which samples and aggregates neighbor nodes through graph sampling to ensure each node has a limited number of sampled neighbors. By combining graph sampling and feature aggregation, GraphSAGE can perform node embedding and representation learning by utilizing local node information while considering global information from the entire network. This approach overcomes the challenge of loading features for all nodes in large-scale networks while still considering global information. Another improvement made to the GCN resulted in the Graph Attention Network (GAT) model [12], which incorporates node feature expressions into the definition of aggregation functions using attention mechanisms. The weights of the nodes are computed and aggregated to the central node in a weighted sum manner. However, these methods do not address the bias introduced by majority class nodes during the training process, making them unsuitable for directly classifying imbalanced nodes.

2.3. Resampling Algorithm

Resampling techniques involve using oversampling or undersampling techniques to balance imbalanced sample distributions. The Synthetic Minority Oversampling Technique (SMOTE) [13] is a classic oversampling method that randomly selects samples and performs linear interpolation between the selected sample and its nearest neighbors. While SMOTE oversampling can partially alleviate data distribution imbalance, it lacks a specific selection criterion before performing linear interpolation on minority class samples, which may result in classifier overfitting. Therefore, Hart proposed the Condensed Nearest Neighbor (CNN) rule in 1968 [14], which iteratively uses the 1-nearest neighbor rule to determine whether a sample should be retained in the majority class. In 1976, Tomek defined Tomek Links [15], which are pairs of different-class samples that are nearest neighbors to each other. This method helps identify border samples and noisy samples. Wilson proposed the Edited Nearest-Neighbor (ENN) rule [16], which can clean up samples for each class but may delete more samples and lead to data loss. Hybrid sampling is a combination strategy of undersampling and oversampling, which addresses the drawbacks of using oversampling or undersampling individually. Gustavo et al. [17] proposed two hybrid sampling methods: SMOTE + Tomek and SMOTE + ENN. These methods first apply SMOTE oversampling to the minority class and then use Tomek Links and ENN, respectively, to clean the samples and remove minority class samples that impede the majority class. Overall, SMOTE-ENN hybrid sampling effectively manages imbalanced classification problems by synthesizing minority class samples and removing noisy samples, thereby enhancing model performance and generalization ability.

3. Methods

In this section, we will provide a detailed introduction to the proposed ESA-GCN model. The main idea of the ESA-GCN is to use a series of sampling and edge generation techniques in the expressive embedding space obtained by the graph neural network (GNN)-based feature extractor to improve node classification performance on imbalanced datasets. The ESA-GCN consists of four parts: a GraphSAGE-based feature extractor, a resampling module, an edge generator, and a GCN-based node classification module. The conceptual framework of the ESA-GCN model is shown in Figure 1.
The ESA-GCN can fully utilize the feature extraction ability of GNNs and the strategies for enhancing sample balance to improve node classification effectiveness on imbalanced datasets. What makes the ESA-GCN innovative is its comprehensive use of undersampling, oversampling, and edge generation techniques to enhance the ability to solve node classification problems in imbalanced graph data.

3.1. Feature Extractor

The feature extractor plays a critical role in learning node representations by preserving node attributes and graph topology, facilitating the generation of synthesized nodes. Graph convolutional networks (GCNs) can typically be used to accomplish this task. However, a GCN has certain limitations and may not handle various types of local topologies well, leading to limited effectiveness when dealing with new graph structures. To overcome these issues, GraphSAGE is introduced as an improvement over GCNs because it can effectively learn from multiple types of local topologies and possesses strong generalization capabilities. Considering this, the ESA-GCN adopts a GraphSAGE block as the core module of the feature extractor to achieve learning of node representations. Within this block, the message passing and fusion processes are formulated as Equation (1).
H v 1 = σ W 1 · C O N C A T F v , : , F · A : , v
Here, F represents the input node attribute matrix, and F v , : represents the attributes of node v . A [ : , v ] is the v-th column of the adjacency matrix, h v 1 is the embedding of node v , W 1 is the weight parameter, and σ ( · ) refers to activation functions such as R e L U ( · ) .
The ESA-GCN model uses GraphSAGE in the GNN as the feature extractor, which learns node representations through message passing and fusion processes to measure similarity and interpolation operations between nodes.

3.2. Resampling Module

The original imbalanced dataset is made more balanced by the resampling module. The ESA-GCN model combines undersampling and oversampling methods to address the imbalanced dataset problem. The Edited Nearest Neighbor (ENN) undersampling method, used by the ESA-GCN, reduces the majority class nodes based on their density in the feature space. It removes low-density majority class nodes, reducing noise and duplicate samples in the majority class samples of the imbalanced dataset and improving the accuracy of the classifier. Then, the ESA-GCN uses the Synthetic Minority Oversampling Technique (SMOTE) to generate minority class nodes. SMOTE synthesizes new minority class nodes to increase their sample count and interpolates them in the feature space with neighboring minority class nodes, thereby enhancing the classifier’s ability to learn from the minority class. The resampling module includes both the ENN undersampling process and the SMOTE oversampling process.

3.2.1. ENN Undersampling Process

After obtaining the representations of each node in the embedding space constructed by the feature extractor, we perform undersampling on the majority of class nodes. The basic idea of ENN undersampling is as follows:
First, for each majority class sample, we calculate its distance to its nearest neighbor sample. We iteratively go through the majority class samples, and for each sample, we check the classes of its neighboring samples. If more than half of the neighboring samples belong to the majority class, the sample is retained; otherwise, it is deleted.
Let h v 1 be a labeled majority class node with label Y v . The first step is to use Euclidean distance measurement in the embedding space to find the closest node n e v that belongs to the same class as h v 1 , and we record the count of such nodes as m . The calculation of n e v is defined in Equation (2):
n e v = argmin u | h u 1 h v 1 | , s . t . Y u = Y v
Then, we find the number of neighbors, denoted as n, for the h v 1 node. If the number of neighbors belonging to the same class is greater than half of the total number of neighbors according to Equation (3), then the h v 1 node is kept; otherwise, the node and its corresponding edges are deleted.
m n 2
By following the aforementioned steps, ENN undersampling is completed, reducing noise and duplicate samples in the majority class samples of an imbalanced dataset.

3.2.2. SMOTE Oversampling Process

After obtaining node representations in the embedding space through the feature extractor, we oversample the minority class to generate representations for new samples. SMOTE, a basic concept, interpolates the target minority class samples with their nearest neighbors from the same class in the embedding space. Using Equation (2), we find the closest node n e ( v ) that belongs to the same class as h v 1 .
For the nearest neighbor node, we generate a new node according to Equation (4) and obtain the node embedding of the new node, h n ( v ) 1 .
h n ( v ) 1 = 1 δ · h v 1 + δ · h n e v 1
In Equation (4), δ is a random variable uniformly distributed within the range [0, 1]. Since h v 1 and h n ( v ) 1 belong to the same class and their distributions are close to each other, the generated synthetic node h n v 1 should also belong to the same class. This allows us to obtain labeled synthetic nodes.
After the above steps, an augmented dataset is obtained in which the majority class samples are reduced using the ENN method and the minority class samples are increased using the SMOTE method. This sampling method can make the dataset more balanced and improve the performance of the classifier.

3.3. Edge Generator

An edge generator is required in order to establish connections and create a complete graph structure, as there are no direct edges between the newly generated nodes from the resampling module and the original nodes. The ESA-GCN incorporates the attention mechanism and GraphSAGE to predict the links between synthetic nodes. By learning the relationships and weights between nodes, the edge generator generates links between synthetic nodes within the embedding space, resulting in the construction of a balanced augmented graph. This enables the GNN to more effectively capture the structure and information between nodes, thereby enhancing classification performance in node classification tasks.
The edge generator incorporates the attention mechanism to dynamically adjust the importance of each node. This ensures that the learning process gives more weight to minority class nodes and takes into account the interaction between nodes when computing the feature vectors. The similarity between the feature vectors of two nodes is calculated using the sigmoid function, and edges are generated accordingly to establish a complete graph structure. The calculation formula for node embedding is given in Equation (5):
H i = a i , j W h i + j N ( i ) a i , j W h j
In the equation, W represents the weight parameters, N ( i ) denotes the set of neighboring nodes for node i , h i represents the feature vector of node i , h j represents the feature vector of node j , h i represents the feature vector of node i after being computed by the attention mechanism, and a i , j represents the attention of node j on node i. LeakyReLU is a variant of the rectified linear unit (ReLU) that introduces a small slope for negative input values instead of setting them to zero directly. The calculation formula for a i , j is given in Equation (6):
A i , j = e x p L e a k y R e L U a T W x i | | W x j k N i { i } e x p L e a k y R e L U ( a T W x i | | W x j )
Equation (7) is used to compute the relational information Ev,u between the predicted node representations of nodes v and u. Here, S is a parameter matrix that captures the interaction between nodes.
E v , u = s o f t m a x σ ( h v 1 · S · h u 1 )
By introducing the attention mechanism, the ESA-GCN is able to consider the interactions between nodes more comprehensively. It flexibly adjusts the weights of nodes during the edge generation process, resulting in a more accurate graph structure and making it more effective in handling imbalanced data. Additionally, incorporating the attention mechanism allows for the concentrated processing of crucial information in the input data with higher weights while ignoring irrelevant information. This significantly reduces the model parameters and computational complexity.

3.4. Classification Module

The node classification module utilizes the GraphSAGE method to predict the classification of nodes. Assuming that the edge generator module produces an augmented graph G ~ = { A ~ , H ~ } , H ~ is the enhanced node representation set obtained by concatenating H (embeddings of original nodes) with embeddings of synthetic nodes, A ~ is the enhanced node adjacency matrix obtained by connecting A (adjacency matrix of original nodes) with the adjacency matrix of synthetic nodes, and VLL refers to the enhanced label set after merging the synthetic nodes into the original nodes in the node set VL.
In G ~ , the sizes of different data categories become balanced, and an unbiased GCN classifier can be trained on this basis. In this module, a GraphSAGE block is used in conjunction with a linear layer for node classification on G ~ .
h v 2 = σ W 2 · C O N C A T ( h v 1 , H ~ · A ~ [ : , v ] )
P v = s o f t m a x ( σ W c · C O N C A T ( h v 2 , H 2 · A ~ [ : , v ] ) )
In Equation (9), H 2 represents the node representation matrix of the second GraphSAGE, and W is the weight parameter. P v represents the probability distribution of node v over class labels. The classifier module is optimized using cross-entropy loss.
L n o d e = u V L C 1 Y u = c · l o g P v [ c ]
During the testing process, the predicted class of node v is set as the class with the highest probability Y v , as shown in Equation (11):
Y v = argmax c P v . c
By following the above steps, the final node classification results are obtained Y v .

3.5. Training Algorithm

The feature extractor, resampling module, edge generator, and node classifier are combined to form the ESA-GCN model, with its final objective function L t o t a l represented by Equation (12):
L t o t a l = min θ , ϕ , φ L n o d e + λ · L e d g e
where θ , ϕ , and φ are the parameters of the feature extractor, edge generator, and node classifier, respectively. As the performance of the model depends on the quality of the embedding space and the generated edges, we also attempt to pretrain the feature extractor and edge generator using L e d g e to make the training process more stable.
The advantages of the ESA-GCN model can be summarized as follows: (1) The ENN algorithm effectively reduces the classifier’s error rate and improves its performance stability by removing low-quality and noisy sample data. (2) The combination of the ENN algorithm and SMOTE algorithm can make the dataset more balanced and improve the model’s classification performance. (3) The attention mechanism can better consider the relationships between nodes and feature interactions, enabling more accurate aggregation of nodes on the graph. (4) The attention mechanism can concentrate on a part of crucial information in the input data with high weights while ignoring other irrelevant information, greatly reducing the model parameters and computation expense.
The complete workflow of the ESA-GCN model is shown in Algorithm 1. In each optimization step, the feature extractor is used to obtain node representations in the 2nd to 3rd lines. Then, ENN undersampling is performed from the 5th to the 7th line. SMOTE oversampling is then executed from the 8th to the 10th line to balance the node classes. After predicting the generated new sample edges in the 11th and 12th lines, the entire framework is trained together with edge prediction loss and node classification loss, as shown in the 14th line.
Algorithm 1. Full Training Algorithm
Require: V m a j , V m i n : majority class and minority class nodes
Require: A : adjacency matrix
Require: F : node attribute matrix
Require: Y : node label
Require: σ ( · ) : activation functions
Require: F v , : : the attributes of node v
Require: A : , v : the v-th column of the adjacency matrix
Require: C O N C A T ( · ) : tensor splicing
Require: λ : loss weights
1: for each t ∈ [1, max_iterations] do
2:  Feature extraction;
3:     h v 1 = σ W 1 · C O N C A T F v , : , F · A : , v
4:   Resampling;
5:     for node in V m a j do
6:       Perform ENN undersampling in class b, following Equations (2) and (3) of ENN algorithm;
7:     end for
8:     for node in V m i n do
9:       Generate a new sample in class c, following Equations (2) and (4) of SMOTE algorithm;
10:     end for
11:   Generate edges;
12:      Generate A’ using the edge generator based on Equations (5)–(7);
13:   Compute the total loss;
14:                L t o t a l = min θ , ϕ , φ L n o d e + λ · L e d g e
15:    return trained feature extractor, edge predictor, and node classifier module.
16: end for

4. Experimental Results and Discussion

This section describes the experiments using three public datasets: Cora [18], Citeseer [18] (Section 4.2), and PubMed [19]. The Cora [18] dataset is composed of 2708 scientific publications, and its citation network consists of 5429 links. Each publication in the dataset is represented by a binary word vector with 0/1 values, indicating the absence/presence of the corresponding word in a dictionary. The dictionary consists of 1433 unique words, meaning that each publication is characterized by 1433 features, each represented by a 0/1 value. The Citeseer [18] dataset includes 3312 scientific publications. The citation network is composed of 4732 links. Similar to the Cora dataset, each publication in Citeseer is described by a 0/1 word vector value, indicating the absence or presence of a word in a dictionary. The dictionary for Citeseer contains 3703 unique words, and thus each publication is characterized by 3703 features, each represented by a 0/1 value. The PubMed [19] dataset consists of 19,717 scientific publications on diabetes. The citation network is made up of 44,338 links. Each publication in the dataset is described by a TF/IDF-weighted word vector with 500 unique words in the dictionary. Therefore, each publication in the PubMed dataset is characterized by 500 features, each represented by a TF/IDF-weighted value. In these datasets, the class distribution is relatively balanced, so we used a simulated imbalanced setting: randomly selecting three classes as the minority classes and downsampling them. All majority classes have a training set containing 20 nodes. For each minority class, the quantity is 20× imbalance_ratio. All experiments are conducted on a 64-bit machine using a Nvidia GPU (Tesla V100, 1246 MHz, 7 GB of memory). The ESA-GCN model is evaluated for solving imbalanced classification using two metrics: the average AUC-ROC score and the F1-Macro score, to assess the advantages of the ESA-GCN. The effectiveness is then demonstrated through experiments. For all methods, the learning rate is initialized as 0.001, and the weight decay is set to 5 × 10−4. Λ is set as 1 × 10−6. If not specified, the imbalance ratio is set to 0.5, and the undersampling neighbor number K is set to 3. Additionally, all models are trained until convergence. For the three datasets, the top 4, 3, and 2 classes are selected as the majority classes, and the rest as the minority class, sorted by the number of nodes. In the ablation experiment, an additional runtime indicator is included to analyze the training speed of the models.

4.1. Compared Methods

To validate the effectiveness of the proposed ESA-GCN model, we compared it with four representative methods, which can be divided into two categories: balanced network embedding methods (i.e., GCN and GraphSAGE) and imbalanced network embedding methods (i.e., DR-GCN and GraphSMOTE).
  • GCN: A graph convolutional network (GCN) [10] is the most representative balanced network embedding method, which obtains node embeddings by aggregating the features of neighboring nodes.
  • GraphSAGE: GraphSAGE [11] is another representative GNN method. Unlike GCNs, which use full-sized neighboring nodes to obtain node embeddings, GraphSAGE saves memory by using a fixed number of neighboring nodes for each target node. Additionally, it learns three different aggregators: mean-aggregator, LSTM-aggregator, and pooling-aggregator. We report the best-performing aggregator among these three as the final result of GraphSAGE.
  • DR-GCN: The DR-GCN [3] is an imbalanced network embedding method based on GCNs. It proposes to use conditional adversarial training to enhance the separation of different classes. Additionally, distribution alignment training is applied to balance the majority and minority nodes.
  • GraphSMOTE: GraphSMOTE [2], based on SMOTE [13], has two types of edge generators that can connect synthetic nodes to the original graph through pretraining. We report the results of the best variant of the generator for each dataset.
From Table 1, it can be seen that our ESA-GCN method has the greatest improvement in the performance of the GNN (GCN) when the dataset is imbalanced. This indicates that the ESA-GCN can effectively alleviate data imbalances and enhance the classification performance of the GNN.

4.2. Ablation Experiment

In this section, we investigated the advantages of individually increasing the ENN undersampling and attention mechanisms, as well as simultaneously increasing the ENN undersampling and attention mechanisms, on model performance. The experiments were conducted with well-defined constraints, including an imbalance ratio set to 0.5, loss weight set to 0.1, and undersampling neighbor count K set to 3. Each experiment was repeated three times, and the results were averaged. Table 2 presents the average results.
From Table 2, we observed the following points:
After adding ENN undersampling during the sampling phase,
  • Compared to the original model, the Cora, Citeseer, and PubMed datasets showed improvements in both AUC-ROC and F1 scores. This indicates that the ENN algorithm effectively reduces the classifier’s error rate and improves its performance stability by removing low-quality and noisy samples.
  • The ENN undersampling algorithm identifies unreasonable samples by calculating the distances between each sample and its nearest neighbors. Removing these unreasonable samples can reduce the data volume, alleviate the burden of model training, and slightly improve the training speed. However, due to the added undersampling process, the running speed of the Cora dataset is slightly slower than the original model, while the running speeds of the Citeseer and PubMed datasets are faster than the original model.
After adding attention mechanisms during the edge generation phase,
  • Compared to the original model, the Cora, Citeseer, and PubMed datasets showed improvements in both AUC-ROC and F1 scores. This demonstrates that attention mechanisms can better consider the relationships and feature interactions between nodes, resulting in more accurate node aggregation on the graph.
  • The running time also decreased, indicating that with the inclusion of attention mechanisms, the model can concentrate on processing key information with higher weights in the input data while ignoring other irrelevant information, thus significantly reducing model parameters and computational complexity.
After simultaneously adding both ENN undersampling and attention mechanisms,
  • The original model generalized minority class nodes too much without considering reducing majority class nodes. The ENN algorithm reduces noisy data by removing unimportant or erroneous samples. Therefore, the samples obtained through SMOTE-ENN sampling are cleaner and can reduce interference during model training.
  • The original model used GraphSage and node similarity calculation during the edge generation phase, which was slower and did not fully consider the interrelationships between nodes.
Overall, our proposed ESA-GCN model showed improvements in AUC-ROC, F1 scores, and running speed compared to the original GraphSmote model.

4.3. Research on the Influence of Undersampling Neighbor Count

This section analyzes the performance of different algorithms at various undersampling neighbor counts to evaluate the impact of undersampling neighbor counts on model performance. The experiments were conducted using the Cora, Citeseer, and PubMed datasets with well-defined constraints. The imbalance ratio was set to 0.5, the loss weight was set to 0.1, and the undersampling neighbor count K varied as {3, 4, 5}. Each experiment was repeated three times, and the results were averaged.
From the experimental results in Figure 2, the following observations can be made:
In the Cora dataset, the AUC-ROC metric is optimal when K = 4, while the F1-Macro metric is optimal when K = 5. This suggests that increasing the undersampling neighbor count improves model performance on this dataset.
In the Citeseer dataset, K = 3 yields the best performance, while both AUC-ROC and F1-Macro metrics are worse than the original model when K = 5. This indicates that a smaller undersampling neighbor count is more favorable for model performance on this dataset.
In the PubMed dataset, K = 4 yields the optimal AUC-ROC and F1-Macro metrics, but the F1-Macro metric is worse than the original model when K = 5. This suggests that an undersampling neighbor count of 4 is more advantageous for model performance on this dataset.
In general, choosing a too small value for K may result in selecting inappropriate neighbors, thus impacting model performance. On the other hand, choosing a too large value for K would increase the number of nearest neighbors for each sample, leading to an inadequate number of retained minority class samples and compromising undersampling effectiveness while reducing model generalization capacity. Therefore, selecting an appropriate K value in the ENN algorithm requires careful consideration based on specific datasets and application requirements.

4.4. Study on the Impact of Imbalance Rate

In this section, the performance of different algorithms at different imbalance rates is analyzed to evaluate their robustness. The experiments were conducted using the Cora, Citeseer, and PubMed datasets with well-defined constraints. The undersampling neighbor count K was set to 3, the loss weight was set to 0.1, and the imbalance ratios varied as {0.1, 0.2, 0.4, 0.6}. Each experiment was repeated three times, and the results were averaged.
From the experimental results in Figure 3, Figure 4 and Figure 5, it can be concluded that the ESA-GCN model, which incorporates undersampling and attention mechanisms, is more adaptable to different imbalance rates compared to the original GraphSmote model. This indicates that the ESA-GCN model exhibits better robustness under different imbalance rates and is capable of handling imbalanced node classification tasks more effectively.

4.5. Performance Changes after Adjusting Hyperparameters

In this section, we experimentally evaluate the performance changes in different algorithms under different loss weights λ and analyze the results. We set the undersampling neighbor count K to 3, the imbalance ratio to 0.5, and vary the loss weights as {1 × 10−7, 1 × 10−6, 2 × 10−6, 4 × 10−6, 6 × 10−6, 8 × 10−6, 1 × 10−5}. Each experiment is conducted three times, and the results are averaged.
From the experimental results in Figure 6, Figure 7 and Figure 8, a conclusion can be drawn: in the Cora, Citeseer, and PubMed datasets, the improved ESA-GCN model outperforms the original GraphSmote model after adjusting the parameter λ. Specifically, the improved model demonstrates better performance in terms of AUC-ROC and F1-Macro metrics. This indicates that the ESA-GCN, which incorporates undersampling and attention mechanisms, can reduce noise, balance the dataset, and effectively capture key information between nodes. As a result, it exhibits good stability and robustness in handling imbalanced node classification tasks.

5. Conclusions and Future Work

5.1. Conclusions

The problem of class imbalance in graph nodes is widely prevalent in real-world tasks such as fraud detection, fake user detection, malicious software detection, etc. This issue significantly affects the performance of classifiers on these minority classes, but there has been limited research in this area. Therefore, in this work, we investigate the task of imbalanced node classification and propose the ESA-GCN model to address the problem more effectively.
Specifically, in this paper, we address the problem by employing ENN undersampling to balance the original dataset and reduce the interference of noisy samples on model training. Additionally, we enhance the model’s performance and reduce the model parameters and computational complexity by incorporating attention mechanisms to adjust the importance of each node and regenerate edges. Through experiments conducted on three public datasets, we observe significant improvements in terms of AUC-ROC, F1-Macro metrics, and runtime speed compared to the traditional GraphSmote algorithm.

5.2. Future Work

The ESA-GCN model has achieved better performance in handling imbalanced node classification. In future work, the following directions can be further explored:
  • More effective minority class sample selection strategies: Although ENN undersampling can reduce the interference of noisy samples, it may also result in the loss of useful minority class samples. Therefore, it is necessary to explore more refined strategies for selecting minority class samples to better utilize their information.
  • More efficient feature learning methods: Current algorithms mainly utilize self-attention mechanisms for feature learning of minority class samples. In future work, other more efficient or precise feature learning methods can be explored to further improve the model’s performance.
  • More complex graph data classification tasks: Current experiments are primarily focused on node classification tasks. In future work, more complex graph data classification tasks such as graph matching and community detection can be explored, and the improved algorithms can be validated on these tasks.
  • Extension to other application domains: This study primarily conducted experiments on citation networks, but there are many other real-world applications that can be regarded as having imbalanced node classification problems. Therefore, we hope to extend our framework to more application domains, such as cancer diagnosis, malicious code detection, oil spill detection, and improper social account detection, in order to address specific application scenarios.

Author Contributions

Conceptualization, methodology: L.Z. and H.S.; software, visualization, formal analysis, writing—original draft preparation: H.S.; writing—review and editing, supervision, funding acquisition: L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China, Grant No. 72374210.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Acknowledgments

We would like to express our sincere gratitude to all the individuals and institutions who have contributed to this paper. We especially appreciate the contributions made by Y.S. and B.S. in the areas of writing, review, and editing. Furthermore, we would like to thank the funding institutions, including Xinzhu Zheng, Yufa Sun and Bingbo Shi, for their financial support. We are deeply grateful to all the individuals and institutions mentioned above for their support and assistance, which have enabled us to complete this paper.

Conflicts of Interest

The research was conducted without any commercial or financial conflict of interest.

References

  1. Muñoz, M.A.; Villanova, L.; Baatar, D.; Smith-Miles, K. Instance spaces for machine learning classification. Mach. Learn. 2018, 107, 109–147. [Google Scholar] [CrossRef]
  2. Zhao, T.X.; Zhang, X.; Wang, S.H. GraphSMOTE: Imbalanced node classification on graphs with graph neural networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Jerusalem, Israel, 8–12 March 2021; ACM Press: New York, NY, USA, 2021; pp. 833–841. [Google Scholar]
  3. Song, Y.; Wang, Y.; Ye, X.; Wang, D.; Yin, Y.; Wang, Y. Multi-view ensemble learning based on distance-to-model and adaptive clustering for imbalanced credit risk assessment in P2P lending. Inf. Sci. 2020, 525, 182–204. [Google Scholar] [CrossRef]
  4. Wu, L.; Lin, H.; Gao, Z.; Tan, C.; Li, S. Graphmixup: Improving class-imbalanced node classification on graphs by self-supervised context prediction. arXiv 2021, arXiv:2106.11133. [Google Scholar]
  5. Qu, L.; Zhu, H.; Zheng, R.; Shi, Y.; Yin, H. ImGAGN:Imbalanced Network Embedding via Generative Adversarial Graph Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; ACM Press: New York, NY, USA, 2021. [Google Scholar]
  6. Joonhyung, P.; Jaeyun, S. GraphENS: Neighbor-aware ego network synthesis for class-imbalanced node classification. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
  7. Li, X.; Wen, L.; Deng, Y.; Feng, F.; Hu, X.; Wang, L.; Fan, Z. Graph Neural Network with Curriculum Learning for Imbalanced Node Classification. arXiv 2022, arXiv:2202.02529. [Google Scholar]
  8. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
  9. Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. arXiv 2015, arXiv:1509.09292. [Google Scholar]
  10. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  11. Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Ed.; MIT Press: Cambridge, UK, 2017; pp. 1024–1034. [Google Scholar]
  12. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  13. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. arXiv 2011, arXiv:1106.1813. [Google Scholar] [CrossRef]
  14. Condensed Nearest Neighbor, CNN; Hart, P. The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
  15. Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, 6, 769–772. [Google Scholar]
  16. Laurikkala, J. Improving Identification of Difficult Small Classes by Balancing Class Distribution. In Artificial Intelligence in Medicine Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2101, pp. 63–66. [Google Scholar]
  17. Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  18. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective Classification in Network Data. AI Mag. 2008, 29, 93. [Google Scholar] [CrossRef]
  19. PubMed. Available online: https://www.ncbi.nlm.nih.gov/pubmed/ (accessed on 19 January 2023).
Figure 1. Conceptual framework of the ESA-GCN model.
Figure 1. Conceptual framework of the ESA-GCN model.
Applsci 14 00111 g001
Figure 2. Changes in AUC-ROC and F1-Macro with varying undersampling neighbor counts.
Figure 2. Changes in AUC-ROC and F1-Macro with varying undersampling neighbor counts.
Applsci 14 00111 g002
Figure 3. Changes in AUC-ROC and F1-Macro when adjusting the imbalance ratio on the Cora dataset.
Figure 3. Changes in AUC-ROC and F1-Macro when adjusting the imbalance ratio on the Cora dataset.
Applsci 14 00111 g003
Figure 4. Changes in AUC-ROC and F1-Macro when adjusting the imbalance ratio on the Citeseer dataset.
Figure 4. Changes in AUC-ROC and F1-Macro when adjusting the imbalance ratio on the Citeseer dataset.
Applsci 14 00111 g004
Figure 5. Changes in AUC-ROC and F1-Macro when adjusting the imbalance ratio on the PubMed dataset.
Figure 5. Changes in AUC-ROC and F1-Macro when adjusting the imbalance ratio on the PubMed dataset.
Applsci 14 00111 g005
Figure 6. Changes in AUC-ROC and F1-Macro when adjusting hyperparameters on the Cora dataset.
Figure 6. Changes in AUC-ROC and F1-Macro when adjusting hyperparameters on the Cora dataset.
Applsci 14 00111 g006
Figure 7. Changes in AUC-ROC and F1-Macro when adjusting hyperparameters on the Citeseer dataset.
Figure 7. Changes in AUC-ROC and F1-Macro when adjusting hyperparameters on the Citeseer dataset.
Applsci 14 00111 g007
Figure 8. Changes in AUC-ROC and F1-Macro when adjusting hyperparameters on the PubMed dataset.
Figure 8. Changes in AUC-ROC and F1-Macro when adjusting hyperparameters on the PubMed dataset.
Applsci 14 00111 g008
Table 1. Performance comparison of different methods for classifying imbalanced nodes.
Table 1. Performance comparison of different methods for classifying imbalanced nodes.
Dataset and
Metric
CoraCiteseerPubMed
Model AUC-ROCF1-MacroAUC-ROCF1-MacroAUC-ROCF1-Macro
GCN0.85430.52050.63340.45030.81540.5501
GraphSAGE0.86870.53760.7070.47430.83320.5721
DR-GCN0.87760.55130.61020.39240.67140.5628
GraphSmote0.89420.59600.71930.39870.84570.6881
ESA-GCN0.89970.63920.76140.44750.87460.7229
Table 2. Results of the ablation experiment.
Table 2. Results of the ablation experiment.
Dataset and
Metric
CoraCiteseerPubMed
Model AUC-ROCF1-
Macro
TimesAUC-ROCF1-
Macro
TimesAUC-ROCF1-
Macro
Times
GraphSmote0.89420.5960532.7449 s0.71930.3987830.3261 s0.84570.688114,938.8945 s
GraphSmote + ENN undersampling0.91440.6657574.32 s0.74070.4399705.73 s0.87290.720914763.00 s
GraphSmote + Attention mechanism0.89910.6465220.66 s0.75250.41379.96 s0.86470.70668834.59 s
ESA-GCN0.89970.6392223.30 s0.76140.4475378.08 s0.87460.72298973.52 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Sun, H. ESA-GCN: An Enhanced Graph-Based Node Classification Method for Class Imbalance Using ENN-SMOTE Sampling and an Attention Mechanism. Appl. Sci. 2024, 14, 111. https://doi.org/10.3390/app14010111

AMA Style

Zhang L, Sun H. ESA-GCN: An Enhanced Graph-Based Node Classification Method for Class Imbalance Using ENN-SMOTE Sampling and an Attention Mechanism. Applied Sciences. 2024; 14(1):111. https://doi.org/10.3390/app14010111

Chicago/Turabian Style

Zhang, Liying, and Haihang Sun. 2024. "ESA-GCN: An Enhanced Graph-Based Node Classification Method for Class Imbalance Using ENN-SMOTE Sampling and an Attention Mechanism" Applied Sciences 14, no. 1: 111. https://doi.org/10.3390/app14010111

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop