*2.2. Imbalanced Learning*

If the classification classes are not approximately equal, and a few classes contain many more samples than the others, the dataset is called imbalanced. Representative techniques to handle imbalanced datasets include sampling methods, ensemble algorithms, and cost-sensitive approaches.

Re-sampling the original dataset is a strategy for balancing the majority and minority classes at the data level. This type of method constructs a well-balanced training set by over-sampling the minority class or under-sampling the majority class during the preprocessing step. Following that, any learning algorithm might be trained on the new dataset to reduce the system's bias towards the majority classes. A typical example of sampling models is SMOTE [14]. To shift the learning bias toward the minority class, it generates synthetic samples depending on their nearest neighbors. Numerous extensions using various distance measures or selection criteria of seed samples are proposed based on the regular SMOTE algorithm. Among them, representative methods include the borderline SMOTE [41], the safe-level SMOTE [42], and the density-based SMOTE [43].

The methods of over-sampling and under-sampling are not limited to any modeling algorithm, which could be essentially regarded as data pre-processing processes. Ensemble classifiers and cost-sensitive approaches, which are algorithm-specific, could also be used to address data imbalance concerns as algorithm-level enhancement. The former type of method includes diverse hybrid sampling/boosting algorithms, such as SMOTEBoost [44], random under-sampling boost (RUSB) [45], and the balance cascade approach [46]. Besides the boosting algorithms, other ensemble methods such as balanced random forest [47] can also be applied for imbalanced datasets. Cost-sensitive approaches, which penalize the misclassification of the minority more severely, have also been reported to be effective in addressing the problem of class imbalance. Two popular examples of them are AdaCost [48] and weighted random forest [47].

### *2.3. Connections to Our Work*

The focus of this study is to investigate solutions for semi-supervised GRL and graph node classification on imbalanced multi-label graphs. The most closely related works can be found in the emerging area of imbalanced graph node classification [5,17,49]. Among these works, the dual-regularized graph convolutional network (DR-GCN) [5] model relies on a class-conditioned adversarial training process to facilitate the separation of labeled nodes and the identification of minority class nodes. GraphSMOTE [17] attempts to transfer the classical SMOTE method [14], which deals with imbalanced data, to graph data. In addition, the relaxed GCN network (RECT) [49] has reported the best performance on imbalanced graph node classification tasks, and its core idea is based on the design and optimization of a class-semantic-related objective function. Unlike GraphSMOTE, which is based on labeled minority generation, we present the first graph node over-sampling model that utilizes synthetic *unlabeled* nodes to inhibit the tendency of GNNs to overfit to majority under the topological effect. The new supervision information resulting from labeled synthetics and the blocking of over-propagated majority features by unlabeled synthetics facilitates balanced learning between different classes, taking advantage of the strong topological interdependence between nodes on a graph. We specify the details of the proposed model in the next section.
