**1. Introduction**

Graphs are becoming ubiquitous across a large spectrum of real-world applications in the form of social networks, citation networks, telecommunication networks, biological networks, etc. [1]. In addition, numerous applications involving multimedia, such as video surveillance, video streaming, healthcare systems, and intelligent indoor security systems depend on using graphs as research objects [2–4]. For a considerable number of real-world graph node classification tasks, the training data follow a *long-tail* distribution, and node classes are *imbalanced*. In other words, each of a few "majority" classes has a large number of samples, while most classes only contain a handful of instances. Taking the NCI chemical compound graph as an example, only approximately 5% of molecules are labeled as active in the anticancer bioassay test [5]. Graph node classification tasks are often further complicated by the fact that graph nodes can be associated with multiple labels in many real-world network data. Many social media sites, such as BlogCatalog, Flickr, and YouTube, allow users to use a diverse set of labels representing their various interests. A person can join several interest groups on Flickr, such as *Landscape* and *Travel*, and different

**Citation:** Duan, Y.; Liu, X.; Jatowt, A.; Yu, H.-t.; Lynden, S.; Kim, K.-S.; Matono, A. SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs. *Remote Sens.* **2022**, *14*, 4479. https://doi.org/ 10.3390/rs14184479

Academic Editors: Jungho Im and Gwanggil Jeon

Received: 26 June 2022 Accepted: 26 August 2022 Published: 8 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

video genres on YouTube, such as *Cooking* and *Wrestling*. Furthermore, many networks are characterized by imbalanced label distribution and multi-label nodes at the same time, as shown in Figure 1.

**Figure 1.** Label distribution of three real-world multi-label network datasets: (**a**) BlogCatalog3 [6], (**b**) Flickr [6], and (**c**) YouTube [7]. The horizontal axis represents the label id, and the vertical axis represents the number of nodes containing each label. It can be clearly observed that these label distributions are all highly imbalanced, where a few classes contain many more nodes than the rest of the classes.

To date, a large body of work has focused on graph representation learning (GRL) with balanced node classes and simplex labels [8–12]. However, these models do not perform well when graphs exhibit the aforementioned characteristics of being imbalanced and multi-label for the following reasons. (1) *The problem caused by the imbalanced setting*: The imbalanced data make the classifier overfit the majority class, and the features of the minority class cannot be sufficiently learned [13]. Furthermore, the above problem is aggravated by the presence of the *topological interplay* effect [5] between graph nodes, making the feature propagation dominated by the majority classes. (2) *The problem caused by the multi-label setting*: Multi-label graph architectures typically encode very complex interactions between nodes with shared labels [5], which is challenging to capture. Therefore, it is essential to develop a specific graph learning method for class imbalanced multi-label graph data. However, research in this direction is still in its infancy. Thus, in this study, we propose *imbalanced multi-label GRL* to address this challenge while also contributing to graph learning theory.

For imbalanced data, *minority over-sampling* is an effective measure to improve the classification accuracy [14–16]. This strategy has recently been confirmed to be effective for graph data as well [17]. Traditional over-sampling techniques consist of a two-step process: (1) the selection of some minority instances as "seed examples"; (2) the generation of synthetic data with features and labels similar to the seed examples, which are then added to the training data. For example, the most popular over-sampling technique— the synthetic minority over-sampling technique (SMOTE) [14]—addresses the problem of minority generation by performing interpolation between randomly selected minority instances and their nearest neighbors. Cost-sensitive learning is another type of effective approach for alleviating the problem of imbalanced data applied to a classification [16], where the basic assumption is that the cost resulting from different types of misclassification varies significantly (e.g., the cost of treating an intruder as a non-intruder is much greater than treating a non-intruder as an intruder). The principle of applying cost-sensitive learning methods to imbalanced learning problems is to assign a larger penalty cost to misclassified minority class samples [18]. Existing cost-sensitive classification algorithms can generally be grouped into three categories [19]: algorithms that (1) pre-process the training data, (2) post-process the output, and (3) apply direct cost-sensitive learning methods. Data pre-processing aims to make the classification results on the new training set equivalent to cost-sensitive classification decisions on the original training set, typically

along the lines of sampling [18] and weighting [16]. Post-processing the output makes the classifier biased toward minority classes by adjusting the classifier decision threshold, as represented by MetaCost [20] and ETA [21]. Direct cost-sensitive learning methods embed the cost information into the objective function of the learning algorithm to obtain the minimal expected cost, such as cost-sensitive decision trees [22] and cost-sensitive SVM [23].

However, mainstream over-sampling techniques have significant shortcomings when applied to graph data, as the selection of seed examples prioritizes global minority nodes while ignoring local minority nodes, and each synthetic instance is always assigned a label based on some specific strategy, which may be incorrect. This is because, in contrast to non-graph data, the relationships between graph nodes are explicitly expressed by the edges connecting them, meaning that the representation learning of a node can be heavily dependent on its neighboring *unlabeled* nodes through the feature propagation mechanism inherent to graphs.

Motivated by the above observations, we propose and test the following hypothesis. In addition to synthetic minority samples, synthetic *unlabeled* samples can also facilitate the debiasing of graph neural networks (GNNs) on an imbalanced training set. In particular, for nearby global minority samples that are a local majority, we can "safely" produce virtual samples of the same class and add them into the training sets to balance the class distribution. Global minority samples that are also a local minority are more likely to be local outliers, and thus, they are risky for selection as seed examples for further oversampling. For nearby global minority samples whose neighbors are class-balanced, it is difficult to determine the labels of virtual samples. Thus, the production of *unlabeled* virtual nodes should be encouraged, which can help minorities by "blocking" the overaggregation of the majority features delivered through edges. This idea is illustrated in Figure 2. **We argue that the key to over-sampling on an imbalanced multi-label graph is to flexibly combine the synthesis of both labeled and unlabeled instances enriched by label correlations.**

**Figure 2.** A comparison between our method and the current state-of-the-art graph over-sampling method **GraphSMOTE** [17]. In the current method, the idea is to generate new minority instances near randomly selected minority nodes and create virtual edges (dotted lines in the figure) between those synthetic nodes and real nodes. Instead, we synthesize minority instances in safe areas (i.e., A1), generate *unlabeled* instances in locally balanced areas (i.e., A2), and do not conduct data oversampling near minority nodes that are outliers (i.e., A3). For the simplicity of illustration, only a single-label scenario is shown.

We extend the existing over-sampling algorithms to a novel framework for the imbalanced multi-label graph node classification task based on the above considerations. We extend the classic global minority-based seed example selection to the local minority perspective (see Section 3.2). Distinct from interpolation, which is commonly used in mainstream over-sampling techniques [24], we use a generative adversarial network (GAN) [25] to generate new instances. As a representative deep generative model, a GAN can capture label correlation information by estimating the probability distribution of seed examples [26]. We propose an ensemble architecture of a GAN and conditional GAN (CGAN) [27] for the flexible generation of both unlabeled and labeled synthetics (see Section 3.3). To make use of the graph topology information, we propose a method to obtain new edges between the generated samples and existing data with an edge predictor (see Section 3.4). The augmented graph is finally sent to a graph convolutional network (GCN) [12] for representation learning, together with the learned label correlations (see Section 3.5). We name our proposed framework **SORAG**, abbreviated from **S**ynthetic data **O**versampling St**RA**tegy on **G**raph. In summary, our contribution is three-fold:

