*3.2. Imbalance Measurement*

In multi-label learning, a commonly used measure that evaluates the global imbalance of a particular label is *IRLbl*. Let |*Ci*| be the number of instances whose *i*-th label value is 1; *IRLbl* is then defined as follows:

$$IRLbl\_{j} = \frac{\max\{|c\_1|, |c\_2|, \dots, |c\_m|\}}{c\_i} \,. \tag{1}$$

Therefore, the larger the value of *IRLbl* for a label, the more minority class it is. For a node *vi*, its GMD is defined as follows:

$$GMD\_i = \frac{IRLbl\_j \cdot \left[B\_{i\bar{j}} = 1\right]}{\sum\_{j=1}^{m} \left[B\_{i\bar{j}} = 1\right]},\tag{2}$$

where [*Bij* = 1] means *vi* has the *j*-th label, and ∑*mj*=<sup>1</sup> [*Bij* = 1] counts the number of labels that *vi* has.

The LMD of a node can be measured by the proportion of opposite class values in its local neighborhood. For *vi*, let *Nki* denote its k-hop neighbor nodes. Then, for label *cj*, the proportion of neighbors having an opposite class to the class of *vi* is computed as

$$S\_{ij} = \frac{\sum\_{\upsilon\_m \in N\_i^k} [B\_{ij} \neq B\_{mj}]}{|N\_i^k|} \,\,\,\,\tag{3}$$

where *S* ∈ R*n*×*m* is a matrix defined to store the local imbalance of all nodes for each label. Given *S*, a straightforward way to compute the LMD for *vi* is to average its *Sij* for all labels as follows:

$$LMD\_i = \frac{\sum\_{j=1}^{m} S\_{ij}[B\_{ij} = g\_j]}{m} \,. \tag{4}$$

where *gj* ∈ {0, 1} denotes the minority class of the *j*-th label. Namely, if |*cj*| ≥ 0.5 · *n*, *gj* = 1; otherwise, *gj* = 0. Here, *n* is the total number of vertices. Further, we group the global minority nodes into different types based on the LMD, and each type is identified correctly by the classifier with different difficulties. Following [50,51], we discretize the range [0, 1] of *LMDi* to define four types of nodes, namely safe (SF), borderline (BD), rare (RR), and outlier (OT), according to their local imbalance:


Based on the above categories, we are confident about generating new virtual samples for the global minority samples belonging to SF by imitating their features and labels to balance the distorted class distribution. The global minority samples belonging to BD are located in the decision boundary; hence, it is challenging to determine the label for virtual samples similar to them. Therefore, we keep the new samples unlabeled and use them to weaken the over-propagation of majority class features by taking advantage of the feature smoothing mechanism on the graph. The global minority samples belonging to RR/OT are more likely to be outliers and should not be selected as seeds to generate new samples.

Furthermore, for *vi*, we define two metrics: labeled seed probability (LSP) and unlabeled seed probability (USP) to describe the probability of being selected as a seed example to generate labeled synthetic nodes and unlabeled synthetic nodes, respectively. The seed probabilities (LSP/USP) are calculated as follows:

$$SP\_i = GMD\_i \cdot LMD\_i = \begin{cases} \displaystyle LSP\_{i\prime} & \text{if } v\_i \in \text{SF} \\ \displaystyle LSP\_{i\prime} & \text{if } v\_i \in \text{BD} \end{cases} \tag{5}$$

We compute the LSP and USP scores for all nodes and sort them in descending order. The top-ranked nodes (controlled by the hyper-parameter seed example rate *ρ*) will be selected as seed examples. A min-max normalization processes all the GMD and LMD scores to improve the computation stability.
