*4.3. Evaluation Metrics*

Following many previous studies in the field of multi-label GRL [5,30,31], we adopt Micro-F1 and Macro-F1 to evaluate the node classification performance, which are defined as follows:

$$Micro-F1 = \frac{\sum\_{i=1}^{m} 2 \cdot TP^i}{\sum\_{i=1}^{m} (2 \cdot TP^i + FP^i + FN^i)}\tag{20}$$

$$Macro - F1 = \frac{1}{m} \sum\_{i=1}^{m} \frac{2 \cdot TP^i}{2 \cdot TP^i + FP^i + FN^i} \tag{21}$$

where *TPi*, *FPi*, and *FN<sup>i</sup>* represent the number of true positives, false positives, and false negatives for the *i*-th label, respectively. *m* denotes the total number of labels. Micro-F1 measures the F1-score of the aggregated contributions of all the classes. Macro-F1 was defined as the arithmetic mean of the label-wise F1-scores. Compared to Micro-F1, Macro-F1 does not consider the class size. Micro-F1 and Macro-F1 combine the precision and recall of the model, the range of which is [0, 1]. The larger the value, the stronger is the model performance.
