*5.5. Parameter Tuning*

For the validation set for each dataset, we used a grid search to tune the parameters in the following order: learning rate (range of {0.001, 0.005, 0.01, 0.05, 0.1}) → weight decay (range of {10−5, 5 × <sup>10</sup>−5, <sup>10</sup>−4, 5 × <sup>10</sup>−4, 10−<sup>3</sup> }) → dropout rate (range of {0.1–0.9}, step size 0.1) → k (range of {1, 2, 3}) → *ρ* (range of {0.1–0.9}, step size 0.1) → *α* (range of {0.1–0.9}, step size 0.1) → *β* (range of {0.1–0.9}, step size 0.1) → *λ* (range of {0.1, 0.5, 1, 5, 10}) → *μ* (range of {0.1, 0.5, 1, 5, 10}). Among these, the last six parameters are specific to the **SORAG** family. When tuning the parameters, the sampling ratios for the BLOGCATALOG3, FLICKR, and YOUTUBE networks were set to 10%, 1%, and 1%, respectively.

Figure 7 shows the variation in the performance of **SORAG***F* on the validation set of each dataset as the values of key parameters change. We first note that for the three generic parameters—learning rate, weight decay, and dropout rate—their values have a significant effect on classification performance and thus should be determined carefully. For *k*, *ρ*, and *μ*, we observed that their optimal values were consistent for all three datasets. The experimental results sugges<sup>t</sup> that the best local minority of the nodes should be computed based on their two-hop neighbors (i.e., *k* = 2). We assume that this is because the one-hop information will lead to the omission of valid neighbors, while the three-hop information will introduce the noise of irrelevant nodes. Moreover, for the objective function, <sup>L</sup>*edge* has the same weight as L*nc*, which indicates that the generation of virtual edges is of considerable importance. On the two larger datasets (FLICKR and YOUTUBE), the synthesis of virtual nodes and virtual edges are demonstrated to have equal weights, whereas the construction of virtual nodes has a smaller weight on the BLOGCATALOG3 network.

By contrast, the optimal values of *α* and *β* are close to 1 under almost all conditions (the only exception is that *α* = 0.5 for FLICKR). This observation verifies our expectation that *PBD*(*x*) ≈ *Pu*(*x*) and *PSF*(*<sup>x</sup>*, *y*) ≈ *Pl*(*<sup>x</sup>*, *y*). By introducing these two parameters, we can regulate the distribution of the synthetic nodes in a more flexible manner.

**Figure 7.** The effect of key parameters on the performance of **SORAG***F* for each dataset: (**a**) learning rate; (**b**) weight decay; (**c**) dropout rate; (**d**) k; (**e**) *ρ*; (**f**) *α*; (**g**) *β*; (**h**) *λ*; (**i**) *μ*.

### *5.6. Validation of Key Procedures in Training SORAG*

In this section, we answer the following research question: Do pre-training and fine-tuning the network components (i.e., two node generators and one edge generator) improve the model performance (see Algorithm 1)? For all datasets, we tested the following variants: {A, only pre-training the unlabeled node generator without fine-tuning; B, only pre-training the labeled node generator without fine-tuning; C, only pre-training the edge generator without fine-tuning; D, training the unlabeled node generator jointly with other components without pre-training; E, only training the labeled node generator jointly with other components without pre-training; F, only training the edge generator jointly with other components without pre-training; and G, the full model}. Naturally, {G-A, G-B, G-C} describe the effects of fine-tuning each focused component on performance, whereas {G-D, G-E, G-F} describe the performance difference between pre-training each component and non-pre-training. The results are presented in Table 5. The test method used was **SORAG***F*, and all the experimental settings and parameters were the same as those in Section 5.2.


**Table 5.** Performance comparisons of **SORAG***F* with different training procedures.

As presented in Table 5, we found that for all datasets, pre-training **SORAG***F* improved Micro-F1. Macro-F1 was also improved in most cases, with the exception of pre-training the labeled generator on the BLOGCATALOG3 dataset, which slightly reduced Macro-F1. The average Micro-F1 and Macro-F1 improvements were 5.9.

Therefore, we concluded that training strategy G, which combined pre-training and fine-tuning of key components, was the best practice.

### *5.7. Case Study: Performance of SORAG on a Geographic Knowledge Graph*

In this section, we exhibit the performance of **SORAG** on a standard geographic knowledge graph named US-AIRPORT as an example of the application of our approach in the field of remote sensing, as geographic knowledge graphs are reported to play an increasing role in the computation, analysis, and visualization of large-scale remote sensing data [55]. US-AIRPORT is the dataset used in struc2vec [8] (https://pytorch-geometric. readthedocs.io/en/latest/modules/datasets.html (accessed on 25 June 2022)), where nodes denote airports and labels correspond to activity levels. One-hot encodings were used as features. We randomly selected 20% of all nodes as the test set and 80% as the training set. Additionally, the average results from 10 separate experiments were used. Table 6 shows the performance of the analyzed methods in terms of Micro-F1 and Macro-F1. As shown, the performance gain from the state-of-the-art method GraphSMOTE towards **SORAG***F* is considerable, which demonstrates our data over-sampling strategy is able to weaken the drawbacks of the imbalanced node distribution.


 59.52  58.83

**60.10**

 56.41

**Table 6.** Comparison in terms of Micro-F1 and Macro-F1 on US-AIRPORT. The best results are boldfaced.
