*5.3. Influence of Over-Sampling Rate*

In this section, we explore how the performance of **SORAG** varies with the oversampling rate. We varied the number of synthetic unlabeled nodes and the number of synthetic labeled nodes in [10%, 20%, ..., 90%, 100%] of the size of the training set on each dataset and recorded the performance change in **SORAG** as follows (see Figure 5). The sampling ratios for the BLOGCATALOG3, FLICKR, and YOUTUBE networks were set to 10%, 1%, and 1%, respectively, and all the parameters were the same as those in Section 5.2.

**Figure 5.** Influence of over-sampling rate on the model performance for each dataset: (**a**) BLOG-CATALOG3, UNLABELED NODE GENERATION; (**b**) BLOGCATALOG3, LABELED NODE GENERATION; (**c**) FLICKR, UNLABELED NODE GENERATION; (**d**) FLICKR, LABELED NODE GENERATION; (**e**) YOUTUBE, UNLABELED NODE GENERATION; (**f**) YOUTUBE, LABELED NODE GENERATION.

One clear observation is that the performance of **SORAG** fluctuates wildly with changes in the oversampling rate on all datasets. Unlike non-graph data, **SORAG** generates new edges for new samples during which more random noise is introduced. Therefore, as the over-sampling rate varies, the features of the virtual nodes affect the formation of the virtual edges. As a result, there is substantial variation in the feature propagation process on the new graph, which highly influences the classification results. This may explain the high sensitivity of **SORAG** to the oversampling rate. Below, in Table 4, we show the selected optimal over-sampling rates for all datasets.


**Table 4.** Optimal over-sampling rates for the synthetic unlabeled nodes (denoted as Rate*U*) and synthetic labeled nodes (denoted as Rate*L*) on each dataset. N/A abbreviates for "not applicable".
