**6. Experimental Results**

We conducted a series of experimental tests in order to evaluate the performance of the proposed Grey-Box model utilizing the classification accuracy and *F*1-score as an evaluation metric with a 10-fold cross-validation.

In these experiments, we compared the performance of Black-box, White-box and Grey-box models using five different labeled dataset ratios i.e., 10%, 20%, 30%, 40%, and 50%, in six datasets. The term "ratio" refers to the ratio of labeled and unlabeled datasets utilized for each experiment. The White-Box model utilizes a single White-Box base learner in self-training framework, the Black-Box model utilizes a single Black-Box base learner in self-training framework while the Grey-Box model utilizes a White-Box learner with a Black-Box base learner in self-training framework.

Additionally, in order to investigate which Black-Box or White-Box combination brings the best results, we tried two White-Box and three Black-Box learners. For this task we have utilized the Bayes-net (BN) [25] and the random tree (RT) [26] as White-Box learners and the sequential minimal optimization (SMO) [27], the random forest (RF) [28], and the multi-layer perceptron (MLP) [29] as Black-Boxes for each dataset. Probably these classifiers are the most popular and efficient machine learning methods for classification tasks [30]. Table 2 presents a summary of all models used in the experiments.

The experimental analysis was conducted in a two-phase procedure: In the first phase, we compared the performance of Grey-Box models against that of White-Box models, while in the second phase, we evaluated the prediction accuracy of the Grey-Box models against that of Black-Box models.


**Table 2.** White-, Black-, and Grey-Box models summary.

### *6.1. Grey-Box Models vs. White-Box Models*

Figures 3–8 present the performance of the Grey-Box and White-Box models. Regarding both metrics we can observe that the Grey-Box model outperformed the White-Box model for each Black-Box and White-Box classifier combination in most cases, especially on EduDataA, EduDataAB, and Australian datasets. On Coimbra dataset the two models exhibit almost similar performance, with Grey-Box model being slightly better. On Winsconsin dataset the Grey-Box model reported better performance in general, except for ratio 40 and 50 where the White-Box model outperformed the Grey-Box, utilizing RT as White-Box base learner. The interpretation of Figures 3–8 reveals that for EduDataA, EduDataAB and Bank marketing datasets, as ratio value increases, the performance of all models increases too, while on the rest datasets, it seems that we cannot state a similar conclusion. We can also observe that the Grey-Box model performs better in most cases for small ratios (10%, 20%, 30%) while it seems that for higher ratios (40%, 50%) the White-Box exhibits similar performance as the Grey-Box model. Summarizing, we point out that the proposed Grey-Box model performs better for small labeled ratios. Additionally, the Grey-Box (SMO-RT) reported the best overall performance reporting the highest classification accuracy in three out of six datasets, while the Grey-Box reported the best performance utilizing RT as White-Box learner, regarding all utilized ratios.

**Figure 3.** Box-plot with performance evaluation of Grey-Box and White-Box models for EduDataA dataset.

**Figure 4.** Box-plot with performance evaluation of Grey-Box and White-Box models for EduDataAB dataset.

**Figure 5.** Box-plot with performance evaluation of Grey-Box and White-Box models for Australian dataset.

**5DWLR**

**5DWLR**

**Figure 6.** Box-plot with performance evaluation of Grey-Box and White-Box models for Bank dataset.

**Figure 7.** Box-plot with performance evaluation of Grey-Box and White-Box models for Coimbra dataset.

**Figure 8.** Box-plot with performance evaluation of Grey-Box and White-Box models for Winsconsin dataset.

### *6.2. Grey-Box Models vs. Black-Box Models*

Figures 9–14 present the performance of the Grey-Box and Black-Box models utilized in our experiments. We can conclude that the Grey-Box model was nearly as accurate as a Black-Box model with the Black-Box being slightly better. More specifically, the Black-Box model clearly outperformed the Grey-Box model on EduDataA and EduDataAB datasets while the Grey-Box reported better performance on Australian dataset. On Bank marketing, Coimbra, and Winsconsin datasets both models exhibited similar performance.

**Figure 9.** Box-plot with performance evaluation of Grey-Box and Black-Box models for EduDataA dataset.

**Figure 10.** Box-plot with performance evaluation of Grey-Box and Black-Box models for EduDataAB dataset.

**Figure 11.** Box-plot with performance evaluation of Grey-Box and Black-Box models for Australian dataset.

**Figure 12.** Box-plot with performance evaluation of Grey-Box and Black-Box models for Bank dataset.

**Figure 13.** Box-plot with performance evaluation of Grey-Box and Black-Box models for Coimbra dataset.

**Figure 14.** Box-plot with performance evaluation of Grey-Box and Black-Box models for Winsconsin dataset.
