*4.1. Datasets*

The compared classification algorithms were evaluated utilizing the chest X-ray (Pneumonia) dataset, the Shenzhen lung mask (Tuberculosis) dataset and the CT Medical images dataset.


The training partition was randomly divided into labeled and unlabeled subsets. In order to study the influence of the amount of labeled data, four different ratios ( *R*) of the training data were used: 10%, 20%, 30% and 40%. Using the recommendation established in [9,32] in the division process we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of semi-supervised classification is to exploit unlabeled data for better classification results. Hence, we use a random selection of examples that will be marked as labeled instances, and the class label of the rest of the instances will be removed. Furthermore, we ensure that every class has at least one representative instance.

### *4.2. Performance Evaluation of WvEnSL against Ensemble Self-Labeled Algorithms*

Next, we focus our interest on the experimental analysis for evaluating the classification performance of WvEnSL algorithm against the ensemble self-labeled algorithms CST-Voting and DTCo, which utilize simple voting methodologies. It is worth noticing that our main goal is to measure the effectiveness of the proposed weighted voting strategy over the simple majority voting; therefore, we will compare ensembles using identical set of classifiers. This will eliminate the source of discrepancy originated from unequal classifiers. Thus, the difference in accuracy can solely be attributed to the difference of voting methodologies.

Furthermore, the base learners utilized in all self-labeled algorithms are the Sequential Minimum Optimization (SMO) [33], the *C*4.5 decision tree algorithm [34] and the *k*NN algorithm [35] as in [2,7–9], which probably constitute the most effective and popular machine learning algorithms for classification problems [36].


The configuration parameters for all supervised classifiers and self-labeled algorithms, utilized in our experiments, are presented in Table 1.


**Table 1.** Parameter specification for all the base learners and self-labeled methods used in the experimentation.

Tables 2–4 presents the performance of all ensemble self-labeled methods on Pneumonia dataset, Tuberculosis dataset and CT Medical dataset, respectively. Notice that the highest classification performance for each ensemble of classifiers and performance metric is highlighted in bold. The aggregated results showed that the new weighted voting strategy exploits the individual predictions of each component classifier more efficiently than the simple voting schemes, illustrating better classification performance. WvEnSL3 exhibits the best performance, reporting the highest *F*1-score and accuracy, relative to all classification benchmarks and labeled ratio, followed by WvEnSL2. In more detail, WvEnSL3 demonstrates 82.53–83.49%, 69.79–71.73% and 69–77% classification accuracy for Pneumonia dataset, Tuberculosis dataset and CT Medical dataset, respectively; while WvEnSL2 reports 81.89–83.17%, 69.79–71.55% and 67–77%, in the same situations.

The statistical comparison of several classification algorithms over multiple datasets is fundamental in the area of machine learning and it is usually performed by means of a statistical test [2,7–9]. Since our motivation stems from the fact that we are interested in evaluating the rejection of the hypothesis that all the algorithms perform equally well for a given level based on their classification accuracy and highlighting the existence of significant differences between our proposed algorithm and the classical self-labeled algorithms, we utilized the non-parametric Friedman Aligned Ranking (FAR) [37] test.



**Table 3.** Performance evaluation of WvEnSL against ensemble self-labeled algorithms for Tuberculosis dataset.


**Table 4.** Performance evaluation of WvEnSL against ensemble self-labeled algorithms for CT Medical dataset.


Let *r j i* be the rank of the *j*-th of *k* learning algorithms on the *i*-th of *M* problems. Under the null-hypothesis *H*0, which states that all the algorithms are equivalent, the Friedman aligned ranks test statistic is defined by:

$$F\_{AR} = \frac{(k-1)\left[\sum\_{j=1}^{k} \hat{\mathcal{R}}\_j^2 - (kM^2/4)(kM+1)^2\right]}{\frac{kM(kM+1)(2kM+1)}{6} - \frac{1}{k}\sum\_{i=1}^{M} \hat{\mathcal{R}}\_i^2}$$

where *R*ˆ*i* is equal to the rank total of the *i*-th dataset and *R*ˆ *j* is the rank total of the *j*-th algorithm. The test statistic *FAR* is compared with the *χ*2 distribution with (*k* − 1) degrees of freedom. It is worth noticing that, FAR test does not require the commensurability of the measures across different datasets, since it is non-parametric, neither assumes the normality of the sample means, and thus, it is robust to outliers.

Additionally, in order to identify which algorithms report significant differences, the Finner test [38] with a significance level *α* = 0.05, is applied as a post-hoc procedure. More analytically, the Finner procedure adjusts the value of *α* in a step-down manner. Let *p*1, *p*2, ... , *pk*−<sup>1</sup> be the ordered *p*-values with *p*1 ≤ *p*2 ≤···≤ *pk*−<sup>1</sup> and *H*1, *H*2, ... , *Hk*−<sup>1</sup> be the corresponding hypothesis. The Finner procedure rejects *H*1–*Hi*−<sup>1</sup> if *i* is the smallest integer such that *pi* > 1− (<sup>1</sup>− *<sup>α</sup>*)(*k*−<sup>1</sup>)/*i*, while the adjusted Finner *p*-value is defined by:

$$p\_F = \min\left\{1, \max\left\{1 - (1 - p\_j)^{(k-1)/j}\right\}\right\},$$

where *pj* is the *p*-value obtained for the *j*-th hypothesis and 1 ≤ *j* ≤ *i*. It is worth mentioning that the test rejects the hypothesis of equality when the *pF* is less than *α*.

The control algorithm for the post-hoc test is determined by the best (lowest) ranking obtained in each FAR test. Moreover, the adjusted *p*-value with Finner's test (*pF*) was presented based on the corresponding control algorithm at the *α* level of significance while the post-hoc test rejects the hypothesis of equality when the value of *pF* is less than the value of *a*. It is worth mentioning that the FAR test and the Finner post-hoc test were performed based on the classification accuracy of each algorithm over all datasets and labeled ratio.

Table 5 presents the information of the statistical analysis performed by nonparametric multiple comparison procedures for all ensemble self-labeled algorithms. The interpretation of Table 5 demostrates that WvEnSL3 reports the highest probability-based ranking by statistically presenting better results, followed by WvEnSL2 and WvEnSL1 (C4.5). Moreover, it is worth mentioning that all weighted voting ensemble outperformed the corresponding ensemble which utilize classical voting schemes. Finally, based on the statistical analysis, we can easily conclude that the new weighted voting scheme had a significant impact on the performance of all ensemble of self-labeled algorithms.


**Table 5.** Friedman Aligned Ranking (FAR) test and Finner post-hoc test.

### *4.3. Performance Evaluation of WvEnSL against Classical Supervised Algorithms*

Next, we compare the classification performance of the proposed algorithm against the classical supervised classification algorithms: SMO, C4.5 and *k*NN. Moreover, we compare the performance of iCST-Voting against the ensemble of classifiers (Voting) which combines the individual predictions of the supervised classifiers utilizing a simple majority voting strategy. It is worth noticing that


Table 6 presents the performance of the proposed algorithm WvEnSL3 against the supervised algorithms SMO, C4.5, *k*NN and Voting on Pneumonia dataset, Tuberculosis dataset and CT Medical dataset. As above mentioned, the highest classification performance for each labeled ratio and performance metric is highlighted in bold. The aggregated results show that WvEnSL3 is the most efficient algorithm since it illustrates the best overall classification performance. More specifically, WvEnSL3 exhibits the highest *F*1-score and classification accuracy on Pneumonia and Tuberculosis datasets, while for CT Medical dataset, WvEnSL3 reports the second best performance, considerably outperformed by C4.5.


**Table 6.** Performance evaluation WvEnSL3 against state-of-the-art supervised algorithms on Pneumonia dataset, Tuberculosis dataset and CT Medical dataset.
