*2.1. Co-Training Studies*

The first part of this section is dedicated to Co-training scheme and some of the numerous variants that have been demonstrated. Hence, Co-training is deemed to be a representative multi-view SSL method established by Blum and Mitchell [7] for binary classification problems and, in particular, for categorizing web pages either as course or as non-course. It is based on the premise that each example can be naturally divided into two separate set of features usually referred to as views, which is clearly an assumption of grea<sup>t</sup> importance for the implementation of the particular method. Co-training is identified as a "two-view weakly supervised algorithm" [6] since it incorporates the self-training approach to separately teach each one of the two supervised classifiers in the corresponding feature view and boost the classification performance exploiting the unlabeled examples in the most efficient manner [19]. Moreover, Zhu and Goldberg consider co-training as a wrapper method which is not affected by the two supervised classifiers employed in the relevant procedure, provided that they produce good predictions on unlabeled data [20]. Several modifications have been implemented since then, including mutual-learning and co-EM—bringing together the co-training and Expectation-Maximization (EM) approaches—exploiting mainly simple classifiers like naive Bayes (NB) [7,21].

In addition to the "two view" assumption, the effectiveness of the particular method depends largely on two other key assumptions: the first one is that each view is adequate for classifying the unlabeled data using a small set of labeled examples for training, while the second one is that each view is conditionally independent given the class label. When either of these assumptions is not met, different co-training variants have been proposed with comparable results. In the case where the "two view" assumption is not fulfilled, a random feature partition could take place to facilitate the application of the method as proposed by Zhu and Goldberg [20]. In such cases, the feature set is partitioned into two subsets of almost equal size, which henceforth form the two feature views, while different classifiers C1, C2 are employed. In addition, the same classifiers may be used under different configuration parameters, thus ensuring the diversity between them [22].

The number of studies that propose co-training as an effective SSL method is really restricted. One of these is presented in [22], where sentiment analysis is the main focus, while in [23] the authors have tackled a health care issue. Although the popularity of this type of problem is widespread and even though any shortcomings that may be associated with a large amount of labeled data can be efficiently leveraged by other SSL methods, ye<sup>t</sup> co-training seems to not have been delved into thoroughly enough. In this study, three different sources of text data were examined: news articles, online reviews, and blogs. A number of co-training variants were designed, focusing on the way the split of the feature space takes place, fitting appropriately the specific properties that characterize text data, such as the creation of one view by unigrams and the rest by bigrams or by adopting character-based language models and bag-of-words models, respectively. The produced results demonstrate the effectiveness of the co-training algorithm.

Another task that has been e fficiently tackled by using the co-training method is that of drug discovery, where classification methods need to be applied so as to predict the suitability of some molecules considering treatments of diseases and their possibly induced side-e ffects during the initial steps of tedious experiments [24]. Accurate predictions may save both time and money, since fewer combinations would be investigated and the final results could be acquired much faster. In this work, two di fferent views were available, stemming from chemistry and biology, and had to be mixed to reach the final conclusion. The approaches that were examined may be summed up as follows: (i) access separately each view either with a base classifier or the partial least squares (PLS) regression method [25], (ii) fuse the di fferent views, either by joining the heterogeneous data without any preprocess or after having applied the PLS method, also used for dimensionality reduction, and (iii) a modification of the co-training method (co-FTF). Ensemble tree-based learners were preferred in this last approach, handling imbalanced datasets appropriately and leading to promising results, while examining two labeled ratio scenarios. In addition, a random forest of predictive clustering trees was incorporated in a self-training scheme for multi-target regression, thus improving the performance of the employed SSL approach [26].

An expansion of the co-training algorithm, which includes an ensemble of tree-based learners as base learner, has been proposed in [27]. Under the assumptions that are presented there, the necessity of two su fficient and redundant views has been eliminated for the proper operation of Co-Forest. Furthermore, the bootstrap method that is exploited during the creation of the included decision trees provides the required diversity and, at the same time, reduces the chance of exporting biased decisions, leading to an e fficient operation of the SSL scheme. Adaptive Data Editing based Co-Forest (ADE-Co-Forest) [28] constitutes a variant of the original Co-Forest algorithm, introducing an internal mechanism in order to tackle the mislabeled instances, thus improving the total predictive behavior, since both false negative/positive error rates are further reduced, compared to its ancestor. A boosted co-training algorithm has also been proposed for a real-task problem—to be more specific, it concerns the human action recognition—which is based on the mutual information and the consistency between labeled and unlabeled data. Two metrics, named inter-view and intra-view confidence, are introduced and exploited dynamically so as to select the most appropriate subset of the unlabeled pool with the corresponding pseudo-labels [29].

Recently, a quite e ffective co-training method was introduced in [30] for early prognosis of undergraduate students' performance in the final examinations of a distance learning course based on attributes which are naturally divided into two separate and independent views. The first one concerns students' characteristics and academic achievements which are manually filled out by tutors, while the second one refers to attributes tracking students' online activity in the course learning managemen<sup>t</sup> system and which are automatically recorded by the system. It should be mentioned that semi-supervised multi-view learning has also been successfully applied for gene network reconstruction combining the interactions predicted by a number of di fferent inference methods [19]. In a similar work, an ensemble-based SSL approach has been proposed for the computational discovery of miRNA regulatory networks from large-scale predictions produced by di fferent algorithms [31].

## *2.2. Ensemble Selection Strategies*

The second part is oriented towards reporting briefly some of the most important points related with Ensemble Selection concept [32,33]. To be more specific, some usual keywords in this field are Multiple Classification Systems (MCSs), Static Ensemble Classifier (SEC), and Dynamic Ensemble Classifier (DEC), as well as classifiers' competence and diversification. The way that all these terms are connected is the fact that when a new ensemble learner is designed, the main ambitions are the employment of complementary and diverse participants, following the main asset of MCSs regarding the continuous increase of the predictive rate. The main di fference between the remaining two terms is the fact that SEC strategies examine a global solution regarding the total set of unknown instances, while the DES approaches provide a separate solution per test instance using mainly local

restrictions. Despite their distinct roles, they can be combined under hybrid mechanisms sharing similar measurement metrics or ML techniques for converging to their decisions [34,35].

Ensemble Selection has been inserted as a new stage into the original chain of constructing an ensemble learner, taking into consideration both the importance of computational needs that arise when we trust ensembles with too many participants and the fact of discarding less accurate models or models that reduce the internal diversity. This tactic is usually referred to as ensemble pruning or selective ensemble. A taxonomy of these techniques has been proposed in [36], assigning them to four di fferent categories: (i) ranking-based, (ii) clustering-based, (iii) optimization-based, and (iv) others, including the remaining techniques that cannot be strictly categorized to any of the previous three subsets. Another taxonomy was demonstrated in 2014, concerning mainly the actual need of DES in practice and the relation between the inherent complexity of classification problem, measured by appropriate metrics, and the contribution of the examined Dynamic Selection approaches [16]. Prototype selection techniques have also been examined in the abovementioned framework, acting beneficially towards both reducing computational resources and boosting the classification accuracy [37]. Furthermore, one related work on the field of SSL has been proposed using the competence of selected classifiers that stems from an a ffinity graph, achieving smoothness of the decisions for neighboring data [17].

### **3. The Proposed Co-Training Scheme**

Motivated by the above studies, in the present paper we make an attempt to put forward an ensemble-based co-training scheme for binary classification problems adopting a strategy of choosing the base classifiers of the ensemble from an available pool of candidate classification algorithms per dataset. The most important points concerning our contribution are outlined below:


Let the whole dataset (X) consist of *n* instances and *k* features, apart from the class variable (Y) that, in the context of this work, is restricted to be a binary one. Thus, without loss of generality, we assume that y*i* ∈ {0,1} for each labeled instance {l*<sup>i</sup>*, 1 ≤ *i* ≤ <sup>n</sup>*i*}, while each unlabeled instance {u*<sup>i</sup>*, 1 ≤ *i* ≤ <sup>n</sup>*u*} is characterized by the absence of the corresponding y*i* value. The parameters n*l* and n*u* represent the cardinality of L and U subsets, respectively. After having removed all missing values—leading to a new cardinality of total instances (n-)—it is evident that the following equation holds:

$$\mathbf{n}' = \mathbf{n}\_l + \mathbf{n}\_{ll} \tag{1}$$

Besides holding both numeric and categorical features, all the features of the latter form are converted into binary ones, increasing the initial number of k features into k-, in case X contains at least one of them. Otherwise, since no augmentation of the initial features has been applied, the next two quantities coincide: k ≡ k-. Under this generic approach, classification algorithms that cannot handle categorical data are not rejected by the total proposed process.

This choice seems safe enough, since it does not reject the adoption of any learner—this mainly refers to learning algorithms that cannot handle e fficiently the existence of both numerical and

categorical data—although the manipulation of heterogeneous features is an open issue [39]. Afterwards, without introducing any specific assumption about the relationship or the origination of any included feature, the available feature vector F: <f1, f2, ... , fk-> is split into two newly formatted subsets F1 and F2, where F = F1∪F2. Hence, two di fferent datasets X1, X2 are generated, respectively, both including disjoint feature sets, but sharing the same class variable Y. Therefore, the final hypothesis space could be summarized as follows: Fview: Xview →[0, 1], where view = 1, 2.

Through the above described methodology, the following two choices are enabled: either to apply a common learning strategy for both views, such as adopting the same learner, or tackling each view separately, depending on underlying properties, such as the views' cardinalities, independence or correlation assumptions that a ffect the views' internal structure or other kind of relationships that specify the nature of each view, since two distinct tasks have been raised. Following the majority of the existing approaches found in the literature and taking into consideration that a random feature split operates as an agnostic factor regarding the structure of the constructed views, the first approach was adopted in the present study [40].

Under this strategy, and before the common base learner is built per view, a preprocess stage is inserted. This aims to measure the rate of the imbalanced instances found in the provided training set and to define the number of the instances that have to be mined from each class per iteration (Minedclass0, Minedclass1). Due to the SSL concept, the quota of L and U subsets is defined by a labeled ratio value (R). Given this setting, the amount of the initial training set (Lview) is computed according to the following formula:

$$\text{InitSize} = R \times \text{size}(X), \text{ } \forall \text{view} = 1, \text{ } \text{ } \tag{2}$$

The cardinalities of both classes are then computed (Cmax, Cmin) regarding the available L0view. The minimum of them is set equal to 1 (Minedclass0), while the other one is equal to Cmax/Cmin (Minedclass1). In this way, the provided class distribution of the labeled instances is assumed to be representative of the total problem defined also by the unknown instances that must be assessed. Finally, these two variables are exploited during the learning stage to retrieve a suitable number of unlabeled instances per class during each iteration.

Now, as it regards the choice of the base learners, we selected five representative algorithms from di fferent learning families, capturing a wide spectrum of properties, concerning both assets and defects, which should be combined and avoided, respectively, in order to construct appropriately an accurate and robust enough ensemble learner per dataset so as to initialize the co-training process [41]. For this purpose, our pool of classifiers (C) consists of support vector machines (SVMs) [42], k-nearest-neighbors (kNN) [43], a simple tree inducer (DT) from family of decision trees [44], naive Bayes (NB) [45], and logistic regression (LR) [46]. In order to keep the computational needs of the exported ensemble, we restrict the cardinality of classifier participants under our Voting scheme, setting this number equal to 2. Thus, we had to employ a soft variant of Voting scheme which takes into account the class-probabilities of each algorithm and combines these decisions through averaging process, instead of hard voting through on-o ff decisions [29], where the occurrence of ties with the even number of base learners would appear too frequent. Furthermore, the stage of averaging the decisions of each individual participant generally leads to the reduction of the ensemble's variance and helps to surpass the structure sensitivity that is usually detected in more unstable methods, considering the input data

To be more specific, if we assume that we tackle with a binary classification problem containing a set of labels Υ = {0, 1} and a feature space *X* ∈ <sup>R</sup>*k*, such that for any probabilistic classifier *F* holds the next function: *F* : *X* → Υ, then for each instance *m* the decision profile of learner *j* is a pair of class probabilities *Pj*0, *Pj*1 which sum up to 1. Consequently, the mechanism of a simple, without weighting factors, soft-Voting classifier, given an instance *xm*, combines the decisions of all the candidate classification algorithms searching the most probable class ( ω) as follows:

$$\mathfrak{H}\_{\mathfrak{M}} = \arg\max\_{\omega} \sum\_{j=1}^{|\mathbf{y}|} P\_j(=\omega | \mathbf{x}\_{\mathfrak{M}}), \ y\_{\mathfrak{M}} \in Y, \ m \in \{1, 2, \dots, n'\}, \tag{3}$$

The class with the largest average probability is exported as the prevalent one through this pipeline, where *y*<sup>ˆ</sup>*m* ∈ *Y* and the notation of *p* depicts the number of the combined classifiers.

Trying to uncover the function of our preprocess stage which constructs the base learner of the proposed co-training scheme, we had to refer that the ambition of any Static Ensemble Selection strategy is to construct a subset *C*<sup>∗</sup>, such that *C*∗ ⊂ *C* and |*C*∗ | = 2, which satisfies better the chosen criteria for obtaining the most desired performance over test instances. In our case, we investigate the most compatible pair of learners that maximizes our proposed criterion under an unweighted soft-Voting scheme. Through this, we measure the number of instances for which the decision of the soft-Voting scheme remains correct when the two candidate participants disagree (*qcorrected*), normalized by the total amount of disagreements based on the label of the examined instances (*qdisaggre*), as well as the rate of non-common errors (*qcommon errors*/*v*). To this end, we introduce the objective function of Equation (4), which is defined as a linear combination of the mentioned quantities:

$$\begin{aligned} Q\_{soft}^a(i,j) &= a \ast \frac{q\_{\text{command}}^{i,j}}{q\_{\text{diagram}}^{l,l}} + (1-a) \ast (1 - \frac{q\_{\text{command},\text{error}}^{i,j}}{v}), \\ &0 \le a \le 1, \ i, j \in \{0, 1, 2, \dots, |\mathbf{C}^\*|\} \text{ with } i \ne j,\end{aligned} \tag{4}$$

where *a* is a parameter to balance the importance between the included terms. Actually, the first one rewards the pair of classifiers that managed to act complementary, since the more times the confidence of the classifier that guessed correctly the corresponding class label overpowered against the erroneous one, the larger values this term records. On the other hand, the second term penalizes the pair of classifiers whose common decisions coincide with mislabeling cases by reducing its value when such behavior occurs. The parameter *v* symbolizes the cardinality of the validation set over which the rest of quantities are calculated. Giacinto and Roli called this diversity measure as "the double-fault measure" [47].

Although an analysis of the selected *a* value could raise the interest of further research, we selected the value of 0.5 for equal importance. Thus, for each examined dataset D, which contains both labeled and unlabeled data, we split the labeled set into train and validation set, in a same manner as the default k-fold-cross-validation strategy, applying the previously referred Static Ensemble Selection strategy so as to detect the most favorable pair of classifiers for our soft-Voting ensemble learner. In case that *qdisaggre* = 0, then *a* is set equal to 0, holding only the second term.

Exploiting the exported soft-Voting ensemble learner as the base learner of our co-training variant, each L0view is fitted with *Co*(*Voteso f t*(*C*∗ *i* ,*C*<sup>∗</sup> *j*)) ≡ *Co*(*VoteSEC so f t*)—we use the notation *C*∗ *i* and *C*∗ *j* for the selected learners which are included into *C*∗—and the corresponding class probabilities for each unlabeled instance per view ( uiview) are computed per iteration. Next, only the top-class0 and top-class1 instances per class are selected, based on the estimated confidence measure. Subsequently, these instances are exported by the current U subset (since both views share the common unlabeled set, it does not need to use the view index when referring to the U subset). Then, they are added to the training set of the opposite view along with the most prominent class label based on base learner's decision. Therefore, if the target variable of the m-th instance of U is categorized as class0 by the F1 classifier (x m: <f1, f2, ... , fk- /2 with probclass0first > 0.5), then the Liter 2 subset during the iter-th iteration has to be augmented with the same instance, using the corresponding features of the second view and the estimated class variable (x m: <sup>&</sup>lt;fk'/2+1, fk'/2+2, ... , fk'|class0>).

According to this learning scheme whose main ambition is to teach two different learners of the same classification algorithm through mutual disagreement concept, each learner injects into the other the information that is retrieved by the supplied view per iteration. A more theoretical analysis of the error bounds that can be achieved through the disagreement-based concept in case of Co-training could be found in [48]. Since our strategy of constructing the base learner of co-training through a static ensemble selection mechanism *VoteSEC so f t* , we assume that we provide an accurate enough algorithm whose both competence's performance and diversity's behavior have been verified through a validation set so as to avoid overfitting phenomena or heavy mislabeling learning behaviors.

To sum up, the pseudo-code of the introduced SEC strategy (SSoftEC) as well as the proposed co-training variant are presented in Algorithms 1 and 2, respectively.

### **Algorithm 1.** SSoftEC *strategy*

**Input:** L—labeled set f—number of folds to split the L C—pool of classification algorithms exporting class probabilities α—value of balancing parameter **Main Procedure: For** each *i, j* ∈ {0, 1, ... , |*C*|} and *i* - *j* **do Set** *iter* = 0, *Qaso f t*(*i*, *j*) = 0 **Split** L to f separate folds: *L*(1), *<sup>L</sup>*(2), ... , *L*(*f*) **While** *iter* ≤ f **do Train** Ci, Cj on *L* \ *L*(*iter*) **Apply** Ci, Cj on *L*(*iter*) **Update** *Qaso f t*(*i*, *j*) according to Equation (4) *iter* = *iter* + 1 **Output: Return**pairofindicesi,jsuchthat:(*i*,*j*)∗ :*arg*max*Qasot*(*i*,*j*).

**Algorithm 2.** *Ensemble based co-training variant*

 *i*,*j*  *f*  # **Mode:**

Pool-based scenario over a provided dataset D = Xn × k Yn × 1 xi—vector with k features <f1, f2, ... fk> ∀ 1 ≤ i ≤ n yi—scalar class variable with yi ∈ {0, 1} ∀ 1 ≤ i ≤ n {xi, yi}—i-th labeled instance (li) with 1 ≤ i ≤ nl {xi}—i-th unlabeled instance (ui) with 1 ≤ i ≤ nu Fview—separate feature sets with view ∈ [1,2] learnerview—build of selected learner on corresponding View, ∀view = 1, 2 **Input:** Liter—labeled instances during iter-th iteration, Liter ⊂ D Uiter—unlabeled instances during iter-th iteration, Uiter ⊂ D iter—number of combined executed iterations MaxIter—maximum number of iterations C—pool of classifiers ≡ {SVM, kNN, DT, NB, LR} (f, α)—number of folds to split the validation set during SEC and value of Equation (4) **Preprocess:** k-—number of features after having converted each categorical feature into binary n-—number of instances after having removed instances with at least one missing value Cj—instance cardinalities of both existing classes with j ∈ {min, max}

Minedc—define number of mined instances per class, where c ∈ {class0, class1} *Algorithms* **2020**, *13*, 26

 *f* 

**Main Procedure: Apply** SSoftEC(L0, f, C, α) and obtain *C*∗*i* ,*C*<sup>∗</sup>*j* **Construct** *Voteso f <sup>t</sup>*(*C*<sup>∗</sup>*i* ,*C*<sup>∗</sup>*j*) **Set** *iter* = 0 **While** *iter* < MaxIter **do For** each *view* **Train** learnerview on *Liter view* **Assign** class probabilities for each ui ∈ Uiter **For** each *class* Detect the top Minedclass ≡ Indview Update: *Liter*+<sup>1</sup> *view* ← *Liter view* ∪ *xj*, *arg* max *class P*(*Y* = *class*|*Xview* ) ∀ *j* ∈ *Ind*∼*view*) (The sign ∼ view means the opposite view from the current. *Uiter*+<sup>1</sup> *view* ← *Uiter view*\*xj*∀*<sup>j</sup>* ∈ *Ind*∼*view iter* = *iter* + 1 **Output:** Use*Voteso<sup>t</sup>*(*C*<sup>∗</sup>*i*,*C*<sup>∗</sup>*j*)trainedLMaxItertopredictclasslabelsoftestdata.

### **4. Experimental Procedure and Results**

 on

For the purpose of our study a number of experiments were carried out using 27 benchmark datasets from UCI Machine Learning Repository [49] regarding binary classification problems (Table 1), where the sign # depicts the cardinality of the corresponding quantity. Note that the columns entitled # Features in Table 1, counts all the features apart from the class variable. These datasets have been partitioned into 10 equal-sized folds using the stratified 10-fold-CV resampling procedure so that each fold should have the same distribution as the entire dataset [50]. This process was repeated 10 times until all folds were used as the testing set and the results were averaged. Moreover, each fold was divided into two subsets, one labeled and the other one unlabeled, in accordance with a selected labeled ratio value (R) which is defined as follows:

$$\mathcal{R} = |L\mathbf{D}| / \left( |L\mathbf{D}| + |L\mathbf{D}| \right). \tag{5}$$


**Table 1.** Description of datasets used from the UCI repository.

#: the cardinality of the corresponding quantity.

In order to study the influence of the amount of labeled data in the training set, three different ratios were used, and in particular: 10%, 20%, and 30%. In general, the *R* (%) values over which researchers are interested are the smaller ones (*R* < 50%), so as to be consistent with the practical aspect of SSL scenario. The effectiveness of the proposed co-training scheme was compared to several co-training and self-training variants. For verifying the supremacy of the *VoteSEC so f t* as base classifier, we built the soft-Voting versions based on all pairs of the inserted pool of classifiers (C). Furthermore, the version that exploits the decisions of all the participants of C pool was implemented, as well as the

individual variants without voting. Thus, 16 different supervised classifiers were exhibited as base learners, all imported by the scikit-learn Python library [51] and in particular:

The SVMs using Radial Basis Function as kernel inside its implementation, representing one universal learner that tries to separate instances using hyper-planes and 'Kernel-trick' [52],


For simplicity, we made use of the following notation in the experiments, while the parameters' configuration for all applied classification methods is presented in Table 2:




**Table 2.** Configuration of exploited algorithms' parameters.

As mentioned before, there are 10 different pairs of algorithms that can be formatted with a pool of five candidate classifiers. In addition, the case of applying each one individually takes also place, as well as the case that all participants of pool C are exploited under the same Voting stage. Thus, 16 self-training variants and 16 co-training variants are examined against the proposed co-training algorithm, which selects through a static selection strategy the soft-Voting ensemble base learner into its operation per different task. As it concerns the parameter f, it has been set equal to 10, leading to a 10-fold-cross-validation procedure per examined dataset. The next tables depict only one out of three

different labeled ratio scenarios concerning the top five algorithms, based on total Friedman Ranking statistical process along with a smaller statistical comparison concerning only the top five algorithms. For a deeper analysis, the total results can be found in http://mL.math.upatras.gr/wp-content/uploads/ 2019/12/Official\_results\_co\_training\_ssoftec\_voting.7z. Moreover, a pie chart has been provided in Figure 1, depicting the participation, into per centage style, of the pair of classifiers that were employed into the proposed strategy as base learner during all the experiments.

**Figure 1.** Pie chart depicting the participation of each combination inside our Static Ensemble strategy.

For evaluating the predictive performance of the proposed algorithm, three representative and widely used evaluation measures were adopted for measuring the obtained performance over the test set: classification accuracy, F1-score, and Area Under the ROC Curve (AUC). Accuracy corresponds to the percentage of correctly classified instances, while F1-score is an appropriate metric for imbalanced datasets and is defined as the harmonic mean of recall (r) and precision (p). In the case of a binary classification problem, they are defined as:

$$Accuracy = (\text{tp} + \text{tn})/n\tag{6}$$

$$F\_1 \text{score} = 2 \times tp / (2 \times tp + fp + fn). \tag{7}$$

where *tp*, *tn*, *fp*, *fn*, and *n* correspond to the number of true positive, true negative, false positive, false negative, and total number of instances, respectively. Finally, the latter one is related to the quality of the examined classifier ranking of any randomly chosen instance and is computed by aggregating the corresponding performance across all possible classification thresholds. The most favorable manner to visualize this metric is through plots of *TPR* vs. *FPR* or Sensitivity vs. (1-Specificity) relationship at different classification thresholds, where *TPR* stands for True Positive Rate, while *FPR* stands for False Positive Rate. Their analytical formulas are provided here:

$$TPR = tp/(tp + fn),\tag{8}$$

$$\text{FPR} = fp/(fp+tn). \tag{9}$$

The experimental results using 10% labeled ratio are summarized in Tables 3–5, where the best value per dataset is bold highlighted. Overall, it appears that the co-training Vote performs better than the corresponding self-training variants. Moreover, among the co-training variants employed, the proposed algorithm takes precedence over the rest on most of the datasets. In addition, we applied a familiar statistical tool to confirm the observed results. Hence, the Friedman Aligned Ranks [56] non-parametric test (significance level α = 0.05) was used to compare all the employed SSL methods (Table 6). According to the calculated results, the algorithms are sorted from the best performer (lowest ranking) to the worst one (higher ranking). Therefore, it is statistically confirmed the supremacy of the *Co*(*VoteSEC so f t*) algorithm, while the null hypothesis *H*0 (i.e., the means of the results of two or more algorithms are the same) is rejected. Furthermore, the Nemenyi post-hoc test [57] (α = 0.05) was applied to detect the specific differences between the algorithms, which is a commonly used non-parametric test for pairwise multiple comparisons. Table 6 includes the computed Critical

Difference (CD) which is the same for all the cases of this R-based scenario (CD = 2.27). It is statistically confirmed that the difference between the *Co*(*VoteSEC so f t*) algorithm and the majority of the other methods is statistically significant in all examined metrics, thus verifying the predominance of the proposed co-training scheme. The fact also that the proposed algorithm outperforms the Vote (all) variants means that the implemented time-efficient SEC strategy provides a more accurate base learner for the field of SSL. Towards this direction, we visualize the performance of the proposed algorithm against *Co*(*Vote*(*all*)) for the examined metrics and the case of R = 90% via a violin plot which favors the comparison of the distribution of the achieved values per algorithm, including also some important statistical quantities: median, interquartile range, and 1.5× interquartile range (Figure 2). Therefore, we can deduce experimentally the success of the proposed approach, especially when generic binary datasets constitute the main issue to be tackled when the collected labeled instances are highly numerically restricted.

**Figure 2.** Violin plots of the proposed algorithm against *Co*(*Vote*(*all*)) approach over the three examined metrics.


**Table 3.** Classification accuracy (±stdev) values for the best five variants (labeled ratio 10%).

Bold highlighted means the best value per dataset.


**Table 4.** F1-score (±stdev) values for the best five variants (labeled ratio 10%).

Bold highlighted means the best value per dataset.

**Table 5.** Classification accuracy (±stdev) values for the best five variants (labeled ratio 10%).


Bold highlighted means the best value per dataset.


**Table 6.** Friedman Rankings for all examined algorithms and statistical importance based on Nemenyi post-hoc test.

Adoption of more dedicated preprocessing stages oriented towards more specific problems should be applied, in order to boost the performance of the SSoftEC strategy and provide the Co-training scheme a more appropriate base learner [58–60]. However, in our generic experimental stage, which covers various applications, the proposed algorithm recorded a both robust and accurate enough performance, especially in the case of the F1-score metric which is critical for real problems with class distribution different from the optimal, under a computational inexpensive manner, in contrast with DEC strategies that employ a new classifier search per test instance. The smoothing of the decisions that are produced through the proposed soft-Voting ensemble seems to favor the exported decision profile, since a large number of decisions that were initially misclassified based on individual predictions were reverted towards the ground truth label. While at the same time, numerous cases where the two participants disagree over the binary label were not affected. This happens because a large correct confidence value combined with a smaller incorrect one remains untouched under such a voting scheme, according to Equation (3).
