2.1. Evaluation of Oversampling Techniques with Ensemble Learning
This section provides an empirical study for evaluating several oversampling algorithms, namely, SMOTE, BorderlineSMOTE, SVM-SMOTE, and KMeanSMOTE (SMOTE-KN). The SMOTE [
23] involves creating artificial samples for the minority class, thereby effectively addressing the imbalance issue without relying on simple random oversampling with replacement. Creating synthetic instances through the SMOTE algorithm depends on the resemblances within the feature space of existing minority instances. The SMOTE randomly selects data points from the minority class
such that
is the minority class. Then, it identifies the
k-nearest neighbors of the selected data point (the value of
k is an integer value). This is followed by creating a new data point by linearly interpolating between the selected data point and one of its
k-nearest neighbors. This is repeated until the desired number of artificial samples are created.
Figure 1 depicts an illustration of SMOTE algorithm.
BorderlineSMOTE [
24] represents a modification of the original SMOTE algorithm. It involves the identification of borderline samples, which are subsequently utilized for the generation of new synthetic samples. It has two variations: SMOTE-B1 and SMOTE-B2; both of them are evaluated in this study. SVM-SMOTE [
25] is another variation of the original SMOTE. It employs an SVM classifier to identify support vectors and utilizes this information to generate new samples. Another variation of SMOTE, is KMeanSMOTE [
26] which employs a K-Means clustering technique prior to the application of SMOTE. Through clustering, samples are grouped together, and new samples are generated based on the density of each cluster. They are implemented using the imbalanced-learn tool [
27]. Using the oversampling techniques is typically done on the training datasets, since it is not practical to create synthetic testing data and assess models using synthetic test samples [
28]. Once the oversampling techniques are applied, the quantity of training samples for every class across all datasets is adjusted to match the highest counts of the majority classes in the original datasets. A recently published dataset called Curated Microarray Database (CuMiDa) [
29] was used for the empirical analysis. CuMiDa is the gene expression dataset for leukemia cancer composed of 22,283 genes and 64 samples distributed among five leukemia types presented in
Table 2.
Different ensemble-based learning models were generated and evaluated using bagging, random forests, stacking, voting, and boosting.
Table 3 presents a description of the evaluated ensemble learning implemented using scicit-learn [
30]. As for the evaluation method, a 10-fold cross-validation approach was adopted. This approach served the dual purpose of benchmarking our results and facilitating comparisons with the existing literature. Principal Component Analysis (PCA) was additionally employed to condense the input features, limiting them to only 50 genes.
Table 4 presents the obtained results. The highest results were obtained using SVM-SMOTE, along with random forests and hard voting classifiers, with each reporting 100% accuracies.
For the empirical analysis of the evaluated oversampling techniques, we also evaluated their effects and compared them with the baselines. We considered the ensemble learning classifiers without oversampling as the baselines. As depicted in
Figure 2, applying oversampling techniques had positive impacts in nearly all cases for all the evaluation measures except for two cases. The first case was when applying SMOTE with the stacking-based ensemble classifier, in which there was a drop in the results by around 1.5%. The other case was when applying the SMOTE with the hard voting-based ensemble classifier.
2.2. Evaluation of Proposed Feature Selection Method
This section presents an empirical study of the proposed feature selection technique which involves the combination of chi-square (ChiS) and information gain (IG) methods under two scenarios: one involving the application of the oversampling technique and the other without it. The evaluation is conducted using three gene expression datasets, as described in
Table 5 and
Table 6. The empirical analysis employs three classifiers—namely, multilayer perceptron (MLP), sequential minimal optimization support vector machines (SMO-SVM), and random forests.
For MLP classifier, an extensive exploration of various neural network architectures was conducted. This exploration involved experimenting with different configurations, including varying the number of neurons in the hidden layers and adjusting the learning rates. The goal was to systematically analyze how these architectural parameters impact the performance of the MLP classifier in our study. In the first network configuration, the count of hidden layer neurons is determined by computing the average of the input and output dimensions such that
As an illustration, if there are 300 genes in the input and seven classes in the output (representing the number of distinct classes), the number of neurons in the hidden layer would be calculated as 154. Regarding the learning rate, a fixed value of 0.3 was employed.
We explored other architectures of the MLP by varying the number of neurons in the hidden layer. Specifically, we investigated configurations with 20, 50, and 80 neurons in the hidden layer. Additionally, we examined various learning rates, including 0.1, 0.3, and 0.5, thereby considering all the possible combinations of these parameters. Every MLP configuration was thoroughly examined using a momentum value of 0.2 in conjunction with a backpropagation learning algorithm.
In the case of the SVM classifier, we employed the SMO-SVM as an optimization algorithm used for training support vector machines [
31]. The SMO-SVM is a specific algorithm designed to efficiently solve the optimization problem associated with SVM training. The main idea behind SMO is to break down the large quadratic programming problem into a series of smaller subproblems that can be solved analytically and efficiently. This approach is particularly useful when dealing with large datasets and high-dimensional feature spaces. This investigation focused on assessing the performance of two distinct kernel functions: PUK and Poly-kernel. The complexity parameter (C) was set to one for both experiments. To address the multiclass problem, we applied pairwise classification, which is a technique commonly referred to as “one-vs-one”.
As ensemble learning classifiers, we applied random forests, which aggregate decision tree predictors. Each individual tree depends on the values of a random vector sampled independently, and all the trees within the forest share the same distribution [
32].
Our approach involves suggesting a homogeneous ensemble of k-NN classifiers for the classification of cancers. This ensemble leverages the concept of majority voting among the predicted labels generated by individual k-NN models. The majority voting algorithm aggregates the predictions from each of the 1-NN, 3-NN, and 5-NN models separately (where 1-NN stands for one nearest neighbor, 3-NN for three nearest neighbors, and so on). Subsequently, it assigns a label to a sample by considering the most frequently occurring prediction among these models. This approach was implemented using WEKA-3.6.13 [
33].
The outcomes of our experiments are consolidated in
Table 7,
Table 8 and
Table 9 corresponding to MLP, SMO-SVM, and random forests, respectively.
Each table is structured to present the outcomes of individual feature selection without oversampling, individual feature selection with oversampling, and the proposed combined feature selection technique before and after the implementation of the oversampling technique. Additionally, each table is divided into three sections, with each section depicting the results for a specific dataset.
The most favorable performances are denoted by bold numbers, wherein we define the best performance as the one associated with the highest precision, recall, F measure, and accuracy. Subsequently, the number of selected genes was considered. In situations where the results achieved through various feature selection methods were equal, our preference leaned towards the technique with the smallest number of genes. For instance, in the Leukemia-subtype scenario, the results obtained using ChiS were identical to those achieved using the combined technique (ChiSIG). Nevertheless, we considered the outcomes obtained through the combined technique to be superior, because it excelled in terms of the number of selected genes. Specifically, while the ChiS resulted in 300 genes, the combined technique yielded a reduced set of 233 genes. This principle was applied consistently across the other cases as well.
In the case of the MLP, the most optimal performance was achieved by employing the first structure, as previously described, for all datasets. Conversely, for the SMO-SVM, the peak performance was observed when utilizing the poly kernel function for the Leukemia-subtype and Leukemia-ALLAML datasets, and the PUK kernel function for the Colon dataset.
In the Leukemia-subtype dataset, the most superior performance was achieved when utilizing our proposed feature selection method in conjunction with the SMO-SVM. Similarly, for the Leukemia-ALLAML dataset, our feature selection technique yielded the best performance when employed alongside the MLP. In the scenario of the Colon dataset, both the MLP and SMO-SVM exhibited their highest performance levels when our proposed feature selection technique was employed.
In order to assess the impact of the SMOTE techniques, we categorized its effects into three distinct types: positive influence, negative influence, and negligible influence. The experimental findings presented in
Table 7,
Table 8 and
Table 9 illustrate that in 19 out of the 27 scenarios, the application of the SMOTE technique had a favorable impact, thereby resulting in improved outcomes. Furthermore, in three instances, the SMOTE technique yielded adverse effects, while it showed no discernible impact in five other cases.
We conducted an investigation into the sensitivity of the classifiers for the evaluated oversampling techniques. The experimental findings reveal that the random forests classifier exhibited the highest degree of positive improvements. This sensitivity analysis is visually represented in
Figure 2, which demonstrates that the overall average improvement in all evaluation measures achieved through the application of the evaluated oversampling techniques was more pronounced in the case of the random forests.
We conducted additional experiments involving random forests and an ensemble of k-NN classifiers. For these experiments, we employed the ChiS feature selection method within a 10-fold cross-validation framework while applying the SMOTE. We conducted an examination to identify the most informative genes within the three datasets.
Figure 3 presents the correlation between the number of these informative genes and the associated accuracies when employing the random forests classifier. In the case of the Leukemia-subtype dataset, it is evident that the random forests classifier achieved the highest accuracy, reaching 96.56%, with a gene set comprising 90 genes. Shifting to the Leukemia-ALLAML dataset, it is apparent that the random forests classifier achieved a perfect accuracy rate of 100% using a gene set containing 50 genes. Finally, in the context of the Colon dataset, it is clear that the highest accuracy achieved by the random forests classifier stood at 92.50%, and this performance was maintained with a gene set consisting of 60 genes or more.
Table 10 provides an overview of the top-performing results achieved using the random forests classifier for the most significant gene sets.
The same process described above was also implemented when using an ensemble of k-NN classifiers in lieu of random forests.
Figure 4 displays the relationship between the number of the most-informative genes and their corresponding accuracies when employing the ensemble of k-NN classifiers. In the case of the Leukemia-subtype dataset, the highest accuracy reached was 93.85%, which was achieved with a gene set containing 100 genes. Shifting to the Leukemia-ALLAML dataset, the highest accuracy obtained was 98.94% utilizing a gene set comprising 60 genes. Finally, in the context of the Colon dataset, it is apparent that the highest accuracy achieved stood at 92.50%, and this performance was maintained with a gene set consisting of 30 genes or more.
Table 11 presents the top-performing results obtained with the ensemble of k-NN classifiers for the most-prominent gene sets.
It is imperative to contextualize our results within the broader research landscape by conducting a thorough comparison of our obtained results with those of related studies. This comparative analysis was guided by a set of well-defined criteria, as elaborated in
Table 12. Importantly, our findings shined brightly in this comparative analysis, as they consistently outperformed the results reported in the most closely related works. This not only underscores the robustness of our approach but also highlights its potential to make a significant contribution to the domain.