1. Introduction
Breast cancer, which is cancer that develops from breast tissue, is one of the important problems in the medical domain. It is the second most severe cancer among all of the cancers that have already been discovered. Some factors have been found to cause breast cancer, such as obesity, a lack of physical exercise, alcoholism, hormone replacement therapy during menopause, ionizing radiation, a family history of breast cancer, etc. [
1]. In practice, many medial institutes have paid much attention to the early detection of breast cancer.
In related literatures, many data mining and machine learning techniques have been used to develop various kinds of breast cancer prediction models. Among them, some focus on the improvement of learning models and some focus on data pre-processing steps. For example, convolutional neural networks (CNN), as one representative of a deep learning technique, were modified to improve their prediction performance [
2,
3]. On the other hand, some studies focus on feature selection for filtering out irrelevant features from a given dataset for the construction of more effective classifiers [
4,
5] and data sampling for re-balancing class imbalanced datasets in order to decrease the effect of skewed class distribution in the learning process [
6,
7].
For related works of feature selection, Sasikala et al. [
8] propose a novel feature selection method based on the genetic algorithm to select a gene subset from high dimensional gene data, which causes different classifiers perform better than the ones without feature selection. In [
9], a genetic algorithm is used for feature selection, where the selected subset is used to construct different classifiers for performance comparisons. On the other hand, Jiang and Jin [
10] use a gradient boosting decision tree with Bayesian optimization to remove the irrelevant and redundant features from gene expression data. Raj et al. [
11] compare several feature selection methods to determine the best one to combine with the random forest classifier.
For related works on class imbalance learning, Zhang et al. [
12] propose a clustering-based under-sampling method to select informative samples from the clusters identified in the majority and minority classes, and the decision tree based on this boosting technique is employed for the prediction model. In [
13], eighteen different under- and over-sampling methods are used to balance related class imbalanced cancer datasets, in which the over-sampling methods perform better than the under-sampling ones. Cai et al. [
14] apply the synthetic minority over-sampling technique (SMOTE) to balance the training dataset and employ the stacking ensemble method to combine multiple classifiers, which achieved better performance than conventional methods. Rani et al. [
15] investigated the effect of performing SMOTE on five different classifiers to determine the best one for breast cancer prediction.
According to Fernandez et al. [
16], SMOTE over-sampling can benefit from the use of feature selection, where feature selection is performed over the class imbalanced dataset to select a subset feature of it, and then the reduced dataset is over-sampled to make it contain the same size of the data samples as in the majority and minority classes. Recently, Solanki et al. [
17] propose the contrary procedure that SMOTE be performed first to re-balance the breast cancer dataset, and then wrapper-based feature selection methods can be applied to reduce the feature dimensions.
However, to the best of our knowledge there is not any study examining the performances of both procedures to combine feature selection and over-sampling for breast cancer prediction. Therefore, the research objective of this paper is to compare these two combination orders with two baselines by employing feature selection and over-sampling individually. Particularly, filter and wrapper-based feature selection methods are combined with SMOTE for performance comparison. In addition, one small- and one large-scale breast cancer datasets are used in order to understand the performance of different approaches.
The contribution of this paper is two-fold. First, the procedures of combining the feature selection and over-sampling steps are compared in terms of breast cancer prediction, which has never been done before. Second, the best combination procedure and combined algorithms that will be identified in this paper can be used as one the representative baseline methods for future research.
The rest of this paper is organized as follows.
Section 2 overviews related literature on feature selection and over-sampling.
Section 3 describes the two different combination procedures and the experimental setup.
Section 4 presents the experimental results, and
Section 5 concludes the paper.
2. Literature Review
2.1. Feature Selection
Feature selection is an important data pre-processing step in data mining and knowledge discovery from databases. It focuses on selecting representative features from a given training set, which have higher discriminative power to make classifiers better able to distinguish between different classes. Moreover, another advantage of feature selection is to reduce feature dimensionality, which lowers the computational complexity during the classifier training stage [
5,
18].
In general, feature selection algorithms are composed of four basic steps, which are a generation procedure to generate the candidate feature subset, an evaluation function to evaluate the effectiveness of the feature subset, a stopping criterion to determine when to stop the previous steps, and a validation procedure to examine whether the feature subset is valid [
19].
Existing feature selection algorithms can be divided into filter, wrapper, and embedded methods depending on how they combine the feature selection search with the construction of the classifiers. In filter methods, the relevance of features such as distance, consistency, dependency, information, and correlation are assessed,. That is, the feature relevance score is calculated, in which low-scoring features are removed. Some representative methods include relief, the Fisher score, and information gain.
In wrapper methods, a specific classification algorithm is used to determine the quality of different subsets of features. Since the space of feature subsets can grow exponentially with the number of features, heuristic search methods are used to guide the search for an optimal subset. Therefore, wrapper methods are very computationally intensive, especially when the construction of the chosen classifier requires a high computational cost. One representative wrapper method is the genetic algorithm.
In embedded methods, feature selection is incorporated as part of the classifier training process. That is, the feature selection method is embedded in the modeling algorithm, where the classifier is used to evaluate the quality of the selected subset of features. Embedded methods have the advantage of including interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods. One representative wrapper method is the decision tree classifier.
2.2. Over-Sampling
In practice, the class imbalanced dataset problem usually occurs since the number of data samples in one class are significantly different from those of the other one; say the imbalance ratio is 1:100. For the example of breast cancer datasets, they do not usually contain both the malignant and benign patient classes, denoted as the minority and majority classes, respectively. Without dealing with the class imbalance problem, most machine learning models aim at maximizing the accuracy of its classification rule by ignoring the minority class examples, with the classification of all testing examples being organized into the majority class [
6].
In general, there are three types of solutions to the class imbalance problem, which are algorithm level, data level, and cost-sensitivity methods. Among them, the data level methods based on data sampling techniques are usually considered first since they are used independently of the classifier [
6]. Data sampling techniques focus on re-balancing the given training set. Particularly, under- and over-sampling techniques have been used, in which the former is for reducing the size of the majority class, whereas the latter is used for enlarging the size of the minority class. Among them, the synthetic minority over-sampling technique (SMOTE) is one representative method, which has been used as the baseline in many related studies [
16].
The aim of SMOTE is to produce new synthetic examples for the minority class. For example, a minority class instance i is selected as the basis to create new synthetic data. According to a specific distance metric, usually the Euclidean distance, the number of the neighbors nearest to i are chosen from the training set, e.g., , , and . Next, a randomized interpolation is conducted to obtain new synthetic data, i.e., , , and .
3. Research Methodology
3.1. Two Combination Orders for Feature Selection and Over-Sampling
In this paper, two orders of combining the feature selection and over-sampling steps are compared by being given a training set, denoted as TR, which is composed of M and N majority and minority class data samples, respectively, and each data sample is represented by k dimensional features. For the first order, i.e., performing feature selection first and over-sampling second, a chosen feature selection algorithm is employed to select some representative features from the TR. As a result, a reduced feature subset of TR is produced, denoted as , where each data sample is represented by o dimensional features (k > o). Next, the over-sampling algorithm is used to generate M–N synthetic data samples for the minority class, leading to a balanced training set, denoted as , which is composed of 2M data samples. That is, the number of data samples in the majority and minority classes are the same.
On the other hand, for the second combination order, the over-sampling algorithm is used first to re-balance the training set, i.e., TR, which results in a balanced training set, denoted as . is composed of 2M data samples, and each data sample is represented by k dimensional features. Next, the chosen feature selection algorithm is performed over , leading to a reduced feature subset of , denoted as . In , each data sample is represented by p dimensional features (k > p). Note that the number of features in by the first combination order and by the second combination order are not necessarily the same, i.e., .
Therefore, the performances of the classifiers trained by and can be compared individually based on the same testing set. Moreover, other classifiers trained by through performing feature selection alone and through performing over-sampling alone are regarded as the baseline approaches for further performance comparison.
3.2. Experimental Setup
3.2.1. Datasets
In order to examine the performances of both orders of combining feature selection and over-sampling, two related breast cancer datasets are considered. The first one is based on the KDD Cup 2008 breast cancer dataset (
https://www.kdd.org/kdd-cup/view/kdd-cup-2008 (accessed on 15 February 2021)), which contains 102294 data samples, and each data sample is represented by 117 different image features, which are extracted from 4 X-ray images per patient. Particularly, the class imbalance ratio is 163.2.
The second dataset is based on the Breast Cancer Wisconsin Dataset downloaded from the UCI Machine Learning Repository (
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29 (accessed on 15 February 2021)). It is composed of 699 data samples, in which each data sample is represented by 10 features including clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. In addition, the class imbalance ratio is 1.86.
To train and test the classifier, the 5-fold cross validation method is used to divide each dataset into 80% and 20% training and testing sets. This means that every subset will be trained and tested five times, and the average prediction accuracy can obtained consequently be. In other words, each patient data will be used as the training and testing data example. In addition, the class imbalance ratio of the training set in each fold is controlled to be the same as the original dataset.
3.2.2. The Feature Selection and Over-Sampling Methods
In this paper, the information gain (IG) as the filter method and the genetic algorithm (GA) as the wrapper method are used for feature selection. Particularly, these two methods have been used in many research problems, including text classification [
20], gene expression microarray analysis [
21], intrusion detection [
22], financial distress prediction [
23], software defect prediction [
24], etc.
IG evaluates the gain of each variable in the context of the target variable, which is based on calculating the reduction in entropy. That is, the feature ranking stage focuses on ranking the subsets of features by high information gain entropy in decreasing order. In GA, an initial set of candidate solutions (i.e., individuals) are created and their corresponding fitness values are calculated for the later cross-over and mutation steps. Specifically, the individuals are subsets of predictors, and the fitness values are measures of the model performance.
Analyses were performed using the WEKA data mining software package. Most related parameters are based on its default values, except for the genetic algorithm, where the population size, crossover rate, and mutation rate were set as 50, 0.8, and 0.01, respectively [
25].
On the other hand, the over-sampling method is based on SMOTE. It has been widely used as a baseline over-sampling method for breast cancer datasets [
14,
15,
16,
17]. The percentage of synthetic instances was set to make the two datasets become balanced datasets where the malignant and benign classes contain the same numbers of data samples. Other related parameters were based on the default values of WEKA.
3.2.3. The Classifier Design
After the original training set
TR was pre-processed by different approaches, i.e.,
,
,
, and
, they were used to train the support vector machine (SVM) classifier for performance comparisons. In related literature, SVM has been widely used as the baseline classifier for breast cancer prediction [
26,
27,
28,
29].
The implementation of SVM was based on the RBF kernel function, and its related parameters were based on the default values of WEKA.
5. Conclusions
Feature selection aims at selecting representative features from a given training set, whereas over-sampling is for re-balancing the class imbalanced training set. In this paper, the order of combining feature selection and over-sampling for breast cancer prediction are compared in terms of classification accuracy. In order to assess the performances of different combination approaches, the information gain (IG) and the genetic algorithm (GA) as the filter and wrapper-based feature selection methods and the synthetic minority over-sampling technique (SMOTE) were employed for creation of the combinations. Moreover, two breast cancer datasets with significantly different class imbalance ratios and number of features were used for the experiments.
Regarding the experimental results, for the highly imbalanced dataset containing a large number of features, performing both feature selection and over-sampling can cause the SVM classifier provide higher AUC rates than performing feature selection and over-sampling alone as well as at the baseline. In particular, it is recommended to execute feature selection first and over-sampling second. On the contrary, for the dataset with the low imbalance ratio and small number of features, performing over-sampling alone is the better choice.