1. Introduction
The number of records and database sizes holding medical data are both expanding. Various databases are constantly maintained with medical data because of measurements, tests, prescriptions, and other procedures. Traditional methodologies cannot analyze and search for intriguing patterns and information within this enormous amount of data [
1]. In order to find useful information in massive data repositories, there is a growing need for cutting-edge methods and technologies. Cancer has emerged as one of the deadliest illnesses in recent years. Women all across the world are particularly susceptible to breast cancer. In the field of bioinformatics or medical research, the accurate diagnosis of certain vital information is a significant problem [
2]. Countless websites, hospitals, research facilities, and diagnostic organizations all have access to enormous amounts of medical diagnostic data. To automate and expedite disease diagnosis, categorization is merely a supplementary step. Breast tumors, which can be benign or malignant, are most frequently caused by an overgrowth of the cells lining the breast ducts. A benign tumor forms when the cells grow incorrectly. The most prevalent type of benign breast tumor is a fibro adenoma [
3]. This could need to be surgically removed to establish the diagnosis, and further therapy might not be required. Malignant tumors’ cancerous cells have the capacity to spread outside of the breast if untreated. If caught early enough, breast cancer is usually curable.
A lump or clump, changes in breast or nipple size or shape, changes in breast or nipple color, the possibility of nipple discharge, breast enlargement or thickness, or even persistent discomfort are some of the signs of breast cancer [
4,
5]. During a screening exam, it can be caught early using mammography or a portable cancer detection tool. According to the cancer stage, the tissues of breast cancer evolve as the condition worsens. The breast cancer stage (I–IV) indicates how far cancer has extended in that patient [
6].
Stage I: The tumors grow slowly and are unlikely to spread. This stage can often require surgery to cure.
Stage II: The tumors grow and spread very little, but this stage may come back after the treatment.
Stage III: The tumors rapidly divide the cells’ growth, but no dead cells are found. This stage grows quickly.
Stage IV: The tumors are actively dividing, and they have both growth and dead tissues. In this stage, tumors can grow and spread quickly.
Breast cancer can take many different forms. There are many different ways to categorize and forecast this condition [
7].
It is well known that ensemble classifiers outperformed single classifiers in most studies conducted over the past ten years. The motivation behind this study is to use a homogeneous classifier, as it combines one base method with at least two alternative configurations or variants, or an ensemble that combines one base learning model with one meta ensemble, such as boosting.
The main contributions of this study are the following:
(1) We improved the quality of data in pre-processing using the SMOTE technique.
(2) We used the average splitting technique to randomly divide the dataset into smaller subsets and then individually model each division using the different classifiers.
(3) Utilizing the accuracy ranking technique to keep the best base models and discard the worst ones will help the suggested homogeneous ensemble technique perform better.
The article is organized as follows. The following section provides a review of the literature and related work carried out in breast cancer detection. The third section shows the proposed model, and the next section provides the performance analysis of different ML algorithms and ensemble methods against various performance factors, namely precision, recall, and f1 measures.
2. Related Works
Some women may have an increased chance of developing breast illness due to genetics, lifestyle, weight, radiation, and regeneration factors. Due to the condition, when identified quickly, the patient can be rescued due to the advancements in the treatment of malignant developments [
8]. Fit imaging and ongoing organizing have been demonstrated to advance image objectives and painful depictions.
Breast electrography, in addition to traditional ultrasonography (US) as well as mammography, provides information on breast abnormalities. It provides a non-invasive assessment of a lesion’s stiffness [
9]. The major findings suggest that it can improve on USG’s specificity and predictive value when describing breast masses. Any injury is visible in mammography or USG for the reason that the sore’s general thickness and echogenicity are different from the surrounding breast tissue. Breast illness prediction using a constructed hereditary calculation-based technique was offered as a framework, and it was noted that the prediction of breast cancer growth is still an unexplored area. In this article [
10], various AI calculations are used to forecast the location of the breast illness. Decision trees, irregular woods, SVMs, brain organizations, direct models, Adaboost, and pure Bayesian methods are used for forecasting. An engineering technique is used to improve the accuracy of predicting breast disease. The GA-based weighted normal total methodology for direct information collecting is a recently developed method that circumvents the limitations of the traditional weighted normal technique [
11]. A few models are predicted using the weighted normal technique with consideration for hereditary calculations.
SVM, MLP, and NN are just a few of the ML algorithms that are used to evaluate the Wisconsin Diagnostic Dataset [
12]. The aforementioned dataset includes highlights found from FNA test sweeps on a bosom growth. The informative index is divided into two parts for ML calculations: 70% for preparation and 30% for testing [
13]. Their findings revealed that all of the ML computations introduced a better performance on the twofold order of carcinomas, such as determining whether growth is safe or harmful. In this approach [
14], the factual solutions to the arranging problem are also agreeable. It is advised to use a CV process, such as K-fold cross-validation, to validate this review’s outcomes [
15]. This incrementing will help identify the key ideal hyperparameters for ML calculations and a more precise percentage of model expectation execution. In this paper [
16], ML tactics that improve indicative precision are examined. CART, RF, and K-nearest neighbors were considered as techniques. The UCI (University of California Irvine) machine learning repository currently maintains 488 datasets of various characteristics and was used to obtain the dataset.
It is frequently noticed that the KNN computation performs significantly better than other correlation-related methods [
17]. The K-nearest neighbor model is the most trustworthy. Models for grouping, such as random forest and boost tree, demonstrate a comparative precision [
18]. Therefore, the most reliable classifier may be used to identify growths so that a fix can be discovered at an early stage. Breast illness can be diagnosed using several AI methods and blood test data to rule out cancer early. Four different AI computations are used in this article to identify cancer early on [
19].
Malignant tumors are cancerous, while benign tumors are not, and are typically less dangerous. However, they do not invade other tissues or organs [
20]. They can develop to great proportions, which may result in pressure on the region around the tumor, which may result in discomfort and other complications. If medically removed, they are likely to remain in the same location, developing slowly and preserving the defined borders around the perimeter of the tumor [
7]. Malignant tumors are malignant and can metastasize or spread to new organs and tissues to form new tumors. Primary malignant tumors disseminate to auxiliary locations during metastasis [
21]. The liver, lungs, brain, and bones are frequently the sites of metastases that are established when cancer cells separate from the tumor and travel through the blood or lymphatic system [
22]. Malignant tumors must, therefore, be treated right away to prevent metastasizing. Depending on how far along the cancer is, the typical course of treatment includes surgical removal, radiotherapy, chemotherapy, or a combination of these.
We have developed the AdaBoost homogeneous ensemble strategy in this study and used this novel method to develop an accurate, automatic prediction model that can distinguish benign breast cancers from malignant ones. These forecasts aim to categorize patients into benign and malignant groups, helping individuals with benign breast tumors to avoid or reduce the amount of invasive treatments they may need to perform. The suggested strategy enables us to utilize a number of ensemble approaches simultaneously to enhance the performance of the prediction system.
3. Proposed Methodology
The Breast Cancer Wisconsin (diagnostic) dataset was obtained via Kaggle. Ten independent variables—”Sample code number,” “Clump Thickness,” “Uniformity of Cell Size,” “Uniformity of Cell Shape,” “Marginal Adhesion,” “Single Epithelial Cell Size,” “Bare Nuclei,” “Bland Chromatin,” “Normal Nucleoli,” and “Mitoses”—are considered to be “X,” and one dependent variable—”Y”—consists of class labels [
23]. The dataset has roughly 137 samples. The data is separated into two groups: X contains all of the input variables, while Y contains the class label, which serves as the output variable [
24]. The proposed methodology is depicted in
Figure 1.
3.1. Data-Preprocessing
A class imbalance occurs when there is an unbalanced distribution of classes in a dataset, meaning that there are many more data points in the negative class (majority class) than in the positive class (minority class). The classifier model’s performance will suffer if the skewed data are not corrected beforehand. The majority of the predictions will match the majority class, while the minority class features will be treated as data noise and ignored. The model will have a significant bias as a result. As mentioned, to achieve more accurate classification model results, the data are separated into training and test portions, with the training data comprising 70% of the total dataset, and the test portions comprising 30%.
Many classification algorithms strive to collect only pure examples for learning and make the boundary between each class as clear as possible to improve the prediction. Most classifiers find it significantly more difficult to learn how to classify synthetic cases close to the boundary than those far from it. Based on these results, the authors present an enhanced pre-processing method (A-SMOTE) for imbalanced training sets [
25]. SMOTE uses a k-nearest neighbor method to generate synthetic data. SMOTE begins by randomly selecting data from the minority class, after which the data’s k-nearest neighbors are determined [
26]. The k-nearest neighbor selected at random and the random data would then be combined to create synthetic data. The SMOTE method is explained in the section below.
Step A: Using Equation (1), a synthetic instance is created:
where
is the majority class samples,
z is the minority class samples and
is the newly created synthetic instance.
Step B: The below mentioned steps are carried out to remove the outlier; that is, noise.
If,
= {
,
,
, ….
} is a new instance received by Step A, then we will calculate the distance among
with the original minority
,
defined using Equation (2).
where:
,
are samples of rapprochement and, as per Equation (2),
L is calculated using Equation (3).
Step C: Calculate the distance between
and every original majority
,
), described using Equation (4).
,
) are samples of rapprochement and, as per Equation (4),
H is computed using Equation (5).
Entropy is nothing but the measure of disorder. Claude E. Shannon used the following Equations (6) and (7) to put this link between probability and heterogeneity or impurity into mathematical form:
The likelihood of a category’s uncertainty or impurity is given as the log to base 2 × (pi). The number of potential categories is indicated by the index
i, and since our issue is a binary categorization,
i = 2 in this instance. The graph below (2) shows how a symmetric curve represents his equation. The probability of the occurrence is plotted on axis X, and the heterogeneity or impurity H is shown on axis Y. The sample graph of entropy is shown in
Figure 2.
3.2. Adaboost Classifier
Adaboost enables the fusion of many “weak classifiers” into a single, so-called “strong classifier”. These trees are often referred to as “Decision Stumps.” By giving each data point the same weight, this method builds a model. The improperly categorized points are subsequently given more weight. All points with higher weights are given more significance in the following model. The models will keep being trained until a reduced error is obtained. The pseudo code of the Adaboost classifier is depicted in Algorithm 1. Adaboost and LogitBoost, as depicted in Algorithms 1 and 2, are the two boosting algorithms that are utilized to prompt the ADTree variations discussed in the following section.
Algorithm 1: Adaboost classifier—pseudo code |
Input: Let D be the dataset that includes {(a1,b1), (a2,b2), ….. (am, bm)}; |
Let λ be the learning (base) algorithm |
Let T be the total No. of learning rounds. |
Process: |
D1(i) = 1/m |
for time = 1, …, T; |
ht = λ (D, Dt); weak learner is trained with Distribution Dt |
t = [ (aii)]; Error measure (entropy) |
= ln ( ht |
Dt+1(i) = ∗ |
|
= |
Outcome: H(a) = sign ( |
Like the Adaboost, the LogitBoost also starts off with the uniform weight distribution. Additionally, it records the regression value G(
) and the probability prediction of the positive class p(
) at each boosting iteration.
Algorithm 2: LogitBoost |
Input: Dataset (Training) |
1. Initialize: = (b(i) + 1)/2, G() = 0, and p() = 0.5 |
2. For time = 1…..,T |
2.1a Calculate the working response as well as weights |
|
= , |
|
2.1b Fit using weights |
2.1c Update G() ← G() + and |
Output: G(a) = |
At the beginning of each boosting procedure, the working response as well as weight distribution are changed. The weighted least squares regression issue is fitted with a regression function.
3.3. Alternating Decision-Tree (ADTree)
A ML approach for classification is ADTree. It generalizes decision trees (DTs) and is linked to boosting. A set of DT nodes representing a predicate condition and prediction nodes that will store a single integer make up an ADTree [
27]. The Real Adaboost is implemented in Algorithm 3 to learn an ADTree model. The precondition P and base conditions C remain constant during the induction as two significant variables. The division of the input space greatly affects any decision tree’s quality.
In step 1.1 of Algorithm 3, all the training samples are given an equal weight by setting
= 0 = 1/n. The training sample as well as the boosting step are represented by the symbols I and t, respectively. Additionally, the precondition true, which is going to define the complete input space, is initialized as part of the precondition set P. With the root decision rule, a prior classifier is created by deriving a second prediction value,
from the ratio of positive to negative training data.
(c) stands for the total weight of the positive samples that meet condition c, and
(c) for the negative samples. Before starting a fresh boosting procedure, the training data set is reweighted in accordance with this root decision rule (step 1.3).
Algorithm 3: ADTree with Adaboost |
Input: Training Dataset -D |
1. Process of Initialization |
1.a Set = 0 = 1/n = {true} |
1.b First DT rule (x): { if (true) then = ln ( else 0} |
1.c Update = 1 = = 0 exp |
2. Do it again for boosting cycle t = 1:T |
2.1 For every pre-condition and each condition |
Z + + W( |
2.1 Compute and for the selected that minimizes Z with = 1 |
|
= ln (), = =l n() |
2.2 Update : { , |
2.3 Update = exp(- ( |
Output: F(x) = |
3.4. Reduced Error Pruning Tree (REPTree)
Poor pruning phase performance has been seen to hinder the top-down induction of DTs. For instance, despite the truth that the accuracy of the generated tree does not increase, it is known that its size increases linearly with the sample amount. The errors are minimized. The REPTree method is based on the idea that a variance-induced error can be reduced by computing the information gain using entropy and backfitting.
3.5. Naïve Bayes (NB) Classifier
In the naive Bayesian technique, the categorized data are presented in two steps. The initial step of utilizing the training input data entails assessing the probability distribution’s parameters. The test dataset is categorized in the second stage according to the highest posterior probability. The pseudo code for the NB classifier is displayed here.
3.6. Random Forest (RF) Classifier
The training data are used to build many decision trees, which is how RF functions. Every tree in a classification situation suggests an output as a class, and the class with the greatest number of outputs is ultimately chosen. The amount of trees needed for construction must be given. Such a method for aggregating data is called random forest (RF). This technique is used to lower the variation in the results, a crucial metric.
3.7. CART
Breiman et al. were the first to present the CART algorithm. CART is produced with the use of Hunt’s algorithm. It can process categorical and continuous attributes to create a DT. Additionally, it takes into account missing data and builds the DT using the Gini Index as an attribute selection criterion [
28]. CART creates binary trees by splitting the given datasets (training set) into binary segments. The probabilistic assumptions for ID3 and C4.5 do not use the Gini Index. The CART method uses cost-complexity pruning to eliminate erratic branches from the DT to improve classification accuracy.
3.8. Homogeneous Adaboost Technique
The mentioned algorithms, namely ADTree, NB classifier, RF classifier, REPTree and CART classifiers are executed as individual classifiers on the given dataset, and their performance against precision, recall, f1 measure and accuracy are compared. In the next stage, the Adaboost classifier is combined with each of these individual classifiers, the same is referred to as a homogeneous Adaboost classifier and the performance is recorded against different metrics as mentioned. Then, finally, the model that provides the best performance is recommended for deployment in the early detection of breast cancer.
4. Results and Discussion
The various ML techniques of naive Bayes, ADTree, REPTree, CART and RF are used in the dataset as separate classifiers. In the following section, a comparison of their performances is made using different metrics [
29]. The f1 measure, recall, precision, and accuracy are computed using Equations (8)–(11).
The mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE) and relative squared error (RRSE) are computed using Equations (12)–(15).
According to
Table 1, random forest is the best model because it can be built in 2.11 s, as opposed to 58.28 s for ADTree (TTBM—time to build model).
For each classifier, the accuracy forecast is shown in
Figure 3. RF has the best accuracy of 96.87% among all the classifiers described above that are currently being employed in the research. The accuracy provided by random forest is the best, and the lowest is provided by CART classifier prediction, with an accuracy of 84.67%.
The rates of mistakes derived from each classifier are shown in
Figure 4a,b. The ADTree RMSE rate value is 0.36 and the MAE rate is 0.24, respectively. This demonstrates that the prediction processes recorded few errors. However, the error rate for NB is higher, having 0.72 and 0.65 for RMSE and MAE, respectively.
The Adaboost is now an ensemble with the individual classifiers described as homogeneous classifiers, as mentioned earlier. AdaBoost-RF is the best model, as shown by
Table 2, which also shows that it was built in just 8.52 s. The AdaBoost-CART, however, is the worst model because it requires 200.12 s to create. AdaBoost-NB has the lowest F1-value (0.70), whereas AdaBoost-RF has the greatest F1-value (0.98).
With an accuracy of 97.95 percent, the Adaboost-RF forecasts outperform all other classification algorithms, as shown in
Figure 5. Adaboost-ADTree, on the other hand, comes in second with a prediction accuracy of 93.96%. Adaboost-REPTree offers a prediction rate of 84.66%, which is the lowest.
The varying error rates that were observed are shown in
Figure 6. The lowest error rates are offered by Adaboost-RF ensemble classifiers, which are 0.11 for MAE as well as 0.33 for RMSE. On the other hand, Adaboost-NB has a greater error rate, with values for MAE and RMSE of 0.51 and 0.69 that are very close to the values of the NB individual classifier.
All the performance measures reported in this study are statistically different compared to the other studies, as shown in
Table 3.
In the proposed method, we estimated 95% confidence intervals (CI) based on 10 replications for different random seed numbers. Using a holdout (70–30%) validation technique, Walid Theib Mohammad et al. [
33] reported a maximum classification accuracy of 97.7%, while our proposed study achieved 97.95% (95% CI: 96.5–98.6%) classification accuracy. In summary, we have achieved 0.25% and 0.59% higher accuracies compared to the second-best models by Walid Theib Mohammad et al. [
33] and Chaurasia V et al. [
31] (10-fold cross validation), respectively.