Breast Cancer Diagnosis Using Bagging Decision Trees with Improved Feature Selection

Dudeja, Deepak; Noonia, Ajit; Lavanya, S.; Sharma, Vandana; Kumar, Varun; Rehan, Sumaiya; Ramkumar, R.

doi:10.3390/engproc2023059017

Open AccessProceeding Paper

Breast Cancer Diagnosis Using Bagging Decision Trees with Improved Feature Selection^†

¹

Department of Computer Science and Engineering, Maharishi Markandeshwar Deemed to be University, Ambala 133203, India

²

Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur 302034, India

³

Department of Computer Science and Engineering, R.V.S. College of Engineering, RVS Nagar, Dindigul 624005, India

⁴

Department of Computer Science, ABES Engineering College, Ghaziabad 201009, India

⁵

Department of Mathematics, School of Arts and Sciences, University of the People, Pasadena, CA 00012, USA

⁶

Department of Computer Science and Engineering, BBD University, Lucknow 226010, India

⁷

Department of Electrical and Electronics Engineering, School of Engineering and Technology, Dhanalakshmi Srinivasan University, Samayapuram 621112, India

^*

Author to whom correspondence should be addressed.

^†

Presented at the International Conference on Recent Advances on Science and Engineering, Dubai, United Arab Emirates, 4–5 October 2023.

Eng. Proc. 2023, 59(1), 17; https://doi.org/10.3390/engproc2023059017

Published: 11 December 2023

(This article belongs to the Proceedings of Eng. Proc., 2023, RAiSE-2023)

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning is a science of computer algorithms that enable systems to automatically learn actions and adjust them without explicit programming and improve from experience using pattern recognition. This work offers a practical introduction to the core concepts and principles of bagging decision trees used for breast cancer diagnosis. In this article, three main algorithms, viz. linear regression (LR), decision tree (DT), and random forest, were used. The random forest method used bagging techniques for selecting data points, and feature optimization was also carried out. Through our experiments, it has been found that the results obtained with the bagging trees algorithm outperform the result obtained with the best decision tree parameters. A feature optimization scheme was also introduced in the selection of data points during the training phase, which effectively increased accuracy.

Keywords:

machine learning; deep learning (DL); convolution neural network (CNN); xceptionNet; leNet; denseNet; mobileNet

1. Introduction

Breast cancer is the type of cancer that affects women the most frequently. According to the World Health Organization (WHO), it affects about 21 million people each year and is responsible for many cancer-related deaths among women. In the year 2018 alone, breast cancer was responsible for the deaths of almost 627,000 women, or 15% of all cancer fatalities among women. Although breast cancer is becoming more prominent in almost every corner of the world, the rate of development in developed areas of the world has picked up significantly. In the case of breast cancer, cells in the tissue transform and divide uncontrollably, which leads to the formation of a collection of mass or a lump. Commonly, breast cancer originates in the milk glands or milk ducts. It is impossible to detect it because there are no symptoms associated with it for a small tumor, which is why preparatory screening is required for early identification. Most masses do not turn out to be malignant. As soon as cancer is detected, more testing in the form of a surgical biopsy is necessary. Diagnosis and screening are extremely important in the fight against breast cancer. Because of limited resources, low-cost technologies and solutions for early detection are required. The most recent developments in image processing technology and machine learning algorithms have made it possible for early detection to be carried out with the lowest possible initial investment. There is no limit to the amount of data that may be used to train machine learning models. The only condition is that the model must be stable and accurate. There are several different machine learning models, such as linear regression and decision trees. These models are straightforward and quick, but accurate prediction might be difficult to achieve with them. The research objective of this study is to investigate the effectiveness of using bagging decision trees for breast cancer diagnosis, aiming to develop a robust and accurate classification model that can aid early detection and improve the overall diagnostic performance in differentiating between malignant and benign breast tumors.

2. Literature Review and Existing Methods

Hemasundara et al. [1] proposed a PCA-based approach in which the condition of a patient is evaluated by making use of recall and positive predictive value with relation to sensitivity, false alarm rate, F1-score, and the efficiency of categorization. Breast cancer was detected by Angulo et al. [2] based on the appearance and detectability of skin sores using a variety of image processing techniques. A CNN-based pre-trained model technique for mass evaluation that follows a cross-validation procedure was proposed by Richa et al. [3]. Mammography images were analyzed by Ragab et al. [4], who grouped together cancerous and benign masses using a process that involved both segmentation and deep learning. Up to 71.01% accuracy can be achieved with the freshly created DCNN design. Deep learning was used to analyze MRI images in the research described in [5]. Jung et al. [6] developed a model dependent on RetinaNet, “a one-stage object finder”. The article in [7] suggested a molecular-based understanding of MD and how its demography may be used to define high-density tissue. A CAD framework for detecting masses in breasts was proposed by Yousefi et al. [8]. According to Feng et al. [9], breast tumors can be classified as either metastatic, invasive, or non-invasive malignancies. They discovered atomic subtypes that treat targets seen in cancerous growth cells. Breast ultrasound images were analyzed by Xiao et al. [10] to establish a dataset that included a total of 2058 ultrasound images of breast masses. Of these masses, 688 were malignant and 1370 were benign lesions. Rao et al. [11] investigated the use of computer-aided design (CAD) structures for automated recognition, segmentation, and grouping of masses extracted from mammograms. Histopathology and mammography phenotypes were analyzed by Hamidinekoo et al. [12] with the purpose of determining their biological characteristics. The pre-processing approach for mammograms was upgraded by Gowri et al. [13] to facilitate the robotic extraction of tissue.

In the study by Mohamed et al. [14], an approach was presented for processing images for the purposes of classification and segmentation. They obtained data that was accurate 82.5% of the time and helped them to accurately diagnose and analyze tumors at a cheap cost to prescribe treatments. The application of machine and deep learning in healthcare and medicine, particularly in clinical imaging, is anticipated to become possible within a reasonable amount of time, according to Lee et al. [15]. T. Kooi et al. [16] investigated temporal data to determine how to identify fatal lesions in fragile mammographic tissue. A CNN and a massive-training ANN were two of the key deep learning models that Suzuki [17] utilized and examined. Imaging process approaches were described by Kooi et al. [18], which included hazy borders, high speckle noise, low intensity, low contrast in homogeneity, and low SNR. Deep learning models were illustrated by Carneiro et al. [19] for creating division maps of breast lesions using the mammographic test, which is the primary methodology used in this field. They suggested a framework for using segmentation-based maps that were produced by automated mass frameworks and micro-calcification detection systems. Using methods from the field of machine learning, Shen et al. [20] assisted with the analysis of photographs in the domain of clinical learning. in The researchers in [21] investigated profound learning, also known as deep learning, as well as organized yield models for the purpose of advancing their proposed CAD framework. The researchers in article [22] had presented evidence that suggests a CNN model could be developed and optimized for the identification of malignant growths with mammograms using data augmentation and transfer learning.

3. Methodology

The fact that decision trees (DTs) do not presuppose anything about the way that data are organized is one of the many reasons why they are superior to models such as the logistic regression model. While performing logistic regression, it is assumed that a line may be drawn to divide the available data. Sometimes, the organization of our data is not as simple as that. No matter how the data are organized, a decision tree may be able to get to the heart of the matter using its branching logic. The tendency of decision trees to overfit data is the primary detriment associated with their use. Pruning can be used to improve their performance, but the work presented here demonstrates how decision trees can be utilized to create a more accurate model.

The dataset used for breast cancer diagnosis in this paper is the “Breast Cancer Wisconsin (Diagnostic) Data Set”, commonly known as the Wisconsin Breast Cancer dataset. This dataset consists of 569 samples, where each sample represents a biopsy of breast tissue. There are 30 features computed from each digitized image of a biopsy, including the mean, standard error, and worst (the mean of the three largest values) of ten different characteristics such as the radius, texture, and smoothness. The target variable is binary, indicating whether the biopsy is diagnosed as malignant (1) or benign (0).

3.1. Bagging Trees

Decision trees are extremely vulnerable to random anomalies in their training dataset. Decision trees have a high variance, and a random change in their training dataset can result in a significant difference in the tree’s appearance. A random forest is a model made up of various decision trees. The purpose of random forests is to reap the benefits of decision trees while minimizing the issues concerning variation. A random forest is an example of ensemble learning, since it combines numerous machine learning models to form a single model. An example of this technique is given below:

A, A, B, C—Sample 1
B, B, B, D—Sample 2
A, A, C, C—Sample 3

Bootstrap aggregation (or bagging) is a strategy for minimizing variance in a single model by assembling an ensemble of models based on bootstrapped samples. A bootstrapped resample of our training dataset was used to bag decision trees. Hence, if we have 100 data points in our training set, each resample will have 100 data points chosen at random from our training set. Note that we used random selection for this replacement, which means that some data points will appear several times and some will not. The prediction is made with each of the ten decision trees in the bagging trees approach, and each decision tree is then voted on. The ultimate prediction is determined by the number of votes received. Bootstrapping in the training set is used to wash out the decision tree’s variation. The average of numerous trees with different training sets will produce a model that better captures the essence of the data. Hence, bagging decision trees are utilized as a way of minimizing model variance. The trees in bundled decision trees may still be too similar alike to have fully created the optimum model. They are based on several resamples, but they all have access to the same functionality. As a result, some constraints are applied to the model when generating each decision tree, resulting in increased variance in the trees. This is referred to as de-correlating the trees. At each node of a decision tree, all split thresholds for each feature are constructed to discover the single best feature and split threshold. For each node of a decision tree in a random forest, a subset of the features is chosen at random for evaluation. As a result, each phase will select a decent, but not the best, feature to split on. It is worth noting that this random selection of features occurs at each node. The square root of the number of characteristics is a common choice for the number of features to assess at each split. Hence, if we had nine traits, we would take three of them into account at each node (randomly chosen). A random forest is generated by bagging in this manner. Any decision tree in a random forest is almost certainly inferior to a normal decision tree. Nonetheless, averaging produces a very powerful model!

Using bagging techniques to improve decision trees and random forest models for breast cancer diagnosis holds significant importance due to several key advantages. Bagging reduces variance, mitigates overfitting, and enhances the model’s generalization by combining predictions from multiple decision trees. This results in increased accuracy and robustness, crucial in medical applications like breast cancer diagnosis, where reliable predictions of unseen data are paramount. The bagging method’s ability to handle noisy data and its parallelizability make it well-suited for large medical datasets, facilitating real-time or near real-time diagnoses. Moreover, the interpretability of random forest models provides valuable insights into the most critical features influencing breast cancer prediction, contributing to medical practitioners’ understanding of and trust in the model. Overall, the adoption of bagging techniques encourages more efficient and accurate breast cancer diagnosis, leading to potentially improved treatment outcomes and patient care.

3.2. Random Forest Model for Breast Cancer Detection

The Wisconsin Breast Cancer dataset was utilized to predict malignant or cancerous nodes to demonstrate the utility of the random forest (RF) approach. This dataset is a measure of many characteristics of a lump in breast tissue, including a label indicating whether the tumor is malignant. There are 569 data points and 30 features in the dataset. Python’s scikit-learn library was used for prediction and modeling. The syntax for creating and deploying a random forest model in scikit-learn is identical to those of logistic regression models and decision trees; when the code is engineered for comparison, the random-state condition is introduced so that it performs the same split every time. Without this random state, the data points in the training and testing sets would be different each time, making it more difficult to test the code.

3.2.1. Random Forest Parameters

Several options for tweaking the random forest classifier’s behavior may be found in the algorithm’s documentation in scikit-learn. Only a subset of these options will be covered here; however, the documentation has information on the full set. The pre-pruning tuning parameters for a random forest are the same as those for a decision tree: max depth, min. leaf samples, and max. leaf nodes. Although overfitting is rarely an issue with random forests, tuning these parameters is usually unnecessary. The number of trees (n estimators) and max. features are two more tuning options we will investigate (the number of features to consider at each split). By default, the maximum number of features is set to be equal to the square root of the number of features, p. (or predictors). While we rarely have the need to deviate from the default value of max. features, the following code demonstrates how to do so. Estimators (decision trees) can be increased from the default of 10. This is usually effective but could be too little in some situations. In subsequent sections, we’ll examine how to change the default value and how to determine an optimal expenditure. One of the main selling points of random forests is how little adjusting is typically required. Most datasets can make do with the default settings.

3.2.2. Grid Search

The purpose of the Grid Search is to guide us in selecting the best possible settings. Let us utilize Grid Search to evaluate how well different sizes of random forests perform. The parameters to be varied and the range of values to be tested must be specified in a parameter grid. The following is one possible format for presenting this:

n estimators = [10, 25, 50, 75, 100]

3.2.3. Parameters Grid

We can now develop a Grid Search and a random forest classifier. Keep in mind that k-fold cross validation will be performed automatically by the Grid Search. Our cross-validation level was set to 5, with cv = 5. As accuracy is prioritized by default, these settings are optimal. Keep in mind that the greatest accuracy score may change slightly between runs due to the random split in the 5 folds. As the classes in the breast cancer dataset are roughly equal, the accuracy will work out fine for us. We might wish to switch to a different statistic, such the F1-score, if there is a significant disparity between the classes. By adjusting the “F1” parameter, we can alter the scoring metric. Setting the classifier with a random state prevents unexpected results by producing the same optimal parameter every time. Further decision trees can be compared by adding more parameters and parameter values to the grid’s dictionary of parameters, such as max. features.

3.2.4. Elbow Graph

An increase in the random forest’s tree count has never been shown to negatively impact performance. Until a certain threshold is reached, the performance gains from adding more trees will plateau. However, the complexity of the algorithm increases as the number of trees grows. It takes more computing power to run an algorithm with higher complexity. Adding complexity to the model is usually worthwhile if doing so increases speed, but we should avoid doing so unless necessary. An elbow graph can be used to identify the optimum number of estimators in ensemble learning such as random forest. An elbow graph shown in Figure 1 can help us zero in on the optimal point. The elbow graph is a simplified yet effective model for maximizing efficiency. Let us run a Grid Search with an estimator set between 1 and 100 to determine which value is best. We are going to use the complete result from the Grid Search this time, rather than just the best parameters, like we described before. These data can be found in the cv results parameter. This dictionary contains a great deal of information, but we will only be using one of the keys—the mean test score—for our purposes. To save these numbers, let us create a variable. We can make a graph of these numbers using matplotlib. The graph flattens out after about 10 trees, as can be seen here. Given how unpredictable it is, the best model appeared at n estimators = 33 and n estimators = 64. As our goal is to use as few estimators as possible while achieving optimal performance, a value of around 10 seems about right. Our random forest model can be constructed with the best possible tree count at this point. Elbow graphs are useful for when we want to discover the minimum amount of complexity that will give best performance, which occurs frequently when we add complexity to a model.

3.3. Features’ Importance

The cancer dataset used in this paper includes 30 different characteristics. Does each feature have the same weight when constructing a model? If not, then what set of characteristics should we employ? Choosing which features to focus on is at play here. Simple feature selection can be achieved with the help of random forests. To refresh your memory, a random forest is made up of numerous decision trees, and in each tree, the node is selected to partition the dataset based on the biggest drop in impurity, often Gini impurity or entropy, in classification. As a result, the amount of impurity reduced by each feature in a tree can be calculated. The total forest can benefit by averaging the individual feature contributions to purify the data. Think of this as a feature importance meter that will help us prioritize and narrow down our feature set shown in Figure 2. Scikit-learn includes a feature importance variable in the model that details how significant various features are. All the scaled scores are rounded down to a total of 1. We can find the most important features in the training dataset using a random forest with n estimators = 10, then list them in descending order of relevance. Table 1′s results show that the worst radius (0.31), followed by the mean concave points, and then the worst concave points, is the most consequential feature. In linear regression, instead of utilizing averages, the variance is used to determine which features are the most significant as shown in Figure 3.

3.4. A New Model Based on Selected Features

Why do we need to execute feature selection? The main benefits of feature selection are that it allows us to train a model faster and minimizes the complexity of a model, making it easier to interpret. And, if the correct subset is picked, it can improve a model’s accuracy. Selecting the correct subset frequently requires expert knowledge, and creativity.. In our dataset, we discovered that the features featuring the word “worst” appear to be more important. As a result, we can create a new model using the chosen attributes and see if it increases accuracy. Remember the model from the last section? It had a precision of 0.9650. We start with the features whose names contain the term “worst”: ‘worst radius’, ‘worst texture’, ‘worst perimeter’, ‘worst area’, ‘worst smoothness’, ‘worst compactness’, ‘worst concavity’, ‘worst symmetry’, ‘worst fractal dimension’. There are eight of these features. We then build another data frame with the chosen features, followed by a training- test split with the same random state. After fitting the model and calculating its accuracy, we find that its accuracy has now increased to 0.9720. In this case, we can improve accuracy by employing a subset of characteristics, specifically one-third of the total features. This is due to the removal of some noise and highly correlated characteristics, which resulted in higher accuracy. When the sample size is big, the advantage of developing a better model with fewer features becomes more obvious. There is no ideal way for selecting features, at least not a universal one. Instead, we must figure out what works best for the individual problem and use our domain experience to create a successful model. The scikit-learn library makes it simple to determine the relevance of a feature.

4. Results and Discussion

Now we can use the model to make a prediction. For example, let us take the first row of the test set and see what the prediction is. Note that the prediction method takes an array of points, so even when we have just one point, we have to put it in a list. These results mean that the model predicted that the lump was cancerous and that its prediction was correct. We can use the score method to calculate the accuracy over the whole test set. Thus, the accuracy is 96.5%. We can see how this compares to the decision tree model, for which accuracy is 90.2%, much worse than that for the random forest (for models’ comparison refer Table 2 and Figure 4). With features selection accuracy further improved to 92% and 97.2% for decision tree and random forest model respectively.

4.1. Performance

Probably the biggest advantage of random forests is that they generally perform well without any tuning. They will also perform decently well on almost every dataset. A linear model, for example, cannot perform well on a dataset that cannot be split with a line. It is not possible to split the random dataset used in this paper (please refer Figure 5) with a simple linear line without manipulating its features. However, a random forest performs just fine on this dataset. We can see this by looking at the code to generate the fake dataset above and comparing a logistic regression model with a random forest model. The function “make_circles” makes a classification dataset with concentric circles. We use k-fold cross validation to compare the accuracy scores and see that the logistic regression model performs worse than random guessing, but the random forest model performs quite well.

When looking to obtain a benchmark for a new classification problem, it is common practice to start by building a logistic regression model and a random forest model, as these two models both have the potential to perform well without any tuning. This will give you values for your metrics to try to beat. Oftentimes, it is almost impossible to do better than these benchmarks.

4.2. Interpretability

Random forests, despite being made up of decision trees, are not easy to interpret. A random forest has several decision trees, each of which is not a very good model; but, when averaged, they create an excellent model. Thus, random forests are not a good choice when aiming for interpretability.

4.3. Computation

Random forests can be a little slow to build, especially if you have a lot of trees in the random forest. Building a random forest involves building (usually) 10–100 decision trees. Each of these decision trees is faster to build than a standard decision tree because they do not compare every feature at every split; however, given its quantity of decision trees, a random forest is often slow to build. Similarly, prediction with a random forest will be slower than a single decision tree since predictions must be made with each of the estimator containing 10 to100 decision trees to obtain the final prediction. Random forests are not the fastest model, but generally, this is not a problem, since computers have a lot computational power.

5. Conclusions

In conclusion, this research contributes to the field of breast cancer diagnosis by demonstrating the effectiveness of using bagging decision trees for accurate classification. The bagging ensemble method helps improve the model’s robustness and generalization, leading to more reliable predictions of unseen data. Additionally, this study highlights the crucial role of feature selection in enhancing accuracy. By identifying and utilizing the most relevant features from the Wisconsin Breast Cancer dataset, the model becomes more efficient and less prone to overfitting. The implications of these findings are significant for breast cancer diagnosis. The bagging decision trees model presents a practical and powerful tool for early detection, aiding medical professionals in making timely and accurate diagnoses. Its ability to handle complex interactions within the data and mitigate the impact of outliers enhances the model’s reliability in real-world scenarios. Moreover, an improved feature selection ensures that the most informative and meaningful attributes are utilized, reducing the computational burden, and providing a clearer understanding of the underlying factors contributing to breast cancer. Overall, this research’s contributions pave the way for more effective and efficient breast cancer diagnoses, with potential applications in automated screening systems and decision-making support tools for medical practitioners. By harnessing the advantages of the bagging decision tree model and emphasizing the importance of improved feature selection, the accuracy and reliability of breast cancer diagnosis can be significantly enhanced, ultimately improving patient outcomes and contributing to early interventions that save lives.

Author Contributions

Conceptualization, D.D. and A.N.; methodology, S.L., V.S., V.K., S.R. and R.R.; validation, D.D. and A.N.; formal analysis, S.L., V.S., V.K., S.R. and R.R.; investigation, D.D. and A.N.; resources, D.D. and A.N.; data curation, S.L.,V.S., V.K., S.R. and R.R.; writing—original draft preparation, S.L.,V.S., V.K., S.R. and R.R.; validation, D.D. and A.N.; writing—review and editing, S.L.,V.S., V.K., S.R. and R.R.; validation, D.D. and A.N.; visualization, S.L., V.S., V.K., S.R. and R.R.; supervision, D.D. and A.N.; All authors have contributed equally to research and writing of this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this research is publicly available on Kaggle.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rao, C.H.; Naganjaneyulu, P.V.; Satyaprasad, K. Automatic Classification Breast Masses in Mammograms using Fusion Technique and FLDA Analysis. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 1061–1071. [Google Scholar]
Angulo, P.A.; Castellano, C.R.; Rodriguez, C.A.; González, M.J.L. Value of a computer-assisted detection (CAD) system designed for digital mammography (DM) in the diagnosis of breast cancer assessed by DM and digital breast tomosynthesis (DBT). Eur. Congr. Radiol. 2019, 1–49. [Google Scholar] [CrossRef]
Agarwal, R.; Diaz, O.; Lladó, X.; Yap, M.H.; Martí, R. Automatic mass detection in mammograms using deep convolutional neural networks. J. Med. Imaging 2019, 6, 031409. [Google Scholar] [CrossRef] [PubMed]
Hameed, B.M.Z.; Shah, M.; Naik, N.; Singh Khanuja, H.; Paul, R.; Somani, B.K. Application of artificial intelligence-based classifiers to predict the outcome measures and stone-free status following percutaneous nephrolithotomy for staghorn calculi: Cross-validation of data and estimation of accuracy. J. Endourol. 2021, 35, 1307–1313. [Google Scholar] [CrossRef] [PubMed]
Lundervold, A.S.; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Für Med. Phys. 2019, 29, 102–127. [Google Scholar] [CrossRef] [PubMed]
Jung, H.; Kim, B.; Lee, I.; Yoo, M.; Lee, J.; Kang, J. Detection of masses in mammograms using a one-stage object detector based on a deep convolutional neural network. PLoS ONE 2018, 13, e0203355. [Google Scholar] [CrossRef] [PubMed]
Nazari, S.S.; Mukherjee, P. An overview of mammographic density and its association with breast cancer. Breast Cancer 2018, 25, 259–267. [Google Scholar] [CrossRef] [PubMed]
Yousefi, M.; Krzyżak, A.; Suen, C.Y. Mass detection in digital breast tomosynthesis data using convolutional neural networks and multiple instance learning. Comput. Biol. Med. 2018, 96, 283–293. [Google Scholar] [CrossRef] [PubMed]
Feng, Y.; Spezia, M.; Huang, S.; Liu, B. Breast cancer development and progression: Risk factors, cancer stem cells, signaling pathways, genomics, and molecular pathogenesis. Genes Dis. 2018, 5, 77–106. [Google Scholar] [CrossRef] [PubMed]
Xiao, T.; Liu, L.; Li, K.; Qin, W.; Yu, S.; Li, Z. Comparison of transferred deep neural networks in ultrasonic breast masses discrimination. BioMed Res. Int. 2018, 2018, 4605191. [Google Scholar] [CrossRef] [PubMed]
Thummalapalem, G.D.; Pradesh, A.; Vaddeswaram, G.D. Automated Detection, Segmentation and Classification Using deep Learning Methods for Mammograms-A Review. Int. J. Pure Appl. Math. 2018, 119, 627–666. [Google Scholar]
Hamidinekoo, A.; Denton, E.; Rampun, A.; Honnor, K.; Zwiggelaar, R. Deep learning in mammography and breast histology, an overview and future trends. Med. Image Anal. 2018, 47, 45–67. [Google Scholar] [CrossRef] [PubMed]
Gowri, V.; Valluvan, K.R.; Chamundeeswari, V.V. Automated Detection and Classification of Microcalcification Clusters with Enhanced Preprocessing and Fractal Analysis. Asian Pac. J. Cancer Prev. 2018, 19, 3093. [Google Scholar] [CrossRef] [PubMed]
Mohamed, S.E.; Wahbi, T.M.; Sayed, M.H. Automated Detection and Classification of Breast Cancer Using Mammography Images. Int. J. Sci. Eng. Technol. Res. 2018, 7, 2278–7798. [Google Scholar]
Lee, J.G.; Jun, S.; Cho, Y.W.; Lee, H.; Kim, G.B.; Seo, J.B.; Kim, N. Deep learning in medical imaging: General overview. Korean J. Radiol. 2017, 18, 570–584. [Google Scholar] [CrossRef] [PubMed]
Kooi, T.; Karssemeijer, N. Classifying symmetrical differences and temporal change for the detection of malignant masses in mammography using deep neural networks. J. Med. Imaging 2017, 4, 044501. [Google Scholar] [CrossRef]
Suzuki, K. Overview of deep learning in medical imaging. Radiol. Phys. Technol. 2017, 10, 257–273. [Google Scholar] [CrossRef] [PubMed]
Huang, Q.; Luo, Y.; Zhang, Q. Breast ultrasound image segmentation: A survey. Int. J. Comput. Assist. Radiol. Surg. 2017, 12, 493–507. [Google Scholar] [CrossRef] [PubMed]
Carneiro, G.; Nascimento, J.; Bradley, A.P. Automated analysis of unregistered multi-view mammograms with deep learning. IEEE Trans. Med. Imaging 2017, 36, 2355–2365. [Google Scholar] [CrossRef] [PubMed]
Patil, V.; Saxena, J.; Vineetha, R.; Paul, R.; Shetty, D.K.; Sharma, S.; Smriti, K.; Singhal, D.K.; Naik, N. Age assessment through root lengths of mandibular second and third permanent molars using machine learning and Artificial Neural Networks. J. Imaging 2023, 9, 33. [Google Scholar] [CrossRef] [PubMed]
Cardoso, J.S.; Marques, N.; Dhungel, N.; Carneiro, G.; Bradley, A.P. Mass segmentation in mammograms: A cross-sensor comparison of deep and tailored features. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 1737–1741. [Google Scholar]
Zhang, X.; Zhang, Y.; Han, E.Y.; Jacobs, N.; Han, Q.; Wang, X.; Liu, J. Whole mammogram image classification with convolutional neural networks. In Proceedings of the 2017 IEEE international conference on bioinformatics and biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; pp. 700–704. [Google Scholar]

Figure 1. Elbow graph to estimate the n estimators.

Figure 2. Heat map of different features of the breast cancer dataset.

Figure 3. Features mapped against each other to show correlations.

Figure 4. Accuracy of different models for two different methods of feature optimization.

Figure 5. Random data points (two types of classes are symbolized by Triangles and Circles).

Table 1. Feature importance ratings.

worst radius 0.309701	mean concave points 0.183126
worst concave points 0.115641	mean perimeter 0.064119
mean radius 0.058742	worst concavity 0.050951
radius error 0.049103	mean texture 0.017197
worst area 0.016512	mean concavity 0.014696

Table 2. Comparison of results of different models for two different methods of feature optimization.

Sr. No.	Model	With Random Selection of Data	With Feature Importance Selection of Data
1.	LR	0.36	0.83
2.	DT	0.902	0.92
3.	RF	0.965	0.9720

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dudeja, D.; Noonia, A.; Lavanya, S.; Sharma, V.; Kumar, V.; Rehan, S.; Ramkumar, R. Breast Cancer Diagnosis Using Bagging Decision Trees with Improved Feature Selection. Eng. Proc. 2023, 59, 17. https://doi.org/10.3390/engproc2023059017

AMA Style

Dudeja D, Noonia A, Lavanya S, Sharma V, Kumar V, Rehan S, Ramkumar R. Breast Cancer Diagnosis Using Bagging Decision Trees with Improved Feature Selection. Engineering Proceedings. 2023; 59(1):17. https://doi.org/10.3390/engproc2023059017

Chicago/Turabian Style

Dudeja, Deepak, Ajit Noonia, S. Lavanya, Vandana Sharma, Varun Kumar, Sumaiya Rehan, and R. Ramkumar. 2023. "Breast Cancer Diagnosis Using Bagging Decision Trees with Improved Feature Selection" Engineering Proceedings 59, no. 1: 17. https://doi.org/10.3390/engproc2023059017

Article Menu

Breast Cancer Diagnosis Using Bagging Decision Trees with Improved Feature Selection^†

Abstract

1. Introduction

2. Literature Review and Existing Methods