1. Introduction
Breast cancer causes the largest number of deaths in women worldwide. This dreadful disease is rapidly increasing in several developing countries [
1]. In the year 2021, an estimated 2.3 million women received a diagnosis of breast cancer, culminating in a staggering 685,000 lives lost around the globe. Projections indicate that this figure may grow to 3.2 million new diagnoses and 1.0 million fatalities by the year 2040. A plethora of innovative approaches harnessing cutting-edge technologies have emerged recently, empowering specialists to detect breast cancer in women more effectively. Over the decades, researchers have amassed a wealth of information on breast cancer patients, providing invaluable insights for the diagnosis, monitoring, and management of this perilous illness [
2].
In the era of advanced, data-driven technologies, the application of ML in healthcare has transformed the landscape of predictive analytics and clinical decision-making [
3].
Among its various applications, breast cancer prediction remains a critical challenge due to the inherent complexities of medical data [
4]. The effectiveness of ML in this field is significantly influenced by the use of appropriate data preprocessing methods, especially when addressing datasets that are skewed or imbalanced [
5,
6]. To address this challenge, the integration of the Box–Cox transformation with ML models has been employed to normalize data distributions and stabilize variance, thereby improving the overall performance and reliability of predictive models.
Preprocessing data before training ML models is crucial to ensuring reliable results, especially for medical datasets that often suffer from skewness, outliers, and class imbalance [
7,
8]. These challenges can lead to biased predictions, reduced generalizability, and reduced clinical assessment accuracy. The Box–Cox transform for data normalization offers a solution by transforming unnormalized data into a distribution more suitable for predictive modeling [
9,
10]. Despite its potential, the application of the Box–Cox transform in conjunction with machine learning remains unexplored in healthcare datasets, particularly in breast cancer research [
11].
In this study, we applied AI models to two datasets; the first one is a synthetic dataset that was generated by a gamma-distributed dataset [
12] and the second is the SEER breast cancer dataset [
13]. The synthetic dataset simulates positively skewed features commonly observed in real-world data, providing a controlled environment to evaluate the impact of data transformation. The SEER dataset, extracted from the SEER program, displays a real-world example of an imbalanced medical dataset, with a large proportion of cases classified as alive compared to dead [
14]. By testing the outcomes of ML models on these two datasets, we aim to highlight the challenges of feature transformation in addressing data skewness and enhancing the prediction accuracy.
ML models were used to evaluate the impact of the Box–Cox transform on the training performance of ML models. These models were chosen for their integrative power and broad applications in medical and predictive analytics [
11,
15]. This study shows how feature transformation techniques can be applied in healthcare analytics to improve the performance of ML model training results. By demonstrating this impact, the study provides a scalable framework for addressing data skewness and improving model performance in a variety of healthcare applications. This study makes the following key contributions:
This study introduces the evaluation of the gap in healthcare studies by applying data preprocessing using the Box–Cox transformation on both synthetic and real-world datasets, providing a framework for enhancing the reliability of AI models.
This study compares the performance of several ML models, including ensemble methods (voting and stacking), before and after performing a Box–Cox transformation, which demonstrates the powerful performance of combining models in addition to efficient data processing.
This study provides a scalable framework for integrating feature transformation techniques into breast cancer prediction and other medical datasets, contributing to the broader field of medical informatics and improving clinical decision support systems.
The next sections of this manuscript are structured as follows:
Section 2 presents a review of related studies, while
Section 3 offers a detailed explanation of the proposed methodology.
Section 4 displays the outcomes, accompanied by a discussion highlighting the importance of the findings. Finally, I conclude the study and outlines potential directions for future research.
2. Related Work
A recent study focused on classifying the survival status of breast cancer patients (alive or deceased) using the SEER dataset. It utilized data from the SEER database to perform a COX regression analysis to identify key prognostic factors [
16]. It then developed the extreme gradient boosting (XGBoost) ML model to predict survival outcomes in patients with bone metastatic breast cancer. Additionally, the study highlighted that combining neoadjuvant chemotherapy with surgery significantly improved overall survival and breast cancer-specific survival across multiple molecular subtypes. Model performance was strong, with AUC values of 0.818 (test) and 0.845 (training) for 1-year survival prediction and 0.798 (test) and 0.839 (training) for 3-year survival.
Another study applied both machine learning and deep learning techniques, highlighting their effectiveness in biomedical data analysis [
17]. A two-step feature selection approach with a variance threshold followed by PCA was used to optimize input features. Several supervised and ensemble classifiers were evaluated, including AdaBoost, XGBoost, Gradient Boosting, Naive Bayes, and decision trees. Among these, the decision tree algorithm achieved the highest accuracy of 89%, demonstrating its strong predictive performance. The limitations of the study include potential biases in feature selection, issues with class imbalance, and limited generalizability due to reliance on a single dataset.
In addition, [
18] utilized data from the SEER database (2010–2019) and performed COX regression analysis to identify prognostic factors for BCBM patients. The researchers applied ML models to predict survival rates, validating the models with a hospital cohort. The XGBoost models demonstrated high accuracy in predicting survival, with AUC values indicating strong performance over different time frames (6-month AUC = 0.824, 1-year AUC = 0.813).
A recent prognostic study targeted female survivors of first primary cancer who developed second primary breast cancer (SPBC), utilizing SEER data from 1998 to 2018. The study developed five survival prediction models, including four ML algorithms and a Cox proportional hazards model, using 16 features selected through Cox regression analyses [
19]. The random survival forest model outperformed others, achieving a time-dependent AUC of 0.805. Key predictive features included age, cancer stage, regional nodes positive, latency, radiotherapy, and surgery.
Z. Jiang et al. [
20] developed an ML-based model for IMPC prognosis using SEER data from 1123 patients diagnosed between 1998 and 2019, incorporating variables such as age, race, tumor site, TNM stage, and treatment status. Five ML algorithms were evaluated, with the XGBoost model outperforming others, achieving a precision of 0.818 and an AUC of 0.863 for predicting 5-year survival. The study highlights the promising role of ML; however, it includes a retrospective design and lacks external validation. The study acknowledged that various confounding factors could influence survival outcomes, such as the patients’ age, overall health, and specific characteristics of the tumors. These factors were not fully explored, which may have affected the robustness of the findings.
Another study analyzed data from the SEER database (2010–2018) to develop a ML model using CatBoost for predicting survival rates in metastatic breast cancer patients [
21]. The CatBoost model achieved high predictive performance, with AUC values of 0.833, 0.806, and 0.810 for 1-, 3-, and 5-year survival, respectively, on the SEER dataset, and maintained excellent accuracy on an external independent dataset, with AUCs of 0.937, 0.907, and 0.890 for the same time frames.
In yet another study, the authors developed and validated predictive models to forecast breast cancer recurrence and metastasis by utilizing a large, comprehensive dataset of 272,252 records from multiple reputable sources [
22]. Essential prognostic factors were identified through survival analysis, and machine learning models, including LightGBM, XGBoost, and RF, were subsequently trained and validated utilizing data from the Baheya Foundation. The performance of the AI models was robust, with LightGBM achieving an AUC of 92% in the prediction of recurrence.
Study [
23] focused on predicting by applying multiple ML techniques to a dataset of 5178 independent rows, including demographic, laboratory, and mammographic data, with 25% of the cases with breast cancer. The ML models evaluated included RF, MLP, gradient boosting trees, and genetic algorithms. Models were initially trained on 20 features from demographic and laboratory data. The results showed that the RF model outperformed others, achieving an accuracy of 80%, coupled with a sensitivity of 95%, a specificity of 80%, and an AUC of 0.56.
Despite the promising results of various machine learning models applied to breast cancer prognosis using the SEER dataset and other biomedical data, existing studies often overlook the critical role of advanced data preprocessing techniques, such as the Box–Cox transformation, in addressing skewed and imbalanced data distributions. While prior research has predominantly focused on model selection and feature engineering, there remains a notable gap in systematically integrating and evaluating the impact of feature transformation methods to stabilize variance and normalize distributions before model training. Moreover, many studies lack comprehensive comparisons of transformation effects across diverse machine learning algorithms and datasets, particularly synthetic skewed data versus real-world clinical datasets. This gap highlights the need for more focused research on how advanced preprocessing, like the Box–Cox transformation, can enhance model performance and reliability in breast cancer prediction tasks, ultimately improving clinical decision support systems.
3. Materials and Methods
The methodology in this study is implemented as shown in the abstract graph in
Figure 1. The process begins with data acquisition, which includes the generation of a synthetic dataset using a gamma distribution and utilizing the SEER breast cancer dataset. In the preprocessing stage, tasks such as cleaning and handling missing values, identifying outliers, and encoding categorical values into numeric representations are performed to prepare the data for analysis. The third stage, Box–Cox transformation, applies the transformation to continuous variables through different values of lambda. This transformation stabilizes variance and normalizes the data distribution, which is crucial for improving the performance of ML models. The final stage, AI model training, involves splitting the datasets into training and testing subsets. Various ML models, including LR, SVM, RF, XGBoost, ensemble voting model, and ensemble stacking model, are trained on the original data and transformed data and then evaluated using performance metrics such as accuracy, precision, recall, and F1-score.
3.1. Dataset Description
3.1.1. Dataset Simulation Using Gamma Distribution
We generated a synthetic dataset by simulating two independent variables (x1 and x2) using the gamma distribution [
24]. The np.random. Seed ensures that the random data generated is reproducible, allowing for consistent results across multiple runs. The dataset consists of 1000 samples, specified by sample size. This approach creates a dataset with two positively skewed features, which can be useful for testing machine learning models, applying transformations, or exploring statistical methods for non-normally distributed data.
3.1.2. SEER Breast Cancer Dataset
This dataset of breast cancer patients was obtained from the November 2017 update of the SEER program of the National Cancer Institute (NCI) in the United States, which provides population-based cancer statistics [
25]. This dataset consisted of female patients who were diagnosed with infiltrating ductal and lobular carcinoma of the breast (SEER primary sites recode NOS histological codes 8522/3) during the period from 2006 to 2010. Exclusions were made for patients with indeterminate tumor size, assessed regional lymph nodes, regional positive lymph nodes, and those with a survival period of less than one month; as a result, 4024 patients were included in the final analysis.
3.2. Preprocessing
In the SEER breast cancer dataset, there are differences between categorical and continuous values, where categorical values represent discrete categories or groups that classify data into distinct labels such as race, marital status, T Stage, N Stage, 6th Stage, Grade, etc. Continuous values represent numerical data with a wide range of possible values, often on a measurable scale such as Age, Tumor Size, Regional Node Examined, Regional Node Positive, and Survival Months. For categorical values, label encoding transforms each unique value in a column into a corresponding integer, making the data suitable for machine learning models requiring numeric inputs using LabelEncoder.
3.2.1. Box–Cox Transformation
The Box–Cox transformation, traditionally a one-dimensional transformation with a single parameter (commonly denoted as λ), is applied element-wise to a vector y [
26] and is computed using the following equation:
Where the transformation applies to a vector
y in
Rn, each element of the vector is transformed. In this study, the Box–Cox transformation was utilized twice: first on continuous values alone and then on all columns using
values of 0, 0.5, and 1.
Figure 2 shows the Box–Cox transformation for age and tumor size. The figure illustrates the distribution of the original and Box–Cox transformed values for two continuous variables, age, and tumor size. The top row shows the histograms of the original data for age (left) and tumor size (right), highlighting their respective distributions. The bottom row depicts the distributions after applying the Box–Cox transformation. The transformation was performed with selected
values (0, 0.5, and 1) to stabilize variance and improve data normality. For age, the transformation adjusted the scale while preserving the overall distribution. For tumor size, the transformation effectively reduced skewness, yielding a more symmetric distribution. These visualizations demonstrate the impact of the Box–Cox-transformation on preprocessing continuous data for improved modeling.
3.2.2. Logarithmic Transformation
The logarithmic transformation is often used to handle skewed data by compressing large values and expanding smaller values, making the data distribution more symmetric [
27]. This technique is beneficial when dealing with highly skewed data, such as tumor sizes or patient survival time, which might follow a right-skewed distribution.
Logarithmic transformation is applied to continuous variables to stabilize variance and reduce skewness in the dataset by transforming variables such as age, tumor size, number of regional nodes examined, number of regional nodes positive, and survival months. This study aims to normalize the distribution and improve the performance of ML models, which often assume normally distributed data.
The logarithmic-transformation is applied to the continuous variables using the natural logarithm (log(x)) with a small constant added to avoid the issue of zero values. The general form of the logarithmic transformation is
where
x is the original value of the continuous variable, and +1 is added to avoid issues with log (0), which is undefined. It assigns the default value of 0 to avoid errors.
Figure 3 presents histograms of two continuous variables, age and tumor size, both before and after applying the logarithmic transformation. It effectively reduces skewness in both variables, making them more suitable for modeling and improving the performance of machine learning algorithms.
3.3. Dataset Splitting
The two datasets used in this study were divided into two subsets for training and testing to develop and evaluate the models, as shown in
Table 1. Dataset 1, simulated using a gamma distribution, contains 1000 rows, of which 836 are classified as alive and 164 as dead. Dataset 2, extracted from the SEER dataset, contains 4024 records, of which 3408 are classified as alive and 616 as dead. The two datasets were split into 80% training and 20% testing, maintaining the same distribution of classes within each group to ensure balanced representation in training.
3.4. Machine Learning Models
Various machine learning models were used to analyze breast cancer datasets and predict outcomes. The selected models represent a combination of traditional statistical methods and modern ensemble techniques, each contributing unique strengths to the analysis. These models were carefully selected to address the challenges posed by skewed and imbalanced data distributions, ensuring a comprehensive assessment of their effectiveness in breast cancer prediction tasks.
3.4.1. LR
LR is a linear statistical model commonly applied for binary classification tasks. It computes the probability of a target variable belonging to a specific class by modeling a linear combination of the input features [
28]. It models the probability of a class outcome using a logistic (sigmoid) function, making it suitable for linearly separable data. In this study, LR is not only evaluated as an individual model but also serves as the meta-learner in the stacking ensemble. Its simplicity and interpretability make it an ideal candidate for synthesizing the outputs of more complex base models.
3.4.2. SVM
SVM is a supervised learning algorithm that formulates a hyperplane or a collection of hyperplanes within a high-dimensional space to delineate distinct classes [
29]. It demonstrates significant efficacy in high-dimensional environments, particularly in scenarios where the quantity of features surpasses that of the sample size. While SVM was assessed during the model selection process with the parameter probability set to True to facilitate probability estimates for ensemble integration, it was ultimately omitted from the final stacking model due to considerations related to performance or compatibility. Nevertheless, the performance benchmark established by SVM played a crucial role in affirming the superiority of ensemble methods.
3.4.3. RF
RF is a robust ensemble learning algorithm that combines multiple decision trees through bagging (bootstrap aggregation) [
30]. It enhances generalization and reduces overfitting by averaging predictions from multiple trees trained on different data subsets. In this study, the RF model was configured with n_estimators = 146 and random_state = 16 to ensure consistent results across trials. RF was used as both an individual model and a base learner in the ensemble voting and stacking classifiers. Its ability to capture complex feature interactions and handle noisy data made it a valuable component in ensemble learning strategies.
3.4.4. XGBoost
XGBoost is a powerful ensemble algorithm based on gradient boosted decision trees [
31]. XGBoost excels at high performance in classification tasks, where a series of trees are sequentially constructed, and each new tree is corrected for errors in the previous trees. In this paper, the XGBoost model is implemented with n_estimators = 146 and random_state = 16 to ensure repeatability and improve learning depth. The algorithm automatically addresses feature importance and regularization, reducing the risk of overfitting while enhancing generalization.
3.4.5. Ensemble Voting Model
An ensemble voting model is created by combining the outputs of multiple classifiers. In this experiment, the XGBoost and RF models were combined using a soft method. Soft voting calculates the average probabilities of the classes predicted by each base model and then selects the class with the highest average probability to form the final prediction. Both base learners were configured with n_estimators = 146 and random_state = 16 to ensure consistent training behavior. By combining two different algorithms (XGBoost and RF), the ensemble leverages the strengths of both boosting and pooling, resulting in improved classification.
3.4.6. Ensemble Stacking Model
The stacking model combines multiple base models and uses a super-learner to aggregate their predictions [
32]. The base models in this approach include the RF and XGBoost models. The LR model was chosen as the super-model due to its simplicity and effectiveness in aggregating outputs from diverse models. The stacking classifier was built using 5-fold cross-validation (cv = 5) to ensure robust learning from the base model predictions without overfitting. This hierarchical structure allows the model to capture nonlinear relationships and dependencies between models, ultimately enhancing classification accuracy.
3.5. Convergence and Generalization of the Proposed ML Algorithms
In machine learning, convergence refers to the behavior of an algorithm as it approaches a solution or optimal state, and generalization refers to the ability of a model to perform well on unseen data [
33]. The convergence and generalization properties of machine learning algorithms are often tied to concepts like optimization, loss function minimization, bias-variance tradeoff, and regularization.
Convergence in the context of ML algorithms often refers to the point at which the optimization process reaches a minimum in the objective function, e.g., loss function [
34]. Many of the ML models employed in this manuscript, such as LR, SVM, RF, and XGBoost, rely on iterative optimization methods. For instance, LR converges when the gradient descent algorithm reaches the optimal weights that minimize the logistic loss function. SVM uses a convex optimization problem to find the hyperplane that maximizes the margin between classes. XGBoost, based on gradient boosting, iteratively minimizes the residual errors, updating weights in each step. Ensemble methods do not involve traditional optimization. They aggregate predictions from base learners.
Mathematically, this convergence can often be characterized using stochastic gradient descent (SGD) for linear models or by analyzing the properties of the optimization objective function in nonlinear models like XGBoost.
Generalization refers to the ability of a model to perform well on unseen data, which is essential for ensuring the robustness of the predictive model in real-world scenarios. For ML models, ensemble learning techniques like RF and Stacking, used in this manuscript, improve generalization by averaging predictions across multiple models, reducing variance, and mitigating the risk of overfitting. Ensemble models benefit from reduced variance. Stacking uses a meta-learner (here, LR) to combine base predictions, improving generalization via the equation:
3.6. Evaluation
The outputs of the AI models and performance metrics used in this study were evaluated using standard metrics that measure the accuracy and effectiveness of predictive models, as conducted in our previous work [
35]. These evaluation metrics are discussed below.
Accuracy is used to measure the proportion of correctly classified instances out of the total number of instances. It is computed using the equation:
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative values, respectively.
Precision indicates the proportion of true positive predictions among all positive predictions, calculated as
Recall (sensitivity) represents the proportion of actual positives that were correctly identified by the model. It is computed as
The F1 score is a harmonic mean of precision and recall, providing a balanced measure of model performance, particularly for imbalanced datasets. It is defined as
These metrics provide a comprehensive evaluation of the AI models’ predictive capabilities, allowing for robust comparison and assessment of their performance across the datasets.
4. Results and Discussion
This section presents detailed results and a discussion of the application of ML models under two scenarios and data preprocessing methods. The study was applied to two different datasets and examined the effect of the Box–Cox transformation on improving prediction accuracy and overall performance metrics. By exploring these scenarios, the analysis aims to highlight the importance of feature normalization and variance stabilization in improving the results of ML models.
4.1. Scenario A: Applying AI Machine Learning Models on the Synthetic Dataset
In Scenario A, various AI machine learning models were applied to a synthetic dataset generated using a gamma distribution to simulate realistic data variability. The models were evaluated on the original dataset and data transformed using the Box–Cox method with different (
λ) values, including cases where
λ was not determined (None), and fixed values at 0.5 and 1. This transformation was used to stabilize variance and make the data more normally distributed, potentially improving model performance.
Table 2 summarizes the performance metrics for AI models on a synthetic dataset.
The consistent improvement in performance metrics across all models following the Box–Cox transformation highlights the critical role of data preprocessing in machine learning pipelines. The gamma distribution inherently produces skewed data, which can negatively impact models that assume or perform better on data with approximate normality. Applying the Box–Cox transformation reduced skewness, leading to more symmetrical and stabilized data distributions. The lambda parameter (λ) controls the nature of the transformation, and the results demonstrate that tuning λ can significantly affect model outcomes. When λ was optimally chosen as 1, most models achieved their highest accuracies and F1-scores, indicating that this particular power transformation best normalized the dataset for these algorithms. The case where λ was None corresponds to instances where the transformation parameter could not be effectively estimated, which still yielded improved results compared to the original data but was outperformed by the tuned transformations.
Among the models, ensemble methods such as RF and XGBoost showed relatively high baseline performance even on the original data, reflecting their robustness to data distribution irregularities. Nonetheless, they too benefited from the Box–Cox transformation, achieving peak performance metrics close to or exceeding 97% accuracy. This reinforces the advantage of combining robust algorithms with effective data transformation techniques.
LR and SVM, which rely more heavily on assumptions about data distribution and linear separability, exhibited the most marked improvements, emphasizing the importance of normalization and variance stabilization for these models. Overall, these results validate the integration of data transformation steps in the modeling process, especially when handling skewed or non-normal data distributions. Employing such preprocessing methods enhances model generalization and prediction accuracy, which is critical for deploying reliable AI systems in practical applications.
The results demonstrate the significant impact of the Box–Cox transformation on the performance ML models when applied to non-normally distributed data, as shown in
Figure 4. The transformation effectively reduced skewness, leading to improved variance stabilization and better feature normalization. Among the tested models, RF achieved the highest performance metrics, likely due to its reliance on the local structure of the data, which benefits from normalized feature distributions.
Figure 4 presents the distribution of data before and after applying the Box–Cox transformation for the machine learning models. The histograms illustrate the original data and Box–Cox transformed data for lambda values of 0.5 and 1. The transformation effectively reduces skewness and normalizes the data, resulting in improved feature distribution for machine learning models. Each subplot corresponds to one model, showcasing how preprocessing impacts feature values and aids in stabilizing variance for better model performance.
4.2. Scenario B: Applying AI Machine Learning Models on the SEER Dataset and Applying the Box–Cox Transformation for Continuous Values
In this scenario, ML models were applied to the SEER breast cancer dataset, with a focus on evaluating the impact of the Box–Cox transformation on continuous variables. The Box–Cox transformation is applied to stabilize variance and reduce skewness in the continuous features of the dataset, which can improve the predictive performance of machine learning models. The transformation was applied to continuous values, and the performance of several ML models was compared both with and without applying the transformation.
Table 3 provides a summary of the performance metrics of five ML models. The models were evaluated for their performance using the original data as well as after applying the Box–Cox transformation with different values of the transformation parameter,
λ, including None (no transformation), 0.5, and 1.
The Box–Cox transformation significantly enhanced the predictive performance of all models, with most improvements observed when λ was set to 1. This indicates that the transformation was effective in normalizing the continuous features of the dataset, which likely improved model interpretability and accuracy.
All machine learning models demonstrated improved performance on the SEER dataset after applying the Box–Cox transformation, particularly with λ = 1. LR and SVM showed notable gains in accuracy and F1-score, highlighting improved handling of variance and skewness in continuous features. Ensemble methods like RF and XGBoost also benefited, with accuracy rising to 94.29% and 93.04%, respectively. The ensemble voting model achieved 93.42% accuracy, while the Stacking model outperformed all others, reaching 94.53% accuracy and a 94.74 F1-score. These results confirm that the Box–Cox transformation effectively enhanced model performance across the board.
The confusion matrices in
Figure 5 provide a detailed overview of the ensemble stacking model’s performance on the SEER breast cancer dataset, with different λ values applied to the Box–Cox transformation for continuous variables. The matrices illustrate the ensemble stacking model for each λ value, providing insight into how well the model classifies the “Alive” and “Dead” classes across varying levels of data transformation.
The confusion matrix results show that the model’s performance gradually improves with the application of the Box–Cox transformation. At λ = None, the model correctly classifies 652 individuals as alive and 76 as dead but still classifies 57 dead individuals as alive (false positives) and 20 living individuals as dead (false negatives). At λ = 0.5, the transformation reduces these misclassifications, increasing true positives to 658 and true negatives to 90 while reducing false positives to 43 and false negatives to 14. Performance peaks at λ = 1, where true positives rise to 661 and true negatives to 100, while false positives and false negatives fall to their lowest levels (33 and 11, respectively).
Figure 6 displays the performance of the ML models on the SEER breast cancer dataset, with and without the application of the Box–Cox transformation for continuous features. The chart compares the accuracy of these models across four different scenarios: Original Data, λ = None, λ = 0.5, and λ = 1. The Box–Cox transformation significantly improves model performance by stabilizing variance and reducing skewness, with the greatest improvement observed when λ = 1. LR and SVM show notable gains in accuracy, increasing from 88–89% to 92–93%. RF and XGBoost also benefit, reaching accuracies of 94.29% and 93.04%, respectively, with λ = 1. Ensemble models, particularly ensemble stacking, achieve the highest accuracy of 94.53%, demonstrating the benefit of combining multiple base models. While λ = 0.5 improves performance, λ = 1 consistently delivers the best results by reducing both false positives and false negatives, leading to more accurate classification.
4.3. Scenario C: Applying AI Machine Learning Models on the SEER Dataset Using the Logarithmic Transformation for Continuous Values
This scenario explores the impact of applying a logarithmic transformation to continuous features in the SEER breast cancer dataset.
The logarithmic transformation is commonly used to stabilize variance and reduce skewness in data, particularly for features that follow a right-skewed distribution, such as tumor size or age. The goal of this transformation is to improve the performance of machine learning models by making the data more normally distributed, which is a key assumption for many machine learning algorithms. The results from applying ML models on the SEER dataset after the logarithmic transformation of continuous variables are summarized in
Table 4. After applying the logarithmic transformation, all models showed performance improvements. LR saw a rise in precision from 87.95% to 92.46%, with minor gains in accuracy and recall. SVM showed a significant precision of 94.30%, though accuracy and recall remained stable. RF achieved the highest accuracy of 90.68%, with gains of 92.63% in precision and 91.35% in F1-score.
XGBoost’s accuracy reached 89.81%, and its F1-score reached 90.39%. Ensemble models also showed gains, with the voting model achieving 90.36% accuracy and the stacking model achieving the highest accuracy of 90.66% and precision of 92.31%.
4.4. Scenario D: Applying AI Machine Learning Models on the SEER Dataset Using SMOTE Augmentation
In this scenario, SMOTE was applied to address the class imbalance present in the SEER breast cancer dataset. The original dataset comprised a total of 4024 samples, with a substantial disparity between the alive class (3408 samples) and the dead class (616 samples), which can negatively affect the performance and generalizability of machine learning models.
To mitigate this imbalance, the dataset was first divided into 80% training and 20% testing subsets. This resulted in 3219 samples in the training set, with 2736 alive and 438 dead cases. The SMOTE method was then applied exclusively to the training set, generating synthetic examples for the minority class (Dead) to match the number of majority class instances. This augmentation produced a balanced training dataset with 2736 samples for each class, totaling 5472 training instances.
Table 5 summarizes the data distribution before and after the SMOTE augmentation. Notably, the testing set comprises 672 alive cases and 133 dead cases, which remained untouched to ensure unbiased evaluation of model performance on imbalanced real-world data.
As presented in
Table 6, the use of SMOTE augmentation on the SEER dataset improved model performance, with the ensemble stacking model achieving the highest accuracy at 87.33%, followed closely by RF with 86.71% and ensemble voting with 86.58%. This confirms SMOTE’s usefulness in addressing class imbalance by synthetically increasing the minority class.
However, as summarized in
Table 3, applying the Box–Cox transformation significantly improved classification accuracy and F1-score, indicating enhanced model performance compared to SMOTE. For instance, the ensemble stacking model reached an accuracy of 94.53% when λ = 1. This demonstrates that transforming the continuous feature distributions had a more significant positive impact than merely rebalancing class proportions.
4.5. Impact of the Box–Cox Transformation on the Results of ML Models Compared to the Logarithmic Transformation
The Box–Cox transformation plays a crucial role in enhancing the performance of ML models, especially when handling skewed or non-normally distributed data. In the context of breast cancer prediction using both synthetic and real-world datasets, the Box–Cox transformation demonstrates its power in addressing common issues such as skewness and variance instability. The transformation has been systematically applied to the continuous variables of the datasets, with varying λ values to stabilize variance and normalize distributions, leading to significant improvements in model accuracy and robustness.
In the synthetic dataset, which was intentionally generated using a gamma distribution, the data naturally exhibited positive skewness, which could hinder the predictive performance of many machine learning models. By applying the Box–Cox transformation with different λ values (0, 0.5, and 1), the data’s skewness was reduced, and its variance was stabilized. This normalization created a more symmetric distribution, which is crucial for models like LR and SVM that rely on assumptions of normality in the data. The performance metrics for these models were significantly enhanced after applying the Box–Cox transformation, particularly when λ was set to 1, which provided the optimal normalization for the datasets.
The Box–Cox transformation significantly improved the performance of ML models applied to the SEER breast cancer dataset, which contains skewed and imbalanced data. By normalizing continuous variables like age, tumor size, and survival months, models such as LR, SVM, and RF showed notable performance gains. For example, LR’s accuracy improved from 88.74% to 92.42% with λ = 1. RF saw a peak accuracy of 94.29% and an F1-score of 94.56%, demonstrating the transformation’s ability to enhance model generalization.
Ensemble methods, like voting and stacking models, also benefited from the transformation, with the stacking model achieving the highest accuracy of 94.53% and an F1-score of 94.74% at λ = 1. The study highlights the Box–Cox transformation’s ability to normalize data and stabilize variance, making it a valuable preprocessing tool for healthcare applications with skewed and imbalanced datasets.
When comparing the Box–Cox transformation to the logarithmic transformation, both methods serve to normalize skewed data, but their effectiveness and flexibility differ. The logarithmic transformation is often used when data exhibits a strong right skew, particularly for variables with large ranges, such as tumor size or survival months in healthcare datasets. However, it is limited to data with strictly positive values and may not always fully stabilize variance.
On the other hand, the Box–Cox transformation is more flexible, as it can handle a wider range of distributions, including those that are not strictly positive. By allowing for a range of λ values, the Box–Cox transformation can better adapt to the specific characteristics of the data, potentially leading to more effective normalization and variance stabilization.
In our study, the Box–Cox transformation provided more significant improvements in model performance, particularly for models like LR and SVM, compared to the logarithmic transformation. This is especially evident in the SEER dataset, where the Box–Cox transformation improved accuracy and F1-score more effectively than the logarithmic transformation, demonstrating its superiority in addressing skewness and variance instability in healthcare data.
Regularization methods like Box–Cox transformation in this manuscript address the generalization by stabilizing variance and normalizing data distributions, which in turn help in reducing both bias and variance, thereby improving generalization. The transformation improves the model’s ability to learn from the underlying data distribution without overfitting to noise.
For instance, ensemble models like RF and stacking naturally improve generalization by aggregating predictions from multiple weak learners. From a theoretical perspective, the generalization of ensemble methods can be analyzed using bias–variance decomposition. In this stacking model, the meta-learner (often a logistic regression) combines the predictions of multiple base learners, which helps in reducing the model’s variance and improves generalization.
4.6. Ablation Study
This ablation study focuses on evaluating various ML strategies for predicting breast cancer risk using a dataset comprising clinical and biochemical features from 166 participants [
36]. The dataset includes eight features—age, BMI, serum glucose levels, insulin levels, HOMA, leptin, adiponectin, resistin—and a classification label indicating breast cancer risk. The primary goal of this study is to assess the effectiveness of various ML algorithms in processing these features to predict the likelihood of breast cancer.
The clinical and biochemical attributes within the dataset offer valuable predictive insight into breast cancer risk, particularly when enhanced by the application of the Box–Cox transformation. The superior performance of advanced ML models following this transformation underscores its efficacy in stabilizing variance and normalizing feature distributions. Notably, variables such as BMI, glucose levels, and adiponectin emerge as influential contributors to model predictions, highlighting the necessity of integrating heterogeneous data sources to achieve comprehensive and reliable risk stratification. This is further illustrated in
Figure 7, which presents a heatmap correlation matrix depicting inter-feature relationships within the breast cancer risk prediction dataset.
The results presented in
Table 7 provide a comprehensive evaluation of the ML models applied to a clinical and biochemical dataset for breast cancer risk prediction and Box–Cox transformation. The following analysis discusses the strengths, limitations, and implications of the performance metrics for each model, drawing comparisons to highlight the impact of different machine learning techniques in medical data analysis.
LR performed moderately, with 75% accuracy, but struggled with complex, nonlinear patterns. SVM improved upon LR, with 83.33% accuracy, handling high-dimensional data better. RF and XGBoost achieved 91.67% accuracy, excelling in feature interactions through ensemble methods. The ensemble voting model matched their performance, showcasing the strength of combining models. The ensemble stacking model slightly outperformed the voting model, with 91.88% accuracy, offering the best overall performance by effectively synthesizing diverse model outputs.
4.7. Evaluation of Comparisons in the Context of Recent Research Outcomes
This section provides a comparative evaluation of recent studies on breast cancer prediction, focusing on their methodologies and accuracy results.
Table 8 presents a collection of relevant studies that utilized different datasets and machine learning methods. These studies applied a range of techniques such as Cox regression analysis, ensemble methods, and gradient boosting to predict breast cancer survival and outcomes. The accuracies reported across these studies vary, with some achieving notable results.
In comparison, the proposed work in this study demonstrates the highest accuracy of 97.87% on a synthetic dataset and 94.53% on the SEER dataset, leveraging an ensemble model with the Box–Cox transformation. This highlights the significant impact of the Box–Cox transformation in improving the predictive performance of machine learning models, particularly in handling skewed and imbalanced datasets.
4.8. Limitations and Future Work
This study highlights two significant limitations encountered during the analysis: the limited availability of breast cancer data and the imbalanced distribution of classes within the dataset. These limitations have implications for the generalizability and accuracy of the findings. An additional limitation is the difficulty in selecting an optimal lambda value for certain datasets. Also, the computational complexity that may arise when applying the Box–Cox transformation to large datasets is a limitation. For future work, several areas for future research have emerged, focusing on advanced learning paradigms and broader datasets, as well as exploring the model’s deployment within a simulated CDSS environment with an emphasis on real-time data integration, multimodal integration, and multi-task learning strategies. Future studies may apply ensemble vision transfer learning [
35] or federated learning [
37] and utilize diverse datasets [
38,
39].
4.9. Conclusions
This study emphasizes the significant role of the Box–Cox transformation in enhancing ML models for breast cancer prediction, particularly in addressing the challenges posed by skewed and imbalanced medical datasets. By utilizing the Box–Cox transformation, the study successfully demonstrated its ability to normalize and stabilize the variance of datasets, resulting in enhanced prediction accuracy and model performance. The results indicate that the transformation, particularly with an optimal lambda value of 1, consistently improved the performance of ML models such as LR, SVM, RF, and ensemble methods like stacking and voting classifiers.
Furthermore, the findings suggest that the integration of data preprocessing techniques like Box–Cox offers a robust framework for improving the reliability and generalizability of AI models in healthcare applications.
The transformation process itself is computationally manageable, but model training, especially for ensemble methods, may demand substantial processing power, particularly for larger datasets like the SEER breast cancer data. The choice of appropriate lambda values for the Box–Cox transformation also requires careful consideration to balance performance improvement with computational efficiency. The application of the Box–Cox transformation significantly impacted both synthetic and real-world datasets, including the SEER breast cancer data, thereby reinforcing the importance of feature transformation in predictive healthcare analytics.
This work provides a valuable contribution to the field of medical informatics, offering a scalable solution to improve the performance of predictive models in breast cancer detection and potentially extending it to other medical datasets characterized by skewed distributions and class imbalances. Future research should explore the broader applicability of the Box–Cox transformation across various healthcare datasets, as well as investigate alternative preprocessing techniques that could further enhance predictive performance. Additionally, the integration of more complex machine learning architectures and the use of large, diverse datasets would help refine the approaches discussed in this study, paving the way for the development of more accurate and efficient clinical decision support systems.