Next Article in Journal
Analysis of HER2-Low Breast Cancer in Aotearoa New Zealand: A Nationwide Retrospective Cohort Study
Previous Article in Journal
Sentinel Lymph Node Assessment in Endometrial Cancer: A Review
Previous Article in Special Issue
PD-L1 Expression in Paired Samples of Rectal Cancer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Approach for Predicting the Survival of Colorectal Cancer Patients Using Machine Learning Techniques and Advanced Parameter Optimization Methods

by
Andrzej Woźniacki
1,
Wojciech Książek
1,* and
Patrycja Mrowczyk
2
1
Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Warszawska 24, 31-155 Cracow, Poland
2
Oncology Clinical Department, The University Hospital in Cracow, Kopernika 50, 31-501 Cracow, Poland
*
Author to whom correspondence should be addressed.
Cancers 2024, 16(18), 3205; https://doi.org/10.3390/cancers16183205
Submission received: 28 August 2024 / Revised: 17 September 2024 / Accepted: 18 September 2024 / Published: 20 September 2024

Abstract

:

Simple Summary

Colorectal cancer remains a major health challenge with high mortality rates and increasing diagnoses among younger adults. This study introduces a new method to predict patient survival using machine learning, a technique that can greatly enhance early diagnosis and treatment. The aim of our study is to apply eight different machine learning algorithms to a large data set of patients with colorectal cancer in Brazil and optimize these models with advanced parameter tuning tools. The best performing models achieved around 80% accuracy in predicting survival rates over one, three, and five years, as well as overall and cancer-specific mortality. This approach promises to improve clinical decision making by providing more accurate survival predictions, ultimately helping better patient care and management.

Abstract

Background: Colorectal cancer is one of the most prevalent forms of cancer and is associated with a high mortality rate. Additionally, an increasing number of adults under 50 are being diagnosed with the disease. This underscores the importance of leveraging modern technologies, such as artificial intelligence, for early diagnosis and treatment support. Methods: Eight classifiers were utilized in this research: Random Forest, XGBoost, CatBoost, LightGBM, Gradient Boosting, Extra Trees, the k-nearest neighbor algorithm (KNN), and decision trees. These algorithms were optimized using the frameworks Optuna, RayTune, and HyperOpt. This study was conducted on a public dataset from Brazil, containing information on tens of thousands of patients. Results: The models developed in this study demonstrated high classification accuracy in predicting one-, three-, and five-year survival, as well as overall mortality and cancer-specific mortality. The CatBoost, LightGBM, Gradient Boosting, and Random Forest classifiers delivered the best performance, achieving an accuracy of approximately 80% across all the evaluated tasks. Conclusions: This research enabled the development of effective classification models that can be applied in clinical practice.

1. Introduction

Cancer is currently one of the leading causes of death worldwide, and colorectal cancer is one of the most significant due to its high incidence and mortality rates. In 2020, nearly 2 million new cases of colorectal cancer were diagnosed, and close to a million people died from the disease [1]. Table 1 provides detailed statistics on morbidity and mortality rates by continent.
The highest number of cases has been detected in China, the USA, and Japan [2]. According to the International Agency for Research on Cancer, the number of cases is projected to increase to 3.2 million by 2040, resulting in 1.6 million deaths. These figures represent increases of 63% and 73.4%, respectively, compared with 2020 [3]. There is also a concerning trend in the rising incidence of this type of cancer among young people. Studies conducted in the USA showed that between 2004 and 2016, the percentage increase in incidence was 7.9% for the 20–29 age group, 4.9% for the 30–39 age group, and 1.6% for the 40–49 age group. An analysis of data from 1975 to 2010 estimates a 90% increase in the incidence of colorectal cancer in the 20–30 age group by 2030 [4]. According to research [5], by 2030, 15% of colorectal cancer cases will be diagnosed in young people. Unfortunately, the causes of early onset are still unknown. Participants in studies often lack typical risk factors, such as a family history of this type of cancer or the presence of polyps. However, colorectal cancer is still predominantly diagnosed in older individuals, with the average age of diagnosis being 68 for men and 72 for women. Potential risk factors include having first-degree relatives with colorectal cancer and lifestyle factors such as obesity, physical inactivity, and alcohol consumption [6]. The main treatment methods include surgical interventions, chemotherapy, radiotherapy, and immunotherapy [7].
The rising number of cancer cases is placing increasing pressure on healthcare systems worldwide. To address this challenge, modern technologies, particularly artificial intelligence, are becoming essential for supporting the diagnosis and treatment of patients. Machine learning methods have already been successfully applied in the diagnosis of various cancers, including breast cancer [8], lung cancer [9], and liver cancer [10]. Numerous scientific studies have also highlighted the potential of machine learning in the detection of colorectal cancer, with a particular focus on deep learning techniques [11].
For example, in the research from 2024 [12], convolutional neural networks (CNNs) and ranking methods were employed to diagnose colorectal cancer using histopathological data. The authors first converted the images to grayscale and applied adaptive Gaussian filtering, followed by Otsu thresholding. An 18-layer convolutional network, combined with a ranking method, was used for classification, achieving an accuracy of 91%, precision of 92%, and recall of 93%. Another approach [13] explored multispectral classification of tissues from colorectal cancer using CNN. This task involved a three-class classification: benign hyperplasia (BH), intraepithelial neoplasia (IN), and carcinoma (Ca). The research also emphasized image segmentation, achieving a classification accuracy of 99.17%, with a Jaccard similarity coefficient (JSC) of 0.86 and a dice similarity coefficient (DSC) of 0.90 for segmentation. Machine learning methods have also been applied beyond colorectal cancer diagnosis, such as in predicting mutation signatures from histopathological images and forecasting cancer recurrence [14]. In this work, the authors combined CNNs with support vector machines (SVMs). The models designed for predicting cancer recurrence achieved AUC values ranging from 0.63 to 0.93 on the test set, with similar results on the training set. This consistency indicates that the model is not overfitted and has strong generalization capabilities.
In recent years, machine learning methods have gained significant traction in the field of oncology, particularly in predicting patient survival times. This task is crucial, as it holds immense value for patients, clinicians, researchers, and decision-makers alike. Predicting life expectancy is fundamental to many clinical decisions in oncology, such as determining the likelihood of cancer recurrence or patient mortality. Traditionally, these estimates are based on the cancer’s location and stage, but the complexity of this task often involves numerous variables that surpass the analytical capabilities of a physician during a patient consultation. Advanced machine learning techniques address this challenge by analyzing complex relationships between variables, enabling more accurate life expectancy predictions [15]. For instance, in a separate investigation [16], researchers predicted the survival period of breast cancer patients using the NIH SEER dataset, which includes data from 2024 patients. The work achieved 72% accuracy in predicting two-year survival using a random forest algorithm. Comparable efforts have also been conducted for lung cancer patients [17], utilizing 12 demographic and clinical features. The best results were obtained using an 11-layer deep neural network, with a classification accuracy of 88.58%. However, when it comes to predicting the survival of patients with colorectal cancer, there is a scarcity of research in the literature. One notable exception is the recent effort in [18], where the authors conducted five types of classification: predicting one-year, three-year, and five-year survival, as well as determining whether the patient died and whether the death was cancer-related. The study used publicly available data from a hospital in São Paulo, covering the years 2020–2021. The XGBoost algorithm achieved over 74% accuracy across all five classification tasks, with AUC values exceeding 82%. These results demonstrate the feasibility of building effective models to predict survival in patients with colorectal cancer and highlight the need for further research to improve model efficiency and generalizability.
Building on this foundation, the authors of this article aimed to develop more effective classification models for predicting the survival of colorectal cancer patients. They used an expanded dataset from São Paulo, covering the years 2000–2023 (including additional data from 2021 to 2022). The primary innovation in this work is the use of a broader range of classifiers and the application of advanced libraries to optimize classifier parameters, such as Optuna, RayTune, and HyperOpt. Proper selection of classifier parameters is crucial for achieving high classification accuracy on test sets. Additionally, our research emphasizes model explainability, which is a critical aspect of medical data analysis.

2. Materials And Methods

2.1. Dataset

The research utilized a substantial dataset comprising information on 72,961 colorectal cancer patients. This publicly available dataset, sourced from the Hospital-Based Cancer Registries of São Paulo and coordinated by the Fundação Oncocentro de São Paulo [19], spans the years 2000 to 2023. The extensive size of this dataset enables the development of robust classification models for predicting patient survival. It includes comprehensive data on both clinical and demographic factors.
Figure 1 illustrates the distribution of patients by age, highlighting that the disease predominantly affects the elderly. However, the percentage of patients aged 0–49 is 22.41%.
Figure 2 shows a survival curve indicating the percentage of survival over time in months. The survival rates at one year, three years, and five years are highlighted on the graph, along with specific survival percentages at those time points. At one year, the survival rate is around 77%, at three years, it drops to 60%, and at five years, the rate is approximately 54%.
The initial dataset consisted of 107 columns. Due to issues such as incomplete data, inadequate descriptions, or the irrelevance of certain features for survival prediction, 49 columns were removed. This resulted in a final dataset with 58 remaining feature columns. Table 2 details these features. The Supplementary Materials include an appendix with comprehensive information on all features in the dataset.
Columns with text data were encoded using label encoding. The following columns contained missing values, which were addressed using the nearest neighbor algorithm with the specified parameters. Before being input into the classifiers, the data were normalized using the standardized normal distribution.
Consistent with the approach of the study [18], several classification problems were explored based on this dataset. For this purpose, the following columns were prepared:
  • overall_death: A binary variable where a value of 1 indicates that the patient has died, and a value of 0 signifies that the patient is either still alive or under follow-up.
  • cancer_death: A binary variable similar to “overall_death” but specifically for cancer-related deaths. A value of 1 indicates that the patient died from cancer, while a value of 0 indicates that the patient died from other causes or is still alive.
  • alive_year1, alive_year3, alive_year5: Binary variables indicating whether the patient is still alive 1, 3, or 5 years after diagnosis. A value of 1 signifies that the patient is still alive, and a value of 0 indicates that the patient has died or is no longer under follow-up after the specified period.
During the experiments, the data were split into a training set comprising 75% of the data and a test set comprising the remaining 25%, which were used for the final evaluation of the model.
Table 3 displays the number of samples in each training and test set for each classification problem.

2.2. Machine Learning Methods

In the conducted experiments, several widely used and highly effective classification methods were employed, including Random Forest [20] XGBoost [21], CatBoost [22], LightGBM [23], Gradient Boosting [24], Extra Trees [25], the k-nearest neighbor algorithm (KNN) [26], and decision trees [27]. These methods were optimized using various frameworks designed for tuning machine learning model parameters. The following libraries were utilized:
  • Optuna [28] is a hyperparameter optimization framework that allows for automated and efficient searching of the best parameters for machine learning models. It utilizes a process known as “define-by-run”, where the search space is dynamically constructed during execution, offering greater flexibility compared with static hyperparameter configurations. Optuna integrates advanced techniques such as Bayesian optimization and early stopping to efficiently explore the parameter space and reduce training time. Moreover, it supports both single-objective and multiobjective optimization, making it a powerful tool for complex machine learning tasks. Its simplicity and performance have made it increasingly popular for tuning machine learning models.
  • HyperOpt [29] is a widely adopted library designed for hyperparameter optimization that leverages distributed asynchronous search algorithms. It provides support for both Bayesian optimization and Tree-structured Parzen Estimator (TPE), which helps to efficiently explore hyperparameter spaces. The framework can be easily adapted to various machine learning tasks, including deep learning, and supports parallel computation, making it highly scalable. Its flexibility allows users to optimize complex objective functions, including loss functions in deep neural networks.
  • RayTune [30] is a scalable hyperparameter tuning library that integrates seamlessly with the Ray distributed computing framework. Designed for large-scale parallel hyperparameter optimization, RayTune supports advanced search algorithms, such as population-based training (PBT), asynchronous hyperband, and Bayesian optimization. Its ability to scale effortlessly across multiple GPUs or clusters makes it highly suitable for tuning complex models, including those in deep learning or reinforcement learning. RayTune also features easy integration with other popular machine learning libraries, allowing users to conduct large experiments with minimal setup.
In recent years, the three frameworks mentioned above have gained popularity for their ability to often outperform traditional optimization methods, such as grid search or random search, and frequently surpass the results of evolutionary methods like genetic algorithms. A primary objective of this work is to compare these three modern tools in the context of predicting patient survival with colorectal cancer.
Table 4 displays the parameters that were optimized for each classifier. It is evident that classifiers within the gradient-boosting category have a particularly large number of parameters, making parameter optimization techniques essential for selecting the appropriate values. Proper parameter selection is a crucial step in developing effective machine learning models.

2.3. Metrics

In this research, standard metrics derived from the confusion matrix were used and calculated on the test set. Additionally, ROC curves were generated. The formulas for the selected metrics are provided below.
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l = 2 T P 2 T P + F P + F N
where: TP is the number of True Positives, TN is the number of True Negatives, FP is the number of False Positives, FN is the number of False Negatives.

2.4. Experiment Schema

Figure 3 illustrates the experimental design.
The experiments were carried out using a dataset from the Hospital-Based Cancer Registries of São Paulo, coordinated by the Fundação Oncocentro de São Paulo. Initial preprocessing involved removing unnecessary columns, standardizing the data, and encoding categorical columns using naive methods. The data were then divided into a training set (75%) and a test set (25%).
Eight widely used and validated classifiers were used: Random Forest, XGBoost, CatBoost, LightGBM, Gradient Boosting, Extra Trees, KNN, and Decision Tree. A key aspect of the experiment was the optimization of parameters using three frameworks: Optuna, RayTune, and HyperOpt. Classification accuracy and F1-score were used as primary metrics, calculated in the test set. Detailed descriptions of the individual components of the experiment can be found in the Section 2.1, Section 2.2 and Section 2.3.

3. Results

In this section, we present the results of our research. They are organized into three subsections, each corresponding to one of the parameter selection methods used. For each optimization method, six classifiers were optimized, and the following metrics were calculated in the test set: precision, F1-score, precision, and recall. To ensure robust performance evaluation, 10-fold cross-validation was applied during training, using the stratified version of K-Folds to maintain balanced class distributions across the folds. For the top-performing model within each optimization method, additional analyses including the confusion matrix and ROC curve were generated.
The experiments were conducted on a computer with the following specifications.
AMD Radeon RX 6600.
AMD Ryzen 5 5600G.
32 GB RAM.
The source code was developed in Python 3.11, using libraries such as scikit-learn [31], Pandas [32], and NumPy [33]. Detailed results, including confusion matrices and ROC curves, are provided in the Supplementary Materials because of the extensive number of experiments which could not be fully included in the main body of the article.

3.1. Optuna Optimization

In the first experiment, Optuna was employed to optimize the following classification models: XGBoost, LightGBM, CatBoost, Extra Trees, and KNN. The optimized parameters and their search ranges are detailed in Table 4.
Table 5 presents the results of classification models optimized using the Optuna framework. For the “alive_year1” problem, the LightGBM algorithm achieved the highest performance, with a classification accuracy of 0.8187 and an F1-score of 0.7544. In the “alive_year3” problem, the Gradient Boosting algorithm was the most effective, achieving a classification accuracy of 0.7861 and an F1-score of 0.7811. For the “alive_year5” problem, the CatBoost algorithm performed best, with a classification accuracy of 0.8185 and an F1-score of 0.7615. Regarding the “overall_death” problem, Gradient Boosting yielded the top results, with a classification accuracy of 0.7888 and an F1-score of 0.7880. In the “cancer_death” problem, LightGBM delivered the best results, with a classification accuracy of 0.7959 and an F1-score of 0.7836. Overall, most classifiers performed well, although XGBoost was the least effective, with a classification accuracy slightly above 60%. The other boosting algorithms demonstrated similar and high prediction accuracy.

3.2. Raytune Optimization

In the second experiment, the optimization framework was switched to the RayTune library for selecting classifier parameters. The set of classifiers and the range of parameters used remained the same as in the previous section.
Table 6 presents the results obtained using the RayTune library for model optimization. The results are comparable to those achieved with the Optuna framework and are notably high. For the “alive_year1” classification problem, the CatBoost algorithm delivered the best performance, with a classification accuracy of 0.8167 and an F1-score of 0.7511. Similarly, in the “alive_year3” problem, CatBoost again achieved the top results, with a classification accuracy of 0.7834 and an F1-score of 0.7786. In the “alive_year5” problem, LightGBM emerged as the best performer, yielding a classification accuracy of 0.8164 and an F1-score of 0.7578. For the “overall_death” problem, LightGBM also produced the best results, with a classification accuracy of 0.7879 and an F1-score of 0.7871. In the “cancer_death” problem, CatBoost proved to be the most effective, achieving a classification accuracy of 0.7955 and an F1-score of 0.7835. Overall, CatBoost was the top-performing classifier for three of the problems, while LightGBM excelled in two. The classifiers optimized using RayTune demonstrated similar classification accuracy to those optimized with Optuna.

3.3. Hyperopt Optimization

In the final experiment, the HyperOpt library was used for parameter optimization, while the rest of the experimental setup remained unchanged.
Table 7 presents the results obtained using the HyperOpt library for parameter optimization. This method also produced effective classification models, with an accuracy in all problems of around 80%. For the “alive_year1” problem, the CatBoost model achieved the best performance, with a classification accuracy of 0.8189 and an F1-score of 0.7543. Similarly, for the “alive_year3” problem, CatBoost was the most effective, reaching a classification accuracy of 0.7864 and an F1-score of 0.7820. In the “alive_year5” problem, the Gradient Boosting algorithm delivered the top results, with a classification accuracy of 0.8184 and an F1-score of 0.7620. For the “overall_death” problem, Gradient Boosting also excelled, achieving a classification accuracy of 0.7907 and an F1-score of 0.7897. In the “cancer_death” problem, the LightGBM classifier achieved the highest performance, with a classification accuracy of 0.7954 and an F1-score of 0.7835. Overall, ensemble methods using boosting continued to deliver the best results, while KNN and decision tree classifiers showed notably lower performance. The classification efficiency of the models optimized with HyperOpt was consistent with the results obtained with Optuna and RayTune.

3.4. Comparison of Results Using Optuna And Grid Search

To further validate our results, we performed additional comparisons using Grid Search optimization for the best-performing models. As shown in Table 8, the accuracy obtained with Grid Search was consistently lower compared with the accuracy after applying our chosen optimization methods, such as Optuna. This demonstrates the superior performance of more advanced adaptive methods that can better navigate the complex parameter space. We limited the Grid Search comparison to only the best classifiers, as performing a fully comprehensive comparison between random search and grid search on all classifiers would have been computationally prohibitive due to time constraints.

3.5. Explainability of Models

Model explainability is crucial in designing machine learning models, particularly in medical applications. To address this, the authors incorporated model interpretability into their experiments using one of the most widely recognized methods: SHAP (SHapley Additive exPlanations) values [34]. Due to the complexity of our current experiment, we chose not to perform feature selection analyses at this stage. This will be a focus of future work, where SHAP values will play a more critical role. For now, SHAP was used primarily to highlight the key features that influenced model decisions, which is important for clinical practice and the eventual deployment of algorithms in healthcare settings.
The SHAP summary plot for overall death (Figure 4) highlights the most influential features affecting the prediction of the model. Clinical staging (EC), year of diagnosis (ANODIAG), and absence of recurrence (RECENHUM) emerge as key factors, with clinical staging showing the highest impact. Surgery and the health care provider also play moderate roles in driving the model’s predictions.
Similarly, the SHAP summary plot for cancer death (Figure 5) reveals clinical staging (EC) as the dominant predictor, along with year of diagnosis (ANODIAG) and treatment-related factors such as chemotherapy (QUIMIO) and radiotherapy (RADIO). The distributions of the SHAP values provide insight into how these characteristics influence the predicted risk of cancer-specific mortality.
The SHAP summary plots for survival predictions at different time points reveal interesting patterns in the factors that drive the models. In year 1 (Figure 6), clinical staging (EC) emerges as the dominant predictor, followed by the absence of recurrence (RECENHUM) and year of diagnosis (ANODIAG). This indicates that early survival is heavily influenced by the stage of cancer and whether a recurrence has been detected.
In year 3 (Figure 7) and year 5 (Figure 8), the plot changes slightly, and the clinical staging (EC) continues to be important, but now age (IDADE) joins the year of diagnosis (ANODIAG) as a key factor. Long-term survival appears to be more closely related to the age at which the cancer was diagnosed and the age of the patient.
The analysis highlights the evolving importance of clinical staging, diagnosis year, and patient age as key predictors over time. This interpretability not only strengthens our confidence in the model but also helps clinicians focus on the most relevant factors when considering patient treatment plans and long-term prognosis.

4. Discussion

Colorectal cancer is one of the most common cancers, characterized by high mortality rates and an increasing incidence among young adults. To improve the diagnosis and treatment of this disease, modern technologies are essential, particularly artificial intelligence and machine learning methods. The research carried out by the authors represents one of the pioneering efforts in predicting patient survival for colorectal cancer. This study used a substantial dataset from Brazil that included patient data collected between 2000 and 2023. The extensive dataset enables reliable research and the development of classification models with high generalizability and minimal overfitting.
The authors addressed five distinct problems: one-year survival (referred to as alive_year1), three-year survival (alive_year3), five-year survival (alive_year5), overall death prediction (overall_death), and cancer-specific death prediction (cancer_death). Each of these problems was framed as a separate binary classification task. A range of classifiers known for their high performance across various problems were employed, including XgBoost, CatBoost, LightGBM, Gradient Boosting, Extra Trees, KNN, and Decision Tree. By evaluating a diverse set of models, this study aimed to identify the most effective algorithms.
The authors also conducted preliminary work with deep learning models, such as CNNs and LSTMs. However, the initial results from these models were quite poor. It is possible that the complexity of the data, considering the number of features, is not substantial enough for deep learning models to effectively learn and provide significant improvements in performance. Additionally, incorporating these models would introduce further complexity to the study without a corresponding benefit in performance.
A significant aspect of this work was the application of three popular parameter optimization frameworks: Optuna, RayTune, and HyperOpt. Despite their widespread use, there are few studies that compare these three approaches directly. Parameter optimization is a critical stage in the development of effective machine learning models, and the authors dedicated considerable effort to this aspect. This comprehensive comparison ensures that the parameters have been optimally selected, enhancing the potential effectiveness of the models in clinical practice.
Table 9 provides a comparative analysis of the best models built using these three optimization techniques for the five classification problems.
It is evident that all three parameter optimization techniques performed exceptionally well in selecting appropriate model parameters, with only minor differences among them. This shows that there are several highly effective frameworks available to optimize the parameters of the machine learning model. HyperOpt provided the best parameters for three of the problems, while Optuna excelled for two. The results of RayTune were only slightly less favorable, with differences between the frameworks being less than one percent. The best performance was achieved for predicting one-year survival, where the CatBoost classifier attained a classification accuracy of 0.8189 and an F1-score of 0.7543. This problem appears to be less complex than long-term predictions. For the three-year survival forecast, CatBoost again delivered the highest performance, with an accuracy of 0.7864 and an F1-score of 0.7820. HyperOpt was particularly effective for one-year and three-year predictions. For the prediction of five-year survival, CatBoost, optimized with Optuna, achieved the highest performance with an accuracy of 0.8185 and an F1-score of 0.7615. In predicting overall death, Gradient Boosting, optimized with HyperOpt, performed best, with an accuracy of 0.7907 and an F1-score of 0.7897. To predict cancer-specific death, the LightGBM model, optimized with Optuna, achieved the best results, with a classification efficiency of 0.7959 and an F1-score of 0.7836. In general, the models demonstrated high effectiveness in the five classification problems, with accuracies around 80%. The F1-scores were consistent with the classification efficiencies, indicating robust generalization and effective class handling. Compared with previous work [18], which also used the same dataset but only covered up to 2021 and relied on RandomForest, Naive Bayes, and XgBoost, our study represents a significant advance. We extended the dataset to include information up to 2023, explored a wider array of classifiers, and employed advanced parameter optimization frameworks. This approach led to improved results. Table 10 compares the best results from our study with those reported by Buk Cardoso et al. [18]. Furthermore, this study aligns with larger findings in the field, such as those reviewed by Kourou et al. [35], which emphasize the transformative impact of machine learning on cancer research and underscore the importance of robust and explainable models in clinical practice.
The results for each of the five problems exceed those reported in the study by [18]. According to the authors, these improvements are attributed to the use of advanced parameter optimization frameworks: Optuna, RayTune, and HyperOpt. The studies also emphasized model explainability through the use of SHAP values. Model explainability is crucial for integrating machine learning algorithms into clinical practice, as it enhances trust in these methods and encourages their adoption for diagnostic and therapeutic support across a broad spectrum of diseases. The most notable achievements of this research include the following:
  • Developing effective machine learning models for survival prediction.
  • Addressing five classification problems with eight different classifiers and three parameter optimization methods.
  • Comparison of three advanced parameter optimization frameworks to determine the most effective approach.
Regarding the limitations of the research, it is important to note the following:
  • Data Scope: The dataset is limited to patients from Brazil, which may impact the model’s generalizability to other populations. Cancer outcomes can vary significantly between countries due to factors such as diet, genetic predispositions, and differences in healthcare systems. Therefore, it is essential to validate the model on datasets from different regions to ensure it can be effectively applied across diverse populations. At present, data from other countries are not readily available, but with the growing use of machine learning and big data in healthcare, we anticipate that such data will become more accessible in the near future. This would allow for cross-country comparisons and further refinement of the model to mitigate potential biases and enhance its robustness in different contexts.
  • Feature Selection: This study did not address feature selection. Future research should incorporate feature selection methods, such as Principal Component Analysis (PCA) or genetic algorithms, to further refine model performance.
  • Sample Balancing: The research did not apply oversampling techniques to balance class distributions. Implementing methods like SMOTE (Synthetic Minority Oversampling Technique) in the training set could potentially improve model effectiveness.
This study has led to the development of effective classification models to predict the survival of patients with colorectal cancer, offering promising prospects for clinical application. Future research will focus on feature selection using algorithms such as Principal Component Analysis (PCA) and genetic algorithms, areas in which the authors have extensive experience [36]. Given the large number of features involved in the current study, this is a complex and multifaceted challenge. A comprehensive exploration of feature selection methods is necessary to build the most accurate and efficient models, ensuring that all relevant dimensions of the data are thoroughly analyzed and optimized. Additionally, genetic algorithms will be explored for parameter optimization, allowing for comparative studies of Optuna, RayTune, and HyperOpt frameworks alongside evolutionary methods. The use of oversampling techniques, particularly SMOTE (Synthetic Minority Oversampling Technique), will also be considered to balance sample distributions across classes, a strategy known to enhance model performance [37]. Furthermore, the research will explore other ensemble model building techniques, such as model contamination [38].

5. Conclusions

In this study, several effective classification models were developed to predict 1-year, 3-year and 5-year survival, as well as to predict patient death and cancer-specific mortality. These models were designed using advanced machine learning parameter optimization frameworks: Optuna, RayTune, and HyperOpt. The results achieved demonstrate promising potential for applying these models in clinical practice. However, more research is required to enhance the effectiveness of these methods and ensure their optimal performance.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/cancers16183205/s1, which contains the complete dataset before preprocessing, the complete dataset used in the article after preprocessing, and results for each classifier with each optimization method (ROC curves and confusion matrices).

Author Contributions

Conceptualization, A.W. and W.K.; methodology, A.W. and W.K.; software, A.W. and W.K.; validation, A.W. and W.K.; formal analysis, A.W. and W.K.; investigation, A.W. and W.K.; resources, A.W. and W.K.; data curation, A.W. and W.K.; writing—original draft preparation, A.W., W.K., and P.M.; writing—review and editing, A.W., W.K., and P.M.; visualization, A.W. and W.K.; supervision, W.K.; project administration, W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this research are publicly available and can be accessed at https://fosp.saude.sp.gov.br/ (accessed on 17 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Roshandel, G.; Ghasemi-Kebria, F.; Malekzadeh, R. Colorectal Cancer: Epidemiology, Risk Factors, and Prevention. Cancers 2024, 16, 1530. [Google Scholar] [CrossRef] [PubMed]
  2. Ferlay, J.; Ervik, M.; Lam, F.; Colombet, M.; Mery, L.; Piñeros, M.; Znaor, A.; Soerjomataram, I.; Bray, F. Global Cancer Observatory: Cancer Today (Version 1.1); International Agency for Research on Cancer: Lyon, France. Available online: https://gco.iarc.who.int/today (accessed on 12 August 2024).
  3. Eng, C.; Yoshino, T.; Ruíz-García, E.; Mostafa, N.; Cann, C.G.; O’Brian, B.; Benny, A.; Perez, R.O.; Cremolini, C. Colorectal cancer. Lancet 2024, 404, 294–310. [Google Scholar] [CrossRef]
  4. Bailey, C.E.; Hu, C.Y.; You, Y.N.; Bednarski, B.K.; Rodriguez-Bigas, M.A.; Skibber, J.M.; Cantor, S.B.; Chang, G.J. Increasing Disparities in the Age-Related Incidences of Colon and Rectal Cancers in the United States, 1975–2010. JAMA Surg. 2015, 150, 17. [Google Scholar] [CrossRef] [PubMed]
  5. Dharwadkar, P.; Zaki, T.A.; Murphy, C.C. Colorectal Cancer in Younger Adults. Hematol./Oncol. Clin. N. Am. 2022, 36, 449–470. [Google Scholar] [CrossRef] [PubMed]
  6. Hossain, M.S.; Karuniawati, H.; Jairoun, A.A.; Urbi, Z.; Ooi, D.J.; John, A.; Lim, Y.C.; Kibria, K.M.K.; Mohiuddin, A.M.; Ming, L.C.; et al. Colorectal Cancer: A Review of Carcinogenesis, Global Epidemiology, Current Challenges, Risk Factors, Preventive and Treatment Strategies. Cancers 2022, 14, 1732. [Google Scholar] [CrossRef] [PubMed]
  7. Abedizadeh, R.; Majidi, F.; Khorasani, H.R.; Abedi, H.; Sabour, D. Colorectal cancer: A comprehensive review of carcinogenesis, diagnosis, and novel strategies for classified treatments. Cancer Metastasis Rev. 2023, 43, 729–753. [Google Scholar] [CrossRef]
  8. Khalid, A.; Mehmood, A.; Alabrah, A.; Alkhamees, B.F.; Amin, F.; AlSalman, H.; Choi, G.S. Breast Cancer Detection and Prevention Using Machine Learning. Diagnostics 2023, 13, 3113. [Google Scholar] [CrossRef]
  9. Nazir, I.; Haq, I.u.; AlQahtani, S.A.; Jadoon, M.M.; Dahshan, M. Machine Learning-Based Lung Cancer Detection Using Multiview Image Registration and Fusion. J. Sens. 2023, 2023. [Google Scholar] [CrossRef]
  10. Zhang, Z.M.; Huang, Y.; Liu, G.; Yu, W.; Xie, Q.; Chen, Z.; Huang, G.; Wei, J.; Zhang, H.; Chen, D.; et al. Development of machine learning-based predictors for early diagnosis of hepatocellular carcinoma. Sci. Rep. 2024, 14, 5274. [Google Scholar] [CrossRef]
  11. Tamang, L.D.; Kim, B.W. Deep Learning Approaches to Colorectal Cancer Diagnosis: A Review. Appl. Sci. 2021, 11, 10982. [Google Scholar] [CrossRef]
  12. Karthikeyan, A.; Jothilakshmi, S.; Suthir, S. Colorectal cancer detection based on convolutional neural networks (CNN) and ranking algorithm. Meas. Sens. 2024, 31, 100976. [Google Scholar] [CrossRef]
  13. Haj-Hassan, H.; Chaddad, A.; Harkouss, Y.; Desrosiers, C.; Toews, M.; Tanougast, C. Classifications of Multispectral Colorectal Cancer Tissues Using Convolution Neural Network. J. Pathol. Inform. 2017, 8, 1. [Google Scholar] [CrossRef] [PubMed]
  14. Mazaki, J.; Umezu, T.; Saito, A.; Katsumata, K.; Fujita, K.; Hashimoto, M.; Kobayashi, M.; Udo, R.; Kasahara, K.; Kuwabara, H.; et al. Novel AI Combining CNN and SVM to Predict Colorectal Cancer Prognosis and Mutational Signatures from HE Images. Mod. Pathol. 2024, 37, 100562. [Google Scholar] [CrossRef] [PubMed]
  15. Vale-Silva, L.A.; Rohr, K. Long-term cancer survival prediction using multimodal deep learning. Sci. Rep. 2021, 11, 13505. [Google Scholar] [CrossRef] [PubMed]
  16. Naser, M.Y.M.; Chambers, D.; Bhattacharya, S. Prediction Model of Breast Cancer Survival Months: A Machine Learning Approach. In Proceedings of the SoutheastCon 2023, Orlando, FL, USA, 1–16 April 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar] [CrossRef]
  17. Huang, S.; Yang, J.; Shen, N.; Xu, Q.; Zhao, Q. Artificial intelligence in lung cancer diagnosis and prognosis: Current application and future perspective. Semin. Cancer Biol. 2023, 89, 30–37. [Google Scholar] [CrossRef] [PubMed]
  18. Buk Cardoso, L.; Cunha Parro, V.; Verzinhasse Peres, S.; Curado, M.P.; Fernandes, G.A.; Wünsch Filho, V.; Natasha Toporcov, T. Machine learning for predicting survival of colorectal cancer patients. Sci. Rep. 2023, 13, 8874. [Google Scholar] [CrossRef]
  19. Fundação Oncocentro de São Paulo. Available online: https://fosp.saude.sp.gov.br/ (accessed on 12 August 2024).
  20. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  21. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16, 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
  22. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the NIPS’18, 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018; pp. 6639–6649. [Google Scholar]
  23. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the NIPS’17, 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
  24. Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef]
  25. Wang, Z.; Mu, L.; Miao, H.; Shang, Y.; Yin, H.; Dong, M. An innovative application of machine learning in prediction of the syngas properties of biomass chemical looping gasification based on extra trees regression algorithm. Energy 2023, 275, 127438. [Google Scholar] [CrossRef]
  26. Uddin, S.; Haque, I.; Lu, H.; Moni, M.A.; Gide, E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci. Rep. 2022, 12, 6256. [Google Scholar] [CrossRef]
  27. Ying, K.; Ameri, A.; Trivedi, A.; Ravindra, D.; Patel, D.; Mozumdar, M. Decision tree-based machine learning algorithm for in-node vehicle classification. In Proceedings of the 2015 IEEE Green Energy and Systems Conference (IGESC), Long Beach, CA, USA, 9 November 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar] [CrossRef]
  28. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
  29. Bergstra, J.; Yamins, D.; Cox, D. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; Proceedings of Machine Learning Research. Dasgupta, S., McAllester, D., Eds.; Volume 28, pp. 115–123. [Google Scholar]
  30. Liaw, R.; Liang, E.; Nishihara, R.; Moritz, P.; Gonzalez, J.E.; Stoica, I. Tune: A Research Platform for Distributed Model Selection and Training. arXiv 2018, arXiv:1807.05118. [Google Scholar]
  31. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  32. The Pandas Development Team. Pandas-Dev/Pandas: Pandas. 2020. Available online: https://zenodo.org/records/10957263 (accessed on 17 September 2024).
  33. Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
  34. Louhichi, M.; Nesmaoui, R.; Mbarek, M.; Lazaar, M. Shapley Values for Explaining the Black Box Nature of Machine Learning Model Clustering. Procedia Comput. Sci. 2023, 220, 806–811. [Google Scholar] [CrossRef]
  35. Kourou, K.; Exarchos, K.P.; Papaloukas, C.; Sakaloglou, P.; Exarchos, T.; Fotiadis, D.I. Applied machine learning in cancer research: A systematic review for patient diagnosis, classification and prognosis. Comput. Struct. Biotechnol. J. 2021, 19, 5546–5555. [Google Scholar] [CrossRef]
  36. Pałka, F.; Książek, W.; Pławiak, P.; Romaszewski, M.; Książek, K. Hyperspectral Classification of Blood-Like Substances Using Machine Learning Methods Combined with Genetic Algorithms in Transductive and Inductive Scenarios. Sensors 2021, 21, 2293. [Google Scholar] [CrossRef]
  37. Elreedy, D.; Atiya, A.F. A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Inf. Sci. 2019, 505, 32–64. [Google Scholar] [CrossRef]
  38. Daza, A.; Arroyo-Paz; Bobadilla, J.; Apaza, O.; Pinto, J. Stacking ensemble learning model for predict anxiety level in university students using balancing methods. Inform. Med. Unlocked 2023, 42, 101340. [Google Scholar] [CrossRef]
Figure 1. Age distribution of patients.
Figure 1. Age distribution of patients.
Cancers 16 03205 g001
Figure 2. Survival curve.
Figure 2. Survival curve.
Cancers 16 03205 g002
Figure 3. Experiment schema.
Figure 3. Experiment schema.
Cancers 16 03205 g003
Figure 4. SHAP summary plots for the best classifiers of overall death.
Figure 4. SHAP summary plots for the best classifiers of overall death.
Cancers 16 03205 g004
Figure 5. SHAP summary plots for the best classifiers of cancer death.
Figure 5. SHAP summary plots for the best classifiers of cancer death.
Cancers 16 03205 g005
Figure 6. SHAP summary plots for the best classifiers for Year 1.
Figure 6. SHAP summary plots for the best classifiers for Year 1.
Cancers 16 03205 g006
Figure 7. SHAP summary plots for the best classifiers for Year 3.
Figure 7. SHAP summary plots for the best classifiers for Year 3.
Cancers 16 03205 g007
Figure 8. SHAP summary plots for the best classifiers for Year 5.
Figure 8. SHAP summary plots for the best classifiers for Year 5.
Cancers 16 03205 g008
Table 1. Colorectal cancer incidence and mortality statistics [2].
Table 1. Colorectal cancer incidence and mortality statistics [2].
PopulationNumber of CasesNumber of Deaths
Oceania22,2438036
Africa70,42846,087
Latin America and the Caribbean145,12066,155
Northern America183,97373,647
Europe538,262247,842
Asia966,399462,252
Total1,926,425904,019
Table 2. Details of dataset.
Table 2. Details of dataset.
NameDescriptionData Type
SEXOPatient’s genderint
IDADEPatient’s ageint
ESCOLARIEducation levelint
UFNASCBirthplacechar
IBGEPostal codechar
CIDADECity of residencechar
CATEATENDHealth service providerint
DTCONSULTDate of first consultationdate
CLINICADiagnosis departmentint
DIAGPREVDiagnosis statusint
DTDIAGDate of diagnosisdate
BASEDIAGBasis of suspicionint
TOPOCancer subgroupchar
TOPOGRUPCancer groupchar
DESCTOPOCancer locationchar
MORFOMorphological codechar
DESCMORFOMorphological subtypechar
ECClinical stagingchar
ECGRUPStaging groupchar
TTumor growthchar
NSpread to lymph nodeschar
MDistant metastaseschar
PTPT scalechar
PNPN scalechar
PMPM scalechar
SS scaleint
GG scalechar
IDMITOTICMitotic indexint
PSAPSA classificationint
GLEASONGleason scoreint
OUTRACLAOUTRACLA codechar
META01Metastasis 1char
META02Metastasis 2char
META03Metastasis 3char
META04Metastasis 4char
DTTRATTreatment start datedate
NAOTRATReason for no treatmentint
TRATAMENTOType of treatmentchar
TRATHOSPTreatment in hospitalchar
TRATFANTESTreatment before hospitalchar
TRATFAPOSTreatment after hospitalchar
NENHUMNo treatment in hospitalboolean
CIRURGIASurgery in hospitalboolean
RADIOaRadiotherapy in hospitalboolean
QUIMIOChemotherapy in hospitalboolean
HORMONIOHormone therapy in hospitalboolean
TMOStem cell transplant in hospitalboolean
IMUNOImmunotherapy in hospitalboolean
OUTROSOther methods in hospitalboolean
NENHUMANTNo treatment outside hospitalboolean
CIRURANTSurgery outside hospitalboolean
RADIOANTRadiotherapy outside hospitalboolean
QUIMIOANTChemotherapy outside hospitalboolean
HORMOANTHormone therapy outside hospitalboolean
TMOANTStem cell transplant outside hospitalboolean
IMUNOANTImmunotherapy outside hospitalboolean
OUTROANTOther methods outside hospitalboolean
NENHUMAPOSNo treatment anywhereboolean
CIRURAPOSSurgery anywhereboolean
RADIOAPOSRadiotherapy anywhereboolean
QUIMIOAPOSChemotherapy anywhereboolean
HORMOAPOSHormone therapy anywhereboolean
TMOAPOSStem cell transplant anywhereboolean
IMUNOAPOSImmunotherapy anywhereboolean
OUTROAPOSOther methods anywhereboolean
DTULTINFOLast info datedate
ULTINFOLast info statusint
CONSDIAGDays from consultation to diagnosisint
TRATCONSDays from consultation to treatmentint
DIAGTRATDays from diagnosis to treatmentint
ANODIAGYear of diagnosisint
FAIXAETARAge groupchar
LATERALITumor lateralitychar
INSTORIGTreatment location if previously diagnosedchar
DRSNFZ departmentchar
RRASRegional health networkchar
DTPREENCHEnd of treatment datedate
REGISTRADORegistration statuschar
DTRECIDIVARecurrence datedate
RECNENHUMNo recurrenceint
RECLOCALLocal recurrenceint
RECREGIORegional recurrenceint
RECDISTDistant metastasisint
REC01Metastasis type 1char
REC02Metastasis type 2char
REC03Metastasis type 3char
REC04Metastasis type 4char
HABILIT2Certification statuschar
Table 3. Distribution of data in a dataset.
Table 3. Distribution of data in a dataset.
Overall_DeathCancer_DeathAlive_Year1Alive_Year3Alive_Year5
TrainNo (Class 0)11,25614,426681914,06517,704
Yes (Class 1)12,962979217,39910,1536514
TestNo (Class 0)37524809227346885902
Yes (Class 1)43213264580033852171
TotalNo (Class 0)15,00819,235909218,75323,606
TotalYes (Class 1)17,28313,05623,19913,5388685
Table 4. Optimized parameters for classifiers.
Table 4. Optimized parameters for classifiers.
NameParameterRange
RandomForestn_estimators1–200
max_depth2–20
min_weight_fraction_leaf0.0–0.5
min_samples_split2–10
min_samples_leaf1–10
max_samples0.7–1.0 step = 0.05
criterion‘gini’, ‘entropy’
XgBoostn_estimators1–200
max_depth3–10
scale_pos_weight1–3
learning_rate0.01–0.3
subsample0.5–1
colsample_bytree0.5–1
gamma0–1
min_child_weight1–10
CatBoostiterations100–1000
depth2–10
learning_rate0.01–0.3
l2_leaf_reg1 × 10−5 –1
border_count1–255
random_strength0–1
LightGBMnum_leaves2–256
max_depth2–50
learning_rate0.01–0.3
n_estimators100–500
min_child_samples5–100
subsample0.4–1
colsample_bytree0.4–1
reg_alpha1 × 10−8–10.0
reg_lambda1 × 10−8–10.0
GradientBoostingClassifiermax_depth2–50
learning_rate0.01–0.3
n_estimators100–500
subsample0.4–1.0
min_samples_split2–100
min_samples_leaf1–100
max_features[’sqrt’, ‘log2’, None]
Extra Treesmax_depth2–50
n_estimators100–1000
min_samples_split2–100
min_samples_leaf1–100
max_features[‘sqrt’, ‘log2’, None]
Knnn_neighbors1–20
weights[‘uniform’, ‘distance’]
Decision Treemax_depth2–50
min_samples_split2–100
min_samples_leaf1–100
max_features[‘sqrt’, ‘log2’, None]
Table 5. Results of parameter optimization with Optuna.
Table 5. Results of parameter optimization with Optuna.
AccuracyF1-ScorePrecisionRecallAUC
alive_year1Random Forest0.81260.73790.79030.71600.846
XgBoost0.68080.57070.58230.56800.650
CatBoost0.81650.75030.78850.73110.850
LightGBM0.81870.75440.79040.73560.855
Gradient Boosting0.81480.75030.78350.73250.850
Extra Trees0.80590.73280.77480.71360.838
KNN0.78480.70090.74300.68400.806
Decision Tree0.79070.72040.74570.70650.814
alive_year3Random Forest0.78030.77550.78160.77690.860
XgBoost0.63250.63030.67530.66210.716
CatBoost0.78110.77650.77530.77810.860
LightGBM0.78340.77880.77760.78060.865
Gradient Boosting0.78610.78110.78030.78210.865
Extra Trees0.77470.76710.76930.76550.852
KNN0.73600.72670.72920.72510.807
Decision Tree0.75160.74500.74500.74490.833
alive_year5Random Forest0.81330.74610.76870.73180.883
XgBoost0.73150.66150.66020.66300.785
CatBoost0.81850.76150.77160.75340.885
LightGBM0.81850.76010.77220.75080.888
Gradient Boosting0.81490.75700.76660.74920.886
Extra Trees0.81510.75760.76660.75030.881
KNN0.77860.67830.72290.66200.825
Decision Tree0.80120.73610.74860.72690.858
overall_deathRandom Forest0.78890.77460.78700.76890.861
XgBoost0.65690.63290.67550.64200.736
CatBoost0.78560.78460.78450.78480.869
LightGBM0.78680.78570.78580.78550.870
Gradient Boosting0.78880.78800.78770.78840.871
Extra Trees0.78380.78310.78280.78360.865
KNN0.75280.75130.75150.75120.830
Decision Tree0.75850.75740.75730.75760.838
cancer_deathRandom Forest0.78600.78490.78490.78490.861
XgBoost0.60800.60780.62600.62760.683
CatBoost0.79470.78310.79020.77890.864
LightGBM0.79590.78360.79230.77890.867
Gradient Boosting0.79510.78350.79060.77930.864
Extra Trees0.79100.77780.78800.77260.857
KNN0.76020.74560.75360.74150.829
Decision Tree0.77220.75750.76770.75260.837
Rows highlighted with a green background indicate the best performing models in terms of accuracy for the respective time frame.
Table 6. Results of parameter optimization with RayTune.
Table 6. Results of parameter optimization with RayTune.
AccuracyF1-ScorePrecisionRecallAUC
alive_year1Random Forest0.71840.41810.35920.50000.727
XgBoost0.67530.52710.54900.53230.608
CatBoost0.81670.75110.78800.73220.847
LightGBM0.81650.75350.78520.73620.851
Gradient Boosting0.81060.74660.77540.73040.847
Extra Trees0.72190.43190.82930.50640.805
KNN0.78210.70580.73430.69160.792
Decision Tree0.74660.58230.70490.58400.730
alive_year3Random Forest0.69840.67070.69950.66900.765
XgBoost0.63290.62870.68900.66720.733
CatBoost0.78340.77860.77750.78000.859
LightGBM0.78270.77790.77690.77930.863
Gradient Boosting0.78220.77720.77640.77830.862
Extra Trees0.69180.64120.72120.64810.808
KNN0.72790.71870.72050.71740.794
Decision Tree0.70230.68820.69460.68580.748
alive_year5Random Forest0.73110.42230.36550.50000.785
XgBoost0.73180.68260.67510.69910.786
CatBoost0.81610.75770.76850.74930.878
LightGBM0.81640.75780.76910.74910.887
Gradient Boosting0.81370.75530.76500.74750.884
Extra Trees0.73330.43330.77720.50490.833
KNN0.77670.69110.71530.67840.810
Decision Tree0.74800.61090.66970.60230.769
overall_deathRandom Forest0.73420.70410.73910.69870.814
XgBoost0.62570.56460.68850.60230.699
CatBoost0.78600.78500.78490.78520.868
LightGBM0.78790.78710.78690.78730.869
Gradient Boosting0.78600.78490.78490.78500.869
Extra Trees0.75720.75250.76090.75120.835
KNN0.74790.74650.74670.74640.820
Decision Tree0.68950.68590.68830.68530.755
cancer_deathRandom Forest0.73360.72860.73590.72770.812
XgBoost0.46460.41800.58710.53850.704
CatBoost0.79550.78350.79150.77900.863
LightGBM0.79310.78130.78860.77700.865
Gradient Boosting0.79090.77940.78560.77560.861
Extra Trees0.74270.70480.76570.69980.836
KNN0.75420.74090.74610.73780.816
Decision Tree0.63270.59510.61240.59620.635
Rows highlighted with a green background indicate the best performing models in terms of accuracy for the respective time frame.
Table 7. Results of parameter optimization with HyperOpt.
Table 7. Results of parameter optimization with HyperOpt.
AccuracyF1-ScorePrecisionRecallAUC
alive_year1Random Forest0.81230.73660.79100.71450.847
XgBoost0.70690.54630.59800.55240.654
CatBoost0.81890.75430.79120.73520.851
LightGBM0.81740.75320.78800.73480.854
Gradient Boosting0.81570.75190.78430.73430.851
Extra Trees0.72030.42550.83050.50340.806
KNN0.78480.70090.74300.68400.806
Decision Tree0.75410.62630.70310.61620.763
alive_year3Random Forest0.77840.77380.77260.77570.858
XgBoost0.65910.65860.66690.67020.708
CatBoost0.78640.78200.78080.78380.864
LightGBM0.78500.78040.77920.78200.866
Gradient Boosting0.78400.77910.77810.78040.863
Extra Trees0.69220.64320.71900.64930.809
KNN0.73600.72670.72920.72510.807
Decision Tree0.68880.67980.68020.67940.754
alive_year5Random Forest0.81390.74300.77220.72630.882
XgBoost0.72760.66360.65970.66880.789
CatBoost0.81800.76170.77060.75460.884
LightGBM0.81740.76070.76980.75330.888
Gradient Boosting0.81840.76200.77120.75450.887
Extra Trees0.73320.43150.81870.50420.830
KNN0.77860.67830.72290.66200.825
Decision Tree0.75870.65070.68790.63800.798
overall_deathRandom Forest0.78790.77330.78640.76740.861
XgBoost0.65370.62210.68290.63650.740
CatBoost0.78600.78500.78490.78520.869
LightGBM0.78990.78910.78890.78950.871
Gradient Boosting0.79070.78970.78960.78990.870
Extra Trees0.75550.75110.75840.74990.831
KNN0.75280.75130.75150.75120.830
Decision Tree0.69900.69860.69890.69990.770
cancer_deathRandom Forest0.78630.78540.78520.78550.865
XgBoost0.65300.65130.65810.66370.730
CatBoost0.79440.78260.78990.77830.864
LightGBM0.79540.78350.79110.77920.866
Gradient Boosting0.78950.77800.78420.77420.862
Extra Trees0.74100.70140.76640.69680.835
KNN0.76020.74560.75360.74150.829
Decision Tree0.67580.63020.67180.63170.731
Rows highlighted with a green background indicate the best performing models in terms of accuracy for the respective time frame.
Table 8. Comparison of optimization methods for best classifiers.
Table 8. Comparison of optimization methods for best classifiers.
Classification ProblemBest ClassifierAccuracyGrid Search Accuracy
alive_year1LightGBM0.81870.7235
alive_year3Gradient Boosting0.78610.7441
alive_year5CatBoost0.81850.7044
overall_deathRandom Forest0.78890.7122
cancer_deathLightGBM0.79590.7412
Table 9. Summary of results for all parameter optimization methods.
Table 9. Summary of results for all parameter optimization methods.
OptunaRayTuneHyperOpt
alive_year1AccuracyLightGBM0.8187CatBoost0.8167CatBoost0.8189
F1-Score0.75440.75110.7543
alive_year3AccuracyGradient Boosting0.7861CatBoost0.7834CatBoost0.7864
F1-Score0.78110.77860.7820
alive_year5AccuracyCatBoost0.8185LightGBM0.8164Gradient Boosting0.8184
F1-Score0.76150.75780.7620
overall_deathAccuracyGradient Boosting0.7888LightGBM0.7879Gradient Boosting0.7907
F1-Score0.78800.78710.7897
cancer_deathAccuracyLightGBM0.7959CatBoost0.7955LightGBM0.7954
F1-Score0.78360.78350.7835
Rows highlighted with a green background indicate the best performing models in terms of accuracy for the respective time frame.
Table 10. Comparison of results with existing literature.
Table 10. Comparison of results with existing literature.
ProblemMetricsCardoso et al. [18]Current Study
alive_year1Accuracy [%]77.481.89
alive_year3Accuracy [%]74.778.64
alive_year5Accuracy [%]77.981.85
overall_deathAccuracy [%]77.779.07
overall_deathAccuracy [%]77.179.59
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Woźniacki, A.; Książek, W.; Mrowczyk, P. A Novel Approach for Predicting the Survival of Colorectal Cancer Patients Using Machine Learning Techniques and Advanced Parameter Optimization Methods. Cancers 2024, 16, 3205. https://doi.org/10.3390/cancers16183205

AMA Style

Woźniacki A, Książek W, Mrowczyk P. A Novel Approach for Predicting the Survival of Colorectal Cancer Patients Using Machine Learning Techniques and Advanced Parameter Optimization Methods. Cancers. 2024; 16(18):3205. https://doi.org/10.3390/cancers16183205

Chicago/Turabian Style

Woźniacki, Andrzej, Wojciech Książek, and Patrycja Mrowczyk. 2024. "A Novel Approach for Predicting the Survival of Colorectal Cancer Patients Using Machine Learning Techniques and Advanced Parameter Optimization Methods" Cancers 16, no. 18: 3205. https://doi.org/10.3390/cancers16183205

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop