Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features

Dutta, Bireswar

doi:10.3390/app15084507

Open AccessArticle

Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features

by

Bireswar Dutta

Department of Information Technology and Management, English Taught Program in Smart Service Management, Shih Chien University, Taipei 104, Taiwan

Appl. Sci. 2025, 15(8), 4507; https://doi.org/10.3390/app15084507

Submission received: 24 March 2025 / Revised: 17 April 2025 / Accepted: 18 April 2025 / Published: 19 April 2025

(This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare and Precision Medicine)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Lung cancer remains a leading cause of global mortality, with early detection being critical for improving the patient survival rates. However, applying machine learning and deep learning effectively for lung cancer prediction using symptomatic and lifestyle data requires the careful consideration of feature selection and model optimization, which is not consistently addressed in the existing research. This research addresses this gap by systematically evaluating and comparing the predictive efficacy of several machine learning and deep learning models, employing rigorous data preprocessing, including feature selection with Pearson’s correlation, outlier removal, and normalization, on a patient symptom and lifestyle factor dataset from Kaggle. Machine learning classifiers, including Decision Trees, K-Nearest Neighbors, Random Forest, Naïve Bayes, AdaBoost, Logistic Regression, and Support Vector Machines, were implemented using Weka simultaneously with neural network models with 1, 2, and 3 hidden layers, which were developed in Python within a Jupyter Notebook environment. The model performance was assessed using K-fold cross-validation and 80/20 train/test splitting. The results highlight the importance of feature selection for enhancing the model accuracy and demonstrate that the single-hidden-layer neural network, trained for 800 epochs, achieved a prediction accuracy of 92.86%, outperforming the machine learning models. This study contributes to developing more effective computational methods for early lung cancer detection, ultimately supporting improved patient outcomes.

Keywords:

lung cancer prediction; classification algorithms; ROC curve; train–test split; K-fold cross-validation; correlation matrix

1. Introduction

Lung cancer is one of the most common and fatal forms of cancer worldwide, which causes a large number of cancer-related fatalities annually and impacts people in every aspect of life [1]. Within the complex structure of the human pulmonary system, lung cancer develops as a hidden threat and often stays undetected until it has progressed to an advanced stage [2]. Lung cancer encompasses a spectrum of distinct forms, ranging from slower-growing small cells to the more prevalent non-small cell lung cancer, each posing unique diagnostic and therapeutic challenges [3]. The impact, however, is substantial, including social, emotional, and physical features.

The Centers for Disease Control and Prevention explored whether smoking is one of the most common risk factors for lung cancer, and 80–90% of lung cancer deaths are due to smoking in the US [4]. Not only smokers, non-smokers exposed to secondhand smoke face a significantly elevated risk of lung cancer [2]. Radon, a gas that originates from rocks, soil, and water, is the second primary cause of lung cancer. It may infiltrate buildings via fissures and may induce lung cancer after prolonged inhalation [1]. Radiation treatment, atypical dietary habits, and familial genetic predisposition also cause lung cancer-related fatalities. The literature emphasizes that risk prediction is of greater importance than clinical evaluation in lung cancer screening [1,2,3,5].

Although medical technology has made strides, lung cancer persists as a significant health challenge, mainly due to complications of early diagnosis [6]; in fact, delayed detection often reduces the treatment effectiveness and leads to poorer patient outcomes [6,7]. The literature has used several mathematical models to detect and prevent diseases, facilitate timely treatment interventions, and improve the health outcomes [8,9]. Early detection, before metastasis, significantly enhances the possibility of effective treatment in lung cancer cases, as early cancer detection and risk factor assessment enable the administration of appropriate therapies and preventive measures [7]. The evidence demonstrates that the inherent difficulties in early lung cancer detection result in delayed diagnoses beyond six months, substantially diminishing the survival rates, and render treatment significantly more challenging [7,8,10].

In the last few years, machine learning (ML) and data mining have become reliable and essential tools in the medical field due to their ability to reveal hidden patterns in the data, improving the decision-making accuracy [8,9]. Researchers have utilized ML and soft computing techniques to effectively identify many types of cancer at early stages using classification methods [7,8,11] and developed advanced models for predicting cancer therapy outcomes at the onset of treatment [6,7]. However, selecting an appropriate learning algorithm is critical for precisely diagnosing lung cancer and understanding its relation to patient habits [3,5]. ML automates disease prediction, while data mining integrates ML, statistics, and database techniques to preprocess and extract meaningful patterns from large datasets [8,11]. Therefore, it is critical to develop and refine computational methods that can accurately predict the lung cancer risk and facilitate timely intervention. This study systematically evaluates and compares ML and deep learning (DL) models for lung cancer prediction, focusing on patient symptoms and lifestyle factors aiming to address the concerns.

This research aims to identify the ML model with the best predictive accuracy for lung cancer, using patient features, including lifestyle factors and symptoms, and developing further effective diagnostic technologies that enable early diagnosis and enhance the patient outcomes. Thus, the following research questions are explored to fulfill the aims of the study:

What are the predictive accuracy, precision, recall, and F-measure of various ML models for lung cancer prediction, based on patient symptoms and lifestyle factors, and which algorithm performs the best across these metrics?

How does feature selection affect the ML lung cancer detection accuracy?
Do DL methods, such as neural networks, outperform the traditional ML classifiers in lung cancer prediction?
This study’s findings are expected to clarify the applicability of ML approaches for lung cancer detection, leading to improved screening and early intervention strategies.

2. Literature Review

Lung cancer, characterized by uncontrolled cell growth in lung tissues, is a highly malignant disease and a leading cause of cancer-related deaths worldwide. The importance of early detection and accurate diagnosis in improving the patient outcomes cannot be overstated. This section reviews the recent studies on the ML and DL techniques in lung disease diagnosis, particularly lung cancer.

2.1. Machine Learning Applications in Lung Disease Diagnosis

Machine learning has become an invaluable tool in medical science, with its ability to analyze complex datasets and identify patterns that may be imperceptible to the human eye. From the perspective of lung disease, ML algorithms have been employed for various tasks, including risk prediction, early detection, and the classification of disease subtypes. Maurya et al. [1] conducted a comparative analysis of several machine learning algorithms for lung cancer prediction. Their findings indicated that the K-Nearest Neighbor (KNN) and Naive Bayes (NB) models were identified as the most significant methods for early lung cancer prediction. Khanam and Foo (2021) [11] compared ML algorithms for diabetes prediction. Although focused on diabetes, this study provides valuable insights into the comparative performance of ML models, which is relevant to selecting the appropriate algorithms for lung disease diagnosis.

Protić et al. (2023) [12] explored the numerical feature selection in the ML-based detection of anomalies. The study findings explored how methodologies like feature selection are critical in developing accurate and efficient diagnostic models. Dudáš (2024) [13] investigated the graphical representation of the data prediction potential using correlation graphs and chains. The findings explored visualizing data relationships which can aid in understanding the underlying factors that contribute to lung diseases. Patra (2020) [14] utilized ML classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Decision Trees (DT), NB, and KNN to predict lung cancer. The findings revealed that the proposed RBF classifier performed with an accuracy of 81.25%. Radhika et al. (2019) [15] performed a comparative study of lung cancer detection using ML algorithms. The experimental results indicate that the proposed ensemble classifier yielded the optimal classification performance, with SVM performing the second best. Conversely, NB exhibited the least effective classification. Faisal et al. (2018) [16] evaluated machine learning classifiers and ensembles for the early-stage prediction of lung cancer. The results showed that the Gradient-Boosted Tree classifier achieved 90% accuracy, outperforming all the other individual and ensemble classifiers.

2.2. DL Applications in Lung Disease Diagnosis

DL, a subfield of artificial intelligence, has emerged as a powerful tool for analyzing medical images and other complex data in lung disease diagnosis. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have shown promise in this field. Esteva et al. (2021) [17] provided a comprehensive overview of deep learning-enabled medical computer vision, highlighting its applications in various medical domains, including lung disease. The success of ML models hinges on thorough data assessment, proactive planning for model limitations, active community participation, and the cultivation of trust. Alzubaidi et al. (2021) [8] reviewed DL concepts, CNN architectures, challenges, applications, and future directions. This review offers a valuable foundation for understanding the deep learning methodologies used in lung disease diagnosis.

Vieira et al. (2021) [18] used a data mining approach with Artificial Neural Networks (ANNs) to classify lung cancer cases. Through the application of multiple data mining models within RapidMiner, they identified smoking as the principal risk factor for lung cancer. Chakraborty et al. (2024) [6] reported that the implementation of ML and DL in real-world medical settings is nascent and faces significant challenges. These technologies require further development, rigorous validation, and scalable implementation to be effectively utilized in healthcare. Liu et al. (2023) [9] explored how medical image processing and biomedical models enhance the image quality, enabling doctors to better visualize lesion areas. Table 1 illustrates the significant role of ML and DL in improving the accuracy and efficiency of lung disease diagnosis.

2.3. Research Deficiencies and Prospective Avenues

The efficacy of lung cancer prediction extends beyond ML model selection and necessitates integrating patient-specific information, including symptoms, lifestyle habits, and genetic predispositions. Smoking is the primary cause of lung cancer, responsible for 80–90% of the occurrences in the United States [1,2]. Therefore, ML models must include these characteristics to improve their predictive accuracy.

Despite the effectiveness of ML and DL models in lung cancer prediction, challenges remain. Many studies rely on small or imbalanced datasets, hindering the result generalizability. Additionally, the computational complexity and large data requirements of deep learning models pose obstacles to clinical application.

Future investigations need to include feature selection to pinpoint the most relevant features and explainable ML and DL models to enhance the transparency in decision making. Moreover, using extensive datasets helps improve the model resilience and relevance across distinct patient demographics.

3. Materials and Methods

This section details the methodology employed to systematically evaluate ML and DL models for lung cancer prediction. A rigorous and well-defined methodology is crucial to ensure the validity, reliability, and reproducibility of the research findings. The overall workflow, depicted in Figure 1, encompasses data acquisition and preprocessing, feature selection, model development, and performance evaluation [19]. This structured process enables a transparent and systematic comparison of the models’ predictive capabilities.

3.1. Dataset Description and Preparation

3.1.1. Dataset Source and Characteristics

The study utilized a lung cancer prediction dataset obtained from Kaggle. Kaggle is an open-source platform for data scientists and ML professionals connected to Google LLC. The dataset comprises clinical features, offering a complementary approach to studies focused primarily on imaging data and CNNs [17,20]. Unlike some previous work with smaller sample sizes [14], this dataset includes information on 309 patients with 16 unique attributes (Table 2), providing a robust foundation for analysis. The dataset’s dependent variable is the “Lung_Cancer” attribute, with the remaining fifteen attributes serving as independent predictors. The “Lung_Cancer” attribute uses two categories: 1 to represent patients with lung cancer and 2 to represent patients without it.

A detailed statistical analysis, including descriptive statistics, was conducted to understand the dataset’s distributions, central tendencies, and variabilities. This analysis is essential for identifying potential biases, informing the preprocessing steps, and interpreting the model performance. Table 2 presents the dataset’s characteristics, including attribute descriptions, data types, means, and standard deviations.

Table 2 describes the dataset used in the study. It details the characteristics of each attribute (feature), including the data type and basic descriptive statistics like the mean and standard deviation for numeric attributes. This table is essential for understanding the raw data before performing any preprocessing or analysis.

The data were primarily analyzed based on positive and negative cases among the correlation of Age and Gender. Figure 2 indicates that most patients’ distribution ranged between the ages of 55 and 75.

Weka (Waikato Environment for Knowledge Analysis) was utilized in this study to automate several ML tasks. Weka provides pre-built implementations of various machine learning algorithms and tools for data preprocessing, making it efficient for functions such as duplicate record removal, missing value imputation, outlier identification and removal, feature selection, data normalization, and the training and evaluation of machine learning classifiers [11].

The utilization of Weka facilitated the standardized and efficient implementation of the common ML workflow steps. However, NN models were custom implemented using Python 3.11 along with the Keras and TensorFlow libraries, providing greater flexibility in designing and training the deep learning architectures. This hybrid approach allowed us to leverage the strengths of Weka’s automation and Python’s customizability.

The NN uses Python programming within a Jupyter Notebook 7.2.2. environment.

3.1.2. Preprocessing of the Dataset

Data preprocessing is a critical step to transform raw data and improve the accuracy of ML models. The following preprocessing steps were applied to the dataset:

Duplicate Record Removal: Weka’s “RemoveDuplicates” filter identified and removed 33 duplicate records, resulting in 276 remaining data points. Missing Value Handling: Weka’s “ReplaceMissingValues” filter was used to check for missing values. Unlike some studies that use simple mean imputation, this approach ensures a more robust handling of missing data. No missing values were found in the dataset.

Outlier Detection and Removal: Weka’s “InterquartileRange” filter was used to identify and remove outliers and extreme values. No outliers were detected.

3.2. Feature Selection Method

Pearson’s correlation technique was employed to identify the most relevant attributes for lung cancer prediction. This method was chosen for its simplicity and effectiveness in quantifying the linear relationships between attributes and the target variable. The correlation coefficient, ranging from −1 to 1, indicates the strength and direction of the linear relationship. A zero value signifies no relationship, while 0.5 or higher suggests a significant association. Weka’s “correlation” filter determines the correlation coefficient between the input and output properties.

A threshold of 0.15 was set to determine the relevant attributes. Consequently, Anxiety, Chronic Disease, Age, Shortness of Breath, and Smoking were excluded, while Coughing, Wheezing, Alcohol_Consuming, Swallowing_Difficulty, and Allergy were retained as the five most essential input attributes. Table 3 shows the correlation coefficients.

3.3. Data Normalization Techniques

To improve the computational efficiency of the algorithms, min-max scaling was used to normalize the data to the range of 0 to 1. The “Normalize” filter in Weka was applied for this purpose. Table 4 presents the mean and standard deviation of the attributes after normalization.

Figure 3 shows the correlation between the input and output attributes post-preprocessing. It illustrates a correlation coefficient of 0.33 between “Allergy” and “Lung_Cancer”, indicating the highest correlation.

3.4. Experimental Setup for Testing and Training the Data

The preprocessed dataset, now consisting of 276 samples (38 without lung cancer and 238 with lung cancer), was divided into 80% for training and 20% for testing. To ensure a robust estimate of model performance, 10-fold cross-validation was employed. Stratified k-fold cross-validation was used to maintain the class distribution in each fold, which is particularly important for imbalanced datasets [11,21]. In the K-fold cross-validation method, one fold is utilized for testing, and the remaining K-1 folds are used for training. This process is repeated until each fold has served as the test set. The model’s efficacy is then assessed by calculating the mean of the outcomes from all K iterations.

3.5. Development and Execution of a Classification Model

3.5.1. Machine Learning Models

Several ML algorithms such as DT, RF, Ada boost (AB), KNN, LR, NB, SVM, and NN were implemented using Weka with default parameter settings to provide a baseline comparison. For the KNN model, K was set to 7.

3.5.2. Development of Deep Learning Models

Neural Network (NN) models with one, two, and three hidden layers were developed using Python 3.11 with the Keras and TensorFlow libraries. The NN models were trained with different numbers of epochs (200, 400, and 800) to assess the impact of the training duration on the performance. ReLU activation functions were used in the hidden layers, and a sigmoid function was used in the output layer for binary classification. Learning rates of 0.1, 0.01, and 0.005 were tested to determine the optimal value.

The NN models were developed by explicitly leveraging the Sequential class from Keras within the Keras and TensorFlow libraries, where the optimizer minimizes the output error during the backpropagation. Additionally, Stochastic Gradient Descent (SGD) is commonly exercised as an optimization technique where the learning rate plays a crucial role in adjusting weight updates based on the loss gradient.

We conducted experiments with various learning rates to determine the optimal value. We used the “train_test_split” function from the “scikit-learn” package for the partitioning train and test data. For K-fold cross-validation, we utilized the “Cross_val_score” function from the scikit-learn package, and the StratifiedKfold methodology was used to address the binary nature of the target variable. Table 5 details the architectures of the NN models.

The single-hidden-layer NN model was constructed incorporating an input layer, a hidden layer, and an output layer. Eight neurons in the input layer represent the eight attributes. The hidden layer consists of a single neuron with the ReLU activation function, and the output layer consists of one neuron with a sigmoid activation function for binary classification.

A four-layer NN model was constructed. The input and output layers (first and fourth) maintained identical input shapes, neuron counts, and activation functions as the single-hidden-layer model. The second layer comprises 41 hidden neurons, and the third layer comprises eight hidden neurons, with all the neurons within these layers utilizing the ReLU activation function.

The input and output layers (first and fifth) maintained identical input shapes, neuron counts, and activation functions as the single-hidden-layer Neural Network (NN). The second, third, and fourth hidden layers consisted of 25, 16, and 8 neurons, respectively, with all the neurons within these layers utilizing the ReLU activation function.

3.5.3. Selection of Different Learning Rates

The learning rates of 0.1, 0.01, and 0.005 were selected based on a combination of common practice and preliminary experimentation. These values represent a range from relatively large (0.1) to small (0.005), allowing us to observe the impact of the learning rate magnitude on the model convergence and performance. While a more complete hyperparameter tuning method could be employed, this initial range provided a reasonable starting point for evaluating the model’s sensitivity to the learning rate.

3.6. Model Evaluation

The model performance was evaluated using accuracy, precision, recall, and the F1-score. These metrics provide a comprehensive assessment of the models’ classification capabilities. The following formulas were used for the evaluation metrics:

Accuracy = (TP + TN)/(TP + TN + FN + FP)

Recall = (TP)/(TP + FN)

Precision = (TP)/(TP + FP)

F-measure = 2 × (Precision × Recall)/(Precision + Recall)

where: TP = true positive; TN = true negative; FP = false positive; FN = false negative.

4. Results

4.1. Performance of the Machine Learning Algorithms

Table 6 shows the confusion matrices for all the classifiers used in the study, which were evaluated via 10-fold cross-validation (K = 10) and train/test splitting.

Table 7 displays the accuracy of all the classification algorithms used in the study, demonstrating that all of them exceeded 85%. Moreover, DT, NB, and KNN (K = 7) exhibited superior accuracy in the testing methodologies.

Figure 4 shows that the ROC curve for the KNN model demonstrated the highest accuracy in the train/test split evaluation and the K-fold cross-validation (Table 7).

The different colors in the ROC curve represent the classifier's performance on each fold during the cross-validation process (Figure 4). Each colored line corresponds to the ROC curve obtained when one-fold of the data was used as the test set, and the remaining folds were used for training. This visualization allows us to see the variability in the classifier's performance across different subsets of the data. The Area typically summarizes the overall performance Under the ROC Curve (AUC), which is displayed at the top of the plot (AUC = 0.9149 in this case).

Figure 5a–d and Figure 6a–d present the performance of all the classifiers based on four metrics, Precision, Recall, F-measure, and Accuracy, evaluated using 10-fold cross-validation (K = 10) and the train/test splitting method.

4.2. Findings of the NN Model

In the NN model with a single hidden layer and 200 epochs, we conducted experiments with learning rates of 0.1, 0.01, and 0.005, as presented in Table 8. We found that a learning rate of 0.1 yielded the best accuracy, so we used it for all the cases.

Table 9 shows the effect of an epoch count on the NN model with one, two, and three hidden layers, using a learning rate of 0.1. The results indicated that the NN model, featuring a single hidden layer and trained across 800 epochs with a learning rate of 0.1, attained the highest accuracy of 92.86%. Moreover, the NN model demonstrated superior accuracy in both training and testing compared to all the other neural network models examined.

Figure 7 illustrates the ROC curve for a model using one hidden layer trained over 800 epochs. We computed the accuracy of the single-hidden-layer NN model over 800 epochs using 10-fold cross-validation (K = 10). The model achieved an average accuracy of 90%. This finding provides strong evidence of the one-hidden-layer NN model’s effectiveness in lung cancer prediction. It highlights the potential of DL techniques to improve diagnosis and the necessity of further exploring model architectures and hyperparameters for optimal performance.

5. Discussion

Given the high mortality rate of lung cancer and the importance of early, precise identification, this research evaluates the predictive efficacy of several ML and DL models for lung cancer detection using patient symptoms and lifestyle factors. This research compared traditional ML classifiers—DT, KNN, RF, NB, AB, LR, and SVM—with Neural Network (NN) models to determine the most effective approach.

5.1. Evaluation of the Machine Learning Models

The findings demonstrate that DT, NB, and KNN attained superior classification accuracy, surpassing 85% in both the K-fold cross-validation (K = 10) and train/test splitting methodologies. The highly regarded performance of these models indicates their appropriateness for lung cancer classification, presumably owing to their capacity to manage the intricate connections among features. The findings correspond with previous research that has shown the efficacy of these models in lung cancer prediction [3,14].

RF, AB, and LR showed strong performance, with accuracy rates similar to DT, NB, and KNN. However, SVM and LR had a slightly lower accuracy, likely due to their sensitivity to feature selection and hyperparameter optimization.

5.2. Efficacy of NN in Lung Cancer Diagnosis

The DL models surpassed the conventional ML classifiers, with a three-hidden-layer NN trained for 800 epochs attaining a maximum accuracy of 92.86%. This result emphasizes the capability of DL to enhance lung cancer detection since NN can identify intricate nonlinear patterns within the dataset, yielding improved classification accuracy relative to the conventional ML models.

The research investigated several NN topologies, including single-, dual-, and triple-hidden-layer models. The three-hidden-layer NN with a learning rate of 0.1 consistently surpassed the other models, indicating that deeper structures and prolonged training durations improve the performance. Nevertheless, increasing the number of hidden layers beyond three did not provide any improvements, suggesting that model optimization and hyperparameter adjustment are essential for reconciling the accuracy and computational efficiency.

5.3. Influence of Data Preprocessing and Feature Selection

The results indicated that the preprocessing methods significantly improved the classification outcomes, underscoring the need for systematic data preparation in ML applications. Pearson’s correlation was employed for feature selection, revealing coughing, wheezing, alcohol consumption, dysphagia, and allergies as the most pertinent predictors of lung cancer. Removing the less significant features like anxiety, chronic illness, and shortness of breath improved the model performance by reducing the dimensionality and eliminating noise. Furthermore, data standardization and outlier elimination improved the model efficacy by establishing a consistent data distribution, resulting in more precise predictions.

5.4. Comparative Analysis with Prior Studies

The outcomes of this study corroborate the prior research that underscored the efficacy of ML and DL models in lung cancer diagnosis (Table 10). The current research has shown that ML classifiers, namely DT, NB, and RF, attain accuracy rates beyond 85% when used on medical datasets [1,14,16]. Patra [14] achieved an accuracy of 81.25% using SVM, among other algorithms, while Gultepe [22] reported 83% accuracy using KNN, Naïve Bayes, and Decision Trees. Maurya et al. [1] reached a maximum of 91.07% with Naïve Bayes, and Faisal et al. [16] demonstrated a 90% accuracy using Gradient-Boosted Trees. These findings illustrate a gradual improvement in the prediction accuracies over time as model architectures and feature engineering techniques have advanced.

The current study notably surpasses these benchmarks. Its NN model, featuring a single hidden layer trained over 800 epochs, achieves an accuracy of 92.86%. This outcome underscores the effectiveness of DL architectures, particularly when paired with rigorous data preprocessing and optimal hyperparameter tuning strategies like learning rate and epoch selection.

Importantly, this study further illustrates the advantage of DL in predicting the accuracy, in contrast to past studies that focused only on the classical ML models [16,22]. The findings indicate that NN, when well adjusted, surpasses the classical classifiers in lung cancer detection, making them a significant asset for early diagnosis based on symptomatic and lifestyle data.

6. Conclusions

This study investigated the efficacy of ML and DL techniques for lung cancer prediction, addressing the critical need for early and accurate diagnosis to improve the patient outcomes. The research provides a comparative analysis of various ML algorithms (DT, RF, KNN, NB, AB, LR, and SVM) and NN models for lung cancer prediction using a dataset of patient symptoms and lifestyle factors. This shows how a rigorous data preprocessing pipeline, including outlier removal, normalization, and feature selection with Pearson’s correlation, improves the model prediction performance and reduced noise. The study identifies coughing, wheezing, alcohol consumption, swallowing difficulty, and allergies as the key predictive features for lung cancer, highlighting the significant risk indicators. It evaluates the performance of NN models with varying architectures (one, two, and three hidden layers) and training parameters, determining the optimal configuration for lung cancer prediction.

Several ML models, including DT, NB, and KNN, achieved a high classification accuracy (above 85%), demonstrating their potential for effective lung cancer prediction. DL models, notably a three-hidden-layer NN, outperformed the traditional ML classifiers, reaching a maximum accuracy of 92.86%, highlighting the superior capability of NNs to capture complex patterns in the data.

7. Limitations and Future Study

Despite the promising findings, a few limitations should be acknowledged. The dataset used was minimal, potentially affecting the generalizability of the results to broader groups. The research concentrated on a restricted array of parameters, and the inclusion of additional clinical and genetic data may boost the models’ prognostic efficacy. Thus, subsequent studies need to examine more extensive datasets and assess the incorporation of DL methodologies with feature engineering strategies to augment the predictive accuracy further. Secondly, Pearson’s correlation technique was used to find the most relevant attributes. However, it is acknowledged that Pearson’s correlation has limitations; it only detects linear correlations and may miss complex, non-linear relationships that could also be predictive. Thus, future work should use more advanced feature selection methods, such as information gain, chi-square test, recursive feature elimination (RFE), and model-based selection methods, to evaluate their impact on the model performance and identify any non-linear relationships that Pearson’s correlation might have missed. It will provide a more comprehensive understanding of the feature space and potentially lead to improved predictive accuracy. Finally, the literature indicates that shortness of breath is one of the key features of determining lung cancer. The current study used Pearson’s correlation to identify the most relevant attributes, setting a threshold of 0.15 for inclusion. Shortness of breath was eliminated due to its correlation coefficient of 0.0644, which was below the required threshold. On the other hand, swallowing difficulty (0.2689) and alcohol consumption (0.2944) had correlation coefficients above the 0.15 threshold, indicating a more substantial correlation with lung cancer in the dataset. Thus, these were considered essential features and retained for the analysis. Future studies should consider other feature selection methods, such as the Information gain chi-square test or Principal Component Analysis (PCA), to confirm the outcome.

Funding

This research received no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during this study are available at: https://www.kaggle.com/datasets/sanjoli02/lung-cancer (5 January 2025) (Original Dataset). You can download the dataset freely. There is no issue related to copyright. You can find all the information at https://www.kaggle.com/docs/datasets (5 January 2025).

Conflicts of Interest

The author declares no conflicts of interest.

References

Maurya, S.P.; Sisodia, P.S.; Mishra, R.; Singh, D.P. Performance of machine learning algorithms for lung cancer prediction: A comparative approach. Sci. Rep. 2024, 14, 18562. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Li, J.; Zhou, X. Lung microbiome: New insights into the pathogenesis of respiratory diseases. Sig. Transduct. Target. Ther. 2024, 9, 19. [Google Scholar] [CrossRef] [PubMed]
Garg, P.; Singhal, S.; Kulkarni, P.; Horne, D.; Malhotra, J.; Salgia, R.; Singhal, S.S. Advances in Non-Small Cell Lung Cancer: Current Insights and Future Directions. J. Clin. Med. 2024, 13, 4189. [Google Scholar] [CrossRef] [PubMed]
CDC. Available online: https://www.cdc.gov/lung-cancer/risk-factors/index.html#:~:text=In%20the%20United%20States%2C%20cigarette,Many%20are%20poisons (accessed on 5 January 2025).
Liu, Y.; Geng, Q.; Lin, X. Benefits, harms, and cost-effectiveness of risk model-based and risk factor-based low-dose computed tomography screening strategies for lung cancer: A systematic review. BMC Cancer 2024, 24, 1567. [Google Scholar] [CrossRef] [PubMed]
Chakraborty, C.; Bhattacharya, M.; Pal, S.; Lee, S.S. From machine learning to deep learning: Advances of the recent data-driven paradigm shift in medicine and healthcare. Curr. Res. Biotechnol. 2024, 7, 100164. [Google Scholar] [CrossRef]
Sogandi, F. Identifying diseases symptoms and general rules using supervised and unsupervised machine learning. Sci. Rep. 2024, 14, 17956. [Google Scholar] [CrossRef] [PubMed]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Wu, R.; Yang, A. Research on Medical Problems Based on Mathematical Models. Mathematics 2023, 11, 2842. [Google Scholar] [CrossRef]
Lahiri, A.; Maji, A.; Potdar, P.D. Lung cancer immunotherapy: Progress, pitfalls, and promises. Mol. Cancer 2023, 22, 40. [Google Scholar] [CrossRef] [PubMed]
Khanam, J.J.; Foo, S.Y. A comparison of machine learning algorithms for diabetes prediction. ICT Express 2021, 7, 432–439. [Google Scholar] [CrossRef]
Protić, D.; Stanković, M.; Prodanović, R.; Vulić, I.; Stojanović, G.M.; Simić, M.; Ostojić, G.; Stankovski, S. Numerical Feature Selection and Hyperbolic Tangent Feature Scaling in Machine Learning-Based Detection of Anomalies in the Computer Network Behavior. Electronics 2023, 12, 4158. [Google Scholar] [CrossRef]
Dudáš, A. Graphical representation of data prediction potential: Correlation graphs and correlation chains. Vis. Comput. 2024, 40, 6969–6982. [Google Scholar] [CrossRef]
Patra, R. Prediction of lung cancer using machine learning classifier. In Proceedings of the International Conference on Computing Science, Communication and Security, Gujarat, India, 26–27 March 2020; Springer: Berlin/Heidelberg, Germany; pp. 132–142. [Google Scholar]
Radhika, P.; Nair, R.A.; Veena, G. A comparative study of lung cancer detection using machine learning algorithms. In Proceedings of the 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Prague, Czech Republic, 20–22 February 2019; pp. 1–4. [Google Scholar]
Faisal, M.I.; Bashir, S.; Khan, Z.S.; Khan, F.H. An evaluation of machine learning classifiers and ensembles for early stage prediction of lung cancer. In Proceedings of the 2018 3rd International Conference on Emerging Trends in Engineering, Sciences and Technology (ICEEST), Thrissur, Kerala, India, 18–20 January 2018; pp. 1–4. [Google Scholar]
Esteva, A.; Chou, K.; Yeung, S. Deep learning-enabled medical computer vision. NPJ Digit. Med. 2021, 4, 5. [Google Scholar] [CrossRef] [PubMed]
Vieira, E.; Ferreira, D.; Neto, C.; Abelha, A.; Machado, J. Data Mining Approach to Classify Cases of Lung Cancer. In World Conference on Information Systems and Technologies; Springer: Berlin/Heidelberg, Germany, 2021; pp. 511–521. [Google Scholar]
Aldoseri, A.; Al-Khalifa, K.N.; Hamouda, A.M. Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Appl. Sci. 2023, 13, 7082. [Google Scholar] [CrossRef]
Jiang, D.; Wang, Z.; Li, H. Experiences of patient delay among lung cancer patients in South China. BMC Cancer 2024, 24, 1527. [Google Scholar] [CrossRef] [PubMed]
Vij, A.; Kaswan, K.S. Prediction of lung cancer using convolution neural networks. In Proceedings of the 2023 IEEE International Conference on Artificial Intelligence and Smart Communication (AISC), Greater Noida, India, 27–29 January 2023. [Google Scholar] [CrossRef]
Gültepe, Y. Performance of Lung Cancer Prediction Methods Using Different Classification Algorithms. Comput. Mater. Contin. 2020, 67, 2015–2028. [Google Scholar] [CrossRef]
Ansari, M.M.; Kumar, S.; Heyat, B.B.; Ullah, H.; Hayat, M.A.B.; Parveen, S.S.; Ali, A.; Zhang, T. SVMVGGNet-16: A Novel Machine and Deep Learning Based Approaches for Lung Cancer Detection using Combined SVM and VGGNet-16. Curr. Med. Imaging 2025, 21, 1–18. [Google Scholar] [CrossRef] [PubMed]
Shanbhag, G.A.; Prabhu, K.A.; Reddy, N.V.S.; Rao, B.A. Prediction of Lung Cancer using Ensemble Classifiers. J. Phys. Conf. Ser. 2022, 2161, 012007. [Google Scholar] [CrossRef]
Yakar, M.; Etiz, D.; Yilmaz, S.; Celik, O.; Guntulu, A.K.; Metintas, M. Prediction of survival and progression-free survival using machine learning in stage III lung cancer: A pilot study. Turk. Oncol. Derg. 2021, 36, 446–458. [Google Scholar] [CrossRef]

Figure 1. Proposed data flow model.

Figure 2. The number of patients after preprocessing.

Figure 3. Correlation coefficient between input and output attributes after preprocessing.

Figure 4. ROC curve for the KNN model.

Figure 5. Performance of classifiers in terms of 10-fold cross-validation.

Figure 6. Performance of classifiers in terms of train/test splitting.

Figure 7. ROC curve for the NN model with one hidden layer, trained for 800 epochs.

Table 1. Comparative analysis of ML and DL approaches in lung cancer prediction research.

Source	ML/DL Methods	Data source	Focus	Key Findings
Maurya et al. (2024) [1]	Various ML algorithms	Kaggle	Comparative analysis of ML algorithms for lung cancer prediction	Based on classification and heat map correlation analysis, the KNN and NB models were identified as the most significant methods for early lung cancer prediction.
Khanam and Foo (2021) [11]	Various ML algorithms	UCI Machine learning repository	Comparison of ML algorithms for diabetes prediction	Feature selection influences the ML model disease prediction.
Protić et al. (2023) [12]	ML	Computer network behavior data	Numerical feature selection in anomaly detection	Feature selection is important in developing correct and effective diagnostic models.
Dudáš (2024) [13]	ML	Unspecified	Graphical representation of data prediction.	Data visualization is important in supporting the understanding of the underlying factors that contribute to lung diseases.
Patra (2020) [14]	SVM, LR, RF, DT, NB, KNN	UCI machine learning repository	Lung cancer prediction	The proposed RBF classifier achieved an accuracy of 81.25%.
Radhika et al. (2019) [15]	LR, SVM, NB, DT	Unspecified	Lung cancer detection	Ensemble classifier performed the best, followed by SVM. NB performed the worst.
Esteva et al. (2021) [17]	CNNs	Medical images	Deep learning-enabled medical computer vision	Healthcare ML success depends on data assessment, limitation planning, community participation, and trust.
Alzubaidi et al. (2021) [8]	DL	Reviewing published materials	A holistic approach is necessary for a comprehensive understanding of DL.	Hyperparameter selection significantly impacts CNN performance.
Chakraborty et al. (2024) [6]	ML, DL	Reviewing the papers	Advances in a data-driven paradigm shift	The challenges associated with implementing ML and DL in healthcare.
Liu et al. (2023) [9]	Mathematical Models	Unspecified	Research on the usage of NN, ML, and statistical models for medical problems	Role of mathematical models in medicine.

Table 2. Characteristics of the dataset.

Characteristics	Description	Data Type	Mean	Standard Deviation
Gender	Patient’s gender	M = Male, F = Female
Age	Age of patient.	Numeric	62.673	8.21
Smoking	Smoking habits	Numeric	1.563	0.497
Yellow_Fingers	Symptom of yellow finger	Numeric	1.57	0.496
Anxiety	Experience of worriedness	Numeric	1.498	0.501
Peer_Pressure	Influences from close members	Numeric	1.502	0.501
Chronic_Disease	Patient is suffering from any chronic diseases	Numeric	1.505	0.501
Fatigue	Patient experiences fatigue	Numeric	1.673	0.47
Allergy	Patient is suffering from any allergic reaction	Numeric	1.557	0.498
Wheezing	Patient is suffering from breathing with a hoarse or shrieking sound	Numeric	1.557	0.498
Alcohol_Consuming	Patient consumes alcohol	Numeric	1.557	0.498
Coughing	Patient is suffering from coughing	Numeric	1.579	0.494
Shortness_of_Breath	Feeling uneasiness about taking a breath	Numeric	1.641	0.481
Swallowing_Difficulty	Feeling uneasiness to swallow	Numeric	1.469	0.5
Chest_Pain	Feeling pain in the chest	Numeric	1.557	0.498
Lung_Cancer	The patient has lung cancer	Yes= Positive, No = Negative	-

Table 3. The correlation coefficient values between the attributes of input and output.

Features	Correlation
Allergy	0.3336
Alcohol_Consuming	0.2944
Swallowing_Difficulty	0.2689
Coughing	0.253
Wheezing	0.2491
Peer_Pressure	0.1951
Chest_Pain	0.1949
Yellow_Fingers	0.1892
Fatigue	0.1601
Anxiety	0.1443
Chronic_Disease	0.1437
Age	0.1063
Shortness_of_Breath	0.0644
Smoking	0.0349

Table 4. Value of attributes after normalization.

Attributes	Average	Standard Deviation (σ)
Allergy	0.547	0.499
Alcohol_Consuming	0.551	0.498
Swallowing_Difficulty	0.467	0.5
Wheezing	0.547	0.499
Coughing	0.576	0.495
Chest_Pain	0.558	0.498
Peer_Pressure	0.507	0.501
Yellow_Fingers	0.576	0.495
Fatigue	0.663	0.474

Table 5. Configuration of NN models.

Types of NN Model	Layers	Neurons
NN model consists of a single hidden layer	Dense (Dense)	(None, 8)
	Dense_1 (Dense)	(None, 8)
	Dense_2 (Dense)	(None, 1)
Two-Hidden-Layer NN Model	Dense (Dense)	(None, 8)
	Dense_1 (Dense)	(None, 41)
	Dense_2 (Dense)	(None, 8)
	Dense_3 (Dense)	(None, 1)
Three-Hidden-Layer NN Model	Dense (Dense)	(None, 8)
	Dense_1 (Dense)	(None, 25)
	Dense_2 (Dense)	(None, 16)
	Dense_3 (Dense)	(None, 8)
	Dense_4 (Dense)	(None, 1)

Table 6. Confusion matrices for classifiers.

Testing Method	DT			KNN			RF			NB			AB			LR			SVM
K-fold cross-validation		0	1		0	1		0	1		0	1		0	1		0	1		0	1
	0	25	13	0	25	13	0	25	13	0	22	16	0	20	18	0	18	20	0	19	19
	0	23	215	0	16	222	0	23	215	0	16	222	0	12	226	0	12	226	0	17	221
Train/test splitting		0	1		0	1		0	1		0	1		0	1		0	1		0	1
	0	9	3	0	8	4	0	8	4	0	9	3	0	7	5	0	6	6	0	6	6
	1	2	42	1	1	43	1	2	42	1	2	42	1	1	43	1	1	43	1	1	43

Table 7. The performance metrics of each classification algorithm (both K-fold cross-validation and training/testing approach).

Classification Algorithms	Accuracy	Precision	Recall	F-Measure
DT (K-fold)	86.96	0.8849	0.8696	0.8757
DT (Split)	91.07	0.9087	0.9107	0.9093
KNN (K-fold)	89.49	0.8986	0.8949	0.8966
KNN (Split)	91.07	0.9093	0.9107	0.9058
RF (K-fold)	86.96	0.8849	0.8696	0.8757
RF (Split)	89.29	0.8888	0.8929	0.8892
NB (K-fold)	88.41	0.8846	0.8841	0.8842
NB (Split)	91.07	0.9087	0.9107	0.9093
AB (K-fold)	89.13	0.8848	0.8913	0.8848
AB (Split)	89.29	0.8914	0.8929	0.8914
LR (K-fold)	88.41	0.8748	0.8841	0.8782
LR (Split)	87.50	0.8732	0.8750	0.8732
SVM (K-fold)	86.96	0.8667	0.8696	0.8681
SVM (Split)	87.50	0.8732	0.8750	0.8619

Table 8. Values of accuracy corresponding to its learning rate.

Learning Rate	Accuracy
0.1	0.929 *
0.01	0.881
0.005	0.905

Note: * 0.1 has performed the best accuracy of 92.86% out of three.

Table 9. The learning rate of 0.1 influences the accuracy via modifications in the hidden layer.

No. of Hidden Layer	Epochs	Accuracy	Accuracy (Train)	Accuracy (Test)
1	200	0.881	92.74%	88.1%
	400	0.857	93.59%	85.71%
	800	0.929 *	92.74%	92.86% *
2	200	0.881	92.74%	88.1%
	400	0.786	87.61%	78.57%
	800	0.786	87.61%	78.57%
3	200	0.905	89.74%	90.48%
	400	0.927	89.74%	92.76%
	800	0.925	89.74%	92.76%

Note: * These are the best accuracy out of all others.

Table 10. Comparison between current study and previous studies on lung cancer prediction.

References	Model and Method	Accuracy
Patra [14]	SVM, LR, RF, DT, NB, KNN, RBF	SVM = 81.25%
Gultepe [22]	KNN, NB, DT	83
Maurya et al. [1]	LR, NB, SVM, RF, Bernoulli Naïve, KNN, Extra Tree, AB, MLP	RF = 85.71; LR = 87.5; NB = 91.07
Faisal et al. [16]	GBT, SVM, C4.5 DT, MP, NN, NB	GBT = 90
Ansari et al. [23]	SVM	94.45
Shanbhag et al. [24]	RF	88
Yakar et al. [25]	LR	70
Current study	DT, KNN, RF, NB, AB, LR, SVM, NN, DL (with varying architectures)	KNN = 91.07; RF= 89.29; NB = 91.07; SVM= 87.50 NN (single hidden layer) = 92.86 (800 epochs)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dutta, B. Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features. Appl. Sci. 2025, 15, 4507. https://doi.org/10.3390/app15084507

AMA Style

Dutta B. Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features. Applied Sciences. 2025; 15(8):4507. https://doi.org/10.3390/app15084507

Chicago/Turabian Style

Dutta, Bireswar. 2025. "Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features" Applied Sciences 15, no. 8: 4507. https://doi.org/10.3390/app15084507

APA Style

Dutta, B. (2025). Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features. Applied Sciences, 15(8), 4507. https://doi.org/10.3390/app15084507

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Analysis of Machine Learning and Deep Learning Models for Lung Cancer Prediction Based on Symptomatic and Lifestyle Features

Abstract

1. Introduction

2. Literature Review

2.1. Machine Learning Applications in Lung Disease Diagnosis

2.2. DL Applications in Lung Disease Diagnosis

2.3. Research Deficiencies and Prospective Avenues

3. Materials and Methods

3.1. Dataset Description and Preparation

3.1.1. Dataset Source and Characteristics

3.1.2. Preprocessing of the Dataset

3.2. Feature Selection Method

3.3. Data Normalization Techniques

3.4. Experimental Setup for Testing and Training the Data

3.5. Development and Execution of a Classification Model

3.5.1. Machine Learning Models

3.5.2. Development of Deep Learning Models

3.5.3. Selection of Different Learning Rates

3.6. Model Evaluation

4. Results

4.1. Performance of the Machine Learning Algorithms

4.2. Findings of the NN Model

5. Discussion

5.1. Evaluation of the Machine Learning Models

5.2. Efficacy of NN in Lung Cancer Diagnosis

5.3. Influence of Data Preprocessing and Feature Selection

5.4. Comparative Analysis with Prior Studies

6. Conclusions

7. Limitations and Future Study

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI