1. Introduction
Lung cancer is one of the most common and fatal forms of cancer worldwide, which causes a large number of cancer-related fatalities annually and impacts people in every aspect of life [
1]. Within the complex structure of the human pulmonary system, lung cancer develops as a hidden threat and often stays undetected until it has progressed to an advanced stage [
2]. Lung cancer encompasses a spectrum of distinct forms, ranging from slower-growing small cells to the more prevalent non-small cell lung cancer, each posing unique diagnostic and therapeutic challenges [
3]. The impact, however, is substantial, including social, emotional, and physical features.
The Centers for Disease Control and Prevention explored whether smoking is one of the most common risk factors for lung cancer, and 80–90% of lung cancer deaths are due to smoking in the US [
4]. Not only smokers, non-smokers exposed to secondhand smoke face a significantly elevated risk of lung cancer [
2]. Radon, a gas that originates from rocks, soil, and water, is the second primary cause of lung cancer. It may infiltrate buildings via fissures and may induce lung cancer after prolonged inhalation [
1]. Radiation treatment, atypical dietary habits, and familial genetic predisposition also cause lung cancer-related fatalities. The literature emphasizes that risk prediction is of greater importance than clinical evaluation in lung cancer screening [
1,
2,
3,
5].
Although medical technology has made strides, lung cancer persists as a significant health challenge, mainly due to complications of early diagnosis [
6]; in fact, delayed detection often reduces the treatment effectiveness and leads to poorer patient outcomes [
6,
7]. The literature has used several mathematical models to detect and prevent diseases, facilitate timely treatment interventions, and improve the health outcomes [
8,
9]. Early detection, before metastasis, significantly enhances the possibility of effective treatment in lung cancer cases, as early cancer detection and risk factor assessment enable the administration of appropriate therapies and preventive measures [
7]. The evidence demonstrates that the inherent difficulties in early lung cancer detection result in delayed diagnoses beyond six months, substantially diminishing the survival rates, and render treatment significantly more challenging [
7,
8,
10].
In the last few years, machine learning (ML) and data mining have become reliable and essential tools in the medical field due to their ability to reveal hidden patterns in the data, improving the decision-making accuracy [
8,
9]. Researchers have utilized ML and soft computing techniques to effectively identify many types of cancer at early stages using classification methods [
7,
8,
11] and developed advanced models for predicting cancer therapy outcomes at the onset of treatment [
6,
7]. However, selecting an appropriate learning algorithm is critical for precisely diagnosing lung cancer and understanding its relation to patient habits [
3,
5]. ML automates disease prediction, while data mining integrates ML, statistics, and database techniques to preprocess and extract meaningful patterns from large datasets [
8,
11]. Therefore, it is critical to develop and refine computational methods that can accurately predict the lung cancer risk and facilitate timely intervention. This study systematically evaluates and compares ML and deep learning (DL) models for lung cancer prediction, focusing on patient symptoms and lifestyle factors aiming to address the concerns.
This research aims to identify the ML model with the best predictive accuracy for lung cancer, using patient features, including lifestyle factors and symptoms, and developing further effective diagnostic technologies that enable early diagnosis and enhance the patient outcomes. Thus, the following research questions are explored to fulfill the aims of the study:
What are the predictive accuracy, precision, recall, and F-measure of various ML models for lung cancer prediction, based on patient symptoms and lifestyle factors, and which algorithm performs the best across these metrics?
How does feature selection affect the ML lung cancer detection accuracy?
Do DL methods, such as neural networks, outperform the traditional ML classifiers in lung cancer prediction?
This study’s findings are expected to clarify the applicability of ML approaches for lung cancer detection, leading to improved screening and early intervention strategies.
2. Literature Review
Lung cancer, characterized by uncontrolled cell growth in lung tissues, is a highly malignant disease and a leading cause of cancer-related deaths worldwide. The importance of early detection and accurate diagnosis in improving the patient outcomes cannot be overstated. This section reviews the recent studies on the ML and DL techniques in lung disease diagnosis, particularly lung cancer.
2.1. Machine Learning Applications in Lung Disease Diagnosis
Machine learning has become an invaluable tool in medical science, with its ability to analyze complex datasets and identify patterns that may be imperceptible to the human eye. From the perspective of lung disease, ML algorithms have been employed for various tasks, including risk prediction, early detection, and the classification of disease subtypes. Maurya et al. [
1] conducted a comparative analysis of several machine learning algorithms for lung cancer prediction. Their findings indicated that the K-Nearest Neighbor (KNN) and Naive Bayes (NB) models were identified as the most significant methods for early lung cancer prediction. Khanam and Foo (2021) [
11] compared ML algorithms for diabetes prediction. Although focused on diabetes, this study provides valuable insights into the comparative performance of ML models, which is relevant to selecting the appropriate algorithms for lung disease diagnosis.
Protić et al. (2023) [
12] explored the numerical feature selection in the ML-based detection of anomalies. The study findings explored how methodologies like feature selection are critical in developing accurate and efficient diagnostic models. Dudáš (2024) [
13] investigated the graphical representation of the data prediction potential using correlation graphs and chains. The findings explored visualizing data relationships which can aid in understanding the underlying factors that contribute to lung diseases. Patra (2020) [
14] utilized ML classifiers such as Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Decision Trees (DT), NB, and KNN to predict lung cancer. The findings revealed that the proposed RBF classifier performed with an accuracy of 81.25%. Radhika et al. (2019) [
15] performed a comparative study of lung cancer detection using ML algorithms. The experimental results indicate that the proposed ensemble classifier yielded the optimal classification performance, with SVM performing the second best. Conversely, NB exhibited the least effective classification. Faisal et al. (2018) [
16] evaluated machine learning classifiers and ensembles for the early-stage prediction of lung cancer. The results showed that the Gradient-Boosted Tree classifier achieved 90% accuracy, outperforming all the other individual and ensemble classifiers.
2.2. DL Applications in Lung Disease Diagnosis
DL, a subfield of artificial intelligence, has emerged as a powerful tool for analyzing medical images and other complex data in lung disease diagnosis. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have shown promise in this field. Esteva et al. (2021) [
17] provided a comprehensive overview of deep learning-enabled medical computer vision, highlighting its applications in various medical domains, including lung disease. The success of ML models hinges on thorough data assessment, proactive planning for model limitations, active community participation, and the cultivation of trust. Alzubaidi et al. (2021) [
8] reviewed DL concepts, CNN architectures, challenges, applications, and future directions. This review offers a valuable foundation for understanding the deep learning methodologies used in lung disease diagnosis.
Vieira et al. (2021) [
18] used a data mining approach with Artificial Neural Networks (ANNs) to classify lung cancer cases. Through the application of multiple data mining models within RapidMiner, they identified smoking as the principal risk factor for lung cancer. Chakraborty et al. (2024) [
6] reported that the implementation of ML and DL in real-world medical settings is nascent and faces significant challenges. These technologies require further development, rigorous validation, and scalable implementation to be effectively utilized in healthcare. Liu et al. (2023) [
9] explored how medical image processing and biomedical models enhance the image quality, enabling doctors to better visualize lesion areas.
Table 1 illustrates the significant role of ML and DL in improving the accuracy and efficiency of lung disease diagnosis.
2.3. Research Deficiencies and Prospective Avenues
The efficacy of lung cancer prediction extends beyond ML model selection and necessitates integrating patient-specific information, including symptoms, lifestyle habits, and genetic predispositions. Smoking is the primary cause of lung cancer, responsible for 80–90% of the occurrences in the United States [
1,
2]. Therefore, ML models must include these characteristics to improve their predictive accuracy.
Despite the effectiveness of ML and DL models in lung cancer prediction, challenges remain. Many studies rely on small or imbalanced datasets, hindering the result generalizability. Additionally, the computational complexity and large data requirements of deep learning models pose obstacles to clinical application.
Future investigations need to include feature selection to pinpoint the most relevant features and explainable ML and DL models to enhance the transparency in decision making. Moreover, using extensive datasets helps improve the model resilience and relevance across distinct patient demographics.
3. Materials and Methods
This section details the methodology employed to systematically evaluate ML and DL models for lung cancer prediction. A rigorous and well-defined methodology is crucial to ensure the validity, reliability, and reproducibility of the research findings. The overall workflow, depicted in
Figure 1, encompasses data acquisition and preprocessing, feature selection, model development, and performance evaluation [
19]. This structured process enables a transparent and systematic comparison of the models’ predictive capabilities.
3.1. Dataset Description and Preparation
3.1.1. Dataset Source and Characteristics
The study utilized a lung cancer prediction dataset obtained from Kaggle. Kaggle is an open-source platform for data scientists and ML professionals connected to Google LLC. The dataset comprises clinical features, offering a complementary approach to studies focused primarily on imaging data and CNNs [
17,
20]. Unlike some previous work with smaller sample sizes [
14], this dataset includes information on 309 patients with 16 unique attributes (
Table 2), providing a robust foundation for analysis. The dataset’s dependent variable is the “Lung_Cancer” attribute, with the remaining fifteen attributes serving as independent predictors. The “Lung_Cancer” attribute uses two categories: 1 to represent patients with lung cancer and 2 to represent patients without it.
A detailed statistical analysis, including descriptive statistics, was conducted to understand the dataset’s distributions, central tendencies, and variabilities. This analysis is essential for identifying potential biases, informing the preprocessing steps, and interpreting the model performance.
Table 2 presents the dataset’s characteristics, including attribute descriptions, data types, means, and standard deviations.
Table 2 describes the dataset used in the study. It details the characteristics of each attribute (feature), including the data type and basic descriptive statistics like the mean and standard deviation for numeric attributes. This table is essential for understanding the raw data before performing any preprocessing or analysis.
The data were primarily analyzed based on positive and negative cases among the correlation of Age and Gender.
Figure 2 indicates that most patients’ distribution ranged between the ages of 55 and 75.
Weka (Waikato Environment for Knowledge Analysis) was utilized in this study to automate several ML tasks. Weka provides pre-built implementations of various machine learning algorithms and tools for data preprocessing, making it efficient for functions such as duplicate record removal, missing value imputation, outlier identification and removal, feature selection, data normalization, and the training and evaluation of machine learning classifiers [
11].
The utilization of Weka facilitated the standardized and efficient implementation of the common ML workflow steps. However, NN models were custom implemented using Python 3.11 along with the Keras and TensorFlow libraries, providing greater flexibility in designing and training the deep learning architectures. This hybrid approach allowed us to leverage the strengths of Weka’s automation and Python’s customizability.
The NN uses Python programming within a Jupyter Notebook 7.2.2. environment.
3.1.2. Preprocessing of the Dataset
Data preprocessing is a critical step to transform raw data and improve the accuracy of ML models. The following preprocessing steps were applied to the dataset:
Duplicate Record Removal: Weka’s “RemoveDuplicates” filter identified and removed 33 duplicate records, resulting in 276 remaining data points. Missing Value Handling: Weka’s “ReplaceMissingValues” filter was used to check for missing values. Unlike some studies that use simple mean imputation, this approach ensures a more robust handling of missing data. No missing values were found in the dataset.
Outlier Detection and Removal: Weka’s “InterquartileRange” filter was used to identify and remove outliers and extreme values. No outliers were detected.
3.2. Feature Selection Method
Pearson’s correlation technique was employed to identify the most relevant attributes for lung cancer prediction. This method was chosen for its simplicity and effectiveness in quantifying the linear relationships between attributes and the target variable. The correlation coefficient, ranging from −1 to 1, indicates the strength and direction of the linear relationship. A zero value signifies no relationship, while 0.5 or higher suggests a significant association. Weka’s “correlation” filter determines the correlation coefficient between the input and output properties.
A threshold of 0.15 was set to determine the relevant attributes. Consequently, Anxiety, Chronic Disease, Age, Shortness of Breath, and Smoking were excluded, while Coughing, Wheezing, Alcohol_Consuming, Swallowing_Difficulty, and Allergy were retained as the five most essential input attributes.
Table 3 shows the correlation coefficients.
3.3. Data Normalization Techniques
To improve the computational efficiency of the algorithms, min-max scaling was used to normalize the data to the range of 0 to 1. The “Normalize” filter in Weka was applied for this purpose.
Table 4 presents the mean and standard deviation of the attributes after normalization.
Figure 3 shows the correlation between the input and output attributes post-preprocessing. It illustrates a correlation coefficient of 0.33 between “Allergy” and “Lung_Cancer”, indicating the highest correlation.
3.4. Experimental Setup for Testing and Training the Data
The preprocessed dataset, now consisting of 276 samples (38 without lung cancer and 238 with lung cancer), was divided into 80% for training and 20% for testing. To ensure a robust estimate of model performance, 10-fold cross-validation was employed. Stratified k-fold cross-validation was used to maintain the class distribution in each fold, which is particularly important for imbalanced datasets [
11,
21]. In the K-fold cross-validation method, one fold is utilized for testing, and the remaining K-1 folds are used for training. This process is repeated until each fold has served as the test set. The model’s efficacy is then assessed by calculating the mean of the outcomes from all K iterations.
3.5. Development and Execution of a Classification Model
3.5.1. Machine Learning Models
Several ML algorithms such as DT, RF, Ada boost (AB), KNN, LR, NB, SVM, and NN were implemented using Weka with default parameter settings to provide a baseline comparison. For the KNN model, K was set to 7.
3.5.2. Development of Deep Learning Models
Neural Network (NN) models with one, two, and three hidden layers were developed using Python 3.11 with the Keras and TensorFlow libraries. The NN models were trained with different numbers of epochs (200, 400, and 800) to assess the impact of the training duration on the performance. ReLU activation functions were used in the hidden layers, and a sigmoid function was used in the output layer for binary classification. Learning rates of 0.1, 0.01, and 0.005 were tested to determine the optimal value.
The NN models were developed by explicitly leveraging the Sequential class from Keras within the Keras and TensorFlow libraries, where the optimizer minimizes the output error during the backpropagation. Additionally, Stochastic Gradient Descent (SGD) is commonly exercised as an optimization technique where the learning rate plays a crucial role in adjusting weight updates based on the loss gradient.
We conducted experiments with various learning rates to determine the optimal value. We used the “train_test_split” function from the “scikit-learn” package for the partitioning train and test data. For K-fold cross-validation, we utilized the “Cross_val_score” function from the scikit-learn package, and the StratifiedKfold methodology was used to address the binary nature of the target variable.
Table 5 details the architectures of the NN models.
The single-hidden-layer NN model was constructed incorporating an input layer, a hidden layer, and an output layer. Eight neurons in the input layer represent the eight attributes. The hidden layer consists of a single neuron with the ReLU activation function, and the output layer consists of one neuron with a sigmoid activation function for binary classification.
A four-layer NN model was constructed. The input and output layers (first and fourth) maintained identical input shapes, neuron counts, and activation functions as the single-hidden-layer model. The second layer comprises 41 hidden neurons, and the third layer comprises eight hidden neurons, with all the neurons within these layers utilizing the ReLU activation function.
The input and output layers (first and fifth) maintained identical input shapes, neuron counts, and activation functions as the single-hidden-layer Neural Network (NN). The second, third, and fourth hidden layers consisted of 25, 16, and 8 neurons, respectively, with all the neurons within these layers utilizing the ReLU activation function.
3.5.3. Selection of Different Learning Rates
The learning rates of 0.1, 0.01, and 0.005 were selected based on a combination of common practice and preliminary experimentation. These values represent a range from relatively large (0.1) to small (0.005), allowing us to observe the impact of the learning rate magnitude on the model convergence and performance. While a more complete hyperparameter tuning method could be employed, this initial range provided a reasonable starting point for evaluating the model’s sensitivity to the learning rate.
3.6. Model Evaluation
The model performance was evaluated using accuracy, precision, recall, and the F1-score. These metrics provide a comprehensive assessment of the models’ classification capabilities. The following formulas were used for the evaluation metrics:
where: TP = true positive; TN = true negative; FP = false positive; FN = false negative.
4. Results
4.1. Performance of the Machine Learning Algorithms
Table 6 shows the confusion matrices for all the classifiers used in the study, which were evaluated via 10-fold cross-validation (K = 10) and train/test splitting.
Table 7 displays the accuracy of all the classification algorithms used in the study, demonstrating that all of them exceeded 85%. Moreover, DT, NB, and KNN (K = 7) exhibited superior accuracy in the testing methodologies.
Figure 4 shows that the ROC curve for the KNN model demonstrated the highest accuracy in the train/test split evaluation and the K-fold cross-validation (
Table 7).
The different colors in the ROC curve represent the classifier's performance on each fold during the cross-validation process (
Figure 4). Each colored line corresponds to the ROC curve obtained when one-fold of the data was used as the test set, and the remaining folds were used for training. This visualization allows us to see the variability in the classifier's performance across different subsets of the data. The Area typically summarizes the overall performance Under the ROC Curve (AUC), which is displayed at the top of the plot (AUC = 0.9149 in this case).
Figure 5a–d and
Figure 6a–d present the performance of all the classifiers based on four metrics, Precision, Recall, F-measure, and Accuracy, evaluated using 10-fold cross-validation (K = 10) and the train/test splitting method.
4.2. Findings of the NN Model
In the NN model with a single hidden layer and 200 epochs, we conducted experiments with learning rates of 0.1, 0.01, and 0.005, as presented in
Table 8. We found that a learning rate of 0.1 yielded the best accuracy, so we used it for all the cases.
Table 9 shows the effect of an epoch count on the NN model with one, two, and three hidden layers, using a learning rate of 0.1. The results indicated that the NN model, featuring a single hidden layer and trained across 800 epochs with a learning rate of 0.1, attained the highest accuracy of 92.86%. Moreover, the NN model demonstrated superior accuracy in both training and testing compared to all the other neural network models examined.
Figure 7 illustrates the ROC curve for a model using one hidden layer trained over 800 epochs. We computed the accuracy of the single-hidden-layer NN model over 800 epochs using 10-fold cross-validation (K = 10). The model achieved an average accuracy of 90%. This finding provides strong evidence of the one-hidden-layer NN model’s effectiveness in lung cancer prediction. It highlights the potential of DL techniques to improve diagnosis and the necessity of further exploring model architectures and hyperparameters for optimal performance.
5. Discussion
Given the high mortality rate of lung cancer and the importance of early, precise identification, this research evaluates the predictive efficacy of several ML and DL models for lung cancer detection using patient symptoms and lifestyle factors. This research compared traditional ML classifiers—DT, KNN, RF, NB, AB, LR, and SVM—with Neural Network (NN) models to determine the most effective approach.
5.1. Evaluation of the Machine Learning Models
The findings demonstrate that DT, NB, and KNN attained superior classification accuracy, surpassing 85% in both the K-fold cross-validation (K = 10) and train/test splitting methodologies. The highly regarded performance of these models indicates their appropriateness for lung cancer classification, presumably owing to their capacity to manage the intricate connections among features. The findings correspond with previous research that has shown the efficacy of these models in lung cancer prediction [
3,
14].
RF, AB, and LR showed strong performance, with accuracy rates similar to DT, NB, and KNN. However, SVM and LR had a slightly lower accuracy, likely due to their sensitivity to feature selection and hyperparameter optimization.
5.2. Efficacy of NN in Lung Cancer Diagnosis
The DL models surpassed the conventional ML classifiers, with a three-hidden-layer NN trained for 800 epochs attaining a maximum accuracy of 92.86%. This result emphasizes the capability of DL to enhance lung cancer detection since NN can identify intricate nonlinear patterns within the dataset, yielding improved classification accuracy relative to the conventional ML models.
The research investigated several NN topologies, including single-, dual-, and triple-hidden-layer models. The three-hidden-layer NN with a learning rate of 0.1 consistently surpassed the other models, indicating that deeper structures and prolonged training durations improve the performance. Nevertheless, increasing the number of hidden layers beyond three did not provide any improvements, suggesting that model optimization and hyperparameter adjustment are essential for reconciling the accuracy and computational efficiency.
5.3. Influence of Data Preprocessing and Feature Selection
The results indicated that the preprocessing methods significantly improved the classification outcomes, underscoring the need for systematic data preparation in ML applications. Pearson’s correlation was employed for feature selection, revealing coughing, wheezing, alcohol consumption, dysphagia, and allergies as the most pertinent predictors of lung cancer. Removing the less significant features like anxiety, chronic illness, and shortness of breath improved the model performance by reducing the dimensionality and eliminating noise. Furthermore, data standardization and outlier elimination improved the model efficacy by establishing a consistent data distribution, resulting in more precise predictions.
5.4. Comparative Analysis with Prior Studies
The outcomes of this study corroborate the prior research that underscored the efficacy of ML and DL models in lung cancer diagnosis (
Table 10). The current research has shown that ML classifiers, namely DT, NB, and RF, attain accuracy rates beyond 85% when used on medical datasets [
1,
14,
16]. Patra [
14] achieved an accuracy of 81.25% using SVM, among other algorithms, while Gultepe [
22] reported 83% accuracy using KNN, Naïve Bayes, and Decision Trees. Maurya et al. [
1] reached a maximum of 91.07% with Naïve Bayes, and Faisal et al. [
16] demonstrated a 90% accuracy using Gradient-Boosted Trees. These findings illustrate a gradual improvement in the prediction accuracies over time as model architectures and feature engineering techniques have advanced.
The current study notably surpasses these benchmarks. Its NN model, featuring a single hidden layer trained over 800 epochs, achieves an accuracy of 92.86%. This outcome underscores the effectiveness of DL architectures, particularly when paired with rigorous data preprocessing and optimal hyperparameter tuning strategies like learning rate and epoch selection.
Importantly, this study further illustrates the advantage of DL in predicting the accuracy, in contrast to past studies that focused only on the classical ML models [
16,
22]. The findings indicate that NN, when well adjusted, surpasses the classical classifiers in lung cancer detection, making them a significant asset for early diagnosis based on symptomatic and lifestyle data.
6. Conclusions
This study investigated the efficacy of ML and DL techniques for lung cancer prediction, addressing the critical need for early and accurate diagnosis to improve the patient outcomes. The research provides a comparative analysis of various ML algorithms (DT, RF, KNN, NB, AB, LR, and SVM) and NN models for lung cancer prediction using a dataset of patient symptoms and lifestyle factors. This shows how a rigorous data preprocessing pipeline, including outlier removal, normalization, and feature selection with Pearson’s correlation, improves the model prediction performance and reduced noise. The study identifies coughing, wheezing, alcohol consumption, swallowing difficulty, and allergies as the key predictive features for lung cancer, highlighting the significant risk indicators. It evaluates the performance of NN models with varying architectures (one, two, and three hidden layers) and training parameters, determining the optimal configuration for lung cancer prediction.
Several ML models, including DT, NB, and KNN, achieved a high classification accuracy (above 85%), demonstrating their potential for effective lung cancer prediction. DL models, notably a three-hidden-layer NN, outperformed the traditional ML classifiers, reaching a maximum accuracy of 92.86%, highlighting the superior capability of NNs to capture complex patterns in the data.
7. Limitations and Future Study
Despite the promising findings, a few limitations should be acknowledged. The dataset used was minimal, potentially affecting the generalizability of the results to broader groups. The research concentrated on a restricted array of parameters, and the inclusion of additional clinical and genetic data may boost the models’ prognostic efficacy. Thus, subsequent studies need to examine more extensive datasets and assess the incorporation of DL methodologies with feature engineering strategies to augment the predictive accuracy further. Secondly, Pearson’s correlation technique was used to find the most relevant attributes. However, it is acknowledged that Pearson’s correlation has limitations; it only detects linear correlations and may miss complex, non-linear relationships that could also be predictive. Thus, future work should use more advanced feature selection methods, such as information gain, chi-square test, recursive feature elimination (RFE), and model-based selection methods, to evaluate their impact on the model performance and identify any non-linear relationships that Pearson’s correlation might have missed. It will provide a more comprehensive understanding of the feature space and potentially lead to improved predictive accuracy. Finally, the literature indicates that shortness of breath is one of the key features of determining lung cancer. The current study used Pearson’s correlation to identify the most relevant attributes, setting a threshold of 0.15 for inclusion. Shortness of breath was eliminated due to its correlation coefficient of 0.0644, which was below the required threshold. On the other hand, swallowing difficulty (0.2689) and alcohol consumption (0.2944) had correlation coefficients above the 0.15 threshold, indicating a more substantial correlation with lung cancer in the dataset. Thus, these were considered essential features and retained for the analysis. Future studies should consider other feature selection methods, such as the Information gain chi-square test or Principal Component Analysis (PCA), to confirm the outcome.