1. Introduction
The incidence and mortality due to uncontrolled cancer are critical public health issues in India and have been increasing, meaning that a significant proportion of deaths is due to cancer. About 75,000 new cases of cancer were reported in the year 2022. This is further compounded by a high prevalence of risk factors combined with challenges in the early detection and diagnosis of the disease [
1].
Tobacco use is the leading cause of lung cancer in India, and it holds about 85% of cases. According to the 2016-17 GATS report, an estimated 28.6% of adults in India are tobacco users in any form, with a higher percentage among men [
2]. It includes smoking cigarettes, bidis, and smokeless tobacco products like gutka and paan. Moreover, exposure to secondhand smoke raises the risks of carcinogenic agents, which, in turn, raises the chances of lung cancer [
3].
Lastly, environmental pollution is a significant cause of growing lung cases in India. The top polluted cities are Delhi, Mumbai, and Kolkata, which are among the worst polluted cities worldwide. Emissions from automobiles and industrial activities and particulate matter emitted during construction significantly damage the lungs and are associated with an increased incidence of lung cancer. Evidence has shown that exposure to ambient PM2.5 levels can be associated with a further increased risk of lung cancer, according to a published study in The Lancet Planetary Health [
4].
Household pollution is one of the significant rural concerns because more people use mainly solid fuels such as firewood, cow dung, and coal to cook and heat their homes. The WHO said that an estimated 700 million people in India will be exposed to household air pollution; this increases lung cancer risks, especially for women who spend more of their time in the kitchen [
5].
Occupational hazards are one of the major causes that have led to the development of lung cancer. People who work in the construction, mining, and manufacturing industries are usually exposed to cancer-causing carcinogens, such as asbestos, radon gas, and many others. In this respect, strict safety measures within occupational places, along with annual checkups, can minimize the threat of such occupational hazards to a greater extent [
6].
If lung cancer is discovered early, the survival rate can be higher. The five-year survival rate for lung cancer diagnosed at an early stage can reach between 60% and 80%, whereas the survival rate for advanced-stage lung cancer falls below 15% [
7]. The National Programme for the Prevention and Control of Cancer and Diabetes is designed to enhance cancer screening and early detection initiatives throughout India. However, more than 70% of lung cancer diagnoses occur at the late stages, and therefore, more advanced screening and diagnostic techniques are fundamentally required [
8].
Current developments in ML and AI open exciting possibilities for better detection and diagnosis of lung cancer. Algorithms in machine learning can identify insights from large datasets, risk factors associated with conditions, and other identifiable patterns, which allows for early detection that leads to personalized treatment planning. Recently, a study in The Lancet Oncology showed how the AI-based system reduced false positives and negatives for lung cancer screening and increased accuracy levels [
9].
The present study focuses on developing a hybrid model for lung cancer prediction while also conducting a comparative analysis of various machine learning algorithms. The proposed preprocessing techniques, including ADASYN and label encoding, enhance model performance by addressing class imbalance and feature representation. This study aims to improve early detection and diagnosis by integrating advanced machine learning techniques to analyze key risk factors associated with lung cancer. By leveraging a hybrid approach that combines traditional machine learning models with deep learning, the research seeks to achieve higher predictive accuracy. Additionally, the study evaluates multiple machine learning models to identify the most significant features contributing to lung cancer prediction, ensuring a robust and reliable methodology for practical healthcare applications.
Translational Advancements in Machine Learning for Cancer Prediction
Lung carcinoma, characterized by the malignant growth of lung cells, remains a significant challenge in oncology, emphasizing the need for advancements in diagnostic and predictive technologies. Recent progress in computer vision and data analytics has paved the way for sophisticated diagnostic methods, mainly by analyzing temporal medical images. Beyond imaging, clinical text data, such as patient symptoms and medical histories, have emerged as invaluable resources for lung cancer diagnosis, enabling the integration of diverse data modalities for enhanced prediction. Machine learning (ML) techniques, including neural networks, support vector machines (SVMs), decision trees, convolutional neural networks (CNNs), and nonlinear cellular automata, have shown considerable success in predicting lung cancer recurrence and survivability. Ensemble methods such as Random Forest, XGBoost, Bagging, and Adaboost have demonstrated superior predictive accuracy by combining multiple models to reduce error rates. Public datasets, like the SEER database, have been instrumental in evaluating these methods using metrics such as precision, recall, F1-score, and the Area Under the Curve (AUC) of Receiver Operating Characteristic (ROC) curves.
Binary classification models have also played a vital role in diagnosis, prognosis, and prediction tasks. Techniques like logistic regression, Gaussian naïve Bayes, k-nearest neighbors (KNN), artificial neural networks (ANNs), radial basis function (RBF) networks, gradient-boosted trees, and multilayer perceptrons (MLPs) have been extensively utilized. These models incorporate demographic factors like age and gender and behavioral factors like smoking habits into personalized diagnostic tools. However, significant challenges persist in integrating diverse data types, such as imaging and clinical data, and conducting comprehensive analyses of ML methods. Addressing these challenges requires multidisciplinary approaches to refine diagnostic accuracy and improve treatment planning.
The advancements in ML for lung cancer diagnosis also involve integrating traditional and emerging methodologies. Unlike conventional AI, ML evolves by learning from past data, enabling dynamic and complex decision-making. Several studies have highlighted the transformative potential of ML in lung cancer detection and prognosis. Shah et al. [
10] identified limitations in existing diagnostic methods, such as the need for advanced feature extraction techniques, robust noise removal, and hybrid classification approaches to improve ML and DL applications in clinical settings. Gayap et al. [
11] highlighted deep learning techniques, including 2D and 3D CNNs, dual-path networks, and vision transformers (ViTs), which outperform classical ML methods in detecting lung cancer using CT scans. Didier et al. [
12] demonstrated that ML models surpass logistic regression in predicting lung cancer survival, offering higher discriminatory accuracy. Similarly, Raoof et al. [
13] reviewed the strengths and limitations of ML techniques, including deep learning and ensemble methods, for specific imaging types.
Other notable contributions include the work of Javed et al. [
14], who summarised the effectiveness of CNNs and recurrent neural networks (RNNs) in lung cancer detection while addressing challenges related to higher accuracy and generalizability. Li et al. [
15] showcased the potential of SVMs and Random Forest in predicting lung cancer progression and emphasized their clinical decision-making utility. Dodia et al. [
16] discussed the challenges of translating ML models for early lung cancer detection into clinical practice, mainly using CT scans. Huang et al. [
17] compared deep learning and traditional ML methods, highlighting deep learning models’ superior accuracy and sensitivity while stressing the need for better interpretability and generalizability.
The literature underscores the transformative potential of ML in lung cancer diagnosis. By addressing challenges such as multimodal data integration, feature selection, and implementation barriers, these advancements promise to improve patient care and outcomes significantly. Early detection and accurate prediction are pivotal in improving patient outcomes and tailoring effective treatment strategies. Advances in ML and deep learning (DL) techniques have shown immense potential in addressing the complexities of disease prediction and diagnosis.
2. Methodology
Figure 1 illustrates a comparison between machine learning (ML) and deep learning (DL) approaches for lung cancer improvement analysis. Both ML and DL follow similar processes, starting with data input, analysis, and preprocessing. They proceed with feature exploration, followed by the training and testing of the models. In the ML approach, various algorithms such as logistic regression, k-nearest neighbors, support vector machine, Random Forest, Gaussian naïve Bayes, multinomial naïve Bayes, Gradient Boosting, and decision tree are used, leading to the selection of the best model based on hyperparameter tuning and performance analysis. On the other hand, DL employs multilayer perceptron (MLP) as the core algorithm. Both approaches involve making predictions and conducting thorough performance analyses.
2.1. Data Collection
A dataset for lung cancer prediction has been collected from the source. The dataset consists of a total of 16 attributes with 309 instances. In this study, we selected only 12 attributes, as mentioned in
Table 1, as many others were not relevant to lung cancer [
18].
2.2. Data Preprocessing
In this study, only two techniques were employed: ADASYN and label encoding, as described below. Our primary focus was on ensuring data consistency rather than handling missing values, as the dataset contained no missing values [
19].
2.2.1. Adaptive Synthetic Sampling (ADASYN)
In this study, ADASYN-Adaptive Synthetic Sampling was used to cope with the imbalanced class of the dataset by proposing a learning system with ML to predict lung cancer with respect to the data available in this task. Lung cancer prediction has the following data: Records-310; Variables 12 Target-Lung Cancer (Yes/No). Suppose the number of lung cancer-positive cases is substantially lower than negative cases. In that case, it may cause a bias in favor of the majority class, hence poor recall in lung cancer prediction [
20].
2.2.2. Label Encoding
In this dataset, LUNG_CANCER attribute is in the form of object data. So, we converted them to numerical values using LabelEncoder from sklearn [
21]. LabelEncoder is a utility class that helps normalize labels in such a way that they contain only values between 0 and n_classes 1. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels [
22].
2.3. Data Analysis
In this study, we analyzed the dataset and found no missing values, as presented in
Table 2. However, there were 33 duplicate entries among the given instances in this dataset, which were removed before processing. A Pearson correlation has also been plotted as a heat map to assess the importance of the attributes, among other factors, as shown in
Figure 2. The attributes of the clinical dataset were chosen based on the experts of this specialization and to measure the effectiveness of the cancer-prediction application, which further helps the patient understand their cancer risk via low cost and decisions based on their appropriate treatment. For data analysis, we utilized VS code 1.96 and Python libraries, including NumPy, Pandas, Matplotlib, and Seaborn, for data visualization [
23].
2.4. Data Transformation
In this phase of the study, the dataset was divided into two subsets: training and testing. Since the given data are in categorical form, label encoding was applied to convert categorical variables into numerical representations, thereby enhancing the model’s performance. The encoding process is illustrated in
Figure 2 [
24,
25].
2.5. Model Construction
We implemented various machine learning algorithms in the treatment of the clinical dataset for lung cancer following an initial statistical analysis. To enhance model performance, attribute correlation analysis was conducted, allowing us to refine and optimize the dataset for lung cancer prediction [
26].
2.6. Data Split
In this study, the dataset was split into two sets, with 80% for training and 20% for testing. During the training process, each model underwent 10-fold cross-validation [
27].
2.7. Model Training
As shown in
Figure 2, this phase involved training the machine learning models using the prepared dataset. After applying label encoding to handle categorical data, the dataset was split into two subsets: training and testing [
28]. The training subset was used to train the models, while the testing subset facilitated model evaluation. Several machine learning algorithms were implemented, and their performance was assessed using metrics such as accuracy, precision, recall, and the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) [
29].
Hybrid Model Approach
We developed a Gaussian naïve Bayes model for lung cancer prediction. This model follows the same methodology as machine learning and deep learning; however, it is based on the naïve Bayes theorem, which utilizes conditional probability, as discussed in
Section 2.12.7. Further, we integrated the GNB model with the Adaboost model to enhance classification performance by combining the probabilistic classification of GNB.
2.8. Cross-Validation
We used 10-fold cross-validation to improve model generalization and reduce overfitting. The dataset was split into 10 subsets, with each fold serving as a validation set once, and the remaining folds were used for training. This ensured a reliable performance estimation for all models [
30].
2.9. Model Prediction
After training the machine learning models, the next step was model prediction. In this phase, the trained models were applied to the testing subset to predict the outcomes based on the input features. The input features were mapped through the trained models to generate the predicted outcomes, which were then compared to the actual results for evaluation.
2.10. Confusion Matrix
In this study, the confusion matrix was utilized to evaluate model performance by calculating key metrics such as accuracy, precision, recall, and F1 score [
31].
2.11. Model Settings and Tuning
Further, we utilized the optimal hyperparameters to improve the performance of machine learning and deep learning models for predicting lung cancer, as mentioned in
Table 3.
2.12. Applied Algorithms for Predicting Lung Cancer
In the application of this study, we utilized supervised learning algorithms involving machine learning algorithms, including the following:
2.12.1. Support Vector Machine
A support vector machine (SVM) is a computational model employed for categorization and prediction tasks. In simpler terms, decision trees ask questions, a Random Forest obtains opinions from many trees, and a support vector machine draws a smart line to make sense of things [
32].
Here, is the weight vector, and is the input feature vector. b is the biased term.
2.12.2. Logistic Regression
A technique of mathematical forecasting called logistic regression predicts the possibility that a result belongs to some category—typically denoted as “1” (e.g., whether a patient has a disease or not). This method is frequently applied to binary classification issues, where estimating the probability of an event occurring is the aim [
33].
2.12.3. K-Nearest Neighbor
The k-nearest neighbor (KNN) classification algorithm identifies k objects in the training dataset that are closest to a given test object. The majority class among these k neighbors determines the test object’s class. This approach addresses two key challenges in real-world datasets: the rarity of exact matches between objects and the potential for conflicting class information among nearby objects. More broadly, KNN belongs to the family of instance-based learning algorithms, which also include case-based reasoning, a method that operates on symbolic data. Additionally, KNN exemplifies lazy learning techniques, as the algorithm postpones the generalization process until query time, relying directly on the training data for decision-making [
34].
In this research, Minkowski distance is taken as
where
p is the parameter that defines the metric’s power.
where
is the class label of the
j-th nearest neighbor.
2.12.4. Decision Tree
Decision trees have been used widely in classification and regression problems. Therefore, they were aptly used to predict lung cancer, where the target variable is a binary value (0 or 1) indicating no cancer and cancer, respectively. The decision tree establishes a relationship between the target output and other input features [
35].
Gini Impurity
The following is the definition of the Gini impurity used to gauge a decision tree split’s quality:
In the dataset D, is the probability of class i, and C is the number of classes in the dataset.
Mathematical Representation
Let the feature set be denoted by
X and the target variable by
Y, where
. The definition is such that 0 denotes no cancer, and 1 denotes the presence of cancer. Thus, the decision function is defined as
where
and
correspond to the regions representing class 0 and class 1, respectively.
2.12.5. Random Forest
This algorithm is applied as a form of ensemble learning, suitable for application in medical diagnostic purposes, such as predicting lung cancer. RF has the flexibility to support variables with both qualitative and quantitative types and scales. Therefore, it can readily be utilized with a binary version of the lung cancer dataset, using 0 to denote the lack of the disease and 1 to denote cancer [
36].
The Gini impurity measures the impurities of the decision nodes of trees in Random Forest.
where
is the class probability at node
D?
2.12.6. Gradient Boosting
Boosting algorithms work by combining many simple models, called weak learners, which are only slightly better than guessing randomly. By combining these weak models step by step, they create a much stronger and more accurate model. Gradient Boosting is a boosting algorithm often used for regression problems. It improves the prediction model by adding minor corrections at each step, building the final solution as a weighted combination of these corrections [
37].
where
is the predicted output for the
i-th data point after
M iterations,
denotes the summation notation, indicating that the terms from
to
M are to be added together,
is a weight or coefficient that adjusts the contribution of each function
, and
is the output of the
m-th function (or model) for the
i-th input data point
.
2.12.7. Gaussian Naïve Bayes
When working with continuous data, it is often assumed that the values for each class follow a Gaussian (normal) distribution. To handle this, the training data are divided by class, and the mean and standard deviation for each class are calculated. By using this, the probabilities for the continuous data can be estimated using the following formula [
38].
2.12.8. Multinomial Naïve Bayes
Multinomial naïve Bayes is commonly used for discrete feature counts, such as word counts in text classification or event counts in health datasets. While it is not the primary method for lung cancer prediction, it may still be applicable in specific contexts [
39].
2.12.9. Multilayer Perceptron
Artificial neural networks (ANNs) are inspired by how the human brain uses multiple layers. ANNs can learn patterns and relationships in data by training on examples, allowing them to generalize and make predictions for new situations. One of the most widely used types of ANN is multilayer perceptron (MLP), which is a powerful tool for modeling complex relationships. MLPs create nonlinear models, enabling them to predict outputs based on given inputs [
40].
Lung cancer prediction is typically a classification task (e.g., predicting whether a patient has lung cancer or not). Since tabular data are used in this study, MLP is a suitable choice. Since MLPs consist of multiple layers of neurons, the layers typically involved in the lung cancer prediction model are as follows:
Hidden Layer
Hidden layers are responsible for learning the underlying patterns and relationships in the data. Typically, the network has 100 hidden layers, as mentioned in
Table 3. The sigmoid activation function is often chosen because it is better for binary classification tasks, and it also helps mitigate issues like vanishing gradients, improving the netlet’s ability to learn effectively.
Table 3.
Applied optimal tuning parameters for individual models.
Table 3.
Applied optimal tuning parameters for individual models.
Model | Hyperparameters |
---|
Logistic regression | L2 penalty, C = 1.0, solver = ‘lbfgs’, max_iter = 100 |
K-nearest neighbors (KNN) | n_neighbors = 5, weights = ‘uniform’, algorithm = ‘auto’, leaf_size = 30 |
Support vector machine (SVM) | kernel = ‘rbf’, gamma = ‘auto’, C = 1.0, degree = 3 |
Decision tree | criterion = ‘gini’, max_depth = None, min_samples_split = 2 |
Random Forest | n_estimators = 100, criterion = ‘gini’, max_depth = None, bootstrap = True |
Multilayer perceptron (MLP) | hidden_layer_sizes = (100,), activation = ‘sigmoid’, solver = ‘adam’, alpha = 0.0001 |
Gradient Boosting | n_estimators = 100, learning_rate = 0.1, max_depth = 3, subsample = 1.0 |
Multinomial naïve Bayes | alpha = 1.0 |
Gaussian naïve Bayes | var_smoothing = 1e-9 |
Hybrid model | “n_estimators”: [50, 100, 200], “learning_rate”: [0.01, 0.1, 0.5, 1.0], “algorithm”: [“SAMME”, “SAMME.R”] |
Output Layer
The output layer determines the prediction. For binary classification, such as predicting the presence or absence of lung cancer, the output layer consists of a single neuron with a sigmoid activation function, which outputs a probability between 0 and 1.
3. Results
The preliminary data analysis categorized cases into positive and negative lung cancer diagnoses, revealing 238 positive and 38 negative cases.
Figure 2 illustrates the correlation values between variables, where positive values near 1 denote strong positive correlations, and negative values near −1 indicate strong negative correlations. Moderate positive correlations include anxiety and swallowing difficulty (0.48) and alcohol consumption and allergy (0.38). Negative correlations include alcohol consumption and yellow fingers (−0.27) and anxiety and coughing (−0.22). Insignificant correlations, such as alcohol consumption and swallowing difficulty (−0.00063), suggest no meaningful relationship.
Table 4 presents model evaluation metrics prior to cross-validation. SVM, Random Forest, and Gradient Boosting exhibited the highest accuracy (0.98), with KNN achieving the highest precision (1.00) but lower recall (0.91). For recall, SVM, Random Forest, and Gradient Boosting outperformed the rest with 0.98. The F1 score reaffirmed their robustness (0.98), ensuring balanced precision and recall. AUC analysis highlighted Random Forest and decision tree as the most discriminative (0.96), while multinomial naïve Bayes underperformed, having the lowest accuracy (0.81) and AUC (0.74).
Figure 3 and
Figure 4 provide a comparative visualization of model performance across various metrics. Specifically,
Figure 4 highlight the ROC curves of nine different algorithms. ROC, which stands for Receiver Operating Characteristic, helps evaluate the performance of classifiers by illustrating the trade-off between true positive and false positive rates at various threshold levels.
Figure 5 shows a comparison of the accuracy of the existing and proposed machine learning models for the prediction of lung cancer. The proposed enhancements significantly outperformed the rest in terms of performance. Therefore, the refined methodologies have proven to be more effective.
Table 5 shows a comprehensive comparison of accuracy across different ML and DL models. The proposed methods are better compared to existing approaches.
Logistic regression showed an improvement in accuracy by 5.58% (87.50% to 93.08%), and KNN showed an improvement of 0.44% from 92.86% to 93.30%. SVC, which is a highly dimensional classification type, showed a significant improvement at 9.48%, from 85.71% to 95.19%. The decision tree model, which is typically valued for interpretability, improved by 5.47%, from 89.29% to 94.76%.
The ensemble methods proved significant gains. Random Forest, using multiple decision trees to reduce overfitting, achieved a 9.69% gain (85.71% to 95.40%). Through boosting that ought to improve predictions, Gradient Boosting achieved a 6.32% gain (89.29% to 95.61%). MLP is a deep learning technique used for intricate feature interactions, and it achieved a gain of 4.84%, from 89.29% to 94.13%.
Gaussian naïve Bayes reached 88.46%, showing an improvement over the previous 91.07%. Meanwhile, multinomial naïve Bayes, with significant accuracy, jumped to 74.84%.
The hybrid model was developed by first performing cross-validation on the Gaussian naïve Bayes (GNB) model, followed by integrating the AdaBoost algorithm, a boosting technique, to enhance performance. In this study, the proposed hybrid model achieved an accuracy of 96.42%, as shown in
Table 5.
These findings confirm a valid comparison of the proposed and previous machine learning and deep learning methods. The hyperparameters, as detailed in
Table 3, were utilized to achieve optimal performance in lung cancer prediction. This outcome, in turn, proves helpful in the ML research community and for professionals in terms of better classification predictive performance with a focus on optimizing strategies. We conclude that the Gradient Boosting and Gaussian naïve Bayes models have performed better in this problem, given smaller datasets and binary characteristics. Such models would be more suited to situations where the attributes/features of the dataset are highly independent of each other. In contrast, the performance was restricted in the other models based on correlation and training/testing splits.
4. Discussion
This study highlights the significant role of machine learning and deep learning models in enhancing lung cancer prediction. By combining traditional ML classifiers with deep learning-based feature extraction and optimizing model parameters, the predictive performance of the proposed approach showed notable improvements. The hybrid model, which integrates Gaussian naïve Bayes with the AdaBoost algorithm, achieved a high accuracy of 96.42%, surpassing conventional models by addressing key limitations such as feature independence assumptions in naïve Bayes classifiers. Compared to existing research, many previous studies have utilized individual ML models like support vector machines (SVMs), decision trees, and naïve Bayes for lung cancer prediction. For instance, Maurya et al. (2024) reported that the accuracy of Random Forest and Gradient Boosting models was 85.71% and 89.29%, respectively [
41]. Our findings indicate substantial advancements, with Random Forest achieving 95.40% and Gradient Boosting reaching 95.61%, confirming the effectiveness of hyperparameter tuning and the application of data augmentation techniques, such as Adaptive Synthetic Sampling (ADASYN), in improving model performance.
Furthermore, Shah et al. (2023) emphasized the necessity of advanced feature selection and denoising techniques to improve ML-based diagnostic accuracy [
10]. Our study aligns with this by employing label encoding and feature selection, resulting in better data representation and minimizing the impact of irrelevant variables. Li et al. (2022) also demonstrated the advantage of ensemble learning techniques like XGBoost in cancer prediction [
15], which is a conclusion that is consistent with our results, where Gradient Boosting emerged as one of the highest-performing classifiers. Deep learning approaches have gained attention in lung cancer diagnostics, particularly medical imaging applications. Gayap et al. (2024) demonstrated that CNN-based models outperform conventional ML models for CT scan-based lung cancer detection [
11]. While our research focused on structured textual clinical data rather than image-based datasets, incorporating multilayer perceptron (MLP) into our model significantly improved classification accuracy from 89.29% to 94.13%. This finding suggests that DL models can effectively capture complex, nonlinear patterns even in non-imaging datasets, enhancing diagnostic precision.
Moreover, our study observed an increase in the accuracy of multinomial naïve Bayes to 74.84%, indicating that preprocessing improvements addressed some of the model’s inherent feature independence constraints. The performance of MLP also improved from 89.29% to 94.13%, likely due to optimized activation functions, deeper network architectures, and regularisation techniques, such as dropout. These results reinforce the notion that deep learning, combined with practical hyperparameter tuning and preprocessing, can yield superior outcomes compared to traditional statistical classifiers.
The findings of this research further validate the capability of ML models in facilitating early detection of lung cancer and supporting clinical decision-making by identifying high-risk patients based on symptom-based data. The proposed hybrid model, with an accuracy of 96.42%, suggests that ensemble learning techniques can significantly enhance prediction reliability. The ability of ML algorithms to analyse large-scale patient datasets and detect early-stage indicators of lung cancer can transform current diagnostic methods, leading to more personalised and timely treatment interventions.
5. Conclusions
The lungs are the primary organs of respiration, providing a constant supply of oxygen to the blood, without which a human could not survive. Lung cancer is the leading cause of cancer-related deaths for both males and females, with most fatalities occurring after diagnosis due to the disease stage at detection. Early diagnosis significantly increases life expectancy. In this study, supervised learning was explored to train a model for diagnosing lung cancer based on feature extraction from symptom-related data. The models were compared using accuracy, precision, recall, F1-score, and AUC values as performance criteria for logistic regression (LR), k-nearest neighbors (KNN), support vector machine (SVM), Gradient Boosting (GB), Gaussian naïve Bayes (GNB), multinomial naïve Bayes (MNB), decision tree (DT), Random Forest (RF), and multilayer perceptron (MLP). The experimental results obtained through 10-fold cross-validation revealed that Gradient Boosting outperformed other models, achieving 95.61% accuracy, precision, recall, and F1-score, along with an AUC of 98%. After a detailed evaluation of nine machine learning algorithms, both Gradient Boosting (95.61% accuracy) and the hybrid model (96.42% accuracy) were identified as the most suitable predictive models for lung cancer diagnosis.
6. Limitation and Future Scope
This study highlights the potential of various machine learning algorithms in analyzing textual clinical data for the early detection of lung cancer. However, it is essential to note that the study was conducted on a relatively small dataset, which primarily relies on patients and symptoms. Future research should focus on applying these techniques to larger datasets to assess variability in algorithm performance. Additionally, employing a more comprehensive and authentic dataset containing at least 16 essential parameters would enhance the robustness of classification, as the current approach is based on symptoms and behavioural attributes. Moreover, advanced machine learning models, such as multilayer perceptrons (MLPs), warrant further exploration when applied to larger datasets. Observational insights from this study suggest that male patients with a history of alcohol consumption, chest pain, and allergies may have a higher likelihood of developing lung cancer. However, these findings require expert validation, which could lead to the establishment of a weighting system for specific attributes in lung cancer detection, thereby improving diagnostic accuracy.
The integration of Electronic Health Records (EHRs) can play a crucial role in the early detection of lung cancer by leveraging clinical data to identify patterns among patients. This approach enables a comparative analysis of relevant attributes and allows for a more reliable assessment of the Area Under the Curve (AUC) achieved by different machine learning models. Further studies are needed to validate the model adoption process and refine classification strategies for improved real-world applications. In hospitals with limited resources, EHR-based automated risk assessment tools can assist doctors in decision-making by flagging high-risk lung cancer cases based on patients’ medical history. However, further validation through clinical trials and real-world deployments is necessary to assess the feasibility and ethical considerations of such implementations.