Next Article in Journal
Study on the Use of Soda Saline–Alkali Soil as a Rice-Seedling-Raising Soil After Short-Term Improvement
Next Article in Special Issue
A Hybrid Deep Learning Approach for Improved Detection and Prediction of Brain Stroke
Previous Article in Journal
Research on Heat Transfer Coefficient Prediction of Printed Circuit Plate Heat Exchanger Based on Deep Learning
Previous Article in Special Issue
Influence of Posture and Sleep Duration on Heart Rate Variability in Older Subjects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Integrating Machine Learning Algorithms: A Hybrid Model for Lung Cancer Outcome Improvement

1
Department of Computer Science & Design, Faculty of Engineering and Technology, Datta Meghe Institute of Higher Education & Research (DU), Wardha 442001, India
2
Department of Computer Science & Medical Engineering, Faculty of Engineering and Technology, Datta Meghe Institute of Higher Education & Research (DU), Wardha 442001, India
3
Department of Applied Sciences, Chandigarh College of Engineering, Chandigarh Group of Colleges, Jhanjeri 140307, India
4
Department of Artificial Intelligence& Machine Learning, Faculty of Engineering and Technology, Datta Meghe Institute of Higher Education & Research (DU), Wardha 442001, India
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(9), 4637; https://doi.org/10.3390/app15094637
Submission received: 24 December 2024 / Revised: 22 February 2025 / Accepted: 26 February 2025 / Published: 22 April 2025
(This article belongs to the Special Issue Artificial Intelligence for Healthcare)

Abstract

:
Lung cancer is a major global health threat, affecting millions annually and resulting in severe complications and high mortality rates, particularly when diagnosed late. It remains one of the leading causes of cancer-related deaths worldwide, often detected at advanced stages due to the lack of early symptoms. This study introduces a novel hybrid machine learning model aimed at enhancing early detection accuracy and improving patient outcomes. By integrating traditional machine learning classifiers with deep learning techniques, the proposed framework optimizes feature selection, hyperparameter tuning, and data-balancing strategies, such as Adaptive Synthetic Sampling (ADASYN). A comparative evaluation with existing models demonstrated substantial improvements in predictive accuracy, ranging from 0.44% to 9.69%, with Gradient Boosting and Random Forest models achieving the highest classification performance. The study highlights the importance of hybrid methodologies in refining lung cancer diagnostics, ensuring robust, scalable, and clinically viable predictive models.

1. Introduction

The incidence and mortality due to uncontrolled cancer are critical public health issues in India and have been increasing, meaning that a significant proportion of deaths is due to cancer. About 75,000 new cases of cancer were reported in the year 2022. This is further compounded by a high prevalence of risk factors combined with challenges in the early detection and diagnosis of the disease [1].
Tobacco use is the leading cause of lung cancer in India, and it holds about 85% of cases. According to the 2016-17 GATS report, an estimated 28.6% of adults in India are tobacco users in any form, with a higher percentage among men [2]. It includes smoking cigarettes, bidis, and smokeless tobacco products like gutka and paan. Moreover, exposure to secondhand smoke raises the risks of carcinogenic agents, which, in turn, raises the chances of lung cancer [3].
Lastly, environmental pollution is a significant cause of growing lung cases in India. The top polluted cities are Delhi, Mumbai, and Kolkata, which are among the worst polluted cities worldwide. Emissions from automobiles and industrial activities and particulate matter emitted during construction significantly damage the lungs and are associated with an increased incidence of lung cancer. Evidence has shown that exposure to ambient PM2.5 levels can be associated with a further increased risk of lung cancer, according to a published study in The Lancet Planetary Health [4].
Household pollution is one of the significant rural concerns because more people use mainly solid fuels such as firewood, cow dung, and coal to cook and heat their homes. The WHO said that an estimated 700 million people in India will be exposed to household air pollution; this increases lung cancer risks, especially for women who spend more of their time in the kitchen [5].
Occupational hazards are one of the major causes that have led to the development of lung cancer. People who work in the construction, mining, and manufacturing industries are usually exposed to cancer-causing carcinogens, such as asbestos, radon gas, and many others. In this respect, strict safety measures within occupational places, along with annual checkups, can minimize the threat of such occupational hazards to a greater extent [6].
If lung cancer is discovered early, the survival rate can be higher. The five-year survival rate for lung cancer diagnosed at an early stage can reach between 60% and 80%, whereas the survival rate for advanced-stage lung cancer falls below 15% [7]. The National Programme for the Prevention and Control of Cancer and Diabetes is designed to enhance cancer screening and early detection initiatives throughout India. However, more than 70% of lung cancer diagnoses occur at the late stages, and therefore, more advanced screening and diagnostic techniques are fundamentally required [8].
Current developments in ML and AI open exciting possibilities for better detection and diagnosis of lung cancer. Algorithms in machine learning can identify insights from large datasets, risk factors associated with conditions, and other identifiable patterns, which allows for early detection that leads to personalized treatment planning. Recently, a study in The Lancet Oncology showed how the AI-based system reduced false positives and negatives for lung cancer screening and increased accuracy levels [9].
The present study focuses on developing a hybrid model for lung cancer prediction while also conducting a comparative analysis of various machine learning algorithms. The proposed preprocessing techniques, including ADASYN and label encoding, enhance model performance by addressing class imbalance and feature representation. This study aims to improve early detection and diagnosis by integrating advanced machine learning techniques to analyze key risk factors associated with lung cancer. By leveraging a hybrid approach that combines traditional machine learning models with deep learning, the research seeks to achieve higher predictive accuracy. Additionally, the study evaluates multiple machine learning models to identify the most significant features contributing to lung cancer prediction, ensuring a robust and reliable methodology for practical healthcare applications.

Translational Advancements in Machine Learning for Cancer Prediction

Lung carcinoma, characterized by the malignant growth of lung cells, remains a significant challenge in oncology, emphasizing the need for advancements in diagnostic and predictive technologies. Recent progress in computer vision and data analytics has paved the way for sophisticated diagnostic methods, mainly by analyzing temporal medical images. Beyond imaging, clinical text data, such as patient symptoms and medical histories, have emerged as invaluable resources for lung cancer diagnosis, enabling the integration of diverse data modalities for enhanced prediction. Machine learning (ML) techniques, including neural networks, support vector machines (SVMs), decision trees, convolutional neural networks (CNNs), and nonlinear cellular automata, have shown considerable success in predicting lung cancer recurrence and survivability. Ensemble methods such as Random Forest, XGBoost, Bagging, and Adaboost have demonstrated superior predictive accuracy by combining multiple models to reduce error rates. Public datasets, like the SEER database, have been instrumental in evaluating these methods using metrics such as precision, recall, F1-score, and the Area Under the Curve (AUC) of Receiver Operating Characteristic (ROC) curves.
Binary classification models have also played a vital role in diagnosis, prognosis, and prediction tasks. Techniques like logistic regression, Gaussian naïve Bayes, k-nearest neighbors (KNN), artificial neural networks (ANNs), radial basis function (RBF) networks, gradient-boosted trees, and multilayer perceptrons (MLPs) have been extensively utilized. These models incorporate demographic factors like age and gender and behavioral factors like smoking habits into personalized diagnostic tools. However, significant challenges persist in integrating diverse data types, such as imaging and clinical data, and conducting comprehensive analyses of ML methods. Addressing these challenges requires multidisciplinary approaches to refine diagnostic accuracy and improve treatment planning.
The advancements in ML for lung cancer diagnosis also involve integrating traditional and emerging methodologies. Unlike conventional AI, ML evolves by learning from past data, enabling dynamic and complex decision-making. Several studies have highlighted the transformative potential of ML in lung cancer detection and prognosis. Shah et al. [10] identified limitations in existing diagnostic methods, such as the need for advanced feature extraction techniques, robust noise removal, and hybrid classification approaches to improve ML and DL applications in clinical settings. Gayap et al. [11] highlighted deep learning techniques, including 2D and 3D CNNs, dual-path networks, and vision transformers (ViTs), which outperform classical ML methods in detecting lung cancer using CT scans. Didier et al. [12] demonstrated that ML models surpass logistic regression in predicting lung cancer survival, offering higher discriminatory accuracy. Similarly, Raoof et al. [13] reviewed the strengths and limitations of ML techniques, including deep learning and ensemble methods, for specific imaging types.
Other notable contributions include the work of Javed et al. [14], who summarised the effectiveness of CNNs and recurrent neural networks (RNNs) in lung cancer detection while addressing challenges related to higher accuracy and generalizability. Li et al. [15] showcased the potential of SVMs and Random Forest in predicting lung cancer progression and emphasized their clinical decision-making utility. Dodia et al. [16] discussed the challenges of translating ML models for early lung cancer detection into clinical practice, mainly using CT scans. Huang et al. [17] compared deep learning and traditional ML methods, highlighting deep learning models’ superior accuracy and sensitivity while stressing the need for better interpretability and generalizability.
The literature underscores the transformative potential of ML in lung cancer diagnosis. By addressing challenges such as multimodal data integration, feature selection, and implementation barriers, these advancements promise to improve patient care and outcomes significantly. Early detection and accurate prediction are pivotal in improving patient outcomes and tailoring effective treatment strategies. Advances in ML and deep learning (DL) techniques have shown immense potential in addressing the complexities of disease prediction and diagnosis.

2. Methodology

Figure 1 illustrates a comparison between machine learning (ML) and deep learning (DL) approaches for lung cancer improvement analysis. Both ML and DL follow similar processes, starting with data input, analysis, and preprocessing. They proceed with feature exploration, followed by the training and testing of the models. In the ML approach, various algorithms such as logistic regression, k-nearest neighbors, support vector machine, Random Forest, Gaussian naïve Bayes, multinomial naïve Bayes, Gradient Boosting, and decision tree are used, leading to the selection of the best model based on hyperparameter tuning and performance analysis. On the other hand, DL employs multilayer perceptron (MLP) as the core algorithm. Both approaches involve making predictions and conducting thorough performance analyses.

2.1. Data Collection

A dataset for lung cancer prediction has been collected from the source. The dataset consists of a total of 16 attributes with 309 instances. In this study, we selected only 12 attributes, as mentioned in Table 1, as many others were not relevant to lung cancer [18].

2.2. Data Preprocessing

In this study, only two techniques were employed: ADASYN and label encoding, as described below. Our primary focus was on ensuring data consistency rather than handling missing values, as the dataset contained no missing values [19].

2.2.1. Adaptive Synthetic Sampling (ADASYN)

In this study, ADASYN-Adaptive Synthetic Sampling was used to cope with the imbalanced class of the dataset by proposing a learning system with ML to predict lung cancer with respect to the data available in this task. Lung cancer prediction has the following data: Records-310; Variables 12 Target-Lung Cancer (Yes/No). Suppose the number of lung cancer-positive cases is substantially lower than negative cases. In that case, it may cause a bias in favor of the majority class, hence poor recall in lung cancer prediction [20].

2.2.2. Label Encoding

In this dataset, LUNG_CANCER attribute is in the form of object data. So, we converted them to numerical values using LabelEncoder from sklearn [21]. LabelEncoder is a utility class that helps normalize labels in such a way that they contain only values between 0 and n_classes 1. It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels [22].

2.3. Data Analysis

In this study, we analyzed the dataset and found no missing values, as presented in Table 2. However, there were 33 duplicate entries among the given instances in this dataset, which were removed before processing. A Pearson correlation has also been plotted as a heat map to assess the importance of the attributes, among other factors, as shown in Figure 2. The attributes of the clinical dataset were chosen based on the experts of this specialization and to measure the effectiveness of the cancer-prediction application, which further helps the patient understand their cancer risk via low cost and decisions based on their appropriate treatment. For data analysis, we utilized VS code 1.96 and Python libraries, including NumPy, Pandas, Matplotlib, and Seaborn, for data visualization [23].

2.4. Data Transformation

In this phase of the study, the dataset was divided into two subsets: training and testing. Since the given data are in categorical form, label encoding was applied to convert categorical variables into numerical representations, thereby enhancing the model’s performance. The encoding process is illustrated in Figure 2 [24,25].

2.5. Model Construction

We implemented various machine learning algorithms in the treatment of the clinical dataset for lung cancer following an initial statistical analysis. To enhance model performance, attribute correlation analysis was conducted, allowing us to refine and optimize the dataset for lung cancer prediction [26].

2.6. Data Split

In this study, the dataset was split into two sets, with 80% for training and 20% for testing. During the training process, each model underwent 10-fold cross-validation [27].

2.7. Model Training

As shown in Figure 2, this phase involved training the machine learning models using the prepared dataset. After applying label encoding to handle categorical data, the dataset was split into two subsets: training and testing [28]. The training subset was used to train the models, while the testing subset facilitated model evaluation. Several machine learning algorithms were implemented, and their performance was assessed using metrics such as accuracy, precision, recall, and the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) [29].

Hybrid Model Approach

We developed a Gaussian naïve Bayes model for lung cancer prediction. This model follows the same methodology as machine learning and deep learning; however, it is based on the naïve Bayes theorem, which utilizes conditional probability, as discussed in Section 2.12.7. Further, we integrated the GNB model with the Adaboost model to enhance classification performance by combining the probabilistic classification of GNB.

2.8. Cross-Validation

We used 10-fold cross-validation to improve model generalization and reduce overfitting. The dataset was split into 10 subsets, with each fold serving as a validation set once, and the remaining folds were used for training. This ensured a reliable performance estimation for all models [30].

2.9. Model Prediction

After training the machine learning models, the next step was model prediction. In this phase, the trained models were applied to the testing subset to predict the outcomes based on the input features. The input features were mapped through the trained models to generate the predicted outcomes, which were then compared to the actual results for evaluation.

2.10. Confusion Matrix

In this study, the confusion matrix was utilized to evaluate model performance by calculating key metrics such as accuracy, precision, recall, and F1 score [31].

2.11. Model Settings and Tuning

Further, we utilized the optimal hyperparameters to improve the performance of machine learning and deep learning models for predicting lung cancer, as mentioned in Table 3.

2.12. Applied Algorithms for Predicting Lung Cancer

In the application of this study, we utilized supervised learning algorithms involving machine learning algorithms, including the following:

2.12.1. Support Vector Machine

A support vector machine (SVM) is a computational model employed for categorization and prediction tasks. In simpler terms, decision trees ask questions, a Random Forest obtains opinions from many trees, and a support vector machine draws a smart line to make sense of things [32].
f ( x ) = W · x + b
Here, W is the weight vector, and x is the input feature vector. b is the biased term.

2.12.2. Logistic Regression

A technique of mathematical forecasting called logistic regression predicts the possibility that a result belongs to some category—typically denoted as “1” (e.g., whether a patient has a disease or not). This method is frequently applied to binary classification issues, where estimating the probability of an event occurring is the aim [33].
  • Model Formulation:
    Let the binary outcome variable be y, with y = 0 indicating no lung cancer and y = 1 indicating the presence of lung cancer.
    P ( y = 1 | X ) = σ ( z ) = 1 1 + e z
    Here, z represents the sigmoid function that maps a linear combination of the input.
  • Model Training:
    The training of the logistic regression model was performed using the maximum likelihood estimation methodology. We defined the data’s likelihood function L as follows:
    L ( β ) = i = 1 N P ( y i | X i ) = i = 1 N σ ( z i ) y i ( 1 σ ( z i ) ) 1 y i
    where N denotes the number of observations, and z i = β 0 + β 1 x i 1 + + β n x i n .
    The natural logarithm of the likelihood function then yields the log-likelihood function:
    ( β ) = i = 1 N y i log ( σ ( z i ) ) + ( 1 y i ) log ( 1 σ ( z i ) )
    To find the optimal parameters β , we maximized the log-likelihood function, often using optimization algorithms such as gradient ascent or solvers like lbfgs in Python libraries such as scikit-learn.

2.12.3. K-Nearest Neighbor

The k-nearest neighbor (KNN) classification algorithm identifies k objects in the training dataset that are closest to a given test object. The majority class among these k neighbors determines the test object’s class. This approach addresses two key challenges in real-world datasets: the rarity of exact matches between objects and the potential for conflicting class information among nearby objects. More broadly, KNN belongs to the family of instance-based learning algorithms, which also include case-based reasoning, a method that operates on symbolic data. Additionally, KNN exemplifies lazy learning techniques, as the algorithm postpones the generalization process until query time, relying directly on the training data for decision-making [34].
In this research, Minkowski distance is taken as
d ( x a , x b ) = j = 1 m | x a j x b j | p 1 p
where p is the parameter that defines the metric’s power.
y ^ = argmax j nearest neighbors i = 1 k y j
where y j is the class label of the j-th nearest neighbor.

2.12.4. Decision Tree

Decision trees have been used widely in classification and regression problems. Therefore, they were aptly used to predict lung cancer, where the target variable is a binary value (0 or 1) indicating no cancer and cancer, respectively. The decision tree establishes a relationship between the target output and other input features [35].

Gini Impurity

The following is the definition of the Gini impurity used to gauge a decision tree split’s quality:
Gini ( D ) = 1 i = 1 C p i 2
In the dataset D, p i is the probability of class i, and C is the number of classes in the dataset.

Mathematical Representation

Let the feature set be denoted by X and the target variable by Y, where Y { 0 , 1 } . The definition is such that 0 denotes no cancer, and 1 denotes the presence of cancer. Thus, the decision function is defined as
f ( X ) = 0 if X R 0 1 if X R 1
where R 0 and R 1 correspond to the regions representing class 0 and class 1, respectively.

2.12.5. Random Forest

This algorithm is applied as a form of ensemble learning, suitable for application in medical diagnostic purposes, such as predicting lung cancer. RF has the flexibility to support variables with both qualitative and quantitative types and scales. Therefore, it can readily be utilized with a binary version of the lung cancer dataset, using 0 to denote the lack of the disease and 1 to denote cancer [36].
The Gini impurity measures the impurities of the decision nodes of trees in Random Forest.
Gini ( D ) = 1 i = 1 C p i 2
where p i is the class probability at node D?

2.12.6. Gradient Boosting

Boosting algorithms work by combining many simple models, called weak learners, which are only slightly better than guessing randomly. By combining these weak models step by step, they create a much stronger and more accurate model. Gradient Boosting is a boosting algorithm often used for regression problems. It improves the prediction model by adding minor corrections at each step, building the final solution as a weighted combination of these corrections [37].
y ^ i ( M ) = m = 1 M λ h m x i
where y ^ i ( M ) is the predicted output for the i-th data point after M iterations, m = 1 M denotes the summation notation, indicating that the terms from m = 1 to M are to be added together, λ is a weight or coefficient that adjusts the contribution of each function h m ( x i ) , and h m ( x i ) is the output of the m-th function (or model) for the i-th input data point x i .

2.12.7. Gaussian Naïve Bayes

When working with continuous data, it is often assumed that the values for each class follow a Gaussian (normal) distribution. To handle this, the training data are divided by class, and the mean and standard deviation for each class are calculated. By using this, the probabilities for the continuous data can be estimated using the following formula [38].
  • Bayes theorem
    Consider Y as the class variable (lung cancer status) and X = ( x 1 , x 2 , , x n ) as the feature vector.
    P ( Y | X ) = P ( X | Y ) P ( Y ) P ( X )
  • Gaussian likelihood function
    The likelihood P ( X | Y ) is expressed as
    P ( X | Y = y ) = i = 1 n P ( x i | Y = y )
    For Gaussian-distributed features, P ( x i | Y = y ) follows the Gaussian probability density function:
    P ( x i | Y = y ) = 1 2 π σ y 2 exp ( x i μ y ) 2 2 σ y 2
  • Class-conditional probability
    The class-conditional probability, therefore, takes the following form:
    P ( Y = y X ) P ( Y = y ) i = 1 n 1 2 π σ y 2 exp ( x i μ y ) 2 2 σ y 2
  • Prediction
    The predicted class y ^ is obtained by maximizing the posterior probability:
    y ^ = argmax y P ( Y = y ) i = 1 n P ( x i | Y = y )

2.12.8. Multinomial Naïve Bayes

Multinomial naïve Bayes is commonly used for discrete feature counts, such as word counts in text classification or event counts in health datasets. While it is not the primary method for lung cancer prediction, it may still be applicable in specific contexts [39].

Mathematical Formulation

The multinomial naïve Bayes function is represented in the following way to demonstrate the probability of a class Y = y, given a feature vector. X = ( x 1 , x 2 , , x n )
P ( Y = y | X ) P ( Y = y ) i = 1 n P ( x i | Y = y ) x i

2.12.9. Multilayer Perceptron

Artificial neural networks (ANNs) are inspired by how the human brain uses multiple layers. ANNs can learn patterns and relationships in data by training on examples, allowing them to generalize and make predictions for new situations. One of the most widely used types of ANN is multilayer perceptron (MLP), which is a powerful tool for modeling complex relationships. MLPs create nonlinear models, enabling them to predict outputs based on given inputs [40].
Lung cancer prediction is typically a classification task (e.g., predicting whether a patient has lung cancer or not). Since tabular data are used in this study, MLP is a suitable choice. Since MLPs consist of multiple layers of neurons, the layers typically involved in the lung cancer prediction model are as follows:

Input Layer

The input layer of multilayer perceptron (MLP) represents the features (columns) in the dataset, as our dataset consists of 12 features. If the dataset contains 16 features, such as age, alcohol consumption, chest pain, and chronic disease, the input layer will have 12 neurons, each corresponding to a specific feature.

Hidden Layer

Hidden layers are responsible for learning the underlying patterns and relationships in the data. Typically, the network has 100 hidden layers, as mentioned in Table 3. The sigmoid activation function is often chosen because it is better for binary classification tasks, and it also helps mitigate issues like vanishing gradients, improving the netlet’s ability to learn effectively.
Table 3. Applied optimal tuning parameters for individual models.
Table 3. Applied optimal tuning parameters for individual models.
ModelHyperparameters
Logistic regressionL2 penalty, C = 1.0, solver = ‘lbfgs’, max_iter = 100
K-nearest neighbors (KNN)n_neighbors = 5, weights = ‘uniform’, algorithm = ‘auto’, leaf_size = 30
Support vector machine (SVM)kernel = ‘rbf’, gamma = ‘auto’, C = 1.0, degree = 3
Decision treecriterion = ‘gini’, max_depth = None, min_samples_split = 2
Random Forestn_estimators = 100, criterion = ‘gini’, max_depth = None, bootstrap = True
Multilayer perceptron (MLP)hidden_layer_sizes = (100,), activation = ‘sigmoid’, solver = ‘adam’, alpha = 0.0001
Gradient Boostingn_estimators = 100, learning_rate = 0.1, max_depth = 3, subsample = 1.0
Multinomial naïve Bayesalpha = 1.0
Gaussian naïve Bayesvar_smoothing = 1e-9
Hybrid model“n_estimators”: [50, 100, 200], “learning_rate”: [0.01, 0.1, 0.5, 1.0], “algorithm”: [“SAMME”, “SAMME.R”]

Output Layer

The output layer determines the prediction. For binary classification, such as predicting the presence or absence of lung cancer, the output layer consists of a single neuron with a sigmoid activation function, which outputs a probability between 0 and 1.

3. Results

The preliminary data analysis categorized cases into positive and negative lung cancer diagnoses, revealing 238 positive and 38 negative cases.
Figure 2 illustrates the correlation values between variables, where positive values near 1 denote strong positive correlations, and negative values near −1 indicate strong negative correlations. Moderate positive correlations include anxiety and swallowing difficulty (0.48) and alcohol consumption and allergy (0.38). Negative correlations include alcohol consumption and yellow fingers (−0.27) and anxiety and coughing (−0.22). Insignificant correlations, such as alcohol consumption and swallowing difficulty (−0.00063), suggest no meaningful relationship.
Table 4 presents model evaluation metrics prior to cross-validation. SVM, Random Forest, and Gradient Boosting exhibited the highest accuracy (0.98), with KNN achieving the highest precision (1.00) but lower recall (0.91). For recall, SVM, Random Forest, and Gradient Boosting outperformed the rest with 0.98. The F1 score reaffirmed their robustness (0.98), ensuring balanced precision and recall. AUC analysis highlighted Random Forest and decision tree as the most discriminative (0.96), while multinomial naïve Bayes underperformed, having the lowest accuracy (0.81) and AUC (0.74).
Figure 3 and Figure 4 provide a comparative visualization of model performance across various metrics. Specifically, Figure 4 highlight the ROC curves of nine different algorithms. ROC, which stands for Receiver Operating Characteristic, helps evaluate the performance of classifiers by illustrating the trade-off between true positive and false positive rates at various threshold levels.
Figure 5 shows a comparison of the accuracy of the existing and proposed machine learning models for the prediction of lung cancer. The proposed enhancements significantly outperformed the rest in terms of performance. Therefore, the refined methodologies have proven to be more effective. Table 5 shows a comprehensive comparison of accuracy across different ML and DL models. The proposed methods are better compared to existing approaches.
Logistic regression showed an improvement in accuracy by 5.58% (87.50% to 93.08%), and KNN showed an improvement of 0.44% from 92.86% to 93.30%. SVC, which is a highly dimensional classification type, showed a significant improvement at 9.48%, from 85.71% to 95.19%. The decision tree model, which is typically valued for interpretability, improved by 5.47%, from 89.29% to 94.76%.
The ensemble methods proved significant gains. Random Forest, using multiple decision trees to reduce overfitting, achieved a 9.69% gain (85.71% to 95.40%). Through boosting that ought to improve predictions, Gradient Boosting achieved a 6.32% gain (89.29% to 95.61%). MLP is a deep learning technique used for intricate feature interactions, and it achieved a gain of 4.84%, from 89.29% to 94.13%.
Gaussian naïve Bayes reached 88.46%, showing an improvement over the previous 91.07%. Meanwhile, multinomial naïve Bayes, with significant accuracy, jumped to 74.84%.
The hybrid model was developed by first performing cross-validation on the Gaussian naïve Bayes (GNB) model, followed by integrating the AdaBoost algorithm, a boosting technique, to enhance performance. In this study, the proposed hybrid model achieved an accuracy of 96.42%, as shown in Table 5.
These findings confirm a valid comparison of the proposed and previous machine learning and deep learning methods. The hyperparameters, as detailed in Table 3, were utilized to achieve optimal performance in lung cancer prediction. This outcome, in turn, proves helpful in the ML research community and for professionals in terms of better classification predictive performance with a focus on optimizing strategies. We conclude that the Gradient Boosting and Gaussian naïve Bayes models have performed better in this problem, given smaller datasets and binary characteristics. Such models would be more suited to situations where the attributes/features of the dataset are highly independent of each other. In contrast, the performance was restricted in the other models based on correlation and training/testing splits.

4. Discussion

This study highlights the significant role of machine learning and deep learning models in enhancing lung cancer prediction. By combining traditional ML classifiers with deep learning-based feature extraction and optimizing model parameters, the predictive performance of the proposed approach showed notable improvements. The hybrid model, which integrates Gaussian naïve Bayes with the AdaBoost algorithm, achieved a high accuracy of 96.42%, surpassing conventional models by addressing key limitations such as feature independence assumptions in naïve Bayes classifiers. Compared to existing research, many previous studies have utilized individual ML models like support vector machines (SVMs), decision trees, and naïve Bayes for lung cancer prediction. For instance, Maurya et al. (2024) reported that the accuracy of Random Forest and Gradient Boosting models was 85.71% and 89.29%, respectively [41]. Our findings indicate substantial advancements, with Random Forest achieving 95.40% and Gradient Boosting reaching 95.61%, confirming the effectiveness of hyperparameter tuning and the application of data augmentation techniques, such as Adaptive Synthetic Sampling (ADASYN), in improving model performance.
Furthermore, Shah et al. (2023) emphasized the necessity of advanced feature selection and denoising techniques to improve ML-based diagnostic accuracy [10]. Our study aligns with this by employing label encoding and feature selection, resulting in better data representation and minimizing the impact of irrelevant variables. Li et al. (2022) also demonstrated the advantage of ensemble learning techniques like XGBoost in cancer prediction [15], which is a conclusion that is consistent with our results, where Gradient Boosting emerged as one of the highest-performing classifiers. Deep learning approaches have gained attention in lung cancer diagnostics, particularly medical imaging applications. Gayap et al. (2024) demonstrated that CNN-based models outperform conventional ML models for CT scan-based lung cancer detection [11]. While our research focused on structured textual clinical data rather than image-based datasets, incorporating multilayer perceptron (MLP) into our model significantly improved classification accuracy from 89.29% to 94.13%. This finding suggests that DL models can effectively capture complex, nonlinear patterns even in non-imaging datasets, enhancing diagnostic precision.
Moreover, our study observed an increase in the accuracy of multinomial naïve Bayes to 74.84%, indicating that preprocessing improvements addressed some of the model’s inherent feature independence constraints. The performance of MLP also improved from 89.29% to 94.13%, likely due to optimized activation functions, deeper network architectures, and regularisation techniques, such as dropout. These results reinforce the notion that deep learning, combined with practical hyperparameter tuning and preprocessing, can yield superior outcomes compared to traditional statistical classifiers.
The findings of this research further validate the capability of ML models in facilitating early detection of lung cancer and supporting clinical decision-making by identifying high-risk patients based on symptom-based data. The proposed hybrid model, with an accuracy of 96.42%, suggests that ensemble learning techniques can significantly enhance prediction reliability. The ability of ML algorithms to analyse large-scale patient datasets and detect early-stage indicators of lung cancer can transform current diagnostic methods, leading to more personalised and timely treatment interventions.

5. Conclusions

The lungs are the primary organs of respiration, providing a constant supply of oxygen to the blood, without which a human could not survive. Lung cancer is the leading cause of cancer-related deaths for both males and females, with most fatalities occurring after diagnosis due to the disease stage at detection. Early diagnosis significantly increases life expectancy. In this study, supervised learning was explored to train a model for diagnosing lung cancer based on feature extraction from symptom-related data. The models were compared using accuracy, precision, recall, F1-score, and AUC values as performance criteria for logistic regression (LR), k-nearest neighbors (KNN), support vector machine (SVM), Gradient Boosting (GB), Gaussian naïve Bayes (GNB), multinomial naïve Bayes (MNB), decision tree (DT), Random Forest (RF), and multilayer perceptron (MLP). The experimental results obtained through 10-fold cross-validation revealed that Gradient Boosting outperformed other models, achieving 95.61% accuracy, precision, recall, and F1-score, along with an AUC of 98%. After a detailed evaluation of nine machine learning algorithms, both Gradient Boosting (95.61% accuracy) and the hybrid model (96.42% accuracy) were identified as the most suitable predictive models for lung cancer diagnosis.

6. Limitation and Future Scope

This study highlights the potential of various machine learning algorithms in analyzing textual clinical data for the early detection of lung cancer. However, it is essential to note that the study was conducted on a relatively small dataset, which primarily relies on patients and symptoms. Future research should focus on applying these techniques to larger datasets to assess variability in algorithm performance. Additionally, employing a more comprehensive and authentic dataset containing at least 16 essential parameters would enhance the robustness of classification, as the current approach is based on symptoms and behavioural attributes. Moreover, advanced machine learning models, such as multilayer perceptrons (MLPs), warrant further exploration when applied to larger datasets. Observational insights from this study suggest that male patients with a history of alcohol consumption, chest pain, and allergies may have a higher likelihood of developing lung cancer. However, these findings require expert validation, which could lead to the establishment of a weighting system for specific attributes in lung cancer detection, thereby improving diagnostic accuracy.
The integration of Electronic Health Records (EHRs) can play a crucial role in the early detection of lung cancer by leveraging clinical data to identify patterns among patients. This approach enables a comparative analysis of relevant attributes and allows for a more reliable assessment of the Area Under the Curve (AUC) achieved by different machine learning models. Further studies are needed to validate the model adoption process and refine classification strategies for improved real-world applications. In hospitals with limited resources, EHR-based automated risk assessment tools can assist doctors in decision-making by flagging high-risk lung cancer cases based on patients’ medical history. However, further validation through clinical trials and real-world deployments is necessary to assess the feasibility and ethical considerations of such implementations.

Author Contributions

P.M.G., H.K. and M.M.J.; Methodology and Validation, P.M.G., H.K. and M.M.J.; Writing—original draft, P.K.; Supervision, P.K. and P.V. All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by Datta Meghe Institute of Higher Education and Research. No funding is involved while preparing the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ramniwas, S.; Chakrabarti, S.; Kumawat, R.; Sen, M.K.; Gupta, A.; Bhattacharya, D.; Gupta, N.K.; Suri, J.C. Clinico-pathological profile of lung cancer patients in a tertiary care hospital, India: A prospective, cross-sectional study. Indian J. Chest Dis. Allied Sci. 2022, 64, 75–81. [Google Scholar]
  2. Jena, D.; Padhi, B.K.; Zahiruddin, Q.S.; Ballal, S.; Kumar, S.; Bhat, M.; Sharma, S.; Kumar, M.R.; Rustagi, S.; Gaidhane, A.M.; et al. Estimation of burden of cancer incidence and mortality in India: Based on global burden of disease study 1990–2021. BMC Cancer 2024, 24, 1278. [Google Scholar] [CrossRef] [PubMed]
  3. Behera, D.; Balamugesh, T. Lung cancer in India. Indian J. Chest Dis. Allied Sci. 2012, 46, 269–281. [Google Scholar]
  4. Goswami, S.; Adhikary, S.; Bhattacharya, S.; Agarwal, R.; Ganguly, A.; Nanda, S.; Rajak, P. The alarming link between environmental microplastics and health hazards with special emphasis on cancer. Life Sci. 2024, 355, 122937. [Google Scholar] [CrossRef]
  5. Noronha, V.; Dikshit, R.; Raut, N.; Joshi, A.; Pramesh, C.S.; George, K.; Agarwal, J.P.; Munshi, A.; Prabhash, K. Epidemiology of lung cancer in India: Focus on the differences between non-smokers and smokers: A single-centre experience. Indian J. Cancer 2012, 49, 74–81. [Google Scholar] [PubMed]
  6. Malik, P.S.; Raina, V. Lung cancer: Prevalent trends and emerging concepts. Indian J. Med. Res. 2015, 141, 5–7. [Google Scholar]
  7. Mohan, A.; Garg, A.; Gupta, A.; Sahu, S.; Choudhari, C.; Vashistha, V.; Ansari, A.; Pandey, R.; Bhalla, A.S.; Madan, K.; et al. Clinical profile of lung cancer in North India: A 10-year analysis of 1862 patients from a tertiary care center. Lung India 2020, 37, 190–197. [Google Scholar] [CrossRef]
  8. Kadir, T.; Gleeson, F. Lung cancer prediction using machine learning and advanced imaging techniques. Transl. Lung Cancer Res. 2018, 7, 304. [Google Scholar] [CrossRef]
  9. Kaasa, S.; Loge, J.H.; Aapro, M.; Albreht, T.; Anderson, R.; Bruera, E.; Brunelli, C.; Caraceni, A.; Cervantes, A.; Currow, D.C.; et al. Integration of oncology and palliative care: A Lancet Oncology Commission. Lancet Oncol. 2018, 19, e588–e653. [Google Scholar] [CrossRef]
  10. Shah, S.N.A.; Parveen, R. An extensive review on lung cancer diagnosis using machine learning techniques on radiological data: State-of-the-art and perspectives. Arch. Comput. Methods Eng. 2023, 30, 4917–4930. [Google Scholar] [CrossRef]
  11. Gayap, H.T.; Akhloufi, M.A. Deep machine learning for medical diagnosis, application to lung cancer detection: A review. BioMedInformatics 2024, 4, 236–284. [Google Scholar] [CrossRef]
  12. Didier, A.J.; Nigro, A.; Noori, Z.; Omballi, M.A.; Pappada, S.M.; Hamouda, D.M. Application of machine learning for lung cancer survival prognostication—A systematic review and meta-analysis. Front. Artif. Intell. 2024, 7, 1365777. [Google Scholar]
  13. Raoof, S.S.; Jabbar, M.A.; Fathima, S.A. Lung Cancer prediction using machine learning: A comprehensive approach. In Proceedings of the 2020 2nd International Conference on Innovations in Mechanical and Industrial Applications (ICIMIA), Bengaluru, India, 27–29 February 2020; pp. 108–115. [Google Scholar]
  14. Javed, R.; Abbas, T.; Khan, A.H.; Daud, A.; Bukhari, A.; Alharbey, R. Deep learning for lung cancer detection: A review. Artif. Intell. Rev. 2024, 57, 197. [Google Scholar]
  15. Li, Y.; Wu, X.; Yang, P.; Jiang, G.; Luo, Y. Machine learning for lung cancer diagnosis, treatment, and prognosis. Genom. Proteom. Bioinform. 2022, 20, 850–866. Available online: https://pubmed.ncbi.nlm.nih.gov/38662768/ (accessed on 7 December 2024).
  16. Dodia, S.; Annappa, B.; Mahesh, P.A. Recent advancements in deep learning based lung cancer detection: A systematic review. Eng. Appl. Artif. Intell. 2022, 116, 105490. [Google Scholar]
  17. Huang, S.; Arpaci, I.; Al-Emran, M.; Kılıçarslan, S.; Al-Sharafi, M.A. A comparative analysis of classical machine learning and deep learning techniques for predicting lung cancer survivability. Multimed. Tools Appl. 2023, 82, 34183–34198. [Google Scholar]
  18. Lung Cancer. Available online: https://www.kaggle.com/datasets/sanjoli02/lung-cancer (accessed on 9 November 2024).
  19. García, S.; Ramírez-Gallego, S.; Luengo, J.; Benítez, J.M.; Herrera, F. Big data preprocessing: Methods and prospects. Big Data Anal. 2016, 1, 9. [Google Scholar] [CrossRef]
  20. Gosain, A.; Sardana, S. Handling class imbalance problem using oversampling techniques: A review. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 79–85. [Google Scholar]
  21. Olisah, C.C.; Smith, L.; Smith, M. Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. Comput. Methods Programs Biomed. 2022, 220, 106773. [Google Scholar]
  22. Kasongo, M.K.D.; Joe, I. A deep-learned embedding technique for categorical features encoding. IEEE Access 2021, 9, 114381–114391. [Google Scholar]
  23. Famili, A.; Shen, W.M.; Weber, R.; Simoudis, E. Data preprocessing and intelligent data analysis. Intell. Data Anal. 1997, 1, 3–23. [Google Scholar]
  24. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  25. Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn. 2012, 2, 1–127. [Google Scholar] [CrossRef]
  26. Gal, M.S.; Rubinfeld, D.L. Data standardization. New York Univ. Law Rev. 2019, 94, 737–790. [Google Scholar] [CrossRef]
  27. Lapchak, P.A.; Zhang, J.H. Data standardization and quality management. Transl. Stroke Res. 2018, 9, 4–8. [Google Scholar] [CrossRef]
  28. Zhang, L.; Wen, J.; Li, Y.; Chen, J.; Ye, Y.; Fu, Y.; Livingood, W. A review of machine learning in building load prediction. Appl. Energy 2021, 285, 116452. [Google Scholar] [CrossRef]
  29. Nguyen, Q.H.; Ly, H.B.; Ho, L.S.; Al-Ansari, N.; Le, H.V.; Tran, V.Q.; Prakash, I.; Pham, B.T. Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Probl. Eng. 2021, 2021, 4832864. [Google Scholar] [CrossRef]
  30. Zhang, X.; Liu, C.-A. Model averaging prediction by K-fold cross-validation. J. Econom. 2023, 235, 280–301. [Google Scholar] [CrossRef]
  31. Luque, A.; Carrasco, A.; Martín, A.; de Las Heras, A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019, 91, 216–231. [Google Scholar] [CrossRef]
  32. Alam, J.; Alam, S.; Hossan, A. Multi-stage lung cancer detection and prediction using multi-class SVM classifier. In Proceedings of the 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2), Rajshahi, Bangladesh, 8–9 February 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–4. [Google Scholar]
  33. Tirzïte, M.; Bukovskis, M.; Strazda, G.; Jurka, N.; Taivans, I. Detection of lung cancer with electronic nose and logistic regression analysis. J. Breath Res. 2018, 13, 016006. [Google Scholar] [CrossRef]
  34. Steinbach, M.; Tan, P.-N. kNN: K-nearest neighbors. In The Top Ten Algorithms in Data Mining; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009; pp. 165–176. [Google Scholar]
  35. Kim, T.W.; Koh, D.H.; Park, C.Y. Decision tree of occupational lung cancer using classification and regression analysis. Saf. Health Work 2010, 1, 140–148. [Google Scholar] [CrossRef]
  36. Lavanya, C.; Pooja, S.; Kashyap, A.H.; Rahaman, A.; Niranjan, S.; Niranjan, V. Novel biomarker prediction for lung cancer using random forest classifiers. Cancer Inform. 2023, 22, 11769351231167992. [Google Scholar]
  37. Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar]
  38. Pradeep, K.R.; Naveen, N.C. Lung cancer survivability prediction based on performance using classification techniques of support vector machines, C4.5 and Naive Bayes algorithms for healthcare analytics. Procedia Comput. Sci. 2018, 132, 412–420. [Google Scholar]
  39. Radhika, P.R.; Nair, R.A.; Veena, G. A comparative study of lung cancer detection using machine learning algorithms. In Proceedings of the 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 20–22 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
  40. Taud, H.; Mas, J.-F. Multilayer Perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
  41. Maurya, S.P.; Sisodia, P.S.; Mishra, R.; Singh, D.P. Performance of machine learning algorithms for lung cancer prediction: A comparative approach. Sci. Rep. 2024, 14, 18562. [Google Scholar]
Figure 1. Schematic representation of the proposed methodology.
Figure 1. Schematic representation of the proposed methodology.
Applsci 15 04637 g001
Figure 2. Correlation matrix for analysis.
Figure 2. Correlation matrix for analysis.
Applsci 15 04637 g002
Figure 3. Model performance for various machine learning models.
Figure 3. Model performance for various machine learning models.
Applsci 15 04637 g003
Figure 4. ROC of the proposed ML and DL models.
Figure 4. ROC of the proposed ML and DL models.
Applsci 15 04637 g004
Figure 5. A comparative analysis between the existing and proposed models.
Figure 5. A comparative analysis between the existing and proposed models.
Applsci 15 04637 g005
Table 1. Dataset description.
Table 1. Dataset description.
Feature No.Feature NameDescription
01Yellow FingerNumber of people with yellow fingers
02AnxietyNumber of people who have anxiety
03Peer PressureNumber of people influenced by peer pressure
04Chronic DiseaseNumber of people with a chronic disease
05FatigueNumber of people experiencing fatigue
06AllergyNumber of people with allergies
07AgeAge of the person
08Alcohol ConsumingNumber of people consuming alcohol
09CoughingNumber of people experiencing coughing
10Swallowing DifficultyNumber of people with swallowing difficulty
11Chest PainNumber of people experiencing chest pain
12Lung CancerNumber of people who have lung cancer
Table 2. Dataset description.
Table 2. Dataset description.
Feature No.Feature NameNumber of Missing Values
01Yellow Finger0
02Anxiety0
03Peer Pressure0
04Chronic Disease0
05Fatigue0
06Allergy0
07Age0
08Alcohol Consuming0
09Coughing0
10Swallowing Difficulty0
11Chest Pain0
12Lung Cancer0
Table 4. Comparison of previous and proposed models for lung cancer detection.
Table 4. Comparison of previous and proposed models for lung cancer detection.
ModelsAccuracyPrecisionRecallF1 ScoreAUC
Logistic regression0.970.980.970.970.93
KNN0.961.000.910.950.93
SVM0.980.980.980.980.95
Random Forest0.980.980.980.980.96
Multilayer perceptron0.970.980.950.960.94
Gradient Boosting0.980.980.980.980.95
Multinomial naïve Bayes0.810.750.890.810.74
Gaussian naïve Bayes (Hybrid model)0.920.880.950.910.89
Decision tree0.940.960.910.940.96
Table 5. Comparison of previous and proposed models for lung cancer detection.
Table 5. Comparison of previous and proposed models for lung cancer detection.
Model NamePrevious Models (Maurya et al. [41])Proposed Models
Logistic regression87.5093.08
KNN92.8693.30
Support vector machines85.7195.19
Decision tree89.2994.76
Random Forest85.7195.40
Gaussian naïve Bayes91.0788.46
Gradient Boosting89.2995.61
Multilayer perceptron89.2994.13
Multinomial naïve Bayes-74.84
Hybrid model-96.42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gote, P.M.; Kumar, P.; Kumar, H.; Verma, P.; Jiet, M.M. Integrating Machine Learning Algorithms: A Hybrid Model for Lung Cancer Outcome Improvement. Appl. Sci. 2025, 15, 4637. https://doi.org/10.3390/app15094637

AMA Style

Gote PM, Kumar P, Kumar H, Verma P, Jiet MM. Integrating Machine Learning Algorithms: A Hybrid Model for Lung Cancer Outcome Improvement. Applied Sciences. 2025; 15(9):4637. https://doi.org/10.3390/app15094637

Chicago/Turabian Style

Gote, Pradnyawant M., Praveen Kumar, Hemant Kumar, Prateek Verma, and Moses Makuei Jiet. 2025. "Integrating Machine Learning Algorithms: A Hybrid Model for Lung Cancer Outcome Improvement" Applied Sciences 15, no. 9: 4637. https://doi.org/10.3390/app15094637

APA Style

Gote, P. M., Kumar, P., Kumar, H., Verma, P., & Jiet, M. M. (2025). Integrating Machine Learning Algorithms: A Hybrid Model for Lung Cancer Outcome Improvement. Applied Sciences, 15(9), 4637. https://doi.org/10.3390/app15094637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop