1. Introduction
According to the global cancer statistics from the International Agency for Research on Cancer, lung cancer is one of the most prevalent cancers worldwide. There were approximately 2.5 million new lung cancer cases in 2022, accounting for 12.4% of all new cancer cases. The number of lung cancer deaths in 2022 was approximately 1.8 million, representing 18.7% of total cancer fatalities and establishing it as a leading cause of cancer-related death [
1]. Lung cancer primarily consists of two types: non-small cell lung cancer (NSCLC) and small cell lung cancer. NSCLC represents more than 85% of all lung cancer cases, with its primary histological subtypes being lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). Seventy percent of lung cancer cases are diagnosed at advanced stages, and treatment options are often limited [
2]. Therefore, accurately determining lung cancer subtypes and identifying high-risk patients can facilitate individualized treatment and follow-up. Histopathological examination is the gold standard for identifying the subtypes of lung cancer, but it relies on the experience of pathologists and is time-consuming [
3,
4]. Moreover, in predicting the survival of patients with lung cancer, common linear methods such as nomograms may fail to capture intricate nonlinear relationships in high-dimensional datasets [
5,
6].
A growing number of studies have attempted to apply machine- and deep-learning models for subtype classification and survival prediction in patients with NSCLC [
7]. For subtype classification, for example, Han et al. collected positron emission tomography (PET)/computed tomography (CT) images of 867 and 552 patients with LUAD and LUSC, respectively, from Peking University Cancer Hospital. They used VGG16 for classification based on first-order intensity statistics and texture features, achieving an accuracy of 84.1% and an area under the curve (AUC) of 0.903 [
8]. Kriegsmann et al. collected image data of 499 LUAD and 440 LUSC cases from Heidelberg University Thoracic Hospital. Utilizing imaging features and a random forest (RF) model for classification, they obtained an accuracy of 90.6% [
9]. However, several studies showed limitations related to the use of test methods. Some studies split different images from the same patient into training and testing sets—the use of similar data for training and subsequent testing may have resulted in high accuracy [
10]; in other studies, the images used for both training and testing datasets were derived from a common set of tissue slices, from which more images were expanded using different post-processing methods [
11]. For example, in one study, the dataset was expanded by using techniques such as rotation and flipping before splitting them into training, validation, and testing sets [
12]. Furthermore, some studies utilized datasets with limited sample sizes—under 350 cases of patients [
13,
14,
15,
16,
17]. Some studies had adequate samples; however, model performance was compromised with AUC values under 0.9, limiting the practical application of the model. For example, Hyun et al. collected data from 210 LUAD and 186 LUSC cases from the Samsung Medical Center in Korea and used a logistic regression model to classify subtypes based on PET/CT radiomics and clinical features, including age, sex, tumor size, and smoking status, achieving an accuracy of 76.9% and an AUC of 0.859 [
18]. Song et al. downloaded the CT images of 700 lung cancer cases (including 496 LUAD and 204 LUSC cases) from The Cancer Imaging Archive (TCIA) platform and used the bagging–Adaptive Boosting (AdaBoost)–support vector machine approach for subtype classification based on CT intensity, texture, and filtered image features, achieving an AUC of 0.823 [
19].
For overall survival (OS) prediction, some studies have also shown issues with small datasets [
20,
21,
22,
23]. For example, He et al. downloaded only 186 CT images of NSCLC cases from the TCIA platform and predicted survival status using intensity, shape, texture, and wavelet in combination with an RF model, achieving an AUC of 0.9296 [
20]. Chaddad et al. downloaded the data of 315 NSCLC cases from the TCIA platform, and they used CT image features, demographic data, and tumor, node, metastasis (TNM) staging variables to predict whether the patients’ survival would be above, equal to, or below the median survival time with an RF model, achieving an AUC of 0.7617 [
22]. In contrast, while some studies employed a sufficient number of samples, their performance remained compromised, with AUC values under 0.75 or even under 0.7 [
24,
25,
26]. Thus, a more effective model is needed.
Changes in the distribution, appearance, size, morphology, and arrangement of the nuclei of cancer cells have been shown to predict cancer aggressiveness. Petersen et al. showed that different lung cancer types had different nuclear sizes. In NSCLC, especially LUAD, the size of nuclei correlated significantly with the grading and survival of patients [
27]. In addition, Sigel et al. suggested that several cytomorphologic features, such as nuclear size, chromatin pattern, and nuclear contours, could be used in a scoring system as they correlated well with histologic grade and prognosis [
28]. However, their study only showed the correlation without proposing a model. Based on the attributes of individual nuclei (e.g., shape, size, and texture), Lu et al. classified the long-term versus short-term survival of patients with early-stage NSCLC, yielding a mean AUC of 0.68 in the training cohort [
29]. The utilization of nuclear features to predict the OS of patients with lung cancer remains limited.
To establish a suitable model with optimal performance in lung cancer subtype classification and OS prediction, we constructed a large dataset consisting of 1252 LUAD and LUSC cases, integrating nuclear, clinical, and genetic features for a total of 95 variables. Uncorrelated factors were excluded using Pearson’s correlation coefficient (PCC) analysis. To analyze the performance of different models, we employed four machine-learning models (light gradient boosting machine [LightGBM], extreme gradient boosting [XGBoost], RF, and AdaBoost) and three deep-learning models (multilayer perceptron [MLP], TabNet, and convolutional neural network [CNN]) for subtype classification as well as prediction of survival at 1, 2, and 3 years among patients with LUAD and LUSC. In addition, we assessed the performances of these models using metrics such as precision, accuracy, AUC, recall, F1-score, and learning duration. Through this multidimensional assessment, we aimed to identify an optimal model for subtype classification and OS prediction in patients with NSCLC.
4. Discussion
Our study employed seven models—four machine-learning models (LightGBM, XGBoost, RF, and AdaBoost) and three deep-learning models (MLP, TabNet, and CNN)—to conduct an in-depth analysis of survival prediction and subtype classification of patients with NSCLC, aiming to provide scientific evidence for the diagnosis and treatment of lung cancer. Our study was the first to incorporate the characteristics of nuclei and the genetic information of patients to predict the subtypes and OS of patients with lung cancer. The combination of different factors and the usage of AI methods increased the predictive accuracy compared to previous studies [
13,
14,
15,
16,
17,
18,
19,
22,
23]. Determining the subtypes of lung cancer in patients is important for selecting treatment options; meanwhile, more attention can be paid to patients who are screened as high-risk. Some subtype classification and survival prediction studies showed limitations related to insufficient sample sizes and compromised model accuracy [
13,
14,
15,
16,
17,
18,
19,
22,
23]. Some studies also relied only on data from a single institution, which further restricted the applicability of the models used [
8,
9,
13,
15,
16,
17]. The characteristics of images acquired at different medical centers can vary widely, and consequently, the generalization ability of a prediction model trained using data from only one center tends to be weak [
53]. Our research utilized the TCGA database, which contains data from multiple cancer research institutions, and we obtained data from 1252 patients with NSCLC (including 525 patients with LUAD and 727 patients with LUSC). The training, validation, and testing datasets were generated by dividing the TCGAIDs in a 60%:20%:20% ratio to ensure the independence of each dataset.
The majority of previous studies focusing on subtype classification or survival prediction were based on PET/CT images. For example, Han et al. used VGG16 for classification based on the first-order intensity statistics and texture features of PET/CT images from Peking University Cancer Hospital, achieving an AUC of 0.903 [
8]. Bicakci et al. collected the PET/CT images of 94 lung cancer cases (including 38 LUAD and 56 LUSC cases) from Acıbadem Kayseri Hospital in Turkey and used the VGG19 model for classification, achieving an AUC of 0.69 [
13]. Marentakis et al. downloaded the preprocessed CT images of 102 lung cancer cases (including 48 LUAD and 54 LUSC cases) from the TCIA platform. They used a long short-term memory + inception model based on four groups of different CT radiomic features (statistical features of the first order, shape, texture features, and wavelet features) for lung cancer subtype classification, achieving an AUC of 0.78 and an accuracy of 74% [
14]. Bashir et al. collected the data from 64 LUAD and 42 LUSC cases from a local hospital in the UK and used an RF model to predict the subtypes on the basis of CT radiomics, nodule semantics, and background parenchymal features, achieving an AUC of 0.82 [
15]. Overall, the best AUC value for subtype classification was around 0.9. For OS prediction, Jha et al. collected CT imaging data of 200 NSCLC cases from the Tata Memorial Hospital in Mumbai, India. Based on radiomic features, they used an RF model to predict the 2-year survival rate, achieving an accuracy of 81% [
22]. Regarding image analysis, many previous studies manually delineated entire tumor regions in images [
14,
16,
19,
21,
22] or used sliced H&E images [
11,
12]. Stained tissues in histological images not only contain nuclei but also various other structures, such as connective tissue or blood vessels. Using entire images may neglect detailed information or lead to distraction by these structures. There are limited studies using features of nuclei to classify the lung cancer subtypes or predict the OS. In a previous study, the long-term versus short-term survival among patients with early-stage NSCLC was classified based on the spatial proximity and attributes (e.g., shape, size, and texture) of individual nuclei, yielding a mean AUC of 0.68 in the training cohort [
29]. Alsubaie et al. characterized the morphometric features of tumor nuclei and found that they have a significant correlation with OS in LUAD. However, they did not propose a predictive model [
54]. Because the cell nucleus can provide key information for identifying the presence or the stage of disease [
55], we chose to focus on the nuclear features of lung cancer cells.
We downloaded SVS format images from the TCGA database and used OpenSlide to divide them into 1024 × 1024 pixel tiles. The UNet model can accurately and efficiently segment cell nuclei into new tile images by combining the location and shape information of lung cancer cell nuclei marked with LabelMe. Subsequent post-processing based on Otsu’s thresholding method was performed to eliminate fragmented parts and retain high-quality cell nucleus mask images. This automated segmentation approach improved processing efficiency and provided precise cell nucleus data for subsequent subtype classification and survival prediction. Following that, a total of 20 nuclear features were extracted from the histological images: R_average, G_average, B_average, R_var, G_var, B_var, area, perimeter, circularity, compactness, eccentric, Hu[0], Hu[1], Hu[2], Hu[3], Hu[4], Hu[5], Hu[6], cD_average, and cD_var. PCC analysis was conducted to filter and optimize the feature–factor system. After conducting the PCC analysis, a total of 16 nuclear variables—namely, area, perimeter, circularity, compactness, Hu[0], Hu[3], Hu[4], Hu[5], Hu[6], R_average, G_average, B_average, R_var, G_var, B_var, and cD_var—were found to be significantly correlated with subtypes of lung cancer, which implies that nuclear features could be used as indicators for subtype classification and OS prediction. Almost all the nuclear features and the mRNA levels of the mutated genes are important predictors for subtype prediction, indicating that there are differences in the morphological characteristics of cancer cell nuclei and the mRNA levels between the LUAD and LUSC samples. In addition, machine- and deep-learning models can reduce inconsistencies arising from pathologists’ subjective judgments, thus enhancing the reliability of diagnosis and assessment and reducing time and labor costs. By combining nuclear and mRNA features, we found that the XGBoost model was the best for subtype classification, achieving an accuracy of 94% and an AUC of 0.9821, which was a small breakthrough compared to previous studies.
OS is relatively complicated to predict as it can be affected by various factors, such as the stage at which the patient was detected, treatment options, and the patient’s mental status and responses to treatment. Some studies were based on PET/CT images [
22], some were based on mutational genes [
56], while others were merely based on the clinical characteristics of patients [
24]. For example, a study using the Surveillance, Epidemiology, and End Results (SEER) database only focused on age, sex, race, TNM stage, and the number of positive lymph nodes without incorporating genetic characteristics. The predictive AUC was only 0.744 [
24]. While some studies incorporated the genetic information of patients, the AUC values obtained were compromised. For example, Zhang et al.’s study [
25] achieved an AUC value of 0.67 only. Other studies did not provide an AUC value [
56]. In our study, we integrated clinical, nuclear, and genetic features, aiming to provide detailed structural and morphological information about the cancerous tissues, thus supporting the OS prediction to a large extend. After conducting the PCC analysis, a total of five nuclear variables—namely, area, perimeter, R_var, G_var, and B_var—were found to be significantly correlated with one-year survival of patients with lung cancer; a total of 15 nuclear variables—namely, area, perimeter, Hu[0], Hu[2], Hu[3], Hu[4], Hu[5], Hu[6], R_average, G_average, B_average, R_var, G_var, B_var, and cD_var—were found to be significantly correlated with two-year survival of patients with lung cancer; a total of 15 nuclear variables—namely, circularity, compactness, Hu[0], Hu[1], Hu[2], Hu[3], Hu[4], Hu[5], Hu[6], R_average, G_average, B_average, R_var, G_var, and B_var—were found to be significantly correlated with three-year survival of patients with lung cancer. In general, not only nuclear features but also clinical features and the SNV, CNV, and mRNA levels of the top 20 mutated genes are clinically important determinants for OS prediction. Regarding the methods for OS prediction, traditional linear methods such as nomograms can only address linear relationships, whereas machine- and deep-learning models can model nonlinear risk functions and capture complex features within high-dimensional datasets. We found that XGBoost also achieved the best overall performance for OS prediction regarding all the metrics: accuracy, AUC, precision, recall, F1-score, and learning duration. However, if we only took the AUC values into consideration, RF performed the best: its AUC values for predicting OS at 1, 2, and 3 years were 0.9134, 0.8706, and 0.8765, respectively, which were all above 0.87 (
Figure 9). These results demonstrated that it was optimal to incorporate nuclear characteristics along with the genetic information of patients. Our study provides a more accurate and comprehensive model for distinguishing NSCLC subtypes and predicting OS based on the morphometric features of tumor nuclei.
Our results show that machine-learning has significant advantages over deep-learning when dealing with structured data. First, machine-learning algorithms can effectively process structured tabular data. Our dataset includes finely extracted features such as clinical, image, and genetic features. XGBoost, which is based on gradient-boosting trees, is particularly suitable for such structured data. In contrast, deep-learning models often rely more on automatically extracted features from raw data. Moreover, when handling tabular data, XGBoost typically outperforms deep-learning models such as TabNet [
57]. Second, machine-learning performs better on small to medium-sized datasets. Due to the correlation among nuclei from the same patient, the effective sample size is 1252 rather than the total 250,400 data records. Although deep-learning has outstanding performance on large-scale data [
58], at the patient level, our dataset is relatively small, which may not be sufficient to support complex deep-learning models. In comparison, machine-learning models exhibit stronger generalization ability on small to medium-sized datasets. We can further optimize XGBoost and RF models by tuning hyperparameters and employing parallel computing techniques to improve training efficiency.
Although our study showed improved AUC values for subtype classification and OS prediction in NSCLC cases, it still had some limitations. First, our dataset only utilized the TCGA database. Future studies could validate our model using other databases. However, the current open-source datasets, such as TCIA and SEER databases, do not provide all the genomic, imaging, and clinical data of patients. Future studies could collaborate with local hospitals to collect complete datasets and further validate our model. Second, for feature selection, we only used histopathological images, clinical information, and genetic features. Future studies could incorporate other features, such as metabolomics, proteomics, and immunomics, which may further enhance the predictive capabilities of the model. Third, in terms of correlation analysis, in addition to PCC, Spearman’s rank correlation analysis and RF feature importance assessment can be further introduced to comprehensively evaluate the importance of factors, enabling the accurate removal of factors with very low correlations. Fourth, in addition to employing multiple machines and deep-learning models, with advancements in algorithms and computational power, future studies could introduce self-supervised learning [
59] to leverage unlabeled histopathological and genomic data for feature pretraining; reinforcement learning [
60] to capture long-term dynamic associations by designing a cumulative reward function for survival prediction; transformers [
61] to enhance the extraction of global contextual features from imaging and sequencing data; and graph neural networks [
62] to model complex interactions between genetic, clinical, and imaging variables. Additionally, using an optimized UNet++ model instead of the basic UNet model may further enhance segmentation accuracy and robustness. In the future, a toolkit can also be developed based on Python to establish a fully automated workflow for the entire process, from inputting the sliced images and the clinical and genetic features of patients to outputting the results of the subtype and OS prediction. Overall, future research should emphasize the integration of multidimensional data and explore more suitable machine- and deep-learning models, which may further improve the accuracy of OS prediction and subtype classification in patients with NSCLC and provide more reliable support for clinical decision-making.