*3.2. Prediction of Non-Small Cell Lung Cancer (NSCLC) Using Abnormal Methylation Levels of RUNX1*

Features for prediction of NSCLC were selected in 42 tumors and matched normal tissues. Lung tumor and matched normal tissues were divided into training and test datasets at a ratio of 7:3, respectively. We built models using the training dataset and tested the performance of the models using the test dataset. Supervised machine learning algorithms such as k-nearest neighbor (kNN), support vector machine (SVM), neural network, logistic regression, and decision tree were applied for feature selection. Since individual CpGs were correlated with each other, only one CpG was included in the models. Among the applied algorithms, a logistic regression model based on cg04228935 showed the best performance in classifying NSCLCs in a test dataset (N = 28) with a sensitivity of 92.9% and a specificity of 92.9% (area under the curve (AUC) = 0.96; 95% confidence interval (CI) = 0.81–0.99, *p* < 0.0001; Figure 2A).

**Figure 2.** Evaluation of prediction performance of five supervised machine learning algorithms in non-small cell lung cancer (NSCLC). (**A**) The true and false positive rates of logistic regression model based on three CpGs were evaluated in a test dataset (N = 28) of 42 NSCLCs, and the receiver operating characteristic (ROC) curves were plotted using the MedCalc software. (**B**) The prediction certainty of the support vector machine model was evaluated in the test dataset of our data and TCGA lung cancer. The X-axis indicates the degree (0% to 100%) of certainty for prediction of our and TCGA tissues as normal or tumor for each β-value on the Y-axis. The sky blue and red orange circles indicate tumor and normal tissues, respectively. (**C**) The β-values of the three CpGs in our and TCGA data were compared to understand the difference of *RUNX1* hypermethylation among other ethnic groups or populations.

To determine if *RUNX1* hypermethylation may be a biomarker for the detection of NSCLC in other races, we tested *RUNX1* hypermethylation in the 899 TCGA primary lung cancers (75 normal tissues and 824 tumor tissues). As with our data, the TCGA data was divided into a training dataset (N = 630) and a test dataset (N = 269), and the performance of logistic regression model based on three CpGs was evaluated on the test dataset (Table S1). The sensitivity and specificity of the model based on cg04228935 in a test dataset (N = 269) were 91.8% and 96.4%, respectively. AUC was 0.95 (95% confidence interval = 0.93–0.98, *p* < 0.0001). The degree of prediction certainty of NSCLC in the test datasets was high in our data and TCGA lung cancer data (Figure 2B). We finally compared the methylation levels of three CpGs at a CpG island of *RUNX1* between our data and TCGA lung cancer data. No significant difference was found between the two data (Figure 2C).
