Next Article in Journal
Lonicera japonica Thunb. Ethanol Extract Exerts a Protective Effect on Normal Human Gastric Epithelial Cells by Modulating the Activity of Tumor-Necrosis-Factor-α-Induced Inflammatory Cyclooxygenase 2/Prostaglandin E2 and Matrix Metalloproteinase 9
Previous Article in Journal
The Impact of HIV and Parasite Single Infection and Coinfection on Telomere Length: A Systematic Review
Previous Article in Special Issue
Multiomics Analysis of the PHLDA Gene Family in Different Cancers and Their Clinical Prognostic Value
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer

1
Oncocross Ltd., Seoul 04168, Republic of Korea
2
Department of Pathology, Kangbuk Samsung Hospital, Sungkyunkwan University School of Medicine, Seoul 03181, Republic of Korea
3
Yonsei Institute of Pharmaceutical Sciences, College of Pharmacy, Yonsei University, Incheon 21983, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Curr. Issues Mol. Biol. 2024, 46(7), 7291-7302; https://doi.org/10.3390/cimb46070432 (registering DOI)
Submission received: 29 May 2024 / Revised: 3 July 2024 / Accepted: 3 July 2024 / Published: 9 July 2024
(This article belongs to the Collection Bioinformatics Approaches to Biomedicine)

Abstract

:
Identifying the primary site of origin of metastatic cancer is vital for guiding treatment decisions, especially for patients with cancer of unknown primary (CUP). Despite advanced diagnostic techniques, CUP remains difficult to pinpoint and is responsible for a considerable number of cancer-related fatalities. Understanding its origin is crucial for effective management and potentially improving patient outcomes. This study introduces a machine learning framework, ONCOfind-AI, that leverages transcriptome-based gene set features to enhance the accuracy of predicting the origin of metastatic cancers. We demonstrate its potential to facilitate the integration of RNA sequencing and microarray data by using gene set scores for characterization of transcriptome profiles generated from different platforms. Integrating data from different platforms resulted in improved accuracy of machine learning models for predicting cancer origins. We validated our method using external data from clinical samples collected through the Kangbuk Samsung Medical Center and Gene Expression Omnibus. The external validation results demonstrate a top-1 accuracy ranging from 0.80 to 0.86, with a top-2 accuracy of 0.90. This study highlights that incorporating biological knowledge through curated gene sets can help to merge gene expression data from different platforms, thereby enhancing the compatibility needed to develop more effective machine learning prediction models.

1. Introduction

Finding the primary site of cancer is important for determining the treatment regimen for the cancer. Cancer of unknown primary (CUP) describes the diagnosis of metastatic cancer where the primary site of origin eludes detection despite comprehensive diagnostic evaluations [1]. CUP remains a perplexing challenge in oncology, representing approximately 3–5% of all malignancies a decade ago compared to 2–4% in recent years [1,2]. Nevertheless, CUP ranks as the third to fourth leading cause of cancer-related mortality [3]. CUP patients also exhibit greater levels of anxiety and depression than patients with known primary cancer and non-metastatic known primary cancer, along with impaired physical, mental, and social relationships [4]. Certainly, the majority of patients (80–90%) diagnosed with CUPs belong to an unfavorable group; their median overall survival (OS) spans from 3 to 11 months, with a 1-year OS of only 25–40% [5].
Identification of primary cancer site characteristics and associated expression targets enables the utilization of targeted anticancer drugs, enhancing prognosis over broad chemotherapy [3,6,7]. Ding et al. performed a meta-analysis, and determined that identifying the tumor of origin and administering targeted therapy are efficacious approaches, particularly for CUP patients with receptive tumor types [8]. However, traditional pathology methods, while considered the gold standard, often face limitations in exhaustively identifying the primary site due to tissue constraints and the complexity of diagnostic stains. Recent studies have demonstrated the potential of ML algorithms trained on diverse tumor and normal tissue datasets in discerning tissue-specific and tumor-specific patterns from high-resolution molecular data. Molecular diagnostic methodologies, exemplified by genome screening, have the potential to facilitate the identification of elusive origins in CUP cases [9,10]. Researchers are also using transcriptome-wide profiling to identify genes responsible for cancer or disease [11,12]. By utilizing gene expression data and employing sophisticated ML techniques, researchers have achieved notable successes in improving diagnostic accuracy for various cancers [13,14]. Using gene expression profiling (GEP) analysis, classical statistics and machine learning classification techniques can be used to predict primary cancers [15,16]. Researchers have used public transcriptome data to develop deep learning models that can identify genetic markers of primary and metastatic cancers, validating the model using newly acquired clinical samples [17,18]. SCOPE was trained on a collection of 10,688 untreated primary tumor samples and tested on 201 metastatic cancers. The model was validated on 15 cancer types, achieving an overall mean accuracy rate of 86% [17]. CUP-AI-Dx was trained on the transcriptome of 18,217 primary cancer samples, and external validation on metastatic cancer samples was performed on 92 metastatic cancer samples of 18 types collected from clinical laboratories in the US and Australia [18]. The resulting CUP-AI-Dx models showed top-1 accuracy results of 86.96% and 72.46%, respectively. Moon et al. developed OncoNPC, an XGBoost classifier based on next-generation sequencing, which identified distinct subgroups within CUP and improved treatment outcomes through genomically guided therapies [19]. OncoNPC was trained on 36,445 tumors from three medical centers and is able to classify 22 cancers; 971 CUP tumors were collected from the Dana-Farber Cancer Institute, and the model was able to identify the primary cancer type from metastatic carcinoma with 41.2% accuracy.
With the increase in the amount of tumor sample transcriptome data obtained through RNA sequencing, the amount of data available for learning is steadily increasing. However, a significant volume of cancer transcriptome data has been generated using micro-arrays, and compatibility must be ensured in order to utilize data from both platforms. Inter-platform compatibility can be achieved through featurization using gene set information that reflects biological knowledge [20]. By employing this approach, it is possible to construct a more extensive dataset for training, which is expected to improve accuracy. In this manner, our model was able to train more data than previous studies and make predictions for both micro-array and RNA sequencing data.
In this study, we used transcriptome data for 17 solid tumors from The Cancer Genome Atlas project (TCGA) [21] and Oncopression (OCP) datasets [22], and utilized training data featurized into gene set enrichment scores. We then trained a classification machine learning ensemble model by combining logistic regression, LightGBM, and SVM through a voting method (Figure 1). Furthermore, we conducted external validation using clinical samples collected from metastatic sites. Our ONCOfind-AI model showed a high robustness of 96.8 ± 2% with 5-fold validation on 27,941 primary cancer data samples. This is the largest number in studies that have built models from public data, and ONCOfind-AI produces the highest accuracy, with 86.2% for top-1 predicted sites and 90.0% for top-2 predicted sites on metastatic cancer samples in external validation.

2. Methods and Materials

2.1. Data Source

The data sources for training included transcriptome data for primary cancer collected from the OCP and TCGA databases. OCP data were created via micro-array with normalization [22], whereas the TCGA data, downloaded from Firehose (https://gdac.broadinstitute.org/, accessed on 1 July 2023), were created via RNA-Seq and normalized to RNA-Seq by expectation maximization (RSEM). A total of 27,941 samples for 17 cancer tissue type were selected for training (Table 1). For external validation, we used 103 formalin-fixed paraffin-embedded (FFPE) tissue samples from patients collected by Kangbuk Samsung Medical Center (KBSMC) between 2018 and 2022. Patient samples were collected after informed consent was obtained; the study followed the guidelines of the Declaration of Helsinki and received approval from the Institutional Review Board (KBSMC 2022-11-018). Additionally, public datasets comprising 107 samples across seven cohorts were obtained from the Gene Expression Omnibus (GEO) to further validate our findings (Table 2).

2.1.1. RNA Sequencing and Gene Expression Profiling

mRNA was extracted from FFPE tissue samples taken from patients using an RNeasy FFPE Kit (Qiagen, Hilden, Germany) according to the manufacturer’s instruction. In summary, FFPE tissue sections were deparaffinized by treatment with deparaffinization solution and lysed by proteinase K digestion followed by heat treatment. Next, the supernatants treated with DNase were added to Buffer RBC and ethanol to adjust the binding conditions for RNA. The samples were applied to the RNeasy MinElute spin column, where the total RNA was bound to the membrane and contaminants were efficiently washed away. RNA was then eluted in RNase-free water. The RNA concentration was determined using a NanoDrop (Thermo Fisher Scientific, Waltham, MA, USA). Subsequently, RNA-seq was performed by Macrogen (Seoul, Republic of Korea) on the Illumina (San Diego, CA, USA) RNA-Seq platform for paired-end sequencing employing the SureSelectXT RNA Direct Reagent Kit (Agilent, Santa Clara, CA, USA). The raw FASTQ RNA-Seq data were trimmed using the Trimmomatic-0.39-1 tool [23]. Alignment and quantification were performed using STAR 2.7.8a and RSEM 1.3.3 with the GRCh38.105 genome reference [24,25].

2.2. Featurization and Feature Selection

Clinical data for external validation from KBSMC and GEO were quantified by transcriptome profile, as described in Section 2.1.1. Data from TCGA and OCP, the primary site cancer samples for training, are available in public databases and provide preprocessed transcriptome profiles. However, because the two public databases measure transcriptomes differently, they are normalized differently, and even samples from the same cancer type show different behavior. Figure 2A shows that the average expression levels of each gene between the TCGA and OCP groups for breast cancer are different in terms of both range and distribution pattern. To integrate the different characteristics of these data, we converted the gene-wise information of the transcriptome to the gene set dimension. We created gene set enrichment scores for all samples; statistical values were extracted using the Kolmogorov–Smirnov test with 8300 gene sets from the “Hallmarker gene sets”, “C2 curated gene sets”, “C6 oncogenic signature gene sets”, and “C8 cell type signature gene sets” obtained from MsigDB (www.gsea-msigdb.org, accessed on 5 March 2024, v2023.2) [26,27,28,29,30,31,32,33,34,35]. Positive or negative signs were assigned based on the directionality of the expression difference. Each gene set was defined by canonical pathway, cancer type, tissue type, cell type, and oncogenic gene. Therefore, we used the transcriptomic profiles collected from the tumors to calculate which gene sets are activated, allowing us to know which tissues, which tumors, and which pathways they originate from. To calculate the score per gene set, we used Gene Set Enrichment Analysis (GSEA) and calculated the normalized enrichment score for each gene set [26].
After computing the 8300 gene set scores, we investigated which of these could provide a comprehensive representation of the OCP and TCGA data. In feature selection, the receiver operating characteristic (ROC)-based feature selection approach can be used as an effective tool to evaluate individual features [36,37]. In particular, for binary-class problems the single feature classifier constructed from feature f i can establish an appropriate threshold θ . If x θ , then x is classified as the TCGA class, and if x < θ , then x is classified as the OCP class. If one feature f i has an area under the ROC curve (AUC) value for a single-feature classifier farther from 0.5 than another feature f j , we can say that f i is more discriminative than f j in the two classes [36]. We calculated the AUC for each gene set feature in order to discriminate between the OCP cohort and TCGA cohort. Figure 2B shows a histogram of the AUC values calculated for each gene set feature. An AUC of 0.5 means that the classifier has no discriminative capacity at all, which means that OCP and TCGA can be used comprehensively when using features around 0.5. Therefore, feature selection was performed at intervals of ±0.05 around a baseline of AUC 0.5, and these selected feature groups were used for modeling.
We found that simply converting to gene set scores significantly normalized the ranges and distribution patterns of the OCP and TCGA cohorts. Figure 2(C-1)) shows all 8300 gene sets, demonstrating that the TCGA and OCP data are more highly correlated than at the gene level. Moreover, filtering gene sets that show differences between groups by AUC range can achieve a greater correlation (Figure 2(C-2)). When examining the distribution through T-SNE clustering for major cancer types such as lung, stomach, and breast cancer, compatibility between TCGA and OCP can be observed (Figure 2D). Dimension reduction was performed on the features using T-SNE from Scikit-learn in order to visualize the distribution of samples and groups in a two-dimensional space, further aiding in understanding and optimizing the model. Figure 2(D-1) is when all 8300 gene sets are used, while Figure 2(D-2) is when 1249 gene sets from the AUC range of 0.45–0.55 are used. In Figure 2(D-2), it can be seen that when using only the 1249 features selected by AUC, the features are more focused on the characteristics of the cancer type than those of the data type.

2.3. Cancer Primary Site Classification Model

To create a classification ensemble model based on machine learning, we used logistic regression, LightGBM, and support vector machine (SVM) with a soft voting classification approach. For this algorithm, we used the Scikit-learn (version 0.22.1) and LightGBM (version 3.1.1) Python packages. LightGBM is renowned for its high efficiency and low memory usage, which make it particularly effective for handling large datasets with high dimensionality. It also excels in terms of speed and accuracy. Additionally, SVM is advantageous for its effectiveness in high-dimensional spaces and its versatility through the use of different kernel functions, enabling it to model nonlinear boundaries. The ensemble approach combining these powerful classifiers leverages their individual strengths to enhance the predictive performance and achieve significant results. The parameters were set using a grid search.

2.4. External Validation

For external validation in metastatic cancer, predictions were made through the developed model using KBSMC and GEO data and a ranking was created based on the probability. To further validate the model under real-world conditions, we trained the model on public data and then evaluated it on a prospectively collected sample of cancer patients from the hospital. Through this process, the top-1 and top-2 prediction accuracy was calculated.

3. Results and Discussion

We demonstrate that more integrated data lead to better prediction performance on datasets consisting of real-world clinical data. In addition, we show how to integrate the two largest sources of transcriptomic data, namely, micro-arrays and RNA sequencing, and use them as features in a machine learning model. The model is found to be robust, with auccuracy scores of 0.988, 0.981, 0.965, 0.963, and 0.942 when performing a 5-fold cross-validation using randomized shuffling of the full TCGA + OCP data. Figure 3A shows the average F1 scores for each model with features in the range of AUC values shown in Figure 2B. In this case, the average F1 score is the average of the predictions across the 17 cancer primary organs. The model was trained with TCGA and then tested with cancer types from OCP data; conversely, the model was also trained with OCP data and then tested with TCGA cancer types. The overall line does not shown the median, as there is a larger amount of OCP data than TCGA data (19,541 vs. 8400) (Table 1). We found that AUC values of 0.5 ± 0.1 to 0.3 were associated with an average F1 score of 0.9 or higher (Figure 3A); therefore, we selected features from this range to build our model and conduct external data validation (Figure 3B).
Models trained on TCGA data, OCP data, and a combination of both TCGA and OCP were validated using external data. The best scores for the validation sets in each model are shown in Figure 3B and Table 3. Notably, models trained on TCGA data generated through RNA sequencing exhibited significantly lower performance when validated with KBSMC data, which were also produced through RNA sequencing. However, we found a significant improvement in performance when micro-array data from OCP were included in the training with our feature integration method. The models trained on the combined TCGA + OCP dataset demonstrated an accuracy range of approximately 0.80 to 0.86 depending on the range of feature selection (Figure 3B, Table 3). Figure 3C,D shows the confusion matrix for the top-1 prediction of the external validation set. The AUC of the feature selection group adopted in the evaluation model ranged from 0.5 ± 0.25. Most of the answers matched the actual primary site, and for those that were incorrect, the most likely answer was a neighboring organ. For example, when bile duct was the correct answer, the incorrect answers were pancreas and colorectum, whereas when the correct answer was uterus were the incorrect answers were ovary and kidney. In particular, the prostate, uterus, bile duct, and pancreas, which had relatively high incorrect answer rates, each had fewer than ten data points, making it difficult to perform sufficient validation. Our model had an average accuracy of 0.9 when calculating the accuracy for KBSMC+GEO data up to the top-2 predictions (Table 4).

3.1. Performance of ONCOfind-AI

We compared ONCOfind-AI to previous studies that have used machine learning to predict the primary site of metastatic carcinoma over the past five years (Table 5).
As it is very difficult to collect metastatic carcinomas, the models were all trained on cancer samples from the primary site. This study presents an integrated feature calculation method for the data types that provide transcriptome profiles, making it the largest study to use public data. The most recent study, OncoNPC from 2023, was a collaboration across three clinical research centers and collected the largest number of cancer samples; however, it achieved relatively low accuracy using a simple XGBoost classifier model. SCOPE used 201 metastatic cancer samples for validation, with 168 obtained from metastatic sites and the remaining 33 from the origin site. CUP-AI-Dx was developed at the Jackson Laboratory for Genomic Medicine and trained on RNA-Seq data from TCGA and the Cancer Genome Consortium (ICGC). For clinical validation, 92 FFPE samples representing 18 cancer types were collected from two clinical laboratories in the USA and Australia. Of these, 23 samples were from the JAX CLIA lab, six of of which were primary cancers. The primary site was predicted with 86.96% accuracy. The remaining 69 samples were from 18 types of metastatic cancer from the the University of Melbourne, for which the primary site was predicted with 72.46% accuracy. Although there is a large amount of primary cancer data, as there are many surgeries, commonly patients do not undergo surgery when metastasis occurs, making metastatic cancer data very rare and difficult to obtain. Therefore, most studies have only used a very small amount of testing data for clinical validation compared to the amount of training data. However, in our study all 210 samples were taken from metastatic tissues of metastatic cancer, and ONCOfind-AI predicted the primary site with the highest top-1 accuracy. Furthermore, as the answers that missed the top-1 prediction mentioned organs similar to the correct answer, it can be interpreted that our model learned the characteristics of tissues. This suggests that with more comprehensive feature selection based on the featurization approach outlined in our research, we can anticipate even better performance as the volume of data available across all cancer types increases.

3.2. Limitation and Future Work

This study shows the potential for further data integration by including other public databases such as ICGC, which we plan to address in the future. Collecting cancer samples from metastatic sites is very challenging, as is any study based on the resulting data, as patients with metastatic cancer often do not undergo biopsies. Our laboratory will continue to work with KBSMC to collect data and refine the model. In future research, we will validate whether our model could perform as well for other non-Asian ethnicities. Cancer genomic variation varies significantly by race, and only 672 out of 11,122 patients in the public TCGA database are Asian [21]; therefore, we expect our model to perform better for Americans.

Author Contributions

Y.J.: writing—original draft, writing—review and editing, conceptualization, methodology, validation, data curation. J.C.: methodology, resources, data curation, writing—review. J.K.: writing—original draft, conceptualization, methodology, validation. S.B.: data curation. J.-H.L.: supervision, writing—review. D.-S.J.: supervision, writing—review. W.-W.K.: supervision, project administration, funding acquisition. Y.-R.K.: supervision, project administration, funding acquisition. J.K.: supervision, project administration, funding acquisition, writing—review and editing. I.-G.D.: supervision, project administration, funding acquisition, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Seoul R&BD Program (BT220252) through the Seoul Business Agency (SBA) funded by the Seoul Metropolitan Government.

Institutional Review Board Statement

This study was conducted according to the guidelines of the Declaration of Helsinki and was approved by the Institutional Review Board of Kangbuk Samsung Hospital (KBSMC 2022-11-018).

Informed Consent Statement

Complete written informed consent was obtained from the patient for the publication of this study and accompanying images.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author, In-Gu Do. The data are not publicly available due to their containing information that could compromise the privacy of research participants. The source code for this project has been distributed on GitHub (http://github.com/yeonuk-Jeong/ONCOfind, accessed on 2 July 2024).

Conflicts of Interest

Authors Yeonuk Jeong, Juwon Kang, Seungjun Baek, Jae-Hak Lee, Dong-Sub Jung, Won-Woo Kim, Yi-Rang Kim, and Jihoon Kang were employed by the company Oncocross Ltd. Authors Jinah Chu and In-Gu Do were employed by the Kangbuk Samsung Hospital. The authors declare that this study received funding from the Seoul Business Agency (SBA). The funder was not involved in the study design, the collection, analysis, and interpretation of data, the writing of this article, or the decision to submit it for publication.

References

  1. Pavlidis, N.; Pentheroudakis, G. Cancer of unknown primary site. Lancet 2012, 379, 1428–1435. [Google Scholar] [CrossRef] [PubMed]
  2. Varadhachary, G.; Abbruzzese, J.L. Carcinoma of unknown primary. In Abeloff’s Clinical Oncology; Elsevier: Amsterdam, The Netherlands, 2020; pp. 1694–1702. [Google Scholar]
  3. Qaseem, A.; Usman, N.; Jayaraj, J.S.; Janapala, R.N.; Kashif, T. Cancer of unknown primary: A review on clinical guidelines in the development and targeted management of patients with the unknown primary site. Cureus 2019, 11, e5552. [Google Scholar] [CrossRef]
  4. Hyphantis, T.; Papadimitriou, I.; Petrakis, D.; Fountzilas, G.; Repana, D.; Assimakopoulos, K.; Carvalho, A.F.; Pavlidis, N. Psychiatric manifestations, personality traits and health-related quality of life in cancer of unknown primary site. Psycho-Oncology 2013, 22, 2009–2015. [Google Scholar] [CrossRef] [PubMed]
  5. Ma, W.; Wu, H.; Chen, Y.; Xu, H.; Jiang, J.; Du, B.; Wan, M.; Ma, X.; Chen, X.; Lin, L.; et al. New techniques to identify the tissue of origin for cancer of unknown primary in the era of precision medicine: Progress and challenges. Briefings Bioinform. 2024, 25, bbae028. [Google Scholar] [CrossRef] [PubMed]
  6. Rassy, E.; Pavlidis, N. Progress in refining the clinical management of cancer of unknown primary in the molecular era. Nat. Rev. Clin. Oncol. 2020, 17, 541–554. [Google Scholar] [CrossRef] [PubMed]
  7. Shuel, S.L. Targeted cancer therapies: Clinical pearls for primary care. Can. Fam. Physician 2022, 68, 515. [Google Scholar] [CrossRef]
  8. Ding, Y.; Jiang, J.; Xu, J.; Chen, Y.; Zheng, Y.; Jiang, W.; Mao, C.; Jiang, H.; Bao, X.; Shen, Y.; et al. Site-specific therapy in cancers of unknown primary site: A systematic review and meta-analysis. ESMO Open 2022, 7, 100407. [Google Scholar] [CrossRef] [PubMed]
  9. Massard, C.; Loriot, Y.; Fizazi, K. Carcinomas of an unknown primary origin—Diagnosis and treatment. Nat. Rev. Clin. Oncol. 2011, 8, 701–710. [Google Scholar] [CrossRef] [PubMed]
  10. Varghese, A.; Arora, A.; Capanu, M.; Camacho, N.; Won, H.; Zehir, A.; Gao, J.; Chakravarty, D.; Schultz, N.; Klimstra, D.; et al. Clinical and molecular characterization of patients with cancer of unknown primary in the modern era. Ann. Oncol. 2017, 28, 3015–3021. [Google Scholar] [CrossRef]
  11. Mai, J.; Lu, M.; Gao, Q.; Zeng, J.; Xiao, J. Transcriptome-wide association studies: Recent advances in methods, applications and available databases. Commun. Biol. 2023, 6, 899. [Google Scholar] [CrossRef]
  12. Cao, C.; Kwok, D.; Edie, S.; Li, Q.; Ding, B.; Kossinna, P.; Campbell, S.; Wu, J.; Greenberg, M.; Long, Q. kTWAS: Integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Briefings Bioinform. 2021, 22, bbaa270. [Google Scholar] [CrossRef] [PubMed]
  13. Petinrin, O.O.; Saeed, F.; Toseef, M.; Liu, Z.; Basurra, S.; Muyide, I.O.; Li, X.; Lin, Q.; Wong, K.C. Machine learning in metastatic cancer research: Potentials, possibilities, and prospects. Comput. Struct. Biotechnol. J. 2023, 21, 2454–2470. [Google Scholar] [CrossRef]
  14. Divate, M.; Tyagi, A.; Richard, D.J.; Prasad, P.A.; Gowda, H.; Nagaraj, S.H. Deep learning-based pan-cancer classification model reveals tissue-of-origin specific gene expression signatures. Cancers 2022, 14, 1185. [Google Scholar] [CrossRef]
  15. Zheng, Y.; Ding, Y.; Wang, Q.; Sun, Y.; Teng, X.; Gao, Q.; Zhong, W.; Lou, X.; Xiao, C.; Chen, C.; et al. 90-gene signature assay for tissue origin diagnosis of brain metastases. J. Transl. Med. 2019, 17, 331. [Google Scholar] [CrossRef] [PubMed]
  16. Jiang, W.; Shen, Y.; Ding, Y.; Ye, C.; Zheng, Y.; Zhao, P.; Liu, L.; Tong, Z.; Zhou, L.; Sun, S.; et al. A naive Bayes algorithm for tissue origin diagnosis (TOD-Bayes) of synchronous multifocal tumors in the hepatobiliary and pancreatic system. Int. J. Cancer 2018, 142, 357–368. [Google Scholar] [CrossRef]
  17. Grewal, J.K.; Tessier-Cloutier, B.; Jones, M.; Gakkhar, S.; Ma, Y.; Moore, R.; Mungall, A.J.; Zhao, Y.; Taylor, M.D.; Gelmon, K.; et al. Application of a neural network whole transcriptome–based pan-cancer method for diagnosis of primary and metastatic cancers. JAMA Netw. Open 2019, 2, e192597. [Google Scholar] [CrossRef]
  18. Zhao, Y.; Pan, Z.; Namburi, S.; Pattison, A.; Posner, A.; Balachander, S.; Paisie, C.A.; Reddi, H.V.; Rueter, J.; Gill, A.J.; et al. CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. EBioMedicine 2020, 61, 103030. [Google Scholar] [CrossRef]
  19. Moon, I.; LoPiccolo, J.; Baca, S.C.; Sholl, L.M.; Kehl, K.L.; Hassett, M.J.; Liu, D.; Schrag, D.; Gusev, A. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat. Med. 2023, 29, 2057–2067. [Google Scholar] [CrossRef]
  20. van der Kloet, F.M.; Buurmans, J.; Jonker, M.J.; Smilde, A.K.; Westerhuis, J.A. Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput. Biol. 2020, 16, e1008295. [Google Scholar] [CrossRef] [PubMed]
  21. Yuan, J.; Hu, Z.; Mahal, B.A.; Zhao, S.D.; Kensler, K.H.; Pi, J.; Hu, X.; Zhang, Y.; Wang, Y.; Jiang, J.; et al. Integrated analysis of genetic ancestry and genomic alterations across cancers. Cancer Cell 2018, 34, 549–560. [Google Scholar] [CrossRef]
  22. Lee, J.; Choi, C. Oncopression: Gene expression compendium for cancer with matched normal tissues. Bioinformatics 2017, 33, 2068–2070. [Google Scholar] [CrossRef] [PubMed]
  23. Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30, 2114–2120. [Google Scholar] [CrossRef] [PubMed]
  24. Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29, 15–21. [Google Scholar] [CrossRef]
  25. Li, B.; Dewey, C.N. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 2011, 12, 323. [Google Scholar] [CrossRef]
  26. Subramanian, A.; Tamayo, P.; Mootha, V.K.; Mukherjee, S.; Ebert, B.L.; Gillette, M.A.; Paulovich, A.; Pomeroy, S.L.; Golub, T.R.; Lander, E.S.; et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 2005, 102, 15545–15550. [Google Scholar] [CrossRef]
  27. Liberzon, A.; Birger, C.; Thorvaldsdóttir, H.; Ghandi, M.; Mesirov, J.P.; Tamayo, P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015, 1, 417–425. [Google Scholar] [CrossRef] [PubMed]
  28. Shao, X.; Gomez, C.D.; Kapoor, N.; Considine, J.M.; Grams, C.; Gao, Y.; Naba, A. MatrisomeDB 2.0: 2023 updates to the ECM-protein knowledge database. Nucleic Acids Res. 2023, 51, D1519–D1530. [Google Scholar] [CrossRef] [PubMed]
  29. Newman, J.C.; Weiner, A.M. L2L: A simple tool for discovering the hidden significance in microarray expression data. Genome Biol. 2005, 6, R81. [Google Scholar] [CrossRef]
  30. Zeller, K.I.; Jegga, A.G.; Aronow, B.J.; O’Donnell, K.A.; Dang, C.V. An integrated database of genes responsive to the Myc oncogenic transcription factor: Identification of direct genomic targets. Genome Biol. 2003, 4, R69. [Google Scholar] [CrossRef]
  31. Nishimura, D. BioCarta. Biotech Softw. Internet Rep. Comput. Softw. J. Sci. 2001, 2, 117–120. [Google Scholar] [CrossRef]
  32. Kanehisa, M.; Goto, S.; Furumichi, M.; Tanabe, M.; Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010, 38, D355–D360. [Google Scholar] [CrossRef] [PubMed]
  33. Schaefer, C.F.; Anthony, K.; Krupa, S.; Buchoff, J.; Day, M.; Hannay, T.; Buetow, K.H. PID: The pathway interaction database. Nucleic Acids Res. 2009, 37, D674–D679. [Google Scholar] [CrossRef] [PubMed]
  34. Jassal, B.; Matthews, L.; Viteri, G.; Gong, C.; Lorente, P.; Fabregat, A.; Sidiropoulos, K.; Cook, J.; Gillespie, M.; Haw, R.; et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020, 48, D498–D503. [Google Scholar] [CrossRef] [PubMed]
  35. Pico, A.R.; Kelder, T.; Van Iersel, M.P.; Hanspers, K.; Conklin, B.R.; Evelo, C. WikiPathways: Pathway editing for the people. PLoS Biol. 2008, 6, e184. [Google Scholar] [CrossRef]
  36. Sun, L.; Wang, J.; Wei, J. AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity. BMC Bioinform. 2017, 18, 73–89. [Google Scholar] [CrossRef]
  37. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Figure 1. This figure presents the scheme of modeling process. RNA sequencing and micro-array data were collected from The Cancer Genome Atlas project (TCGA) and Oncopression (OCP), respectively, and featurization was conducted using gene sets representing the characteristics of each tissue and organ. Using the resulting feature scores, we developed a model to predict the primary site among 17 organs.
Figure 1. This figure presents the scheme of modeling process. RNA sequencing and micro-array data were collected from The Cancer Genome Atlas project (TCGA) and Oncopression (OCP), respectively, and featurization was conducted using gene sets representing the characteristics of each tissue and organ. Using the resulting feature scores, we developed a model to predict the primary site among 17 organs.
Cimb 46 00432 g001
Figure 2. Featurization and feature selection: (A) average expression of each genes by groups from breast cancer (BRCA); (B) Histogram of TCGA-OCP distinguishing AUC value by gene sets. The red line marks 0.5; (C) Average enrichment score of each gene set by groups from BRCA; (D) Distribution of samples by T-SNE. (C-1,D-1) are when all 8300 gene sets are used, while (C-2,D-2) are when 1249 gene sets from AUC ranging from 0.45–0.55 are used.
Figure 2. Featurization and feature selection: (A) average expression of each genes by groups from breast cancer (BRCA); (B) Histogram of TCGA-OCP distinguishing AUC value by gene sets. The red line marks 0.5; (C) Average enrichment score of each gene set by groups from BRCA; (D) Distribution of samples by T-SNE. (C-1,D-1) are when all 8300 gene sets are used, while (C-2,D-2) are when 1249 gene sets from AUC ranging from 0.45–0.55 are used.
Cimb 46 00432 g002
Figure 3. (A) TCGA vs OCP cross validation. Models were created to evaluate metastatic carcinoma using the AUC range of f1 scores of 0.9 and above, indicated by the red line; (B) Accuracy of external validation by feature selection groups; (C) Confusion matrix of external validation for KBSMC and GEO data; (D) Confusion matrix of external validation for only KBSMC data.
Figure 3. (A) TCGA vs OCP cross validation. Models were created to evaluate metastatic carcinoma using the AUC range of f1 scores of 0.9 and above, indicated by the red line; (B) Accuracy of external validation by feature selection groups; (C) Confusion matrix of external validation for KBSMC and GEO data; (D) Confusion matrix of external validation for only KBSMC data.
Cimb 46 00432 g003
Table 1. Training datasets.
Table 1. Training datasets.
Cancer TypeData SourceSample No.
Adrenal GlandOCP311
TCGA261
Bile DuctOCP184
TCGA36
BladderOCP310
TCGA408
BrainOCP3054
TCGA696
BreastOCP5543
TCGA1093
ColorectalOCP3074
TCGA381
Head And NeckOCP622
TCGA520
KidneyOCP356
TCGA891
LiverOCP413
TCGA373
LungOCP2243
TCGA1018
OvaryOCP1143
TCGA307
PancreasOCP207
TCGA178
ProstateOCP247
TCGA497
SkinOCP294
TCGA103
StomachOCP920
TCGA599
ThyroidOCP298
TCGA501
UterusOCP322
TCGA538
TotalOCP19,541
TCGA8400
OCP + TCGA27,941
Table 2. Validation datasets.
Table 2. Validation datasets.
CancerData SourceSample No.
Bile DuctKBSMC4
BladderKBSMC6
BreastGSE1401729
GSE14799513
GSE1912307
KBSMC5
ColorectalKBSMC20
GSE403677
Head And NeckKBSMC2
KidneyKBSMC6
LiverGSE4036715
KBSMC13
LungKBSMC5
OvaryKBSMC5
PancreasKBSMC10
ProstateKBSMC5
SkinKBSMC2
StomachKBSMC11
GSE2469638
GSE1911394
ThyroidGSE6054224
KBSMC3
UterusKBSMC6
TotalKBSMC103
GEO107
Total210
Table 3. External validation results.
Table 3. External validation results.
Training DataTest DataFeature AUC Range 0.5±Weighted Accuracy
OCP + TCGAGEO + KBSMC0.250.862
OCP + TCGAGEO + KBSMC0.30.852
OCP + TCGAGEO + KBSMC0.150.843
OCP + TCGAGEO + KBSMC0.20.843
OCP + TCGAGEO + KBSMC0.10.833
OCP + TCGAKBSMC0.250.825
OCP + TCGAKBSMC0.30.806
OCPGEO + KBSMC0.10.79
OCP + TCGAKBSMC0.150.777
OCPGEO + KBSMC0.20.776
OCPGEO + KBSMC0.150.776
OCP + TCGAKBSMC0.20.767
OCPGEO + KBSMC0.30.757
OCP + TCGAKBSMC0.10.748
OCPGEO + KBSMC0.250.738
TCGAGEO + KBSMC0.30.724
TCGAGEO + KBSMC0.150.719
TCGAGEO + KBSMC0.10.719
TCGAGEO + KBSMC0.20.71
TCGAGEO + KBSMC0.250.71
OCPKBSMC0.150.689
OCPKBSMC0.20.689
OCPKBSMC0.10.68
OCPKBSMC0.30.65
OCPKBSMC0.250.631
TCGAKBSMC0.30.534
TCGAKBSMC0.10.524
TCGAKBSMC0.150.515
TCGAKBSMC0.250.495
TCGAKBSMC0.20.495
Table 4. Top-2 accuracy of external validation.
Table 4. Top-2 accuracy of external validation.
Cancer TypeTop-2 Accuracy for GEO + KBSMC Validation
BileDuct0.250
Bladder1.000
Breast0.926
Colorectal0.963
HeadAndNeck0.500
Kidney1.000
Liver0.964
Lung0.800
Ovary1.000
Pancreas0.900
Prostate0.800
Skin1.000
Stomach0.870
Thyroid0.889
Uterus0.667
Weighted Average0.900
Table 5. Performance comparison.
Table 5. Performance comparison.
ModelTrain SetTest Set (Metastatic)Average Accuracy
SCOPE [17]10,688201 (33 from origin site)15 types, 86%
CUP-AI-Dx [18]18,21792 (23 from origin site)18 types, 83.33%
OncoNPC [19]36,44597122 types, 41.2%
ONCOfind-AI27,94121017 types, 86.2%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeong, Y.; Chu, J.; Kang, J.; Baek, S.; Lee, J.-H.; Jung, D.-S.; Kim, W.-W.; Kim, Y.-R.; Kang, J.; Do, I.-G. Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer. Curr. Issues Mol. Biol. 2024, 46, 7291-7302. https://doi.org/10.3390/cimb46070432

AMA Style

Jeong Y, Chu J, Kang J, Baek S, Lee J-H, Jung D-S, Kim W-W, Kim Y-R, Kang J, Do I-G. Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer. Current Issues in Molecular Biology. 2024; 46(7):7291-7302. https://doi.org/10.3390/cimb46070432

Chicago/Turabian Style

Jeong, Yeonuk, Jinah Chu, Juwon Kang, Seungjun Baek, Jae-Hak Lee, Dong-Sub Jung, Won-Woo Kim, Yi-Rang Kim, Jihoon Kang, and In-Gu Do. 2024. "Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer" Current Issues in Molecular Biology 46, no. 7: 7291-7302. https://doi.org/10.3390/cimb46070432

Article Metrics

Back to TopTop