Next Article in Journal
Lonicera japonica Thunb. Ethanol Extract Exerts a Protective Effect on Normal Human Gastric Epithelial Cells by Modulating the Activity of Tumor-Necrosis-Factor-α-Induced Inflammatory Cyclooxygenase 2/Prostaglandin E2 and Matrix Metalloproteinase 9
Previous Article in Journal
The Impact of HIV and Parasite Single Infection and Coinfection on Telomere Length: A Systematic Review
Previous Article in Special Issue
Multiomics Analysis of the PHLDA Gene Family in Different Cancers and Their Clinical Prognostic Value
 
 
Article
Peer-Review Record

Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer

Curr. Issues Mol. Biol. 2024, 46(7), 7291-7302; https://doi.org/10.3390/cimb46070432 (registering DOI)
by Yeonuk Jeong 1,†, Jinah Chu 2,†, Juwon Kang 1,3, Seungjun Baek 1, Jae-Hak Lee 1, Dong-Sub Jung 1, Won-Woo Kim 1, Yi-Rang Kim 1, Jihoon Kang 1,* and In-Gu Do 2,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Curr. Issues Mol. Biol. 2024, 46(7), 7291-7302; https://doi.org/10.3390/cimb46070432 (registering DOI)
Submission received: 29 May 2024 / Revised: 3 July 2024 / Accepted: 3 July 2024 / Published: 9 July 2024
(This article belongs to the Collection Bioinformatics Approaches to Biomedicine)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors implemented logistic regression, LightGBM, and SVM classifiers to validate their proposed model. They combined these three different types of classifiers into what is generally called an ensemble model (an ensemble combines similar types of classifiers). It is also important to mention whether 5-fold or 10-fold cross-validation was used, as this detail is currently missing.

 

The authors applied the machine learning model to predict various types of cancer using a fairly large dataset. They mentioned the reason for using the stack ensemble model is that LightGBM is renowned for its high efficiency, making it particularly effective for handling large datasets with high dimensionality. However, the authors compared their model with KBSMC+GCO, which has slightly more than two hundred samples (shown in the form of confusion matrices in Figures 3 C & D). These samples are insufficient to justify the superiority of the model. It would be better for the authors to either stick with their own dataset or collect more data for a more robust comparison.

The authors need to address potential overfitting issues in their proposed model or not (in the revised version).

Author Response

Comments 1: The authors implemented logistic regression, LightGBM, and SVM classifiers to validate their proposed model. They combined these three different types of classifiers into what is generally called an ensemble model (an ensemble combines similar types of classifiers). It is also important to mention whether 5-fold or 10-fold cross-validation was used, as this detail is currently missing.


Response 1: Thank you for pointing this out I agree with this comment. We had already done validation on the training set, but we left it out. Added the model's 5-fold cross-validation score to the last paragraph of the 'Introduction' and the first paragraph of the 'Results and Discussion'.

Comments 2: The authors applied the machine learning model to predict various types of cancer using a fairly large dataset. They mentioned the reason for using the stack ensemble model is that LightGBM is renowned for its high efficiency, making it particularly effective for handling large datasets with high dimensionality. However, the authors compared their model with KBSMC+GCO, which has slightly more than two hundred samples (shown in the form of confusion matrices in Figures 3 C & D). These samples are insufficient to justify the superiority of the model. It would be better for the authors to either stick with their own dataset or collect more data for a more robust comparison.

Response 2: Thank you for pointing this out. I have revised Results and Discussion section to emphasize this point.  There is a lot of primary cancer data because there are many surgeries, but there are many patients who do not undergo surgery when metastasis occurs, the metastatic cancer data is very small and difficult to obtain. First,We added a subsection 'Performance of ONCOfind-AI' to compare the number of primary carcinomas trained and the number of metastases tested in other similar studies. Second, in the Limitations and Future Work subsection, we stated that we will continue to work with hospitals to accumulate metastatic carcinoma data to improve our model.

Reviewer 2 Report

Comments and Suggestions for Authors

It needs clarification about the how the features are computed. It is the enrichment (e.g., single sample GSEA) score among 8300 genes in each patient/sample? Considering adding more methods descriptions for this part.

 

Tumor purity could be a confounding factor in the overall origin prediction, was there any effort to evaluate the model robustness in this regard?

 

Considering add cross-validation to ensure model robustness.

 

Limited information reported for replicability and/or reproducibility.

Author Response

Comments 1: It needs clarification about the how the features are computed. It is the enrichment (e.g., single sample GSEA) score among 8300 genes in each patient/sample? Considering adding more methods descriptions for this part.

Response 1:  Thank you for pointing this out .I agree with this comment. I have revised ‘2.2. Featurization and Feature Selection’ to emphasize this point. Added explanations at the beginning and end of the first paragraph.

Comments 2: Tumor purity could be a confounding factor in the overall origin prediction, was there any effort to evaluate the model robustness in this regard?

Response 2:  The tumour volume of all samples from Kangbuk Samsung Medical Center was not available, but the average of 50% was obtained (10~80%). The TCGA data used for training has different purity distributions for each cancer type, but the average for each cancer type mostly lies between 0.38 and 0.72.(doi: 10.1038/s42003-023-04764-8) 

Since the training data are primary cancer samples, and the gene set feature system proposed in this study also generates organ or tissue-specific features, the model will be able to learn the characteristics of those regions even from tumor margins. We also consider that the fact that model detected close to 90% of the primary sites despite having an average tumor purity of around 50% can increase the reliability of the model.

Comments 3: Considering add cross-validation to ensure model robustness.

Response 3: Thank you for pointing this out I agree with this comment. We had already done validation on the training set, but we left it out. Added the model's 5-fold cross-validation score to the last paragraph of the Introduction and the first paragraph of the Results and Discussion.

Comments 4: Limited information reported for replicability and/or reproducibility.

Response 4: We will be servicing the model through our company and we will be releasing the model through GitHub so that everyone can check its execution.

 

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript titled "Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer" presents a novel and significant approach to improving the diagnosis of metastatic cancer origins, particularly for cases of Cancer of Unknown Primary (CUP). The study introduces a machine learning framework, ONCOfind-AI, which leverages transcriptome-based gene set features to enhance predictive accuracy. This is a well-conceived study, however, I have a few comments.

There is insufficient comparison with existing machine learning models and methodologies. Providing a comparative analysis would better demonstrate the advantages of the proposed approach.

The introduction could benefit from a more detailed literature review on existing machine learning approaches for predicting cancer origins to better position the novelty of this study.

The method of normalizing and integrating micro-array and RNA-seq data is not thoroughly explained, potentially affecting reproducibility.

The algorithm proposed by the authors and Transcriptome-Wide Association Studies (TWAS) do share some conceptual similarities, particularly in their utilization of transcriptomic data and the integration of multiple data sources. However, they serve different purposes and employ different methodologies. It would be beneficial for the authors to discuss these differences in more detail and explore how TWAS methodologies could contribute to their research. The authors could consider citing the following TWAS-related literature to support this discussion, ktwas, mktwas, Power analysis of transcriptome-wide association study. 

The results section is dense and could be better organized. Key findings should be highlighted more prominently to improve readability.

The limitations of the study, particularly regarding the diversity of the dataset and potential biases, are not adequately addressed.

The conclusion is somewhat repetitive and does not offer new insights beyond what is already discussed.

Comments on the Quality of English Language

Minor editing of English language required.

Author Response

Comments 1: There is insufficient comparison with existing machine learning models and methodologies. Providing a comparative analysis would better demonstrate the advantages of the proposed approach.

Response 1:  Thank you for pointing this out. I agree with this comment. In the second paragraph of the Introduction, I have provided a more detailed description of the previous research. In Results and Discussion, I added a new subsection ‘3.1. Performance of ONCOfind-AI’ to compare its performance with recent studies with a table. 

Comments 2: The introduction could benefit from a more detailed literature review on existing machine learning approaches for predicting cancer origins to better position the novelty of this study.

Response 2:  In the second paragraph of the Introduction, I have provided a more detailed description of the previous research. 

Comments 3: The method of normalizing and integrating micro-array and RNA-seq data is not thoroughly explained, potentially affecting reproducibility.

Response 3: Thank you for pointing this out. I agree with this comment. I have revised ‘2.2. Featurization and Feature Selection’ to emphasize this point. Added explanations at the beginning and end of the first paragraph. 

Comments 4: The algorithm proposed by the authors and Transcriptome-Wide Association Studies (TWAS) do share some conceptual similarities, particularly in their utilization of transcriptomic data and the integration of multiple data sources. However, they serve different purposes and employ different methodologies. It would be beneficial for the authors to discuss these differences in more detail and explore how TWAS methodologies could contribute to their research. The authors could consider citing the following TWAS-related literature to support this discussion, ktwas, mktwas, Power analysis of transcriptome-wide association study. 

Response 4: In the introduction, we mention the researchers' use of TWAS to study genes associated with cancer or disease and cite the kTWAS paper.

Comments 5: The results section is dense and could be better organized. Key findings should be highlighted more prominently to improve readability.

Response 5: Thank you for pointing this out. We have revised the overall text and added a new subsection '3.1. Performance of ONCOfind-AI' to highlight the novelty of the study.

Comments 6: The limitations of the study, particularly regarding the diversity of the dataset and potential biases, are not adequately addressed.

Response 6: We will integrate additional public DBs to show if we can achieve the same results with more than 3 DBs. Added content to the Limitations and Future Work section.

Comments 7: The conclusion is somewhat repetitive and does not offer new insights beyond what is already discussed.

Response 7: I agree with this comment. Removed the conclusion section and made it more comprehensive in Results and Discussion.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

I think that the authors have adequately addressed the comments made by the reviewers in the revised version of the manuscript. Therefore, I have no further comments.

Author Response

Thank you for your kind review. 

Sincerely,

yeonuk Jeong

Back to TopTop