Evaluation of Genomic Contamination Detection Tools and Influence of Horizontal Gene Transfer on Their Efficiency through Contamination Simulations at Various Taxonomic Ranks
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsCRitical Assessment of genomic COntamination detection at several Taxonomic ranks (CRACOT) by Cornet et al. attempted to compare the detection performance of six commonly used tools in genome contamination detection. The manuscript is well-written but would benefit from some improvements.
The Introduction is good but bores nothing about HTG
I like vein, the methodology section will benefit from the justification inclusion of HGT and some description of the method used. As it is currently, It seems to have appeared unplanned in the work and a lay reader might wonder why a confirmatory technique applied was left out of the manuscript until the discussion section.
Other suggestions are highlighted in the attached reviewed manuscript.
Comments for author File: Comments.pdf
Comments on the Quality of English LanguageNeed minor corrections
Author Response
Reviewer 1
CRitical Assessment of genomic COntamination detection at several Taxonomic ranks (CRACOT) by Cornet et al. attempted to compare the detection performance of six commonly used tools in genome contamination detection. The manuscript is well-written but would benefit from some improvements.
The Introduction is good but bores nothing about HTG
- The HGTs have been added, with a definition, in the introduction. Lines 41 & 55.
I like vein, the methodology section will benefit from the justification inclusion of HGT and some description of the method used. As it is currently, It seems to have appeared unplanned in the work and a lay reader might wonder why a confirmatory technique applied was left out of the manuscript until the discussion section.
- The HGT had been planned from the start of the study, but the reviewer is correct that the original structure of the manuscript did not pay justice for its importance. We have restructured and expanded the text at several places to better introduce the HGT aspects of the study. Lines 41, 55 & 137.
Other suggestions are highlighted in the attached reviewed manuscript.
Line 62-64: The median contamination was 0.45% for CheckM V1.2.1 [6], 0.02% for GUNC V1.0.5 [8], 0.87% for BUSCO V5.4.3 [7], 22.3% for Physeter V0.213470 [5], 2.45% for Kraken2 V2.1.2 [9], 8% for CheckM2 V0.1.3 [11]. I suggest you arrange in either descending or ascending %.
- The sentence was changed for the ascending order.
Line 117-118: CRACOT was used to simulate contamination events, not only the redundant, replaced orsingle type separately, but also as a combination of the three types. rephrase to make this sentence more comprehensive.
- The sentence has been modified, Line 132.
Line 152: Our simulations were performed at six different taxonomic ranks, from intra-phylum to intra-species. what is the justification for this approach?
- We added a sentence to explain the justification of the taxonomic ranks. Line 186.
Line 208 conflated Is this a typo error?
- We have changed “conflated” to “confused”.
Line 235: www.mdpi.com/xxx/s1 Correct Link
- This will be updated by the journal after the reviews.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript provides a freely available contamination simulation framework (CRACOT); the paper shows a simulation framework with 705 high quality genomes belonging to the Clostridia class, which were simulated contamination levels of genomes generated with CRACOT. The accuracy of six tools for detecting the level of CRACOT-generated contamination was then compared, showing that the contamination simulation framework. The paper could be useful for assessing the accuracy of future tools designed to detect the presence of multiple marker genes or for whole genome studies. However, the manuscript presents some inconsistencies that should be corrected before being accepted for publication in the Applied Microbiology Journal.
The title of the study refers to the CRACOT contamination simulation framework but does not describe the content of the manuscript. Please propose a more descriptive title.
The abstract of the manuscript does not clearly describe the objectives, methods and main results obtained, which makes it difficult both to properly understand and to disseminate the study. Please elaborate a more descriptive study.
The introduction of the manuscript presents some general elements of the genome contamination study but is not sufficient for proper understanding of the scope and importance of the study, especially for the lay reading public. Please enrich this section in a general way. Please include the concepts of contamination types, plasmids, protein prediction, orthology inference, orthogroups, chimerism of genome contigs and horizontal gene transfer, as well as describe more fully the tools available for the study of marker genes and whole genomes.
In the Materials and Methods section of the manuscript it is noted (Lines: 61-64) that the initial 705 genomes had varying percentages of contamination (median 0.02-22.3%) detected by the tools evaluated. This raises the question of whether this data should be in the results section or is the initial state of contamination. If it is a result, please relocate and re-elaborate on this information and if it is the initial state of the test genomes, please include in the discussion the foreseeable effect of starting from input genomes without contamination and only evaluate the effect of simulated contamination with CRACOT. Additionally, it is necessary to analyze and discuss each tool separately, comparing the purpose for which the specific algorithm was created and its use in detecting contamination.
In the Materials and Methods section of the manuscript it is noted (Lines: 101-103) that the "duplicate" contamination events were extracted from the set of common orthogroups and the corresponding gene sequences from the slave genome were added to the end of the last contig of the master genome, modifying the position of a gene within the genome, as well as its genomic context. Please analyze and discuss the advantage of not replacing the contaminating gene at the position of the original genes and the effect that forcing the position of the contaminating gene may have on the contamination estimation of the evaluated tools.
All figures included in the manuscript require re-editing or replacement. Please make the following corrections: Figure 1 requires spacing and text sizes to be corrected; It also requires defining the abbreviation THG (Horizontal Gene Transfer). In Figures 2, S1-S3, the axes of different amplitude make it difficult to compare the tools; the title of the "Y" axis uses "percentage" as a variable which is only the unit of measurement of the contamination level; the blue line (median values of the contamination level) and the red numbers (Spearman correlation values) are very faint and overlap the outline of the violin.
Comments on the Quality of English LanguageThe manuscript uses imprecise colloquial terms ("Nowadays", "it is now", etc.) that induce inadequate understanding of the message implicit in the text. Please correct. An inadequate use of commas in the captions of figures S2 and S3 is also detected. Define the abbreviation THG (Horizontal Gene Transfer) from the first time it appears in the text.
Author Response
Reviewer 2
The manuscript provides a freely available contamination simulation framework (CRACOT); the paper shows a simulation framework with 705 high quality genomes belonging to the Clostridia class, which were simulated contamination levels of genomes generated with CRACOT. The accuracy of six tools for detecting the level of CRACOT-generated contamination was then compared, showing that the contamination simulation framework. The paper could be useful for assessing the accuracy of future tools designed to detect the presence of multiple marker genes or for whole genome studies. However, the manuscript presents some inconsistencies that should be corrected before being accepted for publication in the Applied Microbiology Journal.
The title of the study refers to the CRACOT contamination simulation framework but does not describe the content of the manuscript. Please propose a more descriptive title.
- The title has been changed for “
Evaluation of Genomic Contamination Detection Tools and Influence of Horizontal Gene Transfer on their efficiency through contamination simulations at Various Taxonomic Ranks
”.
The abstract of the manuscript does not clearly describe the objectives, methods and main results obtained, which makes it difficult both to properly understand and to disseminate the study. Please elaborate a more descriptive study.
- The abstract has been modified to better describe the goals and our results, notably in terms of HGT. Lines 13-26.
The introduction of the manuscript presents some general elements of the genome contamination study but is not sufficient for proper understanding of the scope and importance of the study, especially for the lay reading public. Please enrich this section in a general way. Please include the concepts of contamination types, plasmids, protein prediction, orthology inference, orthogroups, chimerism of genome contigs and horizontal gene transfer, as well as describe more fully the tools available for the study of marker genes and whole genomes.
- The introduction has been modified to add the types of contamination and HGT (Lines 41, 55) Chimerism (Line 40) and Orthology inference, orthogroups, plasmids (Line 56). We believe that protein prediction is better introduced in the methods section.
In the Materials and Methods section of the manuscript it is noted (Lines: 61-64) that the initial 705 genomes had varying percentages of contamination (median 0.02-22.3%) detected by the tools evaluated. This raises the question of whether this data should be in the results section or is the initial state of contamination. If it is a result, please relocate and re-elaborate on this information and if it is the initial state of the test genomes, please include in the discussion the foreseeable effect of starting from input genomes without contamination and only evaluate the effect of simulated contamination with CRACOT. Additionally, it is necessary to analyze and discuss each tool separately, comparing the purpose for which the specific algorithm was created and its use in detecting contamination.
- It is the initial state of contamination, we now mention this clearly in the methods (Line 81) and discuss the use of this initial contamination state (Line 182). The goals of all tools in regard of the simulations are now discussed (Line 228).
In the Materials and Methods section of the manuscript it is noted (Lines: 101-103) that the "duplicate" contamination events were extracted from the set of common orthogroups and the corresponding gene sequences from the slave genome were added to the end of the last contig of the master genome, modifying the position of a gene within the genome, as well as its genomic context. Please analyze and discuss the advantage of not replacing the contaminating gene at the position of the original genes and the effect that forcing the position of the contaminating gene may have on the contamination estimation of the evaluated tools.
- The influence of adding duplicated genes at the end of the contig is discussed for the six tools (Line 116).
All figures included in the manuscript require re-editing or replacement. Please make the following corrections: Figure 1 requires spacing and text sizes to be corrected; It also requires defining the abbreviation THG (Horizontal Gene Transfer). In Figures 2, S1-S3, the axes of different amplitude make it difficult to compare the tools; the title of the "Y" axis uses "percentage" as a variable which is only the unit of measurement of the contamination level; the blue line (median values of the contamination level) and the red numbers (Spearman correlation values) are very faint and overlap the outline of the violin.
- Thank you for the detailed suggestions.
The axes of different amplitudes in figure 2, S1-S3 make the figures easier to understand, so we decided to keep this version. The version with the same axes for all is presented below.
The manuscript uses imprecise colloquial terms ("Nowadays", "it is now", etc.) that induce inadequate understanding of the message implicit in the text. Please correct. An inadequate use of commas in the captions of figures S2 and S3 is also detected. Define the abbreviation THG (Horizontal Gene Transfer) from the first time it appears in the text.
- “Nowadays” was deleted and “it is now” has been changed for “it is”. Captions has been changed and HGT is now define ate the beginning of the introduction.