Machine-Learning-Based Proteomic Predictive Modeling with Thermally-Challenged Caribbean Reef Corals
Round 1
Reviewer 1 Report
At this point, the paper should be returned to the author for substantial revision without extensive reviews.
- There are too many figures, tables, and acronyms, which make it harder to follow the story and dilute what the main findings are. (Figures need improvement, as well.)
- The study design is not clear, and Figure 1 does not help me understand the study design well. I understand that it can be complicated to explain a nested study design, and I highly recommend a diagram to help readers
- Results and Discussion are in one section, which can be meaningful for short manuscripts, but I believe this manuscript will be better understood by separating it into two sections. Also, some of the results should be moved to the methods (especially the beginning of section 3.2).
Author Response
Reviewer #1’s summary: At this point, the paper should be returned to the author for substantial revision without extensive reviews.
Author’s response to reviewer#1’s summary: Thank you for taking the time to critically review my article and for making comments and suggestions to improve its flow and readability. Both Figure 1 and Table 1 are confusing, in part because of the unexpected result that not all corals from the same reef (or of the same genotype) behaved similarly in the bleaching assay. In reality, the most useful portion of the manuscript that outlined this was Table A1, so I have now replaced former Table 1 with former Table A1 (i.e., new Table 1). By comparing the various columns, it can be known which corals experienced which fates in the bleaching assay. I have also better defined the two parameters I tracked and attempted to predict, as, after not having read this article in many weeks, I realized that I never really defined them clearly. Since SO many methods were employed, it was clearly easy for the reader to get lost, and the fact that I had written this for data analytics experts (not coral biologists, who are actually the target audience) assumes a lot of statistical knowledge that many marine biologists may not have. Therefore, I now have included “road map” introductory paragraphs at the beginning of both the Materials and Methods and Results and Discussions sections that effectively outline my plan of attack. Hopefully, it is now clearer why I took the approach that I did.
Reviewer #1’s major comment #1: There are too many figures, tables, and acronyms, which make it harder to follow the story and dilute what the main findings are. (Figures need improvement, as well.)
Author’s response to reviewer #1’s major comment #1: Thank you for your suggestions, and I can see why it could be hard to navigate this article. In addition to removing the original Figure 1, which appeared to be confusing, I moved former Figures 4-5 to the appendix because the associated method (stepwise discriminant analysis) was ultimately found to be inappropriate for these sorts of data. I also moved the associated text to the manuscript. I had wanted to show a comparison: very simply/cheap model vs. complex/expensive one, but since the latter was the only one that was good, it doesn’t make sense to leave the inferior one in the main text since it could be distracting. Now, you will find that there are only three figures (#3 is easily the most important) and six tables (#1 and #6 being the most important).
Reviewer #1’s major comment #2: The study design is not clear, and Figure 1 does not help me understand the study design well. I understand that it can be complicated to explain a nested study design, and I highly recommend a diagram to help readers
Author’s response to reviewer #1’s major comment #2: You are correct that I did not do a good job of explaining the goals and why I made such a huge effort to try out so many different statistical approaches. I have now provided a highly revised methods section that outlines the plan of attack. I also have a “road map” paragraph in the Results and Discussion to let the reader know what I will discuss. At the end of the day, it is Table 6 that is the most important part of the article, despite being so boring to look at (not sure how to make an algorithm striking), but I do appreciate your suggestion to try to remove unnecessary figures and make the work more focused (see previous comment on this). Hopefully now the reader can more fully grasp what I set out to achieve: to make a predictive model based on protein concentrations that would tell me if a coral would be likely to bleach or not. Since this had never been done before, I had to develop a lot of the methods from scratch, and originally I had wanted to include everything, but I now see that that was too distracting, so I have moved basically all of the things that did NOT work (or that I would not recommend to others) to the appendix. In essence, this should be read as a methods article since the actual experiment was already published.
I have now taken a supplemental table that described the coral fragment and coral colony health states and made that the new Figure 1. I have no idea why I didn’t do this before because this table basically summarizes the entire study so should have been in the main text from the beginning, and for that I apologize. I also have a new paragraph that describes the nature of the nomenclature, which I had done a poor job of before. This is critical because a “healthy” fragment could be taken from a “bleaching-resistant” or “bleaching-susceptible” colony whereas a bleaching fragment could only come from a bleaching-susceptible colony. It sounds like a no-brainer, but I suspect this was the cause of some confusion. Basically, the artificial intelligence needs to be able to detect which data are from which type of fragment, as well as from which type of colony, so it needs to be trained with different combinations of fragments of differing health states. This should now be more apparent.
Reviewer #1’s major comment #3: Results and Discussion are in one section, which can be meaningful for short manuscripts, but I believe this manuscript will be better understood by separating it into two sections. Also, some of the results should be moved to the methods (especially the beginning of section 3.2).
Author’s response to reviewer #1’s major comment #3: Section 3.2 (now section 3.3) is describing results of this study). However, I did strongly consider dividing the Results and Discussion into two sections given the prior complexity of the article. However, since it is now more focused (with ~two pages moved to the appendix), as well as a hopefully more logical flow given the restructuring of the methods (which then influenced the structure of the Results and Discussion), I think it may still work better combined. However, if you and the editor (being the two most critical of the four reviewers) still think it would read better as two sections, I will be happy to do so. Thanks again for your comments, which have hopefully resulted in a superior manuscript.
Reviewer 2 Report
This manuscript contains a proteomic comparative analysis of O. faveolata and symbionts under different conditions. The author has collected considerable data and done an extensive analysis with a view to developing a predictive tool for bleaching events based on proteomic data. It is unclear to me what the benefit of such a tool might be, if by predicting a bleaching event something could be done to mitigate it? Would modelling bleaching be improved by collecting early warning of bleaching events? Transcriptomic and proteomic approaches will both suffer from large feature size (~100k features for transcripts and proteins respectively) compared to very small sample sizes (10-100) and both will suffer from the curse of dimensionality and overfitting. The limited number of peptides identified is something of a blessing in this respect, identifying the most abundant proteins in the sample, which are probably ideal targets for a simple robust test to warn of bleaching. The work is thorough and a number of statistical and analytic approaches were used, but the motivation for the work is not clear to me.
Author Response
Author’s response to reviewer #2: Thank you for reviewing my article. You are correct in pointing out that I never blatantly stated the overarching goal of this project, which was as you hypothesized: to develop a predictive model capable of forecasting coral bleaching susceptibility. I have now added two sentences to the end of the Introduction to emphasize this. It would be a shame were a reader of this article to wonder why I spent so much money and did so much work only to produce a lot of dense, statistical mumbo-jumbo!
“Were a rigorously field-tested model developed that was capable of predicting bleaching susceptibility weeks or even months before the advent of a high-temperature event, marine managers could attempt to mitigate local environmental stressors in a way that may limit the corals’ stress loads. Additionally, knowledge of a coral’s bleaching susceptibility could be useful for reef restoration initiatives, in which BLR corals would clearly be better suited targets for outplanting than BLS conspecifics.”
I had strayed away from including this sort of statement before since I did not yet properly field-test these models, but I do think it’s important to state the long-term goal of this project.
Additionally, you are correct in pointing out the issue with the “wide” nature of the molecular datasets (in which you'll always have many more features than samples). In fact, this was exactly why the statistical pipeline was so complex; I had to tailor it to approaches that could handle wide datasets. For instance, MANOVA could not be used to analyze these data. It was also important to use approaches that can handle multicollinearity, another hallmark of these sorts of datasets (i.e., many analytes correlating with one another). Missing data, as I mention in the Discussion, is what I am most worried about in the future; what if I have the world’s best protein-based predictive model, only to NOT sequence those same proteins in my TBD field corals (very likely to be the case)? This is surely the major limitation of this entire ‘Omics+machine-learning approach (over-fitting certainly being a concern as well).
Reviewer 3 Report
Very nice and complete. Original by the huge amount of perspectives. Felicitations, I appreciate very much (and I must say it is the first time I'm so enthousiast about a review !).
Author Response
Author response to reviewer#3: Thank you very much for the endorsement of my article. Upon taking in the suggestions of the editor, as well as the other two reviewers, I hope you will later find it even easier to read (with a more focused results and discussion, in particular, as well as more methods incorporated into the main text).
Round 2
Reviewer 1 Report
The paper has improved its readability. However, overall, it is not a strong manuscript since the study design is highly limited (limited sample sizes in terms of # of genotypes, # colonies, # of samples, and how the long- and short-term experiments were mixed). It is hard to think that a good predictive model can be constructed from such small sample sizes; only 20 samples, of which only two were actively bleaching, and the number of proteins used was also small. It is not surprising to read that the model produced 100% accuracy because of the small sample size. The model was likely highly overfitted.
How the author filters the proteins to be included also unfortunately likely limited the used proteins to be the most common proteins, that were part of basic metabolic functions, which were not necessarily the interesting proteins. Bleaching susceptibility has been shown to be a highly complex response in corals, involving a large number of genes (e.g. Fuller et. al. 2020). Thousands of proteins are expressed in a cell at a time, so 40 most common coral host proteins were likely not the right candidates to predict bleaching susceptibility (running the sophisticated ML approaches on a limited sample can not yield sophisticated results).
Author Response
However, overall, it is not a strong manuscript since the study design is highly limited (limited sample sizes in terms of # of genotypes, # colonies, # of samples, and how the long- and short-term experiments were mixed). It is hard to think that a good predictive model can be constructed from such small sample sizes; only 20 samples, of which only two were actively bleaching, and the number of proteins used was also small. It is not surprising to read that the model produced 100% accuracy because of the small sample size. The model was likely highly overfitted.
Author’s response to reviewer #1’s summary. Thank you for having a second look at my article. You are correct in pointing out that only two samples were bleaching, but, for the purposes of model-building, it was more important that there were a mix of bleaching-susceptible and bleaching-resistant colonies used in the experiments; in fact, it was actually not necessary to analyze the proteome of a bleaching sample at all, only that of samples that later went on to bleach. From experience, I can tell you that analyzing the proteome of a significantly bleached coral is hard, if not impossible, to analyze since cell death pathways cannot be distinguished from simply cells that had died. The associated RNAs, DNAs, and proteins are also highly degraded. But that is besides the more significant point, that of the small sample size. Even when combined with our prior work with these corals, the sample size only then increases to 35-40 (i.e., still small). In the future, I hope to analyze the proteomes of many more corals. It is worth mentioning that an article with no fresh (newly acquired) samples that employed a machine learning+’Omics approach was published last year in Nature Communications (Roach et al. 2021), though the authors opted to leave in all metabolites, regardless of their identity or confidence scores. Perhaps this work, then, would be better suited for that journal.
Reviewer #1’s major comment #2: How the author filters the proteins to be included also unfortunately likely limited the used proteins to be the most common proteins, that were part of basic metabolic functions, which were not necessarily the interesting proteins. Bleaching susceptibility has been shown to be a highly complex response in corals, involving a large number of genes (e.g. Fuller et. al. 2020). Thousands of proteins are expressed in a cell at a time, so 40 most common coral host proteins were likely not the right candidates to predict bleaching susceptibility (running the sophisticated ML approaches on a limited sample can not yield sophisticated results).
Author’s response to reviewer #1’s major comment #2: The goals of this work were to 1) corroborate the results of a competing proteomic technology and 2) develop proteomic predictive models, not to develop a cellular model of the bleaching process. In fact, I have a paragraph in the Discussion that states just why these data should NOT be used to develop mechanistic inference (one of which was stated above, i.e., the low sample size for bleaching samples). With respect to the predictive model, this is actually untrue, as the models were validated and tested with samples that did not feature in the model building. Given the extensive length of the methods section (which spanned the main text and the supplementary materials, this may have been missed, but it is a point of critical importance; machine learning approaches are, as pointed out, likely to fit 100% of the variation in the training dataset when numbers of samples and analytes are low (as in this work). For this reason, it is imperative to validate the models in a number of ways (as explained in the article). Of course, the optimal means of ensuring that the models were not overfit is to “test” them with field samples, which is the goal of my future work.
My goal for this work was to essentially compare methods and provide a pipeline and analytical framework for incorporating ‘Omics data into predictive models. As I mentioned in my response to the editor, this could just as easily be carried out with other analytes: gene mRNAs, lipids, metabolites, or even physiological response metrics (provided they are acquired/measured at pre-lethal levels). As such, I do not necessarily believe that proteins will always be the best analytes. They might be, though mRNAs or other response variables could be just as good (if not better), especially since mRNAs in particular yield far larger datasets than proteomics. Nevertheless, these models trained with only 86 proteins were still able to predict the bleaching susceptibility of samples held back from the models. Whether their predictive power holds in situ remains to be determined.