Next Article in Journal
Enhanced Methylene Blue Adsorption by Cu-BTC Metal-Organic Frameworks with Engineered Particle Size Using Surfactant Modulators
Previous Article in Journal
Spatio-Temporal Matching and Nexus of Water–Energy–Food in the Yellow River Basin over the Last Two Decades
 
 
Article
Peer-Review Record

Machine Learning and Multiple Imputation Approach to Predict Chlorophyll-a Concentration in the Coastal Zone of Korea

Water 2022, 14(12), 1862; https://doi.org/10.3390/w14121862
by Hae-Ran Kim 1, Ho Young Soh 2, Myeong-Taek Kwak 3,* and Soon-Hee Han 4,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Reviewer 6: Anonymous
Water 2022, 14(12), 1862; https://doi.org/10.3390/w14121862
Submission received: 20 April 2022 / Revised: 4 June 2022 / Accepted: 7 June 2022 / Published: 10 June 2022
(This article belongs to the Section Oceans and Coastal Zones)

Round 1

Reviewer 1 Report

Summary and general comments

In this study, the authors used supervised learning to enhance the seawater physiochemical dataset with missing data and evaluated the performance of each ML algorithm.

The study is interesting and scientifically sound; however, there are some issues the authors need to clarify. Please see the comments below for the feedback and use whichever authors see fit to improve the quality of the manuscript.

Specific comments

  1. The abstract can be improved. The abstract only focus on the ML part, the authors also need to mention a bit more about the chlorophyll section
  2. Some of the abbreviations were not defined at first use in the abstract.
  3. It is not clear about “21C oil”. Is it a database?
  4. There are many abbreviations in the introduction which are not defined. Do recheck the manuscript for all of these errors. Do look for an online resource on proper uses of abbreviations in scientific writing.
  5. Line 86-90. There are many problems with this paragraph. There are many terms/ abbreviation that is inserted without giving any context or definition. Is SVD the same as SVR? What are MICE and Amelia, and are this software? Some of these are machine learning algorithms.
  6. Line 91-105. This is the paragraph where the authors highlight the main aims and objectives of this study. Obviously, chapter 1 is about the introduction, and chapter 5 is the conclusion; therefore is no need to list the content of different chapters. What the authors need to do is to describe clearly the objectives of this study.
  7. Line 118 “chemical (pH, DO, SPM, POC, PON, DSi, DIP, DIN” It is not clear what some of these abbreviations represent. I suggest authors to include an abbreviation page.
  8. Section 2.1. Grammatical errors. There are many words in the middle of the sentences are in capital letter.
    e. “…salinity, Transparency),….”
    “of Water temperature (℃), Salinity (psu), Transparency (m),”.
    Do correct all these errors.
  9. Line 216 “SMAPE, MAE, MAPE, MSE” some of these abbreviations are not defined. It is difficult to know what they mean.
  10. Line 292-302. Were the shapiro test and PCA performed using the same software?
  11. Section 3.5 In this section, the authors only mentioned how the performance of the ML algorithm is evaluated, however, did not mention how these algorithms were optimised. Please see the article on how the authors describe the methodology https://doi.org/10.1016/j.jtice.2021.11.001
  12. Table 3. The R2 is very low. Are these values due to the default setting of the algorithms? Can the values of R2 be improved if the algorithms are optimised?
  13. Section 3.5 Although Xgboost is said to be the best performing algorithm. However, values of RMSE and R2 are very close to Random forest and SVR. Is the performance of RF and SVR far off?
  14. There are many variants of regression tree or decision tree. Do look into the software package and determine specifically which variant is used.
  15. Authors have been using rf as an abbreviation for random forests algorithm and should continue to use the abbreviated form.
  16. How are the predicted values for the missing data? Are there any outliers detected?

Author Response

Response to Reviewer 1

[General Comment] In this study, the authors used supervised learning to enhance the seawater physiochemical dataset with missing data and evaluated the performance of each ML algorithm. The study is interesting and scientifically sound; however, there are some issues the authors need to clarify. Please see the comments below for the feedback and use whichever authors see fit to improve the quality of the manuscript.

Response: Thank you for your comment.

 Specific comments

[Comment 1] The abstract can be improved. The abstract only focus on the ML part, the authors also need to mention a bit more about the chlorophyll section.

Response: Thank you for your suggestion. We have mentioned it in the revised manuscript. [Page1, Abstract]

[Comment 2] Some of the abbreviations were not defined at first use in the abstract.

Response: Thank you for your comment. We have mentioned it in the revised manuscript. [Page1, Abstract]

 

[Comment 3] It is not clear about “21C oil”. Is it a database?

Response: We have replaced 21C oil with oil of the 21st Century with better understanding. [Page2, Introduction]

 [Comment 4] There are many abbreviations in the introduction which are not defined. Do recheck the manuscript for all of these errors. Do look for an online resource on proper uses of abbreviations in scientific writing.

Response: We have taken the reviewer’s comment into full consideration and well reflected throughout the entire manuscript.

[Comment 5] Line 86-90. There are many problems with this paragraph. There are many terms/ abbreviation that is inserted without giving any context or definition. Is SVD the same as SVR? What are MICE and Amelia, and are this software? Some of these are machine learning algorithms

Response: Thank you for your comment. This problem was because the terminology used in the reference paper was written without uniformity with this manuscript. The paper used the SVD term in the same meaning as SVR. And MICE and Amelia are packages of R. We have revised the sentence. [Page 2, Line 93-94]

[Comment 6] Line 91-105. This is the paragraph where the authors highlight the main aims and objectives of this study. Obviously, chapter 1 is about the introduction, and chapter 5 is the conclusion; therefore is no need to list the content of different chapters. What the authors need to do is to describe clearly the objectives of this study.

Response: Thank you very much for the reminder. We have revised the sentence as per suggested. [Page 1-2, Line 36-41, 97-100]

“Google Scholar search containing all three keywords (i.e., imputation, chlorophyll-a, and machine learning) showed about 150 cases in the last two years at the beginning of May of 2022, While there were about 359,000 machine learning cases, 19,000 chlorophyll-a cases, and 25,000 imputation cases, respectively in the single keyword search. The small cases of convergence research with the three keywords demonstrate the need for this study.”

 

“Therefore, our study utilized techniques such as multiple imputation and machine learning for the marine coastal ecosystem observation data. We attempted to predict Chl-a and derived the importance of input features for the target variable (i.e., Chl-a) in the marine field”

 

[Comment 7] Line 118 “chemical (pH, DO, SPM, POC, PON, DSi, DIP, DIN” It is not clear what some of these abbreviations represent. I suggest authors to include an abbreviation page.

Response: Thank you very much for the reminder. We have added it to the caption of Table/Figure for a better understanding. Please refer to the revised manuscript.

 

[Comment 8] Section 2.1. Grammatical errors. There are many words in the middle of the sentences are in capital letter. e. “…salinity, Transparency),….” of Water temperature (℃), Salinity (psu), Transparency (m),”. Do correct all these errors.

Response: Thank you very much for the reminder. We have corrected it. [Page 3, section 2.1]

 [Comment 9] Line 216 “SMAPE, MAE, MAPE, MSE” some of these abbreviations are not defined. It is difficult to know what they mean.

Response: Thank you very much for the reminder. We have corrected the sentence as follows:

“symmetric mean absolute percentage error (SMAPE), MAE, mean absolute percentage error (MAPE), mean squared error (MSE), and RMSE” [Page 6, section 2.4, Line 224-225]

 

[Comment 10] Line 292-302. Were the Shapiro test and PCA performed using the same software?

Response: We used the R software for Shapiro test and PCA. We have added the sentence as follows:

“We used R software for all data preprocessing and data analysis.” [Page 3, section 2.1, Line 124-125]

[Comment 11] Section 3.5 In this section, the authors only mentioned how the performance of the ML algorithm is evaluated, however, did not mention how these algorithms were optimised. Please see the article on how the authors describe the methodology https://doi.org/10.1016/j.jtice.2021.11.001

Response: Thank you very much for the reminder. We have revised section 3.5 with the addition of your suggested reference. [Page 12, section 3.5] Please refer to the revised manuscript.

[Comment 12] Table 3. The R2 is very low. Are these values due to the default setting of the algorithms? Can the values of R2 be improved if the algorithms are optimised?

Response: Thanks for your comment. We tested hyperparameter combination change for machine learning models. The R2 value was low even though we applied the optimization parameters, so we pondered the problem. However, we had no choice but to present the results performed with this data. It is estimated that the reason why the Chl-a prediction accuracy of this study is low is that the feature in the meteorological and hydrodynamic categories was not considered in aquatic ecosystem situations, and only physical and chemical features were used.

[Comment 13] Section 3.5 Although Xgboost is said to be the best performing algorithm. However, values of RMSE and R2 are very close to Random forest and SVR. Is the performance of RF and SVR far off?

Response: Thank you very much for pointing this out. The values of R2 and RMSE in Table 3 are arithmetic averages of 10 values calculated from ten models trained through 10-fold cross-validation. XGboost showed slightly better values than other algorithms on the basis of R2 and RMSE. We revised section 3.5 considering your comment. [Page 12, section 3.5].

[Comment 14] There are many variants of regression tree or decision tree. Do look into the software package and determine specifically which variant is used.

Response: Thank you very much for pointing this out. We have added the sentence as follows:

“The regression tree algorithm (i.e., rpart in R) is optimized by adjusting the hyperparameter (minsplit=16, maxdepth=9, cp=0.01).” [Page 12, section 3.5].

[Comment 15] Authors have been using rf as an abbreviation for random forests algorithm and should continue to use the abbreviated form.

Response: Thank you very much for pointing this out. In this paper, the uniformity of terminology is specified as follows:

“rf : random forest imputations which is built-in imputation method of the mice function, the mice package in R” [Page5, Figure2 caption].

“random forest : machine learning algorithms”

[Comment 16] How are the predicted values for the missing data? Are there any outliers detected?

Response: Thank you very much for pointing this out. Figure 8 (b) shows that imputed values of pH have no outlier in 2016. Figure 8 (d) shows that imputed values of transparency have no outlier in 2016.

Reviewer 2 Report

Paper water-1713140 “Machine learning and multiple imputation approach to predict chlorophyll-a concentration in the Coastal Zone of Korea”

 

Comments

This study focuses on the machine learning and multiple imputation approach to predict chlorophyll-a concentration in the coastal zone of Korea. I think the paper fits well the scope of the journal and addresses an important subject. However, a number of revisions are required before the paper can be considered for publication. There are some weak points that have to be strengthened. Below please find more specific comments:

 

*Abstract: The abstract seems to be adequate. No comments.

*Introduction: I suggest to provide some statistical information to highlight the importance of the subject at hand (probably in the first paragraph). Please also use supporting references for the statistical information provided.

*The literature review coverage seems to be acceptable. Just please check for the most recent and relevant studies that have been published over the last years (i.e., the last 2-3 years). It is essential that the literature review is up to date.

*Section 2: Please provide a bit more detailed discussion regarding the selection of the study area. Some readers may question the study area selection, if it is not justified adequately.

*The authors primarily rely on machine learning in this study. The authors should create a broader and more general discussion regarding the importance of advanced AI algorithms (e.g., not just machine learning but also heuristics, metaheuristics, self-adaptive algorithms) for challenging decision problems (not just prediction of the chlorophyll-a concentration). There are many different domains where advanced AI algorithms have been applied as solution approaches, such as online learning, scheduling, multi-objective optimization, transportation, medicine, data classification, and others (not just the decision problem addressed in this study). The authors should create a discussion that highlights the effectiveness of advanced AI algorithms in the aforementioned domains. This discussion should be supported by the relevant references, including but not limited to the following:

An online-learning-based evolutionary many-objective algorithm. Information Sciences 2020, 509, pp.1-21.

Two hybrid meta-heuristic algorithms for a dual-channel closed-loop supply chain network design problem in the tire industry under uncertainty. Advanced Engineering Informatics 2021, 50, p.101418.

A many-objective evolutionary algorithm with angle-based selection and shift-based density estimation. Information Sciences 2020, 509, pp.400-419.

An Optimization Model and Solution Algorithms for the Vehicle Routing Problem with a “Factory-in-a-Box”. IEEE Access 2020, 8, pp.134743-134763.

An Adaptive Polyploid Memetic Algorithm for scheduling trucks at a cross-docking terminal. Information Sciences 2021, 565, pp.390-421.

A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Computing 2019, 23(22), pp.11775-11791.

Such a discussion will help improving the quality of the manuscript significantly. Such a discussion can be placed either in section 1 or 2.

*Please provide a bit more details regarding the input data used throughout the experiments. In particular, some supporting references would be helpful to justify the data selection.

*The manuscript contains quite a lot of figures/tables with the results from numerical experiments. Please try to provide a more detailed description of these figures/tables to make sure that the future readers will have a reasonable understanding of the main findings.

*The conclusions section could expand on limitations of this study and future research needs. I suggest listing the bullet points.

 

Author Response

Response to Reviewer 2

[General Comment] This study focuses on the machine learning and multiple imputation approach to predict chlorophyll-a concentration in the coastal zone of Korea. I think the paper fits well the scope of the journal and addresses an important subject. However, a number of revisions are required before the paper can be considered for publication. There are some weak points that have to be strengthened. Below please find more specific comments:

Response: Thank you for your comment.

 

Specific comments

[Comment 1] Introduction: I suggest to provide some statistical information to highlight the importance of the subject at hand (probably in the first paragraph). Please also use supporting references for the statistical information provided.

Response: Thank you very much for your suggestion. We have added the sentence as follows:

“Google Scholar search containing all three keywords (i.e., imputation, chlorophyll-a, and machine learning) showed about 150 cases in the last two years at the beginning of May of 2022, While there were about 359,000 machine learning cases, 19,000 Chloro-phyll-a cases, and 25,000 imputation cases, respectively in the single keyword search. The small cases of convergence research with the three keywords demonstrate the need for this study.” [Page 1, section 1, Line 35-40].

 

[Comment 2] The literature review coverage seems to be acceptable. Just please check for the most recent and relevant studies that have been published over the last years (i.e., the last 2-3 years). It is essential that the literature review is up to date.

Response: Thank you for the nice reminder. We think that the literature has been cited as sufficiently recent studies.

[Comment 3] Section 2: Please provide a bit more detailed discussion regarding the selection of the study area. Some readers may question the study area selection, if it is not justified adequately.

Response: Thank you for the nice reminder. We have revised section 2.1 [Page 3, section 2.1] and provided the following citation to support the study area. Please refer to the revised manuscript.

“Kim Y.N.; Yoo J.K.; Yeo J.W.; Kho B.S.; Hwang I.S. History and Status of the National Marine Ecosystem Monitoring Program in Korea. The Sea:Journal of the Korean society of oceanography. 2019, 24(1), 49-53”

 

[Comment 4] The authors primarily rely on machine learning in this study. The authors should create a broader and more general discussion regarding the importance of advanced AI algorithms (e.g., not just machine learning but also heuristics, metaheuristics, self-adaptive algorithms) for challenging decision problems (not just prediction of the chlorophyll-a concentration). There are many different domains where advanced AI algorithms have been applied as solution approaches, such as online learning, scheduling, multi-objective optimization, transportation, medicine, data classification, and others (not just the decision problem addressed in this study). The authors should create a discussion that highlights the effectiveness of advanced AI algorithms in the aforementioned domains. This discussion should be supported by the relevant references, including but not limited to the following:

Response: Thank you for this suggestion. It would have been interesting to explore this aspect. However, in the case of our study, it seems slightly out of scope because we do not currently have knowledge of advanced AI algorithms such as heuristics, metaheuristics, and self-adaptive algorithms, and we do not know much about their utility in the field of marine and fisheries.

 

[Comment 5] Please provide a bit more details regarding the input data used throughout the experiments. In particular, some supporting references would be helpful to justify the data selection.

Response: Thank you for the nice reminder. We have revised section 2.1 and added sentences as follows: [Page 3, section 2.1]

 

“Survey stations and survey items can be found in the reference to the history and status of the national marine ecosystem monitoring program in Korea [54].”

“The provided raw data consisted of several excel files organized by category within each year folder. Data preprocessing started by reading an excel file into R (version 4.1.2; https://www.r-project.org). We used R software for all data preprocessing and data analysis. Since the data in each category are mainly classified by year folder, they were primarily merged horizontally and then sorted and filtered. Next, we vertically connected each integrated data (i.e., five-year data combined by category) based on the year, season, observatory, and depth.”

 

[Comment 6] The manuscript contains quite a lot of figures/tables with the results from numerical experiments. Please try to provide a more detailed description of these figures/tables to make sure that the future readers will have a reasonable understanding of the main findings.

Response: Thank you for the nice reminder. We went through the entire manuscript to provide a more detailed description of these figures/tables. We have added sentences and detailed captions of the figures/tables. Please refer to the revised manuscript.

 

[Comment 7] The conclusions section could expand on limitations of this study and future research needs. I suggest listing the bullet points

Response: Thank you for your comment. We have added the sentence as follows:

“The results of our study suggest that our overall process and techniques can be generalized to make biological feature (e.g., phytoplankton abundance, zooplankton abundance, and Chl-a) predictions and derive important influencing features.

It is estimated that the reason why the Chl-a prediction accuracy of this study is low is that the feature in the meteorological and hydrodynamic categories was not considered in aquatic ecosystem situations, and only physical and chemical features were used. For more accurate predictions of biological features using machine learning or deep learning in the future, it is necessary to collect long-term accumulation of field observation data and various features presented in Table 5.” [Page 16, 5.conclusions]

Reviewer 3 Report

The authors have submitted an interesting manuscript on the prediction of chlorophyll-a concentration in sea water in Korea. The paper mainly focuses on imputation methods for missing data. The description of the methods is very good and the research procedure seems adequate.

However, the wording must be substantially improved, regarding both coherence and English grammar. There are several aspects that are not well explained and, consequently, cannot be totally understood until reading the full paper. I cannot list all this issues, but I will indicate many of them in order to show what aspects should be improved:

  • Many acronyms and variables are not adequately defined when they first appear in the text. For example, the relevant variables are not described until the Discussion section (lines 424-431). The general rule is to describe an acronym and put the acronym in parenthesis the first time it appears. Variables should be defined after their first use.
  • Line 40: “21C oil” Please consider rewording (for example, the oil of the 21st century)
  • Line 45: “sometimes over half of the values are missing”. This is a very vague idea, sometimes is more, sometimes is less…
  • Line 61: “The prediction of Chl-a, algal bloom warning, is mainly used in machine learning methods rather than traditional multiple regression analysis methods”. Please, consider rewording this sentence (machine learning is used instead of multiple regression for predicting chl-a)
  • Line 74: KNN (k nearest neighbors) vs. k nearest neighbors (KNN). Please check all the acronyms this way.
  • Lines 109-110: Please check “data was” – “data were”
  • Lines 118-120: There is an unnecessary duplication of variables list. Moreover the meaning of the variables should be done here. Please, check also consistency of variable names (capitals) and units (the symbol for liter is “l”, not “L”, etc.)
  • Line 127. Please, try to explain better this: initial dataset with both depths gives 30% missing data; final dataset with shallower depth gives XX missing data (it is not until line 237 that we discover that this is about 19%, so it is advantageous to discard deeper depth data)
  • Lines 141-145. Remember “acronym explanation (acronym)” format
  • Line 145. Please, explain here how the 20 complete datasets come from (20 different variables?, 20 combinations of season, station…? , any other method?)
  • Line 159. What is n in the RMSE equation of Figure 2?
  • Line 216. Acronyms to be described
  • Figure 6. It seems that 20 dots appear in each method. Please, explain this in the text in connection with the “20 complete datasets” issue referred above.
  • Figure 6. An explanation should be given regarding 0 dispersion for cart method and possible overfitting.
  • Figure 7(b), please check Dim1-Dim2 labeling in connection with text in lines 308-309 (Dim1 is parallel to DIN,DIP,NOx?)

The improvement obtained with the different imputation methods on the model predictions should be better described. I would suggest adding a table (similar to Table 3) for different models and different imputation methods (perhaps choosing the most relevant error metrics). The question to be answered: Does all this effort in data imputation have a relevant effect on model predictions improvement?

Author Response

Response to Reviewer 3

[General Comment] This study focuses on the machine learning and multiple imputation approach to predict chlorophyll-a concentration in the coastal zone of Korea. I think the paper fits well the scope of the journal and addresses an important subject. However, a number of revisions are required before the paper can be considered for publication. There are some weak points that have to be strengthened. Below please find more specific comments:

Response: Thank you for your comment.

 

Specific comments

[Comment 1] Many acronyms and variables are not adequately defined when they first appear in the text. For example, the relevant variables are not described until the Discussion section (lines 424-431). The general rule is to describe an acronym and put the acronym in parenthesis the first time it appears. Variables should be defined after their first use.

Response: Thank you for the nice reminder. We went through the entire manuscript to apply a general rule of an acronym. Please refer to the revised manuscript.

 

[Comment 2] Line 40: “21C oil” Please consider rewording (for example, the oil of the 21st century)

Response: Thank you for the nice reminder. We have replaced 21C oil with oil of the 21st Century. [Page 2, section 1, Line 48]

 

[Comment 3] Line 45: “sometimes over half of the values are missing”. This is a very vague idea, sometimes is more, sometimes is less…

Response: Thank you for the nice reminder. We have revised the sentence as follows:

“ sometimes it may have missing values of more than half.” [Page 2, section 1, Line 53-54]

 

[Comment 4] Line 61: “The prediction of Chl-a, algal bloom warning, is mainly used in machine learning methods rather than traditional multiple regression analysis methods”. Please, consider rewording this sentence (machine learning is used instead of multiple regression for predicting chl-a).

Response: Thank you for the nice reminder. We have revised it. ” [Page 2, section 1, Line 70-71]

 

[Comment 5] Line 74: KNN (k nearest neighbors) vs. k nearest neighbors (KNN). Please check all the acronyms this way.

Response: Thank you for the nice reminder. We have unified the k nearest neighbors (KNN) form. [Page 2, section 1, Line 82]

 

[Comment 6] Lines 109-110: Please check “data was” – “data were”.

Response: Thank you for the nice reminder. We have checked as follows:

“Technically, "data" is a plural noun—it is the plural form of the noun "datum." However, it is used with both singular and plural verbs.”

 

[Comment 7] Lines 118-120: There is an unnecessary duplication of variables list. Moreover the meaning of the variables should be done here. Please, check also consistency of variable names (capitals) and units (the symbol for liter is “l”, not “L”, etc.)

Response: Thank you for the nice reminder. We have deleted unnecessary duplication of the variables list. We have checked and modified the consistency of variable names and units. [Page3, section 2.1; Page10, Table 2] Please refer to the revised manuscript. 

 

[Comment 8] Line 127. Please, try to explain better this: initial dataset with both depths gives 30% missing data; final dataset with shallower depth gives XX missing data (it is not until line 237 that we discover that this is about 19%, so it is advantageous to discard deeper depth data).

Response: Thank you for the nice reminder. We have added the sentences as follows:

“Therefore, the number of data was 729 and the missing rate of Chl-a was 0.14%.” [Page 3, section 2.1] Please refer to the revised manuscript.

 

[Comment 9] Lines 141-145. Remember “acronym explanation (acronym)” format. Response: Thank you for the nice reminder. We went through the entire manuscript to apply a general rule of an acronym. Please refer to the revised manuscript.

 

[Comment 10] Line 145. Please, explain here how the 20 complete datasets come from (20 different variables?, 20 combinations of season, station…? , any other method?)

Response: Thank you for the nice reminder. We have revised section 2.2 for a better understanding. . Please refer to section 2.2 of the revised manuscript.

 

 “In the imputation phase of Figure 2, m is the number of multiple imputation datasets. We generated m=20 complete imputed datasets for each of the seven methods using the mice package which creates multiple imputations for multivariate missing data.” [Page 4, section 2.2, Line 148-151]

[Comment 11] Line 159. What is n in the RMSE equation of Figure 2?

Response: Thank you for the nice reminder. We have added sentence as follows:

“n is the number of observations.” [Page 5, section 2.2, Line 163]

 

[Comment 12] Line 216. Acronyms to be described

Response: Thank you for the nice reminder. We have revised the sentence as follows:

“symmetric mean absolute percentage error (SMAPE), MAE, mean absolute percentage error (MAPE), mean squared error (MSE), and RMSE” [Page 6, section 2.4, Line 224-225]

 

[Comment 13] Figure 6. It seems that 20 dots appear in each method. Please, explain this in the text in connection with the “20 complete datasets” issue referred above.

Figure 6. An explanation should be given regarding 0 dispersion for cart method and possible overfitting.

Response: Thank you for the nice reminder. We have added the sentence as follows:

“By the experiment in Figure 2, we generated 20 complete imputed datasets from the incomplete dataset (n=594), so 20 RMSE values were derived for each variable, and their distribution was shown as a boxplot. And since seven imputation methods were used, seven boxplots for each variable were displayed in Figure 6. We used the RMSE as an indicator to show the difference between the true value and the imputed value for each variable. The closer the RMSE value is to 0, the better.” [Page 8, section 3.2, Line 275-280]

 

Overfitting is not related to the imputation part but is related to the model in machine learning.

 

[Comment 14] Figure 7(b), please check Dim1-Dim2 labeling in connection with text in lines 308-309 (Dim1 is parallel to DIN,DIP,NOx?)

Response: Thank you for the nice reminder. We have revised the sentences as follows:

“We performed the principal component analysis (PCA) for exploratory data analysis. The loading plot of PCA shows how strongly each variable influences a principal component and the correlation between variables (Figure 7b). The two principal components account for 43.6% of the total variance of the data. Nutrients such as DIN, DIP, NO3, and NO2 strongly influence Dim1, while physical environment information such as water temperature, DO, pH, and salinity have more influence on Dim2. Each of POC and NH4 is positively correlated with Chl-a because the two variable vectors are close, forming a small angle. Moreover, Transparency is negatively correlated with Chl-a because they form a large angle close to 180° and is located on the opposite side of Chl-a (Figure 7b).” [Page 10-11, section 3.3, Line333-340]

 

[Comment 15] The improvement obtained with the different imputation methods on the model predictions should be better described. I would suggest adding a table (similar to Table 3) for different models and different imputation methods (perhaps choosing the most relevant error metrics). The question to be answered: Does all this effort in data imputation have a relevant effect on model predictions improvement?

Response: Thank you for your comment.

 

We think this missing imputation process is a necessary prerequisite before generating a machine learning model. Listwise deletion is the default way of handling incomplete data when performing statistical analysis or machine learning in many statistical packages, including R, SAS, and Stata. It means that a case is dropped from the analysis because it has a missing value in at least one of the specified variables. This can reduce the size of the sample and cause bias. Our attempt was to replace the missing values through multiple imputation for machine learning.

 

As for the multiple imputation, the RMSE is obtained for each variable, so we think a boxplot is more appropriate, unlike only target variable Chl-a prediction in machine learning.

Reviewer 4 Report

The authors studied machine learning and multiple imputation approaches to predict chlorophyll-a concentration in the Coastal Zone of Korea. the manuscript can be published after minor revision. Comments for the authors:

  • Some references need more explanations such as 14-18.
  • Please add more information about Fig.2
  • In the result section, more discussion is needed before going through the results. 
  • Fig 6 is not clear, please change it with higher quality Figure.
  • Fig 7 and 8 are not clear. 

 

Author Response

Response to Reviewer 4

[General Comment] The authors studied machine learning and multiple imputation approaches to predict chlorophyll-a concentration in the Coastal Zone of Korea. the manuscript can be published after minor revision. Comments for the authors:

Response: Thank you for your comment.

Specific comments

[General Comment 1] Some references need more explanations such as 14-18.

Response: Thank you for your comment.

We provided the references such as 14-18 for cases of applying machine learning to predict chlorophyll-a, and collected features for prediction of chlorophyll-a and algal blooms in Table5.

 

[General Comment 2] Please add more information about Fig.2

Response: Thank you for the nice reminder. We have revised the sentences and Fig 2 caption. [Page 4-5, section 2.2]

[General Comment 3] In the result section, more discussion is needed before going through the results.

Response: Thank you very much. Our discussion is not perfect. We did our best.

[General Comment 4] Fig 6 is not clear, please change it with higher quality Figure.

Response: Thank you very much for your comment. We have replaced Fig 6 with the new figure for a better understanding.

[General Comment 5] Fig 7 and 8 are not clear.

Response: Thank you very much for your comment. We have replaced Fig 8 with the new figure for a better understanding. We have added Fig7 with the caption for more description.

Reviewer 5 Report

Dear Authors,

I read you paper. From my point of view results of your work are ready for publication in present form. Congratulations.

Your sincerely,

Reviewer

 

Author Response

Response to Reviewer 5

[General Comment]I read you paper. From my point of view results of your work are ready for publication in present form. Congratulations.

Response: Thank you very much. We went through the entire manuscript to check the English language and style for a better understanding.

Reviewer 6 Report

The authors of the publication conducted analysis were performed on surface data of water observed during the spring and summer of 2015 to 2019 in the Coastal Zone of Korea. For biological research, this is a sufficient period to draw correct conclusions. Authors tried to predict chlorophyll-a concentration using machine learning, by replacing the missing data with the multiple imputations without deleting the same. In this respect, paper is innovative. Collecting physical, chemical, biological, meteorological and hydrodynamic parameters and determining their role in features for prediction of chlorophyll-a and algal blooms is a new approach to environmental research and indicates the possibility of using mathematical methods to predict the size of biomass in water bodies. The key to the application of machine counting methods in environmental research is access to basic data on the state of this environment; the authors proposed an interesting solution allowing for supplementing the missing data on the environment with mathematical methods. The proposed method can be used in studies of other natural environments.

Author Response

Response to Reviewer 6

[General Comment] The authors of the publication conducted analysis were performed on surface data of water observed during the spring and summer of 2015 to 2019 in the Coastal Zone of Korea. For biological research, this is a sufficient period to draw correct conclusions. Authors tried to predict chlorophyll-a concentration using machine learning, by replacing the missing data with the multiple imputations without deleting the same. In this respect, paper is innovative. Collecting physical, chemical, biological, meteorological and hydrodynamic parameters and determining their role in features for prediction of chlorophyll-a and algal blooms is a new approach to environmental research and indicates the possibility of using mathematical methods to predict the size of biomass in water bodies. The key to the application of machine counting methods in environmental research is access to basic data on the state of this environment; the authors proposed an interesting solution allowing for supplementing the missing data on the environment with mathematical methods. The proposed method can be used in studies of other natural environments.

Response: Thank you very much. We went through the entire manuscript to check the English language and style for a better understanding.

Round 2

Reviewer 1 Report

There are a lot of improvements after the revision.

Author Response

[Comments and Suggestions for Authors]

There are a lot of improvements after the revision.

Response: Thank you for your comment. We have polished the grammar and language of the manuscript by Editage (http://www.cactusglobal.com). Please refer to the revised manuscript.

Reviewer 2 Report

 

The authors took seriously my previous comments and made the required revisions in the manuscript. The quality and presentation of the manuscript have been improved. Therefore, I recommend acceptance.

Author Response

Response to Reviewer 2

[Comments and Suggestions for Authors]

The authors took seriously my previous comments and made the required revisions in the manuscript. The quality and presentation of the manuscript have been improved. Therefore, I recommend acceptance.

Response: Thank you for your comment. We have polished the grammar and language of the manuscript by Editage (http://www.cactusglobal.com). Please refer to the revised manuscript.

Reviewer 3 Report

The authors have improved the paper or provided adequate explanations.

Author Response

Response to Reviewer 3

[Comments and Suggestions for Authors]

The authors have improved the paper or provided adequate explanations.

Response: Thank you for your comment. We have taken extensive English editing fully into account in revision. We have polished the grammar and language of the manuscript by Editage (http://www.cactusglobal.com). Please refer to the revised manuscript.

Back to TopTop