Next Article in Journal
Shaking Table Tests on Seismic Capacity Assessment of Basic Unit of Mineral Wool Ceilings Supported by Iron Sheet-Backed Painted Runners
Previous Article in Journal
Field-Scale Constructed Floating Wetland Applied for Revitalization of a Subtropical Urban Stream in Brazil
 
 
Article
Peer-Review Record

A Comparison of Machine Learning Models for Predicting Flood Susceptibility Based on the Enhanced NHAND Method

Sustainability 2023, 15(20), 14928; https://doi.org/10.3390/su152014928
by Caisu Meng and Hailiang Jin *
Reviewer 1:
Reviewer 3:
Sustainability 2023, 15(20), 14928; https://doi.org/10.3390/su152014928
Submission received: 31 July 2023 / Revised: 8 October 2023 / Accepted: 9 October 2023 / Published: 16 October 2023
(This article belongs to the Section Hazards and Sustainability)

Round 1

Reviewer 1 Report

The comments can be found in the PDF file herewith attached

Comments for author File: Comments.pdf

Editing of English language required.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This study introduces a new model, NHAND (New Height Above the Nearest Drainage), to assess the performance of various machine learning models (three individual models and three ensemble models) in predicting flood susceptibility. It revealed that Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Stacking models performed better when elevation was considered as a flood-inducing factor, while ensemble models consistently showed superior prediction accuracy and stability. The study also emphasizes that the selection of flood inducers significantly influences the performance of these models, suggesting ensemble models as a better choice for analyzing complex flood sensitivity issues.

Generally, the paper is well written and presents a topic of relevance and broad interest to the journal's readers. The research evaluates a total of six ML models, ranging from the simplest white box models to complex black-box models. This approach provides a useful means to identify the optimal predictive model and the need for employing complex black-box models, which underscores the paper's strength. However, the paper exhibits critical flaws in both its methodology and results, as outlined below. Therefore, the paper should be carefully revised to address the following major comments.

1.     The methodology depicted in Figure 2 is misleading and flawed. If 5-fold cross-validation is used, then the validation set should account for 20% of the total, not 30%.

2.     Lines 328-330: If 70% of the dataset is allocated for training and 30% for validation, how much of the dataset is used for model testing? Is there no evaluation performed on unseen or test data?

3.     The evaluation of the models' performance is lacking because it relies solely on the coefficient of determination. A comprehensive evaluation should involve common metrics such as root mean squared error (RMSE), mean squared error (MSE), and mean absolute percentage error (MAPE). Incorporating these metrics would provide a more thorough understanding of model performance.

4.     The study's methodology is not adequately explained and lacks depth. For example, the discussion on K-fold cross-validation and model training procedures is underdeveloped. Please enhance it by expanding the methodology's explanation.

5.     Hyperparameter optimization is a key step in developing machine learning models, yet it appears to have been overlooked or inadequately executed in this study. Please provide a detailed account of the optimization processes and methods for each model. All model hyperparameters should be simultaneously optimized using a systematic approach such as grid search and random search, considering a wide range of hyperparameter values. For more guidance, refer to Section 3.6 and Figure 10 of doi.org/10.1016/j.jclepro.2022.134203 and Section 5 of doi.org/10.1016/j.engstruct.2022.113903.

6.     All hyperparameters of all models should be optimized. Therefore, it's crucial to discuss the hyperparameters of all models, their roles, and their values post-optimization. The lack of appropriate hyperparameter optimization, a significant aspect of machine learning model development, exposes a serious deficiency in the study's methodology. This needs to be corrected as the research findings' validity hinges on the accuracy and authenticity of this stage.

7.     Table 3: Why are only a few hyperparameters considered for each model? If a systematic and appropriate methodology is not followed, the developed models and study results will lack validity and significance.

8.     The results shown in Figure 10 and Figure 12 clearly highlight the models' inadequacies, which stem from the deficient and inadequate methodology.

9.     The implementation of the stacking model in this study appears to be flawed. In a stacking model, any machine learning models can be combined to enhance their predictive accuracy and generalizability. It's unclear why Lasso regression is chosen for the stacking model over other advanced models with greater predictive capabilities, such as XGBoost. Furthermore, several model types can be stacked and used in stacking ensemble models, yet it's not clear why Lasso was the only one chosen. This reveals significant flaws in the paper. Please refer to related literature, such as doi.org/10.1016/j.engstruct.2021.112808 and doi.org/10.1016/j.jclepro.2022.135279, where several models are used in a Stacking model to develop robust model and revise accordingly.

10.  Results: The results are not discussed in depth. The models should be comprehensively evaluated on both the training and unseen/test datasets using various performance evaluation metrics. Please refer to the literature and substantially revise the results and discussion. Additionally, scatterplots for both the training and test datasets should be provided to clearly demonstrate the model's adequacy and generalizability.

11.  How were results in Figure 13 obtained? For feature analysis and interpretability of black-box models like RF and XGBoost, it is recommend to use the recently developed model-agnostic unified Shapley Additive exPlanation (SHAP) approach. This approach can provide both local and global interpretability and analyze the influence of input features on prediction response.

12.  The conclusions section requires substantial revisions considering the aforementioned comments.

13.  Given that the current study didn't create a graphical user interface tool using the developed model, how can others utilize the developed model?

 

14.  Please ensure all abbreviations are defined before they are used.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

* Why author Comparison of Machine Learning Models for Predicting 2 Flood Susceptibility

* What is  flood inducing factor?

* Why SVM, KNN and Lasso less R² results?

* What is the use Table 1: Top 5 keywords with the strongest citation bursts

* How NHAND method overcomes the limitations of the HAND?

* What author prdecit from Figure 5?

* use of Stacking method?

* What is the use of page 11, line number 332-334 idea for this paragraph?

* Why author choose 5-fold cross-validation approach in this result?

* Author use two sets of different flood inducing factors. What is the difference between this two set of data?

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

This paper can be accepted in the present form

Moderate Editing of English Language Required.

Author Response

Thank you for your valuable feedback on our paper. After carefully considering your advice, we have meticulously reviewed and rectified any spelling errors and sentence structures in order to align with the rigorous academic standards.

Reviewer 2 Report

This reviewer appreciates the authors' efforts to address the comments raised. However, there are still fundamental flaws in the paper that call into question its validity and significance.

The purpose of stacking ensemble models is to improve predictive accuracy by combining multiple base models. This approach aims to leverage the strengths of each individual model to create a more robust and accurate meta-model. However, in this study, the stacking ensemble failed to enhance accuracy and, in some instances, even underperformed compared to the XGBoost model, as seen in Table 6. The selection of base learners should not be arbitrary but should be determined based on specific criteria to improve the accuracy.

Furthermore, the inadequacy of the models developed in this study is evident from the results in Figures 8 and 9, which show that they cannot reliably predict NHAND (New Height Above the Nearest Drainage) values. The question arises: what is the point of developing a machine learning model if it doesn't yield reasonable accuracy? Specifically, Figure 9 clearly indicates that the NHAND values are consistently overpredicted. This research will lack scientific significance and value unless it employs appropriate methodology to develop a reliable model.

Therefore, the authors are kindly advised to retrain their model to more accurately predict NHAND values. This could be achieved by ensuring that appropriate methodologies are followed, by referring to literature, particularly with regard to hyperparameter optimization.

Other comments are listed below:

Comment 9: Based on the reply to this comment, it’s clear that the stacking ensemble model hasn’t been appropriately developed to enhance accuracy. The selection of base learners, as well as the types of models used for these base learners, should be determined based on specific criteria aimed at improving accuracy. The current results indicate that the developed model is insufficient and has not improved upon the base learners, which is the result of random selection of the base learner and flaws in the methodology. Please revise.

Comment 11: The response to this comment is not valid. Global importance is calculated as the average of absolute Shapley values and can be depicted without indicating the direction of each parameter's effect. To understand the direction of influence for each factor, one could sum the Shapley values (as displayed in Figure 10) for all specimens in the database. This directional influence could be represented on the global importance plot using different colors, as demonstrated in Figure 11 of the original reference. For example, in Figure 15 of doi.org/10.1016/j.istruc.2022.08.023, some factors have a negative summation. The authors are advised to consult these references and enrich the discussion by including the directional effects of each factor on NHAND prediction.

 

Comment 13: The response to this comment is also invalid. The authors should either acknowledge the study's practical limitations or develop a simple GUI model or at least make the developed code publicly available through a platform like GitHub.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

This reviewer appreciates the authors' effort in addressing some of the comments; however, the paper still exhibits methodological flaws and flaws in the results.

In machine learning, ensemble methods like stacking are designed to improve predictive performance by combining multiple models (base learners). The objective is to capitalize on the strengths and offset the weaknesses of individual models. Nevertheless, the results in this paper distinctly show that the stacking ensemble underperforms in predictive accuracy compared to one of its base learners, specifically the XGBoost model. Several factors could contribute to this observation: (a) lack of proper hyperparameter optimization for the stacking ensemble model: The hyperparameters of the stacking ensemble should be collectively optimized using hyperparameter tuning techniques, rather than integrating pre-optimized or independently optimized models, (b) Improper Weighting or Aggregation: If the stacking method is using an inappropriate mechanism to combine predictions (e.g., wrong weights, suboptimal meta-model), then it can produce less accurate results, (c) Quality of Base Learners: If all the base learners are poorly constructed, then combining them won't magically create a good model, (d) Meta-Model Bias: If the meta-model itself has some form of bias or inadequate training, it could potentially worsen the output, and (d) Inadequate Training or Validation: Improper cross-validation during the training of base models or the meta-model could result in optimistic or pessimistic performance estimates and may not represent the true predictive power of the ensemble compared to individual models.

Therefore, the authors should re-examine their model and optimize it properly to produce meaningful and logical results.

Moreover, the evident shortcomings shown in Figure 8 and Figure 9 further substantiate the models' inadequacy. This could be the consequence of multiple factors, including but not limited to improper data preprocessing, inadequate outlier identification, and insufficient model optimization or feature engineering.

 

Therefore, substantial revisions are required to address these critical flaws in the paper. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop