Next Article in Journal
Nitrate Removal and Dynamics of Microbial Community of A Hydrogen-Based Membrane Biofilm Reactor at Diverse Nitrate Loadings and Distances from Hydrogen Supply End
Previous Article in Journal
How Elevated CO2 Shifts Root Water Uptake Pattern of Crop? Lessons from Climate Chamber Experiments and Isotopic Tracing Technique
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Ensemble Model Development for the Prediction of a Disaster Index in Water Treatment Systems

1
Department of Civil and Environmental Engineering, Hanbat National University, 125, Dongseo-daero, Yuseong-gu, Daejeon 34158, Korea
2
G&C Environmental Solution, 16-5, Seongmisan-ro 23-gil, Mapo-gu, Seoul 03979, Korea
3
Korea Institute of Civil Engineering and Building Technology, 283, Goyang-daero, Ilsanseo-gu, Goyang-si, Gyeonggi-do 10223, Korea
4
Department of Environmental Energy Engineering, University of Suwon, 17, Wau-an-gil, Bongdam-eup, Hwaseong-si, Gyeonggi-do 18323, Korea
5
Disaster Prevention Research Division, National Disaster Management Research Institute, 365, Jongga-ro, Jung-gu, Ulsan 44538, Korea
6
Bayesian AI Laboratory, BAIES, Fairfax, VA 22030, USA
7
Department of Civil, Environmental and Construction Engineering, University of Central Florida, 12800 Pegasus Dr., Orlando, FL 32816, USA
8
Department of Information & Statistics, Chungbuk National University, Chungdae-Ro 1, SeoWon-Gu, Cheongju, Chungbuk 28644, Korea
*
Authors to whom correspondence should be addressed.
Water 2020, 12(11), 3195; https://doi.org/10.3390/w12113195
Submission received: 17 October 2020 / Revised: 10 November 2020 / Accepted: 13 November 2020 / Published: 15 November 2020
(This article belongs to the Section Water Resources Management, Policy and Governance)

Abstract

:
The quantitative analysis of the disaster effect on water supply systems can provide useful information for water supply system management. In this study, a total disaster index (TDI) was developed using open-source public data in 419 water treatment plants in Korea with 23 input variables. The TDI quantifies the possible effects or damage caused by three major disasters (typhoons, heavy rain, and earthquakes) on water supply systems. The four components (regional factor, risk factor, urgency factor, and response and recovery factor) were calculated using input variables to determine the disaster index (DI) of each disaster. The weight of the input variables was determined using principal component analysis (PCA), and the weights of the DI of three natural disasters and four components used to calculate the TDI were determined by the analytical hierarchy process (AHP). Specifically, two ensemble machine learning models, random forest (RF) and XGBoost (XGB), were used to develop models to predict the TDI. Both models predicted the TDI with the coefficient of determination and root-mean-square error-observations standard deviation ratio of 0.8435 and 0.3957 for the RF model and 0.8629 and 0.3703 for the XGB model, respectively. The relative importance analysis suggests that the number of input variables can be minimized, which improves the models’ practical applicability.

1. Introduction

Various natural disasters, such as floods and earthquakes, cause considerable damage to water supply systems. This damage includes the destruction of plants, intake systems, pipelines, and electric systems, and the consequent interruption of water supply to the public [1]. The assessment of damage to water supply systems caused by natural disasters is important for proper management and decision-making processes to prevent and restore the damage caused by natural disasters [2,3].
Assessing risk and measuring disaster resilience are the keys for predicting possible events, quantifying contributing factors, and identifying potential consequences. One good example is the lone house that remained standing after Hurricane Ike in 2008. It was rebuilt based on the experience from Hurricane Rita in 2005 on elevated ground, with an appropriate roof pitch and windows that were designed to withstand winds of up to 209 km/h, thus surviving Hurricane Ike with its winds of 177 km/h [4]. Although there have been many efforts to develop quantitative and indicator-based assessments, such as the comprehensive disaster resilience index (CDRI) [5,6], there is no universal standard for the measurement of disaster and related consequences [7]. A reliable disaster resilience framework with unified terminology and its quantitative evaluation would be an important tool in the decision-making processes for both policymakers and engineering professionals [8].
Statistical methods such as principal component analysis (PCA) or analytic hierarchy process (AHP) are often applied for the evaluation of disaster effects on civil infrastructures. For example, Park et al. [9] suggested a disaster risk index for 51 high-speed railroad stations in Korea. The index was calculated from a linear equation of four main indices (hazard, exposure, vulnerability, and emergency response and recovery capability) suggested by Rossi and Gilmartin [10] where the weights of each main index were determined by PCA. Recent studies have also used statistical analysis based on survey data for the assessment of disaster risk on flooding or water security [11,12].
In recent decades, advanced technologies of data mining and machine learning (ML) have been used to manage disasters such as typhoons and earthquakes [13,14,15,16,17,18,19], and various emerging remote sensing technologies have also been increasingly used for monitoring and detecting data related to disaster managements [20]. The continuous increase in available data due to advanced data collection technologies, such as remote sensors or unmanned aerial vehicles (UAV), has accelerated the application and accuracy of ML models [21,22,23]. Ofli et al. [21] used aerial images captured from UAVs for identifying features of interest such as damaged shelters and blocked loads to assist with disaster response. The features of interest in the image were annotated and used for training ML models, including support vector machine (SVM) and random forest (RF), where the overall accuracy of the model’s classification results ranges from 0.73 to 0.85. Sheykhmousa et al. [24] analyzed satellite images of land cover and land use using an SVM classifier. The image data during Typhoon Haiyan, which caused massive damage in the Philippines in 2013, was compared with the image data in 2017, four years after Typhoon Haiyan, to assess the post-disaster recovery process. Chen et al. [15] analyzed the impact indices of flood disasters using RF and developed a risk assessment model based on the neural network method. Various data, including rainfall and socioeconomic data in the Yangtze River Delta area between 2008 and 2018, were used for model development. More recently, Kao et al. [19] used an advanced deep learning algorithm, long short-term-memory (LSTM), to forecast flood events.
Recent studies also used social platforms with text information about disasters to analyze the characteristics of disasters such as typhoons and earthquakes [14,25,26]. Resch et al. [25] analyzed earthquake characteristics from social media information during an earthquake in Napa, California, USA, in 2014 using latent Dirichlet allocation (LDA), which is widely used for topic analysis. The spatial hot spots of the earthquake were determined from the LDA model with 86.45% accuracy compared with the United States Geological Survey earthquake footprint report. More recently, Yu et al. [14] analyzed text information in social media during Typhoon Anemone along the coast of China in August 2012 to develop a typhoon disaster classification system using a model based on a convolutional neural network.
ML is also increasingly used in environmental management. Zhang et al. [27] predicted air pollution by PM 2.0 with a fusion model based on three gradient boosted decision tree (GBDT) algorithms. The root-mean-squared error (RMSE) of the fusion model was 32.300. Bi et al. [28] also used a GBDT-based model, light gradient boost, coupled with a fast Fourier transform for the assessment of a liquefaction disaster. However, even with substantial efforts on the classification and analysis of disaster, its quantitative and indicator-based assessments on the water infrastructure have not been thoroughly conducted.
In this study, the effects of various disasters on water supply systems from the perspective of management are quantified by statistical data analysis methods, PCA, and analytic hierarchy process (AHP). From the statistical approach, a total disaster index (TDI) was developed. In the second part, tree-based ensemble models (i.e., RF and GBDT) were used to predict TDI, which provides valuable information for the safety management of water supply systems.

2. Methods

2.1. Data Sources

Total 23 input variables of facility specification and operational data in 419 water treatment plants in Korea were used to develop a TDI. The data were obtained from statistical yearbooks and open-source public data (Table 1). The 23 input variables provide information about the water supply systems, including water supply capacity, pipeline density, number of customers, management labor, and regional characteristics of natural conditions where the water treatment plants are located (Table 2). The local peak ground acceleration by an earthquake at each water treatment plant was estimated from the Korea Seismicity Map program developed by Cao et al. [29]. The data for regional natural conditions were obtained from meteorological data available from the national meteorological administration information portal [30]. The financial status of a local government that manages the water treatment plant was collected from the public data portal of the Ministry of the Interior and Safety in Korea [31].

2.2. Disaster Index

2.2.1. Type of Disaster

Typhoons and heavy rains are among the most frequent disasters in Korea, while Korea has been known to be relatively safe from earthquakes. However, interest in earthquakes in Korea has increased since the two earthquakes with magnitudes of 5.8 and 5.4 on the Richter scale in 2016 and 2017, respectively. In this study, three natural disasters, typhoons, heavy rains, and earthquakes, were selected as the most influential disasters on the water supply system and used for the TDI development considering natural characteristics in Korea.

2.2.2. Component of Disaster Index

The four components (i.e., regional factor, risk factor, urgency factor, and response and recovery factor), describing the level of the damage caused by each type of disaster, were used to determine the disaster index (DI) of three natural disasters as follows.
  • Regional factor (RE) represents regional characteristics such as the frequency of natural disaster occurrence in the selected areas;
  • Risk factor (RI) represents the quantity of possible damage caused by natural disasters. For example, the RI increases as the capacity of water treatment plants or the length of water supply pipelines increases;
  • Urgency factor (UR) represents the urgency of recovery after a disaster. For example, the UR increases with a larger population in the area receiving drinking water; and,
  • Response and recovery factor (RR) represents the recovery ability during and after a disaster, which is estimated by the financial status or manpower of the authority of a water treatment plant, such as the local government.
A total of 23 input variables obtained from open-source public data were used to determine the four components (RE, RI, UR, and RR), as summarized in Table 2.

2.2.3. PCA Analysis for Index Weight

The weights of each variable for the DI of three natural disasters (typhoon, heavy rain, and earthquake) were determined using PCA. PCA is a statistical method that reduces the dimension of variables and determines each variable’s relative importance using an eigenvector. The input variables were standardized as an average of zero and standard deviation of one for PCA analysis [34,35,36].

2.2.4. AHP Analysis

The DI of three natural disasters and four components were used for the calculation of TDI in 419 water treatment plants. However, there was limited data available for the statistical determination of the relative weight of each natural disaster and four components for the TDI calculation. In addition, although the effect of earthquakes on the water supply system is expected to be extremely large, there were only two significant earthquakes in Korea that occurred in 2016 and 2017. Thus, it should be noted that quantitative data for the analysis of the effect of earthquakes was limited.
The weights of three natural disasters and four components used to calculate the TDI were determined by the AHP suggested by Saaty [37,38]. The AHP is a structured data analysis method for complex decision-making, which is also widely used to analyze disaster data [9,23,39]. In AHP, a pairwise comparison matrix of each element for the decision-making process is structured. This structure relates to the matrix’s eigenvector, which represents the weight of each element in the decision-making process [37,40].
The survey results from 62 experts or engineers currently working in water treatment plants were used for AHP analysis. The survey data with a consistency ratio (CR) of less than 0.2 was used to calculate the weight of each input variable to maintain the consistency of the AHP analysis result [40,41,42].
CI = λ m a x n f n f 1   and   CR = CI RI
where
λ m a x : principal eigenvalue in the pairwise comparison matrix,
nf: number of features,
CI: consistency index,
RI: random consistency index (RI = 0.90 for n = 4 and RI = 0.58 for n = 3), and
CR: consistency ratio.

2.2.5. Disaster Index Model

The TDI is determined by the weighted sum of the DI for three natural disasters using the following equations (Equations (2)–(5)).
TDI = a(TI) + b(HI) + c(EI)
TI = at(REt) + bt(RIt) + ct(URt) − dt(RRt)
HI = ah(REh) + bh(RIh) + ch(URh) − dh(RRh)
EI = ae(REe) + be(RIe) + ce(URe) − de(RRe)
where
TI: DI for typhoon;
HI: DI for heavy rain;
EI: DI for earthquake;
a, b, and c: weight of each natural DI; and
at, bt, ct, dt, ah, bh, ch, dh, ae, be, ce, and de: weight of each component.
Subscripts (i.e., t, h and e) from Equations (3)–(5) represents typhoon, heavy rain, and earthquake respectively.

2.3. Disaster Prediction Model

2.3.1. Model Selection

Two ensemble models, RF and GBDT, have been increasingly used as ML models to manage the water environment. Both models show good performance, even for nonlinear relationship analysis, and data with outliers are also applicable for both classification and regression [43,44].
RF is a tree-based ensemble model in which a random data selection approach generates multiple decision trees. RF randomly selects several sets of input features from the original input features by a bagging method before generating the decision trees, which increases the independence and variability of each decision tree. The final RF prediction is determined by averaging the predictive results from individual decision trees in RF [45]. Consequently, the prediction performance of RF can be dramatically improved [46,47,48] and outperforms other ML models [49]. RF has shown high performance in various domains and has also been continuously applied to environmental research, such as water quality prediction [50,51].
GBDT is an ensemble model based on a gradient boosting method (GBM), called a sequential tree-based calculation process [45,52,53], and a set of decision trees. Unlike RF which determines the final prediction by voting (for classification) or averaging (for regression), GBDT uses the decision tree, called a weak learning model, from a previous stage in the ML process to improve model performance in the following stage. Residual errors of the prior stage are included in developing the decision tree in the current stage to reduce the residual errors by optimizing a specified loss function [45,52]. This optimization process is sequentially performed until the predefined number of decision trees is reached, which is a major difference with RF, where the calculation of each tree is independent.
GBDT is optimized by minimizing an objective function, J, for a training data set with n samples. The regularization term can be added to avoid overfitting of the model [44,54]. Equation (6) shows an illustrative example of the objective function of GBDT [44,54].
J = i = n n L ( y i , y ^ i ) + k = 1 K ( f k )
where
f k : function of the kth decision tree,
L : loss function that calculates the difference between an observation ( y i ) and model prediction ( y ^ i ) in each decision tree,
: regularization function that penalizes the complexity of the model, and
n: number of data samples.
The schematics of RF and GBDT are compared in Figure 1, where X denotes input features as X = x1, x2, …, xn, h(X, θk), (k = 1, 2, …, K) is a collection of decision trees, and the θ k are independent and identically distributed random vectors [44,45,54].
In this study, both RF and GBDT models were used for the TDI estimation of 419 drinking water treatment plants. The Python open-source libraries of Scikit-learn (for RF) and XGBoost (for GBDT) were used for regression model development [55,56]. XGBoost (XGB) is one of the most popular GBDT implementations developed by Chen and Guestrin [45,54]. Scikit-learn is also a popular Python-based ML library developed by Pedregosa et al. [55].

2.3.2. Model Optimization

The hyperparameters of RF and XGB were optimized by a trial and error method with ten-fold cross-validation using the grid search library in Scikit-learn [57]. The models were developed with 23 input variables of 419 water treatment plants, where the ratio of data used for training and testing of the models was 8:2.

2.3.3. Feature Importance (FI) of Input Variables

The relative importance of input variables on RF and XGB model performance was calculated using the feature importance (FI) algorithm in Scikit-learn [57]. The FI in the tree-based model was computed as the total impurity reduction of the model brought by that feature [55,58,59].

2.3.4. Model Evaluation

The model performance was evaluated by three evaluation indexes (Equations (7)–(9)), RMSE, coefficient of determination (R2), and RMSE-observation standard deviation ratio (RSR). RSR ranges from 0 to 1 and approaches 0 when the model shows a good fit with observation. The model is considered to predict the observation when RSR < 0.70 [60,61].
R 2 = 1 i = 1 n ( M i , o b s M i , m o d e l ) 2 i = 1 n ( M i , o b s M i , o b s ¯ ) 2
R M S E = i = 1 n ( M i , o b s M i , m o d e l ) 2 n
R S R = i = 1 n ( M i , o b s M i , m o d e l ) 2 i = 1 n ( M i , o b s M i , o b s ¯ ) 2
where
M i , o b s : observed values,
M i , o b s ¯ : mean of observed values, and
M i , m o d e l : model predicted value.

3. Results and Discussion

3.1. Characteristics of Input Variables

Total 23 input variables for the development of DI were identified from open-source public statistical data. The characteristics of the input variables are summarized in Table 3. The frequency of warning advisories of natural disasters was calculated at each water treatment plant from the sum of the three variables in Table 3 (i.e., RAIN, SWIND, and TYPHOON). The frequency of warning advisories ranged from 0.017 to 1.29 times/km2 and tended to be higher in areas near the ocean as shown in Figure 2 using ArcGIS pro.

3.2. Disaster Index (DI) Model Development

3.2.1. PCA Analysis

The weights for each natural disaster index were determined from PCA with 23 input variables (Table 4). The eigenvectors were calculated from PCA and normalized to make the sum of weights of each component to be 1.

3.2.2. AHP Analysis

The weights for each disaster type were determined from the AHP analysis using the survey data (CR < 0.2) (Table 5). The response rate of the survey was in the range between 52 and 69% for each item. The weights of each disaster are in the order of typhoons, earthquakes, and heavy rain.

3.2.3. Disaster Index (DI)

The TDI was determined using the following model (Equations (10)–(13)) which were developed from PCA and AHP analysis (Table 4 and Table 5).
TDI = 0.481(TI) + 0.198(HI) + 0.321(EI)
TI = 0.275(REt) + 0.265(RIt) + 0.216(URt) − 0.244(RRt)
where
REt = 0.309(REt1) + 0.345(REt2) + 0.346(REt3),
RIt = 0.143(RIt1) + 0.143(RIt2) + 0.144(RIt3) + 0.052(RIt4) + 0.140(RIt5) + 0.132(RIt6) + 0.136(RIt7) + 0.110(RIt8),
URt = 0.334(URt1) + 0.334(URt2) + 0.332(URt3), and
RRt = 0.248(RRt1) + 0.235(RRt2) + 0.263(RRt3) + 0.254(RRt4).
HI = 0.279(REh) + 0.247(RIh) + 0.221(URh) − 0.253(RRh)
where
REh = 0.500(REh1) + 0.500(REh2),
RIh = 0.143(RIh1) + 0.143(RIh2) + 0.144(RIh3) + 0.052(RIh4) + 0.140(RIh5) + 0.132(RIh6) + 0.136(RIh7) + 0.110 (RIh8),
URh = 0.334(URh1) + 0.334(URh2) + 0.332(URh3), and
RRh = 0.248(RRh1) + 0.235(RRh2) + 0.263(RRh3) + 0.254(RRh4).
EI = 0.215(REe) + 0.370(RIe) + 0.235(URe) − 0.180(RRe)
where
REe = 0.333(REe1) + 0.336(REe2) + 0.331(REe3),
RIe = 0.143(RIe1) + 0.143(RIe2) + 0.144(RIe3) + 0.052(RIe4) + 0.140(RIe5) + 0.132(RIe6) + 0.136(RIe7) + 0.110(RIe8),
URe = 0.334(URe1) + 0.334(URe2) + 0.332(URe3), and
RRe = 0.234(RRe1) + 0.056(RRe2) + 0.222(RRe3) + 0.249(RRe4) + 0.239(RRe5).
Using the developed models, TDI values of 419 water treatment plants were determined with the range between −0.526 and 3.813 with an average of 0 and a standard deviation of 0.343. A higher TDI represents a higher potential of effect or damage by a disaster in water treatment systems. The TDI tends to be higher in water treatment plants near metropolitan cities as well as the areas near ocean.
The TDI was developed considering the natural status of Korea. For example, there were only two earthquakes in 2016 and 2017, which were considered to have caused actual damage to water treatment plants in Korea. As the data available for the quantification of damage by earthquakes is minimal, the AHP based on survey data was used for the DI calculation.
Although there were not many cases of damage in water treatment systems from earthquakes, the weight of the earthquake was larger than that of heavy rain. The AHP results represent that, although earthquakes have been rare in Korea, the damage and consequences by an earthquake would not be negligible when it occurs, indicating that a preventive plan against earthquakes should be prepared in advance. In addition, given that most of the facilities already experience heavy rain and are relatively well prepared for these instances, it is expected that the actual damage caused by heavy rain is relatively small compared to other disasters.

3.3. Ensemble Model Simulation

3.3.1. Total Disaster Index (TDI) Prediction using Ensemble Models

Two ensemble ML models, RF and XGB, were used to develop a model to predict TDI. The model performance with the test data set was evaluated by three indices, as summarized in Table 6. The R2 and RSR were 0.8435 and 0.3957 for the RF model and 0.8629 and 0.3703 for the XGB model, respectively.
The observed data and model predictions are compared in Figure 3. The model prediction shows a similar good fit with observations both in the RF and XGB models, while XGB showed a slightly better performance for all three evaluation indexes (Table 6 and Figure 3).

3.3.2. Feature Importance (FI) Analysis

The FI of 23 input variables for both RF and XGB models to predict DI are shown in Figure 4. The FI was different between RF and XGB, while the variables that represent the scale of water treatment plants such as PUMP_EP and Q tend to have a higher effect on model performance for both models. For RF, the sum of FI in the highest nine input variables was more than 80%, while for XGB, the sum of FI in the highest four variables was more than 80% of the total FI for XGB.
The performance of the models was compared between RF and XGB using fewer input variables, starting with 1 and adding up to 10 input variables with the order from the highest FI (Figure 5). The RF model showed a tendency to improve the performance of the model as the number of input variables increased from one to ten, and even when using three input variables, the RSR was 0.6954, indicating that the model accurately predicted the observation. XGB shows better performance when using fewer input variables. The RSR is 0.5323 when only three input variables were applied, which reduces to 0.3937 when using ten input variables. The FI analysis shows that several input variables with higher feature importance have a considerable effect on model performance. The analysis results show that both the RF and XGB models show similar performance when using five or more input variables with higher FI. The FI is one of the factors and not an absolute standard considered for model structure. The necessary input variables are not always obtainable from the actual operation and management of water treatment systems. Thus, the practical applicability of the model would be improved as fewer input variables are used. The FI analysis suggests that the model shows acceptable performance if only part of the input variables with the highest FI would increase the practical applicability of the model.

4. Summary and Conclusions

In this study, a disaster index (DI) for predicting the effect or damage caused by three major natural disasters in Korea (i.e., typhoons, heavy rain, and earthquakes) was newly developed to quantify each natural disaster’s effect on water utilities.
Although the operational data in water utilities provided a good understanding regarding the effect of disasters, the data is usually collected in an individually specified format often site-specific, making it difficult to collect, organize, and analyze the data. In addition, the operational data for water utilities was not easily accessible, limiting the comprehensive development of the DI. Therefore, in this study, the DI of natural disasters in water treatment systems was developed using statistical open-source public data. Two well-defined statistical data analysis methods (i.e., AHP and PCA) were used for the determination of DI.
The open-source public data have greater accessibility and are updated regularly, so the DI can also be updated considering the current status, which is also a significant benefit of using open-source public data. The DI developed in this study may be site-specific at a given location and conditions of water utilities, but the developed framework would be applicable for quantifying the effect of disasters on water treatment systems in other regions with different natural status.
In the second part, two ensemble models (i.e., RF and XGB) were used to develop models to predict TDI. Both RF and XGB showed similar satisfactory performance for prediction of the DI, while the XGB showed a slightly better performance in general. The FI analysis also suggested that the models have sufficient performance for practical use with only several input variables of the highest FI, which can improve the practical applicability of the models.
Quantitative assessment of disaster effects on water treatment systems is essential for better management of the water treatment systems and stable supply of drinking water to the public. However, data related to disaster analysis are often limited and even hardly quantifiable. One of the possible solutions would be to keep collecting data, analyze them statistically, while facilitating frequent discussions from experts experiencing the disasters in their utilities [11,35]. The recent advance of information and communication technologies, such as sensor-based real-time monitoring methods, can provide various continuous monitoring data about the operational condition of water treatment plants and related infrastructure, which can improve the pre- and post-management planning processes [20,22]. However, the quantification and assessment of disasters on water treatment systems are still in an early stage, and the use of field operational data and responses, in particular during disaster events, is currently limited at this time.
This study provided quantified information on the impact of various natural disasters on water treatment systems with open-source public data, which would be useful for creating a plan to reduce damage to water supply systems caused by natural disasters. Further study is warranted to use high-frequency real-time data to improve the model performance and practical applicability.

Author Contributions

Data curation and software, J.P.; conceptualization: J.P., J.-H.P., J.-S.C., J.C.J., K.P., W.H.L. and T.-Y.H.; investigation, J.P., J.-H.P., J.-S.C., J.C.J., K.P., H.C.Y., C.Y.P., W.H.L. and T.-Y.H.; writing-original draft, J.P.; writing—review and editing, J.-H.P., J.-S.C., J.C.J., K.P., H.C.Y., C.Y.P. and T.-Y.H.; project administration, J.P. and J.-H.P.; supervision, J.P. and T.-Y.H.; funding acquisition, J.P. and J.-H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Korea Environment Industry and Technology Institute (KEITI) through Environmental R&D Project on the Disaster Prevention of Environmental Facilities Project, funded by Korea Ministry of Environment (MOE) (2019002870001).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Pan American Health Organization (PAHO). Emergencies and Disasters in Drinking Water Supply and Sewage Systems: Guidelines for Effective Response; PAHO: Washington, DC, USA, 2002; pp. 5–12. [Google Scholar]
  2. Davis, C.A. Water system service categories, post-earthquake interaction, and restoration strategies. Earthq. Spectra 2014, 30, 1487–1509. [Google Scholar] [CrossRef]
  3. Matthews, J.C. Disaster resilience of critical water infrastructure systems. J. Struct. Eng. 2016, 142, C6015001. [Google Scholar] [CrossRef]
  4. World Meteorological Organization (WMO). Atlas of Mortality and Economic Losses from Weather, Climate and Water Extremes (1970–2012); WMO-No. 1123; WMO: Geneva, Switzerland, 2014. [Google Scholar]
  5. Marzi, S.; Mysiak, J.; Essenfelder, A.H.; Amadio, M.; Giove, S.; Fekete, A. Constructing a comprehensive disaster resilience index: The case of Italy. PLoS ONE 2019, 14, e0221585. [Google Scholar] [CrossRef] [PubMed]
  6. Beccari, B. A comparative analysis of disaster risk, vulnerability and resilience composite indicators. PLoS Curr. 2016, 8. [Google Scholar] [CrossRef] [PubMed]
  7. Franc, J.M.; Ingrassia, P.L.; Verde, M.; Colombo, D.; Della Corte, F. A simple graphical method for quantification of disaster management surge capacity using computer simulation and process-control tools. Prehosp. Disast. Med. 2015, 30, 9. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Cimellaro, G.P.; Reinhorn, A.M.; Bruneau, M. Framework for analytical quantification of disaster resilience. Eng. Struct. 2010, 32, 3639–3649. [Google Scholar] [CrossRef]
  9. Park, Y.; Han, S.; Choi, S. Development of Disaster Risk Index for Evaluating the Natural Disaster Hazards of High-speed Railroad Facilities. J. Korean Soc. Hazard Mitig. 2019, 19, 1–9. [Google Scholar] [CrossRef] [Green Version]
  10. Rossi, R.J.; Gilmartin, K.J. The Handbook of Social Indicators: Sources, Characteristics, and Analysis; Garland STPM Press: New York, NY, USA, 1980. [Google Scholar]
  11. Bruce, A.; Brown, C.; Avello, P.; Beane, G.; Bristow, J.; Ellis, L.; Fisher, S.; Freeman, S.G.; Jiménez, A.; Leten, J.; et al. Human dimensions of urban water resilience: Perspectives from Cape Town, Kingston upon Hull, Mexico City and Miami. Water Secur. 2020, 9, 100060. [Google Scholar] [CrossRef]
  12. Lee, S.; Yoon, H. Development of disaster risk assessment method in river confluence using AHP. J. Korean Soc. Hazard Mitig. 2018, 18, 545–553. [Google Scholar] [CrossRef]
  13. Zagorecki, A.T.; Johnson, D.E.; Ristvej, J. Data mining and machine learning in the context of disaster and crisis management. Int. J. Emerg. Manag. 2013, 9, 351–365. [Google Scholar] [CrossRef]
  14. Yu, J.; Zhao, Q.; Chin, C.S. Extracting Typhoon Disaster Information from VGI Based on Machine Learning. J. Mar. Sci. Eng. 2019, 7, 318. [Google Scholar] [CrossRef] [Green Version]
  15. Chen, J.; Li, Q.; Wang, H.; Deng, M. A machine learning ensemble approach based on random forest and radial basis function neural network for risk evaluation of regional flood disaster: A case study of the Yangtze River Delta, China. Int. J. Environ. Res. Public Health 2020, 17, 49. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Khouj, M.; Lopez, C.; Sarkaria, S.; Marti, J. Disaster management in real time simulation using machine learning. In Proceedings of the 2011 24th Canadian Conference on Electrical and Computer Engineering (CCECE), Niagara Falls, ON, Canada, 8–11 May 2011; pp. 001507–001510. [Google Scholar]
  17. Chang, F.J.; Hsu, K.; Chang, L.C. (Eds.) Flood Forecasting Using Machine Learning Methods; MDPI: Basel, Switzerland, 2019. [Google Scholar]
  18. Chang, F.-J.; Guo, S. Advances in hydrologic forecasts and water resources management. Water 2020, 12, 1819. [Google Scholar] [CrossRef]
  19. Kao, I.-F.; Zhou, Y.; Chang, L.-C.; Chang, F.-J. Exploring a Long Short-Term Memory based Encoder-Decoder framework for multi-step-ahead flood forecasting. J. Hydrol. 2020, 583, 124631. [Google Scholar] [CrossRef]
  20. Khan, A.; Gupta, S.; Gupta, S.K. Multi-hazard disaster studies: Monitoring, detection, recovery, and management, based on emerging technologies and optimal techniques. Int. J. Disast. Risk Reduct. 2020, 47, 101642. [Google Scholar] [CrossRef]
  21. Ofli, F.; Meier, P.; Imran, M.; Castillo, C.; Tuia, D.; Rey, N.; Briant, J.; Millet, P.; Reinhard, F.; Parkan, M. Combining human computing and machine learning to make sense of big (aerial) data for disaster response. Big Data 2016, 4, 47–59. [Google Scholar] [CrossRef] [Green Version]
  22. Park, J.; Kim, K.T.; Lee, W.H. Recent Advances in Information and Communications Technology (ICT) and Sensor Technology for Monitoring Water Quality. Water 2020, 12, 510. [Google Scholar] [CrossRef] [Green Version]
  23. Orencio, P.M.; Fujii, M. A localized disaster-resilience index to assess coastal communities based on an analytic hierarchy process (AHP). Int. J. Disast. Risk Reduct. 2013, 3, 62–75. [Google Scholar] [CrossRef]
  24. Sheykhmousa, M.; Kerle, N.; Kuffer, M.; Ghaffarian, S. Post-disaster recovery assessment with machine learning-derived land cover and land use information. Remote Sens. 2019, 11, 1174. [Google Scholar] [CrossRef] [Green Version]
  25. Resch, B.; Usländer, F.; Havas, C. Combining machine-learning topic models and spatiotemporal analysis of social media data for disaster footprint and damage assessment. Cartogr. Geogr. Inf. Sci. 2018, 45, 362–376. [Google Scholar] [CrossRef] [Green Version]
  26. Ragini, J.R.; Anand, P.R.; Bhaskar, V. Big data analytics for disaster response and recovery through sentiment analysis. Int. J. Inf. Manag. 2018, 42, 13–24. [Google Scholar] [CrossRef]
  27. Zhang, Y.; Zhang, R.; Ma, Q.; Wang, Y.; Wang, Q.; Huang, Z.; Huang, L. A feature selection and multi-model fusion-based approach of predicting air quality. ISA Trans. 2020, 100, 210–220. [Google Scholar] [CrossRef] [PubMed]
  28. Bi, C.; Fu, B.; Chen, J.; Zhao, Y.; Yang, L.; Duan, Y.; Shi, Y. Machine learning based fast multi-layer liquefaction disaster assessment. World Wide Web 2019, 22, 1935–1950. [Google Scholar] [CrossRef]
  29. Cao, A.-T.; Tran, T.-T.; Nguyen, T.-H.-X.; Kim, D. Simplified Approach for Seismic Risk Assessment of Cabinet Facility in Nuclear Power Plants Based on Cumulative Absolute Velocity. Nucl. Technol. 2020, 206, 743–757. [Google Scholar] [CrossRef]
  30. Korea Meteorological Administration Information Portal. Available online: https://data.kma.go.kr (accessed on 28 March 2020).
  31. Korea Ministry of the Interior and Safety Information Portal. Available online: http://lofin.mois.go.kr/portal/main.do (accessed on 15 April 2020).
  32. Korea Ministry of Environment (MOE). 2018 Statics of Waterworks; MOE: Sejong, Korea, 2020.
  33. Korea Ministry of Land, Infrastructure and Transport (MOLIT). Korea Design Standard; MOLIT: Sejong, Korea, 2016; p. 45.
  34. Razmkhah, H.; Abrishamchi, A.; Torkian, A. Evaluation of spatial and temporal variation in water quality by pattern recognition techniques: A case study on Jajrood River (Tehran, Iran). J. Environ. Manag. 2010, 91, 852–860. [Google Scholar] [CrossRef]
  35. Tripathi, M.; Singal, S.K. Use of Principal Component Analysis for parameter selection for development of a novel Water Quality Index: A case study of river Ganga India. Ecol. Indic. 2019, 96, 430–436. [Google Scholar] [CrossRef]
  36. Sahoo, M.M.; Patra, K.; Khatua, K. Inference of water quality index using ANFIA and PCA. Aquat. Procedia 2015, 4, 1099–1106. [Google Scholar] [CrossRef]
  37. Saaty, T.L. The Analytic Hierarchy Process; Mcgraw Hill: New York, NY, USA, 1980. [Google Scholar]
  38. Wind, Y.; Saaty, T.L. Marketing applications of the analytic hierarchy process. Manag. Sci. 1980, 26, 641–658. [Google Scholar] [CrossRef]
  39. Chakraborty, S.; Kumar, R.N. Assessment of groundwater quality at a MSW landfill site using standard and AHP based water quality index: A case study from Ranchi, Jharkhand, India. Environ. Monit. Assess. 2016, 188, 335. [Google Scholar] [CrossRef]
  40. Saaty, T.L. How to make a decision: The analytic hierarchy process. Eur. J. Oper. Res. 1990, 48, 9–26. [Google Scholar] [CrossRef]
  41. Saaty, R.W. The analytic hierarchy process—What it is and how it is used. Math. Model. 1987, 9, 161–176. [Google Scholar] [CrossRef] [Green Version]
  42. Saaty, T.L. Priority setting in complex problems. IEEE Trans. Eng. Manag. 1983, 3, 140–155. [Google Scholar] [CrossRef]
  43. Uddameri, V.; Silva, A.L.B.; Singaraju, S.; Mohammadi, G.; Hernandez, E.A. Tree-Based Modeling Methods to Predict Nitrate Exceedances in the Ogallala Aquifer in Texas. Water 2020, 12, 1023. [Google Scholar] [CrossRef] [Green Version]
  44. Shin, Y.; Kim, T.; Hong, S.; Lee, S.; Lee, E.; Hong, S.; Lee, C.; Kim, T.; Park, M.S.; Park, J. Prediction of Chlorophyll-a Concentrations in the Nakdong River Using Machine Learning Methods. Water 2020, 12, 1822. [Google Scholar] [CrossRef]
  45. Zhang, D.; Qian, L.; Mao, B.; Huang, C.; Huang, B.; Si, Y. A data-driven design for fault detection of wind turbines using random forests and XGboost. IEEE Access 2018, 6, 21020–21031. [Google Scholar] [CrossRef]
  46. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  47. Genuer, R.; Poggi, J.-M.; Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 2010, 31, 2225–2236. [Google Scholar] [CrossRef] [Green Version]
  48. Hollister, J.W.; Milstead, W.B.; Kreakie, B.J. Modeling lake trophic state: A random forest approach. Ecosphere 2016, 7, e01321. [Google Scholar] [CrossRef] [Green Version]
  49. Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 2014, 15, 3133–3181. [Google Scholar]
  50. Singh, B.; Sihag, P.; Singh, K. Modelling of impact of water quality on infiltration rate of soil by random forest regression. Model. Earth Syst. Environ. 2017, 3, 999–1004. [Google Scholar] [CrossRef]
  51. Read, E.K.; Patil, V.P.; Oliver, S.K.; Hetherington, A.L.; Brentrup, J.A.; Zwart, J.A.; Winters, K.M.; Corman, J.R.; Nodine, E.R.; Woolway, R.I. The importance of lake-specific characteristics for water quality across the continental United States. Ecol. Appl. 2015, 25, 943–955. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  52. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 1189–1232. [Google Scholar] [CrossRef]
  53. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 3146–3154. [Google Scholar]
  54. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  55. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  56. XGBoost. Available online: https://xgboost.readthedocs.io/en/latest/build.html (accessed on 15 February 2020).
  57. Scikit-Learn. Available online: https://scikit-learn.org/stable/index.html (accessed on 3 January 2020).
  58. Fabris, F.; Doherty, A.; Palmer, D.; De Magalhães, J.P.; Freitas, A.A. A new approach for interpreting random forest models and its application to the biology of ageing. Bioinformatics 2018, 34, 2449–2456. [Google Scholar] [CrossRef]
  59. Grömping, U. Variable importance assessment in regression: Linear regression versus random forest. Am. Stat. 2009, 63, 308–319. [Google Scholar] [CrossRef]
  60. Moriasi, D.N.; Arnold, J.G.; Van Liew, M.W.; Bingner, R.L.; Harmel, R.D.; Veith, T.L. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASABE 2007, 50, 885–900. [Google Scholar] [CrossRef]
  61. Bennett, N.D.; Croke, B.F.; Guariso, G.; Guillaume, J.H.; Hamilton, S.H.; Jakeman, A.J.; Marsili-Libelli, S.; Newham, L.T.; Norton, J.P.; Perrin, C. Characterising performance of environmental models. Environ. Model. Softw. 2013, 40, 1–20. [Google Scholar] [CrossRef]
Figure 1. Schematics of the random forest (RF) and gradient boosted decision tree (GBDT) models.
Figure 1. Schematics of the random forest (RF) and gradient boosted decision tree (GBDT) models.
Water 12 03195 g001
Figure 2. A spatial distribution of water treatment plants and frequency of natural disasters determined from the disaster warning advisories in Korea.
Figure 2. A spatial distribution of water treatment plants and frequency of natural disasters determined from the disaster warning advisories in Korea.
Water 12 03195 g002
Figure 3. Comparison of model prediction.
Figure 3. Comparison of model prediction.
Water 12 03195 g003
Figure 4. Feature importance (FI) of (a) RF and (b) XGBoost (XGB).
Figure 4. Feature importance (FI) of (a) RF and (b) XGBoost (XGB).
Water 12 03195 g004
Figure 5. Model sensitivity to the number of input variables included in a model (RF or XGB).
Figure 5. Model sensitivity to the number of input variables included in a model (RF or XGB).
Water 12 03195 g005
Table 1. Data sources.
Table 1. Data sources.
DataReference
Water treatment plant operational information and facility specificationStatistical yearbook of water treatment system [32]
Meteorological dataMeteorological administration information portal [30]
Financial status in local governmentMinistry of the interior and safety information portal [31]
Design standard for wind speedKorean Design Standard [33]
Local peak ground acceleration by earthquakeKorea Seismicity Map [29]
Table 2. Input variables.
Table 2. Input variables.
VariablesDescription
CUSTOMERPopulation that receives drinking water from the water treatment plant
EMPLOYEES_AREAEmployees per management area of the authority * (person/km2)
EMPLOYEES_SITEEmployees per number of water treatment plants of the authority * (person/ea)
EQ1Seismic design application status (applied: 1, not applied: 0)
LINE_DENSETotal pipeline length per management area of the authority * (m/km2)
LOCAL_EMPLOYEESNumber of employees in the water supply plant
MONEYFinancial independence of the local government (%)
PGA_500500 years frequency peak ground acceleration (%)
PGA_10001000 years frequency peak ground acceleration (%)
PGA_24002400 years frequency peak ground acceleration (%)
PUMPNumber of water supply pumps in the water treatment plant (ea)
PUMP_EPSum of electrical capacity of all water supply pumps in the water supply plant (kW)
QWater supply capacity of the water treatment plant (m3/day)
Q_DAILYDaily average water production capacity of the water treatment plant (m3/day)
Q_MAXDaily maximum water production capacity of the water treatment plant (m3/day)
Q_PROTotal annual water production capacity of the water treatment plant (m3)
QTMaximum water supply capacity per hour of the water treatment plant (m3/hr)
QWTotal annual electric power usage of the water treatment plant (kWh)
QYTotal annual amount of water treated by the water treatment plant (m3/year)
RAINNumber of flood warning advisories between 2015 and 2019 in the region where the water treatment plant is located (times/km2)
SWINDNumber of strong wind advisories between 2015 and 2019 in the region where the water treatment plant is located (times/km2)
TYPHOONNumber of typhoon warning advisories between 2015 and 2019 in the region where the water treatment plant is located (times/km2)
WIND_RRegional standard wind speed (m/s) in the region where the water treatment plant is located
* authority: the owner of the water supply plant (i.e., the local government) and one authority may manage multiple plants.
Table 3. Characteristics of input variables.
Table 3. Characteristics of input variables.
VariablesAverageMaxMinStandard Deviation
CUSTOMER89,902.7493,030,917.0000.000276,768.491
EMPLOYEES_AREA0.1953.0170.0060.449
EMPLOYEES_SITE25.819304.3330.83350.505
EQ10.2221.0000.0000.416
LINE_DENSE2718.44225,817.066274.7793812.635
LOCAL_EMPLOYEES8.043134.0000.00016.030
MONEY18.97278.4504.02015.198
PGA_5008.53211.0003.1991.792
PGA_100011.51514.0005.4311.884
PGA_240016.69519.0009.2221.959
PUMP2.852176.0000.0009.093
PUMP_EP444.52714,400.0000.0001637.036
Q48,083.9051,600,000.00030.000141,017.670
Q_DAILY30,932.3081,081,369.0006.00090,686.216
Q_MAX37,629.1601,221,400.00020.000106,454.278
Q_PRO11,278,843.986394,699,507.0002288.00033,103,181.424
QT1818.25369,120.0000.0006465.596
QW2,406,152.68778,414,779.0000.0007,988,910.560
QY11,623,330.642402,072,337.0003276.00033,936,442.393
RAIN0.0840.9000.0090.123
SWIND0.0770.8130.0020.108
TYPHOON0.0160.0600.0010.013
WIND_R29.08444.00024.0005.143
Table 4. PCA analysis for weight of each component.
Table 4. PCA analysis for weight of each component.
Disaster (Index)ComponentInput Variable (Symbol)Weight
Typhoon (TI)REtWIND_R (REt1)0.309
TYPHOON (REt2)0.345
SWIND (REt3)0.346
sum1.000
RItQ (RIt1)0.143
QY (RIt2)0.143
Q_PRO (RIt3)0.144
PUMP (RIt4)0.052
PUMP_EP (RIt5)0.140
QT (RIt6)0.132
QW (RIt7)0.136
LINE_DENSE (RIt8)0.110
sum1.000
URtQ_DAILY (URt1)0.334
Q_MAX (URt2)0.334
COSTUMER (URt3)0.332
sum 1.000
RRtLOCAL_EMPLOYEES (RRt1)0.248
MONEY (RRt2)0.235
EMPLOYEES_SITE (RRt3)0.263
EMPLOYEES_AREA (RRt4)0.254
sum1.000
Heavy rain (HI)REhWIND_R (REh1)0.500
RAIN (REh2)0.500
sum1.000
RIhQ (RIh1)0.143
QY (RIh2)0.143
Q_PRO (RIh3)0.144
PUMP (RIh4)0.052
PUMP_EP (RIh5)0.140
QT (RIh6)0.132
QW (RIh7)0.136
LINE_DENSE (RIh8)0.110
sum1.000
URhQ_DAILY (URh1)0.334
Q_MAX (URh2)0.334
COSTUMER (URh3)0.332
sum1.000
RRhLOCAL_EMPLOYEES (RRh1)0.248
MONEY (RRh2)0.235
EMPLOYEES_SITE (RRh3)0.263
EMPLOYEES_AREA (RRh4)0.254
sum1.000
Earthquake (EI)REePGA_500 (REe1)0.333
PGA_1000 (REe2)0.336
PGA_2400 (REe3)0.331
sum1.000
RIeQ (RIe1)0.143
QY (RIe2)0.143
Q_PRO (RIe3)0.144
PUMP (RIe4)0.052
PUMP_EP (RIe5)0.140
QT (RIe6)0.132
QW (RIe7)0.136
LINE_DENSE (RIe8)0.110
sum1.000
UReQ_DAILY (URe1)0.334
Q_MAX (URe2)0.334
COSTUMER (URe3)0.332
sum1.000
RReLOCAL_EMPLOYEES (RRe1)0.234
EQ1 (RRe2)0.056
MONEY (RRe3)0.222
EMPLOYEES_SITE (RRe4)0.249
EMPLOYEES_AREA (RRe5)0.239
sum1.000
Table 5. Analytical hierarchy process (AHP) analysis results.
Table 5. Analytical hierarchy process (AHP) analysis results.
(a) Weights for Disaster Type.
DisasterWeight
Typhoon0.481
Heavy rain0.198
Earthquake0.321
Sum1.000
CR0.054
(b) Weights for Each Component.
DisasterComponentWeight
TyphoonREt0.275
RIt0.265
URt0.216
RRt0.244
Sum1.000
CR0.017
Heavy rainREh0.279
RIh0.247
URh0.221
RRh0.253
Sum1.000
CR0.004
EarthquakeREe0.215
RIe0.370
URe0.235
RRe0.180
Sum1.000
CR0.040
Table 6. Summary of model evaluation results.
Table 6. Summary of model evaluation results.
ModelRMSER2RSR
RF0.1000.84350.3957
XGB0.0930.86290.3703
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, J.; Park, J.-H.; Choi, J.-S.; Joo, J.C.; Park, K.; Yoon, H.C.; Park, C.Y.; Lee, W.H.; Heo, T.-Y. Ensemble Model Development for the Prediction of a Disaster Index in Water Treatment Systems. Water 2020, 12, 3195. https://doi.org/10.3390/w12113195

AMA Style

Park J, Park J-H, Choi J-S, Joo JC, Park K, Yoon HC, Park CY, Lee WH, Heo T-Y. Ensemble Model Development for the Prediction of a Disaster Index in Water Treatment Systems. Water. 2020; 12(11):3195. https://doi.org/10.3390/w12113195

Chicago/Turabian Style

Park, Jungsu, Jae-Hyeoung Park, June-Seok Choi, Jin Chul Joo, Kihak Park, Hyeon Cheol Yoon, Cheol Young Park, Woo Hyoung Lee, and Tae-Young Heo. 2020. "Ensemble Model Development for the Prediction of a Disaster Index in Water Treatment Systems" Water 12, no. 11: 3195. https://doi.org/10.3390/w12113195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop