1. Introduction
Due to urbanization and population growth in metropolitan areas, water quality (WQ) changes in urban rivers, including water pollution, because WQ accidents occur frequently around the globe [
1,
2,
3]. Despite the river maintenance project, the WQ of downtown rivers is deteriorating. Dissolved oxygen (DO) is among the WQ elements of downtown rivers that are worsening due to water pollution [
4], and as a result, various WQ accidents occur frequently. According to the Seoul Institute of Health and Environment (2018), over the past 13 years (2005–2017), there have been about 50 WQ accidents in Seoul. For this reason, it is necessary to intensively manage the WQ and aquatic ecosystem of the city’s urban rivers [
5].
The importance of monitoring WQ in urban rivers is only increasing, as WQ deteriorates and WQ accidents occur more frequently in urban rivers, due to the concentration of populations in large cities [
6,
7]. Since 1990, Seoul has been operating an automatic WQ measurement network system that measures WQ on an hourly or daily basis in order to change the WQ of urban rivers [
5]. These efforts made it possible to analyze quantitative and sophisticated predictive model algorithms for WQ changes in urban rivers due to population concentrations in large cities [
8,
9].
The current study sought to predict changes in the WQ of urban rivers in large cities by using traditional time series modeling of data from various automatic WQ measurement network systems from the past to the present [
10,
11]. Recently, the scale of measurement data has become vast and the measurement period of data has been shortened, due to the development of internet-of-things (IoT) technology. This makes it difficult to process it with the existing time series model [
12]; it showed a non-linear relationship between the variables measured first. Additionally, since the covariance between the time series moving average and the observed value does not change with time, it is difficult to reflect long-term changes. Finally, there is also a difficulty in learning about discontinuous time series data.
Since prediction is performed based on input data, machine learning algorithms developed to be universally applied to data analysis and image analysis can be used flexibly in various fields; the use of machine learning models is also rapidly increasing in the WQ field. The ensemble model, which uses a method to improve the performance of a model by combining the results of several models among various machine learning models, is relatively uncomplicated and has excellent predictive performance compared to deep learning models. For this reason, it has been used in various fields until recently [
13,
14,
15,
16,
17].
Recently, however, there has been an increasing number of studies using machine learning techniques to process and model massive data [
17,
18]. For efficient WQ management, it is necessary to check the current status of WQ and predict changes that are likely to occur. For this purpose, various WQ prediction models based on WQ, environmental conditions, hydrometeorological factors, etc. have been developed and utilized [
11,
19,
20,
21].
Therefore, in this study, the accident caused by the deterioration of the WQ of the above urban rivers, the degree of deterioration of the urban river WQ, and the change in the water environment data were determined as three tasks to predict the WQ of the urban river.
2. Materials and Methods
As for the scope of this study, a model was developed to predict dissolved oxygen, which is a source of water pollution, and the predictive performance evaluation was investigated. There were three main processes, which are as follows: (i) initial model development, (ii) model optimization, and (iii) performance evaluation.
This study intends to implement an algorithm for predicting WQ using a machine learning algorithm based on data provided by a state agency. The machine learning model can improve the performance of the model by selecting various input variables in consideration of the characteristics of the items to be predicted, and can increase the practical applicability. In addition, by utilizing the boosting technique among machine learning techniques, frequent urban river WQ problems can be prevented by predicting the deterioration of urban river WQ and changes in water environment data.
Using machine learning techniques, it is possible to first model a non-linear relationship between variables, and then observe a correlation between the training variables. Second, the long-term correlation of time series data is reflected in learning. The third data segment is used for learning, and it has shown good performance in learning and predicting discontinuous time series data, and is currently being actively used as a WQ prediction model [
22,
23,
24,
25].
In this study, we follow the process of first building an initial model centered on the gradient boosting (GB) model and random forest, which are representative algorithms of the ensemble model, and then optimizing it as a model with excellent predictive power. In particular, we attempted to increase the predictive power and learning speed of the implementation algorithm by using appropriate parameters through Grid Search to build the optimization model and adjust the loss function and learning rate that the user must specify. Using AdaBoost, one of the most widely used GB algorithms, a model was built to predict the WQ concentration of the Hwanggujicheon in Korea. In addition, we tried to figure out how the input data, used for building the model, affect the outcomes of the analyses.
Additionally, the scalability of the machine learning-based urban river DO prediction model was also considered by using the commercial computing language Python (python 3.6) and the open-source libraries Keras and Orange 3 for model development and validation. In addition, we proposed a new WQ prediction model by adapting to the changes in the urban river WQ prediction model technique that changes from the traditional time series model to a machine learning-based prediction model.
The optimization process for the initial prediction model for each measurement point to predict the DO amount was carried out; the final prediction model was developed through the prediction performance evaluation. After that, the driving algorithm was developed to derive the optimal system control variable set value, and finally, the predictive power was confirmed by applying the simulation and actual data.
In addition, the predictive performance and reliability are identified through research evaluation. In order to understand the predictive performance and reliability of the developed algorithm, R
2 was >0.8, which is the correlation standard between the measured values and the predicted values, presented in ASHRAE (American Society of Heating, Refrigeration and Air-Conditioning Engineers) Guideline 14, and the coefficient of variation of the root mean square error. The prediction accuracy and reliability of the AdaBoost model algorithm implemented in this study were evaluated using CVRMSE < 30% [
21].
2.1. Literature Review
2.1.1. Overview of Gradient Boosting and Research Cases
The GB model is one of the ensemble models of decision trees, and unlike the random forest bagging algorithm, the tree is created in a way that compensates for the error of the previous tree. The GB model has no randomness and builds trees, the depth of which does not exceed five per tree. Therefore, the GB modeling method can be said to connect as many shallow trees as possible [
26]. Friedman’s (2001) GB algorithm is as follows [
5,
27].
x is an explanatory variable,
y is a dependent variable,
L(
y,
F(
x)) is a differentiable loss function, and as in Equation (3), similar residuals are calculated by repeating m times.
After fitting , the base learner, to the calculated similar residuals, the process of calculating and updating the residuals is repeated m times. The loss function quantifies the error of the prediction model, and in order to find the parameters in the model that minimize the loss function value, general machine learning models use the gradient descent method.
GB performs this parameter loss function minimization process in the model function () space, and differentiates the loss function into the tree model function learned so far according to Equation (4), not the model parameter. In Equation (4) below, is the learning rate.
GB performs this parameter loss function minimization process in the model function (
) space. In addition, this differentiates the loss function into the tree model function trained so far. The process is done via Equation (4). In Equation (4) below,
is the training rate.
That is, in the GB model, the tree model function derivative serves to indicate the weakness of the model trained so far. Furthermore, when fitting the next tree model, the derivative is used to compensate for the weakness to boost performance [
28]. The GB algorithm was created for classification purposes for different classes of logistic likelihood and for the regression of the fewest absolute deviation loss functions, Huber-M, and the fewest squares [
29]. It provides a very powerful and competitive environment for mining regression and classification problems, especially with fewer clean data sets.
GB makes a forward stepwise additive approach through gradient descent in the function space. In addition, we sequentially construct various regression trees for each feature in a fully distributed manner. GB involves the following three basic factors: the loss function must be adjusted, the weak learner model must produce predictions, and finally, the additive model must merge all weak learners to reduce the overall loss function value. The basic structure of the GB machine algorithm is shown in
Figure 1 [
30].
The advantage is that, as is the case with other tree-based models, it works well on datasets with a mix of scales between features and nominal and numeric variables. The disadvantage is that it is sensitive to parameters and the training time is long. In addition, it is known that the performance is poor on very high-dimensional data sets [
28].
Next, looking at AdaBoost, Freund’s AdaBoost algorithm is the most widely used boosting algorithm [
30]. AdaBoost is a high-accuracy model that uses a decision tree as a base model.
Therefore, we train based on the updated weights and the aggregated results obtained from multiple decision trees. In particular, the advantage of AdaBoost is that the number of predicted parameters is small compared to other learning methods. In addition, when boosting learning is performed in terms of false positives, a cascade classification model can be easily constructed in stages, with a positive error rate below a certain standard. Moreover, by selecting one specific dimension through a weak classifier at each step, it can be applied to the aspect of feature selection.
AdaBoost is a learning technique that generates a strong classifier by repeatedly learning a weak classifier using samples from two classes.
Figure 2 shows the basic model of AdaBoost. X is given as an input and output pair (
), and the weak learner classifier is given the same weight
for all the data. When the training of the first classifier is completed, the weight of the data to be applied to the second classifier is modified according to the result.
At this time, the value of the weight decreases if there is no error, and if there is an error, the value of the weight increases. The AdaBoost algorithm focuses on erroneous (highly weighted) data. This process is performed m times.
Each classifier is trained using the adjusted weights, and in the final combining step, the value of ai, which was used for training, is applied so that the classifier with a small error rate can play a more important role in judgment [
31].
The AdaBoost classifier can be obtained using the following steps.
First, we obtain the training data, . In the k-th step (k = 1, 2, …, T), probability is used to restore and extract from the training data to generate new training data. A classifier is generated using the generated training data, and if the n-th observation is improperly classified, = 1, and if the nth observation is properly classified, = 0.
The error
is defined as the following equation:
The (
k + 1)-th probability to be updated is as follows:
Repeat this process m times. After completing the
m-th step,
are combined into one classifier by the classifier
with weight
to create a final classifier [
26].
The advantage is that AdaBoost is adaptive because instances misclassified by previous classifiers are reconstructed into subsequent classifiers. A disadvantage is that AdaBoost is sensitive to noise data and outliers [
32].
The AdaBoost model optimizes the model to minimize a loss function (L: loss function) that calculates the difference between the measured value (
) of the item to be predicted and the predicted value (
) of the model and an objective function composed of a regulation function (Ω), which is a function of the individual DT (decision tree) model (
) [
33,
34,
35].
In this study, the optimal prediction algorithm is implemented using the AdaBoost algorithm. WQ measurement data were used as an independent variable to predict the dependent variable DO. The grid search method was used to optimize the model, and cross-validation was performed by dividing the input data into 10 sets. Model construction and optimization were performed using Python open-source [
36].
2.1.2. Prior Research
Changes in river WQ have been predicted through traditional time series modeling for various forms of water pollution [
10,
37], and the amount of research being conducted is on the rise [
23,
38], as the size of the data has grown and the limitations of the traditional time series model have been revealed. As an alternative to this, a deep learning-based or machine-learning-based prediction model is emerging [
12].
In the case of WQ prediction based on deep learning, Lim and An (2018) described recurrent neural networks (RNN) and a long short-term memory (LSTM) algorithm was used to predict the pollution load [
19].
A machine-learning algorithm [
11] presented a model for predicting Chl-a concentration using artificial neural networks (ANN) and support vector machine (SVM), which are representative machine learning algorithms, and Kwon et al. (2018) predicted Chl-a concentration using ANN and SVM algorithms and satellite image data [
39]. Lee et al. (2020) used random forest (RF) and gradient boosting decision tree (GBDT), which are representative ensemble machine-learning algorithms that use a method to improve the performance of models by combining the results of several models. A model for prediction was built. In addition, research for predicting WQ changes using a machine-learning model based on advanced data analysis technology is also active, and until recently, it has been used in various fields [
13,
14,
19]. In some cases, studies were performed with LightGBM [
16,
33,
35,
40,
41,
42].
Looking at previous studies as target variables, PM concentration prediction [
22], Chl-a concentration prediction [
11,
39], pollution-load prediction [
19,
43], prediction of other variables [
44], and image recognition [
45] can be obtained.
In summary, a number of prior studies on WQ prediction use deep learning techniques. However, as in this case, no previous study developed a model for predicting unit-DO concentration in urban rivers using the GB-based boosting algorithm. No concentration was predicted. There was no algorithm to predict the DO concentration in urban rivers downstream using the GB series AdaBoost, which shows high predictive power.
2.2. GB Series Prediction Model Development
2.2.1. Data Sources
Data to be used in this study were provided from
https://aihub.or.kr (accessed on 23 March 2022). AI Hub is an AI integrated platform operated by the Korea Intelligent Information Society Agency. As part of the 2017 AI learning data building and dissemination project, it aims to provide one-stop AI data, software, computing resources, and material information essential for AI technology and service development [
46].
The data for AI learning in this study were the WQ measurement data of the water environment measurement network, including WQ/automatic/total amount/sediment/radioactive material/KRF, etc., concerning the related measurement data. Detailed data sources for water-quality-related fields are the National Institute of Environmental Sciences of the Ministry of Environment and the Korea Water Resources Corporation [
46].
In the pretreatment process, the data corresponding to Hwanggujicheon were extracted. Hwanggujicheon is a national river that originates in Obongsan in Uiwang-si, Gyeonggi-do, and joins as Jinwicheon in Seotan-myeon in Pyeongtaek city. Afterward, it joins the Jinwicheon Stream in Pyeongtaek city, and flows southward, fed by tributaries of Suwon, such as Osancheon, Homaesilcheon, Seohocheon, Suwoncheon, and Woncheoncheon [
47].
Figure 3 corresponds to the Cheon (creek), which is the subject of this study.
In
Table 1, latitude 37.23056, longitude 126.9936, CAT_ID is 11011204 as catchment area ID, and 1.1 × 10
9 as CAT_DID division area means Hwanggujicheon 1. However, the WQ measurement network in the Hwanggujicheon-1 appears to be an error, and the location indicated by the above longitude corresponds to a different area. It was, therefore, excluded from this study.
Figure 3 is the relevant area for the estimation of water pollution in this study.
It also provides the name and value of the data measurement item and whether the item has been refined. A total of 20 items, such as measurement date, flow rate, water temperature, flow rate (m3/s), water temperature (°C), pH, DO (mg/L), BOD (mg/L), COD (mg/L), SS (mg)/L), EC (μS/cm), T-N (mg/L), DTN (mg/L), NO3-N (mg/L), NH3-N (mg/L), T-P (mg/L), -DTP (mg/L), PO4-P (mg/L), chlorophyll-a, and TOC (mg/L), are provided.
In
Table 2, electrical conductivity (EC), total phosphorus (T-P), chlorophyll-a, flow rate, phosphate (PO
4-P), and total organic carbon (TOC) were excluded due to missing values. Monthly data were used from January 2008 to December 2020 for the usage data period. The data in this study did not show a time series. There is a lack of regularity in the measurement period of the data, and there are parts where monthly data for a specific year are omitted. In addition, parts with many missing values were deleted. For example, measurements of items such as chlorophyll-a only have recent results, and data prior to 2020 do not have values. Looking at the number of data collection cases, the water environment field was 264,147,400, and the data related to the WQ of Hwanggujicheon Stream were extracted from it.
The data used in this study were source data collected from the National Institute of Environmental Sciences, Statistics Korea, and the Korea Meteorological Administration, and the data were primarily refined based on related laws, such as the announcement of the water environment monitoring network operation plan. As for the type of refinement, outliers were identified and removed by determining whether they were included within the confidence interval in the removal of outliers. In addition, cross-validation with data construction institutions and inspection institutions was performed by designating a dedicated inspection team among participating institutions, and an expert inspection was performed by designating a national organization consultative body composed of water-quality experts from the National Academy of Environmental Sciences [
46].
In this study, MinMaxScaler() was used for scaling after data preprocessing. The normalization method used in the DO prediction model used min-max scaling as a method to make the range the same for all input variable characteristics. The min–max scaling method of normalizing used variables to values between 0 and 1. The smallest value is converted to 0, the largest value is converted to 1, and all properties have the range (0–1). Many missing values were deleted.
In this study, the number of instances extracted through preprocessing to build a model for predicting DO in Hwanggujicheon, the research target area, is 761. The measured period is from 2008 to 2020. It corresponds to the number in which the part due to missing values or data errors is removed.
Each element and sub-item were selected through the literature search and prior research, and unused sub-items were those that were not properly learned or parts with many missing values and were removed when constructing the DO prediction model. As the modeling optimization factor, nine features of the DO prediction model were used. Looking at the model variables used in this study, the data of DO, a WQ item of the WQ measurement network, is used as the dependent variable of the boosting-based DO prediction model. The data of nine WQ items from the automatic WQ measurement network were used for the independent variable (input variable) of the boosting-based DO prediction model.
The criteria for selecting the learning data in this study were the literature search, previous studies, and the living environment criteria items of rivers and lakes. This was based on Article 12 (2) of the Framework Act on Environmental Policy (setting of environmental standards) and the environmental standards of the enforcement decree of the same law. However, the total organic carbon content (TOC) was excluded from the study because there were many missing values and there were too many unmeasured areas. Biochemical oxygen demand (BOD) was excluded because it was measured from the amount of DO.
In the boosting-based DO prediction model, an algorithm was applied to each measurement point, but sufficient results were not obtained due to limited regional data. Therefore, the current study used the whole part of Hwanggujicheon. Data partitioning was set to 80:20. A 10-fold cross-validation method was used. In addition, a simulation was performed using the latest data to evaluate the prediction algorithm, but the number of instances was insufficient and the predictive power was minimal.
2.2.2. Statistical Data and Its Visualization
The correlations of the data used in this study are as follows.
The following
Table 4 shows the statistics of the data.
Visualization of individual data is performed based on the index and each region has a similar shape (
Figure 5).
3. Results
3.1. Initial Model and Results
First, we designed a bagging-based random forest. As a parameter of the model, the number of trees was set to nine, and the maximal number of considered features was set to five. Replicable training was not set, maximal tree depth was set to five, and stop-splitting nodes with maximum instances were set to two.
There are 609 train-data instances, and the features are pH, SS, water temperature, TN, DTP, NHN, COD, DTN, and NON. The index is used as meta-attributes, and DO is used as the target variable.
In the case of boosting-based gradient boosting, the number of trees was 13 as the model parameters, the learning rate was 0.464, and replica training was set. Maximum tree depth was set to 5, and regularization strength was set to 1. The fraction of training instances was set to 0.899, the fraction of features for each tree was set to 0.899, the fraction of features for each level was set to 0.849, and the fraction of features for each split was set to 0.499.
Looking at the test scores of the training data in
Table 5, R
2 and CVRMSE, and AdaBoost, 0.998 and 2.199, show the best learning ability. On the other hand, the random forest is 0.925 and 15.372, which lacks explanatory power. However, in all three models, MSE is 0.000, but there is a difference between RMSE and MAE, as well as a difference in running time.
Table 6 is the result value that is learned by 10-fold cross-validation. R
2 and CVRMSE, AdaBoost, 0.896 and 18.082, show the best learning ability. On the other hand, random forest has relatively poor explanatory power with 0.887 and 18.874. In addition, although MSE is 0.000, there is a difference between RMSE and MAE.
Looking at the predictions of the initial modeling, the data include 152 instances, 11 variables, and 9 features (no missing values), and the target variable is DO. The 3 models used were gradient boosting, AdaBoost, and random forest (
Figure 6).
Table 7 corresponds to the prediction results. It also shows the same predictive power as the previous results. In the overall evaluation index, AdaBoost is excellent. Similar to the results of the training process, the MSE is the same, but there are differences in RMSE, MAE, R
2, and CVRMSE. AdaBoost’s RMSE is 0.016, which is relatively close to 0, and R
2 is 0.901, which is closer to 1. CVRMSE is relatively low at 18.435, which satisfies all evaluation criteria of R
2 and CVRMSE.
3.2. Optimal Model and Design
While designing an optimized model, AdaBoost’s learning ability and predictive ability were superior to that of random forest of bagging or GB-based XGBoost, so it was selected as an optimized model. In addition, we want to design a design that improves prediction performance by adjusting the basic parameters.
AdaBoosting is used as the model parameters, the base estimator is tree and the number of estimators is seven. The learning rate is 0.500. The reproducibility of the experiment was set as the fixed seed for the random generator was set to 155. There are 609 data instances of data, and the features are pH, SS, water temperature, TN, DTP, NHN, COD, DTN, and NON. As meta-attributes, they are indexed as Feature 1. The target was set to DO.
In
Table 8, there is a difference in the evaluation index according to the shape of the loss function. Comparing R
2 and CVRMSE, R
2 shows 0.999 in the same way for linear and exponential functions. However, in CVRMSE, linear is 2.066 and exponential is 1.463, which is close to 0, indicating good learning ability. In this study, the exponential function was first selected as the loss function and the same parameters were set to estimate the predictive ability. Next, the predictive ability was compared by applying the linear and square loss functions.
3.3. Predictive Performance Evaluation
Considering the indices for evaluating the performance prediction of machine learning models, various indices, including RMSE (root mean square error), MAE (mean absolute error), MSE, R
2, and CVRMSE, are used. For the evaluation of DO prediction performance using the AdaBoost model constructed in this study, root mean square error (RMSE), MSE, MAE, CVRMSE, R
2, and running time were used. The latter two indices, R
2 and CVRMSE, were mainly used in this paper. MAE and MSE often show the same value, and predictive performance cannot be properly evaluated in many cases. Meanwhile, RMSE is an index that compares the absolute value of the difference between the predicted value and the measured value. Among the evaluation indicators, the closer to 0, the better the performance of MSE, MAE, RMSE, and CVRMSE. The closer to 1, the better the performance of R
2.
It is a value rooted in the MSE, and the error-index is converted back to a unit similar to the actual value, which makes interpretation somewhat easier.
MSE is the main loss function of the regression model, and it is defined as the mean square of the errors, which is the difference between the predicted value and the actual value. Because it is squared, it is sensitive to outliers. MAE is the mean of the absolute values of errors, which is the difference between the actual value and the predicted value and is less sensitive to outliers than the MSE.
R2 (coefficient of determination) is a variance-based prediction performance evaluation index. Other indicators, such as MAE and MSE, have different values depending on the scale of the data, but R2 can intuitively judge the relative performance. That is, the R2 score coefficient of determination is an index that measures the accuracy performance of data prediction by calculating the variance in the predicted value compared to the variance in the actual observation.
It is expressed as a number from 0 to 1; the better the linear regression fits the data, the closer the value of R
2 is to 1. The R
2 value is obtained by dividing the sum of squares of residuals by the sum of squared residuals with respect to the average value as follows:
where
is the fitted value and
is the mean value.
The coefficient of variation of the standard error (CVRMSE: coefficient of variation of the RMSE) is a measured value suggested by ASHRAE (American Society of Heating, Refrigeration and Air-Conditioning Engineers) Guideline 14 to understand the predictive performance and reliability of the optimized AdaBoost model.
The prediction accuracy of the AdaBoost model was evaluated using the correlation criterion (R
2 > 0.8) and the root mean square error coefficient of variation (CVRMSE) [
21].
To analyze the predictive performance and reliability of the AdaBoost model, actual water pollution data and predicted results were compared. At this time, ASHRAE provides statistical criteria for comparing and evaluating the measured data and simulation results, and the predictive performance of the AdaBoost model was mainly evaluated with the R2 value and CVRMSE. R2, representing the model explanatory diagram, is a measure of the magnitude of the explanatory power of the input variables for the variation in the output variables of data, and the correlation was judged to be appropriate when the ASHRAE standard was 0.8 or higher. CVRMSE is the coefficient of variation (CV) of the root mean square error (RMSE), and was used as a measure to determine the difference between the actual value and the predicted value.
4. Discussion
It shows the learning power according to the loss function by setting the hyperparameter obtained by Grid Search. Train time is 0.074–0.081, and test time is 0.003–0.008. MSE shows the same value according to the loss function.
For RMSE, loss: square is 0.002, loss: exponential is 0.001, and the exponential loss function is close to 0. In addition, MAE is 0.001 and 0.000, so loss: exponential is close to 0. Similarly for R2, loss: exponential is 0.999 and 0.998, which is closer to 1 than loss: square. In addition, since R2 is 80% or more, all loss functions are suitable for the determination of fitness. In CVRMSE, 2.199 has the highest loss: square, 2.066, and 1.463, all suggesting appropriate values of less than 30%.
Figure 7 corresponds to the part learned by AdaBoost’s loss function loss: square. As shown in
Table 9, the learning ability is close to 1. The reason that the loss function was selected as loss: square in this study is because it shows better predictive power than the other loss functions in the prediction output of
Table 10.
Figure 8 corresponds to a line plot of the training output and shows the distribution of values for each variable.
Table 10 shows the performance evaluation according to the loss function. It can be found that the loss functions loss: linear and loss: exponential have relative overfitting in comparison to loss: square. The MSE of all loss functions is equally 0, and in RMASE and MAE, loss: square is the closest to 0. However, in MAE, loss: exponential is relatively closer to 0 than the two loss functions.
This is probably because the MAE is calculated as the average of the absolute values of the errors, which is the difference between the actual value and the predicted value, and it is less sensitive to outliers. For R2, loss: square is 0.912, loss: linear is 0.901, loss: exponential is 0.889, and loss: square is closer to 1. Unlike train score, loss: linear and loss: exponential are closer to loss: square than 1, and it shows that the model predictive power is high. Loss: linear also shows better explanatory power than loss: exponential.
In CVRMSE, loss: square is 17.404, loss: linear is 18.435, loss: exponential is 19.501, loss: square is closer to 0, the reliability of the model is high, and the model evaluation is more valid.
Therefore, when the model is evaluated based on R2 and CVRMSE, loss: square shows a higher model fit and predictive power than loss: linear and loss: exponential. All cases where R2 is 0.8 or more and CVRMSE is less than 30 are accepted. AdaBoost’s loss function loss: square showed higher predictive power than other loss functions loss: linear and loss: exponential.
Figure 9 shows the distribution of the results as a line plot of the prediction output. The line plot of the prediction output has a shape similar to that of the training output. Because the train data and test data are divided, there is a difference in the depth.
Figure 10 shows the result predicted by AdaBoost’s loss function loss: square.
To summarize the results of the experiment, the AdaBoost algorithm showed the best predictive power in this research model as a result of the random forest, XgBoost, and AdaBoost predicted by Grid Search. The AdaBoost algorithm, which showed excellent predictive power, showed higher predictive power as a result of adjusting hyperparameters to increase the higher optimal predictive power.
In other words, the results predicted by AdaBoost’s loss function loss: square had relatively little lower learning power compared to the loss functions loss: linear and loss: exponential, but showed better predictive power in the predictive power of the verification data.
Therefore, in this study, AdaBoost’s loss function loss: square was selected and the prediction algorithm was implemented. The implemented prediction result was closer to 1 with R2 0.912, and the model predictive power was high. In addition, the CVRMSE was 17.404, which is closer to 0 than other loss functions; the reliability of the model is high and it is more valid in model evaluation. Therefore, if the model of the implementation algorithm of this study is evaluated based on R2 and CVRMSE, all cases where R2 is 0.8 or more and CVRMSE is less than 30 are accepted, showing high model fit and predictive power.
5. Conclusions
This study is a prior research stage in the development of an algorithm to predict the optimal WQ of Hwanggujicheon based on the data of the open AI hub and implemented an algorithm to predict the WQ using AdaBoost.
The conclusion is summarized as follows.
First, a WQ prediction model for Hwanggujicheon was implemented using a model called AdaBoost, and this prediction model can be used to predict and utilize WQ by selecting representative points of the four major rivers’ water source protection areas and applying them as a pilot.
Second, to implement a WQ prediction algorithm based on boosting, the AdaBoost algorithm, which has excellent predictive performance and model suitability, was selected for random forest and GB-based boosting models. In order to predict the optimized WQ, the input variables of the AdaBoost model were pH, SS, water temperature, TN, DTP, NHN, COD, DTN, and NON. DO was used as the target variable.
Third, by using a random forest or GB-series algorithm in the initial model, it is possible to analyze the prediction accuracy according to the input variable.
Algorithms with excellent predictive power were selected. After the optimization process, when the loss function was square, the model evaluation and reliability criteria of the training data, R2 and CVRMSE, were low, but R2 and CVRMSE were selected as the criteria in the predict score.
Fourth, as a result of the performance evaluation of the finally developed predictive model, RMSE was 0.015, MAE was 0.009, and R2 was 0.912. CVRMSE was 17.404. R2 0.912 and CVRMSE were 17.404, indicating that the predictive model that was developed meets the criteria of ASHRAE Guideline 14.
The WQ measurement algorithm of this study can be used as a policy suggestion. In the policy field, WQ prediction can be carried out by referring to data and models for WQ measurement and pollution source prediction of environmental pollution artificial intelligence data during WQ prediction model evaluation and development by national/administrative agencies, such as the Environmental Technology Institute and Korea Water Resources Corporation. It can also support decision-making regarding environmental and urban policy.
Future directions for this study include developing an operating algorithm for the WQ prediction system, controlling the set values and variables of each system, and applying it to simulation and actual WQ prediction. In addition, it is necessary to predict WQ so that WQ accidents, such as fish death, can be prevented in advance.