**3. Data and Methods**

#### **3. Data and Methods**  *3.1. Data Collection and Processing*

*3.1. Data Collection and Processing*  Eight years (2010 to 2017) of bimonthly WQ data at four WQ monitoring stations alongside the La Buong River (Figure 1) were collected from the Dong Nai Department of Natural Resources and Environment. The measured WQ data consisted of ten variables: temperature (T), pH, DO, BOD, COD, turbidity (TUR), total suspended solid (TSS), coli-Eight years (2010 to 2017) of bimonthly WQ data at four WQ monitoring stations alongside the La Buong River (Figure 1) were collected from the Dong Nai Department of Natural Resources and Environment. The measured WQ data consisted of ten variables: temperature (T), pH, DO, BOD, COD, turbidity (TUR), total suspended solid (TSS), coliform, ammonium (NH<sup>4</sup> + ), and phosphate (PO<sup>4</sup> <sup>3</sup>−). Sampling, preservation, storage, and analysis procedures followed the national guidelines for monitoring surface water.

form, ammonium (NH4+), and phosphate (PO43−). Sampling, preservation, storage, and analysis procedures followed the national guidelines for monitoring surface water. In the current study, the ten WQ variables were utilized to compute the WQI based In the current study, the ten WQ variables were utilized to compute the WQI based on Decision No. 879/QD-TCMT, issued by the Ministry of Natural Resources and Environment (MONRE) of Vietnam [20]. The WQI is expressed as follows:

$$\text{WQI} = \frac{\text{WQI}\_{\text{pH}}}{100} \left[ \frac{1}{5} \sum\_{\mathbf{a}=1}^{5} \text{WQI}\_{\mathbf{a}} \times \frac{1}{2} \sum\_{\mathbf{b}=1}^{2} \text{WQI}\_{\mathbf{b}} \times \text{WQI}\_{\mathbf{c}} \right]^{1/3} \tag{1}$$

(1)

WQI = WQI୮ୌ <sup>100</sup> <sup>5</sup> WQIୟ ୟୀଵ × <sup>2</sup> WQIୠ ୠୀଵ × WQIୡ൨ where WQIa is the WQI values for chemical variables (DO, BOD, COD, NH4+, and PO43−), where WQI<sup>a</sup> is the WQI values for chemical variables (DO, BOD, COD, NH<sup>4</sup> + , and PO<sup>4</sup> <sup>3</sup>−), WQI<sup>b</sup> is the WQI values for physical variables (TSS and TUR), WQI<sup>c</sup> is the WQI value for biological variable (coliform), and WQIpH is the WQI value for pH.

WQIb is the WQI values for physical variables (TSS and TUR), WQIc is the WQI value for biological variable (coliform), and WQIpH is the WQI value for pH. Based on the WQI values, the river water quality is classified into five levels: excellent (WQI = 91–100), good (WQI = 76–90), fair (WQI = 51–75), poor (WQI = 26–50), and very poor (WQI = 0–25). Full details on the guideline for calculating WQI can be found in MONRE [20]. The descriptive statistics of the WQ variables and WQI is exhibited in Table 1. The TSS, TUR, and coliform concentrations presented considerable variations, with high coefficient of variation (CV) values of 153.9% for TSS, 158.4% for TUR, and 343.2% for Based on the WQI values, the river water quality is classified into five levels: excellent (WQI = 91–100), good (WQI = 76–90), fair (WQI = 51–75), poor (WQI = 26–50), and very poor (WQI = 0–25). Full details on the guideline for calculating WQI can be found in MONRE [20]. The descriptive statistics of the WQ variables and WQI is exhibited in Table 1. The TSS, TUR, and coliform concentrations presented considerable variations, with high coefficient of variation (CV) values of 153.9% for TSS, 158.4% for TUR, and 343.2% for coliform. The high differences in these variables can be explained by the sources (point source and nonpoint source) and nature of the pollution [23]. Furthermore, the differences

coliform. The high differences in these variables can be explained by the sources (point source and nonpoint source) and nature of the pollution [23]. Furthermore, the differences

ditionally, the WQI values indicated that the water quality of the La Buong River varies

from a very low quality (WQI = 3.02) to excellent quality (WQI = 98.30).

can be associated with seasonal effects of hydro-climatic conditions in the study area. Additionally, the WQI values indicated that the water quality of the La Buong River varies from a very low quality (WQI = 3.02) to excellent quality (WQI = 98.30).

**Table 1.** Descriptive statistics of the observed WQ variables and WQI in the La Buong River during 2010–2017 (n = 220).


The La Buong River WQ data were divided into two parts: 70% for the training process and 30% for the testing process. The ratio of this division is used widely in the data-driven modeling [1,7]. To improve the training speed and predictive accuracy of the ML models, the WQ data were normalized to a 0–1 range before the modeling process using the following equation:

$$\mathbf{x}'\_{i} = \frac{\mathbf{x}\_{i} - \mathbf{x}\_{\min}}{\mathbf{x}\_{\max} - \mathbf{x}\_{\min}} \tag{2}$$

where *x* 0 *<sup>i</sup>* and *x<sup>i</sup>* are the normalized and original values of a WQI variable (i.e., pH, DO, BOD, etc.) at a station, and *xmin* and *xmax* are the minimum and maximum values of that variable, respectively.

#### *3.2. Machine Learning Models*

As mentioned above, the current study utilized twelve ML models for predicting WQI with three major groups: boosting-based algorithms, decision tree-based algorithms, and ANN-based algorithms.

#### 3.2.1. Boosting-Based Algorithms

Boosting algorithm is an ensemble meta-algorithm method that aims to improve the predictive performance of several given weaker algorithms by primarily reducing bias and variance in supervised learning problems [24]. The basic principle of the boosting method starts by creating a model from the training data, and then conducting a second model based on the previous one by reducing the bias error that arises when the first model could not infer the relevant patterns in the given data. Every time a new learning algorithm is added, the weights of data are readjusted, also known as "re-weighting". These models are added sequentially until the training data is reasonably predicted or the maximum number of learners have been added to the ensemble model [25]. Five types of boosting-based algorithms were utilized in the current study, including adaptive boosting (AdaBoost), gradient boosting (GBM), histogram-based gradient boosting (HGBM), light gradient boosting (LightGBM), and extreme gradient boosting (XGBoost). Full details on these boosting-based algorithms can be found in Wu et al. [26].

#### 3.2.2. Decision Tree-Based Algorithms

The decision tree and its many variants are the other types of learning algorithms that divide the input space into regions and has separate parameters for each region [27]. They are classified as the non-parametric supervised learning method that is widely applied for classification and regression, as well as visually and explicitly represent decisions and decision making. The typical structure of a decision tree is a tree-like flowchart, as the name goes, in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). Besides, the paths from root to leaf represent classification rules. In the present study, three decision tree-based models were assessed with respect to different learning algorithms, including decision tree (DT), extra trees (ExT), and random forest (RF). Full details on these decision tree-based algorithms can be found in Ahmad et al. [28].

### 3.2.3. ANN-Based Algorithms

In recent decades, AI-based models have been developed considerably to achieve a state-of-the-art architecture, comprising a number of learning algorithms and modern computational structures, across various aspects in studies on river water quality modeling [8]. ANN-based models have recently gained popularity due to its robustness and capability to handle nonlinear data even with its typically structured, single hidden layer, or advanced-structured, multiple hidden layers. Basically, ANN includes three layers: input, hidden, and output layers. In case of increasing complexity of the problem, the number of layers will rise and the computational resources will consequently also rise. In this study, both the mentioned structures of the ANN-based models were utilized for predicting WQI, such as multilayer perceptron (MLP), radial basis function (RBF), deep feed-forward neural network (DFNN), and convolutional neural network (CNN). Full details on these ANN-based algorithms can be found in Tiyasha et al. [8] and Tahmasebi et al. [29].

#### *3.3. Construction of ML Models*

As a first important step for constructing the ML model, the selection of input variables is required to determine a sufficient number of the variables, which have enough underlying information to predict WQI. Moreover, this selection could improve the model accuracy by avoiding the undesirable impact on the predictive performance. In the current study, ten WQ variables were identified as potential inputs. There are several existing methods to assess the input combinations, including autocorrelation function, partial autocorrelation function, cross-correlation function, and correlation coefficient. In the midst of these techniques, the correlation coefficient was selected for the current study because of its efficient and straightforward [4].

Table 2 presents that the WQ variable with the highest value of R<sup>2</sup> was coliform, followed by TSS, TUR, COD, BOD, PO<sup>4</sup> <sup>3</sup>−, NH<sup>4</sup> + , pH, DO, and T. It is noteworthy that the WQ variables of coliform, TUR, and TSS had the highest correlations with WQI due to impacts of cropping and livestock activities on water quality in the La Buong River. Based on the correlations of ten WQ variables with WQI, ten input variable combinations are listed in Table 3.

After selecting the input WQ variables, the fitted values of model parameters for each ML model were determined using a "trial and error" technique [23]. With the twelve ML models and ten scenarios of input variable combinations, 120 ML models for predicting the WQI were built during the training process and the performance of these models was evaluated during the testing process [7]. In the present study, the scikit-learn library, a Python-based package, was utilized to develop the twelve ML modes for predicting the WQI.



**Table 3.** Scenarios of input variables for the current study.

