*4.2. Results from Decision Tree Models*

The precipitation, temperature, land use data, and experimentally measured fecal coliform values are used to formulate decision tree models. Several different decision tree models are developed for fecal coliform analysis. All input parameters (precipitation or P, temperature or T, ULUF or u, FLUF or f, ALUF or a) are used to create decision tree models with different data split methods. Precipitation was the single most crucial hydrological input parameter to determine fecal coliform. The correlation values obtained earlier also indicate similar results. 70% of the dataset is used as the training set, and the remaining 30% is used as the testing set. For CART, an accuracy of 63.05% was obtained in the training phase, and 60.29 was obtained in the testing phase. An accuracy of 62.22% was obtained on the entire data set. In CART decision tree modeling, the data is split based on the attribute with the lowest Gini index at the root node in a top-down process with the subsequent splitting of data of attributes of increasing Gini indices till we reach leaf nodes. In this way, the impurity or the uncertainty in the data is minimized with recursive portioning of the data [37]. Similar results were obtained for the ID3 model. Accuracies of 61.78%, 61.76%, and 61.77% were obtained in the training phase, testing phase, and the entire data set, respectively. In the ID3 model, the feature with maximum information gain or smallest entropy is used to split the data at the root node first and then the subsequent nodes till we reach leaf nodes. The least entropy corresponds to the features with the least uncertainty or randomness in the data [36]. Both DTs, CART, and ID3, belong to the family of Top-Down Induction Decision Trees (TDIDT). CART performs slightly better in training than ID3, and ID3 performs slightly better in testing than CART. However, the overall performance of CART is slightly better than ID3. CART and ID3 models were improved by augmenting the simple models with bagging and boosting methods. The highest test set accuracy was obtained for the CART model with adaptive boosting—the accuracy of 81.53%, 72.06%, and 78.67% was obtained in the training phase, testing phase, and entire dataset, respectively. The bagging and adaptive boosting of CART and ID3 perform much better than simple (without bagging and adaptive boosting) CART and ID3 models. Though bagging of ID3 results in largest training accuracy among simple and ensemble models of CART and ID3, the adaptive boosting of CART gives the largest testing accuracy among the same models. However, the overall accuracy of bagging of ID3 model is the highest among simple and ensemble models of CART and ID3 models. Apart from CART and ID3 models, Random Forest was also implemented on the experimental dataset to predict the fecal coliform density or concentration.

The Random Forest model gives an accuracy of 98.7% on the training set, 64.7% on the testing set, and 88.4% on the overall dataset. The Random Forest model is built by creating an ensemble of a large number of decision trees for classification and then predicting the mode or average/mean of all the individual decision tree classification results. The more uncorrelated the individual decision trees are, the better the final prediction or outcome [48]. The individual trees or sub-samples are drawn randomly from the original tree with replacement. The trees are grown to the largest extent possible for classification without pruning of the trees. The features selected in sub-sample trees need to be useful for the effectiveness of the Random Forest model than being pure, random guessing features in classification. The Random forest model outperforms decision trees such as CART, ID3 but its testing accuracy is slightly lower than gradient trees, extremely randomized trees, and DTs with bagging and adaptive boosting. A few other models, such as extremely randomized trees, were also implemented, and the accuracy results of all models are summarized in Table 2 below:


**Table 2.** Accuracies of various DT models in the prediction of FC.

Where ID3-Bagg is ID3 with Bagging, and ID3-AB is ID3 with adaptive boosting. The extremely randomized trees (also known as "Extra Trees") give a fourth-best testing accuracy of 66.17% and a second-best overall accuracy of 88.89% of all the models. While the Random Forest model uses subsamples with replacement, the extremely randomized trees use the whole input sample. Also, while Random Forest opts for optimum split to select cut points, the extremely randomized trees go for random cut points. The extremely randomized trees are faster and have both features of reducing bias and variance due to usage of original input sample and random split points [49]. Gradient boosting model (GBM) gives third-best testing accuracy of 69.12% and best overall accuracy of 89.33% of all the models. In the Gradient Boosting model, a loss function such as mean square error is minimized with the help of gradient descent principle and an ensemble of weak learners to eventually make correct predictions and become a strong learner tree [50]. The GBM, ERT, and RF perform better than simple and ensemble models of CART, and ID3 in training, and overall accuracies; the ensemble models of CART, and ID3 are slightly better in testing accuracies. This could also be due to possible overtraining of the Decision Tree models in the case of GBM, ERT, and RF. Further, optimal cutting down of trees may result in higher Training, Testing, and Overall accuracies of GBM, ERT, and RF than simple and ensemble models of CART, and ID3. The accuracy of a decision tree model is given by the number of correct predictions made divided by the total number of predictions. Here, the prediction is the class to which the water sample belongs. These results are presented in Figure 6.

We have used various decision tree algorithms to classify data rather than regression on our current data set. For classification, the target variable, i.e., fecal coliform (FC), was divided into four classes following the United States Environmental Protection Agency (USEPA) recommendations (given in Table 3).

**Table 3.** Fecal Coliform and its Class.


**Figure 6.** Bar graph of training, testing, and overall accuracies of different DT models.

The classified results into four classes or categories, namely- body contact and recreation, fishing and boating, domestic utilization, dangerous for all decision tree algorithms, are presented in Figure 7. The number of samples in each class is also shown in the same Figure. From Figure 6, we can see that high values of precision, recall, and F1-score are obtained for Random Forest, CART with adaptive boosting, ID3 with adaptive boosting, and Extremely randomized trees for the supports shown for all of the four classes. The definitions of accuracy, precision, recall, F1-score are given by:

$$Accuracy = \frac{TP + TN}{TP + FP + TN + FN} \tag{8}$$

$$Precision = \frac{TP}{TP + FP} \tag{9}$$

$$Recall = \frac{TP}{TP + FN} \tag{10}$$

$$F1\_{score} = 2 \times \frac{Precision \times Recall}{Precision + Recall} \tag{11}$$

where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative, and support is the number of occurrences of each class in ground truth (correct) target values.

This means that the above four algorithms can correctly classify the positive samples from negative samples for each of the respective class, able to recall all of its positive samples and that both of these abilities are equally important in the classification. Then the best values of precision, recall, and F1-score are obtained for CART with bagging and ID3 with bagging algorithms. Simple CART and ID3 yielded lower precision, recall, and F1-score than the rest of the above algorithms discussed in their respective classifications for each support class. Although not presented here, accuracies were also highest for Extremely randomized trees and Random Forest algorithms, slightly lower for CART and ID3 with bagging and boosting, and lowest for simple CART and ID3 algorithms.

**Figure 7.** Classification reports of the entire dataset using different DT models.
