*4.1. Overview*

Results obtained using decision trees are discussed in this section. A preliminary statistical analysis is first performed to study the statistics of the experimental data. The results from the initial statistical analysis are given in Table 1.


**Table 1.** Statistics of Water Quality Parameter and inputs.

The correlation structure of all the input water sample parameters and the fecal coliform is given by the correlation heatmap shown in Figure 4.

The agriculture land-use factor (ALUF or *a*) is highly and negatively correlated with the forest land use factor (FLUF or *f*). FLUF also has a strong negative correlation with urban land use factor (ULUF or *u*). An exciting inference from the correlation heatmap is the heavy positive correlation between precipitation and fecal coliform (please see Figure 4). A few scatter plots that reveal the relation between the input variables and the fecal coliform individually essentially display the results of the correlation map, i.e., Figure 4, and are not reproduced here. This makes precipitation the most significant variable in determining our output using the decision tree method, and this is evident from the CART and ID3 diagrams given a little later. The precipitation values are obtained at each sampling locations by interpolating the two-day cumulative precipitations at all the gauges in the watershed. The rainfall causes the surface water flow over the watershed's overland planes consisting of land uses (predominantly urban, forest, and agricultural) and eventually joining the tributaries and mainstem of the Green River. The various parts of the overland planes of the watershed contribute as surface water flows, which enters tributaries first and then the

main stem of Green River before reaching the watershed outlet tip. Based on watershed characteristics and time of concentration studies, the two-day cumulative and interpolated precipitation values are most suitable drivers of fecal coliform concentrations than other precipitation measures at all the sampling sites [29]. The positive correlation of microbial indicators such as fecal coliform bacteria and precipitation/rainfall/wet weather conditions is in agreement with the other studies such as that of [40–44] for rivers and bays and [45,46] for lakes. The DTs are best suited for large datasets; however, an attempt is made in the current study because of the multi-dimensional feature space (five independent variables) for monthly instances of data of forty-two locations collected for the six-month period. The DTs are expected to work better for shorter data sets and fewer features as the intrinsic data complexities are reduced. In the present scenario, data limitations on the time frame are offset by the multi-dimensional feature space for varied spatial locations. The reduction of input dimension has been looked into for the same dataset using principal component analysis (PCA), canonical correlation analysis (CCA), and artificial neural networks (ANNs) elsewhere [47]. The authors have found comprehensive predictions using all the spatial parameters such as land uses, and temporal climate parameters. The histogram of fecal coliform is given in Figure 5. The variability of fecal coliform is large as is evident from the high standard deviation and the histogram.

**Figure 4.** Correlation Heatmap.

**Figure 5.** Histogram of Fecal Coliform (FC).
