**1. Introduction**

Water is a life-sustaining element that determines the establishment and survival of civilizations. The availability of innocuous and valuable quality water for household and industrial activities is essential for economic prosperity. Organizations worldwide, such as the World Health Organization (WHO), have specified quality standards for each natural source of this vital element. In the United States of America, the United States Environmental Protection Agency (USEPA) establishes standards for water quality and undertakes quality control measures. Due to the sporadic nature of rainfall globally, several countries are dependent on their rivers to be a primary source of water. Massive dams are constructed, and the water held is used for variegated purposes, including irrigation, electricity generation, domestic activities, etc. Over the past decade, industrial and human fecal waste deposition into the rivers, owing to burgeoning urbanization, had substantially exacerbated water contamination levels. Human interferences in nature have caused imbalances in nature, which, if not regulated, are deleterious and threaten humankind's very existence. Therefore, it is essential to monitor the damages that we have caused.

Water quality assessment is an integral part of environmental engineering. It includes the evaluation of the chemical, biological, and physical characteristics of water. Factors

**Citation:** Hannan, A.; Anmala, J. Classification and Prediction of Fecal Coliform in Stream Waters Using Decision Trees (DTs) for Upper Green River Watershed, Kentucky, USA. *Water* **2021**, *13*, 2790. https:// doi.org/10.3390/w13192790

Academic Editors: Nigel W.T. Quinn, Ariel Dinar, Iddo Kan and Vamsi Krishna Sridharan

Received: 24 August 2021 Accepted: 5 October 2021 Published: 8 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

that determine water quality are (i) Physical: temperature, turbidity, suspended solids, color, taste, (ii) Chemical: pH, dissolved oxygen, Total Dissolved Solids (TDS), alkalinity, hardness, and (iii) Biological/Microbiological: Pathogens, coliforms. Fecal coliform is a bacterium that originates from the intestines of humans and other warm-blooded animals. The presence of this bacteria itself in a water body poses no direct harm, but it acts as an indicator of the existence of pathogens that may be harmful. Hence, tests and experiments are performed to measure fecal coliform concentration, which helps us determine the water sample's quality. Water quality determination is a tedious, lab-intensive process.

Moreover, current monitoring methods cannot provide real-time results because of testing time requirements. There is a need for practical, cost, and labor-efficient methods to indicate bacterial concentration on a real-time basis. The fecal coliform presence is measured by the number of colonies per 100 mL of sampled water. In general, the sources of fecal coliform loads in freshwater systems are due to wastewater treatment plant effluents, failed septic systems, human and animal manure [1]. High loads of bacterial contamination are found in rural or agricultural watersheds and urban watersheds streams. The farm cattle waste and failed septic systems of rural watersheds are replaced by domestic pets manure and failed sanitary sewers in urban watersheds. The current USEPA stipulations are as follows for four classes of freshwater systems: (i) Less than one colony/100 mL for drinking water standards, (ii) fewer than 200 colonies/100 mL for body contact recreation, (iii) fewer than 1000 colonies/100 mL for fishing and boating, and (iv) fewer than 2000 colonies/100 mL for domestic water supply. The fecal coliform present in the human and animal waste goes down the drain in houses and businesses from septic systems, overland plane areas through illegal and leaky sanitary sewer pipes to freshwater streams and rivers. It gets transported within the streams due to advection, diffusion, adsorption, and dispersion further down to the outlets. Due to the high affinity of bacteria to the soil, high sediment loads also contain high concentrations or loads of bacteria. High runoff events or storm runoff also known to contain higher levels of bacterial concentrations for the above reasons. The management actions usually include steps such as (i) routine maintenance of septic tanks, (ii) repair of broken field lines, (iii) elimination of straight pipes and failing septic systems, and (iv) isolation of cattle from streams [1].

In recent years, the Artificial Intelligence has been extensively used in the field of environmental engineering across several applications. Researchers have implemented various Machine Learning algorithms for water quality assessment.D'Agostino [2], Gaus [3], and Arslan [4] used Geographical Information System (GIS) to assess water quality parameters. Ahn & Chon [5] used thematic maps of pH, electrical conductivity, nitrate, and sulfate to create maps to utilize water for drinking purposes. Bae [6] used Classification and Regression Trees (CART) for prediction of indicator bacterial concentration in coastal Californian waters. Dissolved oxygen was found to be the most important parameter for the prediction of total and fecal coliforms, while the turbidity was found to be important for enterococci (ENT) using CART decision tree analysis. The pH, temperature, and streamflow were found to be less important for prediction of indicator bacteria. It was possible to predict the indicator bacterial concentrations in real time using CART, saving huge monitoring costs for the state of California. Liao & Sun [7] analyzed the water quality of Chao Lake in China using Improved Decision Tree Learning (IDTL) models that use the feedforward neural network model for preclassification. This model was found to be comparably successful with that of pure neural network models or pure decision tree models such as C4.5. This model was recommended for practitioners as it is faster and uses fewer decision rules. Nikoo [8] developed an integrated water quantity and quality model using the M5P decision tree algorithm for Total Dissolved Solids (TDS) as the water quality indicator. A comparison was made between optimization, support vector regression (SVR), and M5P models. It was found that the M5P model yields explicit relationships between inputs and output such as TDS, which are useful for a decision maker or a practitioner. Azam [9] classified the water quality data using Decision Trees (DTs), Logistic Regression (LR), and Linear Discriminant Analysis (LDA) for two cities. Maier and Keller [10] developed the Random

Forest (RF), Multivariate Adaptive Regression Splines (MARS), Extreme Gradient Boosting (XGB) regression models in conjunction with hyperspectral data to estimate water quality parameters for inland waters. Jerves Cobo [11] used Decision Tree models for assessment of microbial pollution in rivers by studying the presence of macroinvertebrates as indicators. This was needed to set up the pathogen pollution standards and to review the aquatic ecosystem health. Geetha Jenifel and Jemila Rose [12] have used recursive partitioning with decision trees and regression trees to predict water quality parameters and have found better results than the other models such as linear and support vector machine (SVM) models. They have found the decision tree models to be more accurate, practical, reasonable, and acceptable. Ho [13] used the ID3 decision tree model to predict the Water Quality Index (WQI) class for one of Malaysia's most polluted rivers. Sepahvand [14] compared the performances of the M5P model tree and its bagging, Random Forest (RF), and group method for data handling (GMDH) in the estimation of sodium absorption ratio (SAR). Out of all the models they have tried out, bagging M5P model was found to be the most accurate in estimating SAR. This was based on the indices such as correlation coefficient (CC), root mean square error (RMSE), and mean absolute error (MAE). The uncertainty analysis also revealed the accuracy of bagging M5P model compared to other models. Lu and Ma [15] used various hybrid Decision Tree methods and ensemble techniques such as Random Forest (RF), CEEMDAN, XGBoost, LSSVM, RBFNN, LSTM, etc and their combinations for the prediction of water quality parameters of the Tualatin river of Oregon, USA. Shin [16] predicted cholorophyll-a concentrations in the Nakdong River, Korea using various machine learning (ML) models and found the best results using recurrent neural networks (RNNs). The RNNs performed best when time-lagged memory terms are built into the model for predictions. Mosavi [17] compared the performance of two ensemble decision tree models- boosted regression trees (BRT) and random forest (RF) to predict hardness of groundwater quality. More recently, Naloufi [18] used six machine learning (ML) models, including Decision Trees, to predict *E. Coli* concentrations in Marne River in France. They found the Random Forest model to be the most accurate compared to other models. Using the results, they were able to come up with the best ML model for sampling optimization. The other recent discussions and applications of Decision Trees in the context of water quality modeling in rivers include that of studies [19–27].

In light of the above literature, the objectives of the current study are: (i) to study the potential and applicability of various Decision Tree (DTs) algorithms in prediction of Fecal Coliform from causal parameters such as climate (precipitation, and temperature), and land use parameters, (ii) to apply CART, ID3, RF and ensemble methods such as bagging and boosting specifically, and (iii) to suggest a Decision Support System (DSS) based on Decision Tree Classifier (DTC) for water quality management. The paper has been organized as follows. Section 2 describes the study area and the data considered for the present work. Section 3 details the methodology including GIS land use analysis, briefing of Decision Trees for classification, measures of attribute selection and their relation with the Decision Trees, and ensemble methods of bagging and boosting. The results are provided and discussed in the next Section 4. In the same section, a detailed discussion of Decision Tree Classifier based Decision Support System (DTCDSS) is provided. In the last section, i.e., Section 5, a concise summary of findings is provided.
