1. Introduction
With fast economic growth and increased urbanization, water pollution has become grimmer. Understanding the issues and patterns of water quality is also critical for water pollution reduction and regulation. Most countries around the world have started to develop environmental water management schemes to truly understand the quality of the marine ecosystem. Water is life’s most important substance. Although 71% of the Earth’s surface is covered with water, the vast majority of it (95%) is salt water [
1]. Thus, conserving the quality of fresh water is essential. Almost one billion people do not have access to adequate drinking water sources, and two million people die every year from contaminated water and poor sanitation and hygiene [
2].
Water quality is important to the sustainability of a diversion scheme. Predicting water quality involves forecasting variation patterns in the quality of a water system at a certain time. Water quality prediction is important for water quality preparation and regulation. Strategies for the prevention and regulation of water contamination can be developed by predicting future changes in water safety at varying levels of contamination and devising rational strategies to prevent and regulate water contamination. In water diversion schemes, the general consistency of water should be estimated. A large volume of water is transported to address everyday drinking issues. Thus, strategies should be investigated for forecasting water quality in current society [
3].
Water of low quality can also be economically challenging, given that resources must be diverted to upgrade water delivery infrastructure any time a problem arises. For these purposes, the demand for improved water treatment and water quality control has been increasing to ensure clean drinking water at affordable rates. Systematic analyses of raw water, disposal systems, and organizational monitoring problems are required to resolve these challenges [
4]. Achieving precise predictions of changes in water quality can immensely improve the efficiency of aquaculture. In general, water quality data are pre-processed before water quality parameters are predicted. Thus, this section consists of two stages. The first stage consists of the pre-treatment of water quality data and the performance of correlation analysis between different water quality parameters.
With advanced computing using artificial intelligence (AI) techniques, the modelling of water quality has been developed to resolve water quality issues. Artificial neural networks (ANNs) have aided in the monitoring of water quality systems by predicting changes in water quality [
5]. They can immensely improve the efficiency of aquaculture. The simulation of water quality conditions has difficulties and challenges regarding the use of the hydrodynamic and water quality model, a relatively novel computational approach. ANNs have been widely established in many disciplines and provide an alternative technique for understanding and monitoring water quality in reservoirs. ANNs have been successfully applied to simulate and forecast water quality in water bodies. Numerous ANN methods, such as feed-forward neural networks [
6], have been used in various applications. The fuzzy logic system has been developed to solve complex nonlinear systems [
7]. ANN applications have been successfully used as tools to compute and predict the quality of water bodies [
8,
9,
10,
11,
12]. ANN models require parameter values for designing predictions [
13]. ANNs have numerous advantages, including their ability to learn, manage very complex nonlinear systems, and work with parallel processing. Shafi et al. [
14] used support-vector machines, neural networks (NNs), deep NNs, and K-nearest neighbors (KNNs) to classify water quality using data from the Pakistan Council of Research in Water Resources (PCRWR) for drinking water.
Used a hybrid deep learning model convolutional neural network (CNN)-long short-term memory (LSTM) to predict water quality including total nitrogen, total phosphorous, and total organic carbon.
Reference [
15], Liu et al. [
16] used the long short-term memory (LSTM) network model to predict the quality of drinking water in the Yangtze River Basin. The LSTM model was developed using pH, dissolved oxygen (DO), chemical oxygen demand (COD), and NH3-N. It is noted that the LSTM model has promise for monitoring water quality.
Chen et al. [
17] proposed artificial intelligence for modelling and predicting water quality. It is noted that the ANN model gave a better result. Singh [
18] used the ANN model to compute dissolved oxygen (DO) and biological oxygen (BOD) parameters to predict the quality of river water. Zheng et al. [
19] applied the immune practical swarm optimization (PSO) method, which employed a neural network with a hidden layer to predict sewage effluent water quality. Gao [
20] enhanced the back-propagation (BP) neural network by using the grey correlation analysis method to predict water quality. Zhang et al. [
21] combined ANNs with a genetic algorithm to predict water quality by using time data to enhance the stability of the forecasting results. Wang et al. [
22] proposed a Genetic Regression Neural Network GA-GRNN model to develop an efficient method of predicting water quality to ensure water security in the south-to-north water diversion (SNWD) Project. Correlation coefficients were applied to investigate the relationship between significant parameters. Abyaneh [
23] introduced ANNs and regression models to predict COD and bioche. The radial-basis-function was used as a kernel function of the ANN model [
24,
25]. Barzegar et al. [
26] developed a hybrid convolutional neural network (CNN)–LSTM model to predict DO and chlorophyll-a (Chl-a) in Small Prespa Lake in Greece. It is observed that the deep learning model was outperformed compared with the traditional support-vector regression (SVR) model. Maiti et al. [
27] predicted dissolved oxygen (DO) levels using the ANN model. Deep learning methods showed higher performance in predicting WQI compared to traditional machine learning techniques, as did AI techniques such as ANN, Bayesian NNs, and adaptive neuro-fuzzy [
28]. Piazza et al. [
29] presented a comparison between the proposed model’s numerical optimization approach and the results of an experimental campaign. The genetic algorithm with a hydraulic simulator was applied to test and evaluate water quality by monitoring it. Sambito et al. [
30] developed a smart system based on the Internet of Things and a Bayesian decision network (BDN) for predicting wastewater. The proposed system was focused on analysis and soluble conservative pollutants such as metals, decision support systems, and auto-regressive moving averages, and was applied to predicting the water quality WQ of groundwater [
31].
Currently, water quality is assessed by costly and time-consuming laboratory and statistical analyses that require sample collection, transportation to laboratories, and a lot of time and calculation, which is quietly unavailing because water is a completely transmissible medium and time is necessary if the water is contaminated with disease-causing waste. The catastrophic consequences of water contamination necessitate a faster and less expensive alternative. In this regard, we developed a real-time system to evaluate an alternative approach based on the advanced artificial intelligence method for modelling and predicting water quality. These mimicking models, however, face some challenges. For example, they do not consider factors affecting WQ. The contributions of the current study are presented to use an advanced AI Adaptive neural-fuzzy inference system ANFIS model that was developed to predict Water quality Index WQI. The Feed-forward neural network FFNN and KNN were used for the Water Quality Classification WQC. The highly efficient advanced AI can be generalized and then used to forecast the water pollution process, which will aid decision-makers in strategizing for timely decisions.
2. Materials and Methods
Figure 1 displays the framework of the methodology used.
2.1. Dataset
The datasets employed to conduct the research were acquired from different locations in India and contained 1679 simples from 666 different sources of rivers and lakes in the country. The data was collected between 2005 and 2014. The link to the datasets is attached. The datasets include eight important parameters: DO, pH, conductivity, biological oxygen demand, nitrate, fecal coliform, temp, and total coliform. However, seven parameters were considered to show significant values, and the developed models were evaluated based on some statistical parameters. All the experiments consisted of temp parameters. The Indian government collected these data to ensure the quality of the drinking water supplied. This dataset was obtained from Kaggle
https://www.kaggle.com/anbarivan/indian-water-quality-data (accessed on 3 December 2020).
2.2. Data Preprocessing
The processing phase is very important in data analysis to improve data quality. In this phase, WQI was calculated from the most important parameters of the dataset. Then, water samples were classified on the basis of WQI values. The z-score method was used as a data normalization technique for superior accuracy.
2.2.1. Water Quality Index (WQI) Calculation
The WQI, which is calculated using several parameters that affect WQ [
32], was used to measure water quality. The performance of the proposed system was evaluated on the published dataset, with seven important water quality parameters. The WQI was calculated using the following formula:
where
N denotes the total number of parameters included in the WQI formula,
qi denotes the quality estimate scale for each parameter
i calculated by Formula (2), and
wi denotes the unit weight of each parameter in Formula (3).
where
Vi is a measured value that refers to the water samples tested,
VIdeal is an ideal value and indicates pure water (0 for all parameters except OD = 14.6 mg/L and pH = 7.0), and
Si is a standard value recommended for parameter
i, as shown in
Table 1.
where
K denotes the constant of proportionality, which is calculated using the following formula:
Table 2 and
Table 3 represent the parameters of the unit weight and the WQC, respectively.
WQI can be used to calculate more parameters, including our selecting parameters. The WQI depends on the variable data. The proposed system can test any parameters with any water quality data.
2.2.2. Z-Score Normalization Method
Z-score a is used to normalize data by computing both the mean (μ) and standard deviation. The Z-score was applied to scale parameter values between 0 and 2. It is calculated using the following formula:
where
x represents the tested sample in the dataset to be evaluated.
2.3. Adaptive Neuro-Fuzzy Inference System (ANFIS) Model
The ANFIS model is one of the types of ANN algorithms proposed by Jang [
34,
35]. This model was used to solve complex and nonlinear problems. The algorithm consists of a neural network and fuzzy logic and is, therefore, powerful. The algorithm is used to predict data and obtain the optimal membership function through an adaptive system in the input layer. The ANFIS model consists of five layers: fuzzification, antecedent, strength normalization, consequent, and inference [
36]. Each layer contains many nodes. The ANFIS model is represented by two input parameters and an output parameter, as illustrated in
Figure 2. The if-then rules are applied as follows:
where
x and
y are the input parameters for node
and
,
,
, and
are the fuzzy set.
,
,
,
and
are the consequent parameters.
is the output of the ANFIS model.
The first layer implements a membership function to convert the input data into a fuzzy set.
where
μ(
x) and
μ(
y) are membership functions;
Ai is the linguistic variable; and
σi,
bi, and
ci are the parameters of the Bell function.
Nodes in the second layer are fixed nodes where inputs from the previous layer are multiplied with the node value to form an output signal for the second layer.
where
wi signal refers to the firing strength of the rule.
The ratio of
is calculated to normalize firing strength.
where
O3,i is the output of layer 3 and
is the normalized firing strength.
The nodes of the fourth layer are adaptable, and the output of this layer is
. The node function of the fourth layer is defined in the following equation:
where
, and
are consequent parameters used for the fuzzy inference system function (
).
This layer is applied to obtain the model output. The final output of a network is described as follows:
ANFIS is a back-propagation algorithm in which the error value between the expected and actual outputs, as well as the error function, are calculated. Weights are updated inversely from the fifth layer to the first, and the process continues until the lowest error rate is obtained.
Figure 3 shows the framework of an FFNN model for predicting WQI.
The training data were divided into 70% for the training phase and 30% for the testing phase. The ANFIS model was processed based on the scatter partition fuzzy approach, which works by clustering to divide dimension vectors in the specific area of the fuzzy rules. The ANFIS model was developed by integrating fuzzy c-means clustering and back-propagation algorithms. The seven clusters and minimum improvement 10−5, partition matrix exponent 2, and number epoch 150 were appropriate.
2.4. Classification of Water Quality
In this section, two classification algorithms, namely, KNN and FFNN, were presented.
2.4.1. K-Nearest Neighbors (KNN) Model
The KNN algorithm is one of the traditional machine learning algorithms used for the classification of data. The KNN algorithms use K-neighbor values to find the closest point between the objects. The K-value is used to find the closest points in the feature vectors, and the value should be unique. In this study, three K-values were appropriated to obtain good results. The Euclidean distance function (Di) was applied to find the nearest neighbor in the features vector.
where x
1, x
2, y
1, and y
2 are variables for input data.
2.4.2. Artificial Neural Networks (ANNs)
The artificial neural network is a very powerful computation method for developing a number of real medical applications [
37]. In general, ANN models are used as very powerful machine learning algorithms for time series prediction of different engineering applications. The ANN model consists of an input layer, hidden layers, and an output layer. Each hidden layer has weight and bias parameters to manage neurons. To transfer the data from the hidden layer into the output layer, the activation function is used. The learning algorithms are used to select the weights within the NN framework. The weight selection is based on minimum performance measures, such as mean square error (MSE).
Figure 4 shows the architecture of FFNN for the classification water quality WQC.
In this study, the ANN algorithm was used to classify water quality. ANNs have three significant layers: input, hidden, and output. Five hidden layers were considered to transfer the input training from input to output to the sigmoid function. However, the output layer had three classes.
2.5. Performance Measurement
Performance measurement approaches, such as MSE, were applied to evaluate the ability of the proposed model to predict the WQI. Furthermore, the accuracy, specificity, sensitivity, precision, recall, and F-score performance measurements were determined to evaluate the FFNN and KNN classification algorithms to classify the WQC. The statistical methods used are defined as follows:
Root mean square error (
RMSE)
where
R is Pearson’s correlation coefficient,
x is the observation input data in the first set of the training data,
y is the observation input data of the second set of the training data, and
n is the total number of input variables.
F-score
where
TP,
TN,
FP, and
FN are the true positive, true negative, false positive, and false negative, respectively.
4. Discussion
Modelling and the prediction of water quality have played a pivotal and significant role in saving time and consumption in lab analysis. Artificial intelligence algorithms were explored as an alternative method to estimate and predict water quality. This study used the experimental data of 1679 samples from 666 different water bodies of rivers and lakes from different states in India. The dataset includes seven selected important parameters: DO, pH, conductivity, BOD, nitrate, fecal coliform, and total coliform.
Table 6 summarizes the existing model results against our proposed system. There are various studies that used machine learning models for modelling and predicting WQ. Ahmed et al. [
38] applied the FFNN model to predict WQI, and 25 parameters were used as input data. Gazzaz et al. [
39] applied machine learning to predict WQI, and 23 input parameters were considered. Sakizadeh [
40] employed 16 parameters. Rankovic et al. [
41] proposed an artificial intelligence model to predict WQ using 10 input parameters. Umair Ahmed et al. [
42] used various machine learning models for WQI and WQC, and four parameters were used as input data. It is noted that the polynomial regression model is good for predicting WQI, whereas the multi-layer perceptron (MLP) model is suitable for classifying WQC.
Although fewer parameters were used in this investigation, the results of this research are superior to others. Selecting few parameters is suitable for expensive real-time systems. In this study, seven significant parameters were used for modelling and predicting WQI, with superior results (having a very low error prediction (MSE = 0.00336), and a high value for the correlation regression (R = 96.17%).
Moreover, using the FFNN model, a system to detect WQC was developed with the highest accuracy (100%). The proposed method is presented to use only seven water quality parameters for predicting and classifying water quality, so the empirical results confirmed the effectiveness of the model, whereas previous research used machine learning models but with less accuracy.
This system can monitor drinking water and contaminated water with high accuracy. This study suggests that the combined approach of the artificial intelligence techniques proposed in the current study should be applied as a promising tool to accurately simulate water level and quality. The developed model has shown acceptable performance when compared with the available ones, as presented in
Table 6.
The ultimate goal of this work is to serve and directly align with Sustainable Development Goal (SDG) 6, which aims to ensure access to clean water for all. The developed model can be used easily and inexpensively to predict water quality and index and thus water quality classification with high accuracy. In addition, this kind of model is robust and can forecast water contamination and thus guide the authorized governments/agencies to develop effective strategies for better water sustainability and management through the removal of the contamination source and/or seek for an alternative source of pure water to meet the community demand.