Water-Quality Prediction Based on H2O AutoML and Explainable AI Techniques

Madni, Hamza Ahmad; Umer, Muhammad; Ishaq, Abid; Abuzinadah, Nihal; Saidani, Oumaima; Alsubai, Shtwai; Hamdi, Monia; Ashraf, Imran

doi:10.3390/w15030475

Open AccessArticle

Water-Quality Prediction Based on H₂O AutoML and Explainable AI Techniques

by

Hamza Ahmad Madni

^1,*

,

Muhammad Umer

²

,

Abid Ishaq

²

,

Nihal Abuzinadah

³

,

Oumaima Saidani

⁴

,

Shtwai Alsubai

⁵

,

Monia Hamdi

⁶

and

Imran Ashraf

^7,*

¹

College of Electronic and Information Engineering, Beibu Gulf University, Qinzhou 535011, China

²

Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan

³

Faculty of Computer Science and Information Technology, King Abdulaziz University, P.O. Box 80200, Jeddah 21589, Saudi Arabia

⁴

Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

⁵

Department of Computer Science, College of Computer Engineering and Sciences in Al-Kharj, Prince Sattam bin Abdulaziz University, P.O. Box 151, Al-Kharj 11942, Saudi Arabia

⁶

Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

⁷

Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Water 2023, 15(3), 475; https://doi.org/10.3390/w15030475

Submission received: 9 December 2022 / Revised: 29 December 2022 / Accepted: 5 January 2023 / Published: 25 January 2023

(This article belongs to the Special Issue Recent Advances in Monitoring and Treatment of Drinking Water Quality)

Download

Browse Figures

Versions Notes

Abstract

:

Rapid expansion of the world’s population has negatively impacted the environment, notably water quality. As a result, water-quality prediction has arisen as a hot issue during the last decade. Existing techniques fall short in terms of good accuracy. Furthermore, presently, the dataset available for analysis contains missing values; these missing values have a significant effect on the performance of the classifiers. An automated system for water-quality prediction that deals with the missing values efficiently and achieves good accuracy for water-quality prediction is proposed in this study. To handle the accuracy problem, this study makes use of the stacked ensemble H

_{2}

O AutoML model; to handle the missing values, this study makes use of the KNN imputer. Moreover, the performance of the proposed system is compared to that of seven machine learning algorithms. Experiments are performed in two scenarios: removing missing values and using the KNN imputer. The contribution of each feature regarding prediction is explained using SHAP (SHapley Additive exPlanations). Results reveal that the proposed stacked model outperforms other models with 97% accuracy, 96% precision, 99% recall, and 98% F1-score for water-quality prediction.

Keywords:

water-quality prediction; KNN imputer; missing values; machine learning; deep learning

1. Introduction

Water is one of the essential components of life; without water there is no existing life possible on the earth. Despite that 66% of the total earth is made up of water, out of this, only 1% of water is usable, and the rest is not safe to use because either it is saltwater or saline. From an economic point of view, water is an important part of a nation’s economy and wealth. However, during the last few years, water levels have fallen considerably, and this comes as one of the biggest emerging problems of today’s world [1]. As the world population keeps increasing and the predicted growth of the population puts water resources under pressure, providing clean water to this growing population has become a challenging task. The rapid growth of the population is a threatening situation that directly affects the water quality (WQ) as well, and the cost to provide safe water is increasing rapidly [2]. According to research, the lack of clean water might increase the probability of individuals living in poverty. Water distribution is uneven between countries. The accessible amount of water is 60% [3], which means that water is easily available to use and abundant on Earth; it is accessible for industry, agriculture, and for drinking use [4].

Rivers and groundwater are the fundamental sources of fresh water; social and economic development is directly linked with fresh water [5]. Due to human activities, both surface water and groundwater are under great pressure. Activities such a commercialization, urbanization, population growth, and industrialization have a direct impact on water quality and quantity [6]. Additionally, climate change and global warming have a worse effect on water quality. Therefore, water quality evaluation and estimation are of great concern today [7].

The index used for the assessment and classification of surface water and groundwater is the water-quality index (WQI). WQI is a widely used parameter for water-quality classification. For water-quality level estimation, Brown et al. [8] proposed an index. The index is computed based on water physiochemical parameters such as pH, the concentration of pollutants, dissolved oxygen, temperature, turbidity, and biochemical oxygen demand. For policymakers, this WQI parameter gives meaningful qualitative data and is helpful for the planners of water distribution systems. The drawback of WQI is that it consists of lengthy and complex computations, and a lot of time and effort are needed in this regard [9]. To address the above-mentioned problems, it is the need of the hour to have an alternative and state-of-the-art system for efficient water-quality classification (WQC).

AI-based modeling removes the complex and lengthy calculations and classifies WQI promptly [9]. Therefore, water-quality classification using an artificial intelligence-based system is getting the attention of many researchers. Different researchers have proposed different WQC systems using machine learning and deep learning models. Predominantly, such efforts often achieve low accuracy. Furthermore, the available dataset for the experiments has some missing values that are much-needed for water-quality prediction and have a direct impact on the results.

Clean and easily available water is required for drinking, home usage, recreational activities, and food production. Better water-supply and resource management may significantly increase a country’s economic development. Sufficient water should be available for personal and domestic usage and should always be safe, easily accessible, and available to everyone. Every year, many individuals die from kidney failure, cancer, and other diseases caused by polluted water. Laboratory methods for classifying water quality are resource-intensive and time-consuming. Many water-quality classification methods are already available; however, many lack accuracy. As a result, it is very important to have an automated system that can classify water quality with low human effort and with time efficiency.

The continuous, diligent evaluation and acceptability of drinking water sources by the public health community is referred to as potable-water-quality surveillance. A perfect water distribution and monitoring system guarantees people’s health if the potable water is treated without errors. Further, the perfect water treatment system is in vain if the architecture of the water supply and water treatment allows contamination into the potable water. During the last decade, concerns about water contamination have been raised. Prediction of water quality comes out as an important topic as it directly relates to life survival on earth. As a result, there is a vast amount of work on automated water-quality prediction techniques. Such efforts often yield comparably low accuracy. Moreover, the dataset available for experimentation had missing values and missing attributes. These missing values affect the results of water-quality prediction. To address this issue efficiently, this study made the following contributions

A novel H $_{2}$ O AutoML stacked ensemble model is proposed that provides higher accuracy for drinking water-quality prediction.
For resolving the issue of missing values, experiments are performed using two scenarios, where the first scenario involves deleting the missing values, while a K nearest neighbor (KNN) imputer is used in the second scenario.
Experiments are conducted to assess the performance of the KNN imputer and the proposed H20 AutoML stacked ensemble model involving the use of several learning models including logistic regression (LR), extra tree classifier (ETC), random forest (RF), stochastic gradient descent classifier (SGDC), Gaussian naïve Bayes (GNB), and gradient-boosting machine (GBM).
The importance of different features is explained using the SHapley Additive exPlanations (SHAP) model.

This study of WQC consists of four further sections: Section 2 briefly discusses the previous research related to WQC. Section 3 consists of the description of the dataset, proposed methodology, and description of the machine learning model used in this study. Section 4 describes the results, and Section 5 discusses the conclusions of the study.

2. Related Work

Water is one of the most important resources for the existence of life, and human needs are directly linked with the availability of water from both sources (surface and groundwater). Thus, it is very important to have a state-of-the-art system that can classify water quality. Many studies carried out for water-quality classification have provided promising results. The literature review constitutes several previous works that used artificial intelligence systems for water-quality index prediction.

Juna et al. [10] worked on automatic water-quality prediction using a KNN imputer and MLP. They handled the missing values efficiently and obtained higher performance regarding accuracy. They proposed a nine-layer multilayer perceptron (MLP) system with KNN imputer to deal with the missing values. They also used seven machine learning algorithms for comparison. Experimental results show that the proposed nine-layer MLP achieved an accuracy value of 99% for water-quality prediction using the KNN imputer. A dependable approach was proposed by Nida Nasir et al. [4] for predicting water quality accurately. The authors used various machine learning and stacked ensemble learning model for water-quality classification via the water-quality index. They used LR, RF, DT, SVM XGBoost, CATBoost, and MLP for this purpose. Results of the study show that CATBoost achieved an accuracy of 94.51%. For water-quality classification, Radhakrishnan and Pillai [2] used machine learning models. They used three machine learning models, including DT, SVM, and NB, in their study and used multiple datasets. The performance of the machine learning models was compared, and the results revealed that DT achieved better classification accuracy, i.e., 98.50%.

Aldhgani et al. [11] used a non-linear autoregressive neural network (NARNET) and long short-term memory (LSTM). In addition to these deep learning models, they also used three machine learning models, including NB, SVM, and KNN, for experiments. NARNET and LSTM achieved almost the same accuracy but a slightly different regression coefficient (RLSTM= 94.21%, NARNET = 96.17%), and from machine learning models, SVM achieved an accuracy of 97.01%. Shahra et al. [12] proposed a deep learning-based system for water-quality classification for water distribution networks. The study aims to achieve high accuracy and keep low time for computation. They used two learning algorithms: ANN and SVM. ANN outperformed the SVM model in terms of accuracy and achieved an accuracy of 94%, whereas SVM achieved an accuracy of 89%.

An adaptive neuro-fuzzy system was proposed by Hadi et al. [13] for the classification of drinking water into two classes: safe and unsafe. They used a real-time time-series dataset that had four water quality parameters: bacteria count, color, turbidity, and pH. The proposed adaptive neuro-fuzzy system achieved an accuracy of 92% for detecting contaminated data. Abuzir and Abuzir [14] used j48, MLP, and NB for water-quality classification. They used a dataset that had 10 features. Different feature extraction techniques were used for the dimensionality reduction of the dataset. They experimented with three scenarios: using all features, using five features, and using two features. With all features and with selected features, MLP outperformed the other two learning models.

Hassan et al. [15] used machine learning and deep learning models for classification of Indian water quality data. The authors used SVM, RF, NN, multinomial logistic regression (MLR), and bagged tree models (BTM). The results revealed that the main features, such as total coliform, biological oxygen demand, dissolved oxygen, conductivity pH, and nitrate, affect the water quality classification. A study by Sillbery et al. [16] used attribute realization (AR) and SVM for water-quality classification of the Chao Phraya River. When they used AR-SVM on six features of river-water data, they achieved accuracy from 86% to 95%. The study by Ahmed et al. [17] used four different features, including turbidity, pH, TDS, and temperature, for water-quality prediction. Experimental results show that MLP outperformed the other learning algorithms in terms of accuracy and achieved an accuracy of 85.05% with a (3,7) configuration.

The IoT-based system played a vital role in water-quality classification. Kakkar et al. [18] used IoT-based devices for the data collection of residential overhead tanks. After data collection, they use machine learning and a deep learning system for WQC. Malek et al. [19] used Kelantan River data from the years 2005 to 2020 for water-quality classification. They employed different kinds of machine learning models. For water quality, they used 13 physical and chemical parameters. From the experiments, results show that gradient boosting with a learning rate of 0.1 achieved an accuracy value of 94.90%. For water quality and water-demand prediction, Rustam et al. [20] proposed an artificial neural network system. The authors used an artificial neural network with one hidden layer and several dropouts and activation layers. Experiments were conducted on two datasets to predict water quality and water consumption. For water-quality prediction, they achieved an accuracy of 0.96%, while the

R^{2}

score for water consumption prediction was 0.99%. A comparative analysis of existing approaches for water-quality prediction is presented in Table 1.

3. Material and Methods

This section explains the proposed approach for predicting water quality, as well as the machine learning models and dataset used in the experiments. Figure 1 illustrates the proposed architecture used for experiments in water-quality prediction.

3.1. Description of Dataset

The dataset used in this study is obtained from Kaggle, which is a well-known platform. The dataset used in this study is known as “Water Quality”, and it is freely available at [21]. A brief description of the dataset is given in Table 2. The dataset constitutes 935 instances and 10 columns with the target class ‘potable’. The target class has two values, 1 and 0, where 1 is used if the water is safe for drinking, and 0 is used if the water is not safe for drinking.

3.2. KNN Imputer

In today’s world, a large amount of data is available to perform research and decision-making. These data are generated from different and heterogeneous sources, so their adequacy and relevancy may vary concerning a research objective. Often, such datasets are limited by missing information for one or more of their attributes. This might happen due to human error with data extraction or collecting or due to erroneous conversions and other processing routines. As a result, dealing with missing values has become an important part of data preparation. The method of imputation is very important, as the performance of the models is directly linked with it. The KNN imputer by sci-kit-learn is a common approach for imputing missing data. It is widely used in place of traditional imputation methods [22].

KNN imputer facilitates the imputing of missing values in observations by utilizing a Euclidean distance matrix to determine the nearest neighbors. The Euclidean distance is calculated by ignoring missing values and increasing the weight of non-missing coordinates. Euclidean distance can be calculated using the following formula:

D_{x y} = \sqrt{w e i g h t * squared distance from present coordinates}

(1)

where

w e i g h t = \frac{total number of coordinates}{number of present coordinates}

(2)

3.3. Deleting Missing Values from the Dataset

The second approach for dealing with the data is to delete the missing values. This approach is used in the second set of experiments, where all fields with missing data are deleted.

3.4. H2O AutoML

H

_{2}

O AutoML [23] is a machine learning algorithm that works automatically and is included in the H

_{2}

O system [24]. It is easy to understand and easy to implement, it is for enterprise environments, and it produces high-quality models. On the tabular dataset, H

_{2}

O AutoML supports multiple kinds of problems, such as binary classification, multi-class classification, and regression problems. The major advantage of H

_{2}

O AutoML is that it has the capacity for fast scoring; multiple H

_{2}

O models can produce predictions within very little time. The other benefit of H

_{2}

O AutoML is that it offers APIs in different languages. Due to these benefits, it is seamlessly used in different fields. For big data analytics, H

_{2}

O AutoML has a tight integration. H

_{2}

O AutoML is a fully automatic supervised learning model that is implemented in the H

_{2}

O library. It is an open-source, distributed, and scalable model; it is widely used in academia and in the industry as well.

To evaluate the performance of the learning models for water-quality detection, several classifiers are used with the H

_{2}

O AutoML technique to check the efficacy of the proposed system. H

_{2}

O version 3.10.3.1 is used to train the learning models. All learning algorithms are implemented using the H

_{2}

O AutoML module. This study uses seven learning models for water quality classification: logistic regression [25], Gaussian naïve Bayes [26], random forest [27], extra tree classifier [28], gradient boosting machine [29], stochastic gradient decent [30], and H

_{2}

O stacked ensemble [23].

3.5. Logistic Regression

LR is extensively used for classification. LR has the ability to deal with a large number of features because it provides a straightforward equation for classification problems into a binary class. To achieve the best results, we optimized several of its hyperparameters. To compute the probability of a certain event occurring, a mathematical function called the ’logistic regression hypothesis function’ is used. The sigmoid function is used to transform the logistic regression output value into a probability value. The cost function of LR can be calculated as

h Θ (X) = \frac{1}{1} + e^{-} (β_{0} + β_{1} X)

(3)

\{C o s t (h_{Θ} (x)) y = 1 - l o g (1 - h_{Θ} (x)), y = 0\}

(4)

3.6. Gaussian Naïve Bayes

GNB is the advanced variant of naïve Bayes; it is also based on the Bayesian theorem. Naïve Bayes handles categorical variables efficiently, so all the variables in naïve Bayes must be categorical. However, the water quality classification dataset consists of numeric data, so that is why we use GNB. GNB uses the partial technique to handle large datasets because during training, GNB takes the chunks of data into account.

3.7. Random Forest

RF is a tree-based ensemble model. It is an advanced version of a decision tree and is used to handle supervised learning problems. RF combines many weak learners, so it produces highly accurate predictions. By using the different bootstrap samples, RF uses the bagging technique for the training of many decision trees by sub-sampling of the training dataset to obtain the bootstrap samples. The size of the bootstrap samples is the same as the size of the training dataset. In RF, the notable issue in the construction of a tree involves attribute identification at each level for the root node; this process is known as attribute selection. In ensemble classification, two or more classifiers are trained, and their results are combined using a voting process. The most-common ensemble techniques are bagging and boosting. RF can be defined as

p = m o d e {T_{1} (y), T_{2} (y), T_{3} (y), \dots, T_{m} (y)}

(5)

p = m o d e {\sum_{m = 1}^{m} T_{m} (y)}

(6)

For the majority of classification tasks, the Gini index is used as a cost function for the estimation of a split in the dataset. The Gini index can be computed using

G i n i = 1 - \sum_{i = 1}^{c l a s s e s} p {(\frac{i}{t})}^{2}

(7)

3.8. Stochastic Gradient Decent

SGD is a renowned optimization method that learns the optimized value of the model’s parameter in each iteration to reduce the cost function (

c^{f}

). SGD is a well-known variant of GD that concerns a random stochastic such that in each iteration it selects a single sample for the training of the model. SGD needs less training time to find the cost function of only a single training sample

x^{i}

at each iteration to attain local minima. It does so by updating the model parameters for every iteration

x^{i}

and target class

y^{i}

.

Θ_{j} = Θ_{j} - α (y \overset{´}{^{i}} - y^{i}) x_{i}^{j}

(8)

where

α

is the model learning rate and

Θ_{j}

is the parameter. For better performance, SGD uses several hyperparameters.

3.9. Gradient Boosting Machine

GBM is a boosting algorithm that is widely used for classification and regression problems. GBM consists of three main factors: a loss function, a weak learner, and an additive model. The additive model in gradient boosting minimizes the loss function by combining many weak learners. It handles imbalanced datasets efficiently. The purpose of boosting is to enhance the power of the algorithm in such a way that it can detect the model’s weaknesses and replace them with strong learners to produce near-perfect outcomes. GBM does this task by gradually, additively, and sequentially training numerous models.

3.10. Extra Tree Classifier

ETC is an ensemble learning model that is an ensemble of multiple unpruned decision trees. For the splitting nodes, it uses the subset of features. Unlike RF, it uses the whole of the data for construction of the decision tree rather than using the bootstrapping data. There are two primary parameters in ETC: the number of randomized input features selected at each node, the lowest sample size needed for splitting a node

n_{m} i n

, and the ensemble (M) number of decision trees. The decision tree in ETC is very less likely to be correlated because of the randomized selection of points of the split. ETC aggregates the DT predictions in the ensemble to produce the final predictions in the case of regression.

3.11. H2O AutoML Stacked

The stacked ensemble learning model H2O is a supervised learning model that is used to find the optimal combination from a number of prediction algorithms. The process of finding the optimal combination from many prediction algorithms is called stacking. The stacked ensemble model H2O supports any kind of problem, including binary and multi-class classification. It also supports regression problems. This research work leverages an RF classifier as a base and a gradient boosting machine as a meta-estimator to predict the performance of drinking-water quality.

3.12. Explainable Machine Learning

For the advancement of the decision-making sequence, traditional machine learning base prediction needs post hoc interpretations. The function of these post hoc interpretations is so that the community easily understands the rationale that works behind predictions. Machine learning applications emphasize that interoperability is very important, similar to accuracy [31]. Explainable ML helps in providing the basic add-in to machine learning models by improving the transparency of predictions that are obtained automatically. Foremost, such models are divided into two groups: data-driven interpretation and model-driven interpretation. To interpret machine learning-based predictions, we used the SHAP explainable model because it has the capacity to recognize values as a unified measure of feature importance.

3.13. Shapley Additive Explanations

According to Lundberg and Lee [32], SHAP is used to explain ML prediction based on game theory. For instance, inputs are taken as players, and predictions are referred to as payout. The contribution of each player in the game can be calculated with the help of SHAP. Several versions of SHAP have been introduced by Lundberg and Lee, including TreeSHAP, KernelSHAP, linearSHAP, and DeepSHAP. These versions are for specific machine learning model categories. For example, in this study, Tree-SHAP is used to explain the ML predictions. Tree-SHAP uses the linear-explanatory model and shapely values for the initial prediction model estimation.

h (\overset{´}{z}) = ⌀_{0} + \sum_{i = 1}^{N} ⌀_{i} {\overset{´}{z}}_{i}

(9)

where

z^{'}

represents the basic features, ⌀ denotes the feature attribution, and

h^{'}

shows the explanation model. Lundberg and Lee [32] calculate each feature attribution using the below equation:

⌀_{i} = \sum_{K \subseteq M \{i\}} \frac{| k |! (N - | K | - 1)!}{N!} [g_{x} (K \cup \{i\}) - g_{x} (K)]

(10)

g_{x} (K) = E [g (x) | x_{K}]

(11)

where

M^{'}

represents the set of all inputs, K shows the input subset of a feature, and

E [(g (x) | x_k)]

is the expected value of the function on subset k. A linear additive feature attribute method is used by SHAP for the simpler explanation

3.14. Proposed Framework

This section describes all phases of the proposed approach framework and its modules utilized in the experiment. Figure 2 elaborates on the proposed framework architecture. The proposed approach has sub-phases, and each phase has been explained separately. The proposed framework consists of two phases. In Phase 1, all the learning algorithms are implemented using H20 AutoML model selection on the dataset containing the missing values. In Phase 2, the dataset is balanced and sparsity is removed using the KNN imputer technique, and then learning algorithms are implemented. The results obtained clearly show the superiority of the H20 stacked ensemble model over the rest. After that, using the SHAP explainable AI technique, the contributions of features toward the prediction are also explained. SHAP reveals the proportion to which each feature participates in the prediction of drinking-water quality. The reason for choosing this stacking is that both these models perform well individually as compared to other models and are suitable for the task at hand. The SHAP technique is used to demonstrate the final prediction with respect to features so as to provide an explanation of the model’s performance.

3.15. Evaluation

The model’s evaluation is the important step that mainly focuses on estimation of the performance of the model on unseen data. For water-quality classification, the four outcomes are described below:

True Positive (TP): instances that are actually positive and are predicted positive.
True Negative (TN): instances that are actually negative and are predicted negative.
False Positive (FP): instances that are negative and are predicted as positive.
False Negative (FN): instances that are positive and are predicted as negative.

This study evaluates the proposed system in terms of accuracy, precision, recall, and F-score. The values of these parameters range between 0 and 1.

Accuracy is the percentage of correctly predicted instances. It can be computed using the following formula:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(12)

Precision is the exactness of the classifier. Mathematically, precision can be computed as:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

Recall is the completeness of the classifiers. Mathematically, recall can be computed as:

R e c a l l = \frac{T P}{T P + F N}

(14)

The harmonic mean of recall and precision is called the F1 score. It is also referred to as F-score. It can be calculated using the following formula:

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

Figure 2. Proposed stacked ensemble H20 AutoML architecture.

4. Results and Discussions

This section of the study discusses the results obtained from the classifiers used in the study for water-quality classification. Python 3.0 on a Jupyter notebook is used to test machine and deep learning models. The learning models’ accuracy, recall, precision, and F1 score are used to assess their performance.

4.1. Experimental Results of All Learning Models by Removing Missing Data

In the first set of experiments, the missing values are removed from the dataset. Once the missing values are removed from the dataset, machine learning models are used. Table 3 shows the results of the machine learning models obtained by removing missing values from the dataset.

According to the results, among all single models, RF and GBM acquire the highest accuracy scores of 79% and 76%, respectively. RF achieves a precision, recall, and F1 score of 79%, whereas GBM achieves a precision, recall, and F1 score of 76%. LR is the poorest performer, with a 48% value for accuracy, recall, precision, and F1 score. Overall, the machine learning models’ performances by employing removing missing value data are unsatisfactory. H20 stacked ensemble outperforms all single models by giving accuracy and recall of 87% and precision and F1 score of 85% and 86%, respectively. Figure 3 shows a graphical depiction of the machine learning model results after missing-value data have been removed. The results clearly show that the performances from GBM and RF are acceptable, and the rest of the models’ performances are poor.

Table 3. Results using deleted-values dataset.

Model	Accuracy	Precision	Recall	F1-Score
LR	48	48	48	48
GNB	52	54	52	47
ETC	72	72	72	72
RF	79	79	79	79
SGDC	50	25	50	33
GBM	76	76	76	76
H20 Stacked	87	85	87	86

4.2. Experimental Results of All Learning Models by Filling Values with KNN Imputer

In the second set of experiments, the KNN imputer is used. Following preprocessing, some missing values are discovered in the dataset. To handle missing data, we applied the KNN imputer. The value is computed by the KNN imputer using the Euclidean distance and the mean of the given values. The data are used for machine learning model experiments once the missing values are imputed. Table 4 displays the results of the machine learning models produced with the KNN imputer.

The results show that RF and GBM reach 80% accuracy, while the RF obtains 80% precision, recall, and F1 score. GBM has a precision and a recall score of 80%, but an F1-score of 79%. With SGDC, an accuracy score of 59% is attained. In terms of accuracy, precision, recall, and F1 score, the H20 stacked model once again outperforms all other individual models. The graphical depiction of the machine learning model outcomes using the KNN imputer is shown in Figure 4. It illustrates that using the KNN imputer enhances the performance of the machine learning model.

4.3. Accuracy Comparison of All Learning Models with KNN Imputer and Removing Missing Data

For a detailed and clarified performance analysis, the results obtained from the learning models with and without the KNN imputer are compared in this section. Experimental results show that the learning models perform well when we employ the KNN imputer for filling in the missing values. These results are good compared to the results of the learning models without the KNN imputer. The accuracy comparison of all learning models with the KNN imputer and the removal of missing data is shown in Table 5.

The performances of machine learning models when deleting missing values versus the imputed dataset using the KNN imputer are shown in Figure 5. The KNN imputer increases not only individual model performance but also the overall performance of all learning models.

4.4. Results of Cross-Validation

A 10-fold cross-validation is also used to validate the performance of the proposed approach, and results are presented in Table 6. It can be observed that the proposed model provides an average accuracy of 97%, while the average values for precision, recall, and F score are 96%, 99%, and 97%, respectively.

4.5. Comparison with State-of-the-Art Approaches

In this study, a performance comparison is done to demonstrate the importance of the proposed approach. In this regard, a number of recent research studies that are relevant to the current issue are chosen. The current dataset is used to implement models from earlier studies addressing the prediction of water quality. As shown by the comparison findings in Table 7, the proposed technique outperforms other approaches.

4.6. Explainable Artificial Intelligence

SHAP highlights feature importance regarding the given task of water-quality prediction. SHAP feature importance is a superior approach to traditional alternatives, but in isolation, it provides little additional value. Beeswarm plots are a more complex and information-rich display of SHAP values that reveal not only the relative importance of features but their actual relationships with the predicted outcome. The SHAP summary graphic representation depicts the contribution of each feature to each instance (row of data). The total of the feature contributions and the bias term equals the model’s raw prediction, i.e., prediction before applying the inverse link function. The graphical representation of SHAP feature contributions is shown in Figure 6. It can be observed that pH and sulfate are the features that play a major role in predicting water quality. They have a large number of instances in favor of the prediction of safe water quality. The rest of the features largely support the prediction of drinking water quality as harmful.

SHAP explanation demonstrates the contribution of features to a specific instance. The total of the feature contributions and the bias term equals the model’s raw prediction, i.e., prediction prior to applying the inverse link function. H2O stacked ensemble uses TreeSHAP, which can increase the contribution of a feature that has no impact on the prediction when the features are linked. Again, Figure 7 clearly explains that the values of pH and sulfate are in favor of the prediction of water quality as safe.

5. Conclusions

Survival of mankind is not possible without safe drinking water. Polluted water has a lot of adverse effects on human health that ultimately result in severe and life-threatening diseases. Due to a lot of urbanization in the world, the passages of drinking water are mixed with polluted water, which is causing a severe problem for human beings to find safe drinking water. This research work provides a stacked ensemble framework that accurately classifies safe and harmful drinking water. The proposed stacked H20 AutoML framework performs best with the KNN imputer technique that deals with the missing values of the dataset. Experiments are carried out in two phases: with KNN imputer values and by deleting missing values. Results reveal that using the KNN imputer for filling in the missing values is a better choice, as deleting the missing value data cause information loss that affects the performance of the models. The participation of each feature in prediction is explained using an explainable AI technique, SHAP. The proposed approach obtains 97% accuracy when used with the KNN imputer.

Author Contributions

Conceptualization, H.A.M. and M.U.; Data curation, S.A.; Formal analysis, M.H. and O.S.; Funding acquisition, H.A.M. and M.H.; Investigation, M.U.; Methodology, A.I.; Project administration, A.I.; Resources, O.S.; Software, A.I. and S.A.; Supervision, I.A. and N.A.; Validation, I.A.; Visualization, S.A. and M.H.; Writing—original draft, H.A.M. and M.U.; Writing—review and editing, N.A. and I.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by College of Electronic and Information Engineering, Beibu Gulf University, Qinzhou 535011, China and by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R125), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets can be found by the authors at request.

Acknowledgments

This study is supported via funding from Prince Sattam bin Abdulaziz University project number (PSAU/2023/R/1444).

Conflicts of Interest

The authors declare no conflict of interests.

References

Muhammad, S.Y.; Makhtar, M.; Rozaimee, A.; Aziz, A.A.; Jamal, A.A. Classification model for water quality using machine learning techniques. Int. J. Softw. Eng. Its Appl. 2015, 9, 45–52. [Google Scholar] [CrossRef]
Radhakrishnan, N.; Pillai, A.S. Comparison of water quality classification models using machine learning. In Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 10–12 June 2020; pp. 1183–1188. [Google Scholar]
Walley, W.; Džeroski, S. Biological monitoring: A comparison between Bayesian, neural and machine learning methods of water quality classification. In Environmental Software Systems; Springer: Berlin/Heidelberg, Germany, 1996; pp. 229–240. [Google Scholar]
Nasir, N.; Kansal, A.; Alshaltone, O.; Barneih, F.; Sameer, M.; Shanableh, A.; Al-Shamma’a, A. Water quality classification using machine learning algorithms. J. Water Process Eng. 2022, 48, 102920. [Google Scholar] [CrossRef]
Nouraki, A.; Alavi, M.; Golabi, M.; Albaji, M. Prediction of water quality parameters using machine learning models: A case study of the Karun River, Iran. Environ. Sci. Pollut. Res. 2021, 28, 57060–57072. [Google Scholar] [CrossRef] [PubMed]
Ambade, B.; Sethi, S.S.; Giri, B.; Biswas, J.K.; Bauddh, K. Characterization, behavior, and risk assessment of polycyclic aromatic hydrocarbons (PAHs) in the estuary sediments. Bull. Environ. Contam. Toxicol. 2022, 108, 243–252. [Google Scholar] [CrossRef]
Singha, S.; Pasupuleti, S.; Singha, S.S.; Singh, R.; Kumar, S. Prediction of groundwater quality using efficient machine learning technique. Chemosphere 2021, 276, 130265. [Google Scholar] [CrossRef]
Brown, R.M.; McClelland, N.I.; Deininger, R.A.; Tozer, R.G. A water quality index-do we dare. Water Sew. Work. 1970, 117, 339–343. [Google Scholar]
Bui, D.T.; Khosravi, K.; Tiefenbacher, J.; Nguyen, H.; Kazakis, N. Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Sci. Total Environ. 2020, 721, 137612. [Google Scholar] [CrossRef]
Juna, A.; Umer, M.; Sadiq, S.; Karamti, H.; Eshmawi, A.; Mohamed, A.; Ashraf, I. Water Quality Prediction Using KNN Imputer and Multilayer Perceptron. Water 2022, 14, 2592. [Google Scholar] [CrossRef]
Aldhyani, T.H.; Al-Yaari, M.; Alkahtani, H.; Maashi, M. Water quality prediction using artificial intelligence algorithms. Appl. Bionics Biomech. 2020, 2020. [Google Scholar] [CrossRef]
Shahra, E.Q.; Wu, W.; Basurra, S.; Rizou, S. Deep Learning for Water Quality Classification in Water Distribution Networks. In Proceedings of the International Conference on Engineering Applications of Neural Networks, Crete, Greece, 17–20 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 153–164. [Google Scholar]
Mohammed, H.; Hameed, I.A.; Seidu, R. Machine learning: Based detection of water contamination in water distribution systems. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Kyoto, Japan, 15–19 July 2018; pp. 1664–1671. [Google Scholar]
Abuzir, S.Y.; Abuzir, Y.S. Machine learning for water quality classification. Water Qual. Res. J. 2022, 57, 152–164. [Google Scholar] [CrossRef]
Hassan, M.M.; Hassan, M.M.; Akter, L.; Rahman, M.M.; Zaman, S.; Hasib, K.M.; Jahan, N.; Smrity, R.N.; Farhana, J.; Raihan, M.; et al. Efficient prediction of water quality index (WQI) using machine learning algorithms. Hum.-Centric Intell. Syst. 2021, 1, 86–97. [Google Scholar] [CrossRef]
Sillberg, C.V.; Kullavanijaya, P.; Chavalparit, O. Water quality classification by integration of attribute-realization and support vector machine for the Chao Phraya River. J. Ecol. Eng. 2021, 22, 70–86. [Google Scholar] [CrossRef]
Ahmed, U.; Mumtaz, R.; Anwar, H.; Shah, A.A.; Irfan, R.; García-Nieto, J. Efficient water quality prediction using supervised machine learning. Water 2019, 11, 2210. [Google Scholar] [CrossRef] [Green Version]
Kakkar, M.; Gupta, V.; Garg, J.; Dhiman, S. Detection of water quality using machine learning and IoT. Int. J. Eng. Res. Technol. (IJERT) 2021, 10, 73–75. [Google Scholar]
Malek, N.H.A.; Wan Yaacob, W.F.; Md Nasir, S.A.; Shaadan, N. Prediction of Water Quality Classification of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques. Water 2022, 14, 1067. [Google Scholar] [CrossRef]
Rustam, F.; Ishaq, A.; Kokab, S.T.; de la Torre Diez, I.; Mazón, J.L.V.; Rodríguez, C.L.; Ashraf, I. An Artificial Neural Network Model for Water Quality and Water Consumption Prediction. Water 2022, 14, 3359. [Google Scholar] [CrossRef]
Kaggle. Water Quality. 2021. Available online: https://www.kaggle.com/datasets/adityakadiwal/water-potability (accessed on 1 November 2022).
Zhang, S. Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 2012, 85, 2541–2552. [Google Scholar] [CrossRef]
AUTOML: Automatic machine learning. Available online: hhttps://www.automl.org/automl/ (accessed on 1 November 2022).
H2O.ai. H2O: Scalable Machine Learning Platform. Available online: https://h2o.ai/platform/h2o-automl/ (accessed on 1 November 2022).
Ishaq, A.; Sadiq, S.; Umer, M.; Ullah, S.; Mirjalili, S.; Rupapara, V.; Nappi, M. Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE Access 2021, 9, 39707–39716. [Google Scholar] [CrossRef]
Rustam, F.; Ashraf, I.; Mehmood, A.; Ullah, S.; Choi, G.S. Tweets classification on the base of sentiments for US airline companies. Entropy 2019, 21, 1078. [Google Scholar] [CrossRef] [Green Version]
Manzoor, M.; Umer, M.; Sadiq, S.; Ishaq, A.; Ullah, S.; Madni, H.A.; Bisogni, C. RFCNN: Traffic accident severity prediction based on decision level fusion of machine and deep learning model. IEEE Access 2021, 9, 128359–128371. [Google Scholar] [CrossRef]
Sharaff, A.; Gupta, H. Extra-tree classifier with metaheuristics approach for email classification. In Advances in Computer Communication and Computational Sciences; Springer: Berlin/Heidelberg, Germany, 2019; pp. 189–197. [Google Scholar]
Fabian, D.; Guillermo Prieto Eibl, M.d.P.; Alnahhas, I.; Sebastian, N.; Giglio, P.; Puduvalli, V.; Gonzalez, J.; Palmer, J.D. Treatment of glioblastoma (GBM) with the addition of tumor-treating fields (TTF): A review. Cancers 2019, 11, 174. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sowmya, B.; Nikhil Jain, C.; Seema, S.; KG, S. Fake News Detection using LSTM Neural Network Augmented with SGD Classifier. Solid State Technol. 2020, 63, 6985–9665. [Google Scholar]
Ahmad, M.A.; Eckert, C.; Teredesai, A. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC, USA, 29 August–1 September 2018; pp. 559–560. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Hasan, A.N.; Alhammadi, K.M. Quality Monitoring of Abu Dhabi Drinking Water Using Machine Learning Classifiers. In Proceedings of the 2021 14th International Conference on Developments in eSystems Engineering (DeSE), Sharjah, United Arab Emirates, 7–10 December 2021; pp. 1–6. [Google Scholar]
Dilmi, S.; Ladjal, M. A novel approach for water quality classification based on the integration of deep learning and feature extraction techniques. Chemom. Intell. Lab. Syst. 2021, 214, 104329. [Google Scholar] [CrossRef]

Figure 1. Workflow diagram of water-quality prediction method.

Figure 3. Accuracy, precision, and other metrics for models using the deleted-values dataset.

Figure 4. Results of models using KNN-imputer-filled dataset.

Figure 5. Graphical illustration of model performance using KNN imputer.

Figure 6. Graphical representation of SHAP feature importance.

Figure 7. GSHAP explanation.

Table 1. Comparative analysis of the existing approaches.

Ref.	Methods	Dataset	Findings
[10]	Machine learning and deep learning models	Water quality dataset	KNN imputer and MLP achieved 99% accuracy.
[4]	Machine learning models	Drinking-water-quality data of Indian states from 2005 to 2014	CATBoost model achieved 94.51% accuracy.
[2]	Machine learning models	Water collected from Narmada River in India	Decision tree achieved 98.50% accuracy.
[11]	Machine learning model, deep learning model, and autoregressive neural network	Dataset collected from different states of India from 2005 to 2014	SVM achieved highest results with 97.01% accuracy.
[12]	ANN and SVM	Dataset provided by US Environment Protection Agency	ANN outperformed with 94% accuracy.
[13]	Adaptive neuro-fuzzy inference system (ANFIS)	Ålesund water treatment plant (WTP)	ANFIS model detected safety condition between 92–96% in pipe network.
[14]	J48, NB, and MLP	Water_potability dataset	MLP achieved highest accuracy with 66% value.
[15]	Various machine learning models	Indian water-quality data	Multinomial logistic regression achieved 99.83% accuracy
[16]	SVM	Chao Phraya River water dataset during 2008–2019	SVM achieved 94% classification accuracy.
[17]	Various machine learning models	Dataset collected from PCRWR	MLP achieved 85.07% accuracy.
[18]	Multiple sensors interfaced with NodeMcU	Drinking water	Authors proposed a cost-efficient solution that notifies users before the water gets contaminated.
[19]	Series of machine learning models	Kelantan River water using data from 2005 to 2020	Gradient boosting model achieved 94.90% accuracy.
[20]	Several machine learning and deep learning models	Water-quality prediction dataset and the water consumption dataset	The proposed approach (ANN) achieved 96% classification accuracy.

Table 2. Dataset and its attributes.

Feature	Description
pH	Water pH (0 to 14).
Hardness	Soap precipitate capacity in water in mg/L.
Solids	Total dissolved solids in ppm.
Chloramines	Number of chloramines in ppm.
Sulfate	Sulfates dissolved in mg/L.
Conductivity	Electrical conductivity of water in $μ$ S/cm
Organic_carbon	Organic carbon in ppm.
Trihalomethanes	Trihalomethanes in $μ$ g/L.
Turbidity	Light-emitting property of water in NTU.
Potability	Target class of whether the water is potable or not potable: potable is 1, and not potable is 0.

Table 4. Experimental results using machine learning models with KNN imputer data.

Model	Accuracy	Precision	Recall	F1-Score
LR	61	38	61	47
GNB	61	38	61	47
ETC	72	73	72	72
RF	80	80	80	80
SGDC	59	56	59	55
GBM	80	80	80	79
H20 Stacked	97	96	99	98

Table 5. Performance comparison of the machine learning models with and without KNN imputer.

Model	Accuracy
Model	KNN Imputer	Deletion of Missing Values
LR	61	48
GNB	61	52
ETC	72	72
RF	80	79
SGDC	59	50
GBM	80	76
H20 Stacked	97	87

Table 6. Significance of proposed model with k-fold validation.

K-Folds	Accuracy	Precision	Recall	F1-Score
1st-Fold	96	94	96	95
2nd-Fold	98	95	98	97
3rd-Fold	98	96	97	96
4th-Fold	97	97	98	97
5th-Fold	97	98	99	98
6th-Fold	97	98	100	99
7th-Fold	96	98	99	98
8th-Fold	97	98	99	98
9th-Fold	98	98	100	99
10th-Fold	96	96	99	97
Average	97	96	99	97

Table 7. Comparison of proposed approach with state-of-the-art approaches for water-quality prediction.

Reference	Year	Model	Accuracy
[11]	2020	SVM	91%
[33]	2021	DT	95%
[34]	2021	LDA+LSTM-RNN	88%
[12]	2021	ANN	94%
[20]	2022	ANN	96%
Proposed	2022	Stacked Ensemble H $_{2}$ O AutoML	97%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Madni, H.A.; Umer, M.; Ishaq, A.; Abuzinadah, N.; Saidani, O.; Alsubai, S.; Hamdi, M.; Ashraf, I. Water-Quality Prediction Based on H₂O AutoML and Explainable AI Techniques. Water 2023, 15, 475. https://doi.org/10.3390/w15030475

AMA Style

Madni HA, Umer M, Ishaq A, Abuzinadah N, Saidani O, Alsubai S, Hamdi M, Ashraf I. Water-Quality Prediction Based on H₂O AutoML and Explainable AI Techniques. Water. 2023; 15(3):475. https://doi.org/10.3390/w15030475

Chicago/Turabian Style

Madni, Hamza Ahmad, Muhammad Umer, Abid Ishaq, Nihal Abuzinadah, Oumaima Saidani, Shtwai Alsubai, Monia Hamdi, and Imran Ashraf. 2023. "Water-Quality Prediction Based on H₂O AutoML and Explainable AI Techniques" Water 15, no. 3: 475. https://doi.org/10.3390/w15030475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Water-Quality Prediction Based on H2O AutoML and Explainable AI Techniques

Abstract

1. Introduction

2. Related Work

3. Material and Methods

3.1. Description of Dataset

3.2. KNN Imputer

3.3. Deleting Missing Values from the Dataset

3.4. H2O AutoML

3.5. Logistic Regression

3.6. Gaussian Naïve Bayes

3.7. Random Forest

3.8. Stochastic Gradient Decent

3.9. Gradient Boosting Machine

3.10. Extra Tree Classifier

3.11. H2O AutoML Stacked

3.12. Explainable Machine Learning

3.13. Shapley Additive Explanations

3.14. Proposed Framework

3.15. Evaluation

4. Results and Discussions

4.1. Experimental Results of All Learning Models by Removing Missing Data

4.2. Experimental Results of All Learning Models by Filling Values with KNN Imputer

4.3. Accuracy Comparison of All Learning Models with KNN Imputer and Removing Missing Data

4.4. Results of Cross-Validation

4.5. Comparison with State-of-the-Art Approaches

4.6. Explainable Artificial Intelligence

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Water-Quality Prediction Based on H₂O AutoML and Explainable AI Techniques