Hypertuning-Based Ensemble Machine Learning Approach for Real-Time Water Quality Monitoring and Prediction

Shahid, Md. Shamim Bin; Rifat, Habibur Rahman; Uddin, Md Ashraf; Islam, Md Manowarul; Mahmud, Md. Zulfiker; Sakib, Md Kowsar Hossain; Roy, Arun

doi:10.3390/app14198622

Open AccessArticle

Hypertuning-Based Ensemble Machine Learning Approach for Real-Time Water Quality Monitoring and Prediction

by

Md. Shamim Bin Shahid

¹

,

Habibur Rahman Rifat

¹

,

Md Ashraf Uddin

^2,*

,

Md Manowarul Islam

¹

,

Md. Zulfiker Mahmud

¹,

Md Kowsar Hossain Sakib

³ and

Arun Roy

¹

Department of Computer Science and Engineering, Jagannath University, Dhaka 1100, Bangladesh

²

School of Info Technology, Deakin University, Burwood, VIC 3125, Australia

³

School of Computer Science, Taylor’s University, 1, Jln Taylors, Subang Jaya 47500, Malaysia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8622; https://doi.org/10.3390/app14198622

Submission received: 2 August 2024 / Revised: 14 September 2024 / Accepted: 23 September 2024 / Published: 24 September 2024

(This article belongs to the Special Issue Edge-Enabled Big Data Intelligence for 6G and IoT Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the present day, the health of the populace is significantly jeopardized by the presence of contaminated water, and the majority of the population is unaware of the distinction between safe and unsafe water consumption. Agricultural, industrial, and other human-induced activities are causing a significant decline in the availability of drinking water. Consequently, the issue of ensuring the safety of ingesting water is becoming increasingly prevalent. People should be aware of the purity of the water and the locations where it can be used in order to resolve this situation. There are numerous IoT-based system architectures that are capable of monitoring water parameters; however, the majority of these architectures do not allow for real-time water quality prediction or visualization. In order to achieve this, we suggest a wireless framework that is based on the Internet of Things (IoT). The sensors are able to capture water parameters and transmit the data to the cloud, where a machine learning (ML) model operates to classify the water quality. After that, Grafana enables us to effortlessly visualize the real-time data and predictions from any location. We employed a multi-class dataset from China for the model’s construction. GridSearchCV was implemented to identify the optimal parameters for model optimization. The proposed model is a combination of the Random Forest (RF), Extreme Gradient Boosting (XGB), and Histogram Gradient Boosting (HGB) models. The accuracy of the model for the China dataset was 99.80%. To assess the robustness of the proposed model, we acquired a new dataset from the Bangladesh Water Development Board (BWDB) and used it to test the proposed model. The model’s accuracy for this dataset was 99.72%. In summary, the proposed wireless IoT framework enables individuals to effortlessly monitor the purity of water and view its parameters from any location.

Keywords:

water quality index; machine learning; key water parameters; ensemble technique; hypertuning

1. Introduction

Water is an indispensable resource for all living organisms, including humans, and is often referred to as the essence of life [1]. The survival of aquatic species is limited by the level of water pollution, and exceeding a given threshold poses a threat to their existence [2]. Water sources such as rivers, lakes, and groundwater have varying quality standards. As surface water of high quality is scarce, about one-third of the world’s population depends on groundwater for their daily needs, particularly for drinking [3]. The quality of surface water is rapidly declining due to natural environmental changes and human activities, including large-scale agricultural harvesting and industrial operations, resulting in a water scarcity crisis and numerous environmental concerns. Recently, there has been growing interest in using IoT [4,5,6], cloud [7,8], and ML [9] technologies to monitor and analyze water quality. The IoT employs networked sensors and other devices to collect and transmit data from the environment to digital platforms. This allows for automated water quality monitoring with the ability to access real-time data remotely, resulting in a significant reduction in time and cost. Using cloud storage and resource management systems, we can easily retrieve data and ensure the confidentiality of the data. By utilizing ML techniques [10], it is possible to analyze the data collected from IoT sensors in order to detect complex patterns. ML algorithms have the capability to predict future water quality by analyzing the current conditions and historical data. This advanced technology aids in identifying potential issues before they become more serious. In order to fully utilize the potential of these technologies, it is vital to establish a comprehensive framework for classifying water quality using the IoT and ML. This framework should include trustworthy IoT sensors, optimized data processing and analysis techniques, and ML algorithms that can deliver accurate and reliable water quality predictions. Developing a water quality indexing (WQI) model by refining the best learning algorithms might be beneficial for predicting water quality. Datasets for determining water quality can contain many parameters, but only a few parameters can determine WQI [11]. Therefore, to optimize the use of learning methods, only useful parameters should be chosen using different feature selection methods. Determining the quality of the water is very important because more than two billion people are at risk of consuming water contaminated with harmful microorganisms and toxic substances according to the World Health Organization [12]. Drinking water from various sources is estimated to cause around 829,000 deaths annually due to water-related illnesses. Numerous cutting-edge studies [13,14,15,16] developed ML and deep learning models for assessing surface water quality. However, many of these previous works relied on synthetic datasets that did not accurately represent real-life water conditions. Moreover, their models were trained on limited datasets, which might hinder their applicability in real-world scenarios. To address these limitations, this research explores multiple ML models and ensemble methods, training them on a dataset collected from various surface areas in Bangladesh, as well as one other dataset.

This proposed approach aims to provide a more comprehensive and robust solution for measuring surface water quality. This work aims to predict water quality using ML models that utilize historical data to make accurate classifications. The ML approach has become increasingly popular in recent times due to its cost-effectiveness and time-saving benefits compared to traditional methods. In addition, this methodology determines the quality level of a new dataset using WQI [17]. The main contributions of the research include the following:

An autonomous IoT-based water quality classification system is designed using an ensemble ML model, Terraform, Amazon Web Services (AWS), Telegraf, InfluxDB, and Grafana. These beneficial technologies enhance the system through effective infrastructure management, adaptable data handling, and comprehensive monitoring, along with visualization capabilities.
This study presents an enhanced machine learning model that ensembles three cutting-edge models (RF, XGB, and HGB). Before the model integration, each model’s hyperparameters were carefully tuned, as they significantly influenced the final results.
To verify the proposed model’s efficiency on regional water samples other than those from China, a novel dataset is introduced in this study, collected from the BWDB. The proposed model gained decent accuracy on the newly introduced data too. This study can assist with the BWDB’s management, or that of any water monitoring board, of predicting water quality and improving water resource planning.

The application of the IoT and ML to water quality monitoring could revolutionize the management of our water resources. These technologies provide real-time data and predictive insights, ensuring safe and usable water supplies for future generations.

2. Literature Review

Various related studies were reviewed in order to construct an autonomous Industry 4.0-focused cloud-based IoT architecture for water quality classification and monitoring. Relevant research findings have been provided in Table 1, highlighting the key technologies, contributions, and drawbacks of these studies. The initial section concentrates on cloud-based projects, while the subsequent section concentrates on the development of ML models for water quality prediction.

2.1. IoT Cloud-Based Framework for Real-Time Water Quality Monitoring

Saif Allah H. AlMetwally et al. [18] proposed a real-time water quality system that is founded on the IoT. The system utilizes sensors of pH, temperature, turbidity, water level, and discharge rate. Data are collected by these sensors and transmitted to the microcontroller (MCU). The MCU determines whether the water is potable or not. They did not incorporate an ML model into their system, nor did they utilize cloud storage to store the data. This is the reason why users are unable to remotely determine the water’s condition or parameter values.

A real-time water quality monitoring system with an IoT-based approach was presented by Mohammad Salah Uddin Chowdury et al. [19]. The data were collected using a variety of sensors. Subsequently, the data were compared to standard values and processed using Spark streaming analysis, Spark MLlib, deep learning neural network models, and the Belief Rule Base (BRB) system. They also implemented an automated SMS warning system for contaminated water. However, there was still no cloud database system in place to retain the data for subsequent analysis.

A cost-effective cloud-based water monitoring system was proposed by Rita Wiryasaputra et al. [20]. The data are collected using a variety of sensors and then transmitted to the cloud for prediction by ML models. Their prediction model was a decision tree model. Additionally, a web-based application and a mobile notification system were made available to inform consumers about the water quality. Additionally, Grafana was implemented to construct the water monitoring system’s dashboard. However, the decision tree model accuracy was only 97%, which is very low compared to that of others [14]. Additionally, they did not specify how the cloud architecture would be maintained or the approach to data timing in their storage system.

The water quality monitoring system introduced by Bineet Kumar Jha et al. [7] is equipped with IoT sensors, a microcontroller, a cloud platform, and a decision tree classifier. The decision tree model’s accuracy is a mere 84%, which is exceedingly low. They have not specified the method by which the cloud architecture would be managed in the event that it required expansion. They employ an SMS notification system; however, a real-time dashboard would be more beneficial, as it would enable users to identify the parameters demonstrating that the water quality is deteriorating.

2.2. Machine Learning Techniques for Water Quality

Haghiabi et al. [13] used several popular ML models, like group method of data handling (GMDH), a support vector machine (SVM), and an artificial neural network (ANN), as well as artificial intelligence techniques, to predict the water quality of the Tireh River (Iran). They measured some essential parameters (Ca, Cl, Mg, Hco3, pH, So4, Ec, TDS, Na, etc.) by which the water quality could be predicted. After training and testing, they stated which model was most suitable for which parameters among the three models. The SVM’s accuracy was higher than that of the others (GMDH and ANN) for the parameters Ca, Cl, Na, and So4. For Mg and Hco3, both the SVM and ANN’s performance was better than that of GMDH. And for Ec and TDS, all three models had an equally suitable performance. The SVM had the best performance in terms of pH estimation, but its accuracy was slightly lower than the standard.

In another study, Ahmad et al. [21] demonstrated that combining multiple artificial neural networks provides better WQI than a single feedforward neural network. Using both forward selection and backward elimination selective combination methods, they obtained coefficient of determination (R²) and mean squared error (MSE) values of 0.9340 and 0.9270 and 0.1156 and 0.1256, respectively. They used a total of 25 parameters for the model input, excluding biological oxygen demand (BOD) and chemical oxygen demand (COD).

Similar to this research, Chen et al. [14] estimated water quality and identified the key parameters for water quality prediction. They stated that not only the ML models but also some key parameters of the water dataset are responsible for accurate predictions. They used a total of 10 learning models for water quality prediction on a big dataset. Of these, the decision tree (DT), RF, and deep cascade forest (DCF) models showed better performance with two key parameter sets (DO, CODMn, and NH3-N; CODMn and NH3-N).

Authors like Ahmed et al. [22] estimated WQI according to water quality class (WQC) more quickly and inexpensively. Using only four parameters, they determined that gradient boosting, with a learning rate of 0.1, and polynomial regression, with a degree of 2, made predictions the most efficiently for determining WQI among all of the models, with mean absolute error (MAE) values of 1.9642 and 2.7273, respectively. For predicting WQC, a multi-layer perceptron (MLP) performed more accurately, with a configuration of (3, 7) and with an accuracy of 0.8507.

Researchers observed a strong nonlinear relation between estuarine NH4⁺-N and NH4⁺-N in the upper reaches when predicting NH4⁺-N in the Xiaoqing River estuary, China. Wang et al. [23] used several ML models, such as ANNs, RF, and XGB, to validate their observations. An additive explanation method was used to format the XGB model. This study revealed that when the NH4⁺-N concentration in the upper reaches is less than 2 mg/L, then there is no negative impact on estuarine NH4⁺-N.

For surface water quality prediction, a data-driven model was developed by Jin et al. [24] to account for water quality variations, and real-time early warnings were their primary concerns. This focused model was developed using adjusted integration of an improved genetic algorithm (IGA) at first and then a backpropagation neural network (BPNN). The developed model was applied to the Ashi River, China, and compared with a simple BPNN model. It was observed that in terms of accuracy, reliability, and the provision of real-time early warnings, in every case, the data-driven model performed better.

The authors in [16] used five learning models to determine the biochemical oxygen demand (BOD) values of the Euphrates River, Iraq. A total of eleven parameters were considered within a period of ten years. A comparison between specific and integrative models with two feature extraction algorithms, a genetic algorithm (GA) and Principal Component Analysis (PCA), was shown. After the comparison, they stated that the quantile regression forest (QRF) model showed excellent performance among the learning models, and the integrative PCA-QRF model performed better than all the other models, with the statistical criteria of the coefficient of determination (R²), the root mean squared error (RMSE), and the mean absolute error (MAE) of PCA-QRF being 0.94, 0.12, and 0.05, respectively.

Khan et al. [25] used the principal component regression (PCR) technique to predict the water quality in the area of Gulshan Lake. Following this process, after conducting WQI, they extracted important parameters using PCA, and then different regression algorithms were used on these parameters. Finally, they determined the water quality status (WQS) using a Gradient Boosting Classifier. PCR achieved 95 % accuracy, whereas the Gradient Boosting Classifier made 100 % accurate predictions.

Moreover, researchers have determined different types of models in various regions on the basis of creative methodologies. Some of them have found that the SVM model yields more accurate WQI predictions than other methods, observed in an R² score of 0.98 in Iran [13]. In Malaysia, the application of a Least Square Support Vector Machine (LSSVM) achieved an R² value of 0.9 [26]. Conversely, superior results were observed in Poland, where neural networks attained an R² value of 0.99 [27]. Similarly, reliable outcomes have been reported in Ethiopia, Vietnam, and Brazil, as documented in references [28,29,30,31].

Table 1. Existing works with key technologies and drawbacks.

Authors	Key Technologies	Contributions	Drawbacks
Haghiabi et al. (2018) [13]	SVM (98%)	Determined which model is suitable for which parameters among an SVM, an ANN, and GMDH.	Accuracy was reduced significantly for predicting pH.
Ahmad et al. (2017) [21]	FS-MNN (93.4%)	Combined two neural networks (forward selection and backward elimination) for better WQI.	BOD and COD parameters were excluded.
Ahmed et al. (2019) [15]	WDT-ANFIS (98%)	An augmented wavelet de-noising technique (WDT-ANFIS) was introduced.	Optimal selection of hyperparameters still required.
U. Ahmed et al. (2019) [22]	MLP (85.07%)	Both WQI and WQCs calculated in a quicker and inexpensive way.	In trying to make their solution inexpensive, they used only four parameters.
Shou Wang et al. (2022) [23]	XGB-SHAP (95.6%)	An additive explanation method was introduced to format the XGB mode.	Only monthly (no long-term data) readings were used in this study.
Tao Jin et al. (2019) [24]	IGA-BPNN (97.34%)	Adjusted integration of an improved genetic algorithm for water quality for prolonged periods.

3. Research Methodology

We propose the automated system for predicting water quality using IoT technology which is shown in Figure 1. The system leverages the AWS (version 2) cloud infrastructure for its scalability, reliability, and wide range of services. Terraform (version 2) is employed to define and provision the infrastructure as code, ensuring consistency and reproducibility. The system comprises several key components: data collection, storage, preprocessing, prediction, and visualization.

Data collection: Sensors are deployed in the aquatic environment, including lakes, large rivers, and canals, to collect raw water quality data. We have limited our research to a few primary parameters that are crucial for calculating the water quality index. The parameters in question are pH, DO, COD, and NH3-N. For the sensors, we used the ammo::lyser eco sensor module from the sca::n manufacturer. The Zigbee module is employed to transmit the collected data to a central system via wireless communication networks.
Data preprocessing: To prepare raw data for ML, they are subjected to preprocessing. This entails the transformation and cleansing of and potential reductions in the data to enhance the efficacy of the model. Additionally, Telegraf (version 1.17) is essential in that it is capable of determining the intervals of the data. The sensors receive data every second, but there are instances in which we do not require every second of data. Consequently, we can establish the intervals we desire using Telegraf.
Water quality prediction: The water quality is predicted by feeding the preprocessed data into a built-in machine learning model. An ensemble model, which is more precise and resilient, was implemented in this investigation. The water quality was predicted by feeding the preprocessed dataset, consisting of 214 instances, into an ensemble of ML models. In this paper, we implemented a hard-voting ensemble approach to improve the prediction accuracy and robustness. The ensemble combined three distinct models: XGB, RF, and HGB. Each model independently predicted the water quality class, and at the end of the process, we decided by a majority vote on the models’ outputs. This hard-voting approach ensured that the ensemble capitalized on the strengths of each model, resulting in more precise and accurate predictions than those made by individual models.
Data storage: InfluxDB, which is hosted on AWS, was employed for data storage. InfluxDB is specifically engineered to facilitate the management of time series data. InfluxDB is the repository for all the raw data and predicted values. The database can also contain additional information about the sensors, such as their uptime and whether or not they require maintenance.
Visualization and monitoring: Grafana is employed to create interactive dashboards that visualize both raw water quality data and the generated predictions. These dashboards provide a user-friendly interface for monitoring water quality trends and patterns. The visualization process is hosted on AWS, enabling access to dashboards and data from any location with an internet connection.

By combining IoT, InfluxDB (version 3), ML, Grafana (version 10.0), AWS, and Terraform, this system offers a robust and scalable solution for automated water quality classification and monitoring.

3.1. Datasets

3.1.1. Water Quality Parameters

WQI can be determined by a variety of features. pH, DO, ammonia nitrogen, and COD have frequently been chosen by numerous authors, as indicated in Sandra Chidiac et al. [11]’s study. We have provided a short overview of these parameters below, and the standards for these features are illustrated in Table 2.

pH: The acidity or alkalinity of a solution is quantified by its pH. It quantifies the concentration of hydrogen ions (H+) in a solution.
Dissolved oxygen (DO): This denotes the quantity of oxygen gas that is dissolved in a liquid, most commonly water.
Chemical oxygen demand (COD): The quantity of oxygen necessary to chemically oxidize organic and inorganic compounds in water or other liquid samples is quantified by this measurement.
Ammonia nitrogen (NH3-N): This is a compound composed of hydrogen and nitrogen. It denotes the presence of ammonia ions (NH4+) in water.

3.1.2. Data Collection

The model is trained using dataset 1, which is sourced from China. The samples were collected from several large lakes and rivers. Chen et al. [14] also employed this dataset in their research. The water’s quality is represented by the final column, which is designated as a consequence. A total of five columns are present, with four of them defining the water’s parameters—potential of hydrogen (pH), dissolved oxygen (DO), concentration of chemical oxygen demand (CODMn), and ammonia (NH3-N)—in Table 3. The water quality was classified and predicted using six classes, ranging from excellent to bad: 1, 2, 3, 4, 5, and 6 (worse than 5). A total number of 34,607 data points are in this dataset [14].

The BWDB is the source of dataset 2, which is shown in Figure 2, and the data were collected from the southern region divisions of Khulna, Barisal, and Chittagong for the period of 2012-13. A total of 69 groundwater samples (duplicate) were analyzed in the laboratories of BUET, the University of Dhaka, and DPHE to verify and cross-check the quality of the analysis. Two categories of water quality parameters were collected, namely physical parameters and chemical parameters. The physical parameters include temperature, pH, Eh, EC, salinity, TDS, and arsenic. The chemical parameters that were collected include significant anions, cations, and trace elements. The values of the water quality parameters were determined in the field using field test kits, as well as in the laboratories of the BWDB, Dhaka University, BUET, and DPHE.

In this dataset, there are a total of 1045 data points and 28 parameters, but WQI is not mentioned in this dataset.

Table 4 shows the monitoring events and the total number of samples taken in the dry and wet seasons for well nests, line wells (up to 100 m), line wells (up to 30 m), and surface water. Line wells are a series of water wells that are drilled at regular intervals along a straight line to access or monitor a broader area. In contrast, nested wells are multiple wells that are drilled at a single location but at varying depths to target specific aquifer strata. The primary distinction is that line wells extend horizontally throughout a region, whereas nested wells are vertically aligned at a single location.

3.2. The Adopted Ensemble Model (RF, XGB, and HGB) with Hyperparameter Tuning

The model was trained with dataset 1, and we employed nine distinct state-of-the-art models in this investigation. In order to facilitate the model training process, we ensured that the data were scaled between 0 and 1 utilizing a min–max scaler. The min–max scaler equation is as follows: X represents an individual column of the dataset, and X_min and X_max represent the minimum and maximum values of their respective columns.

X_{scaled} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

(1)

Subsequently, we implemented the grid search cross-validation methodology to identify the optimal parameters for each model. Achieving the optimal performance necessitates tuning the models. We utilized 80% of the data for training and 20% for testing. Three of the nine models demonstrated a superior performance in comparison to the others. Consequently, we selected these three models (RF, XGB, and HGB) for the ensemble model development. In order to enhance the accuracy, we implemented the voting ensemble methodology. In order to achieve the highest level of accuracy among all the models, we implemented the hard-voting mode. Dataset 2 was employed to evaluate the model’s robustness. We utilized the Weighted Arithmetic Index (WAI) method to calculate the water quality index for 316 data points in dataset 2, as it was not specified. Figure 3 represents the entire process of developing the ensemble ML model.

3.3. Data Preprocessing

The same four features from dataset 2 were chosen to evaluate the proposed model, as the training dataset contained only four features. After selecting them, we computed WQI for 316 data points. WQI is determined by aggregating the values of numerous water quality parameters into a single value using the WAI method. For WQI, we need the standard values of the parameters, and for this purpose, we followed the Environment Conservation Rules (1997) [33]. The process of WQI using the WAI method typically involves the following steps:

1.: Select the water quality parameters for WQI; for instance, the selected parameters in this study are pH, DO, CODMn, and NH3-N.
2.: Assign weighting factors to each parameter based on their relative importance.
3.: Normalize the values of each parameter according to a scale of 0 to 1.
4.: Multiply the normalized values of each parameter by their corresponding weighting factors.
5.: Sum the weighted values of all the parameters for WQI.

Here is a brief explanation of how we can use the WAI for WQI.

Calculation of the sub-index of quality rating (q_n): Let n denote the water quality parameters, and let q_n denote the quality rating or sub-index corresponding to the nth parameter, where q_n is a number representing the relative value of this parameter in the tainted water with regard to its allowable standard value. The following equation may be used to obtain the q_n value of

$q_{n} = \frac{100 [(V_{n - V_{i_{0}}})]}{S_{n - V_{i_{0}}}}$

(2)

Here, q_n = the quality rating for the nth water quality parameter. V_n = the estimated value of the nth parameter of a given sample. S_n = the standard permissible value of the nth parameter. V_i0 = the ideal value of the nth parameter in pure water. All the ideal V_i0 values are taken as zero (0) for drinking water for all other parameters [35].
Calculation of the unit weight (W_n): It is the suggested standards (S_n) that play a negative role in the W_n calculations for the different water quality measures.

$W_{n} = \frac{K}{S_{n}}$

(3)

Here, S_n = the standard value for the nth parameter; K = a constant for proportionality.
Calculation of WQI: Using this formula, we can determine the WQI:

$W Q I = \frac{\sum q_{n} W_{n}}{\sum W_{n}}$

(4)
WQI assessment: There are 5 tiers of WQI. The five categories of water quality according to the mathematical WQI technique are shown in Table 5.

In this research’s sample data points, only 5 classes are found: 1, 2, 3, 4, and 5.

3.4. Hyperparameter Tuning with GridSearchCV

We used the grid search cross-validation approach to find the optimal parameters for the models. The cross-validation parameter was set to 10 for this study. The best parameters we found are shown in Table 6. With these parameters, the models perform better than that of Chen et al. [14]. The “C” parameter is the regularization parameter in the SVM model, and it is employed to balance optimization of the margin and minimization of classification errors. The “Criterion” parameter is responsible for evaluating the grade of a division in the RF and DT models. Each hyperparameter has a substantial impact on the model’s convergence, efficacy, and complexity. Through appropriate calibration of these parameters, the generalization and efficacy for unseen data can be improved.

3.5. Environment Setup

The Keras program classifies the models, which are then executed on an AMD Ryzen 7 with 32 GB of RAM. The proposed system employs Python for all computational tasks. The models are trained using 80% of the data from dataset 1, while 20% of the data from both datasets are used for testing purposes. Accuracy, precision, recall, specificity, the F1 measure, and a confusion matrix are employed to evaluate our method. The model’s performance is denoted by true negative (TN), true positive (TP), false negative (FN), and false positive (FP) values in the confusion matrix. In this investigation, the subsequent efficacy metrics were implemented:

Accuracy:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(5)

Precision:

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

Sensitivity:

S e n s i v i t y = \frac{T P}{F N + T P}

(7)

Specificity:

S p e c i f i t y = \frac{T N}{T N + F P}

(8)

F1 measure:

F_{1} = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

Precision quantifies the quantity of genuine affirmative samples, whereas accuracy assesses the classifier’s capacity to forecast classes. Specificity evaluates a model’s ability to predict negative outcomes, while sensitivity evaluates its ability to detect positive data. The harmonic mean of accuracy and sensitivity is used to calculate the F1 score after false positives and negatives are taken into consideration.

4. Results and Analysis

This section illustrates the detailed results of our experiment and also compares our evaluation to other findings. The initial section explains the instruments, setup, and evaluation methodologies employed for model testing. This is followed by a concise description, comparison, and presentation of the results of this study.

4.1. Model Comparisons with Related Works

The accuracy of the models is shown in Figure 4. Chen et al. refrained from employing the XGB and HGB models in their investigation. The ensemble model surpasses the other models in terms of accuracy, as evidenced by Figure 4. The accuracy of the RF, XGB, and HGB models is significantly higher than that of the other models, with values of 99.76%, 99.75%, and 99.69%, respectively. In contrast, the accuracy of these three models is enhanced to 99.80% when the voting ensemble approach is implemented, surpassing that of all other models. Table 7 displays all of the model classification reports, demonstrating that the performance of all the models was enhanced by our approach. The precision of the LR model increases from 0.63 to 0.70 when trained using our approach, resulting in the greatest improvement. The precision, recall, and F1 score of the other models also showed slight improvements.

Above all, the ensemble model has a precision of 1, a recall of 0.99, and an F1 score of 0.99. Conversely, we observed that the LR and GNB models exhibit the lowest level of accuracy in both of the experiments.

4.2. Performance Analysis of the Proposed Model for Dataset 1

Figure 5 illustrates the confusion matrix and the ROC AUC for the ensemble model. The ensemble model misclassified the most instances for class 2, with eight instances misclassified. Conversely, classes 1, 3, 5, and 6 were the least misclassified. The model was capable of accurately classifying 627 instances from class 1, 2617 instances from class 2, 1802 instances from class 3, 1141 instances from class 4, 288 instances from class 5, and 410 instances from class 6.

The model performs exceptionally well, as evidenced by the ROC AUC, which indicates that all the class contours are exceedingly near to 1. Additionally, the AUC score for each class is 1.

4.3. Testing the Robustness of the Proposed Model with Dataset 2

In this section, we tested the model performance with Bangladesh’s unseen data. The model performed extremely well even though it was not trained with data on Bangladeshi water bodies. The ensemble model had an accuracy of 99.72% with a precision of 0.9, recall of 0.98, and an F1 score of 0.99. The confusion matrix and the ROC AUC are shown in Figure 6. Every instance from classes 4 and 5 was accurately classified, as we can see from the confusion matrix. Only one instance was misclassified for both classes 1 and 3. On the other hand, for class 2, four instances were misclassified.

For all classes, the ROC is close to 1, and the AUC score is 1 for classes 4 and 5. Classes 1 and 3 have the same AUC score of 0.99, and class 2 has the lowest AUC score of 0.98.

4.4. Grafana Dashboard Implementation with InfluxDB

We selected the InfluxDB database, operating on the AWS cloud server, to store the water parameter data. Initially, it is necessary to establish a bucket that serves as a repository for the sensors’ data. The process of creating a bucket is straightforward; users can utilize either the user interface or the command shell. Connecting Grafana to InfluxDB necessitates only a few simple steps. Grafana can be installed on a local computer and connected to the AWS server where InfluxDB is operating. First, you must log in with your credentials and navigate to the settings after installation. The subsequent step is to select InfluxDB as the data source by clicking on it. You then define the URL that will serve as the connection to the cloud and configure the InfluxDB data source. The database name is set to the bucket name that was established on InfluxDB and InfluxDB authentication granted. Configuration of the connection is then complete. Subsequently, we can generate a new dashboard that presents our data and classifications in a more user-friendly manner. The Grafana dashboard is illustrated in Figure 7.

The proposed system is established in such a way that it can classify the water samples into multiple classes. In this study, dataset 1 contains six distinct classes, whereas dataset 2 contains five. Based on parameter values such as pH, DO, COD, and NH3-N, the proposed model determines the class number. The class number also represents the quality of any particular water sample. In setting the standard values for the water parameters in this study, the Environment Conservation Rules (1997) were followed [33]. The Grafana module was employed to present the results in a clear and user-friendly manner. The Grafana monitoring system utilizes a color scheme to distinguish between standard and non-standard parameter values, where green represents the standard parameter values and other colors, such as blue and red, stand for non-standard parameter values. For instance, when all the concerned parameter values fall within the accepted range, the Grafana dashboard shows the water sample as excellent-quality, as illustrated in Figure 7a. Similarly, when all the concerned parameter values are too high and fall outside of the accepted range, the Grafana dashboard marks those values as red in color. The proposed model classifies this sample as class 5, and the Grafana dashboard shows the predicted result as bad-quality, as illustrated in Figure 7b.

5. Discussion

This study employed a wireless IoT framework where the data are collected automatically by sensors and those data are transmitted to the AWS cloud. The AWS cloud infrastructure is handled by Terraform. Telegraf and Grafana are used to preprocess the data and visualize them. On the other hand, an ensemble ML model is also proposed for predicting the water quality from those data, which will also run in the AWS cloud and classify those data in real time. Previously, many authors have suggested an IoT-based framework for water quality prediction, such as Ketulkumar et al. [37], who proposed an Arduino-based water quality prediction system, but this system was not fully wireless and there was no visualization system. On the other hand, Mushtaque et al. [9] proposed an IoT-based cloud framework for water quality prediction, but there was no visualization method for the predictions or for the parameters. This study also employed two multi-regional datasets to develop a precise ensemble learning algorithm. Before proposing the ensemble algorithm, seven standalone learning algorithms (LR, SVM, DT, KNN, RF, GNB, HGB, and XGB) were evaluated separately on dataset 1 after properly adjusting their parameters using the grid search cross-validation (GridSearchCV) technique. Previously, the predictive strength of these individual algorithms was examined by researchers [14,38] for indexing water quality in various datasets. The outcomes of the previously conducted studies [14,38], particularly the study of Chen et al. [14], illustrate that the standalone algorithms perform ordinarily while predicting the accurate class due to inaccurate parameters.

In the current study, the GridSearchCV approach improved the performance of all of the standalone algorithms shown in Figure 4; more specifically, the improvement in the LR and SVM algorithms was the most significant. In a similar study [14], the authors reported that the LR and SVM algorithms achieved accuracies of 64.73% and 93.68%, respectively. In contrast, this study demonstrated improved accuracies of 70.15% for LR and 96.64% for the SVM. Combining the top-performing algorithms can considerably improve the classification accuracy to a greater extent. These findings also indicate that after finely tuning and merging the RF, XGB, and HGB algorithms, the ensemble algorithm outperformed the standalone algorithms, with an accuracy of 99.8%. Considering that 75% of water comes from groundwater [26] and the fact that a massive quantity of data is needed for WQI, this study included groundwater data samples collected from nested and line wells in Bangladesh to validate the proposed algorithm.

The applicability of the proposed technique extends beyond water quality classification, demonstrating potential utility in water resource management and environmental damage reduction. The proposed framework can be useful in different ways, which are outlined below:

Smart agriculture: In agricultural settings, IoT-based water quality monitoring can help with efficient water usage, ensuring that crops receive clean water and reducing the risk of soil and crop contamination.
Data-driven policy making: Comprehensive data analysis can assist decision-makers in improving rules and regulations based on accurate and real-time data, ensuring better management of water resources.
Scientific research and innovation: Comprehensive data collection can support scientific research, bringing innovative ideas and perspectives to water treatment technologies, pollution control methods, and environmental sustainability practices.

6. Conclusions and Future Works

Accurate classification in WQI is crucial for ensuring the availability of potable water and monitoring pollution. Consequently, effective planning and management of water resources significantly benefit from precise predictions of groundwater levels. This study aimed to develop an IoT-based framework within which a new model would be capable of accurately predicting the water quality in groundwater, rivers, and lakes by learning from WQI evaluations in regions of China. We also tested the model accuracy using unseen data from Bangladesh, on which the model performed very well. Some traditional systems are available for assessing water potability for drinking and other household activities. While these approaches are trustworthy to local communities, they tend to be expensive and time-consuming. Meanwhile, using this IoT-based wireless framework, people and organizations can easily assess the quality of the water. Not only can they monitor the parameters of the water but also the sensors’ statuses and whether they need maintenance or not. Also, the proposed ensemble model is very accurate, so people can rely on it. Furthermore, the visualization can be seen from anywhere in the world, which is a good thing because people are always busy, and they often have to move around too. There is no need to be concerned about storage and connectivity issues, as the AWS cloud service and the InfluxDB database are both scalable. Additionally, the Grafana interface provides the most effective visualization and surveillance of the data, which is beneficial for comprehending the parameters and making classifications. The proposed wireless IoT framework is a monolithic service; however, it can be adapted to a microservice system if an organization wishes to expand their monitoring of water quality. In the future, we intend to incorporate additional sensors and visualization of the water flow from one location to another, as well as the water’s velocity.

Author Contributions

Conceptualization: H.R.R. and M.S.B.S.; supervision: M.A.U.; methodology: M.S.B.S. and H.R.R.; data curation: M.S.B.S. and A.R.; software: H.R.R. and M.S.B.S.; formal analysis: M.M.I., M.A.U. and M.Z.M.; writing—original draft: M.A.U., M.S.B.S., H.R.R., M.K.H.S. and A.R.; investigation: M.M.I., M.Z.M. and M.K.H.S.; writing—review and editing: M.M.I.; project administration, M.A.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data will be made available on request.

Acknowledgments

This research was conducted in the Emerging Data Science Lab, the Department of Computer Science and Engineering, Jagannath University, Dhaka, Bangladesh.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fredriksson, I. Just Ordinary Water-A Necessity for All Forms of Life. Univers. J. Psychol. 2016, 4, 178–183. [Google Scholar] [CrossRef]
Huang, L.Y. Not just another drop in the human rights bucket: The legal significance of a codified human right to water. Fla. J. Int’l L. 2008, 20, 353. [Google Scholar]
Boyd, C.E. Water Quality: An Introduction; Springer Nature: Berlin, Germany, 2019. [Google Scholar]
Mirani, A.A.; Memon, M.S.; Rahu, M.A.; Bhatti, M.N.; Shaikh, U.R. A review of agro-industry in IoT: Applications and challenges. Quaid-E-Awam Univ. Res. J. Eng. Sci. Technol. Nawabshah 2019, 17, 28–33. [Google Scholar]
Lashari, M.H.; Karim, S.; Alhussein, M.; Hoshu, A.A.; Aurangzeb, K.; Anwar, M.S. Internet of Things-based sustainable environment management for large indoor facilities. PeerJ Comput. Sci. 2023, 9, e1623. [Google Scholar] [CrossRef]
Lashari, M.H.; Memon, A.A.; Shah, S.A.A.; Nenwani, K.; Shafqat, F. IoT based poultry environment monitoring system. In Proceedings of the 2018 IEEE International Conference on Internet of Things and Intelligence System (IOTAIS), Bali, Indonesia, 1–3 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar]
Jha, B.K.; Sivasankari, G.; Venugopal, K. Cloud-based smart water quality monitoring system using IoT sensors and machine learning. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 3403–3409. [Google Scholar] [CrossRef]
Garrido-Momparler, V.; Peris, M. Smart sensors in environmental/water quality monitoring using IoT and cloud services. Trends Environ. Anal. Chem. 2022, 35, e00173. [Google Scholar] [CrossRef]
Rahu, M.A.; Chandio, A.F.; Aurangzeb, K.; Karim, S.; Alhussein, M.; Anwar, M.S. Towards design of Internet of Things and machine learning-enabled frameworks for analysis and prediction of water quality. IEEE Access 2023, 11, 101055–101086. [Google Scholar] [CrossRef]
Uddin, M.G.; Nash, S.; Rahman, A.; Olbert, A.I. Performance analysis of the water quality index model for predicting water state using machine learning techniques. Process Saf. Environ. Prot. 2023, 169, 808–828. [Google Scholar] [CrossRef]
Chidiac, S.; El Najjar, P.; Ouaini, N.; El Rayess, Y.; El Azzi, D. A comprehensive review of water quality indices (WQIs): History, models, attempts and perspectives. Rev. Environ. Sci. Bio/Technol. 2023, 22, 349–395. [Google Scholar] [CrossRef]
World Health Organization. Available online: https://www.who.int/ (accessed on 30 September 2022).
Haghiabi, A.H.; Nasrolahi, A.H.; Parsaie, A. Water quality prediction using machine learning methods. Water Qual. Res. J. 2018, 53, 3–13. [Google Scholar] [CrossRef]
Chen, K.; Chen, H.; Zhou, C.; Huang, Y.; Qi, X.; Shen, R.; Liu, F.; Zuo, M.; Zou, X.; Wang, J.; et al. Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res. 2020, 171, 115454. [Google Scholar] [CrossRef] [PubMed]
Ahmed, A.N.; Othman, F.B.; Afan, H.A.; Ibrahim, R.K.; Fai, C.M.; Hossain, M.S.; Ehteram, M.; Elshafie, A. Machine learning methods for better water quality prediction. J. Hydrol. 2019, 578, 124084. [Google Scholar] [CrossRef]
Al-Sulttani, A.O.; Al-Mukhtar, M.; Roomi, A.B.; Farooque, A.A.; Khedher, K.M.; Yaseen, Z.M. Proposition of new ensemble data-intelligence models for surface water quality prediction. IEEE Access 2021, 9, 108527–108541. [Google Scholar] [CrossRef]
Bouslah, S.; Djemili, L.; Houichi, L. Water quality index assessment of Koudiat Medouar Reservoir, northeast Algeria using weighted arithmetic index method. J. Water Land Dev. 2017, 35, 221. [Google Scholar] [CrossRef]
AlMetwally, S.A.H.; Hassan, M.K.; Mourad, M.H. Real time internet of things (IoT) based water quality management system. Procedia CIRP 2020, 91, 478–485. [Google Scholar] [CrossRef]
Chowdury, M.S.U.; Emran, T.B.; Ghosh, S.; Pathak, A.; Alam, M.M.; Absar, N.; Andersson, K.; Hossain, M.S. IoT based real-time river water quality monitoring system. Procedia Comput. Sci. 2019, 155, 161–168. [Google Scholar] [CrossRef]
Wiryasaputra, R.; Huang, C.Y.; Lin, Y.J.; Yang, C.T. An IoT Real-Time Potable Water Quality Monitoring and Prediction Model Based on Cloud Computing Architecture. Sensors 2024, 24, 1180. [Google Scholar] [CrossRef]
Ahmad, Z.; Rahim, N.; Bahadori, A.; Zhang, J. Improving water quality index prediction in Perak River basin Malaysia through a combination of multiple neural networks. Int. J. River Basin Manag. 2017, 15, 79–87. [Google Scholar] [CrossRef]
Ahmed, U.; Mumtaz, R.; Anwar, H.; Shah, A.A.; Irfan, R.; García-Nieto, J. Efficient water quality prediction using supervised machine learning. Water 2019, 11, 2210. [Google Scholar] [CrossRef]
Wang, S.; Peng, H.; Liang, S. Prediction of estuarine water quality using interpretable machine learning approach. J. Hydrol. 2022, 605, 127320. [Google Scholar] [CrossRef]
Jin, T.; Cai, S.; Jiang, D.; Liu, J. A data-driven model for real-time water quality prediction and early warning by an integration method. Environ. Sci. Pollut. Res. 2019, 26, 30374–30385. [Google Scholar] [CrossRef]
Khan, M.S.I.; Islam, N.; Uddin, J.; Islam, S.; Nasir, M.K. Water quality prediction and classification based on principal component regression and gradient boosting classifier approach. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 4773–4781. [Google Scholar]
Leong, W.C.; Bahadori, A.; Zhang, J.; Ahmad, Z. Prediction of water quality index (WQI) using support vector machine (SVM) and least square-support vector machine (LS-SVM). Int. J. River Basin Manag. 2021, 19, 149–156. [Google Scholar] [CrossRef]
Kulisz, M.; Kujawska, J.; Przysucha, B.; Cel, W. Forecasting water quality index in groundwater using artificial neural network. Energies 2021, 14, 5875. [Google Scholar] [CrossRef]
Abera, K.A.; Gebreyohannes, T.; Abrha, B.; Hagos, M.; Berhane, G.; Hussien, A.; Belay, A.S.; Van Camp, M.; Walraevens, K. Vulnerability mapping of groundwater resources of Mekelle City and surroundings, Tigray Region, Ethiopia. Water 2022, 14, 2577. [Google Scholar] [CrossRef]
Giao, N.T.; Dan, T.H.; Ni, D.V.; Anh, P.K.; Nhien, H.T.H. Spatiotemporal variations in physicochemical and biological properties of surface water using statistical analyses in Vinh Long Province, Vietnam. Water 2022, 14, 2200. [Google Scholar] [CrossRef]
Abuzaid, A.S.; Jahin, H.S. Combinations of multivariate statistical analysis and analytical hierarchical process for indexing surface water quality under arid conditions. J. Contam. Hydrol. 2022, 248, 104005. [Google Scholar] [CrossRef]
Braga, F.H.R.; Dutra, M.L.S.; Lima, N.S.; da Silva, G.M.; de Cássia Mendonça de Miranda, R.; da Cunha Araújo Firmo, W.; de Moura, A.R.L.; de Souza Monteiro, A.; da Silva, L.C.N.; da Silva, D.F.; et al. Study of the influence of physicochemical parameters on the water quality index (WQI) in the maranhão amazon, Brazil. Water 2022, 14, 1546. [Google Scholar] [CrossRef]
Rahman, M.M.; Haque, T.; Mahmud, A.; Al Amin, M.; Hossain, M.S.; Hasan, M.Y.; Shaibur, M.R.; Hossain, S.; Hossain, M.A.; Bai, L. Drinking water quality assessment based on index values incorporating WHO guidelines and Bangladesh standards. Phys. Chem. Earth Parts A/B/C 2023, 129, 103353. [Google Scholar] [CrossRef]
The Environment Conservation Rules. 1997. Available online: https://faolex.fao.org/docs/pdf/bgd19918.pdf (accessed on 30 July 2024).
Bangladesh Water Development Board. Available online: https://bwdb.portal.gov.bd/ (accessed on 30 May 2022).
Voudouris, K.; Sotiriadis, M.; Pavlou, A.; Hatziliontos, C. Groundwater quality in the coastal aquifer system of eastern Thermaikos Gulf, North Greece. J. Environ. Prot. Ecol. 2006, 7, 269–279. [Google Scholar]
Tyagi, S.; Sharma, B.; Singh, P.; Dobhal, R. Water quality assessment in terms of water quality index. Am. J. Water Resour. 2013, 1, 34–38. [Google Scholar] [CrossRef]
Chaudhari, K.G. Water quality monitoring system using internet of things and swqm framework. Int. J. Innov. Res. Comput. Commun. Eng. 2019, 7, 3898–3903. [Google Scholar] [CrossRef]
Aslam, B.; Maqsoom, A.; Cheema, A.H.; Ullah, F.; Alharbi, A.; Imran, M. Water quality management using hybrid machine learning and data mining algorithms: An indexing approach. IEEE Access 2022, 10, 119692–119705. [Google Scholar] [CrossRef]

Figure 1. Autonomous IoT and cloud-based framework for water quality prediction and visualization.

Figure 2. Location of collected samples in dataset 2 (Image source is the BWDB [34]).

Figure 3. Proposed ensemble-based ML with hyperparameter optimization.

Figure 4. Comparison of accuracy of models from this study and a related study for dataset 1 [14].

Figure 5. Performance analysis of the proposed model using the confusion matrix, ROC, and AUC on dataset 1.

Figure 6. Robustness testing of proposed model using confusion matrix, ROC, and AUC on dataset 2.

Figure 7. Grafana dashboard for monitoring and visualization of water quality.

Table 2. Water quality standards.

Parameters	Units	Bangladesh Standards’ Drinking Limit [32]	WHO Drinking Limit [33]
pH	–	6.5–8.5	6.5–8.5
DO	mg/L	5	6
COD	mg/L	10	4
NH3-N	mg/L	0.5	0.5

Table 3. Features of dataset 1.

pH	DO (mg/L)	CODMn (mg/L)	NH3-N (mg/L)	Result
7.09	10	5.7	0.33	3
6.8	11.6	6.3	0.59	4
9.41	7.94	10.8	0.1	6
7.65	7.21	4	0.44	2
8.57	10.9	0	0.15	1
7.85	7.57	9.2	1.82	5

Table 4. Number of water samples analyzed.

Monitoring Event	Type of Water	Wet Season	Dry Season
Well Nests	Groundwater	151	133
Line Wells (Up to 100 m)	Groundwater	430	190
Line Wells (Up to 30 m)	Groundwater	-	53
Surface Water	Surface water	95	44
Total	-	676	420

Table 5. WQI ranges [36].

WQI (Range)	Water Quality Status	Possible Usage
0–25	Excellent	Mainly drinking
26–50	Good	Drinking, irrigation, and industrial
51–75	Poor	Irrigation and industrial
76–100	Very poor	Only irrigation
Greater than 100	Unsuitable for drinking or fish culture	Proper treatment needed before use

Table 6. Optimal hyperparameters of the models employed in this study.

Learning Models	Hyperparameters
LR	C = 100; Max iteration = 300; Intercept scaling = 10
SVM	C = 100; Kernel = rbf; Cache size = 200
DT	Criterion = entropy; Max depth = 30; Min sample leaf = 6; Min sample split = 2
KNN	Leaf size = 1; Neighbors = 5; Penalty = 1
RF	Criterion = entropy; Max depth = 500; Bootstrap = True
HGB	warm_start = True; max_iter = 150; max_leaf_nodes = 42
XGB	Objective = multi:softmax; Colsample by tree = 0.75; Max depth = 6; Min child weight = 5; N_estimators = 100; Booster = dart; Sampling method = gradient-based
Ensemble Model	Estimators = (RF, XGB, HGB); voting = hard

Table 7. Classification report for dataset 1 on different algorithms.

Method	Model	Precision	Recall	F1 Score
Chen et al. [14]	LR	0.63	0.65	0.59
	KNN	0.92	0.929	0.924
	RF	0.99	0.99	0.99
	GNB	0.74	0.73	0.74
	SVM	0.94	0.94	0.94
	DT	0.98	0.99	0.99
	XGB	0.98	0.983	0.987
This study	LR	0.70	0.69	0.701
	KNN	0.92	0.927	0.93
	RF	0.99	0.99	0.99
	GNB	0.74	0.73	0.74
	SVM	0.97	0.967	0.96
	DT	0.99	0.99	0.998
	XGB	0.99	0.99	0.99
	HGB	0.99	0.99	0.99
	Ensemble Model	1.00	0.99	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shahid, M.S.B.; Rifat, H.R.; Uddin, M.A.; Islam, M.M.; Mahmud, M.Z.; Sakib, M.K.H.; Roy, A. Hypertuning-Based Ensemble Machine Learning Approach for Real-Time Water Quality Monitoring and Prediction. Appl. Sci. 2024, 14, 8622. https://doi.org/10.3390/app14198622

AMA Style

Shahid MSB, Rifat HR, Uddin MA, Islam MM, Mahmud MZ, Sakib MKH, Roy A. Hypertuning-Based Ensemble Machine Learning Approach for Real-Time Water Quality Monitoring and Prediction. Applied Sciences. 2024; 14(19):8622. https://doi.org/10.3390/app14198622

Chicago/Turabian Style

Shahid, Md. Shamim Bin, Habibur Rahman Rifat, Md Ashraf Uddin, Md Manowarul Islam, Md. Zulfiker Mahmud, Md Kowsar Hossain Sakib, and Arun Roy. 2024. "Hypertuning-Based Ensemble Machine Learning Approach for Real-Time Water Quality Monitoring and Prediction" Applied Sciences 14, no. 19: 8622. https://doi.org/10.3390/app14198622

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hypertuning-Based Ensemble Machine Learning Approach for Real-Time Water Quality Monitoring and Prediction

Abstract

1. Introduction

2. Literature Review

2.1. IoT Cloud-Based Framework for Real-Time Water Quality Monitoring

2.2. Machine Learning Techniques for Water Quality

3. Research Methodology

3.1. Datasets

3.1.1. Water Quality Parameters

3.1.2. Data Collection

3.2. The Adopted Ensemble Model (RF, XGB, and HGB) with Hyperparameter Tuning

3.3. Data Preprocessing

3.4. Hyperparameter Tuning with GridSearchCV

3.5. Environment Setup

4. Results and Analysis

4.1. Model Comparisons with Related Works

4.2. Performance Analysis of the Proposed Model for Dataset 1

4.3. Testing the Robustness of the Proposed Model with Dataset 2

4.4. Grafana Dashboard Implementation with InfluxDB

5. Discussion

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI