Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches
Abstract
1. Introduction
- Develop and evaluate a hybrid machine learning framework that integrates statistical diagnostics for enhanced WQI prediction and classification.
- Identify and quantify the most influential water quality parameters affecting WQI using feature importance analysis.
- Assess model reliability and generalization through residual analysis, diagnostic testing, and learning curve analysis.
- Reduce data dimensionality while preserving variance using PCA for computational efficiency and pattern discovery.
2. Related Works
2.1. Residual Analysis
2.2. Diagnostic and Assumption Tests
2.3. Feature Importance
2.4. Learning Curve Analysis
2.5. Principal Component Analysis (PCA)
3. Methods
3.1. Sources of Water Quality Data
3.2. Statistical Analysis
3.2.1. Feature Importance Analysis Method
3.2.2. Assumption and Diagnostic Test
3.2.3. Analysis of Learning Curve
- Step 1: Incremental Training
- Step 2: Error Metrics
- Training error was calculated using MSE on the training subset.
- Validation error was calculated with the same metrics on the validation set.
- Training error represented the proportion of misclassified samples in the training subset.
- Encountered a validation error regarding the proportion of misclassified samples in the validation set or 1 minus the accuracy score.
- Step 3: Training and Validation Curves Were Analyzed
- Training Curve: Shows how the error decreases as the model receives more data to learn from. An initial rapid decline in error suggests effective learning, while a flattening indicates saturation of model performance [44].
- Validation Curve: Demonstrates the model’s generalization ability. A high validation error that is compared to the training error suggests overfitting, while consistently high errors in both curves indicate underfitting [44].
3.2.4. Analysis of Principal Component Analysis (PCA)
- Step 1: Data Standardization
- Step 2: Covariance Matrix Calculation
- Step 3: Eigenvalue Decomposition
- Step 4: Cumulative Variance and Scree Plot
- Step 5: Heatmap of Correlation Coefficients
- Step 6: Projection of Data onto Principal Components
- Step 7: Visualization of Principal Components
- PC1 vs. PC2 scatterplot: Showed data distribution in a reduced two-dimensional space.
- Scatterplot matrix (PC1–PC4): Examined higher-dimensional relationships, uncovering latent structures.
4. Results and Discussion
4.1. Assumption and Diagnostic Tests
4.1.1. Breusch–Pagan Test Results
4.1.2. Shapiro–Wilk Test Results
4.2. Feature Importance Analysis Results
- Dissolved oxygen (DO) has the highest relevance score (1.000) in both models, making it the most influential predictor.
- Ammoniacal nitrogen (AN) ranks second with a cumulative importance score of 0.565.
- COD and BOD have moderate importance scores (0.253 and 0.158, respectively).
- pH and TSS exhibit negligible importance (0.013 and 0.000), indicating minimal influence on model predictions.
4.3. Learning Curve Analysis Results
4.4. Principal Component Analysis (PCA) Results
- The grid comparing PC1 and PC2 reveals distinct clustering patterns, suggesting that these components effectively capture significant groupings within the dataset, possibly related to water quality classifications. The clusters in this plot highlight the separation between observations with varying underlying properties.
- The PC1 versus PC3 and PC1 versus PC4 grids display a more dispersed pattern, indicating weaker correlations between these components. This suggests that PC3 and PC4 contribute incremental, rather than major, information compared to PC1.
- The PC2 versus PC3 and PC2 versus PC4 grids exhibit random distributions, further emphasizing the diminishing explanatory power of the higher-order components (PC3 and PC4).
- The PC3 versus PC4 grid appears more uniformly scattered, confirming that these two components carry minimal overlapping information and are orthogonal.
4.5. Comparative Discussion and Implications
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Schreiber, S.G.; Schreiber, S.; Tanna, R.N.; Roberts, D.R.; Arciszewski, T.J. Statistical tools for water quality assessment and monitoring in river ecosystems—A scoping review and recommendations for data analysis. Water Qual. Res. J. 2022, 57, 40–57. [Google Scholar] [CrossRef]
- Statswork. Applications of Statistical Analyses on Water Quality Data and Its Recent Research Trends. Pioneer Statistical Consulting. Available online: https://statswork.com/blog/applications-of-statistical-analyses-on-water-quality-data-and-its-recent-research-trends/ (accessed on 13 November 2023).
- Fu, L.; Wang, Y.-G. Statistical Tools for Analyzing Water Quality Data. 2012. Available online: www.intechopen.com (accessed on 20 November 2023).
- Zhou, K.; Wu, B.; Zhang, X. Worldwide Research Progress and Trends in Application of Machine Learning to Wastewater Treatment: A Bibliometric Analysis. Water 2025, 17, 1314. [Google Scholar] [CrossRef]
- Koronides, M.; Stylianidis, P.; Michailides, C.; Onoufriou, T. Real-Time Monitoring of Seawater Quality Parameters in Ayia Napa, Cyprus. J. Mar. Sci. Eng. 2024, 12, 1731. [Google Scholar] [CrossRef]
- Cao, X.; Xiong, F.; Wang, Y.; Ma, H.; Zhang, Y.; Liu, Y.; Kong, X.; Wang, J.; Shi, Q.; Fan, P.; et al. Spectral Analysis of Dissolved Organic Carbon in Seawater by Combined Absorption and Fluorescence Technology. J. Mar. Sci. Eng. 2024, 12, 2297. [Google Scholar] [CrossRef]
- Albrekht, V.; Mukhamediev, R.I.; Popova, Y.; Muhamedijeva, E.; Botaibekov, A. Top2Vec Topic Modeling to Analyze the Dynamics of Publication Activity Related to Environmental Monitoring Using Unmanned Aerial Vehicles. Publications 2025, 13, 15. [Google Scholar] [CrossRef]
- Fox, A.; Leonard, H.; Springer, E.; Provoncha, T. Glyphosate Herbicide Impacts on the Seagrasses Halodule wrightii and Ruppia maritima from a Subtropical Florida Estuary. J. Mar. Sci. Eng. 2024, 12, 1941. [Google Scholar] [CrossRef]
- Liao, S.L.; Chen, L.C.; Tsai, M.H.; Hua, M.C.; Yao, T.C.; Su, K.W.; Yeh, K.W.; Chiu, C.Y.; Lai, S.H.; Huang, J.L. Prenatal exposure to bisphenol - A is associated with dysregulated perinatal innate cytokine response and elevated cord IgE level: A population-based birth cohort study. Env. Res. 2020, 191, 110123. [Google Scholar] [CrossRef]
- Hino, M.; Benami, E.; Brooks, N. Machine learning for environmental monitoring. Nat. Sustain. 2018, 1, 583–588. [Google Scholar] [CrossRef]
- Zhang, S.; Harrop, B.; Leung, L.R.; Charalampopoulos, A.T.; Barthel Sorensen, B.; Xu, W.; Sapsis, T. A Machine Learning Bias Correction on Large-Scale Environment of High-Impact Weather Systems in E3SM Atmosphere Model. J. Adv. Model Earth Syst. 2024, 16, e2023MS004138. [Google Scholar] [CrossRef]
- Mak, H.W.L. Improved Remote Sensing Algorithms and Data Assimilation Approaches in Solving Environmental Retrieval Problems. Ph.D. Thesis, Hong Kong University of Science and Technology, Hong Kong, China, 2019. [Google Scholar] [CrossRef]
- Qin, T.; Liang, T.; Fan, D.; He, H.; Lan, G.; Fu, B. A novel hybrid machine learning approach for accurate retrieval of ocean surface chlorophyll-a across oligotrophic to eutrophic waters. Environ. Res. 2025, 279, 121864. [Google Scholar] [CrossRef]
- Benko, Ľ.; Munkova, D.; Munk, M.; Benkova, L.; Hajek, P. The use of residual analysis to improve the error rate accuracy of machine translation. Sci. Rep. 2024, 14, 1–19. [Google Scholar] [CrossRef]
- Soleimani, F.; Hajializadeh, D. Bridge seismic hazard resilience assessment with ensemble machine learning. Structures 2022, 38, 719–732. [Google Scholar] [CrossRef]
- Wang, X.; Mazumder, R.K.; Salarieh, B.; Salman, A.M.; Shafieezadeh, A.; Li, Y. Machine Learning for Risk and Resilience Assessment in Structural Engineering: Progress and Future Trends. J. Struct. Eng. 2022, 148, 03122003. [Google Scholar] [CrossRef]
- Ohaegbulem, E.U.; Iheaka, V.C. On Remedying the Presence of Heteroscedasticity in a Multiple Linear Regression Modelling. Afr. J. Math. Stat. Stud. 2024, 7, 225–261. [Google Scholar] [CrossRef]
- Yulia, Y.; Helvira, R.; Tunisa, J. Impact Analysis of Inflation, ROA, FDR, and Financing on Non-Performing Financing in Indonesian Islamic Banks. Dinar J. Ekon. Dan Keuang. Islam 2024, 11, 222–235. Available online: https://journal.trunojoyo.ac.id/dinar/article/view/22743 (accessed on 26 June 2025).
- Yang, S.; Berdine, G. Normality tests. Southwest Respir. Crit. Care Chron. 2021, 9, 87–90. [Google Scholar] [CrossRef]
- Saariniemi, J. Case-Study: Twitter Data Analysis by Linear Regression Modelling. 2023. Available online: https://lutpub.lut.fi/handle/10024/166121 (accessed on 17 October 2024).
- Wang, W.; Melnyk, L.; Kubatko, O.; Kovalov, B.; Hens, L. Economic and Technological Efficiency of Renewable Energy Technologies Implementation. Sustainability 2023, 15, 8802. [Google Scholar] [CrossRef]
- Zheng, Z.; Yang, Y.; Zhou, J.; Gu, F. Research on Time Series Data Prediction Based on Machine Learning Algorithms. In Proceedings of the 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology, ICCECT 2024, Jilin, China, 26–28 April 2024; pp. 680–686. [Google Scholar] [CrossRef]
- Qu, X.; Zhao, F.; Gao, L.; Zhang, Z. The application of machine learning regression algorithms and feature engineering in practical application. In Proceedings of the 2022 10th International Conference on Information Systems and Computing Technology, ISCTech 2022, Guilin, China, 28–30 December 2022; pp. 259–263. [Google Scholar] [CrossRef]
- Zheng, Z.; Yuan, J.; Yao, W.; Kwan, P.; Yao, H.; Liu, Q.; Guo, L. Fusion of UAV-Acquired Visible Images and Multispectral Data by Applying Machine-Learning Methods in Crop Classification. Agronomy 2024, 14, 2670. [Google Scholar] [CrossRef]
- Catav, A.; Fu, B.; Zoabi, Y.; Meilik, A.L.; Shomron, N.; Ernst, J.; Sankararaman, S.; Gilad-Bachrach, R. Marginal Contribution Feature Importance - an Axiomatic Approach for Explaining Data. Proc. Mach. Learn. Res. 2021, 139, 1324. [Google Scholar]
- Framling, K. Feature Importance versus Feature Influence and What It Signifies for Explainable AI. In Communications in Computer and Information Science CCIS; Springer Nature: Cham, Switzerland, 2023; Volume 1901, pp. 241–259. [Google Scholar] [CrossRef]
- Oukhouya, H.; El Himdi, K. A comparative study of ARIMA, SVMs, and LSTM models in forecasting the Moroccan stock market. Int. J. Simul. Process Model. 2023, 20, 125–143. [Google Scholar] [CrossRef]
- Verma, V.K.; Kumar, V. Optimization of Regression algorithms using Learning curve in WSN. In Proceedings of the 2021 International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE 2021, Greater Noida, India, 4–5 March 2021; pp. 379–382. [Google Scholar] [CrossRef]
- Hannula, O.; Hällberg, V.; Meuronen, A.; Suominen, O.; Rautiainen, S.; Palomäki, A.; Hyppölä, H.; Vanninen, R.; Mattila, K. Self-reported skills and self-confidence in point-of-care ultrasound: A cross-sectional nationwide survey amongst Finnish emergency physicians. BMC Emerg. Med. 2023, 23, 23. [Google Scholar] [CrossRef]
- Liu, H.; Yang, S.; Qi, F.; Wang, S. Learning to Rank Normalized Entropy Curves with Differentiable Window Transformation. 2023. Available online: https://arxiv.org/abs/2301.10443v1 (accessed on 17 November 2024).
- Lu, J.; Gu, J.; Han, J.; Xu, J.; Liu, Y.; Jiang, G.; Zhang, Y. Evaluation of Spatiotemporal Patterns and Water Quality Conditions Using Multivariate Statistical Analysis in the Yangtze River, China. Water 2023, 15, 3242. [Google Scholar] [CrossRef]
- Ma, X.; Wang, L.; Yang, H.; Li, N.; Gong, C. Spatiotemporal Analysis of Water Quality Using Multivariate Statistical Techniques and the Water Quality Identification Index for the Qinhuai River Basin, East China. Water 2020, 12, 2764. [Google Scholar] [CrossRef]
- Camargo, A. PCAtest: Testing the statistical significance of Principal Component Analysis in R. PeerJ 2022, 10, e12967. [Google Scholar] [CrossRef]
- Brereton, R.G. Principal components analysis with several objects and variables. J. Chemom. 2023, 37, e3408. [Google Scholar] [CrossRef]
- Krzyśko, M.; Nijkamp, P.; Ratajczak, W.; Wołyński, W.; Wenerska, B. Spatio-temporal principal component analysis. Spat. Econ. Anal. 2024, 19, 8–29. [Google Scholar] [CrossRef]
- Lokman, A.; Wan Zakiah, W.I.; Nor Azlina, A.A. A Review of Water Quality Forecasting and Classification Using Machine Learning Models and Statistical Analysis. Water 2025, 17, 2243. [Google Scholar] [CrossRef]
- Mohammed, A.H.; Ashour, M.A.H. Improving the efficiency measurement index using principal component analysis (PCA). Int. J. Health Sci. 2022, 6, 6584–6600. [Google Scholar] [CrossRef]
- Haryati, A.E.; Sugiyarto. Clustering with Principal Component Analysis and Fuzzy Subtractive Clustering Using Membership Function Exponential and Hamming Distance. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1077, 012019. [Google Scholar] [CrossRef]
- Jollife, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
- Devasahayam, S.; Albijanic, B. Predicting hydrogen production from co-gasification of biomass and plastics using tree based machine learning algorithms. Renew. Energy 2024, 222, 119883. [Google Scholar] [CrossRef]
- Jain, N.; Sharma, S.; Thakur, V.; Nutakki, M.; Mandava, S. Prediction and Analysis of Household Energy Consumption Integrated with Renewable Energy Sources using Machine Learning Algorithms in Energy Management. Int. J. Renew. Energy Res. 2024, 14, 354–362. [Google Scholar] [CrossRef]
- Mathew, S.; Idi, D.; Stephen, M. Modeling and Inference of Insurance Sector Development on Nigeria Economic Growth African Multidisciplinary Modeling and Inference of Insurance Sector Development on Nigeria Economic Growth. J. Sci. Artif. Intell. 2024, 1, 249–263. [Google Scholar]
- Mikolajczyk, A.P.; Fortela, D.L.; Berry, J.C.; Chirdon, W.M.; Hernandez, R.A.; Gang, D.D.; Zappi, M.E. Evaluating the Suitability of Linear and Nonlinear Regression Approaches for the Langmuir Adsorption Model as Applied toward Biomass-Based Adsorbents: Testing Residuals and Assessing Model Validity. Langmuir 2024, 40, 20428–20442. [Google Scholar] [CrossRef]
- Deshpande, A.M.; Minai, A.A.; Kumar, M. One-shot recognition of manufacturing defects in steel surfaces. Procedia Manuf. 2020, 48, 1064–1071. [Google Scholar] [CrossRef]
- Boddu, Y.; Manimaran, A. Maximizing Forecasting Precision: Empowering Multivariate Time Series Prediction with QPCA-LSTM. Comput. Econ. 2024, 2024, 1–36. [Google Scholar] [CrossRef]
- Malek, N.H.A.; Yaacob, W.F.W.; Nasir, S.A.M.; Shaadan, N. Prediction of Water Quality Classification of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques. Water 2022, 14, 1067. [Google Scholar] [CrossRef]
- Lap, B.Q.; Du Nguyen, H.; Hang, P.T.; Phi, N.Q.; Hoang, V.T.; Linh, P.G.; Hang, B.T. Predicting Water Quality Index (WQI) by feature selection and machine learning: A case study of An Kim Hai irrigation system. Ecol. Inform. 2023, 74, 101991. [Google Scholar] [CrossRef]
- Wong, W.Y.; Al-Ani, I.; Khallel, A.; Khairuddin, M.; Salwa, A. Water Quality Index Using Modified Random Forest Technique: Assessing Novel Input Features. Comput. Model. Eng. Sci. 2022, 132, 1011–1038. [Google Scholar] [CrossRef]
- Uddin, M.G.; Rahman, A.; Nash, S.; Diganta, M.T.; Sajib, A.M.; Moniruzzaman, M.; Olbert, A.I. Marine waters assessment using improved water quality model incorporating machine learning approaches. J. Environ. Manag. 2023, 344, 118368. [Google Scholar] [CrossRef]
- Thia, J.A.; Thia, C.A.J. Guidelines for standardizing the application of discriminant analysis of principal components to genotype data. Mol. Ecol. Resour. 2023, 23, 523–538. [Google Scholar] [CrossRef] [PubMed]
- Auerswald, M.; Moshagen, M. How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychol. Methods 2019, 24, 468–491. [Google Scholar] [CrossRef] [PubMed]
Parameters | Feature Importance in Regression Model | Feature Importance in Classification Model | Average Features Importance Across Models |
---|---|---|---|
DO | 1.000 | 1.000 | 1.000 |
AN | 0.306 | 0.824 | 0.565 |
COD | 0.025 | 0.496 | 0.261 |
BOD | 0.037 | 0.279 | 0.158 |
pH | 0.015 | 0.024 | 0.019 |
TSS | 0.005 | 0.009 | 0.007 |
Metric | Mean | 95% CI Lower | 95% CI Upper |
---|---|---|---|
MAE | 11.521 | 9.769 | 13.273 |
MSE | 201.301 | 142.375 | 260.269 |
R2 Score | −1.476 | −2.989 | −0.556 |
Variable | PC1 | PC2 | PC3 | PC4 |
---|---|---|---|---|
pH | 0.42 | 0.06 | −0.11 | 0.89 |
DO | 0.58 | −0.01 | 0.15 | −0.04 |
COD | 0.16 | 0.81 | −0.13 | −0.18 |
BOD | 0.53 | 0.03 | −0.43 | −0.14 |
TSS | 0.19 | 0.56 | 0.57 | 0.22 |
AN | 0.36 | −0.10 | −0.67 | 0.30 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lokman, A.; Ismail, W.Z.W.; Aziz, N.A.A. Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms 2025, 18, 494. https://doi.org/10.3390/a18080494
Lokman A, Ismail WZW, Aziz NAA. Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms. 2025; 18(8):494. https://doi.org/10.3390/a18080494
Chicago/Turabian StyleLokman, Amar, Wan Zakiah Wan Ismail, and Nor Azlina Ab Aziz. 2025. "Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches" Algorithms 18, no. 8: 494. https://doi.org/10.3390/a18080494
APA StyleLokman, A., Ismail, W. Z. W., & Aziz, N. A. A. (2025). Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms, 18(8), 494. https://doi.org/10.3390/a18080494