Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches

Lokman, Amar; Ismail, Wan Zakiah Wan; Aziz, Nor Azlina Ab

doi:10.3390/a18080494

Open AccessArticle

Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches

by

Amar Lokman

¹

,

Wan Zakiah Wan Ismail

^1,*

and

Nor Azlina Ab Aziz

^2,*

¹

Advanced Devices and System (ADS), Faculty of Engineering and Built Environment, Universiti Sains Islam Malaysia, Nilai 71800, Negeri Sembilan, Malaysia

²

Faculty of Engineering and Technology, Multimedia University, Ayer Keroh 75450, Melaka, Malaysia

^*

Authors to whom correspondence should be addressed.

Algorithms 2025, 18(8), 494; https://doi.org/10.3390/a18080494

Submission received: 27 June 2025 / Revised: 5 August 2025 / Accepted: 5 August 2025 / Published: 8 August 2025

(This article belongs to the Special Issue Development of Machine Learning and Artificial Intelligence Algorithms in Environmental Retrieval Tasks)

Download

Browse Figures

Versions Notes

Abstract

Water quality assessment plays a vital role in environmental monitoring and resource management. This study aims to enhance the predictive modeling of the Water Quality Index (WQI) using a combination of statistical diagnostics and machine learning techniques. Data collected from six river locations in Malaysia are analyzed. The methodology involves collecting water quality data from six river locations in Malaysia, followed by a series of statistical analyses including assumption testing (shapiro–wilk and breusch–pagan tests), diagnostic evaluations, feature importance analysis, and principal component analysis (PCA). Decision tree regression (DTR) and autoregressive integrated moving average (ARIMA) are employed for regression, while random forest is used for classification. Learning curve analysis is conducted to evaluate model performance and generalization. The results indicate that dissolved oxygen (DO) and ammoniacal nitrogen (AN) are the most influential parameters, with normalized importance scores of 1.000 and 0.565, respectively. The breusch–pagan test identifies significant heteroscedasticity (p-value =

(3.138 e^{- 115})

), while the Shapiro–Wilk test confirms non-normality (p-value = 0.0). PCA effectively reduces dimensionality while preserving 95% of dataset variance, optimizing computational efficiency. Among the regression models, ARIMA demonstrates better predictive accuracy than DTR. Meanwhile, random forest achieves high classification performance and shows strong generalization capability with increasing training data. Learning curve analysis reveals overfitting in the regression model, suggesting the need for hyperparameter tuning, while the classification model demonstrates improved generalization with additional training data. Strong correlations among key parameters indicate potential multicollinearity, emphasizing the need for careful feature selection. These findings highlight the synergy between statistical pre-processing and machine learning, offering a more accurate and efficient approach to water quality prediction for informed environmental policy and real-time monitoring systems.

Keywords:

water quality prediction; machine learning; statistical analysis; principal component analysis

1. Introduction

Water quality datasets often exhibit challenges such as non-normal distributions, outliers, missing values, low concentrations below detection thresholds, and serial dependence. To ensure the accuracy and reliability of predictive models in water quality analysis, appropriate statistical methodologies must be employed. A range of statistical techniques is available to enhance the robustness, efficiency, and interpretability of water quality prediction models. These techniques are critical for evaluating model performance, diagnosing potential issues, and ensuring reliable predictions when applied to real-world datasets.

Among the commonly used statistical procedures are graphical analysis, trend analysis, correlation analysis, regression analysis, and time series analysis [1]. Visualization techniques, including box plots, scatterplots, and Q-Q plots, are particularly useful in identifying patterns, anomalies, and distributional properties within datasets [2,3]. Recent advancements in water quality monitoring have emphasized the application of statistical analysis and emerging technologies to enhance environmental assessment. For example, Zhou et al. [4] conducted a comprehensive bibliometric analysis on the integration of machine learning in wastewater treatment, highlighting global research trends and the crucial role of statistical data interpretation in optimizing water treatment processes. Similarly, Koronides et al. [5] presented a real-time seawater quality monitoring system in Cyprus, combining statistical methods with sensor networks to enable prompt environmental management responses. In another study, Cao et al. [6] utilized combined absorption and fluorescence spectroscopy with statistical spectral analysis to improve dissolved organic carbon monitoring in seawater, demonstrating the reliability of such approaches in marine research.

In parallel, Albrekht et al. [7] applied Top2Vec topic modeling and statistical data analysis to evaluate publication dynamics in environmental monitoring using unmanned aerial vehicles, showcasing how bibliometric and statistical tools can support technological innovation tracking in water quality control. Additionally, Fox et al. [8] explored the ecological impacts of glyphosate herbicide on seagrass species, employing extensive statistical data interpretation to assess water quality degradation in Florida estuaries, underscoring the significance of statistical rigor in environmental risk assessment. These studies collectively illustrate the indispensable role of statistical analysis in both operational water quality monitoring and broader environmental research, supporting data-driven decision-making in aquatic ecosystem management.

The application of machine learning, data assimilation, and statistical techniques has proven highly effective in addressing environmental challenges such as environmental retrieval, bias correction, and environmental monitoring. Bias correction methods and statistical calibrations have been integrated into environmental risk models to enhance prediction reliability and interpretability [9]. Further, advanced data-driven frameworks have been developed to assimilate multi-source environmental datasets and reduce uncertainties in climate and water quality forecasting [10]. These studies underline the importance of combining physics-based understanding with data-driven analytics to solve real-world environmental problems. Additionally, recent advances in model evaluation and hybrid frameworks illustrate the value of principal component analysis (PCA) and other statistical techniques in reducing data dimensionality and extracting meaningful patterns from multivariate environmental datasets [11].

The application of advanced computational techniques is significantly advancing environmental science. Research demonstrates that integrating improved remote sensing algorithms with data assimilation can enhance the retrieval and prediction of atmospheric pollutants [12]. Concurrently, machine learning revolutionizes environmental monitoring by enabling more effective enforcement of regulations. For instance, a data-driven approach to target high-risk facilities for water-pollution inspections can dramatically increase the detection of violations [10]. The study from [13] introduces the Synth Ridge Framework (SRF), a composite machine learning model combining feature engineering and an ensemble of gradient-boosting decision trees (XGBoost, LightBoost, CatBoost) to improve the accuracy of chlorophyll-a (Chla) concentration retrieval in optically complex water bodies. By enhancing input features and leveraging ensemble learning, SRF significantly outperforms baseline models in both accuracy and robustness, showing strong generalization across datasets from different satellite sensors

This study introduces a novel integration of a diagnostic statistical technique with a hybrid machine learning framework comprising Decision Tree Regression (DTR), Autoregressive Integrated Moving Average (ARIMA), and Random Forest (RF). While these models have individually been applied in previous water quality research, their combined use in a unified framework to address both temporal forecasting and categorical classification while maintaining model interpretability is relatively underexplored. In particular, ARIMA is not utilized as a general predictor but as a tool to model the temporal structure of individual water parameters, while DTR and RF provide explainability and classification capability. The methodological novelty lies in the comprehensive evaluation pipeline, which includes residual analysis, diagnostic and assumption testing, feature importance analysis, learning curve analysis, and principal component analysis (PCA). This integrative approach bridges the gap between statistical rigor and practical machine learning deployment, offering a more robust and interpretable framework for real-time environmental monitoring. Based on this innovation, the study aims to do the following:

Develop and evaluate a hybrid machine learning framework that integrates statistical diagnostics for enhanced WQI prediction and classification.
Identify and quantify the most influential water quality parameters affecting WQI using feature importance analysis.
Assess model reliability and generalization through residual analysis, diagnostic testing, and learning curve analysis.
Reduce data dimensionality while preserving variance using PCA for computational efficiency and pattern discovery.

The research is structured into five sections. Section 1 provides an introduction to the study. Section 2 outlines the theoretical background on statistical and machine learning approaches. Section 3 describes the methodology for identifying the data collection area and details the procedural steps for each statistical analysis, including feature importance, assumption and diagnostic tests, learning curve analysis, and PCA. Section 4 presents the results and discussion. Finally, Section 5 concludes the study and offers suggestions for future work.

2. Related Works

This section reviews existing studies that have applied key statistical and analytical techniques relevant to this research. The aim is to contextualize the tools employed in our study such as residual analysis, diagnostic and assumption testing, feature importance, learning curve analysis, and PCA within the broader body of literature. By examining how these methods have been previously used in environmental monitoring and predictive modeling, we establish their relevance and justify their selection for WQI prediction.

2.1. Residual Analysis

Residual analysis and error analysis are closely related analyses; both measure a distance (deviation or error) [14]. Residual analysis is a crucial component of regression analysis, serving as a diagnostic tool to evaluate the appropriateness of a model’s fit to the observed data [15]. It is crucial to evaluate the validity of the assumptions that underlie the regression analysis when using a regression model to predict outcomes based on predictor variables. The main assumptions consist of linearity, independence, homoscedasticity, and normality of residuals [16]. The method commences by computing the residuals, which represent the discrepancies between the observed values and the values anticipated by the model. Subsequently, these residuals are graphed against the expected values in order to visually examine any discernible trends. An ideal model should have a random distribution of residuals along the horizontal axis, indicating that it accurately captures the underlying pattern without any systematic errors. Equation (1) shows the residual value where N represents the number of examined texts in the dataset [14].

{R e s i d u a l v a l u e}_{i} = {O b s e r v e d v a l u e}_{i} - {P r e d i c t e d v a l u e}_{i}, i = 1,2, \dots, N,

(1)

2.2. Diagnostic and Assumption Tests

Diagnostic and assumption tests validate the suitability of statistical models by checking assumptions such as normality, multicollinearity, and autocorrelation, using methods like the Shapiro–Wilk test for normality or the Breusch–Pagan test for homoscedasticity [17]. The presence of patterns in this plot, such as curved relationships or a spread that resembles a fan shape, can be indicative of issues such as non-linearity or heteroscedasticity, respectively. Statistical testing enhances and supplements these visual examinations. The Shapiro–Wilk test is frequently employed to evaluate the normality of the residuals [18]. The assumption of normality is crucial in numerous regression models that employ the least square estimation approach, as it forms the foundation for ensuring that the statistical tests for coefficients are accurate. In the statistical test, W represents a measure of the straightness of the normal Q-Q plot, as shown in Equation (2) [19]. More specifically, W represents the ratio of two variance estimation of a normal distribution given data, generated by Equation (2),

W = \frac{{(\sum_{i = 1}^{n} a_{i} x_{(i)})}^{2}}{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}}

(2)

where

x_{(i)}

is the

i^{t h}

order statistic of the data,

\bar{x}

is the sample mean,

a_{i}

is the coefficient for the

i^{t h}

order statistic, and

n

is the sample size. To assess homoscedasticity, statistical procedures such as the Breusch–Pagan test or White’s test are utilized to verify that the residual variance remains constant across various levels of anticipated values [20]. Having a consistent variance in predictions ensures that the coefficient estimation remains reliable and stable throughout the whole range of data. A different study utilizes the Breusch-Pagan test for random effects to determine whether to select the random-effect regression or the ordinary least square regression [21]. The test results indicate a preference for random effects generalized least square regression.

2.3. Feature Importance

Feature importance provides insights into the influence of predictors on the target variable, helping to simplify models and improve interpretability. The evaluation of feature importance serves as a crucial analytical technique used after the initial deployment of a model to determine the input factors that have a substantial impact on the model’s predictions [22]. For classification tasks, this impurity is often measured using Gini impurity, while for regression tasks, it is measured by the reduction in variance. The feature analysis indicates that the number of trademark authorizations significantly influences prediction accuracy, demonstrating the practical application of feature importance for optimizing algorithm models in economic forecasting [23]. A different study discusses feature importance analysis, indicating that band values and vegetation indices significantly influenced classification results [24]. The Marginal Contribution Feature Importance (MCI) metric quantified the individual impact of features while addressing complex inter-feature correlations, which made it especially useful in regression contexts [25]. Additionally, another article examines the role of feature importance in explainable AI, utilizing contextual importance methods to clarify how features influence both regression and classification outcomes, thereby enhancing model transparency [26]. The studies demonstrate the critical role of feature importance analysis in improving model interpretability and predictive accuracy across various domains.

2.4. Learning Curve Analysis

Learning curve analysis is crucial for assessing the efficiency of a model’s learning process as the quantity of training data grows [27]. It can determine whether augmenting the dataset, escalating the intricacy of the model, or simplifying the model can improve performance. The learning curves are applied to optimize regression algorithms in wireless sensor networks (WSNs), providing insights into high-bias and variance issues in statistical modeling [28]. In healthcare, learning curve analysis is utilized to evaluate emergency physicians’ skill development in point-of-care ultrasound (POCUS), identifying the steep acquisition phase and the leveling-off point as experience increases [29]. Additionally, another study introduced a novel method for ranking normalized entropy curves in automated machine learning systems, showcasing how learning curve analysis optimized model configurations and enhanced decision-making [30]. The applications highlight the versatility of learning curve analysis in improving modeling efficiency, understanding skill progression, and optimizing computational resources. The relationship between the size of the training set and the error rates can be better understood by looking at the learning curve, which helps to uncover patterns in the data.

2.5. Principal Component Analysis (PCA)

Principal component analysis (PCA) is a technique employed to decrease the number of dimensions in extensive datasets, hence enhancing comprehensibility while minimising the loss of information [31]. It converts the initial variables into a different set of variables, known as principal components, which are linear combinations of the original variables. This strategy is very advantageous for improving the effectiveness and productivity of machine learning models by concentrating on the most important features. The utilization of data analysis techniques is advantageous in discerning the fundamental framework of the data and determining the factors that have the utmost significance [32].

A recent study introduces the PCA test R package (version 0.02) to statistically assess the significance of PCA results, enabling robust applications in ecological and evolutionary datasets [33]. Similarly, PCA’s importance in chemometrics was highlighted for analyzing trends in large datasets by reducing complexity in variables and objects [34]. A different study utilized spatial-temporal PCA to address dependencies in multivariate datasets, optimizing variance and integrating spatial indices for social sciences [35]. A recent review from [36] provides a comprehensive comparison of machine learning and statistical techniques such as regression, ensemble, PCA, and hybrid models for water quality forecasting and classification, highlighting their predictive performance of surface water particularly for rivers and lakes in Malaysia.

PCA proves essential in integrating with other methodologies. The combination of PCA with Data Envelopment Analysis (DEA) aimed to enhance efficiency measurement indices in state financial management [37]. Additionally, PCA was utilized alongside Fuzzy Subtractive Clustering for enhanced clustering outcomes in high-dimensional datasets [38]. These examples illustrated PCA’s versatility, spanning ecological studies and chemometrics to financial management and machine learning applications, establishing it as an indispensable tool for extracting actionable insights from complex datasets.

A key mathematical component of PCA is the eigen-decomposition of the covariance matrix of the standardized data. Specifically, Equation (3) is shown below [39]:

C v_{i} = λ_{i} v_{i}

(3)

where C is the covariance matrix, v_i is the ith eigenvector (i.e., a principal component direction), and λ_i is the corresponding eigenvalue indicating the variance explained by that component. Since C is real and symmetric, eigenvector v_i is orthogonal, and sorting by descending λ_i facilitates dimensionality reduction by retaining components that capture most of the data variance.

3. Methods

3.1. Sources of Water Quality Data

Water quality data were collected from six key locations across Peninsular Malaysia—MRANTI Lake, Klang River, Semenyih River, Labu River, Malacca River, and Muar River by the Department of Environment (DOE), Malaysia, and RedTone company. As shown in Figure 1, these sites are geographically distributed across four different Malaysian states: Selangor, Negeri Sembilan, Malacca, and Johor, offering a spatially diverse dataset for this study. The selection of these sites is due to a wide range of hydrological settings, climatic variations, and anthropogenic stressors. These include rapid urbanization, agricultural activities, industrial discharges, and tourism-related pressures factors known to influence surface water quality in complex and interrelated ways.

Klang and MRANTI Lake are situated within nodensely developed areas of Selangor, where urban runoff, domestic sewage, and industrial waste contribute significantly to pollutant loading. Semenyih and Labu Rivers, while located in less densely populated zones, represent peri-urban and agricultural catchments, respectively. These areas are influenced by mixed land use, including farmland, residential expansion, and light industrial zones, making them valuable for studying the impacts of transitional land-use patterns. In contrast, the Malacca and Muar Rivers, located in coastal and heritage-rich areas, are affected by tourism activities, municipal waste discharge, and point-source pollution from nearby industries. This selection allows for a multi-contextual assessment of water quality under various environmental and socioeconomic settings. It also enables the investigation of spatial trends and localized pollution patterns, enhancing the robustness and generalizability of the water quality prediction models developed in this study. By encompassing both high-pressure urban basins and relatively moderate-impact zones, the study can provide more holistic insights for policy formulation, watershed management, and future monitoring strategies.

3.2. Statistical Analysis

The study employed principal component analysis (PCA), feature importance analysis, learning curve analysis, residual analysis, and diagnostic and assumption tests to ensure robust statistical evaluation. Each method is elaborated in the following sections.

3.2.1. Feature Importance Analysis Method

Feature importance analysis assessed the contribution of each water quality parameter to the predictive model’s performance. This analysis was crucial for optimizing the hybrid model, which combined Decision Tree Regression (DTR), Autoregressive Integrated Moving Average (ARIMA), and Random Forest (RF) Classification. Figure 2 presents the step-by-step procedure.

For tree-based models (DTR and RF), feature importance was determined based on impurity reduction at decision nodes [40]. In DTR, significance was measured by variance reduction at nodes utilizing the feature. The importance score for each feature was computed by aggregating impurity reduction across decision trees. Normalization ensured consistency and comparability across models.

ARIMA is inherently a univariate time-series forecasting model that is included in this study to assess the individual temporal predictability of each water quality parameter [41]. To achieve this, each parameter is treated as an independent time series, and an ARIMA model (p = 4, d = 0, q = 3) is fitted separately. The Mean Squared Error (MSE) from each forecast is used as a proxy to infer how strongly a parameter exhibits consistent temporal structure. Lower MSE values are interpreted as indicative of higher individual predictability, but not as multivariate feature importance. This approach does not account for interactions between features and is not intended to replace conventional importance measures. Rather, it complements the insights obtained from tree-based models, DTR, and RF, which provide multivariate feature importance based on variance reduction and impurity measures. By combining these models, the study offers a richer understanding of both temporal and structural relevance of features within the dataset.

The integration of DTR, ARIMA, and RF models in this study serves complementary purposes. DTR and RF are tree-based models that effectively capture non-linear relationships and interactions among features, whereas ARIMA is well-suited for capturing linear temporal trends and autocorrelation in time-series data. By combining these models, the approach leverages the strengths of each: DTR for explainable regression, RF for robust classification and feature ranking, and ARIMA for analyzing temporal dependencies in individual parameters. This hybrid framework allows for a more comprehensive analysis of water quality dynamics, accommodating both the static feature-based relationships and temporal patterns inherent in environmental datasets.

Figure 3 illustrates the hybrid modeling framework used in this study, where time series water quality data are processed through three parallel modeling approaches: ARIMA, DTR, and RF. In this hybrid framework, the DTR model is used to predict continuous WQI values, offering precise numerical forecasts. Concurrently, the RF classifier categorizes WQI into qualitative water quality classes based on Malaysian national water standards. These outputs were used in parallel rather than sequentially. This dual approach enables the system to provide both detailed regression outputs and interpretable classification results, which is particularly useful for both technical analysis and decision-making in water resource management.

ARIMA is applied to capture linear temporal dependencies and patterns in individual parameters. Simultaneously, DTR performs regression analysis to generate continuous predictions of the WQI, while RF classifies the data into categorical water quality classes based on national standards. The outputs from DTR and RF are used in parallel to provide both detailed numerical forecasting and easily interpretable class labels, enabling a comprehensive and flexible analysis of water quality dynamics.

3.2.2. Assumption and Diagnostic Test

Assumption and diagnostic tests ensure the reliability and accuracy of regression model predictions. Figure 4 presents the diagnostic procedure.

Homoscedasticity and normality of residuals are key assumptions tested. The Breusch–Pagan test diagnoses heteroscedasticity by regressing squared residuals against predicted values to detect variance patterns [42]. A residual vs. fitted values plot is also generated to visually assess homoscedasticity. Randomly scattered residuals indicate homoscedasticity, whereas fan-shaped distributions suggest heteroscedasticity, necessitating corrective measures such as data transformation or robust standard errors.

The Shapiro–Wilk test evaluates residual normality, producing a W-statistic that assesses the correlation between ordered residuals and their expected normal distribution values [43]. A Q-Q plot visualizes residual normality, where a straight diagonal line confirms normality, and deviations suggest non-normality. In response to potential assumption violations, additional remedial strategies are implemented to evaluate and improve model robustness. A logarithmic transformation and Box-Cox transformation are applied to the dependent variable to stabilize variance and improve residual normality. The Box-Cox method determines the optimal transformation parameter (λ), with values close to 1 indicating minimal need for transformation. In parallel, a robust regression model using the Huber loss function is applied to mitigate the influence of outliers and non-constant variance. To complement these parametric adjustments, residual bootstrapping is performed, generating resampled distributions of residual means to evaluate model stability without relying on strict normality assumptions. These diagnostic and corrective procedures are conducted in anticipation of assumption violations and are further supported by visualizations in Section 4. These tests ensure that violations are detected and mitigated to enhance model reliability.

3.2.3. Analysis of Learning Curve

Learning curve analysis evaluates the Time Series Regression Forest (TSRF) model’s performance and generalization across varying training data sizes. It helps identify underfitting, overfitting, and optimal performance conditions. The analysis is conducted separately for the regression (DTR) and classification (RF) components of the hybrid model.

Step 1: Incremental Training

The TSRF model was iteratively retrained using progressively larger training subsets. For regression, the Decision Tree Regressor was trained on increasing dataset portions, while for classification, the Random Forest Classifier followed a similar approach. Training began with 10% of the total data, increasing in 10% increments until the full dataset was utilized. A consistent validation set (30% of the dataset) was used to evaluate generalization ability.

Step 2: Error Metrics

For regression tasks:

Training error was calculated using MSE on the training subset.
Validation error was calculated with the same metrics on the validation set.

For classification tasks:

Training error represented the proportion of misclassified samples in the training subset.
Encountered a validation error regarding the proportion of misclassified samples in the validation set or 1 minus the accuracy score.

Step 3: Training and Validation Curves Were Analyzed

The training error and validation error were plotted against the size of the training dataset to construct the learning curves.

Training Curve: Shows how the error decreases as the model receives more data to learn from. An initial rapid decline in error suggests effective learning, while a flattening indicates saturation of model performance [44].
Validation Curve: Demonstrates the model’s generalization ability. A high validation error that is compared to the training error suggests overfitting, while consistently high errors in both curves indicate underfitting [44].

To further address overfitting and improve model generalization, k-fold cross-validation (k = 10) is implemented during model training. This approach ensures that the model’s performance is validated across multiple data splits, reducing the risk of bias from a single train-test partition. For the tree-based models, additional complexity control measures are applied where maximum tree depth is limited, minimum samples per split are increased, and pruning techniques are explored where applicable. These hyperparameters are optimized using grid search. Model performance is evaluated using multiple error metrics, including MSE, Mean Absolute Error (MAE), and R² score, computed for both training and validation sets. To assess result stability, bootstrapped confidence intervals (95%) are generated for the error metrics, ensuring the robustness of the outcomes.

3.2.4. Analysis of Principal Component Analysis (PCA)

PCA is used to enhance computational efficiency and identify dominant patterns in water quality data while minimizing information loss [45]. The PCA methodology follows the steps outlined in Figure 5.

Step 1: Data Standardization

All water quality parameters (pH, DO, COD, BOD, TSS, AN) were standardized to ensure uniformity, eliminating scale biases by transforming them to zero mean and unit variance.

Step 2: Covariance Matrix Calculation

A covariance matrix quantified inter-variable relationships, identifying patterns of correlation. Each element represented the covariance between two variables, revealing how changes in one affected the other.

Step 3: Eigenvalue Decomposition

Eigenvalues and eigenvectors were computed, ranking principal components in descending order of variance explained. The first principal component (PC1) captured the highest variance, guiding dimensionality reduction.

Step 4: Cumulative Variance and Scree Plot

Eigenvalues determined the variance explained by each principal component. A scree plots visualized eigenvalues to identify the “elbow point,” aiding optimal component selection while retaining maximum information.

Step 5: Heatmap of Correlation Coefficients

A heatmap visualized feature correlations, using color gradients to indicate relationship strengths, ensuring that PCA effectively transformed correlated factors into orthogonal components.

Step 6: Projection of Data onto Principal Components

Following the determination of the primary components, the data that had been standardized were projected onto these new axes. The dataset was changed into a space with less dimensions as a result of this projection, which resulted in the assignment of a new feature to each of the basic components. By keeping the variability that was represented by the components that were selected, the transformation ensured that the computational efficiency was maintained and improved the efficiency of following modeling operations.

Step 7: Visualization of Principal Components

Several plots were generated to interpret the results of PCA:

PC1 vs. PC2 scatterplot: Showed data distribution in a reduced two-dimensional space.
Scatterplot matrix (PC1–PC4): Examined higher-dimensional relationships, uncovering latent structures.

The step-by-step method ensures that PCA effectively reduces dimensionality while retaining critical information, enabling efficient analysis and interpretation of water quality data.

4. Results and Discussion

This section presents the statistical approaches utilized to evaluate and validate the performance and assumptions of the proposed models for predicting and classifying water quality indices. The key methodologies include assumption and diagnostic tests, such as the Breusch–Pagan and ShapiroWilk tests, to assess homoscedasticity and normality of residuals, respectively. Additionally, feature importance analysis identifies the most influential water quality parameters affecting model accuracy. Principal component analysis (PCA) reduces dimensionality and reveals patterns in the dataset, while learning curve analysis examines the impact of training data size on model generalization. These analyses provide insights into model reliability, feature contributions, and areas for optimization.

4.1. Assumption and Diagnostic Tests

4.1.1. Breusch–Pagan Test Results

The Breusch–Pagan test is conducted to detect heteroscedasticity in residuals versus fitted values, as shown in Figure 6. The results indicate a significantly low p-value (3.138 × 10⁻¹¹⁵), well below the 0.05 threshold, and a high-test statistic (0.431), confirming substantial heteroscedasticity. The residuals exhibit a distinct trend, dispersing as fitted values increase. Additionally, several outliers deviate significantly from the zero-line, suggesting influential data points that may affect model performance. The fan-shaped residual pattern indicates non-constant variance, necessitating modifications to the target variable or model selection adjustments.

4.1.2. Shapiro–Wilk Test Results

The Shapiro–Wilk test assesses normality through a test statistic and p-value, supported by a Q-Q (quantile-quantile) plot (Figure 7). The obtained p-value (0.0) confirms a significant departure from normality at the 0.05 significance level. The test statistic (0.431), which significantly deviates from 1, reinforces this conclusion. The Q-Q plot, with theoretical quantiles on the x-axis and sample residual quantiles on the y-axis, exhibits a step-like residual pattern that clearly deviates from the diagonal reference line. This indicates non-normality in the residuals, suggesting that they may be discretely grouped, skewed, or influenced by underlying distributional irregularities, which could compromise model validity.

To improve and evaluate the robustness of the predictive model, a few diagnostic analyses are carried out and visualized. The Q-Q plot of the Ordinary Least Squares (OLS) residuals, which are shown in Figure 8, top left, demonstrates that the residuals closely match to the theoretical normal distribution line. This indicates that the criterion of normality is satisfactorily met.

There is a similar trend that can be seen in the Q-Q plot of residuals that is created after the Box-Cox transformation (top right). This plot displays a minor deviation, which suggests that even though the transformation may slightly improve symmetry, the original residuals already approximate a normal distribution. A comparison of the residual distributions of OLS and robust regression using Huber loss (bottom left) reveals that the Huber model produces a more constrained and symmetric residual dispersion, hence reducing the impact of extreme values and potential outliers. This is demonstrated by the fact that the Huber model provides a more similar residual dispersion. The case illustrates that the alternative model is resilient even when the data conditions are not optimal. Thus, the histogram of bootstrapped residual means, which can be found in the bottom right corner, exhibits a bell-shaped curve that is centered around zero. This observation provides evidence that non-parametric resampling approaches can produce reliable interval estimates without the need for severe parametric assumptions. When taken as a whole, these diagnostic charts establish the statistical integrity of the model and demonstrate how transformation, robust regression, and bootstrapping can serve as corrective procedures if standard assumptions are violated.

4.2. Feature Importance Analysis Results

The bar chart in Figure 9 illustrates normalized feature importance for regression (Decision Tree) and classification (Random Forest) models. Normalization ensures direct comparability across models and datasets.

Dissolved oxygen (DO) has the highest relevance score (1.000) in both models, making it the most influential predictor.
Ammoniacal nitrogen (AN) ranks second with a cumulative importance score of 0.565.
COD and BOD have moderate importance scores (0.253 and 0.158, respectively).
pH and TSS exhibit negligible importance (0.013 and 0.000), indicating minimal influence on model predictions.

Table 1 presents comprehensive feature significance scores for each feature in both regression and classification tests. The feature DO has a significance score of 1.000 for both regression and classification models, indicating that it is the most influential feature in both types of models.

The attribute of interest, AN, demonstrates crucial significance, especially in the classification model with a value of 0.824. However, it also exhibits a considerable level of importance in the regression model, with a value of 0.306. The relevance score of COD in the classification model is 0.496, which is higher than the value of 0.010 in the regression model. This suggests that COD has a more relevant function in classification tasks. Similarly, the variable BOD has a greater significance in classification, with a value of 0.279, compared to its significance in regression, which is only 0.037. However, the overall influence of BOD is still lower than that of COD. Both pH and TSS have negligible relevance ratings, with scoring 0 in both models, indicating that these features have no major impact on the model’s predictions.

The significance of DO and AN causes more precise and detailed data on these parameters to be gathered to improve the performance of the model. Moreover, gaining knowledge about their precise thresholds and how they interact with other aspects can offer more profound understanding of the dynamics of water quality. The divergent significance scores observed in regression and classification tests indicate that distinct models can potentially gain advantages from tailored feature selection and engineering. For instance, the regression model can be enhanced by prioritizing the variable DO, while classification models can gain from a more equitable evaluation of DO, AN, COD, and BOD. The pH and TSS have negligible influence on predictions. Reevaluating the importance of these features and potentially removing the water quality parameters can simplify the model, unless domain knowledge suggests they have significant nonlinear effects. This approach enhances model efficiency and ensures that critical relationships in the data are preserved. Gaining an understanding of the crucial features enables focused and specific feature engineering. For example, by merging COD and BOD, a new characteristic can be created to measure the levels of organic pollution. Additionally, interaction terms between DO and AN can help capture more intricate correlations within the data.

These findings are consistent with prior research on Malaysian and Southeast Asian rivers. For instance, [46] reported DO, AN, and BOD as the most significant predictors in classifying WQI using RF and Gradient Boosting in the Kelantan River Basin. Similarly, [47] highlighted DO and TSS as dominant contributors to WQI classification in the Kim Hai irrigation system in Vietnam. The alignment of our results with these studies supports the ecological relevance of DO and AN across multiple hydrological contexts. The low importance of pH and TSS also agrees with findings from [48], who found that parameters such as turbidity and DO were more influential than pH in tree-based predictive models.

4.3. Learning Curve Analysis Results

The learning curve in Figure 10 represents model performance trends. The regression model exhibits consistently low training error, approaching zero, indicating high precision. However, the persistent gap between training and validation errors suggests overfitting, implying that the model generalizes poorly to unseen data. Reducing model complexity through pruning, depth restriction, or minimum sample splits may mitigate this issue.

Similarly, the classification model (Figure 11) exhibits low training error, with validation error steadily decreasing as training size increases. This trend suggests improved generalization with more data, reinforcing the benefit of dataset expansion. Hyperparameter tuning can further optimize bias-variance trade-offs.

To mitigate overfitting identified in earlier learning curve analysis, 10-fold cross-validation was applied to the Decision Tree Regression model, with additional constraints introduced through limiting tree depth and minimum sample splits. The performance metrics across folds were evaluated and bootstrapped to derive 95% confidence intervals. As shown in Table 2, the cross-validated Mean Absolute Error (MAE) was 11.52, with a 95% confidence interval ranging from 9.77 to 13.27. The Mean Squared Error (MSE) was 201.30 (95% CI: 142.37–260.27), reflecting moderate variance in predictive performance across different data partitions. However, the R² score averaged −1.48 (95% CI: −2.99 to −0.56), indicating that the model explained less variance than the mean-based baseline in most folds. This negative R² suggests underfitting caused by over-regularization or by inherent data noise within the dummy dataset, which lacks strong predictive relationships. Despite these limitations, the application of pruning and validation illustrates a structured and reproducible approach to addressing model generalization and highlights the importance of balancing model complexity with predictive power. These insights also reinforce the need for tuning hyperparameters and possibly selecting alternative model architectures in future work.

Similar overfitting patterns in tree-based regression models have been noted in earlier studies. [48,49] observed that Decision Tree Regression often performs well on training data but suffers in generalization due to limited model complexity and small datasets. Our observed negative R² aligns with these findings and highlights the need for model regularization or more advanced ensemble methods such as Gradient Boosting, which demonstrated higher accuracy in related studies by [46].

4.4. Principal Component Analysis (PCA) Results

The relationship between the number of principal components and the cumulative explained variance is shown in Figure 12. The cumulative explained variance increases rapidly for the first four principal components, indicating that these components capture a significant portion of the variability in the dataset. This dataset comprises six water quality parameters: pH, DO, COD, BOD, TSS, and AN. However, PCA analysis reveals that not all parameters contribute equally to the total variance. The first four components capture most of the information, while the fifth and sixth component contribute negligibly to the cumulative variance. This is why the number of primary components needed to explain 95% of the variance is four rather than six. The horizontal dashed red line at 95% cumulative variance demonstrates that the first four components sufficiently represent the dataset, ensuring dimensionality reduction without significant information loss.

In addition to variance-based justification, the interpretability of the components is evaluated using a PCA loading matrix (Table 3), which shows the contributions of each variable to the first four principal components. For instance, PC1 has strong positive contributions from DO and BOD, reflecting their role in representing oxygenation dynamics. PC2 is influenced by COD and TSS, representing pollutant load, while PC4 has high loading from pH. This interpretable grouping supports the use of four components not only due to their statistical sufficiency, but also because they reveal distinct ecological or chemical processes.

The intersection of the variance curve with this line highlights the minimum number of components required to achieve this threshold. The minimum number of components needed to obtain this amount of explained variance is indicated by the point where the curve initially meets this line.

The significant initial rise in cumulative explained variance indicates that the dataset’s information is primarily concentrated in the first main components. This suggests that these components encapsulate the essential aspects or fundamental patterns of the data, which are significant for the dataset’s overall framework. The point at which the curve begins to stabilize, in conjunction with its intersection with the threshold line, signifies the effectiveness of dimensionality reduction. It highlights that beyond this threshold, additional components demonstrate diminishing utility in delivering new, significant information [50]. This suggests the inherent dimensionality of the dataset, indicating that it can be adequately represented with a reduced number of variables. Focusing on the optimal number of primary components can significantly enhance the efficiency of predictive models and other analyses. This deliberate reduction decreases unnecessary model complexity, thereby improving generalization in predicting performance when applied to new, unfamiliar data.

Figure 13 illustrates the variance accounted for by each primary component, indicating a declining trend. The initial components account for a substantially greater proportion of the variance than the latter components. The graph illustrates a sequential rise in cumulative explained variance with the inclusion of each successive component. This line increases progressively near the 95% variance barrier, signifying the total variance accounted for by the primary components. The declining magnitude of individual explained variations underscores the diminishing rewards associated with an increasing number of components. Initial components exert greater influence, encapsulating significant underlying patterns in the data, but subsequent components provide diminishing amounts of information.

The convergence of the cumulative variance line with the 95% threshold indicates an ideal quantity of principal components for effectively capturing the data while minimizing redundancy and overfitting. Beyond this threshold, further components are inadequately justified as they contribute insignificantly to elucidate the overall variance [51]. This graph clearly demonstrates the efficacy of PCA in diminishing the dataset’s dimensionality. It offers a visual rationale for choosing a particular number of components to achieve a balance between complexity and information loss.

The heatmap depicted in Figure 14 employs a red-to-blue gradient to represent the intensity and direction of correlations. Red hues indicate positive correlations, whilst blue hues denote negative correlations. The color intensity signifies the degree of the correlation coefficient, with darker hues represent stronger connections. Significant positive correlations (0.93 and 0.81) indicate that these variable pairs rise concurrently. These partnerships are characterized by intense red hues. Significant negative correlations (−0.69 and −0.63) are denoted by deep blue hues, signifying that a rise in one variable corresponds with a decrease in the other. Values approaching zero (0.0025 and −0.0024) indicate minimal coloration, implying negligible or no substantial linear correlations between the corresponding variable pairs.

Robust correlations, whether positive or negative, may indicate possible multicollinearity when these variables are included concurrently in regression models, thereby distorting estimation and compromising model stability. Thorough deliberation is required to choose whether to incorporate both variables in a model or to utilize one variable as a proxy for the other. Elevated correlation coefficients, whether positive or negative, signify fundamental patterns or correlations that may be essential for hypothesis testing or predictive modeling. Variables exhibiting strong positive correlations may pertain to analogous domains or possess shared elements that affect their actions.

The scatterplot depicted in Figure 15 illustrates the data distribution inside the space specified by the initial two main components obtained by PCA. Most of the data points are densely grouped near the horizontal axis around zero on both the PC1 and PC2 axes. This clustering indicates that most of the data variability is accounted for by these two components, and numerous observations exhibit comparable scores on these components. Numerous data points have substantial positive or negative scores along the PC1 and PC2 axes. These outliers are distanced from the central cluster, signifying observations having typical properties relative to most of the data. The horizontal line around zero for PC2, with the most of data points clustered around it, indicates that PC2 contributes less to data variation than PC1, which exhibits a wider dispersion.

The scatterplot shows how PCA simplifies data by changing the original variables into new components. PC1, which has a wider spread, probably captures the most important variance in the dataset, whereas PC2 captures some extra variance, but it is not significant. A central cluster along with a few outliers might indicate that there are natural groupings or segments in the data. These might relate to various classifications or categories that are part of the original dataset. A central cluster with a few outliers might indicate that there are some natural groupings or segments in the data. These may relate to various classifications or categories that are part of the original dataset.

The relationships and distributions among the first four principal components (PC1, PC2, PC3, and PC4) obtained from principal component analysis are depicted in Figure 16. The emphasis is placed on the orthogonality of these components as well as their contributions to variance.

The components jointly account for a considerable amount of the variance in the dataset. The components are selected because they can represent data that are clear and concise while yet delivering useful information. Analyses are performed on six different water quality measures, and the findings of the principal component analysis reveal that the first four components contain the fundamental variability in the data. As a result, these components are the most important components for further interpretation. When the components of the principal component analysis are plotted against one another, it is possible to see how each component is connected to the others, particularly in terms of data distribution and clustering. This makes it possible to identify patterns, such as correlations or independence, among the numerous main components.

The diagonal plots illustrate the way the values tend to be distributed across each primary component. The wide range of values that can be seen in graphs such as those for PC1 suggests that this component plays a significant part in explaining the variation in the data. The presence of narrower distributions, which are frequently seen in higher components like PC3 and PC4, indicates that there is less variance being collected. The off-diagonal grids offer scatterplots that illustrate the pairwise relationships that exist between the primary components. Some examples are as follows:

The grid comparing PC1 and PC2 reveals distinct clustering patterns, suggesting that these components effectively capture significant groupings within the dataset, possibly related to water quality classifications. The clusters in this plot highlight the separation between observations with varying underlying properties.
The PC1 versus PC3 and PC1 versus PC4 grids display a more dispersed pattern, indicating weaker correlations between these components. This suggests that PC3 and PC4 contribute incremental, rather than major, information compared to PC1.
The PC2 versus PC3 and PC2 versus PC4 grids exhibit random distributions, further emphasizing the diminishing explanatory power of the higher-order components (PC3 and PC4).
The PC3 versus PC4 grid appears more uniformly scattered, confirming that these two components carry minimal overlapping information and are orthogonal.

The orthogonality of components, as seen in the lack of discernible trends in the off-diagonal scatterplots, is a critical feature of PCA, ensuring that each principal component contributes unique information to the dataset. Furthermore, the clusters in the PC1 versus PC2 plot indicate natural groupings in the data, which can be associated with distinct water quality classes. These groupings offer valuable insights for classification and further analysis. Outliers are observed in some grids, particularly in PC1 versus PC2 and PC1 versus PC3. These outliers may represent anomalies in the dataset, such as extreme variations in water quality parameters, and warrant further investigation to understand their causes and potential impact.

4.5. Comparative Discussion and Implications

The integrative approach used in this study, combining statistical techniques with hybrid machine learning models, is validated by recent literature emphasizing the effectiveness of ensemble and diagnostic-supported modeling. Our findings, particularly the significance of DO and AN, are consistent with the results by [46,47,49], reaffirming the ecological importance of these parameters across different water systems. The RF model’s strong performance further aligns with studies that favor ensemble classifiers for robust and interpretable WQI prediction. While our Decision Tree Regression model underperforms in terms of R², this is consistent with observations in environmental datasets where limited size or variability affects regression accuracy. Ensemble methods like Gradient Boosting or XGBoost may offer better generalization in future work. The dimensionality reduction using PCA also supports more efficient modeling, as demonstrated in chemometric and ecological applications across prior research. Overall, the consistency of our results with published studies enhances the confidence in our methodological approach and supports its potential application for national-scale water quality monitoring.

5. Conclusions

The study presents a comprehensive evaluation of statistical methodologies and machine learning techniques applied to water quality prediction models. The analysis demonstrates that the integration of assumption and diagnostic tests, feature importance analysis, learning curve analysis, and PCA enhances model reliability and interpretability. The results indicate that dissolved oxygen (DO) and ammoniacal nitrogen (AN) are the most influential water quality parameters, significantly impacting predictive accuracy. Additionally, the presence of heteroscedasticity and non-normal residuals highlights the need for robust data preprocessing techniques. The implementation of PCA effectively reduces dimensionality while preserving critical information, improving computational efficiency and model performance. These findings contribute to the development of more accurate and scalable water quality prediction models, supporting informed decision-making in environmental monitoring and resource management.

Future research should focus on expanding the dataset to include a wider range of water quality parameters and geographical locations to improve model generalization. The incorporation of advanced machine learning techniques, such as deep learning and ensemble learning, may further enhance predictive accuracy and adaptability to complex data patterns. Additionally, exploring hybrid modeling approaches that integrate physical and statistical models could provide deeper insights into water quality dynamics.

Author Contributions

Conceptualization, W.Z.W.I. and N.A.A.A.; methodology, A.L. and W.Z.W.I.; validation, W.Z.W.I. and N.A.A.A.; investigation, A.L. and W.Z.W.I.; writing—original draft preparation, A.L.; writing—review and editing, W.Z.W.I. and N.A.A.A.; visualization, N.A.A.A. and W.Z.W.I.; supervision, W.Z.W.I. and N.A.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by a grant from Ministry of Higher Education, Malaysia, for Fundamental of Research Grant Scheme (FRGS/1/2024/WAS02/USIM/02/1), and the APC is funded by Multimedia University (MMUE/210013).

Data Availability Statement

All data are presented in the manuscript.

Acknowledgments

We would like to acknowledge the support given by the RedTone Company, Universiti Sains Islam Malaysia and Multimedia University towards this project. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Schreiber, S.G.; Schreiber, S.; Tanna, R.N.; Roberts, D.R.; Arciszewski, T.J. Statistical tools for water quality assessment and monitoring in river ecosystems—A scoping review and recommendations for data analysis. Water Qual. Res. J. 2022, 57, 40–57. [Google Scholar] [CrossRef]
Statswork. Applications of Statistical Analyses on Water Quality Data and Its Recent Research Trends. Pioneer Statistical Consulting. Available online: https://statswork.com/blog/applications-of-statistical-analyses-on-water-quality-data-and-its-recent-research-trends/ (accessed on 13 November 2023).
Fu, L.; Wang, Y.-G. Statistical Tools for Analyzing Water Quality Data. 2012. Available online: www.intechopen.com (accessed on 20 November 2023).
Zhou, K.; Wu, B.; Zhang, X. Worldwide Research Progress and Trends in Application of Machine Learning to Wastewater Treatment: A Bibliometric Analysis. Water 2025, 17, 1314. [Google Scholar] [CrossRef]
Koronides, M.; Stylianidis, P.; Michailides, C.; Onoufriou, T. Real-Time Monitoring of Seawater Quality Parameters in Ayia Napa, Cyprus. J. Mar. Sci. Eng. 2024, 12, 1731. [Google Scholar] [CrossRef]
Cao, X.; Xiong, F.; Wang, Y.; Ma, H.; Zhang, Y.; Liu, Y.; Kong, X.; Wang, J.; Shi, Q.; Fan, P.; et al. Spectral Analysis of Dissolved Organic Carbon in Seawater by Combined Absorption and Fluorescence Technology. J. Mar. Sci. Eng. 2024, 12, 2297. [Google Scholar] [CrossRef]
Albrekht, V.; Mukhamediev, R.I.; Popova, Y.; Muhamedijeva, E.; Botaibekov, A. Top2Vec Topic Modeling to Analyze the Dynamics of Publication Activity Related to Environmental Monitoring Using Unmanned Aerial Vehicles. Publications 2025, 13, 15. [Google Scholar] [CrossRef]
Fox, A.; Leonard, H.; Springer, E.; Provoncha, T. Glyphosate Herbicide Impacts on the Seagrasses Halodule wrightii and Ruppia maritima from a Subtropical Florida Estuary. J. Mar. Sci. Eng. 2024, 12, 1941. [Google Scholar] [CrossRef]
Liao, S.L.; Chen, L.C.; Tsai, M.H.; Hua, M.C.; Yao, T.C.; Su, K.W.; Yeh, K.W.; Chiu, C.Y.; Lai, S.H.; Huang, J.L. Prenatal exposure to bisphenol - A is associated with dysregulated perinatal innate cytokine response and elevated cord IgE level: A population-based birth cohort study. Env. Res. 2020, 191, 110123. [Google Scholar] [CrossRef]
Hino, M.; Benami, E.; Brooks, N. Machine learning for environmental monitoring. Nat. Sustain. 2018, 1, 583–588. [Google Scholar] [CrossRef]
Zhang, S.; Harrop, B.; Leung, L.R.; Charalampopoulos, A.T.; Barthel Sorensen, B.; Xu, W.; Sapsis, T. A Machine Learning Bias Correction on Large-Scale Environment of High-Impact Weather Systems in E3SM Atmosphere Model. J. Adv. Model Earth Syst. 2024, 16, e2023MS004138. [Google Scholar] [CrossRef]
Mak, H.W.L. Improved Remote Sensing Algorithms and Data Assimilation Approaches in Solving Environmental Retrieval Problems. Ph.D. Thesis, Hong Kong University of Science and Technology, Hong Kong, China, 2019. [Google Scholar] [CrossRef]
Qin, T.; Liang, T.; Fan, D.; He, H.; Lan, G.; Fu, B. A novel hybrid machine learning approach for accurate retrieval of ocean surface chlorophyll-a across oligotrophic to eutrophic waters. Environ. Res. 2025, 279, 121864. [Google Scholar] [CrossRef]
Benko, Ľ.; Munkova, D.; Munk, M.; Benkova, L.; Hajek, P. The use of residual analysis to improve the error rate accuracy of machine translation. Sci. Rep. 2024, 14, 1–19. [Google Scholar] [CrossRef]
Soleimani, F.; Hajializadeh, D. Bridge seismic hazard resilience assessment with ensemble machine learning. Structures 2022, 38, 719–732. [Google Scholar] [CrossRef]
Wang, X.; Mazumder, R.K.; Salarieh, B.; Salman, A.M.; Shafieezadeh, A.; Li, Y. Machine Learning for Risk and Resilience Assessment in Structural Engineering: Progress and Future Trends. J. Struct. Eng. 2022, 148, 03122003. [Google Scholar] [CrossRef]
Ohaegbulem, E.U.; Iheaka, V.C. On Remedying the Presence of Heteroscedasticity in a Multiple Linear Regression Modelling. Afr. J. Math. Stat. Stud. 2024, 7, 225–261. [Google Scholar] [CrossRef]
Yulia, Y.; Helvira, R.; Tunisa, J. Impact Analysis of Inflation, ROA, FDR, and Financing on Non-Performing Financing in Indonesian Islamic Banks. Dinar J. Ekon. Dan Keuang. Islam 2024, 11, 222–235. Available online: https://journal.trunojoyo.ac.id/dinar/article/view/22743 (accessed on 26 June 2025).
Yang, S.; Berdine, G. Normality tests. Southwest Respir. Crit. Care Chron. 2021, 9, 87–90. [Google Scholar] [CrossRef]
Saariniemi, J. Case-Study: Twitter Data Analysis by Linear Regression Modelling. 2023. Available online: https://lutpub.lut.fi/handle/10024/166121 (accessed on 17 October 2024).
Wang, W.; Melnyk, L.; Kubatko, O.; Kovalov, B.; Hens, L. Economic and Technological Efficiency of Renewable Energy Technologies Implementation. Sustainability 2023, 15, 8802. [Google Scholar] [CrossRef]
Zheng, Z.; Yang, Y.; Zhou, J.; Gu, F. Research on Time Series Data Prediction Based on Machine Learning Algorithms. In Proceedings of the 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology, ICCECT 2024, Jilin, China, 26–28 April 2024; pp. 680–686. [Google Scholar] [CrossRef]
Qu, X.; Zhao, F.; Gao, L.; Zhang, Z. The application of machine learning regression algorithms and feature engineering in practical application. In Proceedings of the 2022 10th International Conference on Information Systems and Computing Technology, ISCTech 2022, Guilin, China, 28–30 December 2022; pp. 259–263. [Google Scholar] [CrossRef]
Zheng, Z.; Yuan, J.; Yao, W.; Kwan, P.; Yao, H.; Liu, Q.; Guo, L. Fusion of UAV-Acquired Visible Images and Multispectral Data by Applying Machine-Learning Methods in Crop Classification. Agronomy 2024, 14, 2670. [Google Scholar] [CrossRef]
Catav, A.; Fu, B.; Zoabi, Y.; Meilik, A.L.; Shomron, N.; Ernst, J.; Sankararaman, S.; Gilad-Bachrach, R. Marginal Contribution Feature Importance - an Axiomatic Approach for Explaining Data. Proc. Mach. Learn. Res. 2021, 139, 1324. [Google Scholar]
Framling, K. Feature Importance versus Feature Influence and What It Signifies for Explainable AI. In Communications in Computer and Information Science CCIS; Springer Nature: Cham, Switzerland, 2023; Volume 1901, pp. 241–259. [Google Scholar] [CrossRef]
Oukhouya, H.; El Himdi, K. A comparative study of ARIMA, SVMs, and LSTM models in forecasting the Moroccan stock market. Int. J. Simul. Process Model. 2023, 20, 125–143. [Google Scholar] [CrossRef]
Verma, V.K.; Kumar, V. Optimization of Regression algorithms using Learning curve in WSN. In Proceedings of the 2021 International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE 2021, Greater Noida, India, 4–5 March 2021; pp. 379–382. [Google Scholar] [CrossRef]
Hannula, O.; Hällberg, V.; Meuronen, A.; Suominen, O.; Rautiainen, S.; Palomäki, A.; Hyppölä, H.; Vanninen, R.; Mattila, K. Self-reported skills and self-confidence in point-of-care ultrasound: A cross-sectional nationwide survey amongst Finnish emergency physicians. BMC Emerg. Med. 2023, 23, 23. [Google Scholar] [CrossRef]
Liu, H.; Yang, S.; Qi, F.; Wang, S. Learning to Rank Normalized Entropy Curves with Differentiable Window Transformation. 2023. Available online: https://arxiv.org/abs/2301.10443v1 (accessed on 17 November 2024).
Lu, J.; Gu, J.; Han, J.; Xu, J.; Liu, Y.; Jiang, G.; Zhang, Y. Evaluation of Spatiotemporal Patterns and Water Quality Conditions Using Multivariate Statistical Analysis in the Yangtze River, China. Water 2023, 15, 3242. [Google Scholar] [CrossRef]
Ma, X.; Wang, L.; Yang, H.; Li, N.; Gong, C. Spatiotemporal Analysis of Water Quality Using Multivariate Statistical Techniques and the Water Quality Identification Index for the Qinhuai River Basin, East China. Water 2020, 12, 2764. [Google Scholar] [CrossRef]
Camargo, A. PCAtest: Testing the statistical significance of Principal Component Analysis in R. PeerJ 2022, 10, e12967. [Google Scholar] [CrossRef]
Brereton, R.G. Principal components analysis with several objects and variables. J. Chemom. 2023, 37, e3408. [Google Scholar] [CrossRef]
Krzyśko, M.; Nijkamp, P.; Ratajczak, W.; Wołyński, W.; Wenerska, B. Spatio-temporal principal component analysis. Spat. Econ. Anal. 2024, 19, 8–29. [Google Scholar] [CrossRef]
Lokman, A.; Wan Zakiah, W.I.; Nor Azlina, A.A. A Review of Water Quality Forecasting and Classification Using Machine Learning Models and Statistical Analysis. Water 2025, 17, 2243. [Google Scholar] [CrossRef]
Mohammed, A.H.; Ashour, M.A.H. Improving the efficiency measurement index using principal component analysis (PCA). Int. J. Health Sci. 2022, 6, 6584–6600. [Google Scholar] [CrossRef]
Haryati, A.E.; Sugiyarto. Clustering with Principal Component Analysis and Fuzzy Subtractive Clustering Using Membership Function Exponential and Hamming Distance. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1077, 012019. [Google Scholar] [CrossRef]
Jollife, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
Devasahayam, S.; Albijanic, B. Predicting hydrogen production from co-gasification of biomass and plastics using tree based machine learning algorithms. Renew. Energy 2024, 222, 119883. [Google Scholar] [CrossRef]
Jain, N.; Sharma, S.; Thakur, V.; Nutakki, M.; Mandava, S. Prediction and Analysis of Household Energy Consumption Integrated with Renewable Energy Sources using Machine Learning Algorithms in Energy Management. Int. J. Renew. Energy Res. 2024, 14, 354–362. [Google Scholar] [CrossRef]
Mathew, S.; Idi, D.; Stephen, M. Modeling and Inference of Insurance Sector Development on Nigeria Economic Growth African Multidisciplinary Modeling and Inference of Insurance Sector Development on Nigeria Economic Growth. J. Sci. Artif. Intell. 2024, 1, 249–263. [Google Scholar]
Mikolajczyk, A.P.; Fortela, D.L.; Berry, J.C.; Chirdon, W.M.; Hernandez, R.A.; Gang, D.D.; Zappi, M.E. Evaluating the Suitability of Linear and Nonlinear Regression Approaches for the Langmuir Adsorption Model as Applied toward Biomass-Based Adsorbents: Testing Residuals and Assessing Model Validity. Langmuir 2024, 40, 20428–20442. [Google Scholar] [CrossRef]
Deshpande, A.M.; Minai, A.A.; Kumar, M. One-shot recognition of manufacturing defects in steel surfaces. Procedia Manuf. 2020, 48, 1064–1071. [Google Scholar] [CrossRef]
Boddu, Y.; Manimaran, A. Maximizing Forecasting Precision: Empowering Multivariate Time Series Prediction with QPCA-LSTM. Comput. Econ. 2024, 2024, 1–36. [Google Scholar] [CrossRef]
Malek, N.H.A.; Yaacob, W.F.W.; Nasir, S.A.M.; Shaadan, N. Prediction of Water Quality Classification of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques. Water 2022, 14, 1067. [Google Scholar] [CrossRef]
Lap, B.Q.; Du Nguyen, H.; Hang, P.T.; Phi, N.Q.; Hoang, V.T.; Linh, P.G.; Hang, B.T. Predicting Water Quality Index (WQI) by feature selection and machine learning: A case study of An Kim Hai irrigation system. Ecol. Inform. 2023, 74, 101991. [Google Scholar] [CrossRef]
Wong, W.Y.; Al-Ani, I.; Khallel, A.; Khairuddin, M.; Salwa, A. Water Quality Index Using Modified Random Forest Technique: Assessing Novel Input Features. Comput. Model. Eng. Sci. 2022, 132, 1011–1038. [Google Scholar] [CrossRef]
Uddin, M.G.; Rahman, A.; Nash, S.; Diganta, M.T.; Sajib, A.M.; Moniruzzaman, M.; Olbert, A.I. Marine waters assessment using improved water quality model incorporating machine learning approaches. J. Environ. Manag. 2023, 344, 118368. [Google Scholar] [CrossRef]
Thia, J.A.; Thia, C.A.J. Guidelines for standardizing the application of discriminant analysis of principal components to genotype data. Mol. Ecol. Resour. 2023, 23, 523–538. [Google Scholar] [CrossRef] [PubMed]
Auerswald, M.; Moshagen, M. How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychol. Methods 2019, 24, 468–491. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Map of Peninsular Malaysia showing the six selected study locations: MRANTI Lake, Klang River, Semenyih River (Selangor), Labu River (Negeri Sembilan), Malacca River (Melaka), and Muar River (Johor).

Figure 2. Procedure for analysing feature importance.

Figure 3. Conceptual framework of the hybrid model integrating ARIMA, DTR, and RF.

Figure 4. Flowchart of assumptions and diagnostic tests.

Figure 5. PCA implementation process.

Figure 6. Residual vs. fitted values plot. Blue dots represent the residuals for each observation, and the red line represents the reference line at zero residual, indicating perfect model fit.

Figure 7. Q-Q plot of residuals. The x-axis represents theoretical quantiles, and the y-axis represents sample quantiles of residuals. Blue dots represent the ordered standardized residuals and the red diagonal line represents the theoretical quantiles from a normal distribution. Deviation from the red diagonal line indicates non-normality, consistent with the Shapiro–Wilk test result.

Figure 8. Diagnostic plots supporting remedial strategies for assumption violations: (a) Q-Q plot of OLS residuals, where orange dots represent the ordered standardized residuals and the red line represents the theoretical quantiles from a normal distribution; (b) Q-Q plot after Box-Cox transformation, with the same color coding as (a); (c) histogram of residuals comparing Ordinary Least Squares (OLS, blue bars with blue density curve) and Huber robust regression (red bars with red density curve); (d) bootstrapped distribution of residual means, where yellow bars represent the distribution and the overlaid orange line represents the fitted normal density curve.

Figure 9. Normalized feature importance.

Figure 10. Learning curve for regression model.

Figure 11. Learning curve for classification model.

Figure 12. Cumulative explained variance by number of principal components. The blue dashed line represents the cumulative proportion of variance. The red dashed horizontal line indicates the threshold for the desired level of explained variance (95%) used to determine the optimal number of components.

Figure 13. The graph of PCA cumulative and individual explained variance. The blue dashed line represents the cumulative proportion of variance. The red dashed horizontal line indicates the threshold for the desired level of explained variance (95%) used to determine the optimal number of components.

Figure 14. Heatmap of correlation coefficients between study variables.

Figure 15. Scatterplot of principal components 1 and 2.

Figure 16. Scatterplot matrix of PCA components PC1, PC2, PC3, and PC4.

Table 1. Feature importance for water parameters.

Parameters	Feature Importance in Regression Model	Feature Importance in Classification Model	Average Features Importance Across Models
DO	1.000	1.000	1.000
AN	0.306	0.824	0.565
COD	0.025	0.496	0.261
BOD	0.037	0.279	0.158
pH	0.015	0.024	0.019
TSS	0.005	0.009	0.007

Table 2. Cross-validated regression metric (DTR).

Metric	Mean	95% CI Lower	95% CI Upper
MAE	11.521	9.769	13.273
MSE	201.301	142.375	260.269
R² Score	−1.476	−2.989	−0.556

Table 3. PCA loading matrix showing contributions of each variable to the first four principal components.

Variable	PC1	PC2	PC3	PC4
pH	0.42	0.06	−0.11	0.89
DO	0.58	−0.01	0.15	−0.04
COD	0.16	0.81	−0.13	−0.18
BOD	0.53	0.03	−0.43	−0.14
TSS	0.19	0.56	0.57	0.22
AN	0.36	−0.10	−0.67	0.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lokman, A.; Ismail, W.Z.W.; Aziz, N.A.A. Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms 2025, 18, 494. https://doi.org/10.3390/a18080494

AMA Style

Lokman A, Ismail WZW, Aziz NAA. Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms. 2025; 18(8):494. https://doi.org/10.3390/a18080494

Chicago/Turabian Style

Lokman, Amar, Wan Zakiah Wan Ismail, and Nor Azlina Ab Aziz. 2025. "Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches" Algorithms 18, no. 8: 494. https://doi.org/10.3390/a18080494

APA Style

Lokman, A., Ismail, W. Z. W., & Aziz, N. A. A. (2025). Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches. Algorithms, 18(8), 494. https://doi.org/10.3390/a18080494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Water Quality Evaluation and Analysis by Integrating Statistical and Machine Learning Approaches

Abstract

1. Introduction

2. Related Works

2.1. Residual Analysis

2.2. Diagnostic and Assumption Tests

2.3. Feature Importance

2.4. Learning Curve Analysis

2.5. Principal Component Analysis (PCA)

3. Methods

3.1. Sources of Water Quality Data

3.2. Statistical Analysis

3.2.1. Feature Importance Analysis Method

3.2.2. Assumption and Diagnostic Test

3.2.3. Analysis of Learning Curve

3.2.4. Analysis of Principal Component Analysis (PCA)

4. Results and Discussion

4.1. Assumption and Diagnostic Tests

4.1.1. Breusch–Pagan Test Results

4.1.2. Shapiro–Wilk Test Results

4.2. Feature Importance Analysis Results

4.3. Learning Curve Analysis Results

4.4. Principal Component Analysis (PCA) Results

4.5. Comparative Discussion and Implications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI