3.1. Developing Regression Models to Be Used in Modified WQIs
Spearman’s correlation analysis was used to verify the possible relationships between the parameters of the water under study.
Table 2 shows the results obtained from this analysis. Only correlations greater or equal to 0.4 (in bold) in absolute terms were considered. It was observed that eligible parameters could be used to compose predictive models for BOD, TN, TP, and TS concentrations. Very weak correlations were obtained for coliform values, and this variable was then disregarded.
Moderate positive
ρ correlations between BOD and EC (0.4141), BOD and NH
3-N (0.4982), and BOD and turbidity (0.4643) were observed. Additionally, TP exhibited positive moderate correlations with EC (0.5181), NH
3-N (0.4626), and turbidity (0.5265). TN exhibited significant correlations with NH
3-N (0.5343), NO
3-N (0.4553), and turbidity (0.4201). Notably, EC exhibited a strong positive correlation (coefficient of 0.7041) with TN. These observed correlations for BOD, TP, and TN may be attributed to the discharge of domestic wastewater and agricultural runoff, which are major sources of organic matter and nutrients [
25,
26,
27,
28,
29,
30].
TS in water could be influenced by several direct or indirect factors. Although EC has a high correlation with TS (0.728), which can be explained by the known relationship between EC and total dissolved solids (TDS) [
27,
30], it may not reflect the trues relationship between them in some situations. Therefore, a simple linear regression based on only one variable (EC) may not capture the complexity and variability of water quality. To avoid these problems and to obtain a more accurate and robust predictive model of TS, we used multiple linear regression with all the available variables as predictors. This way, we were able to account for possible interactions and confounding effects among the variables and improve the explanatory power of the model.
The predictive models constructed using the time series from 2008 to 2017 are shown in
Table 3.
Regression models are generally adjusted to predict responses for new observations, plot the relationships between variables, or find values that optimize one or more responses. The proposed models were, therefore, adjusted to describe the relationships found between the explanatory variables and the response variable through the regression of generalized linear models.
Table 4 shows the results of the metrics obtained when adjusting the models.
Note that the regressions constructed for each of the parameters obtained an excellent fit between the predicted and observed values, as they presented coefficients of determination greater than 0.60. According to Barros Neto [
28], this result indicates that the models can be used for predictive purposes, allowing the equations to estimate the concentrations of BOD, TP, TN, and TS.
Model validation aims to evaluate the performance of equations with datasets that are different to that used in developing the model. To determine the magnitude of the associated distortion, cross-validation was carried out using the coefficient of determination (R2) and the adjusted coefficient of determination (R2adj). To confirm the good performance of the model, the Pearson correlation (r) and the Nash–Sutcliffe coefficient of efficiency (NSE) were used with data collected between 2018 and 2020.
Table 5 shows the coefficients of the adjusted regression models, the coefficients of the validated models, and the NSE for each response variable. NSE values range from negative infinity to one, with higher values indicating better model performance, while lower or negative values suggest poorer model performance [
21]. Values less than 0.36 are considered unsatisfactory, while values between 0.36 to 0.75 are classified as good, and values greater than 0.75 are regarded as excellent [
29].
Each of the parameters exhibited an R2 and R2adj value greater than 0.60, indicating a good fit between the observed and predicted data. This means that the values estimated by the model were close to those observed during the period. Additionally, it should be noted that the coefficients of determination when validating were greater than those found when modeling. Hence, the models not only fit the new data but also maintained their performance using other sets of data than those used in their construction.
The Pearson correlation (r) for the parameters was close to 1, indicating that for each unit added in one group, there was a proportional increase in the other group. Additionally, the NSE confirmed a similar behavior to that found for the aforementioned metrics. All the parameters showed acceptable performance based on the range of values (0.36–0.75) shown in the literature [
18]. It should be noted that the models for TP, TN, and TS obtained an NSE beyond this range and were considered to have good performance.
The regression models were proven to be suitable for predicting the values from laboratory procedures and they can make the process of monitoring water quality more practical and economically viable [
6]. Additionally, the results demonstrate that the regression models obtained in the present study should perform well with datasets of water quality from reservoirs under similar conditions to those found in the state of São Paulo, southeast of Brazil.
3.2. Online Modified Water Quality Indices
In constructing the online modified WQI indices, a decision was made to exclude the thermotolerant coliforms (TC) parameter, including its
E. coli subset, due to the complexity of obtaining reliable predictive models [
31,
32,
33,
34]. To overcome the omission of TC when calculating the modified WQIs, two strategies were employed. The first strategy involved assigning new weights to each of the parameters that were retained, as presented in
Table 6. The second strategy involved weighted redistribution of all the remaining variables, following the methodology proposed by Srivastava et al. [
35].
DO was assigned the highest weight among the parameters that were retained, as it is a key indicator of water quality degradation and loss. Turbidity, which is often, but not exclusively, related to bacterial contamination, obtained a relatively high value compared to the original weights of WQI
CETESB. In addition, pH was also given a high weight due to its potential to indicate the discharge of industrial wastewater and significant disturbances in aquatic ecosystems [
24,
27].
Using the aforementioned methods, the modified WQIs values were calculated, and the resulting values were evaluated using a dataset of water quality data for reservoirs in the state of São Paulo between 2018 and 2020.
Table 7 presents the correlation between WQI
CETESB values and the values obtained through the modified WQIs calculations, using both assigned weights (WQI
AW) and the redistributed weights (WQI
RW).
The results demonstrate a high and statistically significant correlation between the original WQICETESB values and the values obtained through the modified WQI calculations, using both the assigned and redistributed weights. This suggests that, despite the omission of TC and the use of estimated concentrations through the regression models, the modified WQIs produced values that closely approximated those obtained using the original CETESB methodology.
Subsequently, water quality classifications made by the reference WQI and the modified WQIs were analyzed to evaluate the efficacy of the proposed WQIs in terms of the range (color) classification presented in
Table 1.
Figure 2 presents the water quality obtained through the modified WQIs.
The modified WQIs were shown to be comparable to the method that requires numerous field samplings and laboratory analyses. This was achieved by using sensor readings of electrical conductivity, dissolved oxygen, ammoniacal nitrogen, nitrate-nitrogen, pH, and turbidity, together with information on the current rain regime.
In both of the modified WQIs, there was a low percentage overestimation at more than one rating level (0.2%), which corresponded to only one observation. However, it was observed that WQI
AW had a higher percentage underestimation (5.2%) compared to WQI
RW, which underestimated only 1.3% of the time. These results suggest that online monitoring should not be used as the sole method for assessing water quality, and that sample collections and laboratory analyses should be conducted, not only when atypical measurements are observed, but also on a periodic basis, even at longer intervals [
6,
23,
26].
The results obtained from the modified WQIs led to the generation of new fitting regression models for each WQI, which were aimed at reducing the errors associated with the modified indices. The resulting models were of the linear type, utilizing the scores obtained from WQI
AW and WQI
RW, and the scores obtained from WQI
CETESB, as presented in
Table 8.
Both adjustment models obtained a good fit for the paired observations, as evidenced by R
2 and R
2adj values greater than 0.85. This indicates that the resulting equations can predict 85% of the variation observed in the WQI
CETESB scores, and that the WQI
AW and WQI
RW adjustment models can be utilized to minimize errors [
28,
31].
Pearson correlation analyses were conducted between the adjusted modified WQIs and WQI
CETESB, with the results presented in
Table 9. The coefficients obtained were greater than 0.92, indicating a strong correlation [
17,
35] This suggests that the scores derived from the adjusted modified WQIs closely aligned with those obtained using WQI
CETESB.
Figure 3 illustrates the proportion of correct and incorrect classification levels obtained by the adjusted modified WQIs, considering the intervals presented in
Table 1. The success rate of WQI
AWadj was slightly lower than that of WQI
AW, while there was a 5.8% improvement in the success rate of WQI
RW. It was also observed that for WQI
RWadj, the adjustment could eliminate errors at more than one rating level, while, for WQI
AWadj, it was not possible to eliminate these errors.
The adjustment equation for WQIAWadj decreased overestimation errors to 12.8%, while underestimation errors increased, resulting in an 11.6% underestimation rate in one rating level. A similar trend was observed for WQIRWadj, with a decrease in overestimation error (12.6%) compared to the original value (19.2%), and an increase in underestimation error (10.5%) compared to the original value (5.2%). Although the adjustments were unable to completely eliminate errors, WQIRWadj showed no errors in two or more rating levels of water quality, indicating robust results.
3.3. Comparison with Other Modified WQIs
In order to evaluate the performance of the modified WQIs in comparison to other water quality indices that also used easily measurable parameters, the indices proposed by Naveedullah et al. [
4], Pesce and Wunderlin [
23], and Moscuzza et al. [
24] were compared.
Figure 4 displays a comparison of the modified WQIs, the literature-based WQIs, and WQI
CETESB, using the water quality database of reservoirs in the state of São Paulo from 2018 to 2020.
WQIRW and WQIAW were found to frequently indicate ‘Excellent’ water quality, which can be attributed to the tendency to classify samples in the ‘Good’ quality level as ‘Excellent’. However, WQIRW exhibited overestimation of the ‘Poor’ rating levels and underestimation of the ‘Fair’ and ‘Very Poor’ rating levels (with the latter considered to be null), while WQIAW was more effective in indicating samples as ‘Fair’. Despite these differences, both WQIRW and WQIAW can provide useful information for decision-making in watershed management.
Upon observing the frequencies of each rating level of water quality indicated by WQIRWadj, it can be concluded that the adjustment was effective in correcting the errors associated with WQIRW and was successful in reducing the primary distortions identified earlier in the modified WQIRW. However, for WQIAWadj, despite the adjustment leading to fewer errors in the ‘Excellent’ rating level, it did not correctly identify any observations as ‘Very Poor’, which resulted in only four observations being classified as such. The adjustment also caused an overestimation of the ‘Poor’ category, although it did lead to an improvement in the success rate in the ‘Good’ and ‘Fair’ rating levels.
The frequencies of the observations obtained by the WQIs proposed in the literature differed from those obtained by the reference WQI. Moreover, when analyzing their success rate, the performance of these indices was inferior to those of the four modified WQIs proposed in this study. This can be elucidated by the absence of multiple dimensions of water quality, coupled with the fact that the indices were designed to cover the diverse situations, geographical locations, and inherent attributes of distinct water bodies.
3.4. Simplified Classification of Water Quality
Table 10 presents a simplified classification, which considers that water classified as ‘Excellent/Good’ and ‘Poor/Very Poor’ overlap each other mainly with regard to the collection/supply and treatment of water for public/municipal purposes [
6].
Figure 5 shows the success and error rates of the proposed WQI
S when the simplified classification is used.
The results of the WQIs modified with a simplified scheme indicate a notable achievement in terms of the success rate. The employment of the simplified classification scheme resulted in a noteworthy reduction in the parcel of overestimation for rating level errors, as evidenced by the decrease in the previously observed range of overestimation errors from 12.64% to 27.39% to a narrower range of 4.41% to 1.92%. In addition, the use of the simplified approach led to a similar reduction in the underestimation error of one rating level, with a reduction of up to 9.97 percentage points noted, as seen with WQIRWadj.
WQIRWadj exhibited superior performance compared to the other modified WQIs indicating the correct classification (96.9%), without errors at more than one level rating. It also had the lowest error rate of underestimation (0.6%) and a low rate of overestimation (2.5%).
In order to compare the performance of the proposed WQIs with other WQIs proposed in the literature, simplified classification was used, and the frequencies of each WQI indicated for each category are plotted in
Figure 6. It was observed that the modified WQIs performed similarly to each other and the reference WQI, indicating a good level of agreement. However, the WQIs proposed in the literature exhibited poor performance, with the WQI proposed by Naveedullah et al. [
4] being the one that exhibited the best performance among them.
3.5. Validating the Modified Water Quality Indices
During the validation step of the modified WQI
S, a database of water quality from reservoirs in the state of São Paulo was used, covering a period prior to that used during the modeling step (2003 to 2007). The results of the modified WQIs in comparison to WQI
CETESB are presented in
Figure 7.
All of the compared indices had a success rate of approximately 70%, with WQIRW having the lowest performance at 69.54%. Conversely, WQIAWadj had the highest success rate at 73.25%. These success rates were similar to the values obtained during the construction of the modified WQIs. Furthermore, the adjusted indices were able to eliminate errors at more than one rating level during the validation step. WQIRWadj stands out as it succeeded in eliminating errors at more than one rating level in both the construction and validation steps, and presented good success rates (76.82% and 72.66%, respectively) throughout the present study.
Additionally, it is important to note that WQIRWadj continued to exhibit a 10% rate of overestimation at one rating level, while also displaying a portion of underestimation at one rating level, which reached 16.5%. Furthermore, WQIAWadj showed a decrease in the overestimation error to 8.3% at one rating level, but an increase in the underestimation error to 18.4% at one rating level, when compared to the results obtained during the construction step of the modified WQI.
Based on the results obtained in the present study, WQIRWadj was found to be the most effective modified WQI. During the construction phase, it demonstrated the highest rate of correct classification, with no errors occurring at more than one rating level. In the validation phase, it continued to perform well, with no errors occurring at more than one rating level and achieving a low overestimation error percentage, resulting in a high success rate.
To assess the performance of modified indices relative to WQI
CETESB, a correlation analysis was conducted between the scores obtained by modified indices proposed in other studies [
4,
23,
24] and those obtained by modified indices proposed in this study.
Table 11 presents the correlation values obtained, allowing for a comparison of the performance of the different modified WQIs using WQI
CETESB as a reference. Thus, it is possible to verify that the indices, both modified and adjusted modified, proposed in the present study presented very strong correlations (>0.9279) with the values obtained by the CETESB water quality assessment methodology. In contrast, we observed that the index modified proposed by Pesce and Wunderlin [
23] failed to obtain results similar to those obtained by WQI
CETESB. The modified indices proposed by Naveedullah et al. [
4] and by Moscuzza et al. [
24] had better performances, showing strong (0.7467) and moderate (0.6511) correlations, respectively.
The simplified classification scheme presented in
Table 10 was also used in the validation step to evaluate the performance of modified WQIs when using fewer rating levels.
Figure 8 shows the results achieved in the step for each of the modified WQIs.
It can be observed that all the WQIs had a high success rate, around 96%, in the validation step. However, the error at more than one rating level remained in the modified WQIs. This type of error is inadmissible because it indicates a very different water quality from the reality, which can lead to poor decision-making. It is noteworthy that the adjustment was able to eliminate this type of error in both the strategies of attributed and redistributed weights. Even with the WQIs adjustment, errors in indicating the wrong status of the water quality still occurred, but these errors were reduced. In most cases, the WQIs correctly adjusted the water quality rating or indicated a worse rating level than it really was. Thus, the results validated the efficiency of the modified adjusted WQIs when applying a scheme of simplified classification. In general, WQIRWadj stood out for having the smallest portion of underestimated error and a higher success rate.
The modified WQIs constructed in this study are capable of indicating water quality classifications for other sets of data besides the databases used in their modeling and construction, as evidenced by the validation step. The performance of WQI
RWadj should also be highlighted, as this WQI presented no errors at more than one rating level, a lower overestimation error rate, and a very high success rate. WQI
RWadj can be renamed as WQI
SOL to make it more accessible and promote its dissemination for use in monitoring reservoirs. The letter S indicates the locality for which it was idealized, i.e., the State of São Paulo, and the letters OL denote the methodology of determining the status of water quality, which is an online determination method. Therefore, the initials form the word “SOL,” which means
sun in Portuguese, giving it an even more friendly connotation. To encourage the application of WQI
SOL in monitoring reservoirs, a schematic diagram for the calculation of the modified IQA was prepared to facilitate the application of the methodology proposed in the present study, which can be found in
Figure 9.
The process for calculating WQI
SOL, as shown in
Figure 9, involves several steps. First, measurement data obtained from sensors are used as inputs for the prediction regression models, which generate predicted values for relevant parameters. These predicted values are then used to calculate WQI
RW. The resulting output value is then passed through the adjustment equation. Based on the resulting value, the appropriate classification method can be selected. To help users understand the process, explanatory notes, such as equations or weighted values, have been included in the diagram, denoted by a line and an empty diamond with a parenthesis. As a result, the diagram can serve as a guide for using WQI
SOL to monitor reservoirs.