*2.1. Dataset*

The Aquarisc project team collected the data using sensors to measure water quality parameters and meteorological variables in some water treatment plants in the department of Cauca (Colombia). The measurement dataset was recorded from January 2020 to January 2021. However, some measurement failures were found in the precipitation parameter when performing an initial analysis. For this reason, we used a meteorological database provided by the Institute of Hydrology, Meteorology and Environment Studies (IDEAM) accessed through the open data portal on its website [13].

The following parameters were used in this study: pH, dissolved oxygen (DO), conductivity, oxidation-reduction potential (ORP), turbidity, temperature (T), relative humidity (RH), vapor pressure (VP), barometric pressure (BP), wind speed, precipitation (P), and radiation.

### *2.2. Data Modeling*

### 2.2.1. Data Pre-Processing

The dataset preparation process consisted of three phases: data structuring, cleaning, and fusion.


which resulted in a dataset consisting of one treatment plant in the urban area and three treatment plants in the rural area.

• Data fusion phase: Once the data were organized and cleaned, a data fusion was performed with the dataset taken by IDEAM. Since both sets did not have the same time scale, a temporal adaptation was made to the IDEAM dataset to match each point in the database obtained from the previous phases.

### 2.2.2. Exploratory Analysis of Data

• Distribution of water turbidity: When evaluating the distribution of turbidity in the four treatment plants, distributions with long skews to the right are found. To correct the variable's asymmetry and to obtain a better view of its distribution, a logarithmic transformation was applied to make the water turbidity as close as possible to the normal distribution. Given the large number of zero values in the turbidity measurements, an additional unit was added to avoid − ∞ values. The following equation was applied for the logarithmic transformation of the water turbidity variable:

$$
\log \text{Turbidity} = \log(\text{turbidity} + 1) \tag{1}
$$

We can better understand the distribution of variables by plotting histograms for each station. For example, Figures 1 and 2 show the histograms for the log transformation of water turbidity at four relevant stations.

**Figure 1.** Distribution of the logarithmic transformation of the water turbidity variable in the four stations with available data (Timbio and Popayan) after data cleaning.

**Figure 2.** Distribution of the logarithmic transformation of the water turbidity variable in the four stations with available data (La Sierra and Santander) after data cleaning.

It can be observed that there is more significant variability in the Popayan station, which is in the urban zone. However, this indicator cannot be taken as a conclusion since other water treatment stations may have even more significant variability and higher generic water turbidity levels, but the data are simply insufficient.

• Analysis of the relationship of variables with turbidity: A correlation analysis was conducted to investigate the possible relationship between turbidity and various water variables at all stations. In the initial phase of the study, no significant correlation was found between turbidity and variables, such as pH, DO, conductivity, and ORP. Thus, we performed an additional analysis to improve the correlation, consisting of the logarithmic transformation of the turbidity variable and the correlation with the water variables from the first analysis. Although an increase in correlation coefficients was observed, they were still considered low, indicating no linear relationship between turbidity and the examined water variables. This result suggests the need for a nonlinear model to capture the complexity of the relationship between these variables. Otherwise, models could generate inaccurate data or significant error metrics, making interpretation and decision-making difficult.

### 2.2.3. Data Modeling

For data modeling, a process of experimentation was carried out in which the following variations were considered:


The metric used to determine the best model was the RMSE.

### **3. Results and Discussion**

During the linear modeling process, the database of the four stations was considered, and some variations were performed to determine which model offered better performance. The results obtained are presented in Table 1.

**Table 1.** Summary of linear modeling.


Upon analyzing the results obtained from linear modeling, it was observed that high RMSE values were obtained, suggesting no clear linear relationship between the parameters and the turbidity variable. Consequently, alternative models that could better fit the data were sought, and a non-linear analysis was conducted using different algorithms.

The k-Neighbors Regressor algorithm was employed as the first non-linear model, and the database composed of the four stations was considered. The second model was based on the extra Trees Regressor algorithm and evaluated solely at the Popayan station. Finally, the best-performing model was based on the Random Forest algorithm and evaluated in the database of the three rural stations.

For all three models, all parameters were used as predictors, and it was found that their performance significantly improved when a temporal data split was performed, using the month of December as the test set. Table 2 summarizes the results of these models and their corresponding performance metrics.

**Table 2.** Summary of non-linear modeling.


Based on the obtained results, it has been verified that the random forest model performs best in predicting water turbidity. On the other hand, the K-Neighbors-based model showed overfitting during data training, while the Extra-Trees-based model yielded a high RMSE value. As a result of these findings, a more detailed evaluation of the bestperforming model was carried out to reduce the number of predictors and adapt it to the limited instrumentation used in rural water treatment plants.

As a result of this evaluation, a model that only uses predictors of pH, T, VP, and P has been obtained, which has been the most effective in predicting water turbidity. In addition, the selection of these predictors has reduced the number of parameters considered in the model without compromising the accuracy of the predictions. It is important to note that, during the evaluation, it was observed that meteorological parameters significantly influence the variation of water turbidity, suggesting that the weather conditions in the area significantly impact water turbidity.

In summary, the research results have demonstrated the possibility of obtaining a non-linear predictive model of water turbidity with fewer predictors, which is particularly beneficial in rural areas where instrumentation is limited. Furthermore, the importance of including meteorological parameters as input variables in the model have been emphasized. Their significant impact on water turbidity can be crucial in predicting the parameter and, therefore, helping to make decisions for appropriate water treatment.
