*2.1. DES Characteristics: Experimental pH Values and σ Profiles*

This work aimed to develop a simple and robust mathematical model for predicting the pH values of DESs based on *S*<sup>i</sup> mix descriptors. To develop a user-friendly model to predict pH values in the wide range, we selected both acidic and basic DESs from our database. We chose 38 DESs by carefully selecting and varying different HBA, HBD, and water shares (Table 1). Selected HBAs and HBDs can be roughly classified as quaternary ammonium salts (choline chloride, betaine), amino acids (proline), organic acids (citric and malic acid), and sugars (fructose, glucose, sucrose, xylose). In comparison to HBA, there are more HBD candidates from previously mentioned classes and it has been shown that they have an immediate effect on pH values (Table 1). Overall, all synthesized DESs cover a wide range of pH values from 0.36 for Ch:CA containing 30% water (*w*/*w*) to 9.31 for Ch:U containing 10% water (*w*/*w*). Monitoring the pH values of the same HBA/HBD pair while varying the DES water content shows that water influences the measured pH value. However, this influence is a distinctive characteristic of an individual DES and cannot be extended to all DESs studied in this work.


**Table 1.** Experimentally measured pH values.


#### **Table 1.** *Cont.*

Furthermore, DESs were mathematically described using the σ profile defined with the COSMOtherm software. The HBA and HBD molecules were optimized in TmoleX, both from an energy and geometry point of view. The generated COSMO files contain

all information necessary for the calculation of the σ profile function and thus for the calculation of the σ profile descriptors. For the preparation of the descriptor set, the DESs were modeled as a molar mixture of HBA and HBD according to Table 1. The σ profile curves for each HBA and HBD were divided into 10 regions, the area under each region was calculated, and their numerical values were correlated with the experimental pH values using mathematical models.

#### *2.2. Multiple Linear Regression and Piecewise Linear Regression*

The assessment of the MLR and PLR model applicability to predict the pH values of DESs was based on the correlation coefficient values, *R*2, *R*<sup>2</sup> adj, and *RMSE*. The obtained model coefficient values and the basic statistical analysis are presented in Table 2 while a comparison between the experimental and model-estimated pH values is given in Figure 1.

**Table 2.** MLR and PLR regression coefficients. Statistically significant coefficients are marked in bold.


**Figure 1.** Comparison between experimental data and (**a**) MLR model, (**b**) PLR model, and (**c**) ANN model. (-) data set for model development, (ᶺ)) data set for model validation.

As described in the literature, linear regression calculates an equation that minimizes the distance between the fitted line and all data points. In general, a model fits the data well if the discrepancies between the observed and predicted value are minimal and unbiased. According to Cheng et al. (2014) [16], the coefficient of determination and adjusted coefficient of determination can be considered as summary measures for the goodness of fit of any linear regression model. Moreover, Le Mann et al. (2010) stated that the model can be regarded as appropriate if the coefficient of determination is above 0.75 [17]. Based on this, it can be concluded that both the MLR (*R*<sup>2</sup> = 0.7758) and PLR (*R*<sup>2</sup> = 0.9654) models developed in this work are applicable for the description of DESs' pH values based on *S*<sup>i</sup> mix descriptors but not with the same accuracy. When analyzing *RMSE* errors, it is evident that the PLR model (Figure 1b) ensures significantly smaller data dispersion (*RMSE* = 0.6558) in comparison to the MLR model (*RMSE* = 1.1865) (Figure 1a). As previously described, a high-accuracy model is strongly desired. However, the increase in the accuracy is usually accomplished by the increase in the complexity of the models by increasing the number of model parameters. For practical application, a model with fewer parameters is easier to interpret and, therefore, more suitable for the application.

A high *R*<sup>2</sup> value alone does not guarantee that the model fits the data well, so the model's goodness of fit was further confirmed by residual analysis. The residuals from a fitted model are the differences between the responses observed and the corresponding prediction of the response computed using the regression function. If the model's fit to the data was correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it would suggest that the model fits the data well [18]. Analyzing the results presented in Figure 2, the residuals for the MPLR and PLR models were found to be normally distributed (Figure 2a,b). Furthermore, because the residual plots were gathered roughly along a straight line, the normality condition was met. The bell-shaped histograms that display the measurement distribution also verified the normal distribution of the residuals (Figure 2a,b). The residual vs. predicted value plots (Figure 2a,b) reveal that the residuals have no pattern, implying that the models match the experimental data well. Additionally, the residuals were found to range around the central value (Figure 2a,b) without obvious outliers, which means that the level of randomization was appropriate and that the sequence of testing had no effect on the findings [19].

Analysis of the MLR and PLR model coefficients showed that all coefficients, except *b*<sup>6</sup> (coefficient multiplying *S*<sup>6</sup> mix), were statistically significant. It can also be noticed that for both models, the coefficients from *b*<sup>1</sup> to *b*<sup>5</sup> have a positive influence on the output variable while the coefficients from *b*<sup>6</sup> to *b*<sup>10</sup> have a negative influence on the analyzed model output. The results are easily interpreted in terms of *b*<sup>1</sup> to *b*5, which are associated with the negative potential region and thus with hydrogen bond accepting and basicity properties on the one hand, and *b*<sup>7</sup> to *b*10, which are associated with the positive potential region and thus with hydrogen bond donating and acidity properties on the other hand. *b*<sup>6</sup> turns out to be related to the neutral potential region insignificantly contributing to the pH value. As for the other *b* coefficient values, the more distant the potential region is from the zero (neutral value), the stronger its influence (whether positive or negative) on the pH value. Thus, the model seems to have a clear and rather simple physical significance. Although statistical analysis showed that the coefficient *b*<sup>6</sup> was not significant, the variable *S*<sup>6</sup> was not excluded from the modeling. This result indicates that there is no correlation with the dependent variable at the population level, but this could be changed if a different data set was used.

**Figure 2.** Analysis of the residuals for the MLR model (**a**–**d**), PLR model (**e**–**h**), and ANN mode (**i**–**l**).

The ANOVA revealed that the created MLR and PLR models were statistically significant, with *p* values < 0.001. Moreover, higher *F*-test results (*F* value = 39.8120) and lower *p* values, according to Greenland et al. (2016) [20], show the relative relevance of the created models. Based on the presented results it can be concluded that the collected findings demonstrate the dependability of the created models throughout the spectrum of variables evaluated.
