Next Article in Journal
Chemical Composition, Antioxidant, Antibacterial, and Antibiofilm Activities of Backhousia citriodora Essential Oil
Next Article in Special Issue
Enhanced Recovery of Natural Antioxidants from Grape Waste Using Natural Eutectic Solvents-Based Microwave-Assisted Extraction
Previous Article in Journal
Metagenomics Approach to the Intestinal Microbiome Structure and Abundance in High-Fat-Diet-Induced Hyperlipidemic Rat Fed with (−)-Epigallocatechin-3-Gallate Nanoparticles
Previous Article in Special Issue
Liquefying Flavonoids with Terpenoids through Deep Eutectic Solvent Formation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting the Surface Tension of Deep Eutectic Solvents: A Step Forward in the Use of Greener Solvents

by
Amit Kumar Halder
1,2,*,
Reza Haghbakhsh
3,4,
Iuliia V. Voroshylova
1,
Ana Rita C. Duarte
4 and
Maria Natalia D. S. Cordeiro
1,*
1
LAQV@REQUIMTE, Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
2
Dr B C, Roy College of Pharmacy and Allied Health Sciences, Dr. Meghnad Saha Sarani, Bidhannagar, Durgapur 713212, WB, India
3
Department of Chemical Engineering, Faculty of Engineering, University of Isfahan, Isfahan 81746-73441, Iran
4
LAQV@REQUIMTE, Department of Chemistry, NOVA School of Science and Technology, 2829-516 Caparica, Portugal
*
Authors to whom correspondence should be addressed.
Molecules 2022, 27(15), 4896; https://doi.org/10.3390/molecules27154896
Submission received: 11 July 2022 / Revised: 28 July 2022 / Accepted: 28 July 2022 / Published: 31 July 2022

Abstract

:
Deep eutectic solvents (DES) are an important class of green solvents that have been developed as an alternative to toxic solvents. However, the large-scale industrial application of DESs requires fine-tuning their physicochemical properties. Among others, surface tension is one of such properties that have to be considered while designing novel DESs. In this work, we present the results of a detailed evaluation of Quantitative Structure-Property Relationships (QSPR) modeling efforts designed to predict the surface tension of DESs, following the Organization for Economic Co-operation and Development (OECD) guidelines. The data set used comprises a large number of structurally diverse binary DESs and the models were built systematically through rigorous validation methods, including ‘mixtures-out’- and ‘compounds-out’-based data splitting. The most predictive individual QSPR model found is shown to be statistically robust, besides providing valuable information about the structural and physicochemical features responsible for the surface tension of DESs. Furthermore, the intelligent consensus prediction strategy applied to multiple predictive models led to consensus models with similar statistical robustness to the individual QSPR model. The benefits of the present work stand out also from its reproducibility since it relies on fully specified computational procedures and on publicly available tools. Finally, our results not only guide the future design and screening of novel DESs with a desirable surface tension but also lays out strategies for efficiently setting up silico-based models for binary mixtures.

1. Introduction

The last two decades have witnessed a significant shift in the design and development of new chemicals for large-scale industrial applications. One of such efforts has been driven towards the replacement of flammable and environmentally hazardous substances with green and sustainable solvents. More benign solvents, even if dispensed in a large amount into the environment, are known to produce less harmful effects [1,2,3]. Deep eutectic solvents (DES) represent a class of “green solvents” with tremendous potential to replace conventional toxic chemicals. Indeed, apart from having a wide range of applications, DESs exhibit much less environmental toxicity, even when compared to their predecessor, ionic liquids (ILs) [4,5]. Thus, it is not surprising that the emergence of DESs at the beginning of this century drew considerable attention from the scientific community, as is confirmed by the growing number of publications related to DESs in the last two decades [4,5]. DESs may simply be defined as the low melting point mixture of at least two compounds, one acting as a hydrogen bond acceptor (HBA) and another as a hydrogen bond donor (HBD) in a specific molar ratio [6,7]. Their low melting point, which is a result of complex hydrogen bonding interactions between the components of DESs, allows them to remain in the liquid phase at room temperature [8]. Besides being less eco-toxic in nature, DESs are generally easy to prepare, cost-effective and biocompatible [7,8,9]. However, even with many advantages, the suitability of any chemical for long-term industrial applications often depends on its fundamental physical properties, such as density, viscosity, surface tension, vapor pressure, the speed of sound etc. [4,5], and DESs are of no exception. The physicochemical profile of a DES can readily be tailored by choosing different combinations of their starting components or by modifying their chemical structures [10,11]. Still, the number of possible combinations that can be envisaged to form DESs is extremely high. As such, without detailed knowledge of the relation between structure and properties, their fine-tuning is barely applicable in practice and often limited to a trial-and-error procedure.
Surface tension is one of the most crucial physical properties that must be considered, as it is required for the set-up of industrial processes, such as the design of heating systems, distillation columns and heat exchangers [12]. Therefore, the measurement of their surface tension is essential to assess the suitability of DESs for industrial applications. Normally, with increasing temperature, the surface tension of a DES decreases, but it is well-known that its components and their molar ratio are also responsible for their resulting surface tension [13].
Previously, we have reported a general thermodynamic model to estimate the surface tension for DESs of different nature [14]. The model has been developed with an up-to-date data bank containing surface tension values for a large number of structurally diverse DESs. The question that yet remains is whether a more predictive model for the surface tension of DESs can be achieved by alternative in silico-based modeling approaches such as the one proposed here, Quantitative Structure-Property Relationships (QSPR). In fact, the application of QSPR modeling techniques has long stood as particularly useful to estimate a wide range of properties of different materials [15,16,17,18,19,20,21]. Thus, many QSPR modeling studies have been devoted to different physicochemical properties of DESs. Very recently, for example, Wang et al. developed QSPR models based on Conductor-like Screening Model for Real Solvents (COSMO-RS) descriptors for characterizing the CO2 solubility in DESs [22]. The authors found that the linear model was unable to successfully fit the whole dataset but a random forest non-linear model showed greater reliability, judging from the Absolute-Average-Relative-Deviation (AARD) value of 7.8% attained for that data (59 DESs). Balali and co-workers developed also QSPR models for probing the thiophene distribution between choline chloride (ChCl)-based DES and hydrocarbon phases in ternary systems [23]. The proposed linear models displayed good accuracy and included topological descriptors, which indicate the influence of the degree of the structure of HBDs on the thiophene distribution. In another study, Khajeh et al. employed QSPR modeling for predicting the melting and freezing points of DESs [24]. Their results showed that both properties of 181 DESs could be predicted with good accuracy (R2~0.80) by the derived linear models. Multiple attempts have been undertaken as well to set up linear and non-linear QSPR-based models for estimating the density and viscosity of DESs [25,26,27]. All the later models resorted to COSMO-RS descriptors and showed great predictivity performance (R2 > 0.95). However, these models have been based on a small number of data points pertaining to just certain families of DESs (e.g.: 49 hydrophilic [26] or 54 hydrophobic [27] DESs), which thus limits their general applicability.
The present work is encouraged by our very recent QSPR modeling efforts on the density of DESs [28], which yielded statistically more robust models than a thermodynamic model developed with the same dataset [29]. On one hand, such QSPR efforts demonstrated to offer more options and versatility for setting up predictive models as compared to thermodynamic modeling. On the other hand, QSPR models demand several statistical conditions be satisfied just as inspected here as per the guidelines of the Organization for Economic Co-operation and Development (OECD) [30] to expand their overall applicability as well as statistical reliability [31]. Moreover, in order to address the requirement of robust validation strategies applicable for the binary mixtures, we have recently designed an open-source standalone Python-based tool “QSAR-Mx” (freely available to download at https://github.com/ncordeirfcup/QSAR-Mx, last accessed on 28 April 2022) [28]. This work extensively utilizes such a tool to set up the predictive QSPR models for probing the surface tensions of DESs. Therefore, the scope of the current work goes beyond the development of QSPR models by proposing and comparatively testing novel methodologies that can be utilized in the future towards reliably predicting the properties of binary mixtures.

2. Materials and Methods

2.1. Dataset Collection and Splitting

The dataset employed for the development of the QSPR models was adopted from our previously published work on DES surface tension estimation [14]. It contains 553 data points, compiled from 112 different binary and ternary DESs of diverse compositions. However, here we solely focused on binary DESs and therefore the current dataset was reduced to 530 data points pertaining to 99 different binary DESs. This dataset was combined with an additional set comprising 89 data points that were collected from measurements reported in the literature since 2020 [32,33,34,35] and that, thus, were not included in our previously thermodynamic model. The final dataset comprises 619 unique data points coming from 113 different types of binary DESs. It is worth noting that the experimental surface tension of each DES in the dataset was measured at atmospheric pressure in a large temperature range (278.15–358.15 K), rendering the temperature an important independent variable to consider in the QSPR modeling for understanding how it influences such physical property.
Predictive validation is a required but delicate task in any QSPR modeling—i.e., to assess model adequacy to new mixtures, and it is related directly to the dataset division scheme adopted. In fact, as shown by Muratov et al. [36], the random division of the original dataset for validation purposes is unacceptable since it can lead to unreliable QSPR models and to an over-optimistic estimation of their predictive performance. The authors have thus proposed and described in detail different validation strategies for the QSPR modeling of mixtures [36]. In this work, two such validation strategies were utilized to search for the most predictive QSPR models, namely: the mixtures-out (MO) and compounds-out (CO) schemes. Briefly, in the MO scheme, mixtures of the modeling set are distributed among the training and the test sets without repetition. By contrast, in the CO scheme, at least one chemical of the dataset is never placed in the training set. Naturally, these validation strategies are only applicable to binary mixtures and require some guidance to follow. Due to the complexity of the data matrices, any random MO- or CO-based division scheme may not yield the most predictive model since variables selection depends largely on the training set [28]. Even though the CO-based validation is considered to be the most robust strategy [36], it may give rise to underfitted models with poor statistical quality. At the same time, while the MO-based validation is less robust, this strategy definitely provides more meaningful solutions than any random data distributions or other validation division schemes such as the points-out one proposed by the same authors [28,36,37,38].
As referred to above, we have recently developed a Python-based tool named QSAR-Mx [28], specifically devised to address and automate some crucial steps involved in the QSPR modeling of binary mixtures. A detailed description of the functionalities of this tool can be found in its instruction manual (accessible from https://github.com/ncordeirfcup/QSAR-Mx, last accessed on 28 April 2022). Essentially, QSAR-Mx lets the user generate multiple MO- and CO-based data distributions and then develop models with the latter and select the most predictive ones based on their statistical metrics. Firstly, the user should choose two parameters—i.e., seed and interval, for generating the MO- and CO-based data distributions. In the MO division scheme, QSAR-Mx detects unique mixtures present in the dataset and then sorts them by the number of occurrences in descending order. From the sorted list, the mixtures are grouped according to the seed (the starting point for selection) and interval values given. The unique mixtures selected are then incorporated into the test set. Likewise, in the CO scheme, the QSAR-Mx tool begins by sorting the unique chemicals that belong to component-1 of the mixtures, followed by sorting them in descending order and lastly, by choosing some chemicals based upon the maximum values of the seed and interval chosen. This procedure is then replicated for the unique chemicals belonging to component-2 of the mixtures. The unique chemicals selected are then placed in the test set. One should notice however here that the following QSPR models were always derived with generated data distributions in which the training set size was always greater than the test set size and, simultaneously, the size of the latter was at least 15% of the former. It should be also mentioned that the QSAR-Mx tool has been slightly modified since our previous work [28] because we found that the MO-based data distributions vary from one run to another. The new version of QSAR-Mx is now able to generate the same data distributions (i.e., MO-based training and test sets) every time, independently of the seed and interval given, leading thus to more reproducible modeling results.
In this work, to begin with, we divided the whole dataset into a modeling set and an external validation set (535 and 84 data points, respectively), using the CO-based division scheme with values for the seed and interval of 3 and 4, respectively. The modeling dataset was subsequently divided into training and test sets by MO- and CO-based schemes, setting both the maximum seed and maximum interval as 6. The DESs in the training set coming from the two schemes were employed separately for the development of the QSPR models, and those from the test sets were only used to test such models. The DESs in the external validation set were utilized for extra validation of the final most predictive QSPR models found. Details about the investigated DESs along with their experimental surface tension values, and corresponding references are given in Table S1 of the Supplementary Materials.

2.2. Mixture Descriptors

Due to the unique nature of binary DESs, the calculation of their descriptors requires additional steps to take into account the specificity of each component as well as the molar fractions. Here, we resorted to the strategy previously suggested by Oprisiu et al. [37], in which the descriptors are initially calculated for each component and then modified on the basis of ‘mixture descriptors weighted by molar fraction’ formulas. As such, two types of modified descriptors (from now on, referred to as WM descriptors), i.e., Dpmix and Dnmix, were computed by the following formulas:
Dpmix = x1 D1 + x2 D2
Dnmix = |x1 D1x2 D2|
where Di stands for the descriptor of each component i (i = 1, 2) and xi for the respective molar fraction in the mixture.
Both formulas have already been successfully applied to generate predictive models for various properties of DESs [9,28,38] and implemented in the QSAR-Mx tool. Basically, QSAR-Mx includes two methods for calculating such mixture descriptors. Starting from the descriptors previously obtained for each component (D1 and D2), ‘Method-1’ calculates only the Dpmix descriptors whereas ‘Method-2’ provides both the Dpmix and Dnmix descriptors. In the present work, we used both of the aforementioned methods separately and then performed a comparative analysis to elucidate the method that provides the best solution as far as the predictivity of the overall model is concerned.
To start with, the 3D structures of each DES component were obtained by inputting the SMILES (Simplified Molecular Input Line Entry Specification) strings into the application MarvinView (https://docs.chemaxon.com/display/docs/marvinview.md, last accessed on 15 March 2022) and subsequently standardized by the ChemAxon Standardizer tool with the following options: strip salts, aromatize, neutralize and add explicit hydrogen atoms [39]. Here, we have resorted to the 0D-2D descriptors available in the Dragon software [40] for describing each DES component. Actually, 3D descriptors were excluded due to the high computational effort required for structure optimization of each component, especially for large datasets, and the fact that those may also give rise to misleading information not ensuring reliable property prediction by 3D QSPR [28,41]. Finally, along with the WM descriptors, three independent variables were also included, namely: the measuring temperature, T (in K), the presence/absence of chlorine ions, and the presence/absence of bromine ions. The importance of temperature for the modeling was referred to before. Note in addition that only the cationic part was considered during the calculation of WM descriptors for the DES HBA component. Hence, two binary indicator variables (i.e., the presence (1) or absence (0) of halide ions) were required to be included to account for the anionic part of the HBA components.

2.3. Modeling Techniques and Evaluation

As to the modeling techniques, we started by opting for a regression-based approach like in our previous work [28] thanks to its easy interpretation but also high reliability. Specifically, the regression coefficients were obtained by the multiple linear regression analysis (MLR) implemented in our in-house QSAR-Mx tool and by selecting the variables through the sequential forward selection (SFS) algorithm using the Sequential Feature Selector module of Mlxtend (http://rasbt.github.io/mlxtend/) [42]. The following different conditions were applied for scoring the SFS selection:
(a)
determination coefficient (R2), no cross-validation;
(b)
negative mean absolute error (NMAE), no cross-validation;
(c)
negative mean Poisson deviance (NMPD), no cross-validation;
(d)
determination coefficient (R2), five-fold cross-validation (CV) or ten-fold CV.
Yet, a correlation cutoff of 0.95 and variance cutoff of 0.001 were always set to discard highly intercorrelated and near-constant descriptors. Additionally, the selection of the optimal number of descriptors for the MLR models was controlled by the %MAELOO reduction policy, also implemented in QSAR-Mx. The %MAELOO reduction scheme guarantees that no new descriptor is added to the model during feature selection if its inclusion does not reduce the leave-one-out (LOO) cross-validation and mean absolute error (MAELOO) by at least 5% of the previous model. As such, this policy guarantees that the optimal number of descriptors is present in the model and, at the same time, that models generated with multiple model development strategies may be compared from a neutral ground [28].
To check if higher accuracy could be achieved when estimating the surface tension of DESs, non-linear models were also developed using five different machine learning (ML) techniques, i.e.: (i) k-Nearest Neighbors (k-NN), (ii) Random Forests (RF), (iii) Support Vector Machines (SVM), (iv) Neural Network Multilayer Perceptron (NN-MLP), and (v) Gradient boosting (GB) [43,44,45,46,47]. Such ML-based models were set up by resorting to the tools available in the Scikit-learn programs (https://scikit-learn.org/stable/) with QSAR-Mx (last accessed on 28 April 2022) and for each of them, hyperparameter tuning was performed by varying their crucial parameters (see the list in Table S2). The best parameters for a given ML estimator were determined by a 5-fold cross-validation scheme using the same training sets as before. In the same manner, the external predictivity of the promising non-linear models found was firstly accessed and further validated using the same test and external sets.
As described above, QSAR-Mx generates multiple models based on three types of inputs provided by the user, namely: (i) descriptor calculation strategy (Method-1 or -2), (ii) dataset division schemes (MO- or CO-based data-division), and (iii) scoring conditions. No matter the model generation strategy followed, any QSPR regression model requires an evaluation with robust diagnostic tools to assess and compare its acceptability as well as quality over other models.
In this work, the internal predictivity of the developed QSPR regression models was primarily checked by statistical parameters, such as the MAELOO and Q2LOO (LOO cross-validation R2) [48,49]. Keeping in mind the importance of the compounds-out validation, we have recently introduced two new statistical parameters based on the so-called leave-chemical-out (LCO) cross-validation, which is conceptually similar to the well-known leave-many-out CV but more effective whenever dealing with mixtures [28]. These new parameters, i.e., Q2LCO and MAELCO, are mostly important for the MO-based data distributions. Indeed, even though the latter often produce more predictive models than the CO-based data distributions, their predictivity remains questionable due to the lack of CO-based validation. Both Q2LCO and MAELCO have actually helped us monitor the model performance upon the removal of each component (belonging either to HBA or to HBD) one by one from the training set with further model redevelopment using the remaining components. A detailed description of these two parameters can be found in our previous work [28]. Importantly, while assessing the quality of the models, the difference between Q2LOO and Q2LCO should also be evaluated. In fact, a large discrepancy between the values of the latter suggests that the mixtures based on one or more components are not predicted well enough by the QSPR model. Besides the above-mentioned statistics, the internal predictivity of the final regression models was also evaluated by using scaled rm2 validation metrics, such as rm2(LOO) and Δrm2(LOO) [50]. Basically, rm2 metrics are based on the correlation between the observed and predicted values, with and without setting to zero the intercept of the least square regression lines. In addition, the AARD calculated for each data distribution was also used for checking the overall errors of the derived models. Although not quite common in QSPR modeling, the latter allowed us to compare the quality of our QSPR models with that of previously reported thermodynamic models [29]. To access the external predictive ability of the models, similar statistical validation metrics were also employed, i.e., the mean absolute error for the test set or external validation set (MAEtest and MAEext) and the variance explained for external prediction (R2Pred) [44] along with the scaled rm2(test), Δrm2(test), rm2(ext), and Δrm2(ext) metrics [50].
Other aspects that deserve special attention are the absence of highly collinear descriptors and the lack of chance correlations in the final derived models. Highly collinear variables were simply checked by inspecting the cross-correlation matrix of the models’ descriptors. On the other hand, the Y-randomization technique identifies models with chance correlations, using the cRP2 parameter [51], after the sequence of the response vector has been randomly modified. Here, the procedure was repeated 1000 times, and new models were developed with the randomly reordered responses employing the same set of variables. The uniqueness of the final regression model and its lack of chance correlations is confirmed by the value obtained for cRP2, which should be closer to one [51].
Finally, apart from inspecting the models’ robustness and predictivity, one should also define their applicability domain (AD), that is, the response and chemical structure space for which the models form reliable predictions without extrapolating. In this work, the AD of the developed models was determined by the leverage approach [52], which renders a measure of the similarity of a particular substance from all other substances (distance between its descriptor values and the average for all descriptor values). So, one can plot the standardized residuals against the leverage values for each DES of the several sets. From such a plot, the so-called William’s plot [52], we were able to identify the response and structural DES outliers. All plots shown in the present work were conceived with Matplotlib [53].

2.4. Consensus Modeling

The task here is to explore whether the overall quality of predictions for new substances might be improved by an “Intelligent” selection of multiple models. Towards that end, the most predictive QSPR models derived were subjected to consensus modeling, using the software tool Intelligent Consensus Predictor (freely available through the web https://dtclab.webs.com/software-tools, last accessed on 23 March 2022) developed by Roy et al. [54]. The four strategic techniques of this tool were applied, namely: Consensus Models (CM) 0–3, just as in our previous work [28], and as fully described in the work by the authors [54]. In short, CM0 is the simplest strategy and consists in computing the arithmetic average of predicted response values from all input individual models. In contrast, CM1 is based on the simple arithmetic average of predictions from all qualified individual models. CM2 corresponds to weighted average predictions from all qualified models, formerly giving appropriate weightage to those models. Finally, CM3 applies compound-wise predictions based on the best selection coming from the qualified models. Independently of the consensus modeling methodology, our main purpose was thus to combine multiple statistically robust models to improve the predictivity over the external validation set.

3. Results

3.1. Model Calibration and Evaluation

Figure 1 depicts the workflow chart followed in the present QSPR modeling, which was mostly carried out using our recently developed tool, QSAR-Mx. As can be seen, all the involved steps and methodology employed to cope with the major goal of this work are shown, i.e., to build reliable predictive QSPR regression models from the compiled data that could be used to estimate the surface tension of DESs.
In total, 258 models were set up by varying data splitting schemes, descriptor calculation methods (Method-1 or Method-2) and SFS-MLR modeling. Among these, 136 models pertained to the MO-based division scheme, whereas the remaining 112 models were generated with the CO-based division scheme. The overall predictive quality of each of these regression models was judged by means of the average value computed for the statistical parameters Q2LOO, Q2LCO and R2Pred. Essentially, the two parameters—Q2LOO and R2Pred—account for the internal and external predictivity of the QSPR models, respectively. Nevertheless, the parameter Q2LCO was also included to ensure that the most predictive models do not suffer overfitting due to bias towards some specific components of the binary DES mixtures. Naturally, the higher the average value obtained from these three parameters is, the more predictive the model is. Considering this, we selected the top 15 unique models for further processing. A summary of the statistical results of these models is given in Table 1. Interestingly, out of these 15 models, 14 were derived from MO-based data distributions, and only one model arose from CO-based distributions. Undoubtedly, that clearly shows that MO-based data distributions are more likely to produce more predictive models in comparison to the CO-based data distributions since the latter provides a more rigorous validation strategy.
As referred to before, in the entire model-building process, the data distributions were varied for the selection of the most predictive models. Therefore, the generated test sets serve as a validation set to estimate the external predictivity of the models but, at the same time, as calibration sets for the selection of the best models. In contrast, the external validation sets (containing 84 data points) were treated as the ‘true validation set’ for assessing the external predictivity of the models. The latter was built with the CO-based data-distribution scheme, thus holding a significant challenge to the generated models as far as their external predictivity is concerned. A comparison of the predictivity of the top 15 models is shown in Table 2.
As can be clearly observed from Table 2, only a few models show satisfactory predictions against the external validation set. Nevertheless, six of these models had R2Pred values greater than 0.50, as well as average %AARD values lower than 12. Moreover, three models, namely M09, M10 and M12, supplied the most satisfactory predictivity towards such external validation set with R2Pred > 0.65 and %AARD < 10. Therefore, these three models were considered the best models obtained for predicting the surface tension of DESs. Remarkably, M10, the only CO-based model included in the top 15, emerged as one of the most predictive models. Still, on the basis of overall predictivity, M12 was selected as the best individual QSPR model, even taking into account its slightly lower internal predictivity, compared to M10, and its slightly lower external predictivity towards the test set, as compared to M09. Even so, M12 afforded a balanced prediction against all three sets with an average %AARD value of 7.126, which is lower than that obtained for the other two models. At the same time, model M12 provides the best solution if the average value of Q2LOO, Q2LCO and R2Pred (against the two validation sets) is considered. In fact, for M12, this average value was found to be 0.859, while for M09 and M10, the average values were estimated as 0.820 and 0.831, respectively.
In summary, the best predictive model found for the DESs’ surface tension (a six-variable equation, model M12) can be expressed as detailed below, while the meaning of the selected WM descriptors is given in Table 3.
σ = +89.611 (±3.452)
+0.405 (±0.026) P_VSA_MR_6pmix
−5.034 (±0.874) Eig02_EA(dm)pmix
−23.145 (±3.320) CATS2D_02_ANpmix
+8.835 (±0.174) BLTF96pmix
−25.191 (±2.352) MATS5snmix
−0.104 (±0.011) T
In this equation, Xpmix and Xnmix stand for WM descriptors of the type Dpmix in line with Equation (1) and Dnmix following Equation (2), respectively, T is the temperature (in K) under which the surface tension has been measured, and σ is the surface tension (in mN/m).
A summary of the extended statistical results for model M12 is given in Table 4. The determination coefficient values (R2 = 0.916 and R2Adj = 0.915), the sample size (Ntr = 360), the Fisher ratio (F = 642.4), but especially the high ratio between the number of data points to adjustable variables (ρ = 60) [59] are indicative of the model’s statistical significance and fitness. Model M12 also provides a satisfactory internal and external predictivity as follows from the cross-validation, rm2 and R2Pred metrics values (see Table 4). Moreover, built with only six descriptors, it led to %AARD values of 5.805, 11.155 and 4.418 against the training, test and validation sets, respectively. The model prediction ability was further checked by analyzing the relative deviations (%RD = 100*(σPred − σexp)/σexp) between the predicted and experimental DES surface tension values for all three sets. As Figure 2 shows, model M12 performs more accurately regarding the training and external validation sets than the test set. Yet, the latter also demonstrates a normal behavior considering the shape of the RD distribution according to the proposed model, also displayed in Figure 2. This histogram plot clearly depicts that most of the RD error values are within ±20% and that those are normally distributed, suggesting that the model estimations are not biased.
Figure 3 shows the plot of the predicted surface tensions obtained from model M12 vs. the observed experimental ones. As seen, the majority of the data points are sufficiently close to the diagonal line, denoting the model’s reliability and soundness of its predictions. Indeed, the model’s performance is even better than that of the previously developed thermodynamic model for the DES surface tension [14], which, despite having fewer data points (a total of 530 data points, considering only the 99 unique binary DES), led to %AARD values of 8.87 and 14.81 for the training and test sets, respectively. However, the purpose and outcomes of the current QSPR modeling are different from that of any thermodynamic model, as the former demands several different conditions to be satisfied, apart from validation, to establish the statistical robustness of the model. For example, so far, we have demonstrated the acceptable results on the reliability of the QSPR model M12, but it is also important to inspect the non-intercollinearity among any two of its descriptors. The latter was found to be 0.238, indicating that the variables included in the model are indeed not interrelated to each other. Furthermore, the model was itself checked for its uniqueness by the Y-based randomization technique, which was performed by scrambling the endpoint responses for the training set. The high value obtained for cRP2 (=0.908) implies that the model is not correlated by chance. Another crucial aspect is related to the applicability domain of the model that here was assessed by analyzing the Williams plot (plot of standardized residual vs. leverages). As seen in Figure 3, eight data points from the training set and thirteen from the test set can be considered structural outliers of the model, but no structural outliers were found in the external validation set. Interestingly, most of these structural outliers were well predicted by the model and were thus retained, as previously suggested by Gramatica et al. [49]. In addition, only twelve data points of the entire dataset were found to be response outliers, which also proves the high predictive accuracy of the model [60].

3.2. Model Interpretation

In our previous investigation on density [28] we observed that, in spite of providing less mechanistic interpretability, graph-based topological descriptors often help in characterizing the physicochemical properties. In the present work, a number of topological descriptors were also proven to be significant for describing the surface tension of DESs. Figure 4 shows the relative importance of each descriptor of model M12, estimated on the basis of the absolute value of its regression coefficients.
As can be observed, the WM descriptor MATS5snmix was found to have the highest importance and besides, it is the only Dnmix type descriptor in the model. Being derived from graph-based topological descriptors, MATS5dnmix points out that the differences in topological geometry of the DESs’ components may play a significant role in the surface tension of these solvents. The Dpmix type WM descriptor CATS2D_02_ANpmix is the second most influencing descriptor of the model. Chemically Advanced Template Search (CATS) descriptors are a useful group of descriptors that account for the topological distance among scaffold features in the molecules [58]. CATS2D_02_AN, in particular, means that the acceptor and negatively charged groups are separated by a small topological distance (=2). In this case, higher values of this descriptor are found to be negatively correlated to the surface tension. Descriptor BLTF96pmix appears as the third most important descriptor in the model. Unlike the first two descriptors of topological nature, this descriptor is based on an important molecular property—lipophilicity [55,56]. Since this descriptor belongs to the Dpmix type, it may be inferred that higher lipophilicity of the components would trigger higher surface tension for the DESs. Apart from lipophilicity, another well-known physicochemical property—dipole moment—was also found to have important contributions in ascertaining the DES surface tension. The importance of the dipole moment is derived from the presence of descriptor Eig02_EA(dm)pmix. The fifth most important descriptor belongs to the class of P_VSA descriptors, which represent the amount of van der Waals surface area (VSA) having a property (P) in a certain range [56]. In the case of the descriptor P_VSA_MR_6pmix, the property is the molar refractivity (MR) at a larger range (bin size 6). The positive relation of P_VSA_MR_6pmix with the dependent property is highly significant as it suggests that increased MR (i.e., polarizability) within the van der Waals surface of each component contributes towards a higher surface tension for the respective DESs. Finally, the last descriptor of the model is the temperature of surface tension measurements, T. As expected, with increasing temperature the surface tension is found to decrease, which fits well with the experimental findings. Still, to further check how model M12 actually addresses the influence of temperature, we randomly selected six DESs with a range of surface tension values. From Figure 5, it can be clearly seen that both experimental and predicted properties followed the same trend, i.e., the surface tension gradually decreases as the temperature is increased.

3.3. Non-Linear Models

Albeit M12 emerged as the most accurate linear model in estimating the DES surface tension, the question that still remains is whether a non-linear model based on its descriptors might have better performance. Table 5 shows a statistical summary of the performance of the non-linear models resulting from applying the five different machine learning techniques, i.e.: k-NN, RF, SVM, NN-MLP and GB [49,50,51,52,53]. It can be observed that most of these ML techniques failed to produce predictive models and their results are not accurate either. Still, the SVM technique yields a predictive model and thus has the highest performance among the other ML techniques, though both the internal and the external predictivity of the latter remain inferior to the linear model. These results indicate that for the selected set of descriptors, the multiple linear regression-based model has the best accuracy in estimating the surface tension of DESs as well as sufficient predictivity that cannot be achieved with other model development techniques.

3.4. Consensus Modeling

Finally, we applied the intelligent consensus modeling [54] to see whether the surface tension predictions for the external validation set could be improved. To do so, sets of the three most predictive linear models—M09, M10 and M12—were subjected to consensus predictions in different combinations, namely: (a) C1-based using models M09, M10 and M12; (b) C2-based using models M10 and M12; (c) C3-based using models M09 and M10; and (d) C4-based with models M09 and M12. In each case, the modeling dataset containing 535 data points was treated as the training set whereas the external validation set was used to check the external predictivity of the consensus model. The results of all consensus modeling attempts are presented in Table 6.
Interestingly, the resulting models C1 and C4 lead to similar predictivities. Yet, none of the later consensus models display an external predictivity considerably better than that of the best individual model, M12. The R2Pred and %AARD values obtained for consensus model C1 are 0.864 and 4.459, respectively, and similarly for C3 (i.e., 0.854 and 4.393). As can be seen, both C1 and C3 may therefore be projected as alternative models to M12. However, let us mainly focus on C3, since it reveals that M09 and M10 may indeed work as complementary models for each other towards improving the external predictivity. Details about the M09 and M10 models are provided in Table S3 of the SI.
Since M09 was developed with the same data distribution as M12, these two models have four descriptors in common, namely: CATS2D_02_ANpmix, P_VSA_MR_6pmix, BLTF96pmix, and T. Obviously, these four descriptors have a high significance in describing the surface tension of DESs. Significantly, CATS2D_02_ANpmix, which was found to be the second most important descriptor of M12, appears to be the most influential descriptor of M09. It undoubtedly indicates that this descriptor may be considered the most crucial descriptor in predicting the surface tension of DESs. Presumably, due to the similarity between M12 and M09, consensus modeling with these two models failed to provide any better solution. Model M10, in contrast, is established as a unique model because, save for T, none of its descriptors is found either in M09 or in M12. Most likely due to this reason, its combination with the other two models produces good consensus models. Unlike models M09 and M12, model M10 yields are slightly higher but still have acceptable intercollinearity between descriptors, with a maximum R2 value of 0.713. The selected descriptors for models M09 and M10 are described in detail in Table S4 of the SI.

4. Conclusions

The present work aimed to establish a systematic and well-designed QSPR modeling for predicting the surface tension of a wide range of DESs, following the OECD guidelines. Towards such aim, the largest surface tension data bank of binary DESs known to date, comprising 619 data points from 113 unique DESs of various families was gathered from the literature. In addition, special emphasis was put on employing robust validation strategies for setting up the QSPR models. In so doing, the best QSPR models were set up with multiple data distributions resulting from MO- and CO-data splitting schemes along with a weighted mixture type of descriptors, using our in-house open access tool QSAR-Mx. After considering several statistical parameters, the top three individual linear regression models stand out for their accuracy and robustness. The most predictive individual linear model was however selected based on the predictivity towards the external validation set.
Similar to our previous study on the development of a thermodynamic model [14], the surface tension of DESs was found to be a particularly difficult property to predict. This may be related to the challenging nature of the accurate experimental estimation of surface tensions, associated with the presence of surface-active impurities and differences in the measuring protocols [33]. Nevertheless, our most predictive individual QSPR regression model (i.e., model M12) yielded a satisfactory overall %AARD value (=7.126), especially when compared to the aforementioned thermodynamic model (%AARD = 10.31, considering only the binary DESs therein) [14]. This model depicted also the structural and physicochemical features related to the surface tension of DESs. Just as in our previously developed model for the density of DESs [28], graph-based topological descriptors were found to be highly useful in this respect. Some physicochemical factors, such as the lipophilicity, polarizability and dipole moment of the DESs’ components, were found to be responsible for ruling their surface tensions. We also attempted to generate consensus models based on the top three individual linear models. Interestingly, consensus models based on the two other best individual models—M09 and M10—were found to be equally predictive towards the external validation set.
Overall, this work definitely provides valuable information about the structural and physicochemical features required for predicting the surface tension of binary DESs. At the same time, it also lends important guidelines to set up predictive and validated linear interpretable QSPR models for the various properties of binary mixtures. The high predictivity of the models ensures that these models may be used on the industrial scale to at least predict the surface tension of the DESs that are newly developed or under-developed to assess their suitability as an industrial solvent. The models may also be used to screen a large number of DESs (obtained from databases) to predict the DESs with desirable surface tension properties. What is more, all the proposed models are easily reproducible since they rely on fully specified computational procedures and were built with non-commercial software tools. Finally, both the individual and consensus models developed in this work shall help the future screening as well as the design of new sustainable DES, with major time and cost savings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/molecules27154896/s1, Table S1. List of all investigated DESs and experimental surface tension data. (XLSX); Table S2. Hyperparameter tuning of the different machine learning techniques. (PDF); Table S3. Detailed description of the M09 and M10 models. (PDF); Table S4. Descriptors used in the M09 and M10 models. (PDF)

Author Contributions

Conceptualization, A.K.H., R.H., A.R.C.D. and M.N.D.S.C.; methodology, A.K.H., R.H. and M.N.D.S.C.; software, A.K.H.; formal analysis, A.K.H. and R.H.; investigation, A.K.H., R.H. and I.V.V.; writing—original draft preparation, A.K.H. and R.H.; writing—review and editing, I.V.V. and M.N.D.S.C.; supervision, A.R.C.D. and M.N.D.S.C.; project administration, A.R.C.D. and M.N.D.S.C.; funding acquisition, A.R.C.D. and M.N.D.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by UIDB/50006/2020 with funding from FCT/MCTES through national funds.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data files pertaining to the QSPR modeling are available from the authors. The training, test and external datasets were taken from cited publications, and the DES chemical structures along with the collected surface tension data are provided in the Supporting Information (Table S1). Dragon 7.0, MarvinView, and Standardizer were used in this study under academic license (see Material and methods section). Three other open source software tools were also used in this study, namely: QSAR-Mx, a Python-based tool developed by the authors that is available to download at https://github.com/ncordeirfcup/QSAR-Mx (last accessed on 28 April 2022); Mlxtend, a Python library of useful tools that is accessible from https://rasbt.github.io/mlxtend/; scikit-learn, a Python library of useful machine learning tools that is accessible from https://scikit-learn.org/stable/; and Intelligent Consensus Predictor, a Java-based tool available through the web https://dtclab.webs.com/software-tools (last accessed on 23 March 2022)(see Material and methods section).

Acknowledgments

The authors are thankful to ChemAxon for providing the academic licenses of MarvinView and Standardizer to AKH.

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

Samples of the compounds (full dataset used for modeling) are available from the authors.

References

  1. Clarke, C.J.; Tu, W.C.; Levers, O.; Brohl, A.; Hallett, J.P. Green and Sustainable Solvents in Chemical Processes. Chem. Rev. 2018, 118, 747–800. [Google Scholar] [CrossRef]
  2. Sheldon, R.A. Green Solvents for Sustainable Organic Synthesis: State of the Art. Green Chem. 2005, 7, 267–278. [Google Scholar] [CrossRef]
  3. Sheldon, R.A. Fundamentals of Green Chemistry: Efficiency in Reaction Design. Chem. Soc. Rev. 2012, 41, 1437–1451. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. García, G.; Aparicio, S.; Ullah, R.; Atilhan, M. Deep Eutectic Solvents: Physicochemical Properties and Gas Separation Applications. Energy Fuels 2015, 29, 2616–2644. [Google Scholar] [CrossRef]
  5. Hansen, B.B.; Spittle, S.; Chen, B.; Poe, D.; Zhang, Y.; Klein, J.M.; Horton, A.; Adhikari, L.; Zelovich, T.; Doherty, B.W.; et al. Deep Eutectic Solvents: A Review of Fundamentals and Applications. Chem. Rev. 2021, 121, 1232–1285. [Google Scholar] [CrossRef] [PubMed]
  6. El Achkar, T.; Greige-Gerges, H.; Fourmentin, S. Basics and Properties of Deep Eutectic Solvents: A Review. Environ. Chem. Lett. 2021, 19, 3397–3408. [Google Scholar] [CrossRef]
  7. Abbott, A.P.; Capper, G.; Davies, D.L.; Rasheed, R.K.; Tambyrajah, V. Novel Solvent Properties of Choline Chloride/Urea Mixtures. Chem. Commun. 2003, 70–71. [Google Scholar] [CrossRef] [Green Version]
  8. Zhang, Q.; De Oliveira Vigier, K.; Royer, S.; Jérôme, F. Deep Eutectic Solvents: Syntheses, Properties and Applications. Chem. Soc. Rev. 2012, 41, 7108–7146. [Google Scholar] [CrossRef]
  9. Halder, A.K.; Cordeiro, M.N.D.S. Probing the Environmental Toxicity of Deep Eutectic Solvents and Their Components: An In Silico Modeling Approach. ACS Sustain. Chem. Eng. 2019, 7, 10649–10660. [Google Scholar] [CrossRef]
  10. Palmelund, H.; Andersson, M.P.; Asgreen, C.J.; Boyd, B.J.; Rantanen, J.; Löbmann, K. Tailor-Made Solvents for Pharmaceutical Use? Experimental and Computational Approach for Determining Solubility in Deep Eutectic Solvents (DES). Int. J. Pharm. X 2019, 1, 100034. [Google Scholar] [CrossRef]
  11. Nam, M.W.; Zhao, J.; Lee, M.S.; Jeong, J.H.; Lee, J. Enhanced Extraction of Bioactive Natural Products Using Tailor-Made Deep Eutectic Solvents: Application to Flavonoid Extraction from Flos Sophorae. Green Chem. 2015, 17, 1718–1727. [Google Scholar] [CrossRef]
  12. Chen, Y.; Chen, W.; Fu, L.; Yang, Y.; Wang, Y.; Hu, X.; Wang, F.; Mu, T. Surface Tension of 50 Deep Eutectic Solvents: Effect of Hydrogen-Bonding Donors, Hydrogen-Bonding Acceptors, Other Solvents, and Temperature. Ind. Eng. Chem. Res. 2019, 58, 12741–12750. [Google Scholar] [CrossRef]
  13. Ghaedi, H.; Ayoub, M.; Sufian, S.; Shariff, A.M.; Lal, B. The Study on Temperature Dependence of Viscosity and Surface Tension of Several Phosphonium-Based Deep Eutectic Solvents. J. Mol. Liq. 2017, 241, 500–510. [Google Scholar] [CrossRef]
  14. Haghbakhsh, R.; Taherzadeh, M.; Duarte, A.R.C.; Raeissi, S. A General Model for the Surface Tensions of Deep Eutectic Solvents. J. Mol. Liq. 2020, 307, 112972. [Google Scholar] [CrossRef]
  15. Le, T.; Epa, V.C.; Burden, F.R.; Winkler, D.A. Quantitative Structure-Property Relationship Modeling of Diverse Materials Properties. Chem. Rev. 2012, 112, 2889–2919. [Google Scholar] [CrossRef]
  16. Mikolajczyk, A.; Gajewicz, A.; Rasulev, B.; Schaeublin, N.; Maurer-Gardner, E.; Hussain, S.; Leszczynski, J.; Puzyn, T. Zeta Potential for Metal Oxide Nanoparticles: A Predictive Model Developed by a Nano-Quantitative Structure-Property Relationship Approach. Chem. Mater. 2015, 27, 2400–2407. [Google Scholar] [CrossRef]
  17. Kim, M.; Li, L.Y.; Grace, J.R. Predictability of Physicochemical Properties of Polychlorinated Dibenzo-p-Dioxins (PCDDs) Based on Single-Molecular Descriptor Models. Environ. Pollut. 2016, 213, 99–111. [Google Scholar] [CrossRef] [PubMed]
  18. Moura, A.S.; Halder, A.K.; Cordeiro, M.N.D.S. From Biomedicinal to In Silico Models and Back to Therapeutics: A Review on the Advancement of Peptidic Modeling. Future Med. Chem. 2019, 11, 2313–2331. [Google Scholar] [CrossRef] [PubMed]
  19. Sepehri, B. A Review on Created QSPR Models for Predicting Ionic Liquids Properties and Their Reliability from Chemometric Point of View. J. Mol. Liq. 2020, 297, 112013. [Google Scholar] [CrossRef]
  20. Muratov, E.N.; Bajorath, J.; Sheridan, R.P.; Tetko, I.V.; Filimonov, D.; Poroikov, V.; Oprea, T.I.; Baskin, I.I.; Varnek, A.; Roitberg, A.; et al. QSAR Without Borders. Chem. Soc. Rev. 2020, 49, 3525–3564. [Google Scholar] [CrossRef]
  21. Awfa, D.; Ateia, M.; Mendoza, D.; Yoshimura, C. Application of Quantitative Structure–Property Relationship Predictive Models to Water Treatment: A Critical Review. ACS EST Water 2021, 1, 498–517. [Google Scholar] [CrossRef]
  22. Wang, J.; Song, Z.; Chen, L.; Xu, T.; Deng, L.; Qi, Z. Prediction of CO2 Solubility in Deep Eutectic Solvents using Random Forest Model Based on COSMO-RS-Derived Descriptors. Green Chem. Eng. 2021, 2, 431–440. [Google Scholar] [CrossRef]
  23. Balali, M.; Sobati, M.A.; Gorji, A.E. QSPR Modeling of Thiophene Distribution Between Deep Eutectic Solvent (DES) and Hydrocarbon Phases: Effect of Hydrogen Bond Donor (HBD) Structure. J. Mol. Liq. 2021, 342, 117496. [Google Scholar] [CrossRef]
  24. Khajeh, A.; Shakourian-Fard, M.; Parvaneh, K. Quantitative Structure-Property Relationship for Melting and Freezing Points of Deep Eutectic Solvents. J. Mol. Liq. 2021, 321, 114744. [Google Scholar] [CrossRef]
  25. Benguerba, Y.; Alnashef, I.M.; Erto, A.; Balsamo, M.; Ernst, B. A Quantitative Prediction of the Viscosity of Amine Based DESs Using Ss-profile Molecular Descriptors. J. Mol. Struct. 2019, 1184, 357–363. [Google Scholar] [CrossRef]
  26. Lemaoui, T.; Hammoudi, N.E.H.; Alnashef, I.M.; Balsamo, M.; Erto, A.; Ernst, B.; Benguerba, Y. Quantitative Structure Properties Relationship for Deep Eutectic Solvents Using Sσ-profile as Molecular Descriptors. J. Mol. Liq. 2020, 309, 113165. [Google Scholar] [CrossRef]
  27. Lemaoui, T.; Darwish, A.S.; Attoui, A.; Hatab, F.A.; Hammoudi, N.E.H.; Benguerba, Y.; Vega, L.F.; Alnashef, I.M. Predicting the Density and Viscosity of Hydrophobic Eutectic Solvents: Towards the Development of Sustainable Solvents. Green Chem. 2020, 22, 8511–8530. [Google Scholar] [CrossRef]
  28. Halder, A.K.; Haghbakhsh, R.; Voroshylova, I.V.; Duarte, A.R.C.; Cordeiro, M.N.D.S. Density of Deep Eutectic Solvents: The Path Forward Cheminformatics-Driven Reliable Predictions for Mixtures. Molecules 2021, 26, 5779. [Google Scholar] [CrossRef]
  29. Haghbakhsh, R.; Bardool, R.; Bakhtyari, A.; Duarte, A.R.C.; Raeissi, S. Simple and Global Correlation for the Densities of Deep Eutectic Solvents. J. Mol. Liq. 2019, 296, 111830. [Google Scholar] [CrossRef]
  30. Organization for Economic Co-Operation and Development (OECD). Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship ((Q)SAR) Models; OECD Series on Testing and Assessment 69; OECD Document ENV/JM/MONO2007; OECD Publishing: Paris, France, 2014; pp. 55–65. [Google Scholar]
  31. Toropov, A.A.; Toropova, A.P. QSPR/QSAR: State-of-Art, Weirdness, the Future. Molecules 2020, 25, 1292. [Google Scholar] [CrossRef] [Green Version]
  32. Omar, K.A.; Sadeghi, R. Novel Deep Eutectic Solvents Based on Pyrogallol: Synthesis and Characterizations. J. Chem. Eng. Data 2021, 66, 2088–2095. [Google Scholar] [CrossRef]
  33. Nunes, R.J.; Saramago, B.; Marrucho, I.M. Surface Tension of dl-Menthol:Octanoic Acid Eutectic Mixtures. J. Chem. Eng. Data 2019, 64, 4915–4923. [Google Scholar] [CrossRef]
  34. Lapeña, D.; Bergua, F.; Lomba, L.; Giner, B.; Lafuente, C. A Comprehensive Study of the Thermophysical Properties of Reline and Hydrated Reline. J. Mol. Liq. 2020, 303, 112679. [Google Scholar] [CrossRef]
  35. Abdallah, M.M.; Müller, S.; González de Castilla, A.; Gurikov, P.; Matias, A.A.; Bronze, M.d.R.; Fernández, N. Physicochemical Characterization and Simulation of the Solid–Liquid Equilibrium Phase Diagram of Terpene-Based Eutectic Solvent Systems. Molecules 2021, 26, 1801. [Google Scholar] [CrossRef] [PubMed]
  36. Muratov, E.N.; Varlamova, E.V.; Artemenko, A.G.; Polishchuk, P.G.; Kuz’min, V.E. Existing and Developing Approaches for QSAR Analysis of Mixtures. Mol. Inform. 2012, 31, 202–221. [Google Scholar] [CrossRef] [PubMed]
  37. Oprisiu, I.; Novotarskyi, S.; Tetko, I.V. Modeling of Non-Additive Mixture Properties Using the Online CHEmical Database and Modeling Environment (OCHEM). J. Cheminformatics 2013, 5, 4. [Google Scholar] [CrossRef] [Green Version]
  38. Halder, A.K.; Cordeiro, M.N.D.S. Development of Predictive Linear and Non-linear QSTR Models for Aliivibrio Fischeri Toxicity of Deep Eutectic Solvents. IJQSPR 2019, 4, 50–69. [Google Scholar] [CrossRef]
  39. ChemAxon. Standardizer; Version 15.9.14.0 Software; ChemAxon: Budapest, Hungary, 2010. [Google Scholar]
  40. Mauri, A.C.V.; Pavan, M.; Todeschini, R. Dragon Software: An Easy Approach to Molecular Descriptor calculations. MATCH Commun. Math. Comput. Chem. 2006, 56, 237–248. [Google Scholar]
  41. Hechinger, M.; Leonhard, K.; Marquardt, W. What is Wrong with Quantitative Structure–Property Relations Models Based on Three-Dimensional Descriptors? J. Chem. Inf. Model. 2012, 52, 1984–1993. [Google Scholar] [CrossRef]
  42. Raschka, S. MLxtend: Providing Machine Learning and Data Science Utilities and Extensions to Python’s Scientific Computing Stack. J. Open Source Softw. 2018, 3, 638. [Google Scholar] [CrossRef]
  43. Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inform. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef] [Green Version]
  44. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  45. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory ACM, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
  46. Guang-Bin, H.; Babri, H.A. Upper Bounds on the Number of Hidden Neurons in Feedforward Networks with Arbitrary Bounded Nonlinear Activation Functions. IEEE Trans. Neural Netw. 1998, 9, 224–229. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  48. Golbraikh, A.; Tropsha, A. Beware of Q2! J. Mol. Graph. Model. 2002, 20, 269–276. [Google Scholar] [CrossRef]
  49. Gramatica, P. On the Development and Validation of QSAR Models. Methods Mol. Biol. 2013, 930, 499–526. [Google Scholar] [PubMed]
  50. Roy, P.P.; Paul, S.; Mitra, I.; Roy, K. On Two Novel Parameters for Validation of Predictive QSAR Models. Molecules 2009, 14, 1660–1701. [Google Scholar]
  51. Ojha, P.K.; Roy, K. Comparative QSARs for Antimalarial Endochins: Importance of Descriptor-Thinning and Noise Reduction Prior to Feature Selection. Chemom. Intell. Lab. Syst. 2011, 109, 146–161. [Google Scholar] [CrossRef]
  52. Gramatica, P. Principles of QSAR Models Validation: Internal and External. QSAR Comb. Sci. 2007, 26, 694–701. [Google Scholar] [CrossRef]
  53. Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
  54. Roy, K.; Ambure, P.; Kar, S.; Ojha, P.K. Is It Possible to Improve the Quality of Predictions from an “Intelligent” Use of Multiple QSAR/QSPR/QSTR Models? J. Chemom. 2018, 32, e2992. [Google Scholar] [CrossRef]
  55. Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000. [Google Scholar]
  56. Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics, 2nd ed.; Wiley-VCH: Weinheim, Germany, 2009. [Google Scholar]
  57. Labute, P. A Widely Applicable Set of Descriptors. J. Mol. Graph. Model. 2000, 18, 464–477. [Google Scholar] [CrossRef]
  58. Reutlinger, M.; Koch, C.P.; Reker, D.; Todoroff, N.; Schneider, P.; Rodrigues, T.; Schneider, G. Chemically Advanced Template Search (CATS) for Scaffold-Hopping and Prospective Target Prediction for ‘Orphan’ Molecules. Mol. Inform. 2013, 32, 133–138. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  59. García-Domenech, R.; Julián-Ortiz, J.V. Antimicrobial Activity Characterization in a Heterogeneous Group of Compounds. J. Chem. Inf. Comput. Sci. 1998, 38, 445–449. [Google Scholar] [CrossRef] [PubMed]
  60. Khan, K.; Khan, P.M.; Lavado, G.; Valsecchi, C.; Pasqualini, J.; Baderna, D.; Marzo, M.; Lombardo, A.; Roy, K.; Benfenati, E. QSAR Modeling of Daphnia magna and Fish Toxicities of Biocides Using 2D Descriptors. Chemosphere 2019, 229, 8–17. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Basic workflow chart for the QSPR regression modeling, followed in this study.
Figure 1. Basic workflow chart for the QSPR regression modeling, followed in this study.
Molecules 27 04896 g001
Figure 2. Relative deviations (%RD) between the predicted and observed DES surface tensions (left) and histogram plot of the distribution of %RD values (right).
Figure 2. Relative deviations (%RD) between the predicted and observed DES surface tensions (left) and histogram plot of the distribution of %RD values (right).
Molecules 27 04896 g002
Figure 3. Predicted surface tension values vs. observed experimental ones (left) and Williams plot (right) obtained for model M12.
Figure 3. Predicted surface tension values vs. observed experimental ones (left) and Williams plot (right) obtained for model M12.
Molecules 27 04896 g003
Figure 4. Relative importance of the descriptors found in the best individual model M12.
Figure 4. Relative importance of the descriptors found in the best individual model M12.
Molecules 27 04896 g004
Figure 5. Comparison of surface tension calculated by the M12 model with literature data in the temperature range from 278.15-358.15 K for six DESs at atmospheric pressure. DES1: DL-menthol and octanoic acid (3:1); DES2: tetrabutylammonium chloride and arginine (8:1); DES3: tetraprpylammonium bromide and ethylene glycol (1:6); DES3: tetraprpylammonium bromide and ethylene glycol (1:6); DES4: N,N-diethylethanolammonium chloride and glycerol (1:5); DES5: choline chloride and glycerol (1:5); DES6: choline chloride and D-glucose (1:1).
Figure 5. Comparison of surface tension calculated by the M12 model with literature data in the temperature range from 278.15-358.15 K for six DESs at atmospheric pressure. DES1: DL-menthol and octanoic acid (3:1); DES2: tetrabutylammonium chloride and arginine (8:1); DES3: tetraprpylammonium bromide and ethylene glycol (1:6); DES3: tetraprpylammonium bromide and ethylene glycol (1:6); DES4: N,N-diethylethanolammonium chloride and glycerol (1:5); DES5: choline chloride and glycerol (1:5); DES6: choline chloride and D-glucose (1:1).
Molecules 27 04896 g005
Table 1. Statistical results of the top 15 unique QSPR regression models generated.
Table 1. Statistical results of the top 15 unique QSPR regression models generated.
ModelSeed; IntervalDescriptor aSplit bScoring cNtrdQ2LOOeQ2LCOfMAELOO gNtshR2PrediMAEtest jAvg k
M012; 5Method-1MONMAE4080.8840.8542.5861270.9073.8630.882
M024; 5Method-2MONMAE4350.8730.8492.6581000.8993.9660.874
M031; 4Method-2MONMAE4080.8980.8542.7751270.8652.5690.872
M045; 4Method-2MONMAE4090.8980.8552.7711260.8622.5840.872
M054; 5Method-1MONMAE4350.8710.8392.6351000.8984.0390.869
M063; 5Method-1MONMAE4430.8810.8492.671920.8714.0730.867
M071; 3Method-1MONMAE3590.9010.8622.3691760.8364.2230.866
M084; 5Method-1MOR24350.8830.8582.8541000.8554.6950.865
M094; 3Method-1MONMAE3600.9060.8542.3221750.834.3890.864
M101; 2Method-2COR23010.9310.9031.6602340.7546.0300.862
M114; 5Method-2MO5-fold4350.8650.8413.0001000.8764.2820.861
M124; 3Method-2MOR23600.9080.8822.6081750.7835.1340.858
M134; 5Method-1MO5-fold4350.8710.8452.7061000.8574.6260.858
M141; 4Method-1MO10-fold4080.8690.8493.3401270.8472.0500.855
M155; 4Method-1MO10-fold4090.8690.853.3361260.8442.0520.855
a Descriptor calculation method used. b Data splitting scheme utilized. c Scoring condition applied. d Number of data points in the training set. e Leave-one-out cross-validation determination coefficient. f Leave-chemical-out cross-validation determination coefficient. g LOO cross-validation mean absolute error. h Number of data points in the test set. i Variance explained for external prediction. j Mean absolute error of the test set. k Average value of Q2LOO, Q2LCO and R2Pred.
Table 2. Internal and external predictivity for the top 15 regression models against the training, test and external validation sets a.
Table 2. Internal and external predictivity for the top 15 regression models against the training, test and external validation sets a.
Model bTraining SetTest SetExternal Validation Set
NtrQ2LOOQ2LCO%AARDNtsR2Pred%AARDNexcR2Pred%AARD
M014080.8840.8545.5411270.90711.84384−0.33515.931
M024350.8730.8496.111000.8997.517840.46412.176
M034080.8980.8546.0631270.8655.41884−0.19615.833
M044090.8980.8556.0571260.8625.44684−0.1915.818
M054350.8710.8395.9651000.8987.538840.39211.838
M064430.8810.8496.052920.8717.65840.51611.021
M073590.9010.8625.2221760.8369.33184−7.22527.7
M084350.8830.8586.61000.8558.456840.46611.45
M093600.9060.8545.2021750.839.872840.6888.527
M103010.9310.9034.2082340.75412.754840.7347.777
M114350.8650.8416.8751000.8767.78840.56810.047
M123600.9080.8825.8051750.78311.155840.8624.418
M134350.8710.8456.2041000.8578.442840.54411.093
M144080.8690.8497.3441270.8473.943840.35211.642
M154090.8690.857.3391260.8443.929840.35311.638
a For the meaning of Ntr, Nts, Q2LOO, Q2LCO, R2Pred and %AARD, please check the footnotes of Table 1. b The more predictive models are marked in bold. c Number of data points in the external validation set.
Table 3. The five WM molecular descriptors selected for model M12—Equation (3).
Table 3. The five WM molecular descriptors selected for model M12—Equation (3).
SymbolDefinition [55,56,57,58]Class
P_VSA_MR_6pmixP_VSA-like on Molar Refractivity, at bin size 6P_VSA-like descriptor a (Dpmix type)
Eig02_EA(dm)pmixeigenvalue n. 2 from edge adjacency matrix, weighted by dipole momentEdge adjacency indices (Dpmix type)
CATS2D_02_ANpmixCATS2D Acceptor-Negative at lag 22D CATS b (Dpmix type)
BLTF96pmixVerhaar Fish base-line toxicity from MLOGP (mmol/L)Molecular properties (Dpmix type)
MATS5snmixMoran autocorrelation of lag 5, weighted by I-state c2D autocorrelations (Dnmix type)
a P_VSA-like descriptors stand for the van der Waals surface area (VSA) with a particular property (P), in this case, the molar refractivity (MR) [57]. b Chemically Advanced Template Search (CATS) descriptors expressly designed to identify scaffold hops [58]. c I-states are based on the Kier-Hall atomic electronegativity modified by the number of σ bonds, number of hydrogen atoms, number of electrons in π orbitals, and number of lone pair electrons [55,56].
Table 4. MLR statistical results for model M12—Equation (3) a.
Table 4. MLR statistical results for model M12—Equation (3) a.
Training SetTest SetExternal Set
Ntr = 360;Nts = 175;Nex = 84;
R2 = 0.916; R2Adj = 0.915; F(6353) = 642.4;R2Pred = 0.783;R2Pred = 0.862;
Q2LOO = 0.908; MAELOO = 2.608; Q2LCO = 0.882; MAELCO = 3.122;MAE = 5.134;MAE = 1.777;
rm2(LOO) = 0.869, ∆rm2(LOO) = 0.066;rm2(test) = 0.573, ∆rm2(test) = 0.197;rm2(ext) = 0.767, ∆rm2(ext) = 0.097;
%AARD = 5.805; cRP2 (1000 runs) = 0.908%AARD = 11.155%AARD = 4.418
a R2: Determination coefficient; R2Adj: Adjusted R2; F(6,353): Fisher’s statistic; MAELOO and MAELCO: Leave-one-out and leave-chemicals-out cross-validation mean absolute error, respectively; rm2(LOO) and ∆rm2(LOO): LOO cross-validation rm2 and its associated deviation, respectively; rm2(test) and ∆rm2(test): rm2 of the test set and its associated deviation, respectively; rm2(ext) and ∆rm2(ext): rm2 of the external test set and its associated deviation, respectively. For the meaning of Ntr, Nts, Nex, Q2LOO, Q2LCO, R2Pred, and %AARD, check the footnotes of Table 1 and Table 2.
Table 5. Summary of the statistical parameters obtained from non-linear models based on different machine learning methods.
Table 5. Summary of the statistical parameters obtained from non-linear models based on different machine learning methods.
Method aTraining Set (Q25-fold)Test Set (R2Pred)External Set (R2Pred)
k-NN0.1760.597not determined
RF0.4730.746not determined
SVM0.8740.7740.767
MLP0.5410.269not determined
GB0.4530.471not determined
a k-NN: k-Nearest Neighbors; RF: Random Forests; SVM: Support Vector Machines; NN-MLP: Neural Network Multilayer Perceptron; GB: Gradient boosting.
Table 6. External predictivity of the best individual model M12 and consensus models (C1-C4) built with different combinations of the top three models (M09, M10 and M12).
Table 6. External predictivity of the best individual model M12 and consensus models (C1-C4) built with different combinations of the top three models (M09, M10 and M12).
Consensus ModelsModelsCM aR2Predbrm2(test)cMAEtest d%AARD e
C1M09, M10, M1200.8640.8011.8694.459
C2M10 and M1220.8230.8122.0894.732
C3M09 and M1020.8530.7871.9794.393
C4M09 and M12None--------------------
M12----------0.8620.7671.7774.418
a Method of Intelligent consensus prediction that yielded the best external validation result. b Variance explained for the external prediction. c Metric rm2 for the test set. d Mean absolute error for the test set. e Absolute average relative deviation.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Halder, A.K.; Haghbakhsh, R.; Voroshylova, I.V.; Duarte, A.R.C.; Cordeiro, M.N.D.S. Predicting the Surface Tension of Deep Eutectic Solvents: A Step Forward in the Use of Greener Solvents. Molecules 2022, 27, 4896. https://doi.org/10.3390/molecules27154896

AMA Style

Halder AK, Haghbakhsh R, Voroshylova IV, Duarte ARC, Cordeiro MNDS. Predicting the Surface Tension of Deep Eutectic Solvents: A Step Forward in the Use of Greener Solvents. Molecules. 2022; 27(15):4896. https://doi.org/10.3390/molecules27154896

Chicago/Turabian Style

Halder, Amit Kumar, Reza Haghbakhsh, Iuliia V. Voroshylova, Ana Rita C. Duarte, and Maria Natalia D. S. Cordeiro. 2022. "Predicting the Surface Tension of Deep Eutectic Solvents: A Step Forward in the Use of Greener Solvents" Molecules 27, no. 15: 4896. https://doi.org/10.3390/molecules27154896

Article Metrics

Back to TopTop