1. Introduction
Like the world of real material movements, in which all events that are visible and tangible to us in everyday life, such as wind, rain, and the movement of clouds, take place, there is a world of probabilistic actions, accidents, and tendencies that influence each other. However, these are not visible and not tangible to us. Perhaps quantitative structure-property/activity relationships (QSPR/QSAR) allow one to look into this world of accidents and trends that affect each other.
There is no mysticism here, but the phenomena occurring in such a space are not always described ideally and reliably. In other words, encountering situations that defy logic is possible. For example, the quality of calculations (models) can be affected by the collection of substances, which are available in the database, as well as priorities and criteria selected in the software used for QSPR/QSAR simulation.
However, in any case, it remains an indisputable axiom that models of random events are knowledge only when they are understandable and allow the possibility of verification by establishing and confirming their reproducibility.
Traditional QSPR were initially based on molecular structure [
1,
2,
3] and later became involved in an extended set of descriptors that included information not only on molecular architecture but also on the magnitudes of various physicochemical properties [
4]. This would be a good fit for nano-QSPR, if not for the lack of a clear relationship between the molecular structure and the pleasant/useful/dangerous nano-physicochemical properties of the respective nanomaterials [
5,
6,
7,
8,
9,
10,
11]. It cannot be said that the molecular architecture does not affect the physicochemical properties of nano-substances in any way, but this influence is very sophisticated for nano-substances. That is, if, for small organic molecules, the modifications of the geometry/topology arrangement of a pair of atoms necessarily change the physicochemical parameters, then for fullerenes, and even more so, for multilayer nanotubes, changes in the arrangement of a pair of substituent atoms are very difficult to establish and/or measure experimentally. Naturally, simple homologous series, which formed the basis of the first QSPR experiments of organic compounds [
1,
2,
3] for nano-substances, are extremely rare due to the high cost and weak motivation for experimental work designed to provide the corresponding numerical data on the physicochemical parameters of homologous series of fullerenes.
How do we obtain information on all promised abilities to apply nanomaterials? How do we select and use the unique potentials of nanomaterials? Hints, hypotheses, and intuition must be transformed into knowledge.
Can a model be knowledge?
Knowledge is a tool. It is preferable if knowledge is convenient for use in solving practical problems. Consequently, a model can be a way to reach knowledge when all excess is removed from the model and only the necessary remains. Nothing is surprising in that a brief instruction may be more useful than an excessively detailed one. That is why most researchers profess the principle that “to understand is to simplify”.
Taking into account the absence of large databases on various nanomaterials and the availability of sufficiently large arrays of experimental data on the interaction of individual nanomaterials with different organic substances (for example, with solvents [
12]), one should look for the possibility of constructing models of the behaviour of nanomaterials in interaction with “traditional” organic substances.
In the case of the QSPR study of solubility C60- and C70-fullerenes [
12], the traditional paradigm of QSPR/QSAR simulation is represented as
This is maybe extended as
where S = solubility, F = mathematical function, and M = molecular structure.
The transition from the model expressed by Equation (1) to the model expressed by Equation (2) is essentially a transition from traditional QSPR to nano-QSPR.
It should be noted that the model expressed by Equation (2) must (as well as the model expressed by Equation (1)) comply with the requirements for the QSPR formulated as well-known OECD principles [
13]:
A defined endpoint (including experimental protocol);
An unambiguous algorithm;
A defined domain of applicability;
Appropriate measures of goodness-of-fit, robustness, and predictive power;
A mechanistic interpretation, when it is possible.
One can use these principles for nano-QSPR expressed by Equation (2). Can the OECD principles be improved? Latent attempts to do this can be seen in many studies [
14,
15,
16,
17,
18,
19,
20,
21].
The approach considered here is that each object (solvent = SMILES, fullerene = [C60] or [C70]) is represented by a character string. The program divides the symbols into special groups, for which the so-called correlation weights (some coefficients) are found. The descriptor for each object is the sum of the correlation weights. The Monte Carlo method is used to find such correlation weights that provide the maximum value of the objective function. This optimization is carried out on the basis of partitioning the available data into special subsets: an active training set (its task is to develop a model), a passive training set (its task is to check the objectivity of the current model), a calibration set (its task is to detect the start of the overtraining), and the validation set to assess the predictive potential of the final model.
2. Results
The three schemes for constructing models of the solubility of fullerenes C60 and C70 in organic solvents were evaluated.
The models were constructed using new components of the model, which are named correlation weights of fragments of local symmetry (FLS). However, the Monte Carlo optimization of the extended set of quasi-SMILES codes was planned without using the correlation idealization vector, which has two components: the index of ideality of correlation (IIC) and the correlation intensity index (CII).
The models were constructed via the Monte Carlo optimization of the set of quasi-SMILES codes, without correlation weights of FLS, using the above-mentioned vector of the ideality of correlation.
The models were built using the Monte Carlo optimization of an extended list of the correlation weights, including FLS, along with using the vector of the ideality of correlation.
Figure 1 contains the graphical representation of the simulation processes observed for the three schemes in the case of split 1. One can see that the third scheme seems to have the most perspective.
In addition, one can see that the practically reasoned optimal descriptor for the first scheme is DCW(3,5). In contrast, the preferable optimal descriptor for the second and third schemes is DCW(3,15).
According to the principle “QSAR/QSPR is a random event”, it is necessary to study the statistical quality of models observed under different distributions in the training set (here, the set is structured into three components: active training, passive training, and calibration sets).
Table 1 contains the results of applying the first scheme on splits 1–10.
One can see the determination coefficients for the active training, passive training, and calibration sets as a rule equivalent or even a little larger than the determination coefficient of the validation set. However, in the case of continued optimization, the determination coefficients for the active and passive training samples will increase. In contrast, for the external control sample, the determination coefficient will decrease (
Figure 1).
The second scheme (
Table 2) is characterized by a significant decrease in the statistical quality for the active and passive training sets, accompanied by a noticeable increase in the coefficient of determination for the validation set. This confirms the observed influence of
IIC and
CII described in the literature [
18];
IIC and
CII improve the statistical quality of the QSPR/QSAR models for the validation set, but to the detriment of the statistical quality of the model for the training set.
The statistical quality, as well as the general logic of the models obtained using the third scheme (
Table 3) are very similar, but not identical, concerning the results obtained using the second scheme.
Figure 2 contains the graphical representations of the models observed in the cases of applying second and third schemes for split 1.
Figure 2 shows an example of the models obtained using the second and third schemes for split 1. It should be noted that despite the statistical quality of the model for the active and passive training sets being low, these sets contain two latent correlations (
Figure 2). Apparently, this is the effect of exposure to the vector of the ideality of correlation. Analogical pairs of correlations were observed in computer experiments described in the literature [
20,
21].
Figure 2 indicates that latent correlations on active and passive training sets are statistically more significant than total correlations on these sets.
It is under these circumstances that the problem arises regarding how to distinguish between the two approaches (second and third schemes). Which approach is more efficient, more precise, and more reliable?
Figure 1 shows some improvement in the statistical quality of the model for the case of the third scheme compared with the results observed in the case of the second scheme. However, it is related to split 1. Will this conclusion/hypothesis be true for splits 2, 3, …, 10?
2.1. System of Self-Consistent Models Observed for the Second Scheme
Table 4 contains the test results of the predictive potential of models with external validation sets that did not involve quasi-SMILES in constructing the tested models.
It can be seen that the results of applying the models to different test sets after removing the quasi-SMILES participating in the construction of the corresponding models are far from being the same. However, in all cases, there is a good predictive potential. The average value of determination coefficients for external validation sets is = 0.8989 ± 0.0267.
2.2. System of Self-Consistent Models Observed for the Third Scheme
Table 5 contains the test results of the predictive potential of models with external validation sets that did not involve quasi-SMILES in the construction of the tested models.
It can be seen again that the results of applying the models to different test sets after removing the quasi-SMILES from the construction of the corresponding models are far from being the same. However, in all cases, there is a good predictive potential. The average value of determination coefficients for external validation sets is = 0.9255 ± 0.0163.
2.3. The Comparison of Second and Third Schemes
The predictive potential of models built using the third scheme is better than that of models built using the second scheme. The dispersion in the determination coefficient values for the third scheme is less than one compared to models obtained using the second scheme.
Figure 3 shows the difference in the predictive potential of models obtained using the second and third schemes. One can see the preferable predictive potential for the second scheme for splits #2, #7, and #10. However, all other splits demonstrate the advantage of using the third scheme.
2.4. What Do QSAR/QSPR and Nano-QSAR/QSPR Have in Common?
First, QSAR/QSPR and nano-QSAR/QSPR are random events.
Second, the predictive potential in both cases can change markedly depending on the distribution of available data into training and validation sets.
Third, both QSAR/QSPR and nano-QSAR/QSPR cannot replace a natural experiment in measuring the values of various “usual” and nano-endpoints.
3. Discussion
The principle “QSAR/QSPR is a random event” is confirmed in the results obtained in this study: completely homogeneous distributions in the training and control subsystems, for the same approach of the simulation of solubility fullerenes in organic solvents, provide different values for the statistical characteristics of the models (
Table 1,
Table 2 and
Table 3).
For the proposed approach, which provides a certain “mathematical expectation” for the models obtained using the second and third schemes, it becomes possible to compare the average values, based on which it is possible to put forward a fairly reasonable hypothesis that the third scheme provides the best models compared to the models obtained using the second scheme. It is appropriate to note that the reliable criteria for the quality of models are not only the average values of the coefficients of determination but also their variances, which was observed in previous studies where systems of self-consistent models were used [
20,
21].
The FLS described here may not be a universal tool for developing arbitrary models, but it is only a technique that has proven successful for this task (i.e., for developing a model for the solubility of fullerenes in organic solvents). However, the vector of the ideality of correlations (or maybe the IIC and the CII, separately) perhaps can be recognized as useful and versatile tools for testing, and maybe even for improving, the predictive potential of traditional QSAR/QSPR and nano-QSAR/QSPR.
In this study, a quite simple version of quasi-SMILES has been applied to develop the models. However, one can easily extend the list of codes for quasi-SMILES to express more detailed and complex experimental conditions. In other words, one can hope that the quasi-SMILES serve as a language of communication between “classic” experimentalists who study nanomaterials and developers of nano-QSAR/QSPR models. A certain trend towards recognizing this language and even some experience in the practical use of this language have already been outlined [
22].
4. Materials and Methods
4.1. Data
The experimental solubility values of C60 and C70 fullerenes in diverse solvents were reported in mole fraction determined at 298 K [
12].
Table 6 contains the list of pairs of duplicates observed in [
12]. Of each pair of duplicates, only one was left for further analysis. After this removal, 206 quasi-SMILES representing various pairs of fullerenes (C60 or C70) and solvents were used for further computational experiments.
To this end, these quasi-SMILES were randomly distributed into the following subsets: (i) active training set (25%); (ii) passive training set (25%); (iii) calibration set (25%); and (iv) validation set (25%). Ten splits obtained corresponding to the above proportions are presented here.
Table 7 contains the measures of identity for ten such splits examined in this study.
Each of the above sets has a defined task. The active training set is used to build the model. Molecular features extracted from quasi-SMILES of the active training set are involved in the process of Monte Carlo optimization aimed to provide correlation weights for the above features, which provide the maximal target function value, which is calculated using descriptors (it is calculated as the sum of the correlation weights of all the components of quasi-SMILES) and endpoint values on the active training set. The task of the passive training set is to certify if the model obtained for the active training set is satisfactory for quasi-SMILES, which were not involved in the active training set. The calibration set should detect the start of the overtraining (overfitting). The optimization must stop if overtraining starts. After stopping the optimization procedure, the validation set is used to assess the predictive potential of the obtained model.
4.2. Optimal Descriptor
The model of fullerene solubility in organic solvents studied here is as follows:
where
DCW(
T,
N) is the optimal descriptor.
The optimal descriptor is the basis for calculating the model value of the solubility of fullerenes in organic solvents from the correlation weights of quasi-SMILES codes representing the “fullerene-solvent” systems. The quasi-SMILES reflect the presence of nano-features by two codes, indicated as [C60] and [C70], which indicate the fullerene C60 and C70, respectively. From the traditional SMILES representing the solvent, data on the atomic composition of the solvent (denoted as S) and interatomic bonds (denoted as SS) are extracted. It should be noted that atoms indicate SMILES-atoms, which is one symbol (e.g., ‘C’, ‘N’, ‘=’) or a group of symbols that cannot be considered separately (e.g., ‘Cl’, %11). In this study, the so-called fragments of local symmetry (FLS) are additionally used. Three types of FLS are considered as follows: (i) XYX; (ii) XYYX; and (iii) XYZYX, where X and Y are arbitrary symbols, but X is not equal to Y. FLS are characteristics of the SMILES/quasi-SMILES strings. Generally, they are not reflections of molecular features that are somehow correlated with traditional symmetry. Nevertheless, as SMILES or quasi-SMILES features, they can be useful participants in the described optimization procedure since they improve the predictive potential of the models obtained using the approach considered here.
The above-listed features extracted from quasi-SMILES have so-called correlation weights (
CW) obtained via the Monte Carlo optimization. Thus, the optimal descriptor is calculated as follows:
where
T is the threshold, i.e., an integer to separate codes into two categories. If a code has a frequency in the active training set less than
T, it is considered rare and removed from the simulating process. If the code has a frequency in the active training set larger than
T, it is considered active and involved in the simulating process.
N is the number of epochs of the Monte Carlo optimization.
4.3. The Monte Carlo Optimization
The correlation weights necessary to calculate the optimal descriptors DCW(T,N) are calculated using the Monte Carlo optimization based on special target functions.
Equation (4) needs the numerical data on the above correlation weights. The Monte Carlo optimization is a tool to calculate these correlation weights. Here, two target functions for the Monte Carlo optimization are examined:
The
and
are correlation coefficients between the observed and predicted endpoints for the active and passive training sets, respectively;
IIC is the index of ideality of correlation [
14,
15]; and
CII is the correlation intensity index [
14,
15].
Figure 1 shows the history of the optimization process for various options for the optimal descriptor and the objective function. A comparison of the results presented in
Figure 1 indicates that the most promising option for obtaining the best predictive potential is the option where
IIC,
CII, and FLS are used (third Scheme).
Table 8 contains the correlation weights for quasi-SMILES codes for the model (split 1).
Table 9 contains quasi-SMILES, split into active (A) and passive (P) training sets, calibration (C), and validation (V) sets, and the experimental and calculated values of fullerene C60 and C70 solubility in an organic solvent.
Table 10 shows an example of the
DCW(3,15) calculation.
4.4. The Applicability Domain
The applicability domain is considered in many studies devoted to QSPR/QSAR analysis [
16]. The main question is, “Can the resulting model be applied to a given/interest substance?”. However, the counter-question is also logical. Is it not better to determine for which substances the model being developed is intended before developing it [
17]? Can the model’s applicability domain change if one changes the distribution of available data into training and validation sets?
It should be noted that for the approach studied here, the applicability domain for different splits slightly changes.
The applicability domain for the described CORAL models are defined via the so-called statistical defects of codes used in quasi-SMILES. These defects are calculated as follows:
where
P(
Sk),
P′(
Sk), and
P″(
Sk) are the probability of
Sk in the active training set, passive training set, and calibration set, respectively;
N(
Sk),
N′(
Sk), and
N″(
Sk) are the frequencies of
Sk in the active training set, passive training set, and calibration set, respectively. The statistical defects of quasi-SMILES (
Dj) are calculated as follows:
where
NA is the number of non-blocked codes in quasi-SMILES.
A quasi-SMILES falls in the applicability domain, if
where
is the average statistical defect for the active training set.
4.5. Mechanistic Interpretation
With the numerical data on the correlation weights of codes applied in quasi-SMILES, which was observed in several runs of the Monte Carlo optimization, one can extract three categories of these codes:
Codes that have a positive value of the correlation weight in all runs. These are promoters of endpoint increase;
Codes with a negative correlation weight value in all runs. These are promoters of endpoint decrease;
Codes with negative and positive correlation weight values in different optimization runs. These codes have unclear roles (one cannot classify these features as promoters of increase or decrease for endpoint).
4.6. System of Self-Consistent Models
The reliability of an approach can be assessed by the so-called system of self-consistent models [
18,
19]. The main idea of such a system is to test the performance of an approach on many random splits of the available data into training and validation subsets. This task can be represented by a matrix of determination coefficients related to applying the model built using split 1 to the validation set observed for split 2. Suppose some quasi-SMILES, which are allocated to the validation set of split 2, are present in the training or the calibration sets of split 1 at the same time. In that case, they may improve the statistical quality of model 1 for the split 2 validation set.
In order for the assessment of the statistical quality of model 1 for the validation set of split 2 to be adequate, it is necessary to remove the abovementioned quasi-SMILES from consideration. It can be expressed as the following:
Figure 4 indicates the essence of asterisks in the matrix (11). It is clear that the principles for selecting quasi-SMILES in the validation set of split 2 to assess the predictive potential of model 1 can be clearly translated for the arbitrary pairs of the
i-th model vs. the
j-th split (
i ≠ j).
4.7. Comparison with Other Models
The strange influence of
IIC and
CII on the simulation process via improving the statistical quality of the model for the calibration sets leads to the temptation to compare different models in terms of their quality for the external validation set.
Table 11 contains the comparisons of models for the solubility of fullerenes in various solvents.