Next Article in Journal
Can We Truly Predict the Compressive Strength of Concrete without Knowing the Properties of Aggregates?
Next Article in Special Issue
Node-Based Resilience Measure Clustering with Applications to Noisy and Overlapping Communities in Complex Networks
Previous Article in Journal
Effect of Silica Nanoparticles on Fluid/Rock Interactions during Low Salinity Water Flooding of Chalk Reservoirs
Previous Article in Special Issue
Roadmap Modeling and Assessment Approach for Defense Technology System of Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Conference Report

Modeling Properties with Artificial Neural Networks and Multilinear Least-Squares Regression: Advantages and Drawbacks of the Two Methods †

by
Jesus Vicente De Julián-Ortiz
1,*,
Lionello Pogliani
1 and
Emili Besalú
2
1
Molecular Topology and Drug Design Research Unit, Departament de Química Física, Facultat de Farmàcia, Universitat de València, 46100 Burjassot, Spain
2
Institut de Química Computacional i Catàlisi (IQCC) and Departament de Química, Universitat de Girona, 17003 Girona, Spain
*
Author to whom correspondence should be addressed.
A preliminary version of this article appeared in: de Julián-Ortiz, J.; Pogliani, L.; Besalú, E. Artificial Neural Networks and Multilinear Least Squares to Model Physicochemical Properties of Organic Solvents. In Proceedings of the MOL2NET, International Conference on Multidisciplinary Sciences, 25 December 2016–25 January 2017; Sciforum Electronic Conference Series, Vol. 2, 2016; doi:10.3390/mol2net-02-03826.
Appl. Sci. 2018, 8(7), 1094; https://doi.org/10.3390/app8071094
Submission received: 24 April 2018 / Revised: 18 June 2018 / Accepted: 29 June 2018 / Published: 5 July 2018

Abstract

:
The mean molecular connectivity indices (MMCI) proposed in previous studies are used in conjunction with well-known molecular connectivity indices (MCI) to model eleven properties of organic solvents. The MMCI and MCI descriptors selected by the stepwise multilinear least-squares (MLS) procedure were used to perform artificial neural network (ANN) computations, with the aim of detecting the advantages and limits of the ANN approach. The MLS procedure can replicate the obtained results for as long as is needed, a characteristic not shared by the ANN methodology, which, on the one hand increases the quality of a description, and on the other hand also results in overfitting. The present study also reveals how ANN methods prefer MCI relatively to MMCI descriptors. Four types of ANN computations show that: (i) MMCI descriptors are preferred with properties with a small number of points, (ii) MLS is preferred over ANN when the number of ANN weights is similar to the number of regression coefficients and, (iii) in some cases, the MLS modeling quality is similar to the modeling quality of ANN computations. Both the common training set and an external randomly chosen validation set were used throughout the paper.

1. Introduction

Recently [1], the mean molecular connectivity indices (MMCI) were introduced to model eleven properties of organic solvents. The multilinear least-squares (MLS) used to derive the quantitative structure-property relationships (QSPR) showed that three out of eleven properties, the refractive index (RI), the flash points (FP), and the ultraviolet cutoff values (UV), were modeled with the MMCI while the remaining properties were modeled with the well-known molecular connectivity indices (MCI). The MMCI indices are also centered on the basic concepts of the delta, valence delta, I- and S-indices that go back to the origins of the molecular connectivity theory [2,3,4,5,6,7]. Results from two other recent studies that used semiempirical sets of descriptors [8,9] showed that the artificial neural network (ANN) model with a variable number of hidden neurons chosen by the software improves the quality of a QSPR obtained with the aid of the multilinear least-squares (MLS) methodology, also known as multilinear regression (MLR). Nevertheless, this improvement is somewhat artificial as the ANN computations for the eleven properties employed a number of weights, due to the presence of more than one hidden neuron, much greater than the number of weights or regression coefficients in the MLS procedure. This fact can provide poor results when new data are to be predicted. This is called overfitting, and it can be avoided by guiding the training process after the predictions in a test set, by more general regularization techniques, or by dropout of the hidden neurons.
A scheme of the work is depicted in Figure 1. Data consisting of eleven physicochemical properties of solvents were randomly split into train (TR) and evaluation (EV) sets. Molecular descriptors were calculated, as explained in section Materials and Methods, for every molecule. MLS computations performed with the train set ended up choosing the best descriptors among the set of given descriptors. These best descriptors were used to perform the Multilayer Perceptron ANN (ANN-MLP) computations. To avoid overfitting, during its training process the ANN randomly selects test sets (TE) within the original TR set. Finally, the models obtained by each method are applied, for external validation, to the evaluation (EV) set. It should be underlined that the evaluation set is common to every type of computation.
The aim of the present work is to pin down the real advantages and the drawbacks of the ANN methodology, and apply it to the model of the eleven properties of [1] where either MCI or MMCI are used as the descriptors. Four different types of ANN computations are here performed to detect the level of achieved improvement, if any, (a) with one hidden neuron, (b) with a pre-fixed number of hidden neurons, (c) with a variable number of hidden neurons chosen by the software, and (d) with a minor number of descriptors for the one hidden neuron case. This last case attempted to render the number of ANN weights equal to the number of MLS weights. It also monitored if ANN computations preferred either MCIs or MMCIs for modeling purposes. The descriptors for the eleven properties are those of [1]; however, whenever a property was not satisfactorily modeled by the given MCI (or MMCI) the second or third best MCI (or MMCI) was chosen. The domain of applicability of the models presented here includes substances that have been used as solvents without any other chemical restrictions.

2. Materials and Methods

2.1. The Properties

The raw material of the present study, the eleven properties of the organic solvents, is given in Table 1. The source for their values is cited in [1].

2.2. Descriptors

Table 2 shows the molecular connectivity χ indices, the molecular pseudoconnectivity ψ indices (pseudo-MCI), and the dual connectivity and pseudoconnectivity indices (Dual MCI, pseudo-MCI) used throughout this study. Three new indices were used: Δ = ΣEAnEA, Σ = ΣEA<SEA>, and TΣ/M = Σ3/M1.7 (M = molar mass); Δ encodes the number of electronegative atoms (nEA), Σ encodes the sum of the S-State index for the electronegative atoms, N, O, F, Cl, Br (<SEA> is the average value for a specific atom). Table 3 shows the definitions of the MMCI (the first M stands for “mean”), which are based on averages of vertex invariants. The original Stolarsky’s mean has a minus in the denominator here replaced by a plus to avoid zeroing the denominator due to equal δi and δj, although it is known that the limit of this function when δi tends to δj is finite. The present mean is a kind of pseudo-Stolarsky mean.
These two tables summarize the pool of descriptors used throughout this study: n is the number of atoms in a molecule, i = 1 to n denotes the atoms of a molecule, ij denotes directly σ-bonded atoms, while p is assigned the value n in Table 3. Replacing δ with the valence delta, δv, in Table 2, allows the corresponding valence MCI, {Dv, 0χv, 1χv, χvt, 0χdv, 1χdv, 1χsv}, to be obtained; replacing the Intrinsic-I-State with the Electrotopological S-State index the corresponding pseudoconnectivity electrotopological indices are obtained, {SψE, 0ψE, 1ψE, TψE, 0ψEd, 1ψEd, 1ψEs} [3,4,5,6,7,8,9]. This subject is further elucidated in the Appendix A and Appendix B. Replacing in Table 3 δ, with δv, I and S three other subsets of MMCI: the valence, {AMv, GMv, HMv, RMv, SMv, UMv, HoMv, LMv, StMv}, the I-State, {AMI, GMI, HMI, RMI, SMI, UMI, HoMI, LMI, StMI}, and the E-State {AME, GME, HME, RME, SME, UME, HoME, LME, StME} MMCI, respectively, are obtained. Because some S values can be negative (highly electropositive atoms) to avoid imaginary S-State MMCI values, a rescaling of the S value is undertaken as it is explained in [1]. Summing up, we have thirty-one MCI and thirty-six MMCI. Every index was obtained with a visual basic home-made program that runs on a normal PC that uses both adjacency and distance matrices [6].

2.3. Multilinear Least-Squares Regression

The stepwise multilinear least-squares (MLS) procedure of Statistica 8 that searches the whole combinatorial space built by the descriptors was used to find the best set of indices, either MCI or MMCI, for the training compounds of Table 1. They were then used to evaluate the left-out compounds (EV, those with (°) in Table 1, ~30% of all compounds, 25% for El). These best descriptors were also used for the ANN computations. To model the dipole moments, indices were multiplied by a two-valued symmetry factor, φ = 0, 1, i.e., φ·[MCI or MMCI] = 0 or φ·[MCI or MMCI] = [MCI or MMCI], where zero is used for the symmetric molecules with μ = 0. The choice for the number of indices of a relationship was performed bearing in mind that the ratio of data points to the number of variables should be higher or equal to five and should provide a correlation coefficient r > 0.84, i.e., r2 > 0.70 [10]. External validation was performed for all types of model (ANN inclusive) with the set of evaluation points (EV) by adding them to check the prediction ability of the overall model. Broadly speaking, the models show robustness when 30–25% cases (the EV set) are advantageously added to complete the model.

2.4. Multi-Layer Perceptron—Artificial Neural Networks

ANN methods [11,12] that can perform regression and data validation carry out both tasks in a non-parametric way that makes no assumption regarding the relationship between y and x, where y = f(x). This means that the function Property = f(indices) is not known a priori. In short, a non-parametric model is a kind of black box that tries to discover the mathematical function that can approximate the relationship between the indices and the property well enough. It uses highly flexible transfer functions with adaptable parameters that can model a wide spectrum of functional relationships. The activation functions for both hidden and output nodes used in Statistica 8 are: identity (i), logistic sigmoid (l), hyperbolic tangent (t), sine (s), and exponential (e).
ANN results were obtained with the built-in utility of Statistica 8—the multilayer perceptron neural network (MLP-ANN). This network has three-layered feedforward architecture with unidirectional full connections between successive layers (Figure 2) and error backpropagation (or backprop). The three layers are:
input units → hidden units → output units
Units are also known as neurons or nodes, in our case input units correspond to our variables, i.e.,
variables (MCI or MMCI) → hidden units → P
The only output unit, here, is the targeted property, P. In the present study the number of variables corresponds to the number of MCI or MMCI descriptors. Each neuron, or node, in a layer connects to every neuron in the next layer. The connections between neurons are the weights that determine the values assigned to the nodes. There exist additional weights assigned to the bias values that act as node value offsets; therefore, the resulting number of weights is:
(No. input nodes + 2)·(No. hidden nodes) + 1
The given ANN scheme let us notice that if a weight is added to a hidden node the connections become seven. With five input nodes and seven hidden nodes [a 5-7-1 network] the weights become fifty. The weights adjusted by the training process are initially random and are handed over to all nodes of the following layer. The training process is iterative, and each iteration is called an epoch. Technically, the number of epochs is not definitive and it cannot be held as an unfailing parameter (it can exceed the given number). The weights are slightly varied in each epoch to minimize the sum-of-squares error function: SOS = Σi=1−N (Piclc − Pi)2, where Piclc (clc = calculated) is the ith predicted value (network outputs) of the property, and Pi is the target value. This function is the sum of differences between the prediction outputs and the target defined over the entire training set of points (compounds) N. Statistica 8 allows setting the number of networks to train and retain (Ntr/Nre). Two sets of values are here imposed: Ntr/Nre = 103/200 and Ntr/Nre = 105/200. In the corresponding tables only Ntr is shown as Nre is constant. The ANN network of Statistica 8 is optimized with the BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm to ensure a fast convergence rate [13,14].
Statistica 8, as a rule, sets by default the number of hidden nodes between 3 and 11. Nevertheless, as already told, we perform four procedures (for the 4th procedure see later on): (i) first a single hidden node, then (ii) hidden nodes from two to twelve are sequentially tried ‘by hand’ (i.e., the program is not allowed to change the imposed number of hidden nodes), and, finally, (iii) the program chooses the number of hidden nodes. To come as close as possible to the MLS results, it was decided (iv), to compute again the one hidden neuron case where either one or two indices with the lowest sensibility value have been deleted. In this case, for instance, the number of weights for the 4-1-1 case of Tb is 7, and it equals the number of correlation coefficients from the MLS calculations with six indices. Data required no normalization by the user, since the program performs this automatically.
Since the MLS procedure optimizes a number of regression parameters equal to the number of variables plus one (the bias parameter), a practical comparison between the two methods should only be performed when ANN uses no hidden neurons. In this case, the number of ANN weights equals the number of MLS parameters. One should expect that with a growing number of hidden neurons, the model of a property should constantly improve due to the growing number of weights for each variable (akin having a variable with many different weights). With ANN it is usually the case that the model becomes exceedingly good with a growing number of weights, and this frequently results in overfitting with exceedingly poor prediction for the external values. The choice of training (TR = 80% of the values in Table 1, excluding the externally validated compounds) and test sets (TE = 20% of the values, the bold values in this Table) usually prevents overfitting. In fact, the network is repeatedly trained for a number of cycles so long as the test error is on the decrease, otherwise the training is halted. This method, known as the ‘early stopping’ procedure [12], avoids the trap that the program will always choose the maximum number of hidden nodes. Each property shows an optimal number of nodes, which rarely corresponds with its maximum number.

3. Results

The results of the five procedures, one MLS and four ANN, are shown in Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9. Table 4 collects the MLS results for the eleven properties. In this table, in parenthesis the errors of the regression coefficients are given in vector form (±signs have been omitted, 2nd line of each cell, 2nd column). The training set for the elutropic value (El) includes pentane and tetrahydrofuran.
Table 5, Table 6, Table 7 and Table 8 collect the different ANN-MLP results for the set of variables (descriptors, either MMCI or MCI) of Table 4. In these tables, the first column gives the δv type (see Appendix A), and the number of networks to train, Ntr = 103 or 105 (when both numbers gave rise to similar results Ntr = 103 was preferred), while the number of networks to retain is always 200. The activation functions together with the neuronal architecture are in the second column of Table 5, Table 6, Table 7 and Table 8. In this column, 3rd line, the number of epochs for which the ANN-MLP calculation runs are shown for each property. In the third column is the set of variables together with their statistics. In this column, second line, are shown the sensitivities. These values come from the sensitivity analysis that quantifies the importance of the input variables of the models. The r2 and s, statistics were obtained with the EXCEL spreadsheet plotting the observed property, P, vs. the calculated one, Pclc, once for the training and test compounds, N(aTR + bTE), and the second time for the training + test + evaluated compounds, N(+cEV), where a, b, and c are the number of points (i.e., compounds). We remind the reader that the MLS procedure has no test compounds, only training compounds, N(TR). No ANN weights are shown, due to their exceeding number, and because every time an ANN-MLP runs, different weights and sensitivity values are obtained.
For comparison purposes it was decided to maintain throughout the ANN calculations (see Table 5, Table 6, Table 7 and Table 8) the same number of outliers excluded throughout the MLS procedure, where the exclusion was done for residuals greater than 3s. Clearly, the ANN outliers differ from the MLS ones. In Table 5, the ANN results obtained with a single hidden neuron are given. Table 6 and Table 7 display the multiple neuron cases: Table 6 with an externally imposed number of hidden neurons that was cycled from 2 to 12, and Table 7 with the number of hidden neurons chosen by the program (between 3 and 11). For UV, MS, and El the program sets this number between 3 and 10. For those cases where different sets of hidden nodes achieve similar modeling, the set with the minimal number of nodes was preferred. The subset of descriptors used to model the properties showed r intercorrelation lower than 0.93. We remind the reader that in a previous study [15] it was established that indices can be considered strongly correlated if r > 0.98.

4. Plots

Figure 3, Figure 4 and Figure 5 display the normal and residual plots of those properties that give rise to the best models and that also show optimal statistics for the evaluated points (given in the captions). All these plots follow the statistics shown in Table 9, 3rd column 2nd line. The structure and importance of this type of plots was discussed in [16,17].

5. Discussion

For the ease of discussion and interpretation the most important and detailed statistical results collected through Table 4, Table 5, Table 6 and Table 7 are summarized in Table 9. Table 8 shows a special case that will be discussed later on. While Table 4, Table 5, Table 6 and Table 7 collect the detailed information about the modeling of the eleven properties, and especially about the type of indices, valence deltas, and structure of the ANN computations, Table 9 gives direct information about the different models.
Looking for MMCI indices (letter M), in MLS, they are optimal for three properties: refractive index RI, flashpoints FP, and cutoff UV.
In ANN computations with one hidden neuron, (ANN 1HN, Table 5), these are instead important descriptors for cutoff UV, flashpoints FP, and elutropic values El. It seems that properties with less training points are better modelled by MMCIs. Concerning the statistical results for the training compounds, ANN 1HN (Table 9, 1st line) improves over MLS for Tb, and El properties, while it lays behind for −χ·106, otherwise results are rather similar. With the whole set of compounds (Table 9, second line); i.e., with training (and test with ANN)—plus evaluated compound ANN 1HN calculations improve again over MLS for Tb, and El, while they stay behind with ε, γ, UV, and −χ·106.
As soon as the number of hidden neurons grows either by external choice, enHN (Table 6), or by software choice, snHN (Table 7), MMCIs are optimal descriptors only for Elutropic values (silica) El, which is the property with the lowest number of points.
The multiple hidden neuron case shows that, at the training level ANN enHN (Table 9), things improve consistently over the two previous cases (MLS and ANN 1HN) for Tb, ε, d, RI, γ, FP, μ, and UV. For −χ·106, ANN with several hidden neurons improves with respect to ANN 1HN, and for El there is an improvement only in relation to MLS (Table 4). Results for viscosity, η, are rather similar throughout the three cases. Mostly, improvement concerns both the r2 and the s statistics. Concerning the whole set of compounds (TR, TE and EV) statistics improve in relation to the two previous cases (MLS and ANN 1HN) for Tb, ε, γ, FP, μ, and UV.
The advantage of the ANN over the MLS procedure in general is not striking in the eleven properties. In fact, with the only exception of the training plus test for the −χ·106 property it does not achieve any useful improvement.
Normally, for an optimal modeling the number of hidden neurons that are externally chosen (ANN enHN, Table 9) is smaller than the number of hidden neurons chosen by the software (ANN snHN, Table 9). In some cases, it is much smaller, like for Tb (an extreme case), d, and γ. Furthermore, ANN snHN statistics are either worse or similar to the ANN enHN ones. This means that if you intend to let the software choose the number of hidden neurons then it is better that you stick to the MLS modeling. Could that depend on the ANN initial weights considered? Probably even if it seems a general trend; i.e., it shows up with nearly all properties.
The MLS results compare rather well with the ANN 1HN results even if the ANN computations have a number of weights bigger (by two) than the number of regression coefficients of the corresponding MLS computations. Thus, we decided to perform ANN calculations by deleting the two indices with the lowest sensibility values in Table 5. In those cases where deletion of two indices gives rise to poor modeling, we deleted only one index. In this last case, the number of weights is no longer equal (actually it is bigger by one) to the number of regression coefficients or weights of the MLS case. Results are shown in Table 8, and, as the reader can notice, four properties, γ, UV, −χ·106, and El, do not show up due to poor modeling, while for properties d, FP, and μ, only one index was deleted. We also notice that the dipole moment, μ, does not obey the lowest sensibility rule (see Table 5) as following this rule we should have deleted TΣ/M index. Now, deletion of this index gives rise to a poor modeling for the dipole moment. This confirms that sensibility values change from run to run, like the weights, and they are not guidance for the absolute importance of an index, but only for its importance within a given model. The statistics here are usually not as good as in the MLS case (Table 4), with a clear and amazing exception, the modeling of the whole set of compounds for the dielectric constant, ε. Looking only at the training plus test modeling, we would simply discarded this modeling. Nevertheless, the very good modeling of the evaluated compounds helps to improve the overall model for this property. Thus, (i) before throwing away some training plus test ANN or MLS results, re-evaluate and do not forget that (ii) a very good ANN modeling may hiding somewhere.
All this comes back to the random assigning of the initial weights in ANN computations, which renders it difficult to reproduce values that seem to show up by chance. Table 5, Table 6, Table 7 and Table 8 tell also that there is no fixed preferential value for the parameter Ntr (numbers of networks to train). Usually, different Ntr values give rise to rather similar statistical parameters.
Generally, the addition of the EV set does not greatly affect the overall quality of the models, showing their robustness in most cases. The differences in r2 are not greater than 0.5, as a rule. Some exceptions are the MLS models for FP and μ, and the ANN ones for ε, γ, and μ.
Concerning the most used values for δv, Table 4, Table 5, Table 6 and Table 7 show that the δvppo configuration is preferred, especially throughout the nHN cases (Table 6 and Table 7). This choice means a strong dependence on the core electrons for higher row atoms (see Appendix A). Regarding the exponent of the fractional term in δv (see Appendix A), the most used values are 1, −0.5; i.e., strong hydrogen atom dependence—and 50; i.e., no hydrogen atom dependence. The strong hydrogen dependence of δv tells us that the hydrogen atoms should not be neglected.
Plots of Figure 3, Figure 4, Figure 5 and Figure 6 exemplify the best models obtained from the given properties. These four properties, Tb, d, RI, and η (Vis) show the best statistics for the set of points evaluated. The residual plots, nevertheless, remind us that the models achieved could be further improved since the evaluated points are not placed symmetrically around the zero line, as required in a perfect model. In the graphs of Figure 4 and Figure 5 a point appears away from the remaining points, which could anchor the regression line. However, the corresponding residual plots show that this is not the case, since their residuals are not insignificant. This is due to the large number of values concentrated in the cloud of the remaining points.

6. Conclusions

The first interesting result of the present ANN-MLP computations is that MCIs are preferred over MMCIs, especially with properties with a relatively high number of points. In fact, only El, with a minimum number of points, is usefully described with MMCI when ANN-MLP with more than one hidden neuron is performed.
The second result suggests that for the properties given it is better to impose from the outside the number of hidden neurons.
The third result shows that, with some exceptions, ANN-MLP improves on MLS calculations, even if the improvement is not dramatic.
One of the great advantages of MLS computation is that its statistical results are reproducible, no matter how many times the calculations are repeated with the same indices, the same results are obtained. The ANN-MLP results can seem, instead, as non-reproducible since the weights of the ANN-MLP calculations start with random values, and the minimization procedure usually ends up with different values from run to run. Furthermore, as a rule, different ANN-MLP computations end up in different local minima. However, it must be pointed out that repeating the training process by setting up the same procedure, by using the same seed, randomization the algorithm and precision, with the same data sets, the resulting model would be the same.
ANN-MLP results obtained with one hidden neuron either with the full set of descriptors (Table 5), or with a reduced set of descriptors (Table 8) confirm the validity of the MLS calculations. The asymmetry of the evaluated points around the zero line of the residual plots, reminds us that things might be further improved either with other types of ANN-MLP calculations or with new types of descriptors.
These results indicate that MLS models should be preferred, except when it is necessary to reach a given quality in the predictions that is only achievable with ANN-MLP models.
The present study also tells us that it is worth considering the hydrogen atoms when performing the calculations to derive the MCIs or the MMCIs, as in many cases they help to improve the quality of a model both in the MLS and ANN-MLP computations.

Author Contributions

J.V.d.J.-O., L.P. and E.B. conceived and designed the idea, analyzed the data and wrote the paper; L.P. performed the calculations.

Funding

This research received no external funding.

Acknowledgments

We would like to thank the suggestions of the referees that helped to improve the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Valence Delta

The δv number used in current and previous works is defined as follows [7],
δ v = ( q + f δ n ) δ v ( ps ) ( p r + 1 )
δv(ps) is the valence of a vertex in a chemical pseudograph (or general graph) that allows multiple bonds and self-connections (or loops). Usually, in chemical graph theory simple graphs (with no multiple bonds and loops) and pseudographs are hydrogen-depleted. Parameters p is the order of a complete graph, Kp, used to encode core electrons7, while r is its regularity (r = p − 1). A complete graph is a graph where every pair of its vertices is adjacent. The first order complete graph, K1, that encodes the second row atoms, is just a vertex. Higher values for p encode higher row atoms. Parameter q in Equation (A1) is two-valued: q = 1 or p, where p = 1, 2, 3, 4, 5, 7, ….
Generally, two representations (or configurations) for δv are useful: a Kpo, configuration where q = 1, and p = odd, and a Kppo one where q = p and, again, p = odd.
The fδ fractional perturbation parameter (or hydrogen perturbation) that encodes the depleted hydrogen atoms is defined in the following way,
fδ = 1 − δv(ps)/δvm(ps) = nH/δvm(ps)
δvm(ps) is the maximal δv(ps) value a heteroatom (a vertex) can have in a hydrogen depleted chemical pseudograph when all bonded hydrogen atoms are substituted by heteroatoms, and nH equals the number of hydrogen atoms bonded to a heteroatom. For completely substituted heteroatoms, fδ = 0 as δvm(ps) = δv(ps) (i.e., nH = 0). In hydrocarbons δv(ps) = δ, which is the delta number in simple chemical graphs with no multiple bonds and loops. In this case: δv = (1 + fδn)δ (for p = 1). For quaternary carbons fδ = 0 and δv = δ. Exponent n in fδ quantifies the importance of the hydrogen perturbation; i.e., the higher the n values the lower the importance of the perturbation. Different values for n give rise to different sets of indices. In this study: n = −0.5, 0.5, 1, 2, 5, 50.

Appendix B. The Intrinsic-I-State and the Electrotopological S-State Indices

The I- and E-State indices (ψE,I:E means electrotopological, and I intrinsic), known in the literature as I and S indices, respectively, are related to δv in the following way [4],
I = (δv +1)/δ, S = I + ΣΔI, with ΔI = (Ii − Ij)/r2ij
rij counts the atoms in the minimum path length separating atoms i and j, which equals the graph distance, dij + 1; ΣΔI incorporates the information about the influence of the remainder of the molecular environment, and, as it can be negative, S can also be negative for some atoms.

References

  1. Pogliani, L.; de Julián-Ortiz, J.V. Artificial neural networks and multilinear least squares to model physicochemical properties of organic solvents. Int. J. Chem. Mod. 2014, 6, 241–254. [Google Scholar]
  2. Randić, M. On characterization of molecular branching. J. Am. Chem. Soc. 1975, 97, 6609–6615. [Google Scholar] [CrossRef]
  3. Kier, L.B.; Hall, L.H. Molecular Connectivity in Structure-Activity Analysis; Wiley: New York, NY, USA, 1986. [Google Scholar]
  4. Kier, L.B.; Hall, L.H. The Electrotopological State. In Molecular Structure Description; Academic Press: New York, NY, USA, 1999. [Google Scholar]
  5. Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics, 2nd ed.; Wiley-VCH: Weinheim, German, 2000. [Google Scholar]
  6. Pogliani, L. From molecular connectivity indices to semiempirical connectivity terms: Recent trends in graph theoretical descriptors. Chem. Rev. 2000, 100, 3827–3858. [Google Scholar] [CrossRef] [PubMed]
  7. García-Domenech, R.; Gálvez, J.; de Julián-Ortiz, J.V.; Pogliani, L. Some new trends in chemical graph theory. Chem. Rev. 2008, 108, 1127–1169. [Google Scholar] [CrossRef] [PubMed]
  8. Pogliani, L.; de Julián-Ortiz, J.V. Testing selected optimal descriptors with artificial neural networks. RSC Adv. 2013, 3, 14710–14721. [Google Scholar] [CrossRef]
  9. Pogliani, L.; de Julián-Ortiz, J.V. QSPR with descriptors based on averages of vertex invariants. An artificial neural network study. RSC Adv. 2014, 4, 44733–44740. [Google Scholar] [CrossRef]
  10. Topliss, J.G.; Costello, R.J. Chance correlations in structure-activity studies using multiple regression analysis. J. Med. Chem. 1972, 15, 1066–1069. [Google Scholar] [CrossRef] [PubMed]
  11. Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug Design: An Introduction, 2nd ed.; Wiley-VCH: Weinheim, German, 1999. [Google Scholar]
  12. Livingstone, D.J.; Manallack, D.T.; Tetko, I.V. Data modelling with neural networks: Advantages and limitations. J. Comput.-Aided Mol. Des. 1997, 11, 135–142. [Google Scholar] [CrossRef]
  13. Castillo, E.; Guijarro-Berdiñas, B.; Fontenla-Romero, O.; Alonso-Betanzos, A. A very fast learning method for neural networks based on sensitivity analysis. J. Mach. Learn. Res. 2006, 7, 1159–1182. [Google Scholar]
  14. Broyden–Fletcher–Goldfarb–Shanno Algorithm. Available online: http://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm (accessed on 4 July 2018).
  15. Mihalic, Z.; Nikolic, S.; Trinajstic, N.J. Comparative study of molecular descriptors derived from the distance matrix. Chem. Inf. Comput. Sci. 1992, 32, 28–37. [Google Scholar] [CrossRef]
  16. Besalu, E.; de Julián-Ortiz, J.V.; Pogliani, L. Trends and plot methods in MLR studies. J. Chem. Inf. Model. 2007, 47, 751–760. [Google Scholar] [CrossRef] [PubMed]
  17. Besalú, E.; de Julián-Ortiz, J.V.; Pogliani, L. An overlooked property of plot methods. J. Math. Chem. 2006, 39, 475–484. [Google Scholar] [CrossRef]
Figure 1. Flow chart of the methodology followed throughout the present work.
Figure 1. Flow chart of the methodology followed throughout the present work.
Applsci 08 01094 g001
Figure 2. An ANN scheme with an input node (in), a bias node (b), a hidden node (hn), and an output node (on).
Figure 2. An ANN scheme with an input node (in), a bias node (b), a hidden node (hn), and an output node (on).
Applsci 08 01094 g002
Figure 3. Plot of the Experimental vs. calculated (calc) boiling points, Tb, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of evaluation (EV) points: N = 16, r2 = 0.91, s = 17.
Figure 3. Plot of the Experimental vs. calculated (calc) boiling points, Tb, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of evaluation (EV) points: N = 16, r2 = 0.91, s = 17.
Applsci 08 01094 g003
Figure 4. Plot of the Experimental vs. calculated (calc) density, d, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of EV points: N = 15, r2 = 0.88, s = 0.1.
Figure 4. Plot of the Experimental vs. calculated (calc) density, d, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of EV points: N = 15, r2 = 0.88, s = 0.1.
Applsci 08 01094 g004
Figure 5. Plot of the Experimental vs. calculated (calc) refractive index, RI, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of EV points: N = 14, r2 = 0.96, s = 0.1.
Figure 5. Plot of the Experimental vs. calculated (calc) refractive index, RI, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of EV points: N = 14, r2 = 0.96, s = 0.1.
Applsci 08 01094 g005
Figure 6. Plot of the Experimental vs. calculated (calc) viscosity, η, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of EV points: N = 9, r2 = 0.98, s = 0.1.
Figure 6. Plot of the Experimental vs. calculated (calc) viscosity, η, (a) and the corresponding residual plot (b): (•) training points, () test points, and (×) evaluated points. Statistics of EV points: N = 9, r2 = 0.98, s = 0.1.
Applsci 08 01094 g006
Table 1. Eleven properties of organic solvents with their molar mass M (g·mol−1): Tb, boiling point (K); ε, dielectric constant; d, density (at 20 °C ± 5 °C relative to water at 4 °C, g/cc); RI, refractive index (20 °C); FP, Flashpoint (K); η, viscosity (Cpoise, 20 °C; 1 at 25 °C, 2 at 15 °C); γ, surface tension (mN/m at 25 °C); UV, Cutoff UV values (nm); μ, dipole moments in debye (1D = 10−18 esu cm = 3.3356 × 10−3 C m); MS, magnetic susceptibility (also, −χ·106, in emu mol−1, 1 emu = 1 cm3, temperatures cover a range from 15 °C to 32 °C); and El, Elutropic value (silica).
Table 1. Eleven properties of organic solvents with their molar mass M (g·mol−1): Tb, boiling point (K); ε, dielectric constant; d, density (at 20 °C ± 5 °C relative to water at 4 °C, g/cc); RI, refractive index (20 °C); FP, Flashpoint (K); η, viscosity (Cpoise, 20 °C; 1 at 25 °C, 2 at 15 °C); γ, surface tension (mN/m at 25 °C); UV, Cutoff UV values (nm); μ, dipole moments in debye (1D = 10−18 esu cm = 3.3356 × 10−3 C m); MS, magnetic susceptibility (also, −χ·106, in emu mol−1, 1 emu = 1 cm3, temperatures cover a range from 15 °C to 32 °C); and El, Elutropic value (silica).
SolventsMTbεdRIFPηγUVμMSEl
(°) Acetone58.132920.70.7911.3592560.3223.463302.880.460.43
(°) Acetonitrile41.0535537.50.7861.3442780.3728.661903.920.5340.50
Benzene78.13532.30.841.5012620.6528.2228000.6990.27
Benzonitrile103.146125.21.0101.5283441.24 138.79
1-Butanol74.139117.10.8101.3993082.9524.93215
(°) 2-Butanone72.135318.50.8051.3792700.4023.97330 0.39
Butyl Acetate116.23985.00.8821.3942950.7324.88254
CS276.13192.61.2661.6272400.3731.5838000.532
CCl4153.83502.21.5941.460 0.9726.4326300.6910.14
Cl-Benzene112.64055.61.1071.5242960.8032.99287
1-Cl-Butane92.63517.40.8861.40242670.3523.18225
CHCl3119.43344.81.4921.446 0.5726.672451.010.7400.31
Cyclohexane84.23542.00.7791.4262551.0024.6520000.6270.03
(°) Cyclopentane70.13232.00.7511.4002360.4721.88200 0.629
1,2-diCl-Benzene147.04539.91.3061.5513381.32 2952.500.748
1,2-diCl-Ethane98.9535610.41.2561.4442880.7931.862251.75
diCl-Methane84.93139.11.3251.424 0.4427.202351.600.7330.32
N,N-diMe-Acetamide87.143837.80.9371.438343 2683.8
N,N-diMeFormamide73.142636.70.9441.4313300.92 2683.86
1,4-Dioxane88.13742.21.0341.4222851.5432.752150.450.606
Ether74.13084.30.7081.3532330.2416.952151.15 0.29
Ethyl acetate88.13506.00.9021.3722700.4523.392601.80.5540.45
(°) Ethyl alcohol46.135124.30.7851.3602811.2021.972101.690.575
Heptane100.23711.90.6841.387272 19.65200 0.00
Hexane86.23421.90.6591.3752500.3317.89200 0.00
2-Methoxyethanol76.139816.00.9651.4023191.7230.84220
(°) Methyl alcohol32.033832.70.7911.3292840.6022.072051.700.5300.73
4-Me-2-Pentanone100.239113.10.8001.396286 334
2-Me-1-Propanol74.138117.70.8031.396310
2-Me-2-Propanol74.135610.90.7861.387277 19.96 1.660.534
DMSO78.146246.71.1011.4793682.2442.922683.96
(°) Nitromethane61.037435.91.1271.3823080.6736.533803.460.391
1-Octanol130.246910.30.8271.42935410.6 227.10
(°) Pentane72.153091.80.6261.3582240.2315.49200 0.00 *
3-Pentanone86.137517.00.8531.392279 24.74
(°) 1-Propanol60.137020.10.8041.3842882.2623.32210
(°) 2-Propanol60.135618.30.7851.3772952.3020.93210 0.63
Pyridine79.138812.30.9781.5102930.9436.563052.20.6110.55
Tetra Cl-Ethylene165.83942.31.6231.506 0.90 0.802
(°) Tetra-Hydrofuran72.13407.60.8861.4072560.55 2151.75 0.35 *
Toluene92.13842.40.8671.4962770.5927.932850.360.6180.22
1,1,2-triCl,triF-Ethane187.43212.41.5751.358 0.69 230 0.02
2,2,4-triMe-Pentane114.23721.90.6921.3912660.50 215 0.01
o-Xylene106.24172.60.8701.5053050.8129.76
p-Xylene106.24112.30.8661.4953000.6528.01
(°) Acetic acid60.053916.151.0491.372 27.10 1.20.551
Decalin138.24652.20.8791.476 0.681
diBr-Methane173.83707.81.5422.497 39.05 1.430.935
1,2-diCl-Ethylen(Z)96.93349.21.2841.449 1.900.679
(°) 1,2-diCl-Ethylen(E)96.93212.11.2551.446 00.638
1,1-diCl-Ethylen96.93054.71.2131.425 1.340.635
Dimethoxymethane76.13152.70.8661.356 0.611
(°) Dimethylether46.12495.0
Ethylen Carbonate88.151189.61.3211.425 4.91
(°) Formamide45.04841091.1331.448 57.03 3.730.551
(°) Methylchloride50.524912.60.9161.339 1.87
Morpholine87.14027.31.0051.457 0.631
Quinoline129.25109.01.0981.629 42.59 2.20.729
(°) SO264.126317.61.434 1.6
2,2-tetraCl-Ethane167.84198.21.5781.487 35.58 1.30.856
tetraMe-Urea116.245023.10.9691.449 3.470.634
triCl-Ethylen131.43603.41.4761.480 0.734
(°) externally validated compounds; bold values: test compounds used in Artificial Neural Network Multilayer Perceptron (ANN-MLP) calculations, * for this property these two compounds ∈ {TR} (see Table 4 below) and {TR + TE} (see Table 5 below), Me = Methyl.
Table 2. Definition of the Molecular Connectivity Indices (MCI). Replacing δ with δv and I with S the corresponding valence, χv, I-State, ψI, and E-State, ψE, MCIs are obtained.
Table 2. Definition of the Molecular Connectivity Indices (MCI). Replacing δ with δv and I with S the corresponding valence, χv, I-State, ψI, and E-State, ψE, MCIs are obtained.
MCIPseudo-MCIDual MCI + Δ + ΣDual Pseudo-MCI + TΣ/M
D = ΣiδiSψI = ΣiI0χd = (−0.5)nΠii)0ψId = (−0.5)nΠi(Ii)
0χ = Σ(δi)−0.50ψI = Σ(Ii)−0.51χd = (−0.5)(n+μ−1)Π(δi + δj)1ψId = (−0.5)(n+μ−1)Π(Ii + Ij)
1χ = Σ(δiδj)−0.51ψI = Σ(IiIj)−0.51χs = Π(δi + δj)–0.51ψIs = Π(Ii + Ij)−0.5
χt = (Πδi)−0.5TψI =(ΠIi)−0.5Δ = ΣEAnEA, Σ = ΣEA<SEA>TΣ/M = Σ3/M1.7
n is the number of atoms, ij means corresponds to σ bond, μ is the cyclomatic number.
Table 3. Definition of the Mean Molecular Connectivity Indices (MMCI). Replacing δ with δv, I, and with S the respective valence (Mv), I-State (MI), and E-State (ME) MMCIs are obtained.
Table 3. Definition of the Mean Molecular Connectivity Indices (MMCI). Replacing δ with δv, I, and with S the respective valence (Mv), I-State (MI), and E-State (ME) MMCIs are obtained.
AM = Σi δi/nGM = Σijiδj)1/2HM = 2Σiji−1 + δj−1)−1
RM = Σij[(δi2 + δj2)/2]1/2SM = Σiji2 + δj2)/(δi + δj)UM = Σiji − δj + (δi2 − 2δiδj + 5δj2)0.5]/2
HoM = Σijip + δjp)1/p/2LM = Σijip + δjp)/(δip−1 + δjp−1)StM = Σij[(δip − δjp)/(pδi + pδj)]1/(p−1)
A: arithmetic; G: geometric; H: harmonic; R: root mean square; S: symmetric; U: unsymmetric; Ho: Hölder; L: Lehmer; St: pseudo-Stolarsky.
Table 4. Best set of descriptors for the properties of Table 1 with the multilinear least-squares (MLS) methodology. 1st column: δv type for the valence-dependent indices. 2nd column: set of descriptors and their statistical quality.
Table 4. Best set of descriptors for the properties of Table 1 with the multilinear least-squares (MLS) methodology. 1st column: δv type for the valence-dependent indices. 2nd column: set of descriptors and their statistical quality.
δv-TypeRegression Equations
δvpo(1)
Tb = 237.5 + 139.1 0χ + 24.69 Dv + 527.7 0ψI − 25.91 1ψI − 1500 0ψE + 41.53 TΣ/M
(24, 31, 3.5, 69, 21, 222, 10)
N(TR) = 45, r2 = 0.821, s = 22; N(+16EV) = 61, r2 = 0.792, s = 25
Excluded strong outliers: Formamide & SO2 ∈ {EV}
δvpo(50)
ε = 2.804 − 12.05 χvt − 5.99∙10−5 1χvd + 132.7 1χvs + 0.021 1ψId − 421.2 1ψEs +38.12 TΣ/M
(0.9, 4.4, 10−5, 28, 0.005, 124, 2.9)
N(TR) = 43, r2 = 0.858, s = 4.2; N(+16EV) = 59, r2 = 0.896, s = 5.5
Excluded strong outliers: ethylencarbonate & quinoline ϵ {TR}, and MeOH & MeCl ∈ {EV}.
δvppo(−0.5)
d = 0.733 + 0.024 Dv + 0.211 0χv + 1.463 1χvs − 0.022 SψE + 0.148 Δ
(0.06, 0.002, 0.02, 0.3, 0.002, 0.01)
N(TR) = 45, r2 = 0.939, s = 0.07; N(+15EV) = 60, r2 = 0.914, s = 0.08
Excluded outliers: MeCl & MeOH ∈ {EV}
δvppo(1)
RI = 1.573 − 0.156 HM + 0.617 RM + 0.067 RMv − 0.447 SM − 0.086 HoMv − 0.012 SME
(0.03, 0.01, 0.02, 0.01, 0.02, 0.01, 0.02)
N(TR) = 45, r2 = 0.957, s = 0.04; N(+14EV) = 59, r2 = 0.951, s = 0.03
Excluded outliers: MeCl & MeOH ∈ {EV}
δvpo(−0.5)
γ = 8.683 + 0.386 Dv + 397.6 1χvs + 151.9 TψI − 502.4 1ψIs + 3.347 Δ
(2.3, 0.05, 57, 36, 90, 0.7)
N(TR) = 29, r2 = 0.835, s = 3.1; N(+10EV) = 39, r2 = 0.792, s = 3.1
Excluded outlier: formamide, nitromethane ∈ {EV}
δvppo(0.5)
FP = 387.1 + 26.99 HM − 94.38 HMI + 33.03 GME + 114.5 UMI − 83.10 HoME
(26, 6.2, 12, 5.2, 13, 11)
N(TR) = 29, r2 = 0.829, s = 16; N(+11EV)= 40, r2 = 0.764, s = 17
Excluded outliers: Acetone ∈ {EV}
δvpo(−0.5)
η = − 0.216 + 0.001 1χd + 0.486 1ψI + 2.20∙10−5 1ψId − 3.83∙10−6 0ψEd + 0.098 Σ
(0.2, 0.0003, 0.1, 7∙10−6, 10−7, 0.01)
N(TR) = 28, r2 = 0.969, s = 0.4; N(+10EV) = 38, r2 = 0.939, s = 0.4
Excluded outlier: MeOH ∈ {EV}
δvpo(5) φ = 0, 1
μ = 0.038 + 0.002 1χd − 0.189 Dv + 0.078 0χvd + 0.077 SψE + 4.039 TΣ/M
(0.2, 0.0002, 0.04, 0.01, 0.01, 0.4)
N(TR) = 24, r2 = 0.919, s = 0.4; N(+9EV) = 33, r2 = 0.768, s = 0.7
Excluded outlier: formamide & MeOH ∈ {EV}
δvpo(50)
UV = 299.1 + 50.54 SMv − 37.34 LMv − 9.048 HoME + 1.310 StME
(13, 4.9, 3.8, 1.1, 0.2)
N(TR) = 25, r2 = 0.893, s = 15; N(+8EV) = 33, r2 = 0.826, s = 21
Excluded outlier: 4-Me-2-pentanone ϵ {TR}; 2-butanone, MeOH, acetonitrile ϵ {EV}
δvpo(50)
−χ·106 = 0.617 + 0.044 0χd + 2.208 1χvs − 2.212 1ψIs + 0.070 Δ − 0.016 Σ
(0.02, 0.01, 0.4, 0.5, 0.008, 0.003)
N(TR) = 23, r2 = 0.876, s = 0.04; N(+7EV) = 30, r2 = 0.875, s = 0.04
Excluded outlier: nitromethane & MeOH ∈ {EV}
δvppo(1)
El = 0.018 + 0.181 × 10−3 1χd − 0.675∙10−6 1χvd + 0.003 0ψId + 140.8 TΣ/M
(0.02, 0.00006, 10−7, 0.0004, 14)
N(TR) = 15, r2 = 0.934, s = 0.06; N(+3EV) = 18, r2 = 0.931, s = 0.06
Table 5. ANN results with descriptors of Table 4 with one hidden neuron. 1st column: the δv-type and the Ntr value; 2nd col.: ANN-MLP architecture, abbreviations for the activation functions for the internal layers, the number of epochs, and training and test errors; 3rd col.: input indices, sensitivity values, and statistical parameters for the training plus test sets, a[N(aTR + bTE)], and plus the evaluation set, [N(+cEV)].
Table 5. ANN results with descriptors of Table 4 with one hidden neuron. 1st column: the δv-type and the Ntr value; 2nd col.: ANN-MLP architecture, abbreviations for the activation functions for the internal layers, the number of epochs, and training and test errors; 3rd col.: input indices, sensitivity values, and statistical parameters for the training plus test sets, a[N(aTR + bTE)], and plus the evaluation set, [N(+cEV)].
δv-TypeANN-MLP(Variables) → Property
δvpo(1) Ntr = 1056-1-1
(e, l) *
41
0.005/0.003
(0χ, Dv, 0ψI, 1ψI, 0ψE, TΣ/M) → Tb
(30.67, 34.22, 41.80, 1.111, 15.76, 2.291)
N(36TR + 9TE) = 45, r2 = 0.850, s = 21; N(+16EV) = 61 r2 = 0.820, s = 23
Excluded outlier: dMe-Ether & SO2 ∈ {EV}
δvpo(50) Ntr = 1036-1-1
(e, s)
8
0.004/0.002
tv, 1χvd, 1χvs, 1ψId, 1ψEs, TΣ/M) → ε
(1.209, 1.091, 3.028, 1.108, 1.440, 6.964)
N(34TR + 9TE) = 43, r2 = 0.871, s =3.8; N(+16EV) = 59, r2 = 0.793, s = 5.1
Excl.out.: ethylencarbonate & quinoline ϵ {TR}, formamide & acetone ∈ {EV}.
δvppo(−0.5) Ntr = 1035-1-1
(t, t)
33
0.002/0.0006
(Dv, 0χv, 1χvs, SψE, Δ) → d
(17.99, 8.653, 2.953, 41.31, 12.37)
N(36TR + 9TE) = 45, r2 = 0.956, s = 0.1; N(+15EV) = 60, r2 = 0.930, s = 0.1
Excluded outliers: MeCl & MeOH ∈ {EV}
δvppo(1) Ntr = 1036-1-1
(i, i)
20
0.001/0.0001
(0χ, Dv, 0χv, 0ψE, Δ, TΣ/M) → RI
(78.50, 212.9, 286.4, 356.0, 9.482, 1.603)
N(35TR + 10TE) = 45, r2 = 0.959, s = 0.03; N(+14EV) = 59, r2 = 0.943, s = 0.04
Excluded outliers: formamide & MeOH ∈ {EV}
δvpo(−0.5) Ntr = 1055-1-1
(e, t)
27
0.005/0.006
(Dv, 1χvs, TψI, 1ψIs, Δ) → γ
(9.086, 34.48, 34.44, 45.45, 2.328)
N(22TR + 7TE) = 29, r2 = 0.841, s = 2.8; N(+10EV) = 39, r2 = 0.705, s = 3.7
Excluded outlier: nitromethane & formamide ∈ {EV}
δvppo(0.5) Ntr = 1035-1-1
(e, e)
39
0.009/0.009
(HM, HMI, GME, UMI, HoME) → FP
(445.1, 1.44·106, 2.65·106, 4.22·106, 17·106)
N(22TR + 7TE) = 29, r2 = 0.801, s = 16; N(+11EV) = 40, r2 = 0.769, s = 16
Excluded outliers: 2Me-Butane ∈ {EV}
δvpo(−0.5) Ntr = 1035-1-1
(e, l)
17
0.001/0.0004
(1χd, 1ψI, 1ψId, 0ψEd, Σ) → η
(1.982, 1.509, 1.060, 12.04, 3.824)
N(22TR + 6TE) = 28, r2 = 0.972, s = 0.3; N(+10EV) = 38, r2 = 0.942, s = 0.4
Excluded outlier: MeOH ∈ {EV}
δvpo(5) [φ = 0, 1] Ntr = 1035-1-1
(e, s)
18
0.002/0.003
(1χd, Dv, 0χvd, SψE, TΣ/M) → μ
(317.2, 43.27, 17.80, 26.95, 8.546)
N(19TR + 5TE) = 24, r2 = 0.926, s = 0.4; N(+9EV) = 33, r2 = 0.768, s = 0.7
Excluded outliers: formamide & MeOH ∈ {EV}
δvpo(50) Ntr = 1034-1-1
(s, i)
16
0.003/0.002
(SMv, LMv, HoME, StME) → UV
(772.2, 543.5, 28.82, 4.862)
N(20TR + 5TE) = 25, r2 = 0.892, s = 14; N(+8EV) = 33, r2 = 0.794, s = 22
Excl. outl.: 4M2-pentanone ϵ {TR}; 2-butanone, MeOH, Acetonitrile, ∈ {EV}
δvpo(50) Ntr = 1035-1-1
(s, s)
15
0.008/0.001
(0χd, 1χvs, 1ψIs, Δ, Σ) → −χ·106 (=MS)
(1.420, 6.413, 3.061, 3.569, 1.792)
N(19TR + 4TE) = 23, r2 = 0.809, s = 0.04; N(+7EV) = 30, r2 = 0.810, s = 0.05
Excluded outliers: nitromethane & MeOH ∈ {EV}
δvppo(1) Ntr = 1034-1-1
(i, i)
20
0.002/0.0003
(AMv, HME, GME, StMI) → El
(52.93, 3072, 3020, 27.81)
N(12TR + 3TE) = 15, r2 = 0.966, s = 0.04; N(+3EV) = 18, r2 = 0.955, s = 0.04
pentane and THF ϵ {TR}; excl. Out.: MeOH & 2-propanol ∈ {EV}
* Activation functions: e = exponential, i = identity, l = logistic, t = tanh, s = sin.
Table 6. ANN-MLP results with descriptors of Table 4 with externally imposed number of hidden neurons. The structure is similar to that in Table 5.
Table 6. ANN-MLP results with descriptors of Table 4 with externally imposed number of hidden neurons. The structure is similar to that in Table 5.
δv-TypeANN-MLP(Variables) → Property
δvpo(1) Ntr = 1036-2-1
(t, t)
73
0.004/0.002
(0χ, Dv, 0ψI, 1ψI, 0ψE, TΣ/M) → Tb
(18.17, 50.17, 138.5, 6.414, 93.87, 4.392)
N(36TR + 9TE) = 45, r2 = 0.891, s = 17; N(+16EV) = 61 r2 = 0.871, s = 20
Excluded outlier: SO2 & MeOH ∈ {EV}
δvpo(50) Ntr = 1056-3-1
(t, e)
55
0.002/0.001
tv, 1χvd, 1χvs, 1ψId, 1ψEs, TΣ/M) → ε
(2.111, 1.902, 8.790, 3.305, 8.234, 16.43)
N(34TR + 9TE) = 43, r2 = 0.942, s =2.5; N(+16EV) = 59, r2 = 0.830, s = 4.5
Excl. Out.: ethylencarbonate & quinoline ∈ {TR}, formamide & nitromethane ∈ {EV}
δvppo(−0.5) Ntr = 1035-4-1
(t, l)
58
0.0004/0.0001
(Dv, 0χv, 1χvs, SψE, Δ) → d
(41.54, 29.37, 9.057, 47.73, 29.59)
N(36TR + 9TE) = 45, r2 = 0.990, s = 0.04; N(+15EV) = 60, r2 = 0.966, s = 0.1
Excluded outliers: formamide & MeCl ∈ {EV}.
δvppo(1) Ntr = 1036-2-1
(t, s)
20
0.0001/0.0004
(0χ, Dv, 0χv, 0ψE, Δ, TΣ/M) → RI
(152.0, 450.4, 1447, 596.4, 25.73, 2.743)
N(35TR + 10TE) = 45, r2 = 0.995, s = 0.03; N(+14EV) = 59, r2 = 0.987, s = 0.05
Excluded outliers: formamide & MeOH ∈ {EV}
δvpo(−0.5) Ntr = 1055-4-1
(t, e)
36
0.004/0.002
(Dv, 1χvs, T ψI, 1ψIs, Δ) → γ
(1285, 21.98, 2093, 62687, 5.853)
N(22TR + 7TE) = 29, r2 = 0.908, s = 2.1; N(+10EV) = 39, r2 = 0.871, s = 2.4
Excluded outlier: nitromethane & formamide ∈ {EV}
δvpo(1) Ntr = 1055-5-1
(t, l)
35
0.003/0.009
(D, 1ψIs, 0ψEd, Δ, TΣ/M) → FP
(8.683, 2.965, 1.212, 5.431, 5.439)
N(22TR + 7TE) = 29, r2 = 0.919, s = 10; N(+11EV) = 40, r2 = 0.860, s = 13
Excluded outliers: nitromethane ∈ {EV}
δvpo(−0.5) Ntr = 1055-3-1
(e, l)
35
0.0003/0.0003
(1χd, 1ψI, 1ψId, 0ψEd, Σ) → η
(4.609, 5.914, 1.286, 15.86, 6.803)
N(22TR + 6TE) = 28, r2 = 0.982, s = 0.3; N(+10EV) = 38, r2 = 0.975, s = 0.3
Excluded outlier: 2-butanone ∈ {EV}
δvpo(5) [φ = 0, 1] Ntr = 1055-2-1
(t, t)
77
0.001/0.001
(1χd, Dv, 0χvd, SψE, TΣ/M) → μ
(12.41, 109.7, 76.57, 90.85, 34.04)
N(19TR + 5TE) = 24, r2 = 0.970, s = 0.2; N(+9EV) = 33, r2 = 0.874, s = 0.5
Excluded outliers: HAc, and MeOH ∈ {EV}
δvpo(0.5) Ntr = 1054-5-1
(t, e)
142
0.002/0.0006
(Dv, 0χv, 0ψE, Δ) → UV
(604041, 1166, 22291, 18.45)
N(20TR + 5TE) = 25, r2 = 0.970, s = 7.5; N(+8EV) = 33, r2 = 0.895, s = 13,
Excl. Out.: 4M2-pentanone ϵ {TR}; nitromethane, MeOH, acetone ∈ {EV}
δvpo(50) Ntr = 1035-3-1
(e, s)
18
0.003/0.0008
(0χd, 1χvs,1ψIs, Δ, Σ) → −χ·106 (=MS)
(3.148, 21.76, 4.090, 8.594, 3.054)
N(19TR + 4TE) = 23, r2 = 0.907, s = 0.03; N(+7EV) = 29, r2 = 0.871, s = 0.04
Excluded outliers: nitromethane, MeOH ∈ {EV}
δvppo(1) Ntr = 1034-2-1
(t, s)
22
0.001/0.003
(AMv, HME, GME, StMI) → El
(80.08, 3075, 2819, 34.79)
N(12TR + 3TE) = 15, r2 = 0.973, s = 0.03; N(+3EV) = 18, r2 = 0.975, s = 0.03
pentane and THF ϵ {TR}; excluded outliers: acetonitrile & 2-propanol ∈ {EV}
Table 7. ANN-MLP results with the number of hidden neurons chosen by Statistica 8. Descriptors are those of Table 4. The structure is similar to that in Table 5.
Table 7. ANN-MLP results with the number of hidden neurons chosen by Statistica 8. Descriptors are those of Table 4. The structure is similar to that in Table 5.
δv-TypeANN-MLP(Variables) → Property
δvpo(1) Ntr = 1036-11-1
(t, t)
39
0.005/0.005
(0χ, Dv, 0ψI, 1ψI, 0ψE, TΣ/M) → Tb
(17.98, 45.18, 106.2, 2.556, 72.23, 3.579)
N(36TR + 9TE) = 45, r2 = 0.846, s = 21; N(+16EV) = 61 r2 = 0.826, s = 24
Excluded utlier: MeOH & SO2 ∈ {EV}
δvpo(50) Ntr = 1056-3-1
(t, e)
66
0.002/0.001
tv, 1χvd, 1χvs, 1ψId, 1ψEs, TΣ/M) → ε
(2.598, 2.510, 10.40, 3.409, 10.99, 15.65)
N(34TR + 9TE) = 43, r2 = 0.942, s =2.5; N(+16EV) = 59, r2 = 0.742, s = 5.7
Excl. Out.: ethylencarbonate & quinoline ϵ {TR}, formamide & acetone ∈ {EV}
δvppo(−0.5) Ntr = 1055-8-1
(t, l)
18
0.001/0.001
(Dv, 0χv, 1χvs, SψE, Δ) → d
(20.47, 8.414, 4.606, 49.56, 19.77)
N(36TR + 9TE) = 45, r2 = 0.970, s = 0.05; N(+15EV) = 60, r2 = 0.938, s = 0.07
Excluded outliers: MeCl & MeOH ∈ {EV}
δvppo(1) Ntr = 1056-4-1
(e, i)
66
0.0001/0.0004
(0χ, Dv, 0χv, 0ψE, Δ, TΣ/M) → RI
(447.3, 947.0, 1178, 1152, 39.05, 14.42)
N(35TR + 10TE) = 45, r2 = 0.990, s = 0.02; N(+14EV) = 59, r2 = 0.984, s = 0.02
Excluded outliers: MeCl & MeOH ∈ {EV}
δvpo(−0.5) Ntr = 1035-10-1
(l, s)
42
0.004/0.002
(Dv, 1χvs, TψI, 1ψIs, Δ) → γ
(18.16, 81.96, 74.19, 173.8, 2.809)
N(22TR + 7TE) = 29, r2 = 0.890, s = 2.3; N(+10EV) = 39, r2 = 0.851, s = 2.6
Excluded outlier: nitromethane & formamide ∈ {EV}
δvpo(1) Ntr = 1055-4-1
(l, l)
81
0.003/0.01
(D, 1ψIs, 0ψEd, Δ, TΣ/M) → FP
(6.663, 2.542, 1.105, 4.616, 3.220)
N(22TR + 7TE) = 29, r2 = 0.899, s = 11; N(+11EV) = 40, r2 = 0.840, s = 14
Excluded outliers: 2Me-Butane ∈ {EV}
δvpo(−0.5) Ntr = 1035-3-1
(e, l)
26
0.0003/0.0003
(1χd, 1ψI, 1ψId, 0ψEd, Σ) → η
(6.071, 4.640, 1.164, 14.16, 7.089)
N(22TR + 6TE) = 28, r2 = 0.981, s = 0.3; N(+10EV) = 38, r2 = 0.974, s = 0.3
Excluded outlier: 2-butanone ∈ {EV}
δvpo(5) [φ = 0, 1] Ntr = 1055-4-1
(t, t)
49
0.001/0.0005
(1χd, Dv, 0χvd, SψE, TΣ/M) → μ
(20.13, 174.3, 115.0, 202.7, 62.97)
N(19TR + 5TE) = 24, r2 = 0.977, s = 0.2; N(+9EV) = 33, r2 = 0.835, s = 0.6
Excluded outliers: HAc, and MeOH ∈ {EV}
δvpo(0.5) Ntr = 1054-5-1
(t, e)
108
0.001/0.0003
(Dv, 0χv, 0ψE, Δ) → UV
(2555, 517.8, 51639, 43.21)
N(20TR + 5TE) = 25, r2 = 0.970, s = 7.3; N(+8EV) = 33, r2 = 0.941, s = 10
Excl. Out.: 4M2-pentanone ϵ {TR}; nitromethane, 2-butanone, acetone ∈ {EV}
δvpo(50) Ntr = 1055-4-1
(t, i)
78
0.0004/0.0001
(0χd, 1χvs,1ψIs, Δ, Σ) → −χ·106 (=MS)
(48.24, 991.4, 1672, 165.9, 112.6)
N(19TR + 4TE) = 23, r2 = 0.989, s = 0.01; N(+7EV) = 29, r2 = 0.852, s = 0.04
Excluded outliers: nitromethane & MeOH ∈ {EV}
δvppo(1) Ntr = 1054-5-1
(t, t)
49
0.002/0.001
(AMv, HME, GME, StMI) → El
(66.31, 355.9, 331.7, 27.55)
N(12TR + 3TE) = 15, r2 = 0.973, s = 0.03; N(+3EV) = 18, r2 = 0.973, s = 0.03
pentane and THF ϵ {TR} and excluded MeOH & 2-propanol ∈ {EV}
Table 8. ANN-MLP results for the set of descriptors of Table 4 with only one hidden neuron where either one or two indices were deleted, usually, those with the lowest sensitivity values shown in Table 5. The structure is similar to that in Table 5. Only the satisfactory results are here shown.
Table 8. ANN-MLP results for the set of descriptors of Table 4 with only one hidden neuron where either one or two indices were deleted, usually, those with the lowest sensitivity values shown in Table 5. The structure is similar to that in Table 5. Only the satisfactory results are here shown.
δv-TypeANN-MLP(Variables) → Property
δvpo(1) Ntr = 1054-1-1
(e, e)
25
0.008/0.008
(0χ, Dv, 0ψI, 0ψE) → Tb
(816.3, 863.6, 110900, 7016972)
N(36TR + 9TE) = 45, r2 = 0.758, s = 26; N(+16EV) = 61 r2 = 0.714, s = 29
Excluded outlier: dMe-Ether & SO2 ∈ {EV}
δvpo(50) Ntr = 1034-1-1
(i, s)
8
0.006/0.01
tv, 1χvs, 1ψEs, TΣ/M) → ε
(1.033, 1.602, 1.092, 3.781)
N(34TR + 9TE) = 43, r2 = 0.761, s =5.2; N(+16EV) = 59, r2 = 0.903, s = 5.2
Excl.out.: ethylencarbonate & quinoline ϵ {TR}, nitromethane & HAc ∈ {EV}
δvppo(−0.5) Ntr = 1034-1-1
(l, t)
17
0.004/0.002
(Dv, 0χv, SψE, Δ) → d
(11.01, 7.934, 28.40, 4.905)
N(36TR + 9TE) = 45, r2 = 0.917, s = 0.1; N(+15EV) = 60, r2 = 0.895, s = 0.1
Excluded outliers: SO2 & Formamide ∈ {EV}
δvppo(1) Ntr = 1034-1-1
(i, e)
14
0.0008/0.0008
(0χ, Dv, 0χv, 0ψE) → RI
(1220, 479.3, 31.61, 2.185)
N(35TR + 10TE) = 45, r2 = 0.926, s = 0.05; N(+14EV) = 59, r2 = 0.914, s = 0.05
Excluded outliers: THF & MeCl ∈ {EV}
δvppo(0.5) Ntr = 1054-1-1
(i, l)
26
0.01/0.02
(HMI, GME, UMI, HoME) → FP
(10.65, 14.68, 15.90, 12.16)
N(22TR + 7TE) = 29, r2 = 0.719, s = 19; N(+11EV) = 40, r2 = 0.702, s = 18
Excluded outliers: 2Me-Butane ∈ {EV}
δvpo(−0.5) Ntr = 1033-1-1
(t, i)
67
0.0007/0.0003
(1χd, 0ψEd, Σ) → η
(1.603, 15.54, 10.70)
N(22TR + 6TE) = 28, r2 = 0.965, s = 0.4; N(+10EV) = 38, r2 = 0.917, s = 0.5
Excluded outlier: MeOH ∈ {EV}
δvpo(5) [φ = 0, 1] Ntr = 1054-1-1
(t, e)
22
0.009/0.005
(Dv, 0χvd, SψE, TΣ/M) → μ
(2.582, 4.178, 6.314, 3.371)
N(19TR + 5TE) = 24, r2 = 0.795, s = 0.6; N(+9EV) = 33, r2 = 0.746, s = 0.7
Excluded outliers: HAc & MeOH ∈ {EV}
Table 9. Statistical, N/r2 (2nd decimal figure)/s, results for the eleven properties from Table 4, Table 5, Table 6 and Table 7. 2nd column: MLS, results, 3rd column: ANN with one hidden neuron (ANN 1HN) results, 4th column: ANN with externally chosen number of hidden neurons (ANN enHN) results, 5th column: ANN with software chosen number of hidden neurons (ANN snHN) results. First line shows the statistical results for the training (MLS) and train plus test (ANN) compounds, the second line shows the overall statistical results inclusive of the evaluated compounds. M stands for MMCIs (otherwise they are MCIs). The last two columns show also the number of hidden neurons (second line, underlined and bold).
Table 9. Statistical, N/r2 (2nd decimal figure)/s, results for the eleven properties from Table 4, Table 5, Table 6 and Table 7. 2nd column: MLS, results, 3rd column: ANN with one hidden neuron (ANN 1HN) results, 4th column: ANN with externally chosen number of hidden neurons (ANN enHN) results, 5th column: ANN with software chosen number of hidden neurons (ANN snHN) results. First line shows the statistical results for the training (MLS) and train plus test (ANN) compounds, the second line shows the overall statistical results inclusive of the evaluated compounds. M stands for MMCIs (otherwise they are MCIs). The last two columns show also the number of hidden neurons (second line, underlined and bold).
PMLS (Table 4)ANN 1HN (Table 5)ANN enHN (Table 6)ANN snHN (Table 7)
Tb45/0.82/22
61/ 0.79/25
45/0.85/21
61/0.82/23
45/0.89/17
2/61/0.87/20
45/0.85/21
11/61/0.83/24
ε43/0.86/4.2
59/0.90/5.5
43/0.87/3.8
59/0.79/5.1
43/0.94/2.5
3/59/0.83/4.5
43/0.94/2.5
5/59/0.83/5.7
d45/ 0.94/0.07
60/0.91/0.08
45/0.96/0.1
60/ 0.93/0.1
45/0.99/0.04
4/60/0.97/0.1
45/0.97/0.05
8/60/0.94/0.1
RI45/ 0.96/0.04
59/0.95/0.03
45/0.96/0.03
59/0.94/0.04
45/0.995/0.03
2/59/0.99/0.05
45/0.99/0.02
4/59/0.98/0.02
γ29/0.84/3.1
39/0.79/3.1
29/0.84/2.8
39/0.71/3.7
29/0.91/2.1
4/39/0.87/2.4
29/0.89/2.3
10/39/0.85/2.6
FPM/29/0.83/16
40/0.76/17
M/29/0.80/16
40/0.77/16
29/0.92/10
5/40/0.86/13
29/0.90/11
4/40/0.84/14
η28/0.97/0.4
38/0.94/0.4
28/0.97/0.3
38/0.94/0.4
28/0.98/0.3
3/38/0.98/0.3
28/0.98/0.3
3/38/0.97/0.3
μ24/0.92/0.4
33/0.77/0.7
24/0.93/0.4
33/0.77/0.7
24/0.97/0.2
2/33/0.87/0.5
24/0.98/0.2
4/33/0.84/0.6
UVM/25/0.89/15
33/0.83/21
M/25/0.89/14
33/0.79/22
25/0.97/7.5
5/33/0.90/13
25/0.97/7.3
5/33/0.94/10
−χ·10623/0.88/0.04
30/0.88/0.04
23/0.81/0.04
30/0.81/0.05
23/0.91/0.03
3/29/0.87/0.04
23/0.99/0.01
4/29/0.85/0.04
El15/0.93/0.06
18/0.93/0.06
M/15/0.97/0.04
18/0.96/0.04
M/15/0.97/0.03
2/18/0.98/0.03
M/15/0.97/0.03
5/18/0.97/0.03

Share and Cite

MDPI and ACS Style

De Julián-Ortiz, J.V.; Pogliani, L.; Besalú, E. Modeling Properties with Artificial Neural Networks and Multilinear Least-Squares Regression: Advantages and Drawbacks of the Two Methods. Appl. Sci. 2018, 8, 1094. https://doi.org/10.3390/app8071094

AMA Style

De Julián-Ortiz JV, Pogliani L, Besalú E. Modeling Properties with Artificial Neural Networks and Multilinear Least-Squares Regression: Advantages and Drawbacks of the Two Methods. Applied Sciences. 2018; 8(7):1094. https://doi.org/10.3390/app8071094

Chicago/Turabian Style

De Julián-Ortiz, Jesus Vicente, Lionello Pogliani, and Emili Besalú. 2018. "Modeling Properties with Artificial Neural Networks and Multilinear Least-Squares Regression: Advantages and Drawbacks of the Two Methods" Applied Sciences 8, no. 7: 1094. https://doi.org/10.3390/app8071094

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop