1. Introduction
There is no doubt that cardiovascular diseases are some of the most important medical problems globally. Accordingly, drug monitoring for cardiotoxicity should also be considered a very important medical problem. Data on cardiotoxicity are needed for the development of a wide range of new drugs [
1]. The human potassium channel gene (hERG) plays an important role in regulating heart rate, and data on cardiotoxicity associated with hERG inhibition by drugs and environmental chemicals provide essential information for medicinal chemistry. Enhancing cardiotoxicity data in direct experiments is large-scale, expensive, and virtually impossible. Therefore, the use of in silico models can help to reach this endpoint. For instance, to define the hierarchy of molecules in the early stages of new drug development and minimize the risks of using new pharmaceutical agents, computational approaches are used to predict the hERG-blocking potential of new drug candidates. Indeed, quantitative structure–property/activity relationships (QSPRs/QSARs) for the cardiac toxicity of organic hERG blockers are reported in the literature [
2,
3,
4,
5].
A very complex model requires too much knowledge and too many skills. Convenient models that do not call for significant intellectual effort (i.e., economy of thinking) are therefore desirable, extracting information from data already available, such as effect data associated with chemical structures. However, chemical information can be described and processed in many ways, depending on the chemical structure format. It is interesting to compare the practical applications of InChI (International Chemical Identifier) and SMILES (Simplified Molecular Input Line Entry System), which are common formats used to describe the structure of a substance. According to SCOPUS, citations of works using InChI for QSPR/QSAR analysis are only 2% of those using SMILES in QSPR/QSAR analysis [
6,
7].
The CORAL software (
http://www.insilico.eu/coral, accessed on 11 April 2025) requires only SMILES and numerical data on an endpoint to build a model. Therefore, while the approach used here is fairly convenient, it should be noted that two innovations are applied to the construction of the cardiotoxicity models described here. First, the Conformity Coefficient of Correlative Prediction (CCCP) [
8] was used to improve the efficiency of the Monte Carlo method for model generation. Second, the Las Vegas algorithm [
8,
9] was used to select a prospective split of the available data into training and validation sets. It is likely that this is the first time that these steps have been applied to the construction of a cardiotoxicity model.
The aim of the study to attempt to assess the influence of these new characteristics [
10,
11] on the predictive potential of cardiotoxicity models in order to properly develop successful models.
3. Discussion
The CORAL software has been used to build QSPR/QSAR models for over ten years by many institutes [
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58]. The basic idea used to develop this software is the use of SMILES to represent molecular structures. Most of the previously used QSPR/QSAR models were based on the use of a molecular graph to represent molecular structures, in which the vertices represent different chemical elements and the edges represent covalent bonds [
59,
60].
Both of the aforementioned options for representing the molecular structure have their advantages and disadvantages; however, the main thing is that these representations of molecular structures are far from identical, and therefore, in principle, can complement each other. The CORAL software makes it possible to construct models based on SMILES or the representation of the molecular structure in the form of a molecular graph, as well as through hybrid representation with the involvement of molecular features expressed by SMILES attributes together with invariants of molecular graphs in the modeling process [
61,
62,
63,
64].
In addition to the possibility of constructing hybrid descriptors as above indicated, specific molecular features that are characteristic of the molecular system as a whole were considered for QSPR/QSAR modeling. These are the global attributes of SMILES, such as BOND, NOSP, and HALO. The BOND is a code of covalent bonds. The NOSP is a code that represents configurations of nitrogen, oxygen, sulfur, and phosphorus in a molecular system.
SMILES can serve as a basis for developing other variants of molecular structure information that are capable, in principle, of somehow intersecting with complex biochemical features of molecular structures that determine the biological activity of a substance. These may be combinatorial features, such as contributions of individual atoms or proportions of pairs of atoms [
11].
The fragments of local symmetry (FLS) examined in this study are partially related to mathematical symmetry. On the one hand, they have a significant level of symmetry, but in a topological sense, applied to local situations in the molecule, and not to the whole structure. On the other hand, this fact indicated that the molecular fragments represented by FLS are not related to the traditional symmetry concept, which is global. However, FLSs can improve the predictive potential of QSAR models. This is confirmed in the study.
In addition to this kind of representation, there are other possibilities to influence the results of stochastic optimization. These are the statistical criteria of the forecast potential, namely, the correlation ideality index, the correlation intensity index, and the coefficient of conformism for correlative prediction.
The basic principles and expectations when using stochastic modeling with the Monte Carlo approach via the CORAL program (
http://www.insilico.eu/coral) are as follows.
Any QSPR/QSAR model can be affected by the presence of certain substances in a given (sub)set, thus it is a random event. Therefore, considering one distribution for the training and the validation sets is not enough for a robust assessment of the predictive potential of the method used; it is necessary to consider several random (non-identical) splits in the training and the validation sets.
Even when considering a single split into training and validation sets, multiple runs of the stochastic Monte Carlo simulation process will yield different values of the statistical characteristics of the training and validation sets. In this case, the important and necessary information for the correct assessment of the predictive potential of the method obtained is the dispersion of the statistical values for the training and validation sets.
This dispersion is not necessarily associated with an error (limitation of the predictive potential) of the considered method; there may be cases of special influence of the IIC and CII factors, which divide the correlation cluster into two sub-clusters for the training set, as we showed in
Figure 1.
To strengthen the statistical reliability of the selected divisions in training and validation sets, the Las Vegas algorithm was used to obtain the divisions with minimal statistical defects. The essence of the specified stochastic process (the Las Vegas algorithm) is the construction of structured training and validation sets. The structured training set includes passive and active training sets accompanied by a calibration set. The statistical defect of each of the specified sets is the sum of the statistical defects of the SMILES in the abscissa of the model.
Ideally, a good model yields similar statistical parameters for the training and validation sets.
The results of the computational experiments considered here confirm the relevance of the above points, although this type of study should be replicated for different endpoints.
This study was planned as a means of testing the ability of local symmetry fragments in cooperation with the CCCP to improve the predictive potential of the models. The comparison of
Table 1 and
Table 2 demonstrated that the CCCP results in significant improvement in the statistical quality of the models. The corresponding computational experiments without the correlation weights of the FLS have shown that without these, the models are inadequate. Additional studies with different endpoints may explore whether this is true in other cases too.
At present, the study of IIC and CII has shown that the potential for using IIC is greater than that for CII, even though the combined use of IIC and CII may be beneficial. It is possible to manage the process of cooperation between IIC and CII by using coefficients similar to F1–F3 used in Equations (3) and (4).
Similar studies can be conducted for the new criterion of the predictive potential of the CCCP considered here. Advantages of this criterion in terms of improving the stochastic processes used here for constructing models, both in terms of the Monte Carlo method and in terms of the Las Vegas algorithm, may require additional verification. The study of both the cooperative application of the considered criteria of the forecast potential and their individual capabilities is broad in scope. Obviously, this not only requires experimental implementation but also theoretical understanding. From this point of view, it should be noted that the main advantage of the IIC is its ability to take into account both the correlation coefficient and the mean absolute error (MAE) and/or the root mean squared error (RMSE). To assess the correlation intensity (to calculate CII), other parameters of abstract correlation are used that do not depend on the dispersion of correlation clusters in the “experiment forecast” or “experiment calculation” coordinates (i.e., they do not depend on RMSE and MAE). In fact, optimizing the correlation weights can be interpreted as making a “generalized” decision that affects all compounds (SMILES), not just the training and unseen training sets. This is similar to making a generalized decision through in bicameral legislature in a state parliament. A bicameral legislature is used to avoid a biased decision that is preferred by a particular group of representatives in a parliament. Similarly, two groups of substances that have an unequal influence on the final decision, used separately as training sets and unseen training sets, can help to avoid the biased decision that is preferred by visible substances. The “protests” (substances with an opposite behavior) underlying the calculation of the CII allow us to consider the correlation to be a structure similar to the above-mentioned bicameral legislature. This allows us to compare different correlations using this rule: the smaller the sum of protests [
8], the higher the correlation value. The calculation of the CII and the CCCP have a similar basis, which is the so-called “protest” [
8]. However, the main difference between the CCCP and the CII is that not only “protests” are taken into account but also the opinion of the supporters of the correlation, that is, the compounds (SMILES) that have a negative protest value. This apparently results in an advantage of the CCCP compared to the CII, because taking into account all opinions is more balanced and yields a higher quantity of information about various phenomena.
In silico simulation can be used as a method of cognition functions by feedback, i.e., any model should be verifiable. For the considered models, an interested user can carry out verification conveniently. It is necessary to download the CORAL program and run it using the splits available from the
Supplementary Materials of this study. For QSPR/QSAR simulation by means of stochastic variation in model parameters, a necessary condition is the reproducibility of the results (the values of statistical parameters for the considered partitions). As shown, the reproducibility of the forecast potential is observed with good statistical quality (0.73 ± 0.03 for the determination coefficients for the validation samples). Thus, the proposed modeling concept can be accepted as a convenient tool for QSPR/QSAR analysis.
In principle, the important points related to the development and use of models are the universality and the possibility of standardization. Universality is understood as the ability of the approach to serve various classes of compounds in QSPR/QSAR analysis. heckedStandardization is the ability to determine the essential criteria that guarantee a good level of predictive potential. In terms of universality, the proposed approach can be easily transposed into a modeling tool based on eclectic data using the so-called quasi-SMILES [
10]. In terms of standardization, the above-mentioned IIC and CII, as well as the new parameter CCCP, can be used.
To evaluate the proposed approach, a validation of the model obtained for split 1 was performed. For this purpose, the corresponding SMILES-based descriptors were calculated for 200 compounds outlined in [
15]. This validation confirmed the predictive potential of the model obtained for split 1. The technical details of this validation are presented in the
Supplementary Materials (Table S2).
The above demonstrate that the new possibilities for constructing the stochastic models discussed here, namely (i) the CCCP and (ii) the Las Vegas algorithm, look quite promising.
We plan to conduct corresponding studies in the future, expanding the study scope by considering further cases, including traditional SMILES, organic and inorganic substances, and peptides and nanomaterials using quasi-SMILES.
4. Materials and Methods
Regression model
Database 1
The first database is for regression models. The numerical data on cardiotoxicity expressed in logarithmic units (pIC50) were taken from the literature [
1]. Twelve duplicates were detected in the database. After removing the duplicates, the total number of compounds was 394. These were randomly divided into three partitions to produce three models. Each partition contained a structured training set containing the active (A), passive (P), and calibration (C) training sets; in addition, a validation (V) set was used to evaluate the results of the model using new substances (invisible during the construction of the model).
Descriptors
The descriptors applied were calculated as follows:
Sk is a SMILES atom, i.e., a single symbol or a group of symbols (‘C’, ’O’, ‘N’, etc.) that should be considered a united system (‘Cl’, ‘@@’, %11, etc.);
SSk and
SSSk are two or three connected SMILES atoms. FLS means fragments of local symmetry [
10], i.e., fragments of SMILES that can be represented as XYX, XYYX, or XYZYX, where X ≠ Y and Y ≠ Z.
APP is the matrix of atom pair proportions [
28].
T and
N are parameters of the Monte Carlo optimization that provide numerical data on the correlation weights (CWs) for the SMILES attributes.
Models
The model generated by the CORAL software was calculated as
C0 and C1 are regression coefficients.
Monte Carlo method
The Monte Carlo optimization provides the numerical data on the correlation weights of the SMILES attributes listed above. The following calculation aims to provide a larger value for target functions:
RA and RP are correlation coefficients for active and passive training sets, respectively; the index of ideality of correlation (IIC) [
11]; the correlation intensity index [
8]; and the coefficient of the conformism of a correlative prediction (CCCP) [
8] are components of the Monte Carlo optimization—F
1 = F
2 = 0.5; F
3 = 0.3.
Classification models
Database 2
The second database contains data on 13,846 organic molecules for classification models [
17]. The descriptors are as we described above.
Models
A unique feature of the approach under consideration is the possibility of constructing so-called semi-correlations [
11], which are tools for the representation of binary classifications based on the principle of active versus inactive, represented by 1 and 0 (in principle, the other option, namely active = +1 and inactive = −1, can be used too).
The use of semi-correlations is carried out by means of a regression model in which the values of the optimal descriptor calculated by Equation (1) are plotted along one axis (ordinate), and along the abscissa, there are only two values of 0 and 1 (or, as mentioned above, −1 and +1) for the identification of activity and non-activity, respectively. This binary classification is carried out according to the following scheme:
Applicability domain
The modeling system, through the CORAL program, assumes a stochastic nature in several aspects. First, it is assumed that the model can be built for any random split in training and validation sets. It is expected that some distributions will lead to a good statistical quality of the model and some to a low statistical quality. Secondly, it is expected that even for “successful” models, there will be a scatter in the statistical characteristics (correlation coefficient and standard deviation). Thus, criteria are needed to select suitable distributions for training and validation. Statistical defect values for molecular features extracted from SMILES and statistical defect values for distributions have been proposed [
10]. Depending on the statistical defects of the molecular features, as well as their average values, the applicability domain is determined, as shown below.
The defects for SMILES features (which represent molecular features) are calculated as follows:
where P(A
k), P′(A
k), and P″(A
k) are the probabilities of A
k in the active training, passive training, and calibration sets, respectively, and N(A
k), N′(A
k), and N″(A
k) are the frequencies of A
k in the active training set, passive training set, and calibration set. The statistical SMILES defects (D
j) are calculated as follows:
where NA is the number of non-blocked SMILES attributes in the SMILES.
A SMILES falls into the domain of applicability if
is average on the list of SMILES attribute defects {Dj}.
Las Vegas algorithm
The Las Vegas algorithm is a sequence of testing for different splits into active training, passive training, calibration, and validation sets in the process of the Monte Carlo optimizations [
8]. The aim of the algorithm used here is the split which provides a determination coefficient for the calibration set that is as large as possible, hoping that it is accompanied by a large determination coefficient for the external validation set.
Table 6 contains an example of the Las Vegas algorithm functioning.
Mechanistic interpretation
Through the comparison of several starting points of the described Monte Carlo optimization under the same conditions (same split and same parametrization), one can select a group of SMILES attributes (i.e., the group of molecular features) with positive correlation weights. These can be considered promoters of increases for the endpoint under consideration.
Table 7 contains a collection of molecular features with positive correlation weights in several areas of stochastic optimization. One can see that in the absence of fragments of local symmetry, XYYX and XYZYX are promoters of the increase in cardiac toxicity in hERG. The same role is present in certain atoms, such as nitrogen, and plays a significant part in aromaticity (
Table 7).