Next Article in Journal
Sound-Absorbing, Thermal-Insulating Material Based on Non-Woven Fabrics Mixed with Aerogel Particles
Previous Article in Journal
Measuring Dental Chamber Volume with DICOM Images from Cone-Beam Computed Tomography Can Be Improved with a Simple Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Structure–Activity Relationship (SAR) Modeling of Mosquito Repellents: Deciphering the Importance of the 1-Octanol/Water Partition Coefficient on the Prediction Results

1
CTIS, 69140 Rillieux-La-Pape, France
2
SPO, Univeristy Montpellier, INRAE, Institut Agro, 34060 Montpellier, France
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5366; https://doi.org/10.3390/app14135366
Submission received: 15 May 2024 / Revised: 4 June 2024 / Accepted: 17 June 2024 / Published: 21 June 2024
(This article belongs to the Section Chemical and Molecular Sciences)

Abstract

:
Repellents play a fundamental role in vector control and prevention to keep mosquitoes away from humans. Available in limited numbers, it is absolutely necessary to find new repellents for preventing problems of resistance. QSAR (Quantitative Structure–Activity Relationship) methods are particularly suited for designing molecules with potential repellent activity. These models require that the molecules be described by physicochemical properties, topological indices, and/or structural indicators. In the former situation, QSPR (Quantitative Structure–Property Relationship) models are used for calculating physicochemical descriptors. Use of different QSPR models for the same property can lead to different values for the same molecule. In this context, the influence of the 1-octanol/water partition coefficient (log P) calculated according to two different methodologies was statistically evaluated in the modeling of 2171 molecules for which their skin repellent activity against Aedes aegypti was available. The two series of supervised artificial neural networks differed only by their input neuron coding for log P. Although both categories of classification models led to overall good statistics, we clearly showed that differences in log P values calculated for a molecule could result in very different prediction results. This was especially true for repellents. The practical implication of these differences was discussed.

1. Introduction

Mosquito-transmitted diseases are the main sources of human illness and death in numerous countries worldwide. Malaria is the most devastating mosquito-borne disease. It is caused by Plasmodium protozoan parasites following the infective bites of females of the Anopheles mosquito [1]. In 2021, there were an estimated 247 million malaria cases in 84 endemic countries and 619,000 malaria deaths [2]. Dengue is a viral infection transmitted to humans through the bite of infected Aedes mosquitoes. It is the most important and fastest-spreading mosquito-borne viral disease worldwide. Roughly half of the world’s population currently lives in areas that are environmentally suitable for dengue transmission [3,4]. Other viral diseases transmitted by mosquitoes include chikungunya, Zika, yellow fever, West Nile fever, Japanese encephalitis, and Rift Valley fever [5].
Different vector control methods are used to stop the transmission of mosquito-borne diseases and to limit the nuisance caused by these dipterans. These methods are based on two main complementary strategies. The first one consists of a reduction in mosquito populations in the environment by using physical, mechanical, biological, genetic, and/or biocidal methods [6,7]. The second strategy relies on the use of personal protections via different devices, among which the repellents occupy a key place because they provide immediate, localized, and personal protection against host-seeking mosquitoes [8]. DEET (N,N-diethyl-m-toluamide) is the most widely used repellent worldwide despite some drawbacks [8,9]. Faced with the limited number of other skin-applied repellents approved by regulatory authorities which are commercially available (e.g., IR3535, picaridin), there is a need to identify original structures with new mechanisms of action. Indeed, the more mosquito repellents available, the more we limit the emergence of resistance phenomena. The reduced effectiveness of DEET against Aedes and Anopheles mosquitoes has already been demonstrated in various situations [10,11,12].
In silico approaches allow us to speed up the discovery process of new mosquito repellents through the use of two main paradigms. The first one relies on the use of odorant-binding proteins (OBPs) that play a key role in mosquito olfaction. Among the OBPs, the crystal structures of OBP1 have been used in docking analysis for the structure-based discovery of new molecules with a potential repellent activity against Aedes or Anopheles mosquitoes [13,14,15,16]. The second one consists in the computation of (Quantitative) Structure–Activity Relationship ((Q)SAR) models [17]. Molecules with their experimental repellent activity are encoded by molecular descriptors and then a linear or nonlinear statistical method [18] is used for establishing formal relationships between the structure of the molecules and their repellent activity. The choice of the statistical approach mainly depends on the complexity of the problem at hand. Structural features (SFs), topological indices (TIs) and/or physicochemical properties are used for describing the molecules and feature selection methods allow the selection of the most information-rich set of descriptors and guarantee their optimal number to address overfitting problems. While the programs for calculating the SFs and TIs do not give different results if the native algorithms are used, the same is not true for the programs aiming at calculating the physicochemical properties of molecules that are QSPR (Quantitative Structure–Property Relationship) models. It is well-known that two different QSPR models built for the same physicochemical property can lead to significantly different prediction results for the same molecule. The intrinsic factors that intervene for explaining these differences are the size and composition of the training set, the quality of the data used to build the model, the relevance of the descriptors, the performances of the modeling algorithm, and the quality of the model validation process [17,19,20,21,22,23].
Different SAR and QSAR models have been proposed for discovering new potential repellents against mosquitoes (reviewed in Devillers [24]) but to our knowledge, no study has been carried out to estimate the impact of a molecular descriptor calculated with different QSPR models on the variability and reliability of the prediction results of the repellent activity of compounds.
Recently, Devillers et al. [25] have proposed a SAR model for predicting the skin repellent activity of molecules against the yellow fever mosquito (Aedes aegypti). The model was computed from a large set of 2171 molecules and among its constitutive parameters, the 1-octanol/water partition coefficient (log P) was found to be a key physicochemical descriptor. In this context, by using the same dataset and methodology, the goal of our study was to statistically investigate the impact of log P calculated according to two different methods on the prediction of repellent activity.

2. Materials and Methods

2.1. Mosquito Repellent Activity

The repellent activity of structurally diverse compounds against Aedes aegypti was measured in cage tests under controlled conditions. The test protocol is fully described in Knipling et al. [26] and summarized in Devillers et al. [25]. The endpoint was the protection time defined as the time in minutes between the treatment with the tested molecule and the first mosquito bite. Chemicals having a repellent activity inferior to 60 min, from 61 to 120 min, from 121 to 180 min, and more than 180 min were assigned to classes 1, 2, 3, and 4, respectively. To facilitate the search for structure–repellency relationships, compounds belonging to classes 2 and 3 were discarded and only classes 1 and 4 were selected for SAR modeling. Data cleaning was performed as previously described [25]. This leads to a total of 2171 chemicals. Among them, 318 belonged to class 4 (repellents) and 1853 to class 1 (non-repellents). Supplementary Materials Table S1 lists the 2171 chemicals with their chemical name and experimental activity value [26]. Conveniently, classes 1 and 4 were termed classes 0 and 1, respectively. The dataset of 2171 molecules was randomly split into training sets and independent test sets of 80% and 20%, respectively.

2.2. Molecular Descriptors

Recently, the 2171 compounds and their repellent activity (0 or 1) listed in Table S1 were successfully modeled from a three-layer perceptron (TLP) [27] including 20 molecular descriptors as input neurons [25]. These descriptors encoded topological information (e.g., size, branching), the presence of specific atoms, functional groups, structural features, and physicochemical properties. The 1-octanol/water partition coefficient (log P), calculated according to the regression equation of Moriguchi et al. [28,29] (MlogP), belonged to the last category of the selected molecular descriptors. All these previous descriptors were considered in the current study. MHlogP based on the atom/fragment contribution method of Meylan and Howard [30] was also added to the set of descriptors. The reason for the choice of MHlogP was threefold. First, MHlogP was ranked among the first descriptors with MlogP during the feature selection process that led to the design of a classification model for predicting repellents for clothing application [31]. Secondly, the models allowing the calculation of the MHlogP and MlogP descriptors are based on different methodologies. Last, the atom/fragment contribution method of Meylan and Howard [30] is widely used to compute the 1-octanol/water partition coefficients of pesticides and chemicals of environmental concern [23,32,33,34,35]. Besides MlogP or MHlogP descriptors, 13 topological indices [36] and descriptors encoding specific functional groups and structural fragments in the molecules were also considered in addition to the 19 other descriptors selected in Devillers et al. [25]. Selected on the basis of our previous results [25,31], their goal was to find the same optimal set of descriptors for the classification models built either with MlogP or MHlogP descriptors.
The QSPR (Quantitative Structure–Property Relationship) models for the calculation of the MlogP and MHlogP descriptors, which are implemented in VEGA 1.1.4 (https://www.vegahub.eu/portfolio-item/vega-qsar/ (accessed on 10 January 2024)), were selected because both QSPR models give an estimate of the reliability of the predicted log P results. Indeed, the reliability of the predicted value is identified as low, moderate, or good depending on the degree of belonging of the predicted molecule to the applicability domain of the model. Similar compounds with predicted and experimental log P values are displayed and various indices are calculated. Consequently, the VEGA platform appeared particularly suited to performing this kind of comparative study. The MlogP and MHlogP descriptor values calculated for the 2171 molecules are listed in Table S2.
It is worth noting that for the calculation of all the other descriptors, the RDKit software (http://www.rdkit.org/) version 2021-03-4 was used.

2.3. Statistical Tools

A three-layer perceptron (TLP) [27] was used for classifying the 2171 chemicals into repellents or non-repellents. The input layer consisted of an optimal set of molecular descriptors. The hidden layer included an adjustable number of neurons that needed to be optimized to avoid overfitting problems. The output layer was of two neurons, one for each activity (0/1).
The sum of squares and the cross entropy were tested as error functions. The activation functions on the hidden and output neurons were either identity, logistic, tanh, or exponential. It is worth noting that with the cross-entropy error function, the output activation function was always set to softmax. This restriction ensures that the TLP outputs are true class membership probabilities; that is known to increase the performance of classification neural networks. All the calculations were performed with the data mining module of the Statistica™ software version 12 (StatSoft, Paris, France). The TLP of this software package includes two biases. Although different algorithms are available for training the learning set of a TLP, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm was selected because this second-order optimization algorithm guarantees a fast convergence. Lastly, it is important to note that the molecular descriptors (input neurons) were scaled using a linear transformation such that the original minimum and maximum values of every descriptor were mapped to the range (0, 1).

2.4. Performance Evaluation Metrics

The performances of the classification models were estimated from the calculation of the sensitivity, specificity, and accuracy parameters. The F1 score, the Matthews correlation coefficient (MCC), G-mean, and dominance were also calculated. Lastly, the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) were computed from PRROC package version 1.3.1 (https://cran.r-project.org/). The full definitions of the above classical evaluation parameters for classification problems and the criteria for their interpretation have been thoroughly described in the literature and they are also clearly explained in Devillers et al. [31].

3. Results

Although the coefficient of determination (r2) between the log P values calculated with both QSPR models for the entire set of 2171 molecules equals 0.89, their choice for the comparison exercise was justified. Indeed, both QSPR models behave differently according to the levels of hydrophobicity of the chemicals that are predicted. The range of MlogP values varies from −2.07 to 8.72, while the lowest and highest MHlog values equal −4.76 and 16.11, respectively (Table S2). Among the 2171 molecules of the dataset, there was an experimental log P value for 155 of them in VEGA 1.1.4. A scatterplot of these 155 experimental log P values versus the corresponding MlogP and MHlogP values is shown in Figure 1.
Even if the comparison is made from a dataset of limited size, inspection of Figure 1 reveals that the more hydrophobic the molecules are, the more the MlogP and MHlogP values diverge. A divergence also occurs for low log P values (Figure 1).
The selection of the best set of molecular descriptors for discriminating the repellents from the non-repellents was made by using the Feature Selection option of the data mining module of the Statistica™ software version 12 (StatSoft, Paris, France) and by considering the results obtained in our recent study [25]. The descriptor selection had to respond to two constraints. First, it was necessary to find a unique set of descriptors giving the best possible results with both MlogP and MHlogP descriptors without penalizing one type of classification model compared to the other. Second, the aim of the study being to estimate the potential influence of log P calculated in two different ways on the activity predictions, it was necessary to reduce as much as possible the number of other descriptors to include as input neurons in the TLP models. This was intended to limit the number of possible co-factors that would intervene in the interpretation of the models. Conveniently, the two categories of models derived with MlogP and MHlogP descriptors were named ML and MH, respectively.
On the basis of numerous trials carried out by randomly selecting different training sets and independent test sets of 80%/20% we showed that the input neurons needed to include physicochemical descriptors and topological indices as well as descriptors encoding some structural features and functional groups of molecules. Thus, TLP models with 14 to 22 input neurons (including either MlogP or MHlogP) and 5 to 8 hidden neurons were tested. The cross-entropy with tanh as the hidden activation function and softmax as the output activation function often led to much better results than the sum of squares with identity, logistic, tanh, or exponential as activation functions. Thus, on the basis of hundreds of runs, we found that the most interesting models were 17/7/2 TLPs with the cross-entropy as the error function and tanh and softmax as hidden and output activation functions, respectively. The 17 input neurons included three physicochemical properties, namely the molecular weight (MolWt), the polar surface area of the molecule (TPSA), and the 1-octanol/water partition coefficient MlogP or MHlogP. There were five topological indices [36] viz. the Balaban J index (BalabanJ) [37], which is suited to accounting for small structural changes between similar structures, the Bertz index (BertzCT) [38], which quantifies the complexity of molecules, the molecular connectivity index of first order of valence (Chi1v) [39], which encodes mostly the size of the molecules but also their degree of branching, and the α value (HallKierAlpha) and the Kappa index of order 3 (Kappa3) [36,40], which describe the shape and flexibility of the molecules. Lastly, the input layer included nine descriptors accounting for the presence of specific atoms, functional groups, and structural features in the molecules. These descriptors were the number of aliphatic rings (NumAliphaticRings), the number of aromatic rings (NumAromaticRings), the number of aliphatic hydroxyl groups (fr_Al_OH), the number of carbonyl O (fr_C_O), number of aryl methyl sites for hydroxylation (fr_aryl_methyl), the number of esters (fr_ester), the number of ether oxygens, including phenoxy (fr_ether), the number of ketones (fr_ketone), and lastly, the number of methoxy groups (fr_methoxy).
It is worth noting that MolWt, BalabanJ, BertzCT, TPSA, HallKierAlpha, fr_Al_OH, fr_C_O, fr_ester, fr_ether, fr_ketone, and fr_methoxy also belonged to the input layer of our previously selected 20/6/2 TLP model [25].
Among the different 17/7/2 TLPs, the ML1 and MH1 models were selected as representative of the TLPs computed with MlogP or MHlogP descriptors, respectively. Convergence was obtained in 122 cycles with ML1 and in 89 cycles with MH1. The performance criterion values calculated for the training, test, and whole sets of molecules of the selected ML1 and MH1 models are given in Table 1 and Table 2, respectively. The calculated activity values obtained with both models are listed in Table S1. Inspection of Table 1 and Table 2 shows that both models allow us to obtain about 92% and 90% of good predictions on the training and test sets, respectively. No overfitting can be suspected.
The Matthews correlation coefficient (MCC) is widely used as a measure of the quality of binary classifications. It lies on a scale from −1.0 to 1.0, reached in case of perfect misclassification and perfect classification, respectively, while MCC = 0 corresponds to a random prediction [41]. Table 1 and Table 2 reveal that the MCC values of the ML1 and MH1 models equal 0.733 and 0.742, respectively. These values and those obtained on the corresponding training and test sets are considered to be fully acceptable for real-world data [42,43,44,45,46].
The F1 scores are slightly better than the MCC values (Table 1 and Table 2). F1 ranges in the interval [0, 1], where the minimum is reached when all the positive samples are misclassified, and the maximum is obtained in case of perfect classification [47]. The very high Area Under the Curve (AUC) values (Table 1 and Table 2) reveal the ability of the ML1 and MH1 models to make the distinction between the two classes of activity. In addition, the G-mean and dominance values clearly reveal that both models cope well with the studied unbalanced dataset. Indeed, the Geometric Mean (G-Mean) measures the balance between classification performances on both classes of different sizes. G-mean is the square root of the product of sensitivity (true positive rate) and specificity (true negative rate). A low G-Mean value provides an indication of poor performance in the binary classification of the positive cases, even if the negative cases are correctly classified as such [48,49]. The dominance varies from −1 to 1. The closer the value is to one of the bounds, the greater the accuracy of the class detrimental to the other one. As a result, the closer the dominance is to 0, the better the model is able to correctly predict both unbalanced classes of activities [48,50]. With G-mean values equal to about 0.9 and dominance values near 0, the ML1 and MH1 models appear to be able to equally predict the repellents and non-repellents. In other words, the unbalanced dataset does not influence the prediction performances of the selected models.
Inspection of Table S1 reveals that 256 compounds (11.79%) have their repellent activity wrongly predicted by at least one model. This number drops to 110 (5.07%) when both models incorrectly allocate the chemicals to their class of activity. Among them, five are repellents predicted as non-repellents. Three belong to the training set and two to the test set (Table S1). Among the 105 non-repellents predicted as repellents by both models, 19 are test set members.
There are 78 molecules having their repellent activity wrongly predicted only by the ML1 model. Among them, 24 belong to the test set (i.e., three repellents and twenty-one non-repellents) and the others are training set compounds (i.e., eight repellents and forty-six non-repellents). On the other side, 68 molecules are incorrectly classified only by the MH1 model. Among them, 22 are test set molecules (i.e., four repellents and eighteen non-repellents). The remaining compounds belong to the training set, including nine repellents and thirty-seven non-repellents (Table S1). Thus, the MH1 model shows a small advantage compared to the ML1 model having slightly fewer bad predictions.
Whatever the differences of behavior between the ML1 and MH1 models, inspection of Table 1, Table 2 and Table S1 shows that both models have good prediction performances and behave similarly for a lot of chemicals. These are reinforced when the models are used for predicting the repellent activity of the main commercial repellents [51,52]. Indeed, it is satisfying to note that the ML1 and MH1 models correctly allocate DEET, picaridin or icaridin (sec-butyl 2-(2-hydroxyethyl)piperidine-1-carboxylate), IR3535 (ethyl 3-(N-butylacetamido)propanoate), PMD (p-menthane-3,8-diol), AI3-35765 (1-(3-cyclohexen-1-ylcarbonyl)piperidine), and AI3-37220 (1-(3-cyclohexen-1-ylcarbonyl)-2-methylpiperidine) to class 1.
However, it is obvious that the results obtained with the ML1 and MH1 models are not enough to correctly evaluate the potential influence of the log P calculation methods on the prediction of the repellent activity of chemicals. Indeed, while the results obtained with the ML1 and MH1 models allow us to validate the architectures of the TLP models and their good prediction performances, repetitions are required to statistically secure the results obtained with both models.
Consequently, 92 different training sets and independent test sets of 80%/20% were randomly constituted and used to compute (MlogP or MHlogP + 16)/7/2 TLP models with the BFGS training algorithm, the entropy as the error function, and tanh and softmax as hidden and output activation functions, respectively. Each model was selected on the basis of its prediction performances from batches of 500 to 1000 models obtained under the same conditions. To secure the prediction diversity of the chemicals in both activity classes, sometimes more than one modeling result was selected. This led to the constitution of a data matrix of 135 modeling results for each category of models, namely those including MlogP or MHlogP among the input neurons (Table S2).
Among the 135 different ML models including the MlogP descriptor, compound #430 (N-butyl imide 1,2-dicarboxy-3,6-endomethylene-4 cyclohexene) was only allocated twice to a test set, having the lowest occurrence as a test set member. At the opposite end, 2(2(2,4,5,6-tetrachlorophenoxy)ethoxy) ethyl chloride (#2041) was allocated 52 times to a test set and it is the compound with the highest occurrence. Compound #430 was also found with the lowest occurrence as a test set member of the 135 MH models including the MHlogP descriptor, but it was allocated seven times to a test set. On the other hand, chloroacetic acid, cyclohexyl ester (#620), was found 54 times as a test set member (Table S2).
To better highlight the differences between both sets of 135 modeling results, the percentages of good predictions (GPs) obtained with the ML versus MH models for the training and test sets are represented in Figure 2 and Figure 3, respectively.
There is an important concentration of points in the upper right corner of Figure 2, revealing that many non-repellents (in grey) and repellents (in red) are always correctly allocated to their activity class by the two categories of models regardless of the composition of the training and test sets. In a more precise way, inspection of Table S2 shows that 66 repellents (20.75%) and 967 non-repellents (52.19%) are always correctly predicted by both the ML and MH models, whatever their belonging to the training and test sets. Among the former group of repellents, ten (15.15%) show a difference between their MlogP and MHlogP values ≥ 0.7 (absolute value, av), and among them, six have a difference ≥ 1 (av) (Table S2). Thus, chemicals #753, #754, #1288, #1289, #1293, and #1298 are repellents with differences between their MlogP and MHlogP values ranging from 1 to 1.22 (av). In the same way, there are also repellents that are always allocated to class 1 by the ML and MH models and which show no significant differences between their MlogP and MHlogP values. It is the case, for example, of chemicals #913, #1026, #1450, #1451, #1487, #1499, and #1583, which present a difference between their MlogP and MHlogP values < 0.1 (av).
Among the 967 non-repellents always allocated to class 0 by the ML and MH models, 650 (67.22%) show a difference between their MlogP and MHlogP values ≥ 0.7 (av) and 513 (53.05%) present a difference ≥ 1 (av). The highest difference is found with the malic acid, dioleyl ester (#1392), having MlogP and MHlogP values equal to 8.1 and 15.13, respectively (Table S2). In contrast, there are also non-repellents with similar MlogP and MHlogP values. It is the case, for example, of compounds #256, #836, #1861, and #1914, for which the differences between their MlogP and MHlogP values are ≤ 0.03 (av).
Inspection of the lower left corner of Figure 2 shows that the number of molecules having their repellent activity very poorly predicted by the ML and MH models is low. These two poles exist for the test set molecules (Figure 3). However, more often, we observe different behaviors, as shown in the top right corner of Figure 3, with molecules whose repellent activity is always well predicted by one type of model, while for the other one, it can be as low as 60% or even less. Inspection of the bottom left corner of Figure 3 shows that the situation is broadly the same regarding the molecules whose repellent activity is poorly predicted by both categories of models. More generally, the comparison of Figure 2 and Figure 3 shows that the cloud of points is less dispersed in the former than the latter. This is due not only to the learning process used by the three-layer perceptron, but also to the great structural diversity of the 2171 molecules. In any case, what is the most important to study are molecules whose repellent activities are always or very often correctly predicted by one category of model, whereas they are not or only poorly predicted by the other category of model.
Figure 2 clearly shows that the largest differences in predictions between the ML and MH models concern non-repellent molecules. Thus, for example, the training set molecules #1976 (salicylic acid, allyl ester), #1985 (salicylic acid, ethyl ester), and #1990 (salicylic acid, iso-propyl ester), located in the bottom right of Figure 2, are non-repellents always correctly predicted as such by the MH models, while their activity is correctly predicted in only 25%, 17.44%, and 19.81% of cases by the ML models, respectively (Table S2). The situation is broadly the same when these molecules belong to the test sets of the models (Figure 3, Table S2). There exist significant differences in the MlogP and MHlogP descriptor values for these three compounds. Indeed, the MlogP vs. MHlogP values of chemicals #1976, #1985, and #1990 equal 2.00 vs. 4.23, 1.78 vs. 3.87, and 2.09 vs. 4.29, respectively (Table S2). It is interesting to note that an experimental log P value of 2.95 is given for chemical #1985 in VEGA 1.1.4 and the reliability of the MlogP and MHlogP values for chemical #1976 was estimated as low and moderate, respectively. The same difference of ranking was also provided for chemical #1990. Nevertheless, these differences in log P values cannot explain the behavior of the three molecules with the two types of models. Indeed, for example, salicylic acid, menthyl ester (#1987), is a non-repellent always predicted as such by both types of models, whatever its belonging to the training and test sets, while its MlogP value equals 3.55 and its MHlogP value is 7.39 (Table S1 and Table S2).
The MH models always correctly predict the non-repellent activity of salicylic acid, n-butyl ester (#1982), as a member of the different training and test sets (located vertically to the 100% unit of the abscissa axis in Figure 2 and Figure 3). In contrast, only 67.96% and 53.12% of good predictions are obtained with the ML models, respectively (Table S2, Figure 2 and Figure 3). Well predicted by the ML1 and MH1 models as a training set member, the MlogP, MHlogP, and experimental log P values of chemical #1982 equal 2.38, 4.85 (Table S2), and 4.63, respectively. The five chemicals located just below chemical #1982 in Figure 2 also have their activity better predicted by the MH models than the ML models. Thus, the non-repellents #1088 (p-ethoxypropiophenone), #955 (5,9-dimethyldecanoic acid), #2123 (2,4,6-trimethyl-1,2,5,6-tetrahydrobenzaldehyde oxime), #1058 (dodecylenic acid), and #1282 (o-hydroxy-iso-butyrophenone) are predicted as such by the MH models from about 92% to 97% as training set members and from 85% to 100% when they belong to the test sets (Table S2). With the ML models, the ranges are from about 53% to 60% and from 21% to 43% when they are training and test set members, respectively (Table S2). The MH1 model also outperforms the ML1 model regarding these five non-repellents. The MHlogP values of these chemicals are greater than the corresponding MlogP values of at least one log unit except chemical #1088, for which the MlogP value equals 2.13, while the MHlogP value is 2.74 (Table S2), a good reliability being attributed to the former predicted log P and moderate for the latter.
In the same way, repellent #833 (dibenzyl ether), located in about the middle right-hand corner of Figure 2, is correctly allocated to class 1 as a training set member in 76.19% and 38.58% of cases by the MH and ML models, respectively. When it belongs to the test sets, it is well classified in 11.11% of cases by the MH models and 0% by the ML models (Table S2). It is noteworthy that the ML1 and MH1 models correctly predict chemical #833 as a training set member (Table S1). Lastly, its MlogP, MHlogP, and experimental log P values equal 3.39, 3.48 (Table S2), and 3.31, respectively.
Chemical #520 (n-butyric acid, 1,2,3,4-tetrahydro-2-naphthyl ester), located to the left of chemical #833 in Figure 2, is a repellent predicted as such in 63.25% of cases by the MH models when it belongs to the training sets and in 32.20% by the ML models under the same conditions (Table S2). Also, as a training set member, its repellent activity is correctly predicted by the MH1 model but not by the ML1 model (Table S1). As a test set member, chemical #520 is never correctly predicted by both categories of models. In contrast, the non-repellents #1111 (2-ethylbutyric acid, 1,2,3,4-tetrahydro-2-naphthyl ester) and #2153 (iso-valeric acid, 1,2,3,4-tetrahydro-2-naphthyl ester) are very well predicted by the MH, ML, MH1, and ML1 models, whatever their belonging to the training and test sets (Table S1 and Table S2).
Repellent #2128 (undecalactone), located just below chemical #520 in Figure 2, has its activity correctly predicted in 66.67% of cases when it belongs to the training sets of the MH models and in only 28.04% of cases by the ML models under the same conditions (Figure 2, Table S2). As a test set member, its activity is never correctly predicted by the ML models and only in 39.39% of cases by the MH models (Table S2). It is therefore located near the abscissa of Figure 3. While the molecule is in the training set of the ML1 and MH1 models, only the latter correctly allocates it in class 1 (Table S1). No large difference is found between the MlogP and MHlogP values that are equal to 2.58 and 3.13, respectively (Table S2). In addition, the reliability of both calculated log P values was ranked as moderate.
Figure 2 and Figure 3 show that there are also molecules for which their repellent activity is better predicted by the ML models than the MH models.
Thus, in the top left part of Figure 2, the non-repellent activity of molecule #1278 (alpha-hydroxy-iso-butyric acid, octyl ester) is correctly predicted by the ML models in 91.38% of cases when it belongs to the training sets. Under the same conditions, only 20% of the MH models correctly predict its activity (Table S2). It is in agreement with the ML1 and MH1 predictions (Table S1). As a test set member (Figure 3), the activity of compound #1278 is appropriately allocated to class 0 in only 52.63% and 20% of cases by the ML and MH models, respectively (Table S2, Figure 3). The MlogP value of chemical #1278 equals 3.33, broadly the same as its MHlogP value, which is 3.22 (Table S2). Inspection of Table S1 and Table S2 shows that there are eight other alpha-hydroxy-iso-butyric acid esters (#1272 to #1277, #1279, #1280). Most of them have their repellency correctly predicted by both categories of models, whatever their class of activity. The unique exception is the non-repellent #1277, which is very poorly predicted as a training set member and never allocated to class 0 when it belongs to the test sets (Table S2).
The repellent activity of the training set molecule #537 (capraldehyde glyceryl acetal), which is also located in the upper left part of Figure 2, is correctly allocated to class 0 in 72.55% of cases by the ML models and in only 9.09% of cases by the MH models (Table S2). The molecule also belongs to the training set of the ML1 and MH1 models, and its activity is only correctly predicted by the former model (Table S1). As a test set compound, its non-repellent activity is correctly predicted in 54.55% and 13.89% of cases by the ML and MH models, respectively (Table S2, Figure 3). The MlogP and MHlogP values of chemical #537 are equal to 2.95 and 2.99, respectively (Table S2). Both values have a low reliability according to VEGA 1.1.4. In contrast, Table S1 and Table S2 show that chemical #536 (capraldehyde diethyl acetal) is always correctly predicted as a non-repellent whatever the model category and its belonging to the training and test sets while its MlogP value is 4.44 (low reliability) and its MHlogP value is 5.13 (moderate reliability).
Chemical #462 (1-n-butyl-1,2,3,4-tetrahydro-2-naphthol) (on the right of chemical #537 in Figure 2) is a non-repellent correctly classified by the ML and MH models in 69.23% and 35.24% of cases as a training set member and in 19.35% and 13.33% of cases as a test set member, respectively (Table S2). The chemical belongs to the training set of the MH1 and ML1 models and it is incorrectly modeled by them (Table S1). In contrast, 100% of good predictions are obtained with the MH, MH1, ML, and ML1 models in the prediction of the activities of repellents #1163 (1-ethyl-1,2,3,4-tetrahydro-2-naphthol) and #1583 (1-methyl-1,2,3,4-tetrahydro-2-naphthol), whatever their location in the training and test sets (Table S2). It is interesting to note that the differences between the MlogP and MHlogP values of chemicals #462, #1163, and #1583 equal -0.66, -0.21, and 0, respectively (Table S2).
The non-repellents #1563 (2-methyl-5-phenyl-2-propyl-1,3-dioxolan-4-one) and #1949 (iso-propyl benzyl ketone), located in the top middle part of Figure 2, are also more correctly classified by the TLPs including MlogP as an input neuron than those with MHlogP. Indeed, the former compound is correctly assigned to class 0 in 84.17% and 44.07% of cases when it belongs to the training sets of the ML and MH models, respectively (Table S2). As a test set member, the difference is higher, namely 93.33% and 35.29%, respectively (Table S2). Regarding chemical #1949, 95.1% versus 50.96% of good predictions were obtained as a training set member and 75.76% versus 67.74% as a test set member with the ML and MH models, respectively (Table S2). It is noteworthy that both chemicals have their activity correctly predicted only by the ML1 model. However, the differences between the MlogP and MHlogP values of each chemical are not important (Table S2).
The same phenomenon showing significantly better results obtained with the ML models than the MH models can also be observed with repellents. Thus, chemical #1259 (o-n-hexyloxybenzaldehyde), located below compound #1949 in Figure 2, is a repellent often predicted as such in 88.78% of cases by the ML models when it belongs to the training sets, while it is allocated to class 1 in only 55.88% of cases by the MH models. The molecule is detected as a repellent in 59.46% and 48.48% of cases when it belongs to the test sets of the ML and MH models, respectively (Table S2). It is correctly allocated to class 1 by the ML1 and MH1 models as a test set member (Table S1). Interestingly, there is a large difference between the MlogP and MHlogP values that are, respectively, equal to 2.69 and 4.25 (Table S2), with a low reliability regarding the former value and moderate for the latter.
Repellent #1465 (4-methoxyvalerophenone), located just below chemical #462 in Figure 2, is predicted as such in 66.97% and 34.58% of cases when it belongs to the training sets of the ML and MH models, respectively. As a test set member, only 3.85% and 0% of correct allocations to class 1 are obtained, respectively (Table S2). The chemical belongs to the training set of the ML1 and MH1 models and it is only correctly predicted by the former model (Table S1). Interestingly, the MlogP and MHlogP values equal 2.41 and 3.23 (Table S2), with a good reliability for the former calculated log P value and low for the latter.
The situation is broadly the same regarding repellent #1618 (5-nonanone oxime), located near chemical #1465 in Figure 2. The ML models correctly allocate this repellent to class 1 in 64.10% of cases and the MH models in 32.26% of cases when it belongs to the training sets. As a test set member, the percentages of good predictions become 44.44% and 0%, respectively (Table S2). This explains the position of chemical #1618 in the left part of Figure 3 near the y-axis. However, even if it is included in the training set of the ML1 and MH1 models, chemical #1618 is predicted as a non-repellent by them (Table S1). The MlogP versus MHlogP values for that compound equal 2.38 versus 4.15 (Table S2), with a low reliability for both calculated log P values.
It is interesting to note that repellent #996 (2,6-dimethyl-1,2,3,6-tetrahydrobenzaldehyde oxime) shows broadly the same behavior as chemical #1618, because as member of the test sets of the ML and MH models, inspection of Table S2 shows that 80% and 20% of correct assignment to class 1 are obtained, respectively. This explains the location of chemical #996 in the top left corner of Figure 3. When it belongs to the training sets, 95.45% and 86.67% of good predictions are reached, respectively (Table S2). This is in agreement with the correct predictions of the ML1 and MH1 models. Its MlogP value is 1.88 with a low reliability, while its MHlogP equals 3.62 with a moderate reliability (Table S2).
In contrast, the non-repellent #2123 (2,4,6-trimethyl-1,2,5,6-tetrahydrobenzaldehyde oxime) appears better allocated to class 0 by the MH models than the ML models. Indeed, 91.67% of good predictions are obtained with the MH models versus 59% with the ML models when the chemical belongs to the training sets. As a test set member, 84.62% and 42.86% of correct attributions to class 0 are obtained with the MH and ML models, respectively (Table S2). Note that the ML1 and MH1 models correctly predict the non-repellent activity of the chemical as a training set member (Table S1). The MlogP value of chemical #2123 equals 2.18 with a low reliability, while its MHlogP is 4.17 with a moderate reliability (Table S2). It is noteworthy that in the dataset, there are other molecules including a C=N-OH functional group which are correctly predicted by all the models. Indeed, the non-repellents #475, #476, and #1051 as well as the repellent #1553 have their activity very well predicted by the ML1, MH1, ML, and MH models (Table S1 and Table S2).
Otherwise, it is also interesting to stress that the two categories of model also behave differently in the prediction of the activity of chemical #500 (iso-butyric acid, 2-ethyl-2-hexenal oxime ester). The ML models correctly predict this compound as a non-repellent in 85% of cases when it belongs to the training sets, while the correct predictions total 54.29% when the chemical is included in the test sets. The results are worse with the MH models because 55% of good allocations to class 0 are obtained in the former situation and 11.43% in the latter (Table S2). Lastly, the MlogP value of chemical #500 equals 3.03 and its MHlogP value is 2.02 (Table S2), both log P predictions having a low reliability.
Recently [25], from the same dataset, we showed that the MlogP values of the repellents were distributed over a narrower range than those of the non-repellents. These latter compounds included the greatest log P values [25]. Consequently, the percentages of good predictions obtained with the ML and MH models were plotted versus the MlogP and MHlogP values of the chemicals, as shown in Figure 4 and Figure 5, respectively.
Both figures clearly reveal that the distribution of MHlogP values is much wider than that of MlogP, and that there are non-repellents that cover the whole range of values while being correctly allocated to class 0 by the models. In contrast, the distributions of the calculated log P values of the repellents are much less widespread than those of the non-repellents, even if the dispersion of MHlogP (Figure 5) values is greater than that of the MlogP values (Figure 4). Chemical #2169 (9-hendecenoic acid, 1-trichloro-2-methyl-2-propyl ester) is the repellent showing the highest calculated log P values in Figure 4 and Figure 5. Its MlogP and MHlogP values are equal to 4.67 and 7.3, respectively (Table S2). The calculated log P values have a low and moderate reliability, respectively. As a training set member, chemical #2169 is not correctly predicted by the ML1 and MH1 models (Table S1) and it is only allocated to class 1 in 25.66% of cases by the ML models and 19.47% by the MH models (Table S2). When it belongs to the test sets, chemical #2169 is correctly predicted as a repellent in only 4.55% of cases with both categories of models (Table S2).
Repellent #565 (caprylic anhydride) is also characterized by high MlogP and MHlogP values (Figure 4 and Figure 5) that are equal to 4.16 (low reliability) for the former log P descriptor and 6.57 (good reliability) for the latter (Table S2). The chemical is correctly allocated to class 1 by the ML and MH models as a training set member in only 36.94% and 25.45% of cases, respectively. As a test set member, its repellent activity is wrongly predicted by the ML1 and MH1 models (Table S1), while it is correctly allocated to class 1 in only 25% of cases by the ML models versus 8% by the MH models (Table S2).
Since a significant number of non-repellents have high calculated log P values and their activity is correctly predicted by the two categories of models, it is not surprising that repellent molecules with high log P values have their activity incorrectly predicted by the ML and MH models. However, this crude rule is not so crisp. Indeed, there are repellents with average log P values that have their activity poorly predicted by the models. Thus, repellent #1880 (pivalic acid, 1,5-pentanediol diester) has its activity correctly predicted by the ML models in only 1.65% of cases when it belongs to the training sets and never as a test set member. The situation is the same with the MH models, namely 6.4% versus 0%, respectively (Table S2). Chemical #1880 is also wrongly predicted by the ML1 and MH1 models (Table S1). However, its MlogP and MHlogP values are equal to 3.11 and 4.60 (Table S2), with a low reliability for both calculated values. In the same way, chemical #1551 (4-methyl-2-pentenoic acid, ethyl ester) has MlogP and MHlogP values equal to 1.95 and 2.54 (Table S2), and the reliability of the two calculated log P values was estimated as moderate and good, respectively. However, it is only allocated to class 1 in 4.31% and 0% of cases as training and test set members of the ML models, respectively (Table S2). The situation is nearly the same with the MH models, because when belonging to the training sets, only 5.79% of correct classifications are obtained versus 0% when chemical #1551 is included in the test sets of the MH models (Table S2).
Lastly, because Figure 1 shows that significant differences can exist between the highest or, on the contrary, the lowest values of MlogP and MHlogP for the compounds for which experimental log P values were available, a focus was placed on the corresponding results obtained with both categories of models for these chemicals. The large differences seem to have a limited effect on the prediction results. Thus, for example, acetylacetone (#136), cyclopropyl nitrile (#802), and N,N-dimethyl formamide (#960) are non-repellents always predicted as such by the ML and MH models, whatever their belonging to the training and test sets, while their experimental log P/MlogP/MHlogP values are equal to 0.4/0.08/−1.3, 0.68/0.26/−0.38, and −1.01/0.13/−0.51, respectively. In the same way, linoleic acid, methyl ester (#1380), octadecanoic acid, methyl ester (#1622) and phthalic acid, dicapril ester (#1839), to cite as illustrative examples, are non-repellents always allocated to class 0 by the ML and MH models, while their experimental log P/MlogP/MHlogP values equal 6.82/4.80/7.80, 8.35/5.92/8.23, and 8.10/6.34/8.54, respectively (Table S2).

4. Discussion

The 1-octanol/water partition coefficient (log P) of the organic molecules is undoubtedly the most important physicochemical property for explaining numerous of their pharmacological and toxicological effects as well as their behavior in the environment. This explains why log P is a long-established molecular descriptor in SAR and QSAR modeling, leading to thousands of published models in which it is used alone or with other molecular descriptors [17,18,53]. The number of experimental log P values available being very small compared to the universe of molecular structures that exist and can be drawn and/or synthetized, various methods have been proposed to reduce the time and cost of obtaining this physicochemical parameter. Among them, the QSPR models for log P occupy a prominent place because the balance of their advantages far outweighs that of their disadvantages. As a result, there exist numerous QSPR models for log P constructed from datasets of various sizes and molecular diversity, with different molecular descriptors and built from diverse statistical methods. This leads to models with different application domains. These QSPR models can be open-source/access or commercial software with advantages and disadvantages inherent to their origin. Among them, the possibility or not to estimate whether the predictions made with them belong to their applicability domain is of crucial importance.
In a previous study [25], we showed that log P calculated according to the regression equation of Moriguchi et al. [28,29] (MlogP) was a key descriptor for modeling the repellent activity of structurally diverse chemicals against Ae. aegypti. Consequently, it was tempted to perform a similar study with MHlogP based on the atom/fragment contribution method of Meylan and Howard [30] to evaluate their respective influence on the prediction of repellent activity of molecules. Based on two different methodologies, freely available, well documented, and with criteria to assess the reliability of their predictions, the MlogP and MHlogP models were perfectly suited to the objectives of this study.
Thus, our results clearly reveal that the MlogP and MHlogP models can behave differently according to the structure and activity of the molecules. Table S2 shows that about 50% of the molecules are always well allocated to class 0 or 1 by both categories of models regardless of their belonging to the training and test sets. It is among the non-repellents that the highest differences between the MlogP and MHlogP values are found, without impacting the prediction results. Thus, for example, 100% of good predictions are always obtained for the non-repellent #1392 (malic acid, dioleyl ester), while the MlogP and MHlogP values of this molecule are equal to 8.10 and 15.13, respectively (Table S2). Nevertheless, Figure 2 and Figure 3 and Table S2 show that TLP models that only vary by one input neuron, namely MlogP or MHlogP, can statistically lead to very different output results. This clearly poses the problem of the correct interpretation of the results of a SAR or QSAR model when it is classically built, because in that case, only a single QSPR model per physicochemical property of interest is always chosen in its modeling process.
An attempt was made to estimate which calculated log P values best characterized repellents. To do so, only the molecules always allocated to class 1 by the ML or MH models, independently of their location as training and test members, were considered. Thus, among the 318 repellents, 79 (24.84%) are always correctly classified as such by the ML models and their corresponding MlogP values range from 1.08 to 3.12 (Table S2). In the same way, 91 repellents (28.62%) are always correctly predicted by the MH models, with MHlogP values ranging from 0.94 to 4.02 (Table S2). Within these ranges of calculated log P values, 58 molecules (73.42%) have their MlogP value between 1.62 and 2.64 and 51 (56.04%) have their MHlogP value between 1.60 and 2.63. This information is interesting because it is in agreement with the experimental log P of the main repellents currently available on the market. Indeed, IR3535, picaridin (icaridin), and DEET have experimental log P values of 1.7 [54], 2.11 [55], and 2.4 [56], respectively. Otherwise, among the repellents always predicted as such by the MH models, 19 have their MHlogP ranging between 3 and 4, and one, n-capric acid, has a value of 4.02 (#538 in Table S2). In contrast, only three repellents have their MlogP values ≥ 3, namely 3 (#2085; 2-(m-tolyl)cyclohexanol (trans), 3.12 (#1163; 1-ethyl-1,2,3,4-tetrahydro-2-naphthol), and 3.12 (#2058; beta-(1,2,3,4-tetrahydro-5-naphthyl) ethanol). More generally, although MHlogP, which is based on the atom/fragment contribution method of Meylan and Howard [30], leads to the calculation of descriptors that are more discriminative than MlogP, which is calculated according to the regression equation of Moriguchi et al. [28,29], it is more prone to giving unrealistic log P values in molecules including long aliphatic carbon chains. Fortunately, these molecules are non-repellents and their activity remains correctly predicted (Table S2).
Analysis of the literature shows that lipophilicity is a physicochemical property that is found to be more or less important [14,24,57,58,59,60] for describing and modeling the biting deterrence or repellent activity of molecules against mosquitoes. This difference is easily explainable because these studies always involve a very small number of molecules showing specific structural or functional characteristics. Our current study and the previous one [25] based on a large and highly structurally diverse dataset clearly show that lipophilicity, encoded by log P, plays a key role in distinguishing the molecules that are repellents from those that do not repel mosquitoes. Our comparative study also demonstrates that molecules with a repellent activity are restricted to a range of log P values. It is noteworthy that Devillers et al. [31] also showed the prominent role of log P in the prediction of repellent molecules for clothing application. Obviously, it is important to stress that although log P appears important for modeling the repellent activity of molecules, it is not the only parameter involved.

5. Conclusions

The discovery of IR3535 and picaridin from in silico methods [8,61] has boosted the use of structure–activity models for identifying molecules that repel mosquitoes.
Recently [25], we showed that log P was a key descriptor for modeling the repellent activity of structurally diverse chemicals against Ae. aegypti. From the same methodology and dataset of 2171 molecules, we have statistically evaluated the influence of the choice of MlogP [28,29] or MHlogP [30] on the prediction results, keeping the other descriptors unchanged. Undoubtedly, this impact exists and must be taken into account when interpreting a SAR or QSAR model on repellents, in the identification of a mechanism of action, and for selecting candidate molecules to be tested for their actual repellent activity. These results also apply to other physicochemical properties such as vapor pressure and boiling point that intervene in the effectiveness and protection time of repellents [62,63] and for which some QSAR models have been proposed [24,64,65,66].

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14135366/s1, Table S1. Experimental (ACT) and predicted repellent activity with the ML1 and MH1 models. Table S2. Experimental repellent activity (ACT), percentages of good predictions obtained with the ML and MH models for the 2171 molecules and their MlogP and MHlogP values.

Author Contributions

Conceptualization, J.D.; methodology, J.D. and H.D.; data curation, J.D. and H.D.; software, J.D. and H.D.; formal analysis and validation, J.D. and H.D.; visualization, H.D.; writing—original draft preparation, J.D.; writing—review and editing, J.D. and H.D.; supervision, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was funded by the French National Research Program for Environmental and Occupational Health of Anses; PNREST 2019/1/041.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used to generate the models are available in [26]. The other data can be found in the manuscript or the Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sato, S. Plasmodium—A brief introduction to the parasites causing human malaria and their basic biology. J. Physiol. Anthropol. 2021, 40, 1, Erratum in J. Physiol. Anthropol. 2021, 40, 3. [Google Scholar] [CrossRef]
  2. WHO. World Malaria Report 2022; World Health Organization: Geneva, Switzerland, 2022. Available online: https://www.who.int/teams/global-malaria-programme/reports/world-malaria-report-2022 (accessed on 13 March 2024).
  3. Bhatt, S.; Gething, P.W.; Brady, O.J.; Messina, J.P.; Farlow, A.W.; Moyes, C.L.; Drake, J.M.; Brownstein, J.S.; Hoen, A.G.; Sankoh, O.; et al. The global distribution and burden of dengue. Nature 2013, 496, 504–507. [Google Scholar] [CrossRef]
  4. Messina, J.P.; Brady, O.J.; Golding, N.; Kraemer, M.U.G.; Wint, G.R.W.; Ray, S.E.; Pigott, D.M.; Shearer, F.M.; Johnson, K.; Earl, L.; et al. The current and future global distribution and population at risk of dengue. Nat. Microbiol. 2019, 9, 1508–1515. [Google Scholar] [CrossRef]
  5. World Health Organization. Vector-Borne Diseases. 2024. Available online: https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases (accessed on 15 March 2024).
  6. Fontenille, D.; Lagneau, C.; Lecollinet, S.; Lefait-Robin, R.; Setbon, M.; Tirel, B.; Yébakima, A. La Lutte Antivectorielle en France/Disease Vector Control in France; IRD Edition: Marseille, France, 2009. [Google Scholar]
  7. Onen, H.; Luzala, M.M.; Kigozi, S.; Sikumbili, R.M.; Muanga, C.-J.K.; Zola, E.N.; Wendji, S.N.; Buya, A.B.; Balciunaitiene, A.; Viškelis, J.; et al. Mosquito-borne diseases and their control strategies: An overview focused on green synthesized plant-based metallic nanoparticles. Insects 2023, 14, 221. [Google Scholar] [CrossRef]
  8. Debboun, M.; Frances, S.P.; Strickman, D.A. Insect Repellents Handbook, 2nd ed.; CRC Press: Boca Raton, CA, USA, 2020. [Google Scholar]
  9. Afify, A.; Betz, J.F.; Riabinina, O.; Lahondère, C.; Potter, C.J. Commonly used insect repellents hide human odors from Anopheles mosquitoes. Curr. Biol. 2019, 29, 3669–3680.e5. [Google Scholar] [CrossRef]
  10. Stanczyk, N.M.; Brookfield, J.F.Y.; Ignell, R.; Logan, J.G.; Field, L.M. Behavioral insensitivity to DEET in Aedes aegypti is a genetically determined trait residing in changes in sensillum function. Proc. Natl. Acad. Sci. USA 2010, 107, 8575–8580. [Google Scholar] [CrossRef]
  11. Deletre, E.; Martin, T.; Duménil, C.; Chandre, F. Insecticide resistance modifies mosquito response to DEET and natural repellents. Parasites Vectors 2019, 12, 89. [Google Scholar] [CrossRef]
  12. Yang, L.; Norris, E.J.; Jiang, S.; Bernier, U.R.; Linthicum, K.J.; Bloomquist, J.R. Reduced effectiveness of repellents in a pyrethroid-resistant strain of Aedes aegypti (Diptera: Culicidae) and its correlation with olfactory sensitivity. Pest Manag. Sci. 2020, 76, 118–124. [Google Scholar] [CrossRef]
  13. da Costa, K.S.; Galúcio, J.M.; da Costa, C.H.S.; Santana, A.R.; Dos Santos Carvalho, V.; do Nascimento, L.D.; Lima e Lima, A.H.; Neves Cruz, J.; Alves, C.N.; Lameira, J. Exploring the potentiality of natural products from essential oils as inhibitors of odorant-binding proteins: A structure- and ligand-based virtual screening approach to find novel mosquito repellents. ACS Omega 2019, 4, 22475–22486. [Google Scholar] [CrossRef]
  14. Portilla-Pulido, J.S.; Castillo-Morales, R.M.; Barón-Rodríguez, M.A.; Duque, J.E.; Mendez-Sanchez, S.C. Design of a repellent against Aedes aegypti (Diptera: Culicidae) using in silico simulations with AaegOBP1 protein. J. Med. Entomol. 2020, 57, 463–476. [Google Scholar] [CrossRef]
  15. Neto, M.F.A.; Campos, J.M.; Cerqueira, A.P.M.; de Lima, L.R.; Da Costa, G.V.; Ramos, R.D.S.; Junior, J.T.M.; Santos, C.B.R.; Leite, F.H.A. Hierarchical virtual screening and binding free energy prediction of potential modulators of Aedes aegypti odorant-binding protein 1. Molecules 2022, 27, 6777. [Google Scholar] [CrossRef] [PubMed]
  16. Liggri, P.G.V.; Pérez-Garrido, A.; Tsitsanou, K.E.; Dileep, K.V.; Michaelakis, A.; Papachristos, D.P.; Pérez-Sánchez, H.; Zographos, S.E. 2D finger-printing and molecular docking studies identified potent mosquito repellents targeting odorant binding protein 1. Insect Biochem. Mol. Biol. 2023, 157, 103961. [Google Scholar] [CrossRef] [PubMed]
  17. Karcher, W.; Devillers, J. Practical Applications of Quantitative Structure-Activity Relationships (QSAR) in Environmental Chemistry and Toxicology; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1990. [Google Scholar]
  18. Devillers, J.; Karcher, W. Applied Multivariate Analysis in SAR and Environmental Studies; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1991. [Google Scholar]
  19. Sorkun, M.C.; Koelman, J.M.V.A.; Er, S. Pushing the limits of solubility prediction via quality-oriented data selection. iScience 2020, 24, 101961. [Google Scholar] [CrossRef]
  20. Dearden, J.C.; Rotureau, P.; Fayet, G. QSPR prediction of physico-chemical properties for REACH. SAR QSAR Environ. Res. 2013, 24, 279–318. [Google Scholar] [CrossRef] [PubMed]
  21. Rácz, A.; Bajusz, D.; Héberger, K. Modelling methods and cross-validation variants in QSAR: A multi-level analysis. SAR QSAR Environ. Res. 2018, 29, 661–674. [Google Scholar] [CrossRef] [PubMed]
  22. Jorgensen, W.L.; Duffy, E.M. Prediction of drug solubility from structure. Adv. Drug Deliv. Rev. 2002, 54, 355–366. [Google Scholar] [CrossRef] [PubMed]
  23. Devillers, J. Calculation of octanol/water partition coefficients for pesticides: A comparative study. SAR QSAR Environ. Res. 1999, 10, 249–262. [Google Scholar] [CrossRef] [PubMed]
  24. Devillers, J. 2D and 3D structure-activity modelling of mosquito repellents: A review. SAR QSAR Environ. Res. 2018, 29, 693–723. [Google Scholar] [CrossRef] [PubMed]
  25. Devillers, J.; Larghi, A.; Sartor, V.; Setier-Rio, M.-L.; Lagneau, C.; Devillers, H. Nonlinear SAR modelling of mosquito repellents for skin application. Toxics 2023, 11, 837. [Google Scholar] [CrossRef]
  26. Knipling, E.F.; McAlister, L.C.; Jones, H.A. Results of Screening Tests with Materials Evaluated as Insecticides, Miticides, and Repellents at the Orlando, Fla., Laboratory, April 1942 to April 1947; USDA Publication E-733; United States Department of Agriculture, Agriculture Research Administration, Bureau of Entomology and Plant Quarantine: Orlando, FL, USA, 1947. [Google Scholar]
  27. Devillers, J. Neural Networks in QSAR and Drug Design; Academic Press: London, UK, 1996. [Google Scholar]
  28. Moriguchi, I.; Hirono, S.; Liu, Q.; Nakagome, I.; Matsushita, Y. A simple method of calculating octanol/water partition coefficient. Chem. Pharm. Bull. 1992, 40, 127–130. [Google Scholar] [CrossRef]
  29. Moriguchi, I.; Hirono, S.; Nakagome, I.; Hirano, H. Comparison of reliability of log P values for drugs calculated by several methods. Chem. Pharm. Bull. 1994, 42, 976–978. [Google Scholar] [CrossRef]
  30. Meylan, W.M.; Howard, P.H. Atom/fragment contribution method for estimating octanol-water partition coefficients. J. Pharm. Sci. 1995, 84, 83–92. [Google Scholar] [CrossRef]
  31. Devillers, J.; Sartor, V.; Doucet, J.P.; Doucet-Panaye, A.; Devillers, H. In silico prediction of mosquito repellents for clothing application. SAR QSAR Environ. Res. 2022, 33, 239–257. [Google Scholar] [CrossRef]
  32. Pérez-Pereira, A.; Carvalho, A.R.; Carrola, J.S.; Tiritan, M.E.; Ribeiro, C. Integrated approach for synthetic cathinone drug prioritization and risk assessment: In silico approach and sub-chronic studies in Daphnia magna and Tetrahymena thermophila. Molecules 2023, 28, 2899. [Google Scholar] [CrossRef]
  33. Naik, P.K.; Sindhura; Singh, T.; Singh, H. Quantitative structure-activity relationship (QSAR) for insecticides: Development of predictive in vivo insecticide activity models. SAR QSAR Environ. Res. 2009, 20, 551–566. [Google Scholar] [CrossRef] [PubMed]
  34. Sinclair, C.J.; Boxall, A.B.A. Assessing the ecotoxicity of pesticide transformation products. Environ. Sci. Technol. 2003, 37, 4617–4625. [Google Scholar] [CrossRef]
  35. Fang, H.; Tong, W.; Shi, L.M.; Blair, R.; Perkins, R.; Branham, W.; Hass, B.S.; Xie, Q.; Dial, S.L.; Moland, C.L.; et al. Structure-activity relationships for a large diverse set of natural, synthetic, and environmental estrogens. Chem. Res. Toxicol. 2001, 14, 280–294. [Google Scholar] [CrossRef]
  36. Devillers, J.; Balaban, A.T. Topological Indices and Related Descriptors in QSAR and QSPR; Gordon and Breach Science Publishers: Amsterdam, The Netherlands, 1999. [Google Scholar]
  37. Balaban, A.T. Highly discriminating distance-based topological index. Chem. Phys. Lett. 1982, 89, 399–404. [Google Scholar] [CrossRef]
  38. Bertz, S.H. The first general index of molecular complexity. J. Am. Chem. Soc. 1981, 103, 3599–3601. [Google Scholar] [CrossRef]
  39. Kier, L.B.; Hall, L.H. Molecular connectivity VII: Specific treatment of heteroatoms. J. Pharm. Sci. 1976, 65, 1806–1809. [Google Scholar] [CrossRef]
  40. Kier, L.B.; Hall, L.H. An electrotopological-state index for atoms in molecules. Pharm. Res. 1990, 7, 801–807. [Google Scholar] [CrossRef] [PubMed]
  41. Chicco, D.; Warrens, M.J.; Jurman, G. The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment. IEEE Access 2021, 9, 78368–78381. [Google Scholar] [CrossRef]
  42. Thusberg, J.; Olatubosun, A.; Vihinen, M. Performance of mutation pathogenicity prediction methods on missense variants. Hum. Mutat. 2011, 32, 358–368. [Google Scholar] [CrossRef] [PubMed]
  43. Hu, J.; Li, Y.; Yang, J.-Y.; Shen, H.-B.; Yu, D.-J. GPCR-drug interactions prediction using random forest with drug-association-matrix-based post-processing procedure. Comput. Biol. Chem. 2016, 60, 59–71. [Google Scholar] [CrossRef] [PubMed]
  44. Verheyen, G.R.; Braeken, E.; Van Deun, K.; Van Miert, S. Evaluation of in silico tools to predict the skin sensitization potential of chemicals. SAR QSAR Environ. Res. 2017, 28, 59–73. [Google Scholar] [CrossRef] [PubMed]
  45. Benfenati, E.; Golbamaki, A.; Raitano, G.; Roncaglioni, A.; Manganelli, S.; Lemke, F.; Norinder, U.; Lo Piparo, E.; Honma, M.; Manganaro, A.; et al. A large comparison of integrated SAR/QSAR models of the Ames test for mutagenicity. SAR QSAR Environ. Res. 2018, 29, 591–611. [Google Scholar] [CrossRef] [PubMed]
  46. Hughes, M.G.; Glasby, T.M.; Hanslow, D.J.; West, G.J.; Wen, L. Random forest classification method for predicting intertidal wetland migration under sea level rise. Front. Environ. Sci. 2022, 10, 749950. [Google Scholar] [CrossRef]
  47. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
  48. Branco, P.; Torgo, L.; Ribeiro, R.P. A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 2016, 1, 31. [Google Scholar] [CrossRef]
  49. Akosa, J.S. Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data. In Proceedings of the SAS Global Forum 2017, Orlando, FL, USA, 2–5 April 2017; paper 942. Available online: https://support.sas.com/resources/papers/proceedings17/0942-2017.pdf (accessed on 30 March 2024).
  50. García, V.; Mollineda, R.A.; Sánchez, J.S. A new performance evaluation method for two-class imbalanced problems. In Structural, Syntactic, and Statistical Pattern Recognition; da Vitoria Lobo, N., Kasparis, T., Georgiopoulos, M., Roli, F., Kwok, J.T., Anagnostopoulos, G.C., Loog, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 917–925. [Google Scholar]
  51. Khater, H.F.; Selim, A.M.; Abouelella, G.A.; Abouelella, N.A.; Murugan, K.; Vaz, N.P.; Govindarajan, M. Commercial mosquito repellents and their safety concerns. In Malaria; Kasenga, F.H., Ed.; IntechOpen Limited: London, UK, 2019; Available online: https://www.intechopen.com/chapters/68538 (accessed on 30 March 2024).
  52. Pridgeon, J.W.; Bernier, U.R.; Becnel, J.J. Toxicity comparison of eight repellents against four species of female mosquitoes. J. Am. Mosq. Control Assoc. 2009, 25, 168–173. [Google Scholar] [CrossRef]
  53. Hansch, C.; Leo, A. Exploring QSAR: Fundamentals and Applications in Chemistry and Biology; American Chemical Society: Washington, DC, USA, 1995. [Google Scholar]
  54. World Health Organization. Ethyl Butylacetylaminopropionate Also Known as IR3535®, 3-(N-acetyl-N-butyl)Aminopropionic Acid Ethyl Ester; WHO Specifications and Evaluations for Public Health Pesticides; World Health Organization: Geneva, Switzerland, 2006; p. 26. Available online: https://archive.epa.gov/osa/hsrb/web/pdf/whoir3535evaluationapril2006.pdf (accessed on 13 March 2024).
  55. European Chemicals Agency. Regulation (EU) No 528/2012 Concerning the Making Available on the Market and Use of Biocidal products. Icaridin. Product-Type 19 (Repellents and Attractants). Assessment Report. December 2019; ECHA: Helsinki, Finland, 2019; p. 84. Available online: https://echa.europa.eu/documents/10162/58d77648-e39e-6498-e743-d64df39cdc24 (accessed on 13 March 2024).
  56. European Chemicals Agency. Directive 98/8/EC Concerning the Placing Biocidal Products on the Market. N,N-diethyl-meta-toluamide (DEET). Product-Type 19 (Repellents and Attractants). Assessment Report. 11 March 2010; ECHA: Helsinki, Finland, 2010; p. 43. Available online: https://echa.europa.eu/documents/10162/a9b111f6-37b7-c179-dce4-361b6217484d (accessed on 13 March 2024).
  57. Ali, A.; Cantrell, C.L.; Bernier, U.R.; Duke, S.O.; Schneider, J.C.; Agramonte, N.M.; Khan, I. Aedes aegypti (Diptera: Culicidae) biting deterrence: Structure-activity relationship of saturated and unsaturated fatty acids. J. Med. Entomol. 2012, 49, 1370–1378. [Google Scholar] [CrossRef] [PubMed]
  58. Jahn, A.; Kim, S.Y.; Choi, J.-H.; Kim, D.-D.; Ahn, Y.-J.; Yong, C.S.; Kim, J.S. A bioassay for mosquito repellency against Aedes aegypti: Method validation and bioactivities of DEET analogues. J. Pharm. Pharmacol. 2010, 62, 91–97. [Google Scholar] [CrossRef] [PubMed]
  59. Suryanarayana, M.V.; Pandey, K.S.; Prakash, S.; Raghuveeran, C.D.; Dangi, R.S.; Swamy, R.V.; Rao, K.M. Structure-activity relationship studies with mosquito repellent amides. J. Pharm. Sci. 1991, 80, 1055–1057. [Google Scholar] [CrossRef] [PubMed]
  60. Iovinella, I.; Mandoli, A.; Luceri, C.; D’Ambrosio, M.; Caputo, B.; Cobre, P.; Dani, F.R. Cyclic acetals as novel long-lasting mosquito repellents. J. Agric. Food Chem. 2023, 71, 2152–2159. [Google Scholar] [CrossRef] [PubMed]
  61. Boeckh, J.; Breer, H.; Geier, M.; Hoever, F.P.; Krüger, B.W.; Nentwig, G.; Sass, H. Acylated 1,3-aminopropanols as repellents against bloodsucking arthropods. Pest. Sci. 1996, 48, 359–373. [Google Scholar] [CrossRef]
  62. Leal, W.S. The enigmatic reception of DEET—The gold standard of insect repellents. Curr. Opin. Insect Sci. 2014, 6, 93–98. [Google Scholar] [CrossRef] [PubMed]
  63. Kim, S.-I.; Tak, J.-H.; Seo, J.K.; Park, S.R.; Kim, J.; Boo, K.-H. Repellency of veratraldehyde (3,4-dimethoxy benzaldehyde) against mosquito females and tick nymphs. Appl. Sci. 2021, 11, 4861. [Google Scholar] [CrossRef]
  64. Katritzky, A.R.; Dobchev, D.A.; Tulp, I.; Karelson, M.; Carlson, D.A. QSAR study of mosquito repellents using Codessa Pro. Bioorg. Med. Chem. Lett. 2006, 16, 2306–2311. [Google Scholar] [CrossRef] [PubMed]
  65. Paluch, G.; Grodnitzky, J.; Bartholomay, L.; Coats, J. Quantitative structure–activity relationship of botanical sesquiterpenes: Spatial and contact repellency to the yellow fever mosquito, Aedes aegypti. J. Agric. Food Chem. 2009, 57, 7618–7625. [Google Scholar] [CrossRef]
  66. Wang, Z.; Song, J.; Chen, J.; Song, Z.; Shang, S.; Jiang, Z.; Han, Z. QSAR study of mosquito repellents from terpenoid with a six-member ring. Bioorg. Med. Chem. Lett. 2008, 18, 2854–2859. [Google Scholar] [CrossRef]
Figure 1. Scatterplot of experimental log P values versus the corresponding calculated MlogP and MHlogP values for 155 dataset molecules.
Figure 1. Scatterplot of experimental log P values versus the corresponding calculated MlogP and MHlogP values for 155 dataset molecules.
Applsci 14 05366 g001
Figure 2. Percentages of good predictions (%GPs) obtained with the MH versus ML models for the training sets.
Figure 2. Percentages of good predictions (%GPs) obtained with the MH versus ML models for the training sets.
Applsci 14 05366 g002
Figure 3. Percentages of good predictions (%GPs) obtained with the MH versus ML models for the test sets.
Figure 3. Percentages of good predictions (%GPs) obtained with the MH versus ML models for the test sets.
Applsci 14 05366 g003
Figure 4. Distribution of MlogP values versus the percentages of good predictions obtained for the training sets of the ML models.
Figure 4. Distribution of MlogP values versus the percentages of good predictions obtained for the training sets of the ML models.
Applsci 14 05366 g004
Figure 5. Distribution of MHlogP values versus the percentages of good predictions obtained for the training sets of the MH models.
Figure 5. Distribution of MHlogP values versus the percentages of good predictions obtained for the training sets of the MH models.
Applsci 14 05366 g005
Table 1. Prediction results and performance evaluation metrics of the ML1 model on the total (To), training (Tr) and test (Ts) sets.
Table 1. Prediction results and performance evaluation metrics of the ML1 model on the total (To), training (Tr) and test (Ts) sets.
MetricsML1-ToML1-TrML1-Ts
TP 130226240
TN16811332349
FP17213240
FN16115
Sensitivity0.9500.9600.889
Specificity0.9070.9100.897
Accuracy0.9130.9180.896
MCC0.7330.7560.618
F10.7630.7860.640
AUC0.9480.9540.920
G-Mean0.9280.9340.893
Dominance0.0430.050−0.008
1 TP = true positive, TN = true negative, FP = false positive, FN = false negative (otherwise, see text).
Table 2. Prediction results and performance evaluation metrics of the MH1 model on the total (To), training (Tr), and test (Ts) sets.
Table 2. Prediction results and performance evaluation metrics of the MH1 model on the total (To), training (Tr), and test (Ts) sets.
MetricsMH1-ToMH1-TrMH1-Ts
TP 130026139
TN16931341352
FP16012337
FN18126
Sensitivity0.9430.9560.867
Specificity0.9140.9160.905
Accuracy0.9180.9220.901
MCC0.7420.7650.619
F10.7710.7950.645
AUC0.9540.9580.928
G-Mean0.9280.9360.886
Dominance0.0300.040−0.038
1 TP = true positive, TN = true negative, FP = false positive, FN = false negative (otherwise, see text).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Devillers, J.; Devillers, H. Structure–Activity Relationship (SAR) Modeling of Mosquito Repellents: Deciphering the Importance of the 1-Octanol/Water Partition Coefficient on the Prediction Results. Appl. Sci. 2024, 14, 5366. https://doi.org/10.3390/app14135366

AMA Style

Devillers J, Devillers H. Structure–Activity Relationship (SAR) Modeling of Mosquito Repellents: Deciphering the Importance of the 1-Octanol/Water Partition Coefficient on the Prediction Results. Applied Sciences. 2024; 14(13):5366. https://doi.org/10.3390/app14135366

Chicago/Turabian Style

Devillers, James, and Hugo Devillers. 2024. "Structure–Activity Relationship (SAR) Modeling of Mosquito Repellents: Deciphering the Importance of the 1-Octanol/Water Partition Coefficient on the Prediction Results" Applied Sciences 14, no. 13: 5366. https://doi.org/10.3390/app14135366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop