Next Article in Journal
Effect of Lactoferrin on the Expression Profiles of Long Non-coding RNA during Osteogenic Differentiation of Bone Marrow Mesenchymal Stem Cells
Next Article in Special Issue
Prediction Model with High-Performance Constitutive Androstane Receptor (CAR) Using DeepSnap-Deep Learning Approach from the Tox21 10K Compound Library
Previous Article in Journal
A Molecular Dynamics Study of a Photodynamic Sensitizer for Cancer Cells: Inclusion Complexes of γ-Cyclodextrins with C70
Previous Article in Special Issue
Structure-Based Design and in Silico Screening of Virtual Combinatorial Library of Benzamides Inhibiting 2-trans Enoyl-Acyl Carrier Protein Reductase of Mycobacterium tuberculosis with Favorable Predicted Pharmacokinetic Profiles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Skin Doctor: Machine Learning Models for Skin Sensitization Prediction that Provide Estimates and Indicators of Prediction Reliability

1
Center for Bioinformatics, Universität Hamburg, 20146 Hamburg, Germany
2
HITeC e.V, 22527 Hamburg, Germany
3
Department of Chemistry, University of Bergen, 5020 Bergen, Norway
4
Computational Biology Unit (CBU), University of Bergen, 5020 Bergen, Norway
5
Front End Innovation, Beiersdorf AG, 20253 Hamburg, Germany
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2019, 20(19), 4833; https://doi.org/10.3390/ijms20194833
Submission received: 27 August 2019 / Revised: 17 September 2019 / Accepted: 18 September 2019 / Published: 28 September 2019
(This article belongs to the Special Issue QSAR and Chemoinformatics Tools for Modeling)

Abstract

:
The ability to predict the skin sensitization potential of small organic molecules is of high importance to the development and safe application of cosmetics, drugs and pesticides. One of the most widely accepted methods for predicting this hazard is the local lymph node assay (LLNA). The goal of this work was to develop in silico models for the prediction of the skin sensitization potential of small molecules that go beyond the state of the art, with larger LLNA data sets and, most importantly, a robust and intuitive definition of the applicability domain, paired with additional indicators of the reliability of predictions. We explored a large variety of molecular descriptors and fingerprints in combination with random forest and support vector machine classifiers. The most suitable models were tested on holdout data, on which they yielded competitive performance (Matthews correlation coefficients up to 0.52; accuracies up to 0.76; areas under the receiver operating characteristic curves up to 0.83). The most favorable models are available via a public web service that, in addition to predictions, provides assessments of the applicability domain and indicators of the reliability of the individual predictions.

1. Introduction

Repeated exposure to reactive chemicals with skin-sensitizing properties can cause allergic contact dermatitis (ACD) [1], an adverse cutaneous condition with a prevalence of ~20% among the general population [2] and even higher prevalence among workers with chronic occupational exposure [3]. Understanding the skin sensitization potential of small organic molecules is therefore of essence to the development and safe application of chemicals, including cosmetics and drugs.
Historically, animal tests have effectively been the only method for determining the skin sensitization potential and potency of substances. The local lymph node assay (LLNA) is currently considered to be the most advanced animal testing system [4]. In recent years, ethical considerations and regulatory requirements have led to an intensification of the search for alternatives to animal testing, in particular in the cosmetics industry [5]. New in vitro and in chemico methods have been developed and evaluated [6,7,8,9], and computational approaches are starting to be recognized as important alternatives to animal testing [8,9,10,11]. The non-redundant combinatorial use of said methods in defined approaches that assess several key events of the adverse outcome pathway (AOP) for skin sensitization shows promising predictive capacity [12] and is currently evaluated in risk assessment case studies.
The bottleneck in the development of in silico tools for the prediction of skin sensitization is not related to technology but to the scarcity of available high-quality experimental data for model development. Three strategies have been pursued to address this problem. The first one is to increase the amount and coverage of data by employing data mining techniques to retrieve information from various types of assays and sources [13,14]. Although this has been discussed as a promising strategy to increase the applicability of models, it has also prompted controversial discussions regarding the quality and relevance of the data [15,16]. The second strategy is to develop focused models based on small, focused data sets of high-quality [17,18,19,20,21]. The third strategy is to pursue a middle way that aims for a favorable balance between quantity and quality of the data. The LLNA data available in the public domain are generally regarded as the most suitable source of information for this strategy [22,23,24,25,26,27].
The two largest curated collections of LLNA outcomes in the public domain are the data collections of Alves et al. [28] and Di et al. [22]. The data were obtained from reliable sources and subjected to deduplication procedures that reject discordant records. The data set of Alves et al. includes (mainly) binary LLNA outcomes recorded for 1000 compounds. In addition, it contains human data and outcomes from different types of in vitro and in chemico assays, although for substantially fewer substances. Based on these data, the authors developed machine learning models for different assay types and also a consensus model, all of which are available via an online platform (“PredSkin”) [19]. Their model for the prediction of binary LLNA outcomes reached a correct classification rate (CCR) of 0.77 during five-fold external cross-validation.
The data set published by Di et al. contains 1007 substances annotated with LLNA potency classes [22]. Based on a subset of approximately 400 compounds for which an explicit reaction mechanism could be derived with a structural alerts tool for protein binding implemented in the OECD Toolbox [29], Di et al. developed a variety of models for the binary and ternary prediction of the skin sensitization potential. These models included local models for four reaction domains as well as global models. The best binary global model was reported to obtain an accuracy (ACC) of 0.84 during cross-validation and an ACC of 0.81 on a test set.
Major challenges in the application of machine learning approaches for risk assessment are related to the complexity of models that goes along with limited mechanistic interpretability. For these types of models, transparency with respect to the applicability domain as well as the provision of confidence estimates for individual predictions are of utmost importance to risk assessors, who ultimately are the main stakeholders of these methods.
In this context, and building on the works of Alves et al. and Di et al., this study pursues four main objectives to advance in silico capabilities for the prediction of the skin sensitization potential: (i) the development of a detailed understanding of the chemical space covered by the available LLNA data with respect to the chemical space of cosmetics, approved drugs and pesticides, (ii) the identification of the most suitable (sets of) molecular descriptors for modeling, (iii) the maximization of the applicability of the models by increasing the size and coverage of the data set used for model development, (iv) the definition of robust measures of the models’ applicability domain as well as the provision of indicators for the reliability of individual predictions, and (v) the provision of the most suitable models via a public web service.

2. Results

2.1. Characterization of the LLNA Data Sets

In order to develop a detailed understanding of the relevance of the available LLNA data to modeling the skin sensitization potential of xenobiotics, we analyzed the composition and molecular diversity of the LLNA data sets of Alves et al. and Di et al. In addition, we assessed how well the individual LLNA data sets cover the chemical space of cosmetics, approved drugs and pesticides.

2.1.1. Data Set Composition

Whereas the data set compiled by Alves et al. is balanced (481 sensitizers; 519 non-sensitizers), the data set of Di et al. contains almost twice as many non-sensitizers (n = 629) as sensitizers (n = 364; Table 1). Roughly 40% of all compounds (567) are present in both data sets (Table 2). The LLNA data set compiled by Alves et al. contains 7% of all substances listed in the cosmetics data set; coverage is lower for approved drugs and pesticides (4% and 5%, respectively). The percentages are similar for the LLNA data set of Di et al.: 5% overlaps with cosmetics, 3% with approved drugs and 4% with pesticides. Merging the two LLNA data sets increases the number of unique compounds to 1416 and the overlaps with cosmetics, approved drugs and pesticides to 8%, 5% and 5%, respectively.

2.1.2. Coverage of Chemical Space

Whereas only few of the cosmetics, approved drugs and pesticides listed in the reference data sets are included in the LLNA data sets, principal component analysis (PCA) shows that the areas in chemical space most densely populated with these xenobiotics are actually well-covered by the merged LLNA data set (Figure 1). Nevertheless, scattered data points radiating from the area of high data density towards the bottom and the top right corner of the PCA score plot indicate the existence of drugs and cosmetic compounds without closely related substances listed in the merged data set.
In addition to PCA analysis, the coverage of cosmetics, approved drugs and pesticides by the merged LLNA data set was quantified based on the distribution of maximum pairwise similarities. As shown in Figure 2, the merged LLNA data set covers cosmetics much better than approved drugs and pesticides: over 30% of all cosmetics are represented by the respective nearest neighbor in the merged LLNA data set with a minimum Tanimoto coefficient of 0.6, whereas this is the case for only 10% and 13% of all approved drugs and pesticides, respectively.
It is important to note that the data set compiled by Di et al. includes many compounds that populate areas in chemical space not (well) covered by the LLNA data set of Alves et al. (Figure 3). It is therefore expected that models trained on the merged data set should be more widely applicable than those based solely on the LLNA data compiled by Alves et al.

2.1.3. Molecular Diversity

The molecular diversity of the merged LLNA data set and the reference data sets was assessed in two different ways: by pairwise comparison of molecular structures and by counting of Murcko scaffolds. Pairwise comparisons were again based on Tanimoto coefficients derived from Morgan2 fingerprints of a length of 2048 bits. The cosmetics data set exhibits a lower diversity compared to the other data sets (Figure 4). This can be attributed, to some extent, to the larger size of the cosmetics data set: 23% of all pairs of compounds in the cosmetic data set have fingerprints with a Tanimoto coefficient of 0.8 or higher, whereas this percentage is 11% or lower for the merged LLNA, approved drugs and pesticides data sets. Of all compounds included in the cosmetics data set, 220 have at least one neighbor with identical molecular fingerprint. These are mostly pairs of molecules with long aliphatic chains, differing only by the length of these chains (note that any duplicate molecules have been removed during data preprocessing).
The merged LLNA data set covers a total of 453 distinct Murcko scaffolds, which is roughly as many as covered by the pesticides data set but only one-third and one-quarter of those covered by the cosmetics and approved drugs data sets, respectively (Table 1). Taking into account the size of the individual data sets, the approved drugs data set clearly is the most diverse data set. In contrast, the cosmetics data set, which counts more molecular structures than all other data sets taken together, is the least diverse data set. This is in part related to the fact that approximately 40% of all cosmetics do not include a ring and, as such, do not have a Murcko scaffold.
Benzene is the most prominent Murcko scaffold across all data sets, with a prevalence of 27%, 28%, 10% and 23% among the merged LLNA, cosmetics, approved drugs and pesticides data sets. Any other scaffolds are represented by only a few instances (Table S2). Note the high percentages of singleton scaffolds (72% or higher) across all data sets, which, particularly in the case of the LLNA data set, illustrate the scarcity of the data available for modeling.

2.2. Molecular Properties of Skin Sensitizers and Non-Sensitizers

The merged LLNA data set contains 572 skin sensitizers and 844 non-sensitizers. As shown in Figure 5a, non-sensitizers cover a broader chemical space than sensitizers. A substantial number of non-sensitizers are of higher molecular weight than sensitizers and have a stronger aromatic character and larger topological polar surface area (Figure 5a,d). A cluster of skin sensitizers and non-sensitizers with long aliphatic and halogenated chains was identified, observed as a diagonal line in the lower left of the score plot (Figure 5a,c). Interestingly, the compounds of this cluster can only be discriminated in the “MOE 2D” descriptor space but not in the Morgan2 fingerprint space, since molecules with identical halogen substitution but differing chain lengths can result in identical Morgan2 fingerprints.

2.3. Model Development

Prior to model development, the merged LLNA data set was divided into a training (80%) and test (20%) set (Table 3; see Methods for details). All possible combinations of machine learning approaches (random forest (RF) and support vector machine (SVM)) with up to two different sets of molecular descriptors (including molecular fingerprints) were systematically explored (Table 4). One type of descriptors to highlight is a new fingerprint that we derive from the “Protein binding alerts for skin sensitization by OASIS” profiler implemented in the OECD toolbox [29]. This profiler assigns compounds to eleven mechanistic domains associated with skin sensitization, five of which are represented by more than 20 instances in the training set (i.e., Michael addition, SN2 reaction, Schiff base formation, acylation, and nucleophilic addition). The new fingerprint encodes the presence or absence of alerts matching one or several of these five mechanistic domains.
For any combination of machine learning algorithm and descriptor set(s), optimum hyperparameters were identified via a grid search (Table 5). The grid search was performed within the framework of a 10-fold cross-validation, with Matthews correlation coefficient (MCC) [40] used as the scoring parameter.
The outcomes of this grid search are summarized in Table S3. It can be seen that similar hyperparameters tend to be selected by models based on related types and sets of molecular descriptors. No strong preferences for specific hyperparameter values are apparent. This is likely related to the fact that, within a broad value space, the hyperparameters only had a minor impact on model performance.

2.4. Model Performance

2.4.1. Measures for the Evaluation of Model Performance

Eight different measures were applied to describe the performance of the classifiers:
  • Matthews correlation coefficient (MCC), which is regarded to be one of the best measures of binary classification performance. It is robust against data imbalance and considers the proportion of all four cases of predictions (i.e., true positive, false positive, true negative and false negative predictions). Note that MCC values range from −1 to + 1. A value of + 1 indicates perfect prediction, whereas a value of −1 indicates a prediction that is in total disagreement. A value of 0 indicates a performance which is equal to random.
  • ACC, which has been most commonly used by others to measure the performance of models for the prediction of the skin sensitization potential. It is defined as the proportion of correct predictions within all predictions made.
  • Area under the receiver operating characteristic curve (AUC), which in this case quantifies the ability to correctly rank compounds according to their skin sensitization potential. The AUC does not rely on a decision threshold.
  • Sensitivity (Se), which in this case quantifies the proportion of correctly identified skin sensitizers.
  • Specificity (Sp), which in this case quantifies the proportion of correctly predicted non-sensitizers.
  • Positive predictive value (PPV), which reports the proportion of true positive predictions among all positive predictions.
  • Negative predictive value (NPV), which reports the proportion of true negative predictions among all negative predictions.
  • CCR, which is the mean of Se and Sp.

2.4.2. Model Performance During Cross-Validation

Depending on the combination of machine learning algorithm (RF or SVM) and descriptor set(s) used, MCC values ranged from 0.27 to 0.55, ACC values from 0.66 to 0.78, and AUC values from 0.63 to 0.84 (Table 4). The machine learning algorithms had only a minor impact on model performance. The average MCC values obtained by RFs and SVMs were 0.45 and 0.48, respectively. Nevertheless, the twelve predictors that obtained the highest MCC values are all based on SVMs. Most of the observed variation in performance stemmed from the use of different descriptor sets.
The best performance during cross-validation was obtained by the SVM_MOE2D+OASIS model. This model yielded an MCC, ACC and AUC of 0.55, 0.78 and 0.83, respectively. The best model based on a single set of descriptors was the SVM_PaDEL model. It reached an MCC, ACC and AUC of 0.50, 0.75 and 0.83, respectively. However, its lead over the corresponding RF model and other models based on a single set of descriptors was small. For example, the best model based on a single type of molecular fingerprint, RF_MACCS, obtained an MCC, ACC and AUC of 0.47, 0.75 and 0.81, respectively. Models based on either machine learning algorithm in combination with “MOE 2D” descriptors or MACCS fingerprints yielded comparable performance. Reduction of the full MOE2D descriptor set to the subset of 53 interpretable MOE descriptors (previously used for analyzing the chemical space coverage) led to a decline in MCC values by a maximum of 0.04. Caution needs to be exercised when interpreting these small differences in performance because of the variance observed during cross-validation. For example, for the SVM_MOE2D_53 model, the standard deviation observed for the MCC during cross-validation was 0.069.
In most cases, the combination of two sets of molecular descriptors was beneficial to model performance. Exceptions include models based on combinations of two sets of descriptors of the same type (e.g., Morgan2 and MACCS fingerprints). These did not outperform the best models based on a single set of descriptors. Also, combinations of 0D/1D/2D molecular descriptors with fingerprints did not consistently outperform models based on a single set of descriptors, albeit nine out of twelve models with MCC values greater than or equal to 0.5 are models combining non-binary molecular descriptors (i.e., MOE2D or PaDEL) with molecular fingerprints. Tables S4 and S5 provide a comprehensive overview of the impact of different combinations of descriptor sets on model performance.
Good performance was also obtained by models generated using non-commercial software only. For example, the SVM_PaDEL+OASIS model obtained MCC, ACC and AUC values of 0.50, 0.75 and 0.83, respectively. With few exceptions, the OASIS fingerprint contributed positively to the performance of models. For instance, adding the OASIS fingerprint to the SVM_MOE2D model led to an increase of the MCC, ACC and AUC by 0.07, 0.04 and 0.01, respectively. Interestingly, with a total of just 84 bits, the RF_PaDEL−Est+OASIS model reached a level of performance that is comparable with that of more complex models (MCC 0.48; ACC 0.75; AUC 0.80). However, when used on its own, the OASIS fingerprint is not sufficient for good classification performance: the RF_OASIS and SVM_OASIS models obtained the lowest MCC values across all models (i.e., 0.27 and 0.29, respectively).

2.4.3. In-Depth Analysis of Selected Models within the Cross-Validation Framework

Based on the cross-validation results, five of the most interesting models were selected for additional studies:
  • SVM_MOE2D+OASIS: the model with highest MCC.
  • SVM_PaDEL+OASIS: a model performing comparable to the SVM_MOE2D+OASIS and based on freely available software only.
  • SVM_PaDEL: the best model based on a single set of molecular descriptors.
  • RF_MACCS: the best model based on a single set of molecular fingerprints.
  • SVM_PaDEL+MACCS: a model with good performance, combining the descriptor sets used by the above two models.
Within the above-mentioned 10-fold cross-validation framework, we first analyzed how the coverage of the query molecules by the training data affects model performance. For this analysis we calculated the similarity between the individual query molecules and the one, three and five-nearest neighbors in the training set. Two similarity measures were explored: Tanimoto coefficients in the MACCS fingerprint space and negative Euclidean distances in the PaDEL descriptor space. The latter did not correlate well with molecular similarity (likely caused by noise related to the large number of molecular descriptors considered in this approach; Figure S2 and Table S6), for which reason we decided to go ahead with the fingerprint-based distance measure.
For all five models, a direct linear relationship was observed between MCC values and molecular similarity. The relationship was consistent when considering different numbers of nearest neighbors in the training data but tended to be more robust when taking more (i.e., 5) nearest neighbors into account (Pearson correlation coefficient between 0.92 and 0.96 when considering five nearest neighbors). As shown in Figure 6, for compounds dissimilar to those present in the training data (defined by Tanimoto coefficients averaged over the five nearest neighbors of 0.5 or lower), MCC values were below or around 0.4 for all five models. For compounds structurally related to the training data (defined by Tanimoto coefficients of 0.7 or higher), MCC values were at least 0.5 or higher.
Secondly, we investigated how changes to the decision threshold of the SVM and RF classifiers (i.e., the value above which a compound is predicted to be a sensitizer) affect the sensitivity and specificity of the models. As shown in Figure 7, both these metrics strongly depend on the selected decision threshold. This allows users to define context-dependent thresholds. For example, in scenarios where for a compound of interest any skin sensitization potential should be ruled out, users may opt for lower decision thresholds to identify any hazard. In the case of the RF_MACCS model, lowering the decision threshold to 0.3 results in a sensitivity of 0.84 and a specificity of 0.61 (Figure 7d).
Observing the predicted class probability can be of use for assessing the reliability of a prediction: as shown in Figure 8, the reliability of predictions increases with the absolute distance between the class probability and the decision threshold. For SVM models, predictions with class probabilities more than 0.5 away from the decision threshold had averaged MCC values between 0.63 and 0.67, whereas predictions with class probabilities less than 0.5 away had averaged MCC values of just 0.20 to 0.29. For the RF_MACCS model, predictions with class probabilities more than 0.35 away from the decision threshold had MCC values above 0.6, whereas predictions with class probabilities closer than 0.15 to the decision threshold had MCCs below 0.4. For the five investigated models, the Pearson correlation coefficients for this relationship were between 0.92 and 0.98.
As a further way of analyzing the data, we looked into the reliability of predictions as a function of the number of consecutive nearest neighbors in the training data that are of the same activity class as the one predicted for a compound of interest. From Figure 9, it can be seen that predictions are particularly reliable if the three nearest neighbors in the training data are of the identical class as the class predicted for a compound of interest. The strongest correlation is observed for the RF_MACCS model. For this model the MCC is close to zero for compounds where the predicted class is in conflict with the class assigned to the nearest neighbor. In contrast, the MCC is above 0.6 for compounds where the predicted class and the classes assigned to the three nearest neighbors are identical.

2.4.4. Performance of Selected Models on the Test Set

The performance of the five selected models was tested on holdout data. All models were stable, with only minor losses in MCC, ACC and AUC when compared to the results from cross-validation (Table 6). The largest losses in performance were observed for the RF_MACCS model, with MCC and ACC values decreased by 0.06 and 0.03, respectively (AUC however +0.01).
By defining the applicability domain of the models to include any compounds with a minimum Tanimoto coefficient of 0.75 averaged over the five-nearest neighbors in the training set (based on MACCS fingerprints), MCC values increased, in the case of the RF_MACCS model from 0.41 to 0.59. However, at the same time the coverage of the test set is reduced, in the case of RF_MACCS to 28%.
Defining the applicability domain with a cutoff of 0.50 rather than 0.75 led to only minor performance improvements compared to the model without applicability domain definition. This is related to the fact that only approximately 3% of the compounds of the test set are that dissimilar to the compounds in the training data. However, predictions for these compounds are unreliable (MCC values 0.2 or lower). Therefore, it is important to observe the applicability domain of the individual models.
Besides the applicability domain definition, users are advised to consider two additional types of information when judging the reliability of a prediction: (i) the distance between the predicted class probability from the decision threshold and (ii) the number of consecutive nearest neighbors that are of the same activity class than the class predicted for a compound of interest.
Larger distances of the class probability to the decision threshold indicate higher reliability of the prediction. For example, when considering only predictions with class probabilities 0.35 or further away from the decision threshold, the MCC of the RF_MACCS model increases from 0.41 to 0.78 (this covers 23% of the test set; Table 7). Likewise, for the SVM models, MCC values increase from approximately 0.5 to a maximum of 0.78 when considering predictions only if their class probability is 1.25 or further away from the decision threshold (this covers 12% to 37% of the compounds in the test set).
Predictions for query molecules that are consistent with the class assigned to the k-nearest neighbors in the training data are more reliable than for those that are in conflict. This is also confirmed by the results obtained for the test set (Table 8): Predictions that are in disagreement with the activity class of the nearest neighbor resulted in MCC and ACC values no higher than 0.13 and 0.56, respectively. MCC and ACC values increase to a maximum of 0.98 and 0.99 when considering predictions only if they are consistent with three or more nearest neighbors.

2.4.5. Comparison of Model Performance to that of Existing Models

Major caveats must be considered when attempting to directly compare the performance reported for existing models with those presented in this work. Not only do the underlying training and test sets differ substantially, but also the protocols used for performance evaluation and the definitions of the models’ applicability domains. Roughly summarized, Alves et al. reported their predictor of binary LLNA outcomes to yield a CCR of 0.77 during external cross-validation [28]. Di et al. reported their best global model for the binary prediction of LLNA outcomes, a SVM model based on PaDEL-Ext descriptors (Ext-SVM), to have yielded an ACC of 0.84 during cross-validation and an ACC of 0.81 on their test set (when considering the applicability domain according to their definition) [22]. In comparison, our best model (SVM_MOE2D+OASIS) yielded a CCR of 0.78 and identical ACC during cross-validation (MCC 0.55), without consideration of the applicability domain. On the test set, the SVM_MOE2D+OASIS model obtained a CCR of 0.76 and an MCC of 0.52. In this case, the consideration of the applicability domain of the model (defined as including any compound with a mean Tanimoto similarity to the five nearest neighbors in the training set of 0.50 or higher) did not yield a further improvement of performance. The SVM_PaDEL and RF_MACCS models, which are available via a public web service, yielded comparable CCR values (0.74 and 0.70 without consideration of the applicability domain; 0.75 and 0.71 with consideration of the applicability domain, respectively). The latter model has the additional benefit of being based on a fingerprint with a length of only 166 bits.

2.5. Skin Doctor Web Service

The final RF_MACCS and SVM_PaDEL models, trained not on the cross-validation data set but on the complete, preprocessed data set (1416 and 1388 compounds, depending on the number of compounds for which descriptors could be successfully calculated) are provided via the New E-Resource for Drug Discovery (NERDD) [41]. Queries can either be directly drawn or uploaded in different formats. Users may change the default decision threshold to steer the model’s sensitivity and specificity. Results are presented in a tabular overview and can be exported as a CSV file. For each query they include information on (i) whether or not the query is within the applicability domain of the model, (ii) the predicted activity classes, (iii) distances from the selected decision threshold, (iv) mean similarity between the query compound and the five-nearest neighbors of the training set and (v) number of consecutive nearest neighbors in the training data of which the activity label is consistent with that of the prediction. The analysis and visualization of the corresponding effects presented in this work may be used as guidance to choose the required confidence in the prediction, being aware of the corresponding effects on the model’s applicability domain and the requirements for similarity.
Predictions are flagged with reliability warnings (a) if the mean similarity between the compound of interest and the five nearest neighbors is less than 0.5, or (b) if the predictions are in conflict with the activity of the nearest neighbor in the training data, or (c) if the distance to the decision threshold is small (0.15 for the RF_MACCS model; 0.5 for the SVM_PaDEL model).

3. Materials and Methods

3.1. Data Preparation

The LLNA data set compiled by Alves et al. was downloaded from Chembench. Binary class labels (i.e., “sensitizer”, “non-sensitizer”) were obtained from the binary property “LLNA result” and not altered. The LLNA data set of Di et al. was obtained from the supporting information associated with their publication [22]. Binary class labels (i.e., “sensitizer”, “non-sensitizer”) were assigned based on the information provided by the property “class”: any compounds with the value “negative” were assigned the label “non-sensitizer”; any compounds with the value “weak”, “moderate”, “strong” or “extreme” were assigned the label “sensitizer”. Reference data sets of cosmetic substances and ingredients (hereafter “cosmetics”), approved drugs and pesticides were obtained from the EU CosIng database, Drugbank and EU pesticides database.
All data sets were processed individually according to the following protocol: Any counterions were removed and the remaining molecular structures neutralized as described in the work of Stork et al. [42]. Tautomers were standardized with the “TautomerCanonicalizer” method implemented in the “tautomer” class of MolVS [43]. This was followed by a deduplication of molecules based on canonicalized SMILES. Stereochemical information was disregarded at this point, leading to conflicting activity labels for one compound (which had different activity labels assigned to the two enantiomers). This compound was removed from the data set.
A merged LLNA data set based on the LLNA data sets of Alves et al. and Di et al. was generated by filtering duplicates based on canonical SMILES and removing any compounds with contradicting class labels.

3.2. Descriptor Calculation

Molecular descriptors were computed with the Molecular Operating Environment (MOE) [36] (“MOE descriptors”), RDKit [39] (Morgan and MACCS fingerprints) and PaDEL [37,38] (“PaDEL descriptors” as well as the molecular fingerprints “PaDEL-Est” and “PaDEL-Ext”). “MOE 2D” descriptors were calculated with default settings. Morgan fingerprints (2048 bits) were calculated with a radius of 2. MACCS fingerprints were calculated with default settings. Also, the PaDEL descriptors were calculated with default settings, with the exception of a maximum allowed runtime of 1000 s per molecule. Structural alerts were computed with the OECD toolbox [29] using the “Protein binding alerts for skin sensitization by OASIS” profiler with default settings. All non-binary descriptors were scaled to unit variance and their mean shifted to zero prior to model building and data analysis using the StandardScaler of scikit-learn [44].

3.3. Data Analysis

PCA was conducted with scikit-learn based on a subset of 53 physically meaningful, scaled “MOE 2D” descriptors (Table S1). RDKit was employed for generating Murcko scaffolds and calculating molecular similarity.

3.4. Compilation of Data Sets for Model Development

The merged LLNA data set was divided into a training set (80%) and a test set (20%) by stratified splitting with the train_test_split function of the model_selection module of scikit-learn (data shuffling prior to data set splitting enabled). This procedure was assigned a random state of 43.

3.5. Model Generation

Models were generated with scikit-learn and a random_state value of 43. Default settings were applied, with the exception of class_weight set to “balanced” for both RF and SVM. SVMs were used with a radial basis function (RBF) kernel. Optimal settings for n_estimators and max_features (RF models) and C and gamma (SVM models) were derived during grid search.

3.6. Hardware and Software

All calculations were performed on Linux workstations running openSUSE Leap 15.0 and equipped with Intel i5 processors (3.2 GHz) and 16 GB of main memory.

4. Conclusions

Building on the works of Alves et al. and Di et al., we have compiled a collection of 1416 compounds annotated with binary LLNA outcomes. To our knowledge, this is the largest LLNA data set that has been used for the development of models predicting the skin sensitization potential of small organic molecules. As we show by chemical space analysis, those areas most densely populated by cosmetics, approved drugs and pesticides are also well covered by this new LLNA data set. The fraction of compounds covered by structurally related compounds in the new LLNA data set is much higher for cosmetics (30%) than for approved drugs (10%) and pesticides (13%). Therefore, the models are applicable to many compounds typically used in cosmetic products. However, there are chemical classes of drugs and cosmetics that are not adequately represented by the available LLNA data. This emphasizes the importance of considering the applicability domain of models.
An interesting observation to make was that a cluster of skin sensitizers and non-sensitizers with long aliphatic and halogenated chains could only be discriminated in the “MOE 2D” descriptor space but not in the Morgan2 fingerprint space, which should be taken into consideration for model building. The best models derived from the new LLNA data set obtained MCC and ACC values of up to 0.55 and 0.78 during cross-validation and of up to 0.52 and 0.76 on holdout data, respectively. Importantly, some of the models based entirely on free software and/or molecular descriptors of low complexity yielded comparable performance. We identified the RF_MACCS and SVM_PaDEL models as our favorite models, yielding MCC values of 0.41 and 0.47 on the holdout data. Comparison to existing models indicates that our models reach competitive performance. They are trained on a data set consisting of almost 3.5 times as many compounds as the one used by Di et al. The full data set used for modeling and testing is also 42% larger than that of Alves et al. given the fact that the data set compiled by Di et al. holds in particular a diverse set of non-sensitizers not covered by Alves et al. we expect that our models, as they are based on the amalgamated data set, are more widely applicable and more reliable.
A major aspect of this work is the definition of an applicability domain for the individual models and the elaboration of means to estimate the reliability of predictions. The applicability domain was defined based on the mean similarity of a compound of interest to the five-nearest neighbors in the training data (quantified in MACCS fingerprint space). The difference between the predicted class probability and the decision threshold, as well as the number of consecutive nearest neighbors in the training data having the same activity class assigned as the one predicted for the compound of interest proved to be useful indicators of the reliability of predictions. We recommend considering predictions as reliable if all of the following conditions are met:
  • The compound of interest is within the applicability domain of the model.
  • The distance between the predicted class probability and the decision threshold is at least 0.15 for RF models and 0.5 for SVM models.
  • The predicted activity class for a compound of interest is in agreement with the class assigned to the nearest neighbor in the training data.
The public web service, available at https://nerdd.zbh.uni-hamburg.de/, provides access to the final RF_MACCS and SVM_PaDEL models (i.e., models trained on the complete LLNA data set). Users are provided detailed information on whether or not a compound of interest fulfills the three criteria itemized above. A warning is issued in case predictions are determined to be unreliable. Users may also adjust the decision threshold, allowing them, e.g., to increase the model’s sensitivity in scenarios where it is desirable to flag even substances with a low likelihood of being skin sensitizers.
We hope that the models will be well received by the scientific community and will make a contribution to the development and application of non-animal methods for the prediction of the skin sensitization potential of small organic molecules.

Supplementary Materials

Supplementary Materials can be found at https://www.mdpi.com/1422-0067/20/19/4833/s1.

Author Contributions

Conceptualization, A.W., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Methodology, A.W., C.S., C.B., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Software, A.W. and C.S.; Validation, A.W.; Formal Analysis, A.W.; Investigation, A.W., C.S., C.B., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Resources, A.S., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Data Curation, A.W.; Writing – Original Draft Preparation, A.W., C.S., C.B., A.S., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Writing – Review & Editing, A.W., C.S., C.B., A.S., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Visualization, A.W.; Supervision, A.S., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Project Administration, J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair); Funding Acquisition, A.S., J.K. (Jochen Kühnl) and J.K. (Johannes Kirchmair).

Funding

A.W. is supported by Beiersdorf AG through HITeC e.V. C.B. and J.K. (Johannes Kirchmair) are supported by the Trond Mohn Foundation [BFS2017TMT01].

Conflicts of Interest

J.K. (Jochen Kühnl) and A.S. are employed at Beiersdorf AG and A.W. is funded by Beiersdorf AG through HITeC e.V. A.W., A.S. and J.K. (Jochen Kühnl) were involved in the design of the study, the interpretation of the data, the writing of the manuscript, and the decision to publish the results.

Abbreviations

ACCaccuracy
ACDallergic contact dermatitis
AUCarea under the receiver operating characteristic curve
CCRcorrect classification rate
LLNAlocal lymph node assay
MCCMatthews correlation coefficient
MOEMolecular Operating Environment
NERDDNew E-Resource for Drug Discovery
NPVnegative predictive value
PCAprincipal component analysis
RBFradial basis function
RFrandom forest
PPVpositive predictive value
Sesensitivity
Spspecificity
SVMsupport vector machine

References

  1. Kimber, I.; Basketter, D.A.; Gerberick, G.F.; Ryan, C.A.; Dearman, R.J. Chemical allergy: Translating biology into hazard characterization. Toxicol. Sci. 2011, 120 (Suppl. 1), S238–S268. [Google Scholar] [CrossRef]
  2. Thyssen, J.P.; Linneberg, A.; Menné, T.; Johansen, J.D. The epidemiology of contact allergy in the general population—prevalence and main findings. Contact Dermat. 2007, 57, 287–299. [Google Scholar] [CrossRef]
  3. Lushniak, B.D. Occupational contact dermatitis. Dermatol. Ther. 2004, 17, 272–277. [Google Scholar] [CrossRef] [Green Version]
  4. Anderson, S.E.; Siegel, P.D.; Meade, B.J. The LLNA: A brief review of recent advances and limitations. J. Allergy 2011, 2011. [Google Scholar] [CrossRef]
  5. Dent, M.; Amaral, R.T.; Da Silva, P.A.; Ansell, J.; Boisleve, F.; Hatao, M.; Hirose, A.; Kasai, Y.; Kern, P.; Kreiling, R.; et al. Principles underpinning the use of new methodologies in the risk assessment of cosmetic ingredients. Comput. Toxicol. 2018, 7, 20–26. [Google Scholar] [CrossRef]
  6. Mehling, A.; Eriksson, T.; Eltze, T.; Kolle, S.; Ramirez, T.; Teubner, W.; van Ravenzwaay, B.; Landsiedel, R. Non-animal test methods for predicting skin sensitization potentials. Arch. Toxicol. 2012, 86, 1273–1295. [Google Scholar] [CrossRef]
  7. Reisinger, K.; Hoffmann, S.; Alépée, N.; Ashikaga, T.; Barroso, J.; Elcombe, C.; Gellatly, N.; Galbiati, V.; Gibbs, S.; Groux, H.; et al. Systematic evaluation of non-animal test methods for skin sensitisation safety assessment. Toxicol. In Vitro 2015, 29, 259–270. [Google Scholar] [CrossRef] [Green Version]
  8. Ezendam, J.; Braakhuis, H.M.; Vandebriel, R.J. State of the art in non-animal approaches for skin sensitization testing: From individual test methods towards testing strategies. Arch. Toxicol. 2016, 90, 2861–2883. [Google Scholar] [CrossRef]
  9. Thyssen, J.P.; Giménez-Arnau, E.; Lepoittevin, J.-P.; Menné, T.; Boman, A.; Schnuch, A. The critical review of methodologies and approaches to assess the inherent skin sensitization potential (skin allergies) of chemicals. Part I. Contact Dermat. 2012, 66 (Suppl. 1), 11–24. [Google Scholar] [CrossRef]
  10. Wilm, A.; Kühnl, J.; Kirchmair, J. Computational approaches for skin sensitization prediction. Crit. Rev. Toxicol. 2018, 48, 738–760. [Google Scholar] [CrossRef]
  11. ECHA (European Chemicals Agency). The Use of Alternatives to Testing on Animals for the REACH Regulation, Third Report under Article 117(3) of the REACH Regulation. Available online: https://echa.europa.eu/documents/10162/13639/alternatives_test_animals_2017_en.pdf (accessed on 10 July 2019).
  12. Kleinstreuer, N.C.; Hoffmann, S.; Alépée, N.; Allen, D.; Ashikaga, T.; Casey, W.; Clouet, E.; Cluzel, M.; Desprez, B.; Gellatly, N.; et al. Non-animal methods to predict skin sensitization (II): An assessment of defined approaches. Crit. Rev. Toxicol. 2018, 48, 359–374. [Google Scholar] [CrossRef] [PubMed]
  13. Luechtefeld, T.; Marsh, D.; Rowlands, C.; Hartung, T. Machine learning of toxicological big data enables read-across structure activity relationships (RASAR) outperforming animal test reproducibility. Toxicol. Sci. 2018, 165, 198–212. [Google Scholar] [CrossRef] [PubMed]
  14. Luechtefeld, T.; Rowlands, C.; Hartung, T. Big-data and machine learning to revamp computational toxicology and its use in risk assessment. Toxicol. Res. 2018, 7, 732–744. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Alves, V.M.; Borba, J.; Capuzzi, S.J.; Muratov, E.; Andrade, C.H.; Rusyn, I.; Tropsha, A. Oy vey! A comment on “Machine learning of toxicological big data enables read-across structure activity relationships outperforming animal test reproducibility”. Toxicol. Sci. 2019, 167, 3–4. [Google Scholar] [CrossRef] [PubMed]
  16. Luechtefeld, T.; Marsh, D.; Hartung, T. Missing the difference between big data and artificial intelligence in RASAR versus traditional QSAR. Toxicol. Sci. 2019, 167, 4–5. [Google Scholar] [CrossRef] [PubMed]
  17. Tung, C.-W.; Lin, Y.-H.; Wang, S.-S. Transfer learning for predicting human skin sensitizers. Arch. Toxicol. 2019, 93, 931–940. [Google Scholar] [CrossRef]
  18. Chilton, M.L.; Macmillan, D.S.; Steger-Hartmann, T.; Hillegass, J.; Bellion, P.; Vuorinen, A.; Etter, S.; Smith, B.P.C.; White, A.; Sterchele, P.; et al. Making reliable negative predictions of human skin sensitisation using an in silico fragmentation approach. Regul. Toxicol. Pharm. 2018, 95, 227–235. [Google Scholar] [CrossRef]
  19. Braga, R.C.; Alves, V.M.; Muratov, E.N.; Strickland, J.; Kleinstreuer, N.; Trospsha, A.; Andrade, C.H. Pred-Skin: A fast and reliable web application to assess skin sensitization effect of chemicals. J. Chem. Inf. Model. 2017, 57, 1013–1017. [Google Scholar] [CrossRef]
  20. Kim, J.Y.; Kim, M.K.; Kim, K.-B.; Kim, H.S.; Lee, B.-M. Quantitative structure–activity and quantitative structure–property relationship approaches as alternative skin sensitization risk assessment methods. J. Toxicol. Environ. Health 2019, 82, 447–472. [Google Scholar] [CrossRef]
  21. Toropov, A.A.; Toropova, A.P.; Selvestrel, G.; Benfenati, E. Idealization of correlations between optimal simplified molecular input-line entry system-based descriptors and skin sensitization. SAR QSAR Environ. Res. 2019, 30, 447–455. [Google Scholar] [CrossRef]
  22. Di, P.; Yin, Y.; Jiang, C.; Cai, Y.; Li, W.; Tang, Y.; Liu, G. Prediction of the skin sensitising potential and potency of compounds via mechanism-based binary and ternary classification models. Toxicol. In Vitro 2019, 59, 204–214. [Google Scholar] [CrossRef] [PubMed]
  23. Alves, V.M.; Muratov, E.; Fourches, D.; Strickland, J.; Kleinstreuer, N.; Andrade, C.H.; Tropsha, A. Predicting chemically-induced skin reactions. Part I: QSAR models of skin sensitization and their application to identify potentially hazardous compounds. Toxicol. Appl. Pharmacol. 2015, 284, 262–272. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Lu, J.; Zheng, M.; Wang, Y.; Shen, Q.; Luo, X.; Jiang, H.; Chen, K. Fragment-based prediction of skin sensitization using recursive partitioning. J. Comput. Aided Mol. Des. 2011, 25, 885–893. [Google Scholar] [CrossRef] [PubMed]
  25. Chaudhry, Q.; Piclin, N.; Cotterill, J.; Pintore, M.; Price, N.R.; Chrétien, J.R.; Roncaglioni, A. Global QSAR models of skin sensitisers for regulatory purposes. Chem. Cent. J. 2010, 4, S5. [Google Scholar] [CrossRef] [PubMed]
  26. Enoch, S.J.; Roberts, D.W. Predicting skin sensitization potency for Michael acceptors in the LLNA using quantum mechanics calculations. Chem. Res. Toxicol. 2013, 26, 767–774. [Google Scholar] [CrossRef] [PubMed]
  27. Hoffmann, S. LLNA variability: An essential ingredient for a comprehensive assessment of non-animal skin sensitization test methods and strategies. ALTEX 2015, 32, 379–383. [Google Scholar] [PubMed]
  28. Alves, V.M.; Capuzzi, S.J.; Braga, R.C.; Borba, J.V.B.; Silva, A.C.; Luechtefeld, T.; Hartung, T.; Andrade, C.H.; Muratov, E.N.; Tropsha, A. A perspective and a new integrated computational strategy for skin sensitization assessment. ACS Sustain. Chem. Eng. 2018, 6, 2845–2859. [Google Scholar] [CrossRef]
  29. Apt Systemst Ltd. Aptsys.net OASIS. QSAR Toolbox 4.3. Available online: http://oasis-lmc.org/products/software/toolbox.aspx (accessed on 10 July 2019).
  30. Chembench|Home. Available online: https://chembench.mml.unc.edu (accessed on 26 April 2019).
  31. CosIng—Cosmetics—GROWTH—European Commission. Available online: http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.simple (accessed on 26 April 2019).
  32. DrugBank Version 5.1.2. Available online: https://www.drugbank.ca (accessed on 7 May 2019).
  33. Wishart, D.S.; Feunang, Y.D.; Guo, A.C.; Lo, E.J.; Marcu, A.; Grant, J.R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z.; et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082. [Google Scholar] [CrossRef] [PubMed]
  34. EU Pesticides Database—European Commission. Available online: http://ec.europa.eu/food/plant/pesticides/eu-pesticides-database/public/?event=activesubstance.selection&language=EN (accessed on 25 February 2019).
  35. Chemical Identifier Resolver. Available online: https://cactus.nci.nih.gov/chemical/structure (accessed on 25 February 2019).
  36. Chemical Computing Group Molecular Operating Environment (MOE)|MOEsaic|PSILO. Available online: https://www.chemcomp.com/Products.htm (accessed on 12 June 2019).
  37. PaDEL-Descriptor. Available online: http://www.yapcwsoft.com/dd/padeldescriptor/ (accessed on 10 May 2019).
  38. Yap, C.W. PaDEL-Descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. [Google Scholar] [CrossRef]
  39. Landrum, G. RDKit. Available online: http://www.rdkit.org (accessed on 26 April 2019).
  40. Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
  41. Stork, C.; Embruch, G.; Šícho, M.; de Bruyn Kops, C.; Chen, Y.; Svozil, D.; Kirchmair, J. NERDD: A web portal providing access to in silico tools for drug discovery. Bioinformatics 2019. [Google Scholar] [CrossRef]
  42. Stork, C.; Wagner, J.; Friedrich, N.-O.; de Bruyn Kops, C.; Šícho, M.; Kirchmair, J. Hit Dexter: A machine-learning model for the prediction of frequent hitters. Chem. Med. Chem. 2018, 13, 564–571. [Google Scholar] [CrossRef] [PubMed]
  43. MolVs. MolVs Version 0.1.1. Available online: https://github.com/mcs07/MolVS (accessed on 26 April 2019).
  44. Scikit-Learn: Machine Learning in Python—Scikit-Learn 0.21.0 Documentation. Available online: https://scikit-learn.org/stable/ (accessed on 10 May 2019).
Figure 1. Score plot comparing the chemical space of compounds of the merged LLNA data set, cosmetics, approved drugs and pesticides. The plot is derived from a principal component analysis (PCA) based on 53 intuitive and physically meaningful molecular descriptors such as molecular weight and clogP (see Methods and Table S1 for details). Data points located in the lower parts of the PCA score plot are primarily cosmetics with long aliphatic and often halogenated chains; towards the top right corner of the diagram these are primarily large drug molecules with strong aromatic components. The variance explained by the first two principal components is reported in the axis titles. Four compounds of the cosmetics reference set and eight compounds of the approved drugs reference set are not shown because they are off the chosen limits of the plot (these are complex and large molecules, with a molecular weight of 2800 Da and higher).
Figure 1. Score plot comparing the chemical space of compounds of the merged LLNA data set, cosmetics, approved drugs and pesticides. The plot is derived from a principal component analysis (PCA) based on 53 intuitive and physically meaningful molecular descriptors such as molecular weight and clogP (see Methods and Table S1 for details). Data points located in the lower parts of the PCA score plot are primarily cosmetics with long aliphatic and often halogenated chains; towards the top right corner of the diagram these are primarily large drug molecules with strong aromatic components. The variance explained by the first two principal components is reported in the axis titles. Four compounds of the cosmetics reference set and eight compounds of the approved drugs reference set are not shown because they are off the chosen limits of the plot (these are complex and large molecules, with a molecular weight of 2800 Da and higher).
Ijms 20 04833 g001
Figure 2. Molecular similarity between each compound of the reference data sets (i.e., cosmetics, approved drugs and pesticides data sets) and its nearest neighbor in the merged local lymph node assay (LLNA) data set (similarity quantified as Tanimoto coefficient based on Morgan2 fingerprints with a length of 2048 bits).
Figure 2. Molecular similarity between each compound of the reference data sets (i.e., cosmetics, approved drugs and pesticides data sets) and its nearest neighbor in the merged local lymph node assay (LLNA) data set (similarity quantified as Tanimoto coefficient based on Morgan2 fingerprints with a length of 2048 bits).
Ijms 20 04833 g002
Figure 3. Score plot comparing the chemical space of compounds of the local lymph node assay (LLNA) data sets of Alves et al. and Di et al. The score plot was derived from a PCA based on the identical setup described in the caption of Figure 1. Two data points are located outside the displayed intervals.
Figure 3. Score plot comparing the chemical space of compounds of the local lymph node assay (LLNA) data sets of Alves et al. and Di et al. The score plot was derived from a PCA based on the identical setup described in the caption of Figure 1. Two data points are located outside the displayed intervals.
Ijms 20 04833 g003
Figure 4. Pairwise molecular similarity within the individual data sets (similarity quantified as Tanimoto coefficient based on Morgan2 fingerprints with a length of 2048 bits).
Figure 4. Pairwise molecular similarity within the individual data sets (similarity quantified as Tanimoto coefficient based on Morgan2 fingerprints with a length of 2048 bits).
Ijms 20 04833 g004
Figure 5. Principal component analysis (PCA) of the physicochemical properties of skin sensitizers and non-sensitizers included in the merged local lymph node assay (LLNA) data set. The PCA is based on the identical setup described in the caption of Figure 1. (a) Score plot, with the percentage of variance explained by the individual principal components reported as part of the axis labels. Two data points are located outside the displayed intervals. (b) Loadings plot (an enlarged version is provided in Figure S1; the abbreviations of the individual molecular descriptors are explained in Table S1). (c) Detailed view of the lower left region of the score plot, where mainly sensitizers are observed to form a line of data points. These sensitizers are aliphatic, monohalogenated hydrocarbons that differ primarily by chain length and halogen atom type. (d) Detailed view of the upper right part of the score plot, where mainly non-sensitizing compounds are located, characterized by high molecular weight, aromaticity and a large topological polar surface area.
Figure 5. Principal component analysis (PCA) of the physicochemical properties of skin sensitizers and non-sensitizers included in the merged local lymph node assay (LLNA) data set. The PCA is based on the identical setup described in the caption of Figure 1. (a) Score plot, with the percentage of variance explained by the individual principal components reported as part of the axis labels. Two data points are located outside the displayed intervals. (b) Loadings plot (an enlarged version is provided in Figure S1; the abbreviations of the individual molecular descriptors are explained in Table S1). (c) Detailed view of the lower left region of the score plot, where mainly sensitizers are observed to form a line of data points. These sensitizers are aliphatic, monohalogenated hydrocarbons that differ primarily by chain length and halogen atom type. (d) Detailed view of the upper right part of the score plot, where mainly non-sensitizing compounds are located, characterized by high molecular weight, aromaticity and a large topological polar surface area.
Ijms 20 04833 g005
Figure 6. Matthews correlation coefficient (MCC) as a function of molecular similarity between the query compounds and the one, three and five nearest neighbors in the training data (calculated as averaged Tanimoto coefficients based on MACCS fingerprints). (a) SVM_MOE2D+OASIS; (b) SVM_PaDEL+OASIS; (c) SVM_PaDEL; (d) RF_MACCS; (e) SVM_PaDEL+MACCS. Pearson correlation coefficients are reported in brackets in the figure legends. The number of compounds in each bin is summarized in Table S7.
Figure 6. Matthews correlation coefficient (MCC) as a function of molecular similarity between the query compounds and the one, three and five nearest neighbors in the training data (calculated as averaged Tanimoto coefficients based on MACCS fingerprints). (a) SVM_MOE2D+OASIS; (b) SVM_PaDEL+OASIS; (c) SVM_PaDEL; (d) RF_MACCS; (e) SVM_PaDEL+MACCS. Pearson correlation coefficients are reported in brackets in the figure legends. The number of compounds in each bin is summarized in Table S7.
Ijms 20 04833 g006
Figure 7. Matthews correlation coefficient (MCC), sensitivity and specificity as a function of the decision threshold, for (a) SVM_MOE2D+OASIS; (b) SVM_PaDEL+OASIS; (c) SVM_PaDEL; (d) RF_MACCS; (e) SVM_PaDEL+MACCS. Note that different X-axis scales are applied to the graphs illustrating the performance of random forest (RF) and support vector machine (SVM) models.
Figure 7. Matthews correlation coefficient (MCC), sensitivity and specificity as a function of the decision threshold, for (a) SVM_MOE2D+OASIS; (b) SVM_PaDEL+OASIS; (c) SVM_PaDEL; (d) RF_MACCS; (e) SVM_PaDEL+MACCS. Note that different X-axis scales are applied to the graphs illustrating the performance of random forest (RF) and support vector machine (SVM) models.
Ijms 20 04833 g007
Figure 8. Matthews correlation coefficient (MCC) as a function of the distance between the predicted class probabilities and the decision thresholds, for the (a) support vector machine (SVM) models and (b) random forest (RF) model. The number of compounds in each bin is summarized in Table S8.
Figure 8. Matthews correlation coefficient (MCC) as a function of the distance between the predicted class probabilities and the decision thresholds, for the (a) support vector machine (SVM) models and (b) random forest (RF) model. The number of compounds in each bin is summarized in Table S8.
Ijms 20 04833 g008
Figure 9. Matthews correlation coefficient (MCC) as a function of the number of consecutive nearest neighbors in the training data that are of the same activity class as the predicted class for a compound of interest (molecular similarity quantified as Tanimoto coefficient based on MACCS fingerprints). The number of compounds in each bin is summarized in Table S9. The graphs for SVM_PaDEL+OASIS and SVM_PaDEL+MACCS are not shown because they are (almost) identical with that of SVM_PaDEL and would overlap.
Figure 9. Matthews correlation coefficient (MCC) as a function of the number of consecutive nearest neighbors in the training data that are of the same activity class as the predicted class for a compound of interest (molecular similarity quantified as Tanimoto coefficient based on MACCS fingerprints). The number of compounds in each bin is summarized in Table S9. The graphs for SVM_PaDEL+OASIS and SVM_PaDEL+MACCS are not shown because they are (almost) identical with that of SVM_PaDEL and would overlap.
Ijms 20 04833 g009
Table 1. Overview of all data sets used in this work.
Table 1. Overview of all data sets used in this work.
LLNA Data Set Compiled by Alves et al.LLNA Data Set Compiled by Di et al.Merged LLNA Data SetCosmetic Substances and Ingredients Data SetApproved Drugs Data SetPesticides Data Set
Data sourceChembench [30]1Supporting information of Di et al. [22]LLNA data sets of Alves et al. and Di et al.CosIng Database [31]“Approved Drugs” subset of DrugBank [32,33]2EU Pesticides Database [34]
Number of compounds prior to data preprocessing100010071993593723521383
Number of compounds after data preprocessing1000993314164 (1132/284)546436215578128
Number of sensitizers481364572 (457/115)5n/an/an/a
Number of non-sensitizers519629844 (675/169)5n/an/an/a
Number of Murcko scaffolds3123544538561158329
Proportion of compounds without a Murcko scaffold0.320.290.310.420.130.24
Proportion of singleton scaffolds0.770.790.780.720.820.81
1 Chapel Hill, NC, United States. 2 Edmonton, Alberta, Canada. 3 Thirteen compounds were removed as part of the deduplication procedure; one compound was removed because of conflicting activity assignments. 4 Five hundred and sixty-seven compounds were removed as part of the deduplication procedure; ten compounds were removed because of conflicting activity assignments. 5 Number of compounds in the training set/test set prior to descriptor calculation. 6 One hundred and four compounds were removed by the salt filter because the main component could not be unambiguously identified; 26 compounds were removed due to invalid input structure; 1164 compounds were removed as part of the deduplication procedure. 7 Thirty-one compounds were removed by the salt filter because the main component could not be unambiguously identified; 166 compounds were removed as part of the deduplication procedure. 8 The SMILES notation of 893 compounds present in the EU Pesticides Database were automatically retrieved with the Chemical Identifier Resolver [35]. Six compounds were removed by the salt filter because the main component could not be identified; 13 compounds were removed due to invalid input structure; 62 compounds were removed as part of the deduplication procedure. Abbreviations: LLNA, local lymph node assay.
Table 2. Overlaps between the compounds contained in the LLNA data sets and the cosmetics, approved drugs and pesticides data sets.
Table 2. Overlaps between the compounds contained in the LLNA data sets and the cosmetics, approved drugs and pesticides data sets.
Number of CompoundsData Set Compiled by Alves et al. Data Set Compiled by Di et al.Merged LLNA Data Set
Cosmetics4643324252387
Approved Drugs2155886897
Pesticides812433444
Abbreviations: LLNA, local lymph node assay.
Table 3. Overview of descriptor sets evaluated in this work.
Table 3. Overview of descriptor sets evaluated in this work.
Descriptor SetShort NameNumber of Descriptors/Length of the FingerprintCalculated withNumber of Successfully Processed Molecules1
Training setTest set
0D, 1D and 2D descriptorsMOE2D206MOE [36]; this set corresponds to all descriptors listed as “2D descriptors” in MOE1132284
Selection of 0D, 1D and 2D descriptorsMOE2D_53532MOE [36]1132284
0D, 1D and 2D descriptorsPaDEL1444PaDEL [37,38]; this is the complete set of 0D, 1D and 2D descriptors implemented in PaDEL1109279
MACCS keysMACCS166RDKit [39]1132284
Morgan2 fingerprintsMorgan22048RDKit [39]1132284
OASIS skin sensitization protein binding fingerprintOASIS5 bit fingerprintOECD Toolbox [29]1128283
PaDEL estate fingerprintPaDEL_Est79PaDEL [37,38]1132284
PaDEL extended fingerprintPaDEL_Ext1024PaDEL [37,38]1132284
1 Descriptor calculation failed for individual compounds depending on the software used. For this reason, there are marginal differences in the composition of the individual data sets used for model development. 2 Fifty-three manually selected, physically meaningful descriptors. A list of the selected descriptors can be found in Table S1. Abbreviations: MOE, Molecular Operating Environment.
Table 4. Overview of models and their performance during cross-validation.
Table 4. Overview of models and their performance during cross-validation.
NameNumber of DescriptorsNumber of Compounds in Training DataACCACC STDEVMCCMCCSTDEVAUCCCRSeSPPPVNPV
SVM_MOE2D+OASIS21111280.780.0540.550.1090.830.780.770.780.710.83
SVM_PaDEL+MACCS161011080.760.0350.510.0690.830.760.750.760.690.82
SVM_PaDEL+Morgan2349211080.760.0360.510.0780.820.750.660.830.730.78
SVM_PaDEL+PaDEL-Ext246811090.760.0390.510.0750.840.760.740.780.70.81
SVM_MOE2D+MACCS37211320.760.0470.50.0960.810.740.680.810.710.79
SVM_MOE2D+Morgan2225411320.750.0410.50.0810.830.750.770.730.660.83
SVM_MOE2D+PaDEL168011090.760.0390.50.0790.830.750.740.770.690.81
SVM_MOE2D+PaDEL-Est28511320.760.0390.50.0810.810.750.680.810.710.79
SVM_MOE2D+PaDEL-Ext123011320.750.0540.50.1050.830.750.750.760.680.81
SVM_PaDEL144411090.750.0380.50.0750.830.750.750.750.680.81
SVM_PaDEL+OASIS144911090.750.0380.50.0750.830.750.750.750.680.81
SVM_PaDEL+PaDEL-Est152311090.750.0380.50.0750.830.750.750.750.680.81
RF_PaDEL+MACCS161011080.760.0180.490.0370.820.730.620.850.740.77
RF_PaDEL+Morgan2349211080.760.020.490.0420.820.740.640.840.730.77
RF_PaDEL+OASIS144911090.760.020.490.0430.820.740.620.850.740.77
RF_PaDEL+PaDEL-Ext246811090.760.0220.490.0480.820.730.610.860.750.76
SVM_PaDEL-Est+MACCS24511320.750.0510.490.1060.810.740.690.80.70.79
RF_MOE2D+PaDEL168011090.750.0340.480.0720.830.730.620.840.730.77
RF_Morgan2+PaDEL-Est212711320.760.0330.480.0710.820.730.630.840.730.77
RF_PaDEL144411090.750.0150.480.0330.820.730.620.840.730.76
RF_PaDEL-Est+OASIS8411280.750.0430.480.0910.80.740.650.820.720.78
SVM_MACCS+OASIS17111280.750.0470.480.1020.820.740.690.790.690.79
SVM_MOE2D20611320.740.0370.480.0670.820.740.750.740.660.82
SVM_Morgan2+PaDEL-Ext307211320.750.0440.480.090.820.740.680.80.70.79
RF_MACCS16611320.750.0390.470.0880.810.730.610.840.730.76
RF_MACCS+OASIS17111280.750.0340.470.0740.80.730.60.850.740.76
RF_PaDEL+PaDEL-Est152311090.750.0280.470.060.830.730.610.850.730.76
SVM_MACCS16611320.740.0570.470.120.810.730.690.780.680.79
SVM_PaDEL-Est+OASIS8411280.740.0480.470.0990.80.740.710.760.670.8
SVM_PaDEL-Est+PaDEL-Ext110311320.740.0390.470.080.810.740.70.780.680.79
SVM_PaDEL-Ext102411320.740.0460.470.0930.810.730.70.770.680.79
SVM_PaDEL-Ext+OASIS102911280.740.0360.470.0720.820.740.70.770.680.79
RF_MOE2D+Morgan2225411320.740.0330.460.0710.810.720.620.820.710.76
RF_PaDEL-Est+MACCS24511320.750.0450.460.10.810.720.590.850.730.76
RF_Morgan2204811320.740.0390.460.0810.810.730.640.810.70.77
SVM_Morgan2+MACCS221411320.740.0580.460.1170.80.730.680.780.680.78
SVM_PaDEL-Ext+MACCS119011320.740.0470.460.0970.810.730.680.770.680.78
RF_MOE2D+OASIS21111280.740.0410.450.090.810.710.60.830.710.75
RF_MOE2D+PaDEL-Est28511320.740.0320.450.070.810.720.60.840.720.75
RF_MOE2D+PaDEL-Ext123011320.740.0170.450.0370.820.720.580.850.730.75
RF_MOE2D20611320.730.0360.440.0780.810.710.590.830.710.75
RF_MOE2D+MACCS37211320.730.0330.440.0720.810.710.580.840.710.75
RF_Morgan2+MACCS221411320.730.0390.440.0860.80.720.630.80.680.76
RF_Morgan2+OASIS205311280.740.0290.440.0630.820.710.590.830.710.75
RF_Morgan2+PaDEL-Ext307211320.730.0360.440.0810.810.710.560.850.720.74
SVM_MOE2D535311320.710.0370.440.0690.780.720.760.680.620.81
SVM_PaDEL-Est7911320.720.0370.440.0730.770.720.710.730.640.79
RF_PaDEL-Est7911320.730.0220.430.0420.770.710.640.790.670.76
RF_PaDEL-Ext+MACCS119011320.730.0370.430.0810.810.70.550.850.720.74
RF_PaDEL-Ext+OASIS102911280.730.0330.430.0720.80.70.570.840.710.74
RF_PaDEL-Ext+PaDEL-Est110311320.730.0340.430.0740.80.70.560.850.720.74
SVM_Morgan2+OASIS205311280.730.0380.430.0890.80.690.510.880.750.73
SVM_Morgan2+PaDEL-Est212711320.720.0350.430.0640.790.720.690.750.650.78
RF_MOE2D535311320.730.0390.420.0860.780.70.580.830.690.74
RF_PaDEL-Ext102411320.720.0390.420.0880.790.70.550.840.710.73
SVM_Morgan2204811320.720.0310.390.0720.80.680.490.870.720.71
SVM_OASIS511280.670.0640.290.1510.630.620.370.870.680.67
RF_OASIS511280.660.0540.270.1220.640.630.430.820.620.68
Abbreviations: ACC, accuracy; AUC, area under the receiver operating characteristic curve; CCR, correct classification rate; MCC, Matthews correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity; STDEV, standard deviation.
Table 5. Overview of hyperparameters optimized by grid search.
Table 5. Overview of hyperparameters optimized by grid search.
Machine Learning ApproachParameterExplored Values
RFn_estimators 110, 50, 100, 250, 500, 1000
max_features 2‘sqrt’, 0.2, 0.4, 0.6, 0.8, None
SVMC 30.01, 0.1, 1, 10, 100, 1000
gamma 41, 0.1, 0.01, 0.001, 0.0001, 0.00001
1 Number of prediction trees. 2 Maximum depth of each tree. 3 Penalty parameter C of the error term. 4 Coefficient for the radial basis function (rbf) kernel. Abbreviations: RF, random forest; SVM, support vector machine.
Table 6. Performance of selected models on the test set.
Table 6. Performance of selected models on the test set.
NAMEMean Tanimoto Similarity to the Five Nearest NeighborsNumber of CompoundsACCMCCAUCCCRSeSpPPVNPV
RF_MACCS≥02840.720.410.820.700.570.820.690.74
RF_MACCS≥0.52730.730.430.820.710.60.820.690.75
RF_MACCS≥0.75790.780.590.910.810.890.730.640.92
RF_MACCS<0.5110.45−0.290.600.420.000.830.000.50
SVM_MOE_2D+OASIS≥02830.760.520.830.760.810.720.660.85
SVM_MOE_2D+OASIS≥0.52730.760.530.840.770.820.720.670.86
SVM_MOE_2D+OASIS≥0.75790.810.640.890.840.930.750.670.95
SVM_MOE2D+OASIS<0.5100.600.200.600.600.600.600.600.60
SVM_PaDEL≥02790.740.470.820.740.760.720.650.82
SVM_PaDEL≥0.52690.740.490.830.750.770.730.650.83
SVM_PaDEL≥0.75790.800.630.890.830.930.730.650.95
SVM_PaDEL<0.5100.600.200.560.600.600.600.600.60
SVM_PaDEL+MACCS≥02790.750.500.820.750.780.730.660.83
SVM_PaDEL+MACCS≥0.52690.750.510.830.760.790.730.660.84
SVM_PaDEL+MACCS≥0.75790.800.630.890.830.930.730.650.95
SVM_PaDEL+MACCS<0.5100.600.200.560.600.600.600.600.60
SVM_PaDEL+OASIS≥02790.740.480.820.740.760.730.650.82
SVM_PaDEL+OASIS≥0.52710.750.490.830.750.770.730.650.83
SVM_PaDEL+OASIS≥0.75790.800.630.890.830.930.730.650.95
SVM_PaDEL+OASIS<0.5100.600.200.560.600.600.60.600.60
Abbreviations: ACC, accuracy; AUC, area under the receiver operating characteristic curve; CCR, correct classification rate; MCC, Matthews correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.
Table 7. Test set performance as a function of the distance of predicted class probabilities from the decision threshold.
Table 7. Test set performance as a function of the distance of predicted class probabilities from the decision threshold.
NameDistance to Decision Threshold 1Number of CompoundsACCMCCAUCCCRSeSpPPVNPV
RF-MACCS≥0.151750.850.670.460.840.810.870.760.90
RF-MACCS≥0.35660.910.780.420.890.850.930.850.93
RF-MACCS<0.151090.510.040.420.520.320.720.550.50
SVM_MOE2D+OASIS≥0.52030.820.640.420.830.880.780.730.90
SVM_MOE2D+OASIS≥1.251060.890.760.410.890.890.880.810.94
SVM_MOE2D+OASIS<0.50800.600.200.520.600.620.580.500.70
SVM_PaDEL≥0.51830.800.610.480.810.860.760.710.89
SVM_PaDEL≥1.25340.880.780.450.911.000.820.751.00
SVM_PaDEL<0.50960.610.210.360.600.550.660.510.69
SVM_PaDEL+MACCS≥0.51830.800.620.490.820.880.750.710.90
SVM_PaDEL+MACCS≥1.25370.860.750.520.91.000.800.711.00
SVM_PaDEL+MACCS<0.50960.650.270.390.630.580.690.550.71
SVM_PaDEL+OASIS≥0.51830.800.610.490.810.860.760.710.89
SVM_PaDEL+OASIS≥1.25340.880.780.450.911.000.820.751.00
SVM_PaDEL+OASIS<0.50960.620.220.370.610.550.670.520.70
1 Distance of predicted class probabilities from the decision threshold. Abbreviations: ACC, accuracy; AUC, area under the receiver operating characteristic curve; CCR, correct classification rate; MCC, Matthews correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.
Table 8. Test set performance as a function of the number of consecutive nearest neighbors with class assignments consistent with the predicted class.
Table 8. Test set performance as a function of the number of consecutive nearest neighbors with class assignments consistent with the predicted class.
NameNumber of Concordant Neighbors1Number of CompoundsACCMCCAUCCCRSeSpPPVNPV
RF_MACCS0870.33-0.350.320.330.190.480.260.38
RF_MACCS≥11970.890.770.970.870.810.940.890.89
RF_MACCS≥21470.960.901.000.940.890.990.980.95
RF_MACCS≥31130.990.981.000.980.971.001.000.99
SVM_MOE2D+OASIS0850.560.130.560.570.620.510.550.58
SVM_MOE2D+OASIS≥11980.840.690.940.850.920.790.720.94
SVM_MOE2D+OASIS≥21460.910.810.990.920.950.890.790.98
SVM_MOE2D+OASIS≥31150.910.800.990.920.940.900.790.97
SVM_PaDEL0860.530.070.520.540.560.510.510.56
SVM_PaDEL≥11930.830.660.920.840.870.80.720.92
SVM_PaDEL≥21470.890.780.960.910.960.860.760.98
SVM_PaDEL≥31130.900.790.970.920.970.880.760.99
SVM_PaDEL+MACCS0860.550.100.530.550.590.510.520.57
SVM_PaDEL+MACCS≥11930.840.680.910.850.890.810.730.93
SVM_PaDEL+MACCS≥21470.900.800.960.920.960.880.790.98
SVM_PaDEL+MACCS≥31130.910.810.970.930.970.890.780.99
SVM_PaDEL+OASIS0860.530.070.520.540.560.510.510.56
SVM_PaDEL+OASIS≥11930.830.670.920.840.870.810.730.92
SVM_PaDEL+OASIS≥21470.90.790.960.910.960.870.770.98
SVM_PaDEL+OASIS≥31130.910.810.970.930.970.890.780.99
1 Number of consecutive nearest neighbors in the training data having the same activity class assigned as the one predicted for the test compounds. Abbreviations: ACC, accuracy; AUC, area under the receiver operating characteristic curve; CCR, correct classification rate; MCC, Matthews correlation coefficient; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.

Share and Cite

MDPI and ACS Style

Wilm, A.; Stork, C.; Bauer, C.; Schepky, A.; Kühnl, J.; Kirchmair, J. Skin Doctor: Machine Learning Models for Skin Sensitization Prediction that Provide Estimates and Indicators of Prediction Reliability. Int. J. Mol. Sci. 2019, 20, 4833. https://doi.org/10.3390/ijms20194833

AMA Style

Wilm A, Stork C, Bauer C, Schepky A, Kühnl J, Kirchmair J. Skin Doctor: Machine Learning Models for Skin Sensitization Prediction that Provide Estimates and Indicators of Prediction Reliability. International Journal of Molecular Sciences. 2019; 20(19):4833. https://doi.org/10.3390/ijms20194833

Chicago/Turabian Style

Wilm, Anke, Conrad Stork, Christoph Bauer, Andreas Schepky, Jochen Kühnl, and Johannes Kirchmair. 2019. "Skin Doctor: Machine Learning Models for Skin Sensitization Prediction that Provide Estimates and Indicators of Prediction Reliability" International Journal of Molecular Sciences 20, no. 19: 4833. https://doi.org/10.3390/ijms20194833

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop