2.1. Microsomal Stability Assay
The substrate depletion method was used to determine half-life (t
1/2) of compounds by measuring the disappearance of the parent compound over time. Incubations were performed on a Tecan EVO 200 robotic system (Morrisville, NC, USA), equipped with a 96-channel head, Inheco heating block, and controlled by EVOware software (Version 3.5). Mixed-gender human liver microsomes were purchased from Xenotech (Kansas City, KS, USA) (Catalog: H0610). Gentest NADPH Regenerating Solution A and B (Catalog: 451220/451200), Axygen reservoirs (Catalog: RES-SW384-LP/RES-SW384-HP) were purchased from Corning Inc. (Corning, NY, USA). Incubation plates (384-well, 250 µL; Catalog# 186002632) and LC/MS analysis plates (384-well, 100 µL; Catalog# 186002631) were purchased from Waters Inc. (Milford, MA, USA). The compounds used for assay controls, internal standards, and buffers including albendazole, buspirone, propranolol, loperamide, antipyrine, potassium phosphate monobasic, and potassium phosphate dibasic were purchased from Sigma-Aldrich (St. Louis, MO, USA). An albendazole solution in acetonitrile (ACN/IS) was prepared for use as an internal standard. Each 110 μL reaction mixture included the test compound (1 μM), HLM (0.5 mg/mL), and NADPH regenerating system in phosphate buffer (pH 7.4). The samples were incubated in 384-well plates at 37 °C for 0, 5, 10, 15, 30, and 60 min. At each designated time point, 10 μL of the mixture was transferred to another 384-well plate containing cold ACN/IS. The plates were then centrifuged at 3000 rpm for 20 min at 4 °C, and supernatants were collected into a 384-well injection plate. Sample quantification was performed using Thermo UPLC/HRMS and data were analyzed using TraceFinder software (Version 4.1). The data were then extracted, and half-life analysis was performed using our in-house Validator software (Version 1.0) as described previously [
8,
9].
2.4. Model Building and Validation
Cross-validation using a random split is a common technique for assessing the generalizability of a machine learning model. In this method, the training data are randomly divided into internal training and internal validation subsets multiple times. For each split, the model is trained on the internal training subset and evaluated on the internal validation subset. This process helps ensure that the model’s performance is not overly dependent on a particular partition of the data. The random split method is particularly useful in drug discovery applications where datasets can be limited in size and diverse in nature. By averaging the performance metrics across multiple splits, one can obtain a more reliable estimate of the model’s ability to generalize to unseen data. This technique helps in mitigating overfitting and provides a robust measure of the model’s predictive power. In this study, we performed a 5-fold cross-validation (5-CV). This step is followed by external validation.
External validation involves testing the trained model on an entirely independent dataset that was not used during the training or internal validation phases. This step is crucial in drug discovery to ensure that the model’s predictions are genuinely generalizable and not just a result of overfitting to the training data. External validation provides a stringent test of the model’s performance, as it simulates real-world application scenarios where new, unseen molecules are encountered. The best models from internal validation were validated on the NCATS’s external validation set and the three external datasets. The following metrics were used in order to measure and compare the performance of the models and are based on the four elements of a confusion matrix: true positives (TN), i.e., those positive class compounds correctly predicted as positive; false positives (FP), i.e., those negative class compounds incorrectly predicted as positive; true negatives (TN), i.e., those negative class compounds correctly predicted as negative; false negatives (FN), i.e., those positive class compounds incorrectly predicted as negative.
Accuracy: Accuracy is the ratio of correctly predicted instances to the total instances in the dataset [
15]. It provides a straightforward measure of the model’s overall performance, but it can be misleading in imbalanced datasets where one class dominates.
Sensitivity (or Recall): Sensitivity, also known as recall or true positive rate, measures the proportion of actual positives that are correctly identified by the model. It is critical in contexts where missing positive cases is particularly costly.
Specificity: Specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified by the model. It is important in contexts where false positives are costly.
Positive Predictive Value (PPV): PPV, or precision, measures the proportion of positive predictions that are actually correct. It indicates the reliability of positive predictions made by the model.
Negative Predictive Value (NPV): NPV measures the proportion of negative predictions that are actually correct. It indicates the reliability of negative predictions made by the model.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The AUC-ROC measures the model’s ability to distinguish between positive and negative classes across various threshold settings [
16]. It plots the true positive rate (sensitivity) against the false positive rate (1-specificity) and calculates the area under this curve. A higher AUC indicates better model performance.
Cohen’s Kappa: Cohen’s Kappa [
17] is a statistical measure that compares an observed accuracy with an expected accuracy (random chance). It accounts for the possibility of agreement occurring by chance, providing a more robust metric for evaluating classification models, especially with imbalanced datasets.
TP, TN, FP, and FN are the numbers of true positive predictions, true negative predictions, false positive predictions, and false negative predictions, respectively.