**In Silico Study Identified Methotrexate Analog as Potential Inhibitor of Drug Resistant Human Dihydrofolate Reductase for Cancer Therapeutics**

**Rabia Mukhtar Rana <sup>1</sup> , Shailima Rampogu <sup>1</sup> , Noman Bin Abid <sup>2</sup> , Amir Zeb <sup>1</sup> , Shraddha Parate <sup>1</sup> , Gihwan Lee <sup>1</sup> , Sanghwa Yoon <sup>1</sup> , Yumi Kim <sup>1</sup> , Donghwan Kim <sup>1</sup> and Keun Woo Lee 1,\***


Academic Editors: Marco Tutone and Anna Maria Almerico Received: 18 June 2020; Accepted: 30 July 2020; Published: 31 July 2020

**Abstract:** Drug resistance is a core issue in cancer chemotherapy. A known folate antagonist, methotrexate (MTX) inhibits human dihydrofolate reductase (hDHFR), the enzyme responsible for the catalysis of 7,8-dihydrofolate reduction to 5,6,7,8-tetrahydrofolate, in biosynthesis and cell proliferation. Structural change in the DHFR enzyme is a significant cause of resistance and the subsequent loss of MTX. In the current study, wild type hDHFR and double mutant (engineered variant) F31R/Q35E (PDB ID: 3EIG) were subject to computational study. Structure-based pharmacophore modeling was carried out for wild type (WT) and mutant (MT) (variant F31R/Q35E) hDHFR structures by generating ten models for each. Two pharmacophore models, WT-pharma and MT-pharma, were selected for further computations, and showed excellent ROC curve quality. Additionally, the selected pharmacophore models were validated by the Guner-Henry decoy test method, which yielded high goodness of fit for WT-hDHFR and MT-hDHFR. Using a SMILES string of MTX in ZINC<sup>15</sup> with the selections of 'clean', in vitro and in vivo options, 32 MTX-analogs were obtained. Eight analogs were filtered out due to their drug-like properties by applying absorption, distribution, metabolism, excretion, and toxicity (ADMET) assessment tests and Lipinski's Rule of five. WT-pharma and MT-pharma were further employed as a 3D query in virtual screening with drug-like MTX analogs. Subsequently, seven screening hits along with a reference compound (MTX) were subjected to molecular docking in the active site of WT- and MT-hDHFR. Through a clustering analysis and examination of protein-ligand interactions, one compound was found with a ChemPLP fitness score greater than that of MTX (reference compound). Finally, a simulation of molecular dynamics (MD) identified an MTX analog which exhibited strong affinity for WT- and MT-hDHFR, with stable RMSD, hydrogen bonds (H-bonds) in the binding site and the lowest MM/PBSA binding free energy. In conclusion, we report on an MTX analog which is capable of inhibiting hDHFR in wild type form, as well as in cases where the enzyme acquires resistance to drugs during chemotherapy treatment.

**Keywords:** methotrexate; drug resistance; human dihydrofolate reductase; pharmacophore modeling; virtual screening; molecular docking; molecular dynamics simulation.

#### **1. Introduction**

A major complication in cancer treatment with chemotherapy is the development of resistance to previously effective drugs. Clinically, two main types of drug resistance exist: intrinsic resistance, which is not associated with drug exposure, but rather, with an innate ability of tumor cells; and acquired resistance, which occurs after exposure to the drug [1]. Various mechanisms like increased rates of drug efflux, alterations in drug metabolism, variations in drug targets, increased target expression, activation of survival pathways, increased expression of anti-apoptotic proteins and mutation of drug targets are involved in acquiring resistance to chemotherapeutic agents [2].

Human dihydrofolate reductase (hDHFR) catalyzes the reduction of 7,8-dihydrofolate (DHF) to 5,6,7,8-tetrahydrofolate in a nicotinamide adenine dinucleotide phosphate (NADPH) dependent manner. Tetrahydrofolate is an essential cofactor in several metabolic pathways like purine and thymidylate biosynthesis, playing a vital role in cell division and proliferation [3]. Due to the significance of its crucial role in nucleoside biosynthesis, hDHFR has widely been studied and exploited as a drug target [4,5].

Methotrexate (MTX) (C20H22N8O5) is an antimetabolite, an analogue of folic acid and a derivative of aminopterin antiproliferative drugs that inhibits dihydrofolate reductase [6]. The drug primarily penetrates intracellular targets through an active carrier transport mechanism which is shared by reduced folates and facilitated by the reduced folate carrier (RFC) [7]. This process is carried out by the enzyme folylpolyglutamate synthetase (FPGS) through the accumulation of glutamate residues [8,9]. MTX and polyglutamylated conformations of MTX are tightly bound inhibitors of hDHFR and hinder pyrimidine, and hence thymidylate biosynthesis [10–12]. Decreased MTX affinity has been detected in cell lines exposed to increased dosage causing mutations in hDHFR [13–18]. Mutations in dihydrofolate reductase variants with amino acid substitutions at residues Phe31 [19], Arg70 [20], Leu22 [21,22], Val115 [23] and Phe34 [24] existing in folate binding site have been detected in MTX-resistant cancer cell lines.

MTX-resistant point mutant hDHFR crystal structures have provided an understanding of the details of decreased binding of MTX or other antifolates [25–30]. Volpato et al. reported a combinatorial MTX-resistant hDHFR variant F31R/Q35E which exhibited >650-fold decreased binding to MTX to reveal the structural details of MTX resistance in the F31R/Q35E variant, and obtained the crystal structure of this variant bound with MTX at 1.7 Å resolution (PDB ID: 3EIG) [31]. This highly MTX-resistant variant is an effective selectable marker for several mammalian cell types, along with murine hematopoietic stem cells [32]. Since mutations triggering MTX resistance have not been studied in mammals, and MTX is an approved drug for human treatment, engineered resistant DHFRs provide highly capable ex vivo or in vivo selective markers for human [33].

In recent decades, advances in computational techniques have led to an acceleration of drug discovery [34]. For example, cheminformatics allows us to understand and characterize the molecular properties and chemical activities of specific compounds and produce huge libraries of small molecules to screen against particular therapeutic processes [35]. After candidate identification, other cheminformatics techniques can be utilized to generate libraries of compounds which are structurally and chemically similar to the identified "hits" in order to improve stability, toxicity and kinetics. Additionally, bioinformatics methodologies can be applied to study the therapeutic activity of candidate drugs predicting interactions between drugs and proteins, to analyze the impact on biological pathways and functions and to determine genomic variants that may vary drug response [36]. Accordingly, several approaches have been developed to reduce the research expense and risk of failure for drug discovery, among which computer-aided drug design (CADD) is one of the most effective [37].

Since drug resistance is hinders chemotherapy, there is an urgent need to discover the drugs that inhibit hDHFR in wild type as well as in mutant form, i.e., after acquiring resistance to MTX. Singh et al. developed a small peptide as an anticancer drug targeting hDHFR which was supposed to be effective in MTX-resistant hDHFR because of a larger drug–protein interaction area [38]. Despite the larger interaction area of the peptide, it was specifically designed to inhibit only wild type hDHFR. We carried

out a computational study to identify a candidate molecule capable of inhibiting wild type along with mutant hDHFR. Structure-based pharmacophore modeling was performed exploiting hDHFR wild type and drug-resistant F31R/Q35E variant structures in complex with methotrexate to allocate important chemical features of protein-ligand interactions. Pharmacophore models, WT-pharma and MT-Pharma, with four features comprising key residues were selected from ten models generated for each structure. The selected pharmacophore models exhibited the highest area under the receiver operating characteristics (ROC) curve, verifying the sensitivity of the models to retrieve active compounds. WT- and MT- pharma were further subjected to validation by the Guner-Henry decoy test method.

ZINC was initially developed as an open-access database and toolset to support access to compounds for virtual screening. The upgraded version ZINC<sup>15</sup> makes it possible to carry out similarity searches and to explore the analogs of a given structure or part of a structure according to the input line employed [39]. The MTX structure in the SMILES (Simplified Molecular-Input Line-Entry System) format was used in ZINC<sup>15</sup> to retrieve MTX-analog structures. The obtained analogs were filtered through ADMET and Lipinski's Rule of five to categorize drug-like compounds. The validated pharmacophores WT-pharma and MT-pharma were then used as 3D-query to screen against drug-like MTX-analogs. The analogs mapped with WT- and MT-pharma were carried out for molecular docking where two compounds were found with a higher docking score than the reference (MTX). Further, molecular dynamics simulation confirmed one compound with a stronger affinity for WT and MT hDHFR yielding stable RMSD and strong molecular interactions with catalytic active site residues. Additionally, binding free energy calculation through MM/PBSA (Molecular Mechanics/Poisson-Boltzmann Surface Area) demonstrated robust binding affinity of Hit molecules with WT and MT hDHFR. Accordingly, in this study, we predicted an analog compound of MTX as a potential inhibitor for wild type and drug-resistant hDHFR for cancer therapeutics.

#### **2. Results**

#### *2.1. Generation of Structures Based Pharmacophore Models*

Crystal structures of wild type and F31R/Q35E variant of hDHFR in complex with methotrexate downloaded from the protein data bank were carried out for structure-based pharmacophore modeling. A total of 10 pharmacophore models for each structure, were generated while producing a ROC curve with each model. All the pharmacophores were attributed in terms of the total number of features, types of features, and selectivity score and ROC curve quality (Table 1). All ten models for wild type and for mutant structures yielded the same selectivity score with the difference in location of pharmacophoric features.


**Table 1.** Receptor-ligand based pharmacophores characteristics details.

The pharmacophore models comprising of features including key residues Glu30, Asn64, Arg70, and Val115 were selected so-called WT-pharma and Mt-pharma for wild type and mutant structures respectively. WT-pharma consisting of four pharmacophoric features included one hydrogen bond acceptor (HBA), one hydrogen bond donor (HDB), one negative ionizable (NI) and one ring aromatic (RA). MT-pharma comprised of features one hydrogen bond donor (HDB), two negative ionizable (NI) and one ring aromatic (RA) (Figure 1).

**Figure 1.** Structure-based pharmacophore generation. (**A**) Residues of wild type hDHFR active site complementing pharmacophoric features are shown as a thin stick. Bound inhibitor (MTX) is shown as a light blue colored thick stick model. HBA, HBD, RA, and NI are colored as green, magenta, orange and blue respectively. (**B**) Residues of mutant hDHFR active site complementing pharmacophoric features are shown as a thin stick. Bound inhibitor (MTX) is shown as a pink-colored thick stick model. HBA, HBD, RA, and NI are colored as green, magenta, orange and blue respectively. (**C**) Interfeature distance illustration of WT-pharma (**D**) Interfeature distance illustration of MT-pharma.

#### *2.2. Pharmacophore Models Validation*

Selected pharmacophore models termed as WT-pharma and MT-pharma were assessed for their sensitivity to retrieve the active compounds by receiver operating characteristics (ROC) curve. ROC curves were plotted with the generation of pharmacophore models by utilizing the option *Validation* that is available in the *receptor-ligand Phamacophore* module in DS for structure-based pharmacophore modeling. For this purpose, sets of 46 active and 24 inactive molecules were employed to testify model efficacy by creating the ROC curve. Higher the area under the ROC curve interpreted higher sensitivity of the model. For WT-pharma ROC displayed 0.989 and for MT-pharma 0.985 curve quality indicating 98.9% and 98.5% area under the curve illustrated as highly sensitive pharmacophore models to identify active molecules (Figure 2).

**Figure 2.** Receiver Operating Characteristics curves for validation of selected pharmacophore models between true positive and false-positive rates. (**A**) ROC curve shown in the red line for the WT-pharma model with 0.989 curve quality depicts 98.9% area under the curve. (**B**) ROC curve shown in the red line for the MT-pharma model with 0.985 curve quality depicts 98.5% area under the curve.

Additionally, Decoy set validation was implemented using a *Ligand Pharmacophore Mapping* module in DS. The accuracy of WT-pharma and Mt-pharma was evaluated by four factors i.e., false positive, false negative, enrichment factor (EF), and goodness of fit (GF). EF and GF were computed by applying the data of various parameters given in Table 2. Other properties of WT-pharma and MT-pharma including a percentage of the number of active yields (%Y), percent ratio of actives in the hit list (%A), false negatives, and false positives were also measured (Table 2).

**Table 2.** Decoy set validation for WT & MT hDHFR structure-based pharmacophore models. WT-pharma and MT-pharma obtained the highest goodness of fit score suggesting the suitability of the models for virtual screening.


#### *2.3. Obtaining Methotrexate Analog Structures*

For the generation of structures analogous to methotrexate (MTX), we exploited MTX structure using SMILES format in ZINC<sup>15</sup> for similarity search. Consequently, 32 compounds were retrieved (Table S1. These compounds were downloaded in SDF format to visualize in DS and to carry out for further computations.

#### *2.4. Drug-Likeness of MTX-analogs and Virtual Screening with Pharmacophore Models*

The compounds downloaded were subjected to ADMET and Lipinski's Rule of five assessment tests to filter out drug-like MTX analogs. The ADMET assessment test gauged the pharmacokinetic features of the compounds obtained from ZINC15. In the ADMET assessment test, compounds were estimated for noninhibition to CYP2D6 and nonhepatotoxicity. The pharmacokinetic properties of blood brain barrier (BBB), optimal solubility, and good intestinal absorption were evaluated by setting their values to 3, 3, and 0, respectively. Lipinski's rule of five assessment was carried out after ADMET evaluation. Through Lipinski's rule of five filtration, compounds with AlogP value less than 5, number of HBD <5, number of HBA <10, molecular weight <500 Da, and fewer than 10 rotatable bonds were determined [40,41]. Accordingly, 8 compounds were found obeying drug-like properties. The drug-like MTX-analogs were carried out for virtual screening against WT-pharma and MT-pharma. All 8 compounds were aligned with WT-pharma but one compound was not in agreement with MT-pharma. Subsequently, 7 MTX-analogs were recognized as screening Hits for further computations.

#### *2.5. Molecular Docking of Screening Hits in Active Site of hDHFR*

To explore the binding mode of 7 drug-like compounds retrieved from virtual screening against WT-pharma and MT-pharma, molecular docking simulations were carried out using GOLD v 5.2.2. The 3D structure of wild type and F31R/Q35E variant of hDHFR in complex with inhibitor MTX were taken from protein data bank (PDB ID: 1U72 and 3EIG respectively). Both the structures have a high resolution of 1.9 Å for wild type and 1.7 Å for the F31R/Q35E variant. The co-crystal bound inhibitor (MTX) was docked in the active site of wild type hDHFR in order to optimize the docking protocol. The docked pose of MTX showed a low RMSD value of 0.58 Å with the crystallographic pose of

MTX in the active site of the wild type hDHFR as shown in Figure S1 of supplementary material. The WT- and MT-pharma retrieved drug-like (candidate) compounds were docked by implementing the same optimized protocol. Docking results showed that ChemPLP fitness scores and ASP fitness scores of MTX as a reference compound were 99.23 and 56.65 for wild type hDHFR while 88.98 and 49.84 for mutant hDHFR, respectively. These docking scores were considered as cut-off values for wild type and mutant hDHFR docking results analysis. The candidate compounds for wild type hDHFR were selected based on ChemPLP and ASP fitness scores greater than 99.23 and 56.65 respectively. For mutant hDHFR, compounds yielding ChemPLP and ASP fitness scores higher than 88.98 and 49.84 were selected (Table 3).

**Table 3.** Comparison of ChemPLP and ASP dock scores of MTX (reference inhibitor) and Hit compound in the active sites of WT and MT hDHFR.


Additionally, the compounds were investigated about ligand conformations effectively showing essential interactions in the active site of hDHFR. Finally, one compound which contained pharmacophoric features of wild type and mutant hDHFR structures and fulfilled the above-mentioned criteria of docking scores was selected as docking Hit. The pharmacophore mapping of Hit compound (ID: ZINC000013508844) with WT-pharma and MT-pharma models are shown (Figure 3).

**Figure 3.** Hit compound (MTX-analog) mapping with pharmacophore models. (**A**) Hit compound represented in dark blue colored thick stick model mapping with WT-pharma. (**B**) Hit compound represented in magenta-colored thick stick model mapping with MT-pharma.

#### *2.6. Molecular Dynamic Simulations for Structures Stability Evaluation*

MD simulations were executed to estimate the binding stability of Hit compound after docking in the active site of wild type and mutant hDHFR. Four MD simulation systems were employed as one for each complex i.e., for hit compound and reference compound (MTX) in complex with wild type and mutant hDHFR structures, respectively. The initial details of each system subjected to simulation are given in Table 4.

**Table 4.** The specifications of four systems used for molecular dynamics simulations.


MTX <sup>a</sup> : the reference inhibitor.

Root mean square deviation (RMSD) was measured of the protein-ligand complex for each simulation system to assess ligand binding with hDHFR. In the results of 50 ns simulation, protein-ligand RMSD of reference compound (MTX) in complex with wild type hDHFR was recorded at an average of 0.21 nm throughout the simulation (Figure 4A). The average RMSD of MTX with mutant hDHFR was observed 0.21 nm up to 38.9 ns but afterward it significantly increased to an average of 0.62 nm indicating loss of MTX binding with MT-hDHFR. Accordingly, the representative structure of each system was taken from the last 8 ns (30–38 ns) before the loss of MTX binding with MT-hDHFR. The Hit compound obtained from docking results showed stable RMSD in complex with WT- and MT- hDHFR. The average root means square deviation values of Hit compound in complex with WT-hDHFR and MT-hDHFR were observed at an average of 0.21 nm and 0.22 nm respectively, throughout the simulation depicting that both the systems were well converged (Figure 4A). Additionally, per residue RMSF (root mean square fluctuation) calculated for each complex which was noted about 0.3 nm for all residues except for the MTX which showed RMSF about 2.3 nm in complex with MT-hDHFR (Figure 4B).

**Figure 4.** RMSD analysis of the reference (MTX) and hit compound (MTX-analog). (**A**) RMSD of the protein-ligand complex of wild type and mutant hDHFR revealed their stability throughout the simulation, with no abnormal behavior in all systems except for MTX in complex with MT hDHFR. (**B**) RMSF per residue plot for all the systems portrayed their residues RMSD is stable except for MT hDHFR ligand (MTX) which showed a high fluctuation level. (**C**) The number of intermolecular hydrogen bonds between protein and ligand during 50 ns MD simulations. Light blue and pink colors represent MTX in wild type and mutant hDHFR, respectively, while dark blue and magenta represent the Hit compound in wild type and mutant hDHFR, respectively.

The superimposition of representative structures demonstrated that the binding pattern and conformational alignment of Hit in the active site of hDHFR was similar to that of MTX (Figure 5).

**Figure 5.** The binding patterns of the reference inhibitor (MTX) and hit compound in the active site of wild type and mutant hDHFR. Compounds are displayed by their representative structures superimposed (left) and enlarged (right). The protein is shown in white color. (**A**) Light blue and dark blue colors represent MTX and Hit compound in wild type hDHFR. (**B**) Pink and magenta colors represent MTX and Hit compound respectively in mutant hDHFR.

π The substrate-binding site of hDHFR is mainly comprised of Ile7, Glu30, Phe31, Gln35, Ser59, Pro61, Asn64, Arg70 and Val115 [42]. Our results suggested that the reference compound (MTX) could bind with substrate binding residues of WT-hDHFR but lost its binding with MT-hDHFR, in agreement with Volpato et al. [32]. In contrast with MTX, the Hit compound exhibited strong binding with the active site of both WT- and MT-hDHFR. The MTX formed H-bonds with Ile7, Glu30, and Val115, Phe31, Asn64 and Arg70, as well as one carbon–hydrogen with Pro61 of WT-hDHFR (Figure 6A, Table 5). Furthermore, MTX established π-interactions with Ile7, Ala9, Leu22, Phe34 and Ile60 and showed van der Waals contacts with Val8, Asp21, Phe31, Arg32, Tyr33, Gln35, Thr56, Ser59, Leu67, Lys68, Tyr121 and Thr136 (Table 5).

**Figure 6.** Molecular interactions analyses. The reference inhibitor MTX and Hit compound interacted with essential residues in the active site of hDHFR. MTX in WT hDHFR (**A**), Hit in WT hDHFR (**B**), MTX in MT hDHFR (**C**) and Hit in MT hDHFR (**D**) are depicted as light blue, dark blue, pink, and magenta-colored stick representation. The H-bond forming residues of hDHFR are displayed as a brown stick model. H-bonding and bond distances are represented as green dashed lines and measured in angstrom (Å), respectively.


**Table 5.** Molecular interactions between the ligands (MTX and hit compound) and the active site residues of WT and MT hDHFR.

In the representative structure of MT-hDHFR which was obtained before the disruption of MTX binding, molecular interactions were observed to analyze the difference of MTX binding with wild type hDHFR, leading us to speculate about the segment of resistance to MTX in mutant hDHFR. Accordingly, MTX was shown to form H-bond interactions with Ile7, Glu30, Arg31, Asn64, Lys68, Val115, Tyr121 and van der Waals interaction with Asp21, Phe34, Tyr33, Glu35, Thr56, Pro61, Arg70, Phe134 and Thr136 (Figure 6C, Table 5). Our findings also indicate that MTX formed carbon–hydrogen bonds with Val8, Leu67, Ser59, Lys68 and π-interactions with Asp21, Phe34, Tyr33, Glu35, Thr56, Pro61, Arg70, Phe134 and Thr136. The Hit compound in complex with WT-hDHFR formed H-bonds with Ile7, Glu30, Gln35, Asn64 (2), Arg70 and Val115, as well as carbon–hydrogen bonds with Pro61 and Lys68 (Figure 6B, Table 5). Additionally, Hit showed van der Waals interactions with the hydrophobic pocket residues of WT-hDHFR such as Val8, Asp21, Phe31, Tyr33, Phe34, Thr56, Ser59, Leu67 and Thr136, as well as π-interactions with Ile7, Ala9, Leu22 and Ile60 (Table 5). In the case of MT-hDHFR, the Hit compound established H-bonds with Ile7, Glu30, Arg31, Ser59, Asn64, Arg70, Val115 and Tyr121 (Figure 6D, Table 5). The Hit compound showed hydrophobic van der Waals interactions with Val8, Asp21, Arg28, Arg32, Phe34, Glu35, Thr56, Pro61, Leu67 and Thr136 residues of the WT-hDHFR while π-interactions were formed with Ile7, Ala9, Leu22, Arg31 and Ile60. The residue Ser59 also exhibited carbon–hydrogen bonds with C11 atoms in addition to conventional H-bonds with O13 atoms in Hit molecules. The conventional H-bond was formed only by Hit in the MT-hDHFR binding site. Throughout the simulation period, the total number of intermolecular H-bonds of WT- and MThDHFR in complex with MTX and Hit were calculated. Our results showed that the Hit compound formed an average number of H-bonds with WT- and MT-hDHFR comparable to that of MTX (reference) in WT-hDHFR. Since MTX has a very weak binding with MT-hDHFR, it could not maintain average number of H-bonds after 38.9 ns (Figure 4C), which enhanced the results obtained from the RMSD plots. Our results suggest that Hit (MTX analog) is capable of binding tightly with wild type, as well as MTX-resistant, F31R/Q35E hDHFR variants.

### *2.7. Binding Free Energy Calculations for MTX and Hit Compound*

MM/PBSA binding free energies were calculated for MTX and Hit in complex with WT- and MT-hDHFR. The free energies of MTX and Hit in complex with WT-hDHFR were observed as −646.76 kJ/mol and −642.12 kJ/mol, whereas MT-hDHFR MTX could yield only −49 kJ/mol, while the Hit compound attained −571.38 kJ/mol. The binding free energy evaluations underscore our findings that the Hit molecule is tightly bound with WT- and MT-hDHFR, displaying comparable free energy of MTX in complex with WT-hDHFR. The decomposition analysis of the binding free energy indicated that electrostatic and van der Waals forces are significant characteristics in hDHFR inhibition (Figure 7, Table 6).

− −

π

−

π

π

−

**Figure 7.** Binding free energy analyses. (**A**) Graphical representation of MM/PBSA estimated binding free energy of wild type and mutant hDHFR in complex with MTX (reference) and Hit compound throughout the simulation time. The reference compound is depicted as light blue and dark blue for wild type and mutant hDHFR, respectively. The Hit compound is shown in pink and magenta colors for wild type and mutant hDHFR, respectively. (**B**) The binding free energy decomposition analysis of the final hits in the active site of hDHFR infers that the Hit compound was comparably strongly bound with WT and MT hDHFR, while MTX lost its binding with the mutant structure.



− − − − <sup>a</sup> MTX: methotrexate as reference inhibitor. SASA <sup>b</sup> : Solvent accessible surface area.

− − − − − − − − Altogether, our results verified that the newly identified MTX-analog favorably adapted the active site of wild type and double mutant hDHFR and acquired polar and nonpolar interactions with the catalytic active residues.

− − − − The structure of the Hit compound, which was modified by adding a carbon-oxygen group (C11-C12-O13) with a *p*-ABA moiety, is illustrated in its 2D structure in Figure 8.

**Figure 8.** (**A**) 2D structure of MTX (**B**) 2D structure of Hit compound (MTX analog, ZINC ID: ZINC000013508844).

#### **3. Discussion**

Chemotherapeutics are very effective in the treatment of cancers, but drug resistance is often a limiting factor. Acquired resistance is the type of drug resistance that can develop through various adaptive responses such as mutations, increased expression of the therapeutic target and activation of alternative compensatory signaling pathways arising over the course of the treatment of tumors [43] Human dihydrofolate reductase (hDHFR) is an enzyme that is responsible for the catalysis of the reduction of 7,8-dihydrofolate (DHF) to 5,6,7,8-tetrahydrofolate, which is crucial for DNA synthesis and cell proliferation [44]. Therefore, hDHFR has been widely used as a target for cancer therapeutics for several decades [45]. Methotrexate is a well-known inhibitor that displays a high affinity with hDHFR, but mutation in the active site of hDHFR results in the loss of MTX binding [31].

The present study aimed to identify an analog of methotrexate that was capable of binding tightly, and hence inhibiting, wild type and doubly mutant hDHFR (F31R/Q/35E) by employing several computational methods including structure-based pharmacophore modeling, virtual screening, molecular docking and molecular dynamics simulations studies [46]. Structure-based pharmacophore models of crystal structures of wild type (PDB ID: 1U72) and variant (PDB ID: 3EIG) hDHFR in complex with methotrexate were obtained, with four features in each model. The best pharmacophore models of each the structure were selected by analyzing the inclusion of key residues in pharmacophoric features and sensitivity of the models to retrieve true positive compounds depicted by the highest area under the ROC curve. The selected pharmacophore models, termed as WT-pharma and MT-pharma for wild type and mutant hDHFR structures, respectively, were rationally assessed for the inclusion of conserved hydrogen bond residue Glu30 and other key residues, such as Asn64, Arg70 and Val115 [44,47]. Further, with each pharmacophore model, the ROC curve was formed between the number of false positive (FP) and true positive (TP) compounds retrieved by that model from the datasets of 46 active and 24 inactive compounds. Higher AUC values in the ROC curves infer greater sensitivity of WT-pharma and MT-Pharma in retrieving actives, and specificity for ignoring inactives [48,49]. Using ZINC<sup>15</sup> , Mayorga et al. found a high number of compounds when they utilized a small fragment of the original structure [50]. In our study, to explore analogs of MTX, we used the full structure of MTX and selected 'in vitro', 'in vivo 'and 'clean' options in the section of 'subsets', which resulted in the generation of only 32 analogous compounds (Table S1. An ADMET assessment test and Lipinski's Rule of five scrutinized the downloaded compounds from ZINC<sup>15</sup> for their drug-like properties, and found eight compounds satisfying the required criteria to qualify as lead compounds [51,52]. The validated pharmacophore models of wild type and mutant structures of hDHFR were applied as 3D query for virtual screening with the drug like compounds capable of binding with wild type and mutant hDHFR as well [53]. A molecular docking study was employed to inspect the most suitable and consistent binding mode of the molecules in the binding sites of receptor proteins. Consequently, the best binding modes obtained from docking based on scoring functions and key interactions with the active site residues of wild type and mutant hDHFR were used in MD simulations to assess their stability [54]. The RMSD plots inferred that the Hit compound showed similar modes of interaction in wild type and mutant hDHFR active sites as MTX in the active site of wild type hDHFR. Specifically, the average RMSD profiles (<0.25 nm) obtained for protein-ligand complexes of Hit with wild type and mutant hDHFR exhibited that the systems were uniform and compact, as the stability of the system can be inferred by an RMSD

value of less than 0.3 nm [40,41]. The RMSD plot for MTX in complex with mutant hDHFR showed abrupt fluctuation after 38.9 ns, which indicated the loss of MTX binding with the active site of hDHFR. Furthermore, a high RMSF value (2.34 nm) of MTX (residue187) indicated a loss of ligand binding with only mutant hDHFR protein. Our results showed that Hit compound established stable H-bonds with the active site residues of wild type and mutant hDHFR. Similar to the molecular interactions of MTX (reference compound), most H-bonds were formed by pterin moiety and α-glutamate moiety with hDHFR active site residues, while *p*-aminobenzoic acid (*p*-ABA) moiety formed mainly hydrophobic interactions [31]. The conserved hydrogen bond with OE1 atom of catalytic residue Glu30 was formed with the pterin moiety of Hit molecule [42,55]. The additional oxygen atom in the structure of Hit compound formed a hydrogen bond with Ser59 in both wild type and mutant hDHFR, while Ser59 belonged to the coenzyme NADPH binding site [56]. It was speculated that the hydrogen bond between Ser59 and the modified *p*-ABA moiety of Hit compound contributed to the strong binding of the Hit compound with the mutant structure of hDHFR. Furthermore, the conserved hydrogen bonds formed by an α-carboxylate group of MTX with the side chains of Arg70 and Gln35 while *p*-aminobenzoyl keto group with Asn64 were also observed in Hit compound's interactions with wild type hDHFR [26].

In the mutant structure, due to the substitution of Glu35, a hydrogen bond was not formed because of unfavorable close electrostatic contact of two negative charges between Glu35 side chain and glutamate moiety of MTX and Hit compound. In contrast, Arg31, which was substituted at the position of Phe31, was observed to form hydrogen bonds through the guanidinium group with an α-glutamate moiety of MTX. Hit compound displayed a double hydrogen bond with Arg31; double hydrogen bonds are considered to be crucial for strong binding in protein-ligand interactions [57]. Furthermore, the hydrogen bond of the Arg70 side-chain with an α-carboxylate group of MTX was lost in Mt hDHFR, while Lys68 was formed a hydrogen bond with the α-carboxylate group. On the other hand, the Hit compound retained the H-bond of the Arg70 side chain with the α-carboxylate group, as it did in the wild type. Moreover, the only hydrogen bond formed by a *p*-ABA group of MTX with Asn64 in wild type hDHFR was also shifted to a H-bond with an α-carboxylate group of MTX in mutant hDHFR. While the *p*-ABA group of Hit compound formed a hydrogen bond with Asn64, as in the wild type, an additional H-bond was formed between Ser59 and modified oxygen atom added to *p*-ABA group. Accordingly, a comparative analysis of protein-ligand interactions of MTX and Hit compound suggested that Hit (MTX-analog) may be capable of retaining its strong binding with WT and MT hDHFR. Additionally, the binding free energy evaluations performed by the MM/PBSA method also inferred that the complexes of WT and MT hDHFR with Hit compound were comparably stable, like MTX in WT hDHFR; meanwhile, the binding free energy profile noticeably depicts the loss of MTX binding in MT hDHFR.

#### **4. Materials and Methods**

#### *4.1. Structure Based Pharmacophore Modeling*

Ligand binding features were assessed by the structures of wild type (PDB ID: 1U72) and drug-resistant (PDB ID: 3EIG) human DHFR in complex with methotrexate taken from protein data bank [26,31]. Pharmacophore models were generated using the *Receptor-ligand Pharmacophore Generation* module in Discovery Studio (DS) v.4.5 (Dassault System, BIOVIA Corp, San Diego, CA, USA). FAST (Features from Accelerated Segment Test) algorithm was applied for Conformation Generation, while the Fitting Method was set to Flexible. The Validation option was set to *True*, for which a set of 46 active and 24 inactive compounds, downloaded from BindingDB (https://www.bindingdb.org/bind/ index.jsp) were exploited to generate a ROC curve for each pharmacophore model.

#### *4.2. Decoy Test Validation*

The ability of pharmacophore to identify hDHFR inhibitors was assessed by the Guner−Henry method (Decoy test method) [44]. A test set was prepared by collecting hDHFR inhibitors whose

experimental activities (IC<sup>50</sup> values) were measured by the same biological assays. The test set was composed of active and inactive molecules of hDHFR. The selected pharmacophore models of wild type and mutant structures were employed as a 3D query to obtain the best-fitted molecules from the test set. Screening of the test set was executed by the Ligand Pharmacophore Mapping protocol embedded in DS. Accordingly, several parameters, like Guner−Henry (GH) score, enrichment factor (EF), the percentage ratio of actives (%A), percentage yield of actives (%Y), false negative and false-positive, were calculated, which determined the efficacy of WT-pharma and MT-pharma

$$\begin{aligned} \text{EF} &= (\text{H}\_{\text{a}}/\text{H}\_{\text{l}}) \text{(A/D)}\\ \text{GF} &= (\text{H}\_{\text{a}}/4\text{H}\_{\text{l}}\text{A}) \left(3\text{A} + \text{H}\_{\text{l}}\right) \times \left[ \left[1 - (\text{H}\_{\text{l}} - \text{H}\_{\text{a}}) / (\text{D} - \text{A}) \right] \right] \end{aligned} \tag{1}$$

where D is the total molecules in the data set, A specifies the total number of active compounds in the data set, H<sup>t</sup> indicates the total number of Hits retrieved and H<sup>a</sup> refers to the number of actives present in the retrieved Hits.

#### *4.3. Methotrexate Analogs Generation*

Methotrexate structure was subjected to a similarity search in ZINC<sup>15</sup> using SMILES string of MTX. For the generation of clean analogs*,* in vivo and in vitro options were selected in the available range of Subsets to Check. Subsequently, the structures were downloaded in the SDF (Spatial Data File) format, generated by the webserver, to carry out for further computations in DS.

#### *4.4. Drug-Likeness Prediction and Virtual Screening*

The molecules retrieved from ZINC<sup>15</sup> were tested through ADMET and Lipinski's Rule of five embedded assessment techniques in DS to identify drug-like compounds. Subsequently, the compounds exhibiting such properties were carried out for virtual screening with WT-pharma and MT-pharma. The compounds which fitted with both pharmacophores were considered as screening compounds in our molecular docking study.

#### *4.5. Molecular Docking Simulation*

A docking study was employed through the Genetic Optimization of Ligand Docking (GOLD) package v5.2.2 (The Cambridge Crystallographic Data Centre, Cambridge, United Kingdom). GOLD software provides full flexibility of ligands and limited flexibility of protein; hence, it delivers more reliable results in computational biology the crystal structures of wild type (PDB ID: 1U72) and variant (PDB ID: 3EIG) hDHFR in complex with Methotrexate were taken from protein data bank. The wild type and variant structures of hDHFR were prepared for docking by eliminating water molecules in DS. Chemistry at Harvard macromolecular mechanisms (CHARMm) force field was applied to add hydrogen atoms to the structures of hDHFR. The binding sites of wild type and mutant hDHFR were identified within the radius of 9Å of bound inhibitor (MTX) using the *Define and Edit Binding Site* module, planted in DS. During docking, MTX-analogs retrieved from virtual screening along with methotrexate as reference were treated as ligand molecules. The ChemPLP (Piecewise Linear Potential) score and ASP (Astex Statistical Potential) score were used as the default scoring and rescoring functions, respectively. The ChemPLP is the default scoring function in GOLD software which is empirically optimized for pose prediction. It is implemented to establish the steric complementarity between protein and ligand, distance- and angle-dependent hydrogen and metal bonding terms as well as the heavy atoms clash- and torsional potential. The ASP scoring function measures the atom−atom potential and has similar precision to Chemscore and Goldscore fitness functions [58,59]. During docking, GA (genetic algorithm) was run to produce 100 poses for each drug-like molecule. The bound ligand (MTX) was employed as a reference compound throughout the analyses. Cluster analysis was performed to scrutinize hit compounds exhibiting a higher dock score than cut off (dock score of reference molecule).

#### *4.6. Molecular Dynamics (MD) Simulation*

Molecular dynamics simulations were performed using CHARMm36 all-atom force field [60] in Groningen Machine for Chemical Simulation (GROMACS) v5.0.6 package [61]. For every protein-ligand complex, an independent simulation system was generated. The topology and coordinates files for MTX and docking hits were generated using SwissParam [62]. Transferable intermolecular potential with three points (TIP3P) water model in a cubic box was used for solvation of each system. Solvent molecules were substituted with sodium ions (Na+) to nullify the total charge of simulation systems. The energies of the systems were minimized by applying steepest decent algorithwhere the maximum force was kept less than 10 kJ/mol I order to avoid any bad contacts likely to be occurred in the production run. Initially, the systems were equilibrated in two steps. First, the number of particles at constant volume and temperature (NVT) equilibration was carried out for 100 ps at 300 K. The temperature of the system was kept constant using V-rescale thermostat [63]. In second phase, 100 ps equilibration was performed with number of particles at constant pressure of 1 bar (NPT) and temperature 300 K [64]. Accordingly following the protocol mentioned earlier, all the systems were carried out for production run. In short, bond lengths of heavy atoms were sustained using Linear Constraint Solver (LINCS) algorithm [65]. Particle Mesh Ewald (PME) method was employed to calculate the long-range electrostatic interactions [66]. Short-range interactions length was kept to 12 Å. All simulations were performed with the periodic boundary conditions to make infinite systems. Time interval was kept of 10 ps to save coordinates data. Finally, result's visualization and analysis were performed using GROMACS and DS.

#### *4.7. Binding Free Energy Calculations*

The binding free energy was calculated by employing the Molecular Mechanics/Poisson-Boltzmann Surface Area (MM/PBSA) method [67]. Following the MM/PBSA protocol to compute free energies of the protein-ligand complex, equidistant snapshots of the hDHFR-ligand complex were extracted. The binding free energy of the protein-ligand complex is stated as:

$$
\Delta G\_{\text{binding}} = G\_{\text{complex}} - \left(G\_{\text{protein}} + G\_{\text{ligand}}\right) \tag{2}
$$

where *G*complex denotes the sum of the free energy of the complex, and *G*protein and *G*ligand specify the free energies of portion and ligand in their unbound states.

Free energy can be defined as:

$$G\_X = E\_{\text{MM}} + G\_{\text{solution}} \tag{3}$$

where *X* can be a protein, a ligand, or their complex. *E*MM signifies the average molecular mechanics potential energy in vacuum, while *G*solvation indicates the free energy of solvation.

Accordingly, molecular mechanics potential energy in vacuum can be calculated by implementing the equation:

$$E\_{\rm MM} = E\_{\rm bonded} + E\_{\rm nonbonded} = E\_{\rmbonded} + (E\_{\rm vdw} + E\_{\rm elec}) \tag{4}$$

*E*bonded denotes the bonded interactions, while *E*nonbonded terms the nonbonded interactions. The value of ∆*E*bonded is generally treated as zero.

The combined energetic terms of electrostatic (*G*polar) and apolar (*G*nonpolar) give the solvation free energy which is measured as:

$$G\_{\text{solvation}} = G\_{\text{polar}} + G\_{\text{nonpolar}} \tag{5}$$

Here Poisson-Boltzmann (PB) equation is implemented to compute Gpolar, while the *G*nonpolar is calculated from the solvent-accessible surface area (SASA) as:

$$G\_{\text{nonpolar}} = \chi^{\text{SASA}} + b \tag{6}$$

where, γ represents the coefficient of solvent surface tension, while *b* is its fitting parameter, whose values are 0.02267 kJ/mol/Å<sup>2</sup> and 3.849 kJ/mol, respectively.

#### **5. Conclusions**

In the current study, structure-based pharmacophore modeling, virtual screening, molecular docking and molecular dynamics simulation methods were utilized to identify a potential inhibitor that was capable of strong binding with wild type as well as drug-resistant (mutant) hDHFR. Structure-based pharmacophore models for WT and MT hDHFR in complex with MTX were generated and validated by the decoy test and ROC curve. Methotrexate analogs were generated by exploiting the MTX structure in ZINC15, and carried out for ADMET and Lipinski's Rule of five assessment tests to evaluate drug-likeness of compounds obtained from ZINC. The drug-like compounds were used in virtual screening with validated WT and MT pharmacophore models as a 3D query to identify potential hits of wild type and mutant hDHFR. The compounds obtained from virtual screening were docked in the active site sites of WT and MT hDHFR. Subsequently, through docking results analysis, one compound was found to have a higher dock score than the reference compound (MTX), displaying essential molecular interactions with key residues of the hDHFR active site. Furthermore, MD simulation and binding free energy calculations for the Hit compound and MTX in complex with WT and MT hDHFR were also used to evaluate the stability of the Hit compound with WT and MT hDHFR. Taken together, our findings indicate MTX analog (ZINC000013508844) to be a potential inhibitor of wild type hDHFR and drug-resistant F31R/Q35E variant of hDHFR. In future work, we will try to synthesize the Hit compound to verify our findings through bioassay by collaborating with an experimental lab. These findings can also be extended to assess other drug resistant hDHFR variants for cancer therapeutics.

**Supplementary Materials:** Available online. Table S1 2D structures and ZINC IDs of Methotrexate analogs, Figure S1: Superimposition of X-ray structure pose and docked pose of MTX.

**Author Contributions:** Conceptualization, K.W.L.; Data curation, R.M.R., A.Z., S.P.; Formal analysis, N.B.A., S.R., A.Z. and S.P.; Investigation, R.M.R. and Y.K.; Methodology, R.M.R., S.R., G.L., and S.Y.; Project administration, K.W.L.; Supervision, K.W.L.; Validation, S.Y., G.L., D.K.; Visualization, R.M.R.; Writing and editing, R.M.R., N.B.A., S.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Bio and Medical Technology Development Program of the National Research Foundation (NRF) and funded by the Korean government (MSIT) (No. NRF-2018M3A9A7057263). Also supported by Basic Science Research Program (2017R1D1A3B03035738) through the National Research Foundation of Korea (NRF) funded by the Ministry of Education of Republic of Korea.

**Conflicts of Interest:** The authors declare no competing interests. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


**Sample Availability:** Samples of the compounds are not available from the authors.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

### **Structure-Based Discovery of Dual-Target Hits for Acetylcholinesterase and the** α**7 Nicotinic Acetylcholine Receptors: In Silico Studies and In Vitro Confirmation**

### **Sebastian Oddsson 1,2,3 , Natalia M. Kowal 1,2,3, Philip K. Ahring 2,3, Elin S. Olafsdottir <sup>1</sup> and Thomas Balle 2,3,\***


Academic Editors: Marco Tutone and Anna Maria Almerico Received: 3 June 2020; Accepted: 15 June 2020; Published: 22 June 2020

**Abstract:** Despite extensive efforts in the development of drugs for complex neurodegenerative diseases, treatment often remains challenging or ineffective, and hence new treatment strategies are necessary. One approach is the design of multi-target drugs, which can potentially address the complex nature of disorders such as Alzheimer's disease. We report a method for high throughput virtual screening aimed at identifying new dual target hit molecules. One of the identified hits, *N*,*N*-dimethyl-1-(4-(3-methyl-[1,2,4]triazolo[4,3-a]pyrimidin-6-yl)phenyl)ethan-1-amine (Ýmir-2), has dual-activity as an acetylcholinesterase (AChE) inhibitor and as an α7 nicotinic acetylcholine receptor (α7 nAChR) agonist. Using computational chemistry methods, parallel and independent screening of a virtual compound library consisting of 3,848,234 drug-like and commercially available molecules from the ZINC15 database, resulted in an intersecting set of 57 compounds, that potentially possess activity at both of the two protein targets. Based on ligand efficiency as well as scaffold and molecular diversity, 16 of these compounds were purchased for in vitro validation by Ellman's method and two-electrode voltage-clamp electrophysiology. Ýmir-2 was shown to exhibit the desired activity profile (AChE IC<sup>50</sup> = 2.58 ± 0.96 µM; α7 nAChR activation = 7.0 ± 0.9% at 200 µM) making it the first reported compound with this particular profile and providing further evidence of the feasibility of in silico methods for the identification of novel multi-target hit molecules.

**Keywords:** high-throughput virtual screening; dual-target lead discovery; neurodegenerative disorders; Alzheimer's disease; dual mode of action; multi-modal; nicotinic acetylcholine receptor; acetylcholinesterase; molecular docking

#### **1. Introduction**

The development of treatments for neurodegenerative disorders such as Alzheimer's disease (AD), the most prevalent type of dementia, is a pressing matter due to the incurable, progressive, and debilitating nature of the disease [1]. The situation is further aggravated by the ageing of populations worldwide and reflected by the estimated triplication of the AD-affected from 50 million in 2018 to 152 million in 2050 [2]. Neuropathologically, AD is characterized by its accompanying lesions, most notably but not exclusively by amyloid plaques, neurofibrillary tangles, and neuronal and synaptic loss [3,4], that lead to a variety of neurochemical changes, prominently in the cholinergic

system [5]. Taken together, these alterations are the cause of cognitive symptoms such as memory loss, language impairment, visuospatial dysfunction, and executive functioning issues [6] which are often accompanied by other behavioral and psychological symptoms (BPSD) [7,8].

Despite extensive research efforts to understand AD, its causes and the underlying disease mechanisms remain poorly understood. This renders the search for effective drugs difficult. AD treatments have yet to succeed in slowing down disease progression and AD medications currently approved in the US are palliative in nature but nevertheless continue to provide the biggest benefits for patients [9]. The drugs fall into two categories, acetylcholinesterase (AChE) inhibitors (galantamine (**1**), donepezil (**2**), rivastigmine (**3**); Figure 1) that raise synaptic acetylcholine (ACh) levels and memantine (**4**), an NMDA receptor antagonist that regulates glutamatergic activity [10,11].

**Figure 1.** Structures of currently FDA-approved drugs for the treatment of AD, galantamine (**1**), donepezil (**2**), rivastigmine (**3**) and memantine (**4**). Donepezil and memantine are also approved as a combination therapy.

Investigations into the effects of administering memantine in conjunction with AChE inhibitors started about two decades ago [12,13] and led to the approval of Namzaric, which is a combination therapy relying on fixed doses of donepezil and memantine [14]. Other combination therapies are also of interest, e.g., drugs with pro-cognitive effects combined with anti-inflammatory drugs [15]. In fact, treatments that target more than one disease mechanism have become an emergent therapeutic strategy, especially for the treatment of complex, multifactorial diseases [16]. Besides addressing their intricate nature, the combination of two molecules with different activities in one drug formulation can result in drug synergism, as well as prevent unwanted compensatory mechanisms and drug tolerance, for instance [17].

An alternative to combination therapies is multi-modal compounds [18,19]. These compounds have essentially the same advantages as drug combinations but have, analogously to single molecule therapies, an easier therapeutic regimen, fewer drug-drug interactions and no differences in pharmacological properties to be considered [17,19]. However, the search for multi-target ligands also poses new challenges [20]. This conundrum is commonly tackled by the synthesis of bivalent hybrid ligands [21], but computational approaches can also be used to screen for novel ligands, i.e., by exploiting pharmacophore models underlying a set of known reference ligands or models of the corresponding binding sites [22]. Virtual screening (VS) methods allow for the screening of large compound databases against these models within a reasonable timeframe at comparably low cost. A prerequisite is the availability of enough data of adequate quality in order to construct predictive models. With the number of published X-ray structures, ligand datasets, and the ever-increasing size of screening compound databases, these strategies are becoming increasingly more feasible [23]. Nevertheless, VS for multi-target-directed ligands is not as commonly used as might be expected due to the complexity of the task and a lack of established protocols [23].

In previous work, we embarked on the search for AChE inhibitors that concurrently increase receptor activation at pro-cognitive nicotinic acetylcholine receptors (nAChRs) in a subtype-specific manner [24]. In addition to the already established benefits of increasing synaptic ACh levels by AChE inhibitors, the activity profile would thus extend to direct modulation of the cholinergic system. Since the dosage of AChE inhibitors is often limited due to adverse effects, boosting the activity of a pro-cognitive receptor such as the α7 nAChR [25] could increase the overall treatment efficiency. Importantly, we established the feasibility of VS for the identification of compounds targeting AChE as well as the α7 nAChR [24]. Top-scoring VS hits obtained from one target were subsequently screened against the second target, which allowed for the identification of dual-activity compounds, however, the disadvantage of this approach was introduction of bias towards the first target screened and that none of these screening hits showed agonistic activity at the α7. Therefore, in the present study, the size of the screening database was increased ~45 fold by shifting the focus from natural products and their derivatives to drug-like molecules. Furthermore, the entirety of the screening database was screened against both targets individually. Donepezil and a recently published α7 nAChR agonist were selected as reference compounds and structure-based, parallel and independent VS was performed. Here, we present the new protocol along with in silico and in vitro results, which show that it is feasible to identify multi-target compounds with the desired activity profile. α α α α

#### **2. Results**

Our goal was to find new compounds with dual activity: (i) inhibitors at the AChE and (ii) agonists of the α7 nAChR as illustrated in Figure 2. The approach taken was to screen a pre-filtered compound database against an AChE X-ray structure and an α7 nAChR homology model in parallel using identical parameters. The intersecting set between the two independently obtained sets of VS hits was considered to contain potential bimodal compounds as illustrated by the Venn diagram in Figure 2. A subset of these compounds were selected for in vitro testing at human α7 nAChR expressed in *Xenopus laevis* oocytes by two-electrode voltage-clamp electrophysiology and at the human recombinant AChE using Ellman's colorimetric method [26] to validate the in silico predictions. α α α

**Figure 2.** Workflow of the parallel HTVS, hit selection and in vitro evaluation. Two protein targets were selected for in silico studies and protein models suitable for docking prepared. The same database of compounds was then individually screened against each model using identical parameters. After post-processing, common compounds from the two independent screening hit lists were used to identify compounds destined for in vitro testing.

#### *2.1. Protein Structures and Homology Modeling*

Based on structure-activity considerations for AChE inhibitors and α7 nAChR agonists, an X-ray structure of AChE co-crystallized with donepezil (**2**) determined to a resolution of

2.35 Å (PDB 4EY7) was selected for the high-throughput virtual screening (HTVS) [27]. Since the structure of α7 nAChR has not been determined to date, a homology model was constructed using an (α4)2(β2)<sup>3</sup> nAChR structure with a resolution of 3.94 Å (PDB 5KXI) [28] as primary template augmented with an additional α4 subunit to facilitate modelling of an α7 homopentamer. In addition, an acetylcholine binding protein (AChBP) from *Lymnaea stagnalis* co-crystallized with *N*4,*N*4-bis[(pyridin-2-yl)methyl]-6-(thiophen-3-yl)pyrimidine-2,4-diamine (Compound 44; α7-pEC<sup>50</sup> = 6.3) from a recently published ligand series (PDB 5J5F) [29] served to define and bias the binding site. This structure was chosen because donepezil is an extended, linear structure like Compound 44 (**5**), as illustrated in Figure 3. We hypothesized that including the AChBP structure with its co-crystallized ligand as an additional template in the model building process would result in an α7 nAChR model, in which the binding pocket would be opened up in a way that would facilitate the docking of extended molecules resembling donepezil in size and shape, thus enhancing the probability of identifying dual active molecules. α α α β α α α α

α **Figure 3.** (**A**) Superposition of reference ligands in their bioactive, co-crystallized conformations. AChE inhibitor Donepezil from PDB 4EY7 is shown in blue and α7 nAChR agonist Compound 44 from PDB 5J5F in green. Atoms of the rings shown in bold in Panel B were used for superimposition of the ligands. (**B**) Chemical structures of Donepezil and Compound 44.

Φ Ψ α We selected the homology model with the lowest Discrete Optimized Protein Energy (DOPE) score [30] and inspected its Φ, Ψ angle distributions (Figure 4). A small fraction of the modelled amino acid residues occupies energetically unfavorable regions of the plot. Since all these residues are in non-conserved loop regions of the protein distant from the binding site, the model was deemed suitable for docking. Both the α7 nAChR model and AChE structure were validated by docking the respective reference ligands (RMSD Compound 44 = 0.87 Å, Donepezil = 0.62 Å).

α **Figure 4.** Detailed Ramachandran diagram of chains A and B of the α7 nAChR homology model by residue type. (**A**) The general case of non-glycine, non-proline and non-pre-proline residues is depicted. The special cases having either significantly less (**B**; Glycine) or more conformational restraints (**C**,**D**; Proline and preproline). The contours of Glycine are twofold symmetrized based on the dataset from Lovell et al. [31].

#### *2.2. Screening Database*

To optimize the chances of success for the VS, we considered the importance of various factors that concern compounds making up the screening database. The ZINC15 database [32], which is freely available and suited for VS, provides compound annotations, such as different degrees of commercial availability and the ability to download specific subsets of the database based thereon, and was chosen as starting point. In a first step, a subset of 7,679,852 compounds was downloaded from ZINC15 (Figure 5) that corresponds to compounds which are "anodyne" (lowest reactivity), "in-stock" (highest commercial availability), have molecular weights (MWs) between 250 and 500 Da and a logP within the range of −1 to 5. This puts an emphasis on drug-likeliness and on the ability to verify the results in the laboratory. In addition to the property filters built into ZINC15, another set of property and structure filters was applied. The property filters were based on Lipinski's rule of five [33] with the intention of enhancing the drug-likeliness of the compounds while the structural filters were more focused on reducing the probability of identifying reactive, assay-interfering or otherwise problematic ligands as exemplified by REOS (rapid elimination of swill) [34] or PAINS (pan-assay interference compounds) [35] filters. Roughly half of the initial set was discarded during this extensive filtering

process. To finalize the screening database, the remaining 3,848,234 compounds were prepared for docking resulting in 5,213,053 entities. At this stage, the total charge per molecule was limited to 0 and 1 predicated on the knowledge that nicotinic receptor ligands in most instances contain a basic nitrogen atom crucial for binding.

**Figure 5.** Schematic overview depicting the number of molecules during database preparation, starting from the compound database ZINC15 (bucket) through various filtering stages (funnel) to the final database used for VS (flask). States refers to the number of different entities for which 3D coordinates were generated and includes protonation states and tautomers.

#### *2.3. Virtual Screening*

The database was docked against both protein targets in parallel using the glide-based Schrodinger virtual screening workflow and identical parameters. Following three rounds of docking exercises, the number of compounds was narrowed down to approximately 15,000 for each target. After post-processing, the compounds below the 0.5 quantile of the normalized ligand efficiencies were considered reasonable VS hits for each target and the intersect between the two VS screening results was then investigated in more detail (Figure 6). The median set of the putative ligands consisted of 4734 and 3824 compounds for AChE and α7 nAChR, respectively, with 57 compounds shared between the two sets. The docking poses and the structural diversity of these compounds were then analyzed in order to select 15 compounds for in vitro testing. Notably, all compounds contained a basic nitrogen and at least one aromatic ring. Several hits were analogous structures, hence compounds were prioritized for in vitro testing based on the ligand efficiency values from docking and the observed molecular interactions in the docked ligand poses such as hydrogen bonding to TRP-149 of the α7 nAChR, while also ensuring that no close derivatives were present in the final set. α α

α **Figure 6.** Venn diagram of AChE and α7 nAChR screening hits in the 0.5 quantile of normalized ligand efficiencies. (**B**) Scatterplot of normalized, strain-corrected ligand efficiencies. Compounds which are in the 0.5 quantile are colored blue and tested compounds are labeled as indicated Table 1 and shown in magenta.

#### *2.4. AChE and nAChR* α*7 Activity Testing*

*α* The selected compounds (**6**–**21**, Table 1) were screened for AChE inhibitory activity at 200 µM by employing Ellman's colorimetric analysis [26]. For five poorly soluble compounds, the methanol supernatants were tested. Ellman's colorimetric analysis is based on the breakdown of acetylthiocholine (ATCI) by AChE to thiocholine and acetic acid. Ellman's reagent in turn reacts with the thiol group of thiocholine resulting in a yellow color indicating the extent of enzymatic activity. Of the eleven compounds soluble in MeOH, ten were shown to inhibit AChE significantly at 200 µM varying from 66.6–100% while one compound (**17**), showed low inhibition (22.9 ± 1.1%, Table 1), which is in agreement with it having the lowest docking score for AChE of all the compounds tested. The five poorly soluble compounds also exhibited significant inhibitory activities between 45.9 and 100%. Overall, fourteen compounds inhibited AChE enzymatic activity more than 60%, eleven more than 80% and eight more than 90%. Compounds that inhibited AChE more than 80% in the initial screening at 200 µM had IC<sup>50</sup> values in the low micromolar range between 1.10 and 33.47 µM, which is two to three orders of magnitude lower than donepezil, the reference ligand, which showed an IC<sup>50</sup> value of 0.06. µM. The IC<sup>50</sup> of compounds **6** and **15**, the least potent AChE inhibitors, were determined to be 58.47 and 114.70 µM.

**Table 1.** Docking and in vitro results for tested compounds (**6**–**21**). For AChE, compounds were initially screened at 200 µM followed by IC50 measurements for soluble compounds with ≥ 80% inhibition. For α7 nAChR agonism, activities were measured at 200 µM concentrations. For determination of modulation of ACh responses at the α7 nAChR, 100 µM compound was co-applied with 30 µM ACh. Measured data represent the mean ± S.E.M of at least three AChE replicates and independent oocyte experiments.


```
Table 1. Cont.
```
a Glide G-score (kcal/mol); b Normalized ligand efficiency as defined in Formula 1; c "-" indicates "value not determined"; d % inhibition of supernatant (MeOH); e galantamine pIC50 = 6.64±0.02%, donepezil pIC50=7.22±0.02; f tested as racemic mixture.

Subsequently, all compounds except compound **17** were assessed at the α7 nAChR expressed in *Xenopus* oocytes using two-electrode voltage-clamp electrophysiology. Compounds were tested for direct activation of the α7 nAChR in a 0.2–200 µM concentration range. Compounds displaying less than 1% direct activation were further evaluated at 100 µM for their ability to alter currents evoked by 30 µM ACh. Compound **7** (Ýmir-2) and **15** (Ýmir-10) exhibited activation of α7 nAChR with 7.0 ± 0.9% and 2.3 ± 0.4% at 200 µM, respectively (Figure 7). Attempts to establish their potency were unsuccessful due to limited solubility. However, application of 2, 20, and 200 µM, as evident from Figure 7, established a concentration dependent effect. The remaining thirteen compounds exhibited less than 1% agonism indicating that they were either inefficient at mediating receptor activation or inactive at the tested concentrations. When tested as antagonists for their ability to inhibit ACh-evoked currents, all compounds showed inhibition in a range of 47.2–97.3% at 100 µM, with compounds **19** (Ýmir-14) and **21** (Ýmir-16) displaying almost full inhibition of 96% and 97%, respectively, at 100 µM (Figure 8). α α α

α − **Figure 7.** Evaluation of compounds as agonists. Representative current traces for ACh, Ýmir-2 (**7**) (**A**) and Ýmir-10 (**15**) (**B**) at α7 nAChRs expressed in *Xenopus* oocytes. Cells were subjected to two-electrode voltage-clamp electrophysiology experiments where the oocyte membrane potential was clamped at −60 mV. The representative traces were baseline subtracted. Bars above the traces represent the application periods and the respective test solution concentrations are indicated above them. Note that the majority of the washing periods (3 min) between each trace are omitted in the figure.

α − **Figure 8.** Evaluation of compounds as antagonists. Representative current traces for ACh and 10 and 100 µM Ýmir-14 (**19**) (**A**) and Ýmir-16 (**21**) (**B**) co-applied with 30 µM ACh at α7 nAChRs expressed in *Xenopus* oocytes. Cells were subjected to two-electrode voltage-clamp electrophysiology experiments where the oocyte membrane potential was clamped at −60 mV. The representative traces were baseline subtracted. Bars above the traces represent the application periods and the respective test solution concentrations are indicated above them. Note that the majority of the washing periods (3 min) between each trace is omitted in the figure.

#### **3. Discussion**

We embarked on the search for bimodal compounds with the help of computational methods. In accordance with the hypothesis from our previous study [24], we searched for hit molecules that target α7 nAChR as agonists and AChE as inhibitors. A drug based on this new activity profile could provide a new strategy for treating AD by the dual modulation of cholinergic signaling.

Despite the requirements of VS for high quality models of binding sites and screening databases, it has proven useful for the identification of new ligands for single targets and many methodological improvements have been made over the past decades [36,37]. Adding a second biological target adds another significant constraint to the problem, which is often addressed by pre-filtering or pre-screening the compound database based on one target before testing the second target [23].

In the current study, we conducted a VS without pre-screening our ligand database and docked the entire dataset to both targets. AChE and α7 nAChR are structurally and functionally distinct proteins but both evolved to accommodate ACh in their respective binding pockets. Sharing the same endogenous ligand and hence pharmacophoric elements should increase the probability of finding a molecule that fits both pockets. In addition, we constrained the search to ligands that are extended and linear based on two reference ligands.

We successfully employed this HTVS approach and identified two compounds (Ýmir-2, Ýmir-10) that showed AChE inhibition and activation of the α7 nAChR, confirming the feasibility of VS for the search of bimodal compounds at these targets as reported previously [24]. We observed a remarkably high hit rate for AChE inhibitors, where all but one (**17**) of the tested compounds showed activity at 200 µM. Moreover, 10 out of the 11 compounds soluble in MeOH showed significant inhibition of the enzymatic activity of AChE, as did all the compounds where only the supernatants were tested due to low solubility, indicating that these compounds also interact with the active site of AChE. However, further analysis to determine IC<sup>50</sup> values could not be performed for these compounds.

The hit rate at the α7 nAChR was likewise high, with 2/15 compounds displaying direct agonism. Ýmir-2 (**7**) and Ýmir-10 (**15**) showed partial activation of 7.0% and 2.3%, respectively, at 200 µM. Due to solubility issues, the maximal receptor activation could not be determined. The remaining compounds exhibited α7 nAChR antagonism, in a range between 47.2% and 97.3% at 100 µM when co-applied with 30 µM ACh. Two compounds, Ýmir-14 (**19**) and Ýmir-16 (**21**) inhibited the response of ACh almost fully and in a concentration dependent manner (Figure 8). This, and the fact that the agonist-based VS project yielded compounds that are structurally similar to each other as well as to known α7 nAChR agonists, suggests that the investigated compounds interact with the binding pocket where they likely act as antagonists although weak partial agonism cannot be ruled out. Hence, as all compounds but one displayed some activity at the nAChR, and as the difference between an antagonist and agonist is often subtle, the VS in terms of binding to the α7 nAChR orthosteric binding site could be considered as high as 14/15. However, it cannot be excluded that receptor inhibition for some compounds is mediated by a different mechanism such as direct blockage of the ion channel pore. Further experiments would be required to elucidate the exact mechanism of inhibitory interaction.

All 57 compounds identified in both screenings were extended, linear molecules. Other common properties such as the presence of a basic nitrogen at one end of the molecule and an aromatic moiety at the other were also observed. As the set included derivatives of the same compound but also compounds that differed only in their basic amine (e.g., piperidine and pyrrolidine) we identified these two regions as the main source of variability in the set of putative bimodal compounds. The low diversity of the intersecting set is likely a consequence of the protocol and the constraints applied during docking, but the structural patterns are also in accordance with known nicotinic ligands indicating that this is a general feature of nAChR ligands. The docked poses of Ýmir-2 (**7**) and Ýmir-10 (**15**) display characteristic interactions of AChE inhibitors and nicotinic receptor ligands around the basic amine, i.e., the positively charged amines are coordinated in the aromatic cage of the α7 nAChR and the anionic site of AChE (Figure 9). The lack of hydrogen bonds and strong hydrophobic interactions distal to the basic amines in the α7 nAChR suggests that the activity of these ligands could be further improved in this region. It is also noteworthy that the basic amine of Ýmir-2 (**7**) does not display the hydrogen bond to the backbone carbonyl of TRP-149 which is often considered crucial [38] and observed in multiple co-crystallized complexes between ligands and nicotinic receptors as well as acetylcholine binding proteins.

α

α π π π **Figure 9.** Interactions of Ýmir-2 (**7**) (magenta) and Ýmir-10 (**15**) (blue) in AChE (**A**,**B**) and α7 nAChR (**D**,**E**) and superposition of docking poses to reference ligands donepezil (**C**) and Compound 44 (**5**) (**F**). Hydrogen-bonding and is indicated by yellow, dotted lines, cation-π interactions by green, dotted lines and π–π interactions by blue dotted lines.

#### **4. Materials and Methods**

#### *4.1. Materials*

Plant origin galantamine hydrobromide analytical standard was purchased from PhytoLab GmbH & Co. KG (Vestenbergsgreuth, Germany). Screened compounds (**6**–**16**, **18**, **20** and **21**) were purchased from Molport (Riga, Latvia), compound **17** from Asinex (Winston Salem, NC, USA) and **19** from AsisChem (Waltham, MA, USA). Restriction enzymes were from New England Bio Labs Inc. (Ipswich, MA, USA) and DNA and RNA purification kits were from QIAGEN N.V. (Venlo, The Netherlands). mMessage mMachine T7 transcription kit was from ThermoFisher Scientific (Waltham, MA, USA). Human recombinant acetylcholinesterase (C1682), acetylcholine chloride, acetylthiocholine iodide, 5,5-dithiobis-(2-nitro-benzoic acid), bovine serum albumin, kanamycin, theophylline, tricaine, collagenase, HEPES, Trizma, salts and other chemicals not mentioned specifically were purchased from Sigma-Aldrich Co. LLC (St. Louis, MO, USA) and were of analytical grade.

#### *4.2. Protein Models*

#### 4.2.1. nAChR α7 Homology Model

α α α β α β α α α The amino acid sequence of the human α7 nAChR was obtained from the UniProt protein knowledgebase (entry P36544 [39]) and aligned to the sequences of the α4- and β2-subunits of PDB 5KXI [28] (α- and β-subunits or chains A and B respectively) and 5J5F [29] (chain A) with the T-Coffee Expresso structural alignment tool [40]. Residues used to build the homology model were then manually selected ensuring that the binding site was mainly built on the 5J5F template, an AChBP structure in complex with Compound 44 (**5**) (*N*4,*N*4-bis[(pyridin-2-yl)methyl]-6-(thiophen-3-yl)pyrimidine-2,4-diamine), whereas the tertiary and quaternary structure was largely modelled after 5KXI (Figure 10). Since the α7 receptor consists of α-subunits only, a single α-subunit from the (α4)2(β2)<sup>3</sup> template was also provided separately as template where there is a β-subunits in the 5KXI complex. The α7 nAChR homology models were built with Modeller (Version 9.16, Automodel class, Salilab, UCSF, San Francisco, CA, USA) [41]. Protein secondary structures as well as the pentameral symmetry were supplied as constraints to Modeller. Out of 100 generated models, the model with the highest scoring DOPE potential [30] was selected for further studies.

α


α β β

α α β α β α α β α **Figure 10.** Sequence alignment of the templates used in the homology model of α7 nAChR. The sequence of one α- and one β-subunit of the (α4)<sup>2</sup> (β2)<sup>3</sup> receptor co-crystallized with nicotine (PDB 5KXI) and of an AChBP co-crystallized with Compound 44 (**5**) were aligned using T-Coffee and modified for homology model building. Residues used as templates in homology modelling are shown in bold and conserved amino acids highlighted green. The overall sequence identity between the α7 nAChR and the (α4)<sup>2</sup> (β2)<sup>3</sup> nAChR subunits used as primary templates are 45% (62% sequence similarity) and 37% (57% sequence similarity), respectively. The sequence identity between α7 nAChR and AChBP is 28% (46% sequence similarity) overall and 39% (64% sequence similarity) for the part that was used as template.

The resulting receptor model was then prepared with the Schrodinger Protein preparation wizard [42] (hydrogens added and H-bonds optimized, protonation according to pH 7.4, restrained minimization (RMSD 0.3 Å)). The quality of the selected model was assessed based on a Ramachandran plot drawn in R [43] using the data published by Lovell et al. [31]. Subsequently, the protein-ligand complex was refined with Schrodinger Prime [44] for the ligand its surrounding protein residues within 5 Å using Monte Carlo sampling [45], VSGB solvation model [46], OPLS3 force field [47], and otherwise default parameters.

#### 4.2.2. AChE Model

The protein structure of AChE (PDB 4EY7 [27]) was downloaded from the Protein Data Bank [48] and all waters, ions, and small molecules except for donepezil removed. It was then prepared with the Schrodinger Protein preparation wizard [42] as described above for the α7 nAChR model.

#### *4.3. Screening Library*

The ZINC15 database [32] was downloaded in smiles format after application of the following filters: MW 250–500; logP −1–5; in-stock; anodyne. The smiles strings were then converted to canonical smiles and duplicates removed using OpenBabel 2.4.0 [49]. REOS [34] and PAINS [35] structure filters were applied using Schrodinger utilities. A set of additional filters were defined and applied to remove "flavonoid-like" structures, restricting the amount and position of halogen atoms and limiting the number of oxygen and nitrogen atoms to 6 and 1–8, respectively. Finally, physicochemical properties were calculated using Schrodinger software (Release 2018-1, Schrödinger, LLC, New York, NY, USA) and filtered using a custom perl script (HBD 5; HBA 10; PSA ≤ 120, RB ≤ 7).

#### *4.4. Virtual Screening*

With respect to VS, 14 Å<sup>3</sup> Glide Receptor Grids [50,51] were centered around donepezil and Compound 44 in the AChE and α7 nAChR protein models respectively and generated with the receptor scaling factor set to 0.9, the partial charge cut off to 0.3 and three positional restraints along the reference ligands in each of the binding sites. For the α7 nAChR, the areas concerned were in the hydrophobic pocket as well as the area corresponding to the position of the thiophen moiety of the *N*4,*N*4-bis[(pyridin-2-yl)methyl]-6-(thiophen-3-yl)pyrimidine-2,4-diamine ligand in the 5J5F structure and for AChE they were set at the anionic site as well as the hydrophobic gorge occupied by the indanone moiety of donepezil. These positional restraints were applied after each round of docking, thereby ensuring that only ligand poses occupying crucial areas in the binding site were considered without artificially enriching them. To perform the Ligand Docking, the Virtual Screening Workflow available in the Schrodinger suite based on Glide docking [50,51] was used. Ligands were prepared with OPLS\_2005 forcefield [52] at pH 7.4 and a maximum of 8 different stereoisomers per ligand generated. Only neutral and mono charged states were kept (See Section 4.3 also). For docking, the ligand scaling factor set to 0.85 and the partial charge cut-off to 0.2. Moreover, 300,000 compounds were kept after the initial HTVS, 30,000 after the SP (Standard Precision), and none were removed after the XP (extra precision) docking stage. Docking scores were strain corrected with the help of the Strain Rescore script (OPLS3 [47], aqueous solvation model, Energy Offset 0 kcal/mol and default cartesian constraint settings). Subsequently, the dataset was further processed in R [43], where glide docking scores were used to calculate ligand efficiencies as

$$\text{Ligand efficiency} = \text{(Gide background score} - \text{strain penalty)} \times \text{(N}\_{\text{Heavy atoms}}\text{)}^{-1} \tag{1}$$

and the values obtained normalized. Afterwards, compounds with positive docking scores as well as compounds with strain energy larger than 4 kcal/mol were removed from the dataset. The intersecting set of both obtained lists of docking hits was then considered to contain the putative bimodal compounds. Subsequently, compounds in the lower 0.5 quantile of both normalized ligand efficiencies were excluded. From the resulting list, compounds were selected based on diversity, docking pose, and commercial availability.

#### *4.5. AChE Activity Testing*

In vitro AChE inhibitory activity was studied using the colorimetric method of Ellman [26] using human recombinant AChE enzyme. To a 96-well microplate, test solutions were applied along with 1 mg/mL bovine serum albumin, 0.5 mM 5,5-dithiobis-(2-nitro-benzoic acid), and 0.5 mM acetylthiocholine iodide. The reaction was initiated by addition of 0.20 units/mL AChE enzyme and followed by colorimetric detection at 405 nm. Experiments were conducted in triplicate. All compounds were dissolved in methanol (maximum 2% methanol at assay conditions that did not affect the enzyme activity) and screened at 200 µM with galantamine as a positive control.

Compounds that fully solubilized in MeOH and exhibited more than 70% inhibition of ACh degradation were further analyzed and their IC<sup>50</sup> values determined using between five and eight concentrations.

#### *4.6. nAChR* α*7 Activity Testing*

#### 4.6.1. Molecular Biology

Human α7 nAChR receptor subunits were cloned and inserted into expression vectors as described previously [53]. Plasmid cDNAs were linearized using a downstream Not I restriction site and purified. cRNA was prepared and capped from the linearized cDNA using the mMessage mMachine T7 transcription kit according to the manufactures protocol. Purified cRNA was aliquoted and stored at a concentration of 0.5 µg·µL <sup>−</sup><sup>1</sup> at −80 ◦C until further use.

#### 4.6.2. Expression of α7 nAChR in *Xenopus laevis* Oocytes

*Xenopus laevis* oocytes were obtained as described previously [54], briefly, ovary lobes were removed by surgical incision, sliced into small pieces and defolliculated by collagenase treatment. The protocol for this specific study was approved by the Animal Ethics Committee of the University of Sydney (Protocol number: 2013/5915) and carried out according to these guidelines. Stage V and VI oocytes were injected with a total of ~25 ng of cRNA encoding human α7 nAChR with RIC3 (in 5:1 ratio), a protein enhancing the expression of the receptor. Injected oocytes were incubated for 3–5 days at 18 ◦C in a saline solution (96 mM NaCl, 2 mM KCl, 1 mM MgCl2, 1.8 mM CaCl2, 5 mM HEPES (hemisodium, pH 7.4)) supplemented with 2.5 mM sodium pyruvate, 0.5 mM theophylline, and 50 µM gentamycin.

#### 4.6.3. Oocyte Electrophysiology

Electrophysiological recordings from *Xenopus laevis* oocytes were performed using the two-electrode voltage-clamp technique as described previously [54,55]. Briefly, oocytes were placed in a custom-built recording chamber and continuously perfused with a saline solution. The saline solution contained 96 mM NaCl, 2 mM KCl, 1 mM MgCl2, 1.8 mM CaCl2, 5 mM HEPES (hemisodium, pH 7.4). Pipettes were backfilled with 3 M KCl and open-pipette resistances ranged from 0.3 to 1.5 MΩ when submerged in the saline solution. Oocytes were voltage clamped at a holding potential of −60 mV using an Axon Geneclamp 500B amplifier (Molecular Devices, LLC, San Jose, CA, USA). Rapid solution exchange in the oocyte vicinity (order of a few seconds) was ensured by application through a 1.5 mm diameter capillary tube placed approximately 2 mm from the oocyte as described previously [54]. The solution flow rate through the capillary was 2.0 mL/min. Experiments were performed as follows: nAChR currents were initially evoked with three AChcontrol applications (~EC20, 30 µM), a maximum efficacious concentration of AChmax (EC100, 3 mM) followed by three additional AChcontrol applications. Thereafter, test compounds in increasing concentration were applied (25 s), the maximal tested concentration was 200 µM. A wash period of at least 2 min was kept between each application. After the agonist test, new ACh controls were applied (AChmax followed by three additional AChcontrol applications) and compounds that displayed <1% activation were tested for their ability to modulate the effect of ACh. In these experiments, 10 and 100 µM concentration of

a test compound was co-applied with 30 µM ACh. Peak current amplitudes were normalized with respect to the amplitude of current elicited by 3 mM or 30 µM ACh for the agonist and antagonist test, respectively. All experiments were conducted at least in triplicate.

ACh was initially dissolved in milliQ water as 10 mM stock solution. Screened compounds were dissolved as a 50 mM stock solution in DMSO, except for Ýmir-2 and compound **12** which were kept as 10 mM stock solutions. The maximal DMSO concentration in the final dilution did not exceed 2%. This DMSO concentration did not evoke any current from the receptors. Compound dilutions were prepared in a saline solution on the day of the experiment.

#### *4.7. Data Analysis*

Data analysis was performed as reported previously [24]. Electrophysiological data were analyzed using pClamp 10.2 (Molecular Devices, LLC, San Jose, CA, USA). During analysis, traces were baseline subtracted and responses to individual applications quantified as peak-current amplitudes. Statistical calculations were performed using GraphPad Prism 7 (GraphPad Software, GraphPad, San Diego, CA, USA). Activation of the α7 nAChR was calculated as a percentage of Emax response to ACh. For evaluation of the inhibitory activity, the percentage of remaining peak-current amplitudes relative to that of the AChcontrol (EC20) application was calculated. AChE inhibition data were analyzed using GraphPad Prism 7. Absorption readings from the AChE inhibition assay were plotted versus time and linear regression was performed. From the obtained slopes the percentages of inhibition were calculated normalized to the control (Galantamine) and IC<sup>50</sup> values were determined from non-linear regression analysis. Data were fitted with the slope set to 1 and remaining current amplitude at infinitely high compound concentrations set to 0.

#### **5. Conclusions**

It was confirmed that HTVS approaches can be applied in the search for novel bimodal drug hits active at AChE and α7 nAChR with good hit rates. Ýmir-2 and Ýmir-10 represent novel compounds with a dual activity profile, i.e., inhibition of AChE and activation of the α7 nAChR, reported for the first time. The successful identification of two bimodal compounds is an encouraging outcome for VS for AD drug hits. Derivatives of Ýmir-2 could be developed into compounds with improved physicochemical properties and activity profiles of clinical interest.

**Author Contributions:** S.O., T.B., and E.S.O. designed the studies. S.O. designed and performed the in silico and AChE experiments. N.M.K. and P.K.A. designed and performed the α7 nAChR experiments. S.O. and T.B. drafted the manuscript, and all authors contributed to the final version of the manuscript. E.S.O. and T.B. acquired funding and performed supervision of the project. All authors approved the final version of the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Icelandic Centre for Research [grant number: 152604], doctoral grant and financial support from the University of Iceland.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Sample Availability:** Not available.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

### **Characterizing Epitope Binding Regions of Entire Antibody Panels by Combining Experimental and Computational Analysis of Antibody: Antigen Binding Competition**

#### **Benjamin D. Brooks 1,2,3,\* , Adam Closmore <sup>4</sup> , Juechen Yang <sup>5</sup> , Michael Holland <sup>5</sup> , Tina Cairns <sup>3</sup> , Gary H. Cohen <sup>3</sup> and Chris Bailey-Kellogg <sup>6</sup>**


Academic Editors: Marco Tutone and Anna Maria Almerico Received: 23 April 2020; Accepted: 28 July 2020; Published: 11 August 2020

**Abstract:** Vaccines and immunotherapies depend on the ability of antibodies to sensitively and specifically recognize particular antigens and specific epitopes on those antigens. As such, detailed characterization of antibody–antigen binding provides important information to guide development. Due to the time and expense required, high-resolution structural characterization techniques are typically used sparingly and late in a development process. Here, we show that antibody–antigen binding can be characterized early in a process for whole panels of antibodies by combining experimental and computational analyses of competition between monoclonal antibodies for binding to an antigen. Experimental "epitope binning" of monoclonal antibodies uses high-throughput surface plasmon resonance to reveal which antibodies compete, while a new complementary computational analysis that we call "dock binning" evaluates antibody–antigen docking models to identify why and where they might compete, in terms of possible binding sites on the antigen. Experimental and computational characterization of the identified antigenic hotspots then enables the refinement of the competitors and their associated epitope binding regions on the antigen. While not performed at atomic resolution, this approach allows for the group-level identification of functionally related monoclonal antibodies (i.e., communities) and identification of their general binding regions on the antigen. By leveraging extensive epitope characterization data that can be readily generated both experimentally and computationally, researchers can gain broad insights into the basis for antibody–antigen recognition in wide-ranging vaccine and immunotherapy discovery and development programs.

**Keywords:** epitope binning; epitope mapping; epitope prediction; antibody:antigen interactions; protein docking; glycoprotein D (gD); herpes simplex virus fusion proteins

#### **1. Introduction**

The utility of antibodies (Abs) for treatment of disease has been recognized for over a century and the clinical realization of this vision in a host of broad therapeutic indications has begun to

come to fruition [1,2]. The efficacy of an Ab, whether from vaccination or as a therapeutic, is functionally determined by the specific epitope that it recognizes on its cognate antigen (Ag). Thus, even Abs targeting the same Ag have demonstrated variable efficacy depending on their epitope specificities [3,4]. Recent advances in Ab engineering have enabled researchers to generate large panels of Abs with broad epitope coverage allowing for the selection of Ab–Ag pairings with improved safety and efficacy [5–8]. Abs have demonstrated differential effects in both vaccines and therapies depending on their epitopes [9–11]. Moreover, combination immunotherapies with diverse epitopes have demonstrated synergistic efficacy and have reduced the ability of cancer and infectious disease to develop resistance [12,13]. Furthermore, the use of immune repertoire B cell sequencing is supporting expanded clinical applications not only in immunotherapy, but also in the development of vaccines for emergent infectious diseases, both viral and bacterial [14]. The specificity of Abs in vaccine applications reveals various levels of protection depending on which epitopes are targeted by the immune response [12,13,15–19].

#### *1.1. Understanding the Structures of Epitope:Paratope Interactions Guides the Design of Superior Immunotherapies and Vaccines*

One of the areas for improvement in Ab engineering is the characterization of the binding interactions between an Ab and its specific Ag, as defined by the epitope–paratope interface [5,20–22]. Due to practical limitations, structural data is only available for a relatively small number of epitope–paratope interactions [23,24], since commonly used analytical techniques for obtaining these data, such as X-ray crystallography, NMR spectroscopy, cryo-electron microscopy, and H-D exchange mass spectrometry, are highly resource-intensive and often require artisan skillsets [25–28]. As such, these techniques are feasible only for late-stage lead molecules where they provide confirmation data rather than early-stage predictive tools that influence candidate selection [5]. These limitations highlight the realization that a more sophisticated strategy is required to characterize a panel of antibodies either to screen immunotherapies or to characterize immune repertoires from natural or vaccinated responses [23]. The localization of these interactions would provide information to help understand biochemical mechanisms of action, which is at the core of advancing the discovery and development of new immunotherapeutics and vaccines [20,29,30].

#### *1.2. High-Throughput Epitope Characterization Assays Provide Valuable Epitopic Information*

Epitope characterization has made significant progress in recent years as the throughput of biosensors has improved [5,31]. Foremost amongst emerging techniques for high-throughput epitope characterization is epitope binning, as it merges the speed and functional-site identification capabilities demanded by the biopharmaceutical industry [3–5,32]. Epitope binning is a competitive immunoassay where Abs are tested in a pairwise manner for their simultaneous binding to their specific Ag, thereby generating a blocking profile for each Ab showing how it blocks or does not block the others in the panel [32–35]. Abs with similar blocking profiles can be clustered together into a "bin", or represented as a "community" in a network plot that illustrates the blocking relationships among the Abs. In an oversimplified generalization, Abs in the same community are assumed to recognize the same epitope region and generally block the binding of others in the community. Advances in throughput offered by emerging array-based, label-free methods are allowing epitope binning to be performed early in drug discovery [32–35]. This approach can enable the pipeline to be populated with Abs that are diverse epitopically and, consequently, functionally. A key limitation of epitope binning remains its inability to provide insights into the locations of the epitopes on the target [3–5,32].

#### *1.3. Integrating High-Throughput Experiment and Computation Enables Characterization of Ab-Specific Epitopes*

An emerging approach for localizing Ab recognition and characterizing Ab-specific epitopes involves coupling experimental data and computational modeling [9,36]. Purely computational approaches to epitope prediction are quick and inexpensive but are not yet of sufficient accuracy to be relied upon on their own [36–40]. However, the incorporation of even limited experimental data can help close the gap. For example, the EpiScope approach first constructs a homology model of an Ab based on its sequence, then computationally docks that model onto a structure or high-quality homology model of the Ag, and finally, designs focused mutagenesis experiments to test the docking models [9]. While it is generally not clear from computational scoring alone which docking model is the most accurate, it was shown in both retrospective and prospective studies that a small number (generally 3–5) of binding assays for computationally designed Ag variants can reliably enable the various docking models to be confirmed or rejected and thereby identify the general epitope region [9]. Complementarily, experimental data can be used to focus docking and energy minimization to better define binding mode or epitope [41]. In general, combined computational–experimental approaches balance cost and accuracy in characterizing epitope sites.

Here, we take this general idea and scale it up from individual Abs to sets of Abs, presenting a new integrated experimental-computational approach (Figure 1) to characterize epitopes for an entire panel of Abs against an Ag by combining experimental binning with "dock binning," a new computational counterpart based on analysis of docking models for all the Abs. With the application to a model system, glycoprotein D from herpes simplex virus, we show that this combination of powerful experimental and computational methods can help rapidly identify antigenic regions and localize Ab-specific epitopes. The approach promises to enable better understanding of Ab–Ag interactions at a larger scale, and ultimately improve to the design of vaccines and therapeutics.

**Figure 1.** Schematic overview of our method for localizing Ab-specific epitopes by integrated experimental and computational analysis of binding competition. (Step 1) Experimental binning identifies communities of Abs that compete with each other for Ag binding. This grouping allows subsequent analyses to be focused on one or a few representatives from each community, reducing the effort required. (Step 2) For each representative Ab, a homology model of its Fv structure is constructed from its sequence, and Ab:Ag docking models are generated from the Ab model and the Ag structure or homology model. (Step 3) The docking models are computationally clustered, with this dock binning process analogous to experimental epitope binning in identifying patterns of competition. Since the competition is in terms of structural models of Ab:Ag binding, the identified communities correspond to general antigenic regions, and thereby map out potential binding regions on the Ag, summarized across the whole Ab panel as an antigenic heatmap. (Step 4) Experimental data is collected to probe the hypothesized epitope regions, e.g., using site-directed mutagenesis, chimeragenesis, peptide binding, or selection of alternative natural variants, in order to alter a putative epitope and evaluate effects on experimental competition or binding. (Step 5) The experimental data is used to focus docking, redefining bins, better characterizing competition, and better localizing epitope binding regions. This ultimately results in an antigenic heatmap localizing putative binding regions of the different Abs (illustrated by colored patches on the Ag surface).

#### **2. Results**

We demonstrate the power of combined experimental and computational binning in application to an important target with a wealth of data available for our use: herpes simplex virus (HSV) glycoprotein D (gD). GD is a fusion protein found in HSV that has served as the standard by which all other HSV-2 vaccines are evaluated for safety and efficacy [42]. GD subunit vaccines have conferred some measure of protection against viral challenge [42–44], but gD subunit vaccines fail to prevent infection or latent infection [45]. GD serves as a particularly good target for demonstration of our approach due to the availability of a large panel of available Abs as well as variant Ag constructs that may be leveraged for subsequent analyses. The following sections decribe our process (Figure 1) to characterize the epitopes of a set of anti-gD Abs.

#### *2.1. Experimental Binning Identified Four Communities of Anti-gD Abs*

Step #1 of the workflow (following the schematic in Figure 1) is to perform experimental epitope binning on a panel of Abs, identifying communities of cross-competing Abs and selecting representatives from each community for further investigation. Figure 2 further overviews the general workflow for such epitope binning experiments. In previous studies, this general approach was applied specifically to gD. In particular, high-throughput SPRi technology in a classical sandwich assay format was used to assess competition between pairs of Abs, from a panel of 46 Abs, against four soluble Ag variants, gD from type 1 HSV truncated to the first 285 or 306 residues, and that from type 2 HSV likewise truncated to 285 or 306 residues [46–48]. Subsequent analyses presented here are all based on the 285 residue truncation of gD2.

**Figure 2.** Schematic overview of experimental epitope binning, identifying cross-competition between Abs for binding an Ag. Sensorgrams are assessed for blocking, inferring whether or not two Abs block each other from binding in a competition assay and thus are assumed to bind the same epitopic region on the Ag. The blocking data is collected in a heatmap that indicates whether or not each Ab pair blocks (red blocked, green sandwiched). The heatmap is processed with hierarchal clustering algorithms based on similarity in blocking profiles, thereby generating a dendrogram. Finally, cutting the dendrogram separates Abs into clusters represented in a community network, which has nodes for the Abs and edges indicating which pairs block each other.

The studies showed that the Abs covered much of the surface of gD [4], with numerous Abs in each distinct community of cross-blocking Abs [46–48]. Six "sentinel" Abs, namely DL11, MC23, MC2, MC5, MC14, and 1D3, were selected as representatives (generally the most highly connected within each community) [49,50]. Of particular note, there exists a crystal structure of an additional Ab, E317 (PDB ID 3W9E) [51]. Thus, even though E317 is in the same community as the previously selected DL11, we included it here to serve as a structural "control". These seven Abs represent the communities in serving as the subjects of the following computational and experimental analyses.

#### *2.2. Computational Epitope Prediction Characterized the Putative Epitope Binding Regions*

With the representative Abs selected, Step #2 (Figure 1) is to construct Ab homology models and dock them to the structure/model of the Ag. Ab homology modeling is generally very high quality (<1–2 Å level RMSD to native) for everything except the heavy chain CDR 3, which is more variable (more typically 3–6 Å, though it can be better) [52,53]. For the gD study, we used representative state-of-the-art methods, within Schrodinger BioLuminate (BioLuminate, Schrödinger, LLC, New York, NY, USA, 2020.) to perform this modeling, but note that many alternative approaches are available and may yield somewhat different Ab and Ab:Ag models. A homology model was constructed for each of the seven Abs; for control purposes, E317 was homology modeled on a different scaffold from its crystal structure (RMSD 1.43 Å). The crystal structure of the Ag gD2 was taken from PDB id 2C36 [54–56], with missing residues homology modeled. Docking models were then generated for

each Ab model against the gD2 model using the Piper algorithm within Schrodinger BioLuminate. This yielded roughly thirty representative low-energy docking models per Ab [57].

As is common [9], the docking models were spread over much of the gD2 surface. In general, the quality of Ab:Ag complex models produced by docking has steadily improved [58], e.g., for 95% of the cases in one benchmark, a near-native model was within the top 30 models [40]. Thus, while we could not be confident in the accuracy of any particular model, we could hypothesize that, in aggregate, they included the antigenic sites, setting up the next stage in our analysis.

#### *2.3. Dock Binning Grouped Models and Identified Broadly Antigenic Regions*

In order to identify the "hottest" putative antigenic sites on the protein, worth experimentally probing across the Ab panel, Step #3 (Figure 1) performs dock binning and constructs an antigenic heatmap. In analogy to experimental binning, this step first characterizes competition among Abs for sites on the Ag, here according to the docking models. At this point in the analysis, since there are many (roughly 30 in our gD results) docking models for each Ab, and they may be widely dispersed over the Ag, competition between a pair of docking models from different Abs does not necessarily imply that the Abs themselves compete. However, it does imply that it would be informative to evaluate the binding of those Abs against the Ag sites for which the docking models compete (e.g., mutate such a site and experimentally assess changes in binding/competition). Such a test would provide, with a single experiment, information regarding both Abs' binding sites. More generally, the more evidence there is for interaction with an Ag residue (the "hotter" that residue), the more generally and experimentally informative it should be to probe that residue for binding across the entire Ab panel. This insight is the basis for performing dock binning and constructing an associated antigenic heatmap.

The dock binning workflow (schematically illustrated in Figure 3) is relatively straightforward, mostly following that of experimental epitope binning (see again Figure 2). Now the competition heatmap is based on overlap in docking models, rather than experimental competition. A variety of different methods are possible for assessing this overlap; the results presented here are based on one that we call the "common interaction metric", which considers two Ab docks that contact the same residue(s) to be competing. This score drives the clustering of the docks based on their patterns of competition. Here, the hierarchical clustering method from experimental epitope binning was used [34,35]. A community network map is then generated from the dendrogram.

**Figure 3.** Schematic overview of "dock binning", computational analysis of cross-competition of docking models of Abs against an Ag. Homology models of the Abs are computationally docked against a crystal structure or homology model of the Ag. Docking models are evaluated for "competition" and the extent of competition represented in a heatmap. The competition profiles are subsequently used to cluster docking models, with a community network representing the models (nodes), their competition (edges), and the identified clusters (groups of nodes). In contrast to experimental binning, dock binning is based on structural analysis and thus provides insights into where on the Ag the Abs might be interacting. The antigenic heatmap highlights the most popular Ag residues across the docking models; these are thus most generally informative for subsequent experimentally probing.

Figures 4 and 5 illustrate the results from applying this process general process to the Ab:gD2 docking models. Figure 4 shows the docking model clusters generated for the seven Abs against gD2 and Figure 5 their community relationships. Note that docks from each Ab are found across all communities.

**Figure 4.** Binned docking models for Abs vs. gD2. Clusters of docking models are arranged in a table format, with each column a different cluster and each row a different perspective. Face #1 is the nectin binding face. Faces #2–4 are rotated by 90, 180, and 270 degrees around the *y*-axis. The rightmost column is an antigenic heat map (see also Figure 6) where residues are colored by community and shaded so that residues with more Ab interactions are colored more darkly.

**Figure 5.** Community map of docking models for Abs vs. gD2. Nodes represent docking models, with different symbols for the different Abs. Edges indicates structural competition between pairs of docking models. Colors and background shading indicate community membership according to the partitioning of a dendrogram. The shape of each marker indicates the antibody in the dock (i.e. each shape is a single antibody).

The last step in dock binning, where it can produce insights beyond those that possible with experimental binning alone, leverages the intuition introduced above: analyze the dock bins to identify common putative antigenic sites which may then be subjected to experimental probing. Here computationally identified noncovalent Ab–Ag interactions (hydrogen, electrostatic, pi, and van der Waals) were aggregated across all docks for all Abs to determine a score for each Ag residue in terms of its total number of interactions. Figure 6 (also the last column of Figure 4) shows the resulting "antigenic heatmap," with residues of the protein structure colored according to the dock binning community (Figures 4 and 5) and with darkness proportional to the aggregated scores. This antigenic heatmap highlights the "hot" antigenic residues whose experimental probing is likely to be most informative and can enable the efficient localization of epitopes of the whole Ab panel.

**Figure 6.** Antigenic heat map for Abs vs. gD2. Each residue is colored according to the Ab community (Figure 5) with which it has the most interactions and shaded so that residues with more interactions are colored more darkly. Face #1 is the nectin binding face. Faces #2–4 are rotated by 90, 180, and 270 degrees around the *y*-axis.

#### *2.4. Dock Binning Enabled Selection of Experimental Assays to Evaluate Antigenic Regions*

At this point in the process (Step #4), critical Ag residues according to the antigenic heatmap are probed for Ab recognition (e.g., via point mutagenesis followed by a binding assay, to assess if a mutation away from native disrupts binding). In this study, we were fortunate that there already exists a wealth of existing data available regarding Ab–gD2 binding. We cross-referenced the "hot" residues from the antigenic heatmap against the available data (including Ab:gD2 variant binding, peptide binding, and known "monoclonal Ab resistant", or MAR, mutations), considering the sentinel Abs as well as others in their communities (Supplementary Figure S1) [46,48]. Figure 7 highlights the residues from the antigenic heatmap for which such binding data was available [59,60]. We consider these as the experimental evaluation of the hot residues from the antigenic heatmap. We note that, in other settings, mutations (individual or combination) could be computationally designed to evaluate the disruption of binding while preserving antigenic stability [9,61]. Thus, while we used a large number of experimental measurements in this study, we expect that a much smaller number of tests would suffice in practice, e.g., 3–5 variants sufficed to localize individual Ab epitopes in previous computationally directed studies [9].

**Figure 7.** Experimentally probed residues. Residues with associated experimental binding data are labeled; bracketed numbers in the legend indicate primary references in the bibliography. The surface is colored by associated Abs (Supplementary Figure S1), rather than dock binning community. The residues that were also highlighted by dock binning are underlined; this data serves as probes of those particular predicted antigenic hotspots.

#### *2.5. Experimental Data Allowed Re-Docking to Focus Ab:Ag Models Based on Experimental Binding Data, thereby Localizing Each Ab's Epitope Region*

Finally (Step #5 from Figure 1), the experimental data is used to focus docking of each Ab against the Ag. Here, the experimental data (Figure 7/Supplementary Figure S1) was used to focus docking toward (with "affinity") residues confirmed to be important for an Ab's binding and away from (with "repulsion") those determined not to be. For example, DL11 was docked with an affinity towards the residues 213 and 218 and with repulsion away from the residues associated with Abs from other communities, since these Abs were determined from the initial experimental binning not to compete with DL11, and thus it is assumed the epitopes don't overlap.

The focused docks are then subjected to dock binning and antigenic heatmap construction as described for Step #3. Figures 8–10 show the focused-docking results for the anti-gD Abs. In contrast to the initial unconstrained docking, docking models for an Ab are now more concentrated, focused on the experimentally important residues for the Ab. Consequently, the communities are now fairly homogeneous for each Ab, and the localization of each Ab on gD can be fairly well inferred from an associated hot region in the antigenic heatmap. For example, when compared to the crystal structure for our "structural control Ab" E317, 10 antigenic heatmap residues agree with those in the crystal structure, while 12 extend further out and four are missed (Supplementary Table S1).

**Figure 8.** Binned experimentally-focused docking models for Abs vs. gD2. The same representation as Figure 4, but based on docking models that were focused according to experimental data.

**Figure 9.** Community map of experimentally-focused docking models for Abs vs. gD2. The same representation as Figure 5, but based on docking models that were focused according to experimental data. Note the relative homogeneity of Abs (different symbols) within each community, contrasting with Figure 5. The shape of each marker indicates the antibody in the dock (i.e., each shape is a single antibody).

**Figure 10.** Antigenic heat map for Abs vs. gD based on eperimentally-focused docking models. The same representation as Figure 6, but based on docking models that were focused according to experimental data. The resulting hot spots can be interpreted as likely epitope regions of the associated antibody/-ies in the bin. Relative.

While there is no ground truth for the actual epitopes of the Abs other than E317, the agreement between the antigenic heatmap residues and those previously identified by various experiments (Figure 7/Supplementary Figure S1) was quantified in terms of both centroid distances and common residues (table in the middle of Figure 10). Many of the distances are quite small, indicating that the dock binning region was centered in the same general region as the experimentally identified residues used to focus it. However, some of the regions have fairly few experimentally identified residues (e.g., red and blue communities), and the docking models typically expanded to cover a significantly larger region capturing more Ab–Ag surface complementarity. The quantification of agreement between residues in the antigenic heatmap and those experimentally used to focus the docking further illustrates that while docking largely stays in the focused region as intended, it does include some additional residues and omit some others. The disparity could indicate, for example, either that the dock binning missed some important residues contributing to recognition or that the experiments overestimated the importance of some residues. Likewise, either that dock binning found some additional residues that had not been discovered by previous experiments or that it was somewhat off-target. Such differences can then be the subject of further experimental investigation.

#### **3. Discussion**

#### *3.1. The Integration of Experimental and Computational Binning Provides Important Epitope Information That Can Inform Discovery*

Here, we demonstrate the utility of combining computational modeling and high-throughput experimental data to characterize epitopes early in discovery and thereby enable more effective drug and vaccine development. Epitope binning can facilitate the identification of functional epitopes and is thus increasingly used as a primary or early second screen [5]. However, while experimental epitope binning can inform competition among Abs for an epitope, it cannot localize the epitope on the Ag [5]. In contrast, while computational methods can identify potential epitopes, they are currently limited in their accuracy [40]. Even in this study, we were confronted with a wide range of seemingly equally reasonable docking models, leading us to consider: (1) how to identify the most accurate dock, (2) how to identify highly immunogenic sites on the target, and (3) how to group the docks into epitope regions in order to compare with experimental binning experiments. We determined that identifying the most accurate dock (question 1) was not realistic based on the current accuracy of the computational methods. However, we realized that docks could help us identify putative highly antigenic sites (question 2). Furthermore, this information would be the next logical step in assessing experimental binning communities and would limit the possibilities to consider when attempting to map communities onto their epitopes (question 3). Thus, this combination of computational and experimental binning (Figure 1) can leverage the advantages of each method, complementarily, in order to localize Ab epitopes across an entire panel.

The initial dock binning step provides not only putative antigenically hot regions but also important information toward the generation of mutants that can be used in epitope mapping and cross-antigen binning to test hypotheses regarding Ab–Ag binding and thereby localize epitopes. Dock binning identifies communities of related Ab docking models, while the antigenic heatmap summarizes possible epitope binding regions on the Ag surface. These two analyses together provide elegant, biologically-centered visualizations characterizing potential binding patterns for an entire Ab panel. Taking the epitope binding regions as hypothesized regions that are generally antigenic allows the design of targeted experiments to evaluate them. The underlying intuition for these experiments is that if an Ab truly recognizes a particular epitope, then a mutation to substantially "re-surface" that epitope should disrupt Ab binding, and thus evaluation of changes to binding can allow us to confirm or reject the epitope for the Ab. By testing a set of variants based on the epitope binding regions, we can thereby simultaneously test all the various hypotheses spanning the Ab panel and potential epitopes.

The results that can be obtained from the dock binning approach are limited by how well the Ab models and the Ab:Ag docking models reflect reality, along with how effectively the selected mutational variants can reveal that reality. We used representative computational approaches here, and overall observed reasonable agreement between the results from our approach and the one available ground truth crystal structure. While the previous purely experimental studies reflect an extensive amount of effort, they also may not fully reflect reality to the same extent as a crystal structure does. Thus, we are limited in our ability to characterize general trends regarding how well docking-based approaches perform and why, and what their particular weaknesses may be. From some previous gD studies, however, we do know that two of the Abs (MC2 and MC5) recognize epitopes that are at least partially obscured until receptor binding allosterically drives a conformational change [46,59]. We thus hypothesize that the epitope region identified for MC2, and presumably that for all members of the community it represents, is not particularly accurate. In general, epitopes, as well as paratopes, in conformationally dynamic or poorly modeled regions may present challenges for this model-driven approach.

#### *3.2. While Binning Is Still Limited, Further Computational Advances Will Improve Its Accuracy and Informativeness*

Significant limitations still exist with the prediction of protein-protein interactions, in particular with Ab-Ag epitope identification. Numerous opportunities exist to improve the process and better support vaccine and drug candidate screening. We here summarize some interesting avenues for computational development, following along the path of our workflow (Figure 1):

1. Experimental epitope binning. The fundamental question here is how to cluster the Abs into communities. and new and emerging techniques from network/community analysis, especially in genomics [62–64], may prove beneficial. These metrics can be used to identify communities that require further refinement and/or changes to the clustering. Furthermore, the clustering need

not be discrete, i.e., the partition into communities can be overlapping, and this information can potentially be leveraged throughout the rest of the process. Such analysis could be particularly important in cases of partial or aberrant competition. While the "sentinel" antibodies used here were selected as most representative of clear-cut communities, future approaches could include algorithmic selection of representatives based on properties of the communities (e.g., ensuring adequate coverage of ambiguous clusters) and of the individual antibodies (e.g., based on sequence and structural analysis, favorability for development, etc.).


#### **4. Materials and Methods**

All reagents were previously described [46,49,50,70,71].

#### *4.1. Antibodies*

Monoclonal Abs (mAbs) were used throughout. The monoclonal antibodies have been created by the Cohen/Eisenberg group at the University of Pennsylvania. The following anti-gD mAbs were previously published: 1D3 [49,72–74]; DL6, DL11, DL15 [51,71,74–78]; E317 [51]; MC1, MC2, MC4, MC5, MC8, MC9, MC10, MC14, MC15, MC23 [71,79]; A18 [80]; AP7, LP2 [81]; HD1, HD2, HD3, H162, H170, H193 [82,83]; 11S, 12S, 45S, 77S, 97S, 106S, 108S, 110S [73,84]; BD78, BD80 [71,85]; and the human mAb VID [86–88], DL15, A18, HD3, H162, H193, 77S, 97S, 106S, 108S and 3D5, 4E3E, 4G4D, and 11B3AG [46,70].

Sentinel Abs were selected based on past studies as most representative of their respective communities. Sentinel Abs were sequenced (Genscript, Piscataway, NJ, U.S.) from mAb clone for computational studies.

#### *4.2. Proteins*

Proteins were previously described [70]. Again, these constructs existed before this study and were generated by the Cohen/Eisenberg group. HSV type-1 and type-2 gD, truncated to 285 or 306 residues, were harvested from Sf9 cells infected with baculovirus. Protein was then purified using a DL6 immunosorbent column [70,71,89]. Additional proteins used in this study include: C-terminal truncations 250t, 260t, 275t, and 316t [50,54,70]; deletion mutant ∆(222–224) [50]; point mutants Y38A, V231W, and W294A [54,90,91]; and insertion mutants ins34, ins126, and ins243 [46,77].

#### *4.3. Epitope Binning*

Binning experiments were previously described [3,46]. Briefly, a 48-spot microarray of amine-coupled Abs on a CMD200M sensor prism (Xantec GmbH, Kevelaer, Germany) was printed using Continuous Flow Microspotter (CFM, Carterra, Salt Lake City, UT, U.S.). The array was loaded into the MX96 SPR instrument (Ibis). Data were processed using software from Ibis and Carterra as described previously [3,46].

#### *4.4. Protein Modeling and Docking Prediction*

The crystal structure of gD2 (285 truncation) was taken from PDB id 2C36 [54–56] and Schrodinger BioLuminate was used to homology model the unstructured regions for missing residues including 257–267. Homology models of the Ab Fvs were also constructed using Schrodinger BioLuminate. For E317, the homology modeling was prevented from using the available crystal structure (PDB id 3W9D); instead, the model was based on the anti-HIV neutralizing Ab 4E10 (PDB id 4M62) as the framework template, yielding a model RMSD value of 1.43 Å to the crystal structure.

Docking models were generated for each Ab model against the gD2 model using the Piper algorithm within Schrodinger BioLuminate [57]. This generated 15–30 different docking models for each sentinel Ab. We note that this method performs rigid docking, not accounting for potential Ab and Ag flexibility.

#### *4.5. Dock Binning*

Dock binning takes as input a set of Abs and a set of docking models for each Ab against the same Ag. It assesses docking models for the extent to which they structurally overlap or "compete", and then clusters them based on their profiles of competition against each other in a manner analogous to experimental epitope binning. Finally, it constructs an antigenic heatmap on the Ag surface, highlighting Ag residues according to the frequency with which they contact Ab residues in the docking models comprising a cluster. We now detail these steps.

#### 4.5.1. Competition

Competition between a pair of docking models was assessed with three different scores [21] based on residue-level distances and biophysical interactions across the Ab–Ag interface:

Common interaction metric: the number of Ab–Ag atomic interactions such that one Ag atom has a common interaction with an Ab atom on each Ab. This was computed based on the Interaction Table in BioLuminate, which includes hydrogen bonds, salt bridges, pi stacking, disulfide bonds, and van der Waal interactions.

Cα metric: the number of Cα residues in one Ab within a fixed distance (here 10 Å) from those in the other.

Centroid metric: the distance between the heavy-atom centroids of the two Abs.

Results shown were based on the common interaction metric; results from the other metrics differed only minimally in terms of clustering results.

#### 4.5.2. Clustering

Competition scores were collected into a symmetric matrix, with each row and column representing a docking model and each cell containing the score for the associated pair of models. In this matrix, the row/column for a docking model collects its competition score against each docking model; we call this its "competition score profile". A docking model heatmap was generated based on this matrix (schematically illustrated in Figure 3, with binary scores). The models were hierarchically clustered based on their competition score profiles, here using NBclust. The resulting dendrogram was partitioned to define clusters, also known as communities [92–94]. Carterra Epitope Binning software was used for dendrogram and community plot generation [46].

#### 4.5.3. Antigenic Heatmap

The notion of an antigenic heatmap was developed here to visually represent where the different docking model communities were located on the Ag, i.e., with which Ag residues the Abs in a community interacted the most. Each Ag residue was assigned to the docking model community for which the associated docking models had the most interactions according to the interaction tables generated by BioLuminate (see "common interaction metric" above). The "hotness" of the Ag residue was computed as the number of such interactions, normalized by the size of the community, and the heatmap shade was set accordingly.

In order to compare an antigenic heatmap "footprint" with an experimental epitope region, corresponding sets of residues were identified—those Ag residues comprising a community in the antigenic heatmap and those from the experimental epitope. The centroid of each set of residues was computed and the centroid distances measured, both using Schrodinger PyMOL. In addition, the sets of residues were directly compared for membership to identify how many residues were in common.

**Supplementary Materials:** The following are available online. Figure S1: Sequenced mAbs and associated experimental data. Table S1: Agreement between epitope residues targeted by our control antibody, E317, according to the crystal structure vs. those identified from our dock binning based approach.

**Author Contributions:** Conceptualization, B.D.B., G.H.C. and C.B.-K.; Methodology, A.C.; Software, J.Y., A.C. and M.H.; Validation, B.D.B., G.H.C. and C.B.-K.; Formal Analysis, B.D.B., G.H.C. and C.B.-K.; Investigation, T.C.; Resources, G.H.C.; Data Curation, M.H. and R.F.; Writing-Original Draft Preparation, B.D.B.; Writing-Review & Editing, B.D.B., G.H.C. and C.B.-K.; Visualization, A.C. and M.H.; Project Administration, B.D.B.; Funding Acquisition, B.D.B., G.H.C. and C.B.-K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded in part by NIH grant numbers 1R43AI132075-01, 1R43AI132075-01, and 2R01GM098977. In addition, the research was supported by AI-18289, AI-142940, AI-139618, and a grant from 549 BIONTECH, Inc. (to G.H.C.) and NSF grants RUI-1904797/ACI-1429467 and XSEDE MCB 550 and The Wellcome Trust, UK.

**Acknowledgments:** We thank Amanda Brooks for her critical reading of the manuscript, and Wan Ting Saw for helpful discussions. We would like to thank Ron Weed for illustration of the figures.

**Conflicts of Interest:** B.D.B. has financial interest in a commercial company, Carterra, Inc.

#### **References**


**Sample Availability:** Samples of the compounds are available from the authors.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

**Matthew L. Hudson and Ram Samudrala \***

Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY 14203, USA; mlhudson@buffalo.edu

**\*** Correspondence: ram@compbio.org; Tel.: +1-716-888-4858

**Abstract:** Drug repurposing, the practice of utilizing existing drugs for novel clinical indications, has tremendous potential for improving human health outcomes and increasing therapeutic development efficiency. The goal of multi-disease multitarget drug repurposing, also known as shotgun drug repurposing, is to develop platforms that assess the therapeutic potential of each existing drug for every clinical indication. Our Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun multitarget repurposing implements several pipelines for the large-scale modeling and simulation of interactions between comprehensive libraries of drugs/compounds and protein structures. In these pipelines, each drug is described by an interaction signature that is compared to all other signatures that are subsequently sorted and ranked based on similarity. Pipelines within the platform are benchmarked based on their ability to recover known drugs for all indications in our library, and predictions are generated based on the hypothesis that (novel) drugs with similar signatures may be repurposed for the same indication(s). The drug-protein interactions used to create the drug-proteome signatures may be determined by any screening or docking method, but the primary approach used thus far has been BANDOCK, our in-house bioanalytical or similarity docking protocol. In this study, we calculated drug-proteome interaction signatures using the publicly available molecular docking method Autodock Vina and created hybrid decision tree pipelines that combined our original bio- and chem-informatic approach with the goal of assessing and benchmarking their drug repurposing capabilities and performance. The hybrid decision tree pipeline outperformed the two docking-based pipelines from which it was synthesized, yielding an average indication accuracy of 13.3% at the top10 cutoff (the most stringent), relative to 10.9% and 7.1% for its constituent pipelines, and a random control accuracy of 2.2%. We demonstrate that docking-based virtual screening pipelines have unique performance characteristics and that the CANDO shotgun repurposing paradigm is not dependent on a specific docking method. Our results also provide further evidence that multiple CANDO pipelines can be synthesized to enhance drug repurposing predictive capability relative to their constituent pipelines. Overall, this study indicates that pipelines consisting of varied docking-based signature generation methods can capture unique and useful signals for accurate comparison of drug-proteome interaction signatures, leading to improvements in the benchmarking and predictive performance of the CANDO shotgun drug repurposing platform.

**Keywords:** drug repurposing; virtual screening; multiscale; multitargeting; polypharmacology; computational biology; drug repositioning; structural bioinformatics; molecular docking; proteomic signature

#### **1. Introduction**

#### *1.1. Drug Repurposing*

Pharmacological innovation reduces human mortality rates and provides substantial improvements to the quality of life [1]. Therapeutic compounds that have been discovered, lab tested preclinically, and evaluated for risks and efficacy in clinical trials are approved by regulatory bodies such as the United States FDA for specific indications [2]. Potential failures impose high opportunity costs, and the realities of market forces and investment

**Citation:** Hudson, M.L.; Samudrala, R. Multiscale Virtual Screening Optimization for Shotgun Drug Repurposing Using the CANDO Platform. *Molecules* **2021**, *26*, 2581. https://doi.org/10.3390/ molecules26092581

Academic Editors: Marco Tutone and Anna Maria Almerico

Received: 25 February 2021 Accepted: 19 April 2021 Published: 28 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

distort the types of ailments for which treatments are pursued [3–5]. The rate of novel drug discovery has been slowing as costs have been increasing, illustrating the need for more efficient paradigms [6].

Drug discovery traditionally relies on screening a compound or set of compounds against a biological target, typically a protein for a specific indication. Generally, these approaches incorporate high-throughput in vitro compound screens [7] and/or cell-based assays [8] of candidates drawn from wet laboratory studies or computational screens of virtual representations of compounds and biological targets [9–11]. If promising in vitro leads are found, they undergo in vivo testing, eventually leading to approval for clinical use if they continue to demonstrate relative efficacy and safety [2]. Traditional drug discovery methods tend to be focused on a single target and indication [11,12]. However, drugs and other human-ingested compounds interact promiscuously with many proteins in the body [13,14]. These off-target interactions are responsible for side effects and the fact that one drug may be useful for treating multiple indications [15–19]. Single-target approaches may miss promising leads and potentially beneficial off-target side effects, while drugs that have already been discovered and vetted for safety may be used in novel treatment contexts.

Drug repurposing is the practice of finding new uses for existing drugs, taking advantage of prior safety, efficacy, and pharmacological knowledge and data [15]. Drug repurposing has the potential to arbitrarily increase the utility of the FDA-approved drug library [17,20,21], particularly via innovations such as multitarget drug repurposing [22,23]. Drug repurposing has yielded new uses for multiple drugs [15,17] and has demonstrated potential for the treatment of viral [24,25], bacterial [26], and complex indications such as cancer [17].

#### *1.2. Computational Drug Repurposing Using Molecular Docking*

Computational models that improve drug discovery and repurposing leverage rapidly increasing computer processing power and vast collections of preclinical (in vitro, in vivo) and clinical data [22]. Although there are a variety of computational approaches, the most relevant ones to this study are structure-based. Structure-based approaches focus on modeling/simulating the effects the three-dimensional (3D) structure of a compound may have on one or more macromolecules, typically protein structures [27]. Structure representations are based on data obtained from X-ray diffraction, NMR spectroscopy, cryogenic electron microscopy, and biochemical and biophysical simulation studies. These models may incorporate other features such as predicted protein-compound binding sites, simulations of the surrounding chemical environments, and the functional characteristics of protein structures.

Molecular docking models the three-dimensional (3D) interaction between small molecule compounds and macromolecular protein structures [28–32]. Typically, these simulations algorithmically calculate the optimal position and orientation of a compound structure that interacts with (or binds to) a particular region of a protein structure and its corresponding interaction strength, using physics-based [33] or knowledge-based [34,35] force fields or scoring functions. The characteristics of a correctly modeled compound-protein structure provide researchers insight into the biological implications of the interaction: for example, a researcher may infer that a signaling pathway may be interrupted if a particular protein were to be inhibited by the compound based on the strength of its binding energy [36]. Molecular docking is also useful when researching large sets of compounds and proteins [37]. By comparing the relative differences in interactions between proteincompound pairs, the researcher can rank and organize pairs according to the strength of their interaction score and/or their similarity to identify patterns that are apparent only when examining large sets with many possible combinations, which is difficult and expensive to do in in vitro or in vivo experiments [22,38]. Molecular docking techniques have varying performance advantages and limitations [39,40]; however, provided that docking

approaches are used wisely in concert with other experimental techniques, they have the potential to be useful for drug repurposing, particularly in a large-scale context [38].

#### *1.3. Shotgun Multitarget Multi-disease Drug Repurposing Using the CANDO Platform*

The Computational Analysis of Novel Drug Opportunities (CANDO) platform was developed to mitigate endemic problems in drug discovery and enable multitarget approaches to drug repurposing [22,38,41–45]. The CANDO platform is designed to provide insights about the holistic behavior of compounds interacting within complex biological systems, including how a compound behaves relative to other compounds, and is an extensible standardized framework for building and combining drug repurposing, discovery, and design simulation pipelines. The similarity of drug-protein interaction behavior between a small molecule drug/compound and its macromolecular environment is hypothesized to indicate the similarity of drug therapeutic function [22]. In traditional structure-based and ligand-based drug discovery, therapeutic similarity inferences are often based on molecular target similarity and compound similarity [46]. CANDO extends the similarity assumption principle to include holistic multiscale interaction similarity, that characterizes compounds by the nature of their interaction with entire proteomes and (eventually) interactomes [22,38]. Extending the interaction similarity frontier enables CANDO to account for the promiscuous nature of compound interaction within biological systems and characterize previously unconsidered therapeutic functions of existing approved drugs. CANDO pipelines are evaluated by a benchmarking protocol that examines the relative ranking of every drug for every indication with two or more approved drugs. Analyzing the relative ranking of approved drugs for each indication enables the evaluation of the effectiveness of the platform for recovering known information, comparing relative pipeline performance for particular indications, calculating the accuracy and precision for ranking approved drugs, and determining which components of the platform need improvement.

In this study, we set out to extend the CANDO platform with an additional molecular docking pipeline using the popular software AutoDock Vina [30] to determine whether the prior CANDO performance was dependent on a specific molecular docking protocol, how different molecular docking protocols affect CANDO's performance, and whether hybridizing molecular docking pipelines yields improved performance, as we have previously observed combining structure- and ligand-based CANDO pipelines [43].

#### **2. Results**

#### *2.1. Benchmarking Performance of the Different Pipelines*

The performance of two new primary pipelines and a hybrid one in the CANDO platform was investigated and compared to those previously created, including a random control (see Figure 1 and Methods). The first is the Vina pipeline that uses the eponymous molecular docking program to screen the CANDO v1.5 3733 drug/compound library against a 134-protein subset of the full proteome library (Vina-134). Multiple binding sites for each protein were predicted and targeted for docking, and the strongest interaction scores were used to construct the drug-proteome signatures.

The second pipeline used was the default CANDO v1.5 pipeline restricted to the same 134 protein subset (v1.5-134). We generated a hybrid decision tree pipeline drawn from a combination of the Vina-134 and the v1.5-134 pipelines. For comparison, we examined the performance of these pipelines with respect to a random control and the v1.5 pipeline implemented with the full CANDO proteome library consisting of 46,784 protein structures (v1.5-full).

Figure 2 illustrates the relative performance of these different pipelines. At the top10 threshold, the hybrid decision tree yielded 13.3% accuracy, v1.5-134 10.9%, and Vina-134 7.11%. The v1.5-134 and v1.5-full pipelines outperformed the Vina pipeline, but the latter was able to substantially contribute to the superior performance of the hybrid pipeline. Notably, the hybrid decision tree pipeline outperformed the v1.5-full pipeline with a top10

accuracy of 12.8% with two orders of magnitude difference in the number of proteins used in the implementation of the pipeline (134 vs. 46,784).

**Figure 1.** CANDO shotgun drug repurposing platform and pipeline overview. On the left side of the figure is a flow diagram, which indicates the general protocol for implementing a CANDO drug-proteome pipeline. To the right of the flow diagram, each pipeline relevant to this investigation is displayed along with implementation details for each phase of the CANDO protocol. Data curation: The drug-proteome pipelines utilize libraries of protein structure and drug structure representations. Interaction scoring protocol: These pipelines use bioinformatic, cheminformatic, and molecular docking methods to predict the scores between each protein and drug interaction. The set of protein interaction scores for each drug is considered its interaction signature. Each interaction signature can be compared with one another by assessing the root mean squared deviation of their interaction signatures. Drug comparison protocol: Every drug signature is compared with every other drug signature. After every comparison is made, each drug has a list containing the ordered set of every other drug, from most similar signature to least similar signature. Benchmarking protocol: The CANDO benchmarking procedure assesses, for every drug, how many other drugs with the same indication association are found within certain ranking cutoffs. An indication-specific accuracy score is produced by averaging the recovery rate of co-associated drugs for every drug associated with the particular indication for particular ranking cutoffs. The overall pipeline average indication accuracy is the mean of all the individual ones for a particular cutoff. Three pipelines were generated during this investigation: v1.5 (implemented with a subset of the CANDO proteome library), AutoDock Vina (using the same proteome sublibrary), and a hybrid decision tree pipeline derived from the former two pipelines. Each of the subset pipelines utilized a small sublibrary (134 proteins) of the original CANDO v1 and v1.5 pipelines. Although the pipelines used different signature generation approaches (BANDOCK bioanalytical or similarity docking and AutoDock Vina molecular docking), their signatures underwent the same similarity assessment and benchmarking protocol. However, there is room for variation via the use of alternate docking, similarity assessment, and benchmarking approaches.

**Figure 2.** Benchmarking performance of CANDO pipelines used in this study. Three docking-derived pipelines implemented in CANDO: v1.5-full, using the interaction scores generated by the default CANDO v1.5 bioanalytical or similarity docking protocol (BANDOCK) for the full (46,784) proteome library, v1.5-134, using the same interaction scoring protocol for a 134 protein sublibrary, and Vina-134, based on interaction scores generated using AutoDock Vina for the 134 protein sublibrary are compared with (d) a hybrid decision tree pipeline derived from combining Pipelines (b) and (c), as well as (a) the random control reference pipeline calculated numerically from a hypergeometric distribution [43]. The pipelines were assessed by three CANDO platform benchmarking metrics: average indication accuracy (%), pairwise accuracy (%), and coverage (%). Performance cutoffs are denoted by colored bars from most to least stringent: top10 (dark purple), top25 (light purple), top37/top1% (dark pink), top50 (light)pink, top100 (dark blue), top5% (light blue), top10% (dark green), and top50% (light green) for 1439 indications with at least two approved drugs using a leave-one-out benchmarking protocol (see the Methods Section). White dots denote the highest overall accuracy at each threshold. The hybrid decision tree pipeline, which incorporates the highest indication accuracies from the Vina-134 and v1.5-134 pipelines, performed the best at all cutoffs (white dots). Black dots denote high performance in individual pipelines, which was obtained using the two v1.5 pipelines, one based on the 134 proteome sublibrary and the other on the full proteome library. v1.5-134 yielded the highest top50 percent average indication accuracy (85.3%), top100 coverage (52.742%), top5% coverage (64.382%), and top50% coverage (96.064%). Individually, the Vina-134 pipeline significantly outperformed the random control and yielded a significant fraction of the performance of the v1.5 pipelines. The hybrid decision tree pipeline performed the best, indicating that diversity in pipeline simulation implementation can be leveraged to increase drug repurposing performance. 197

#### *2.2. Divergence in Indication Accuracy at Various Thresholds*

Figure 3 illustrates the similarity and divergence of indication accuracy performance at various thresholds: i.e., instances where the Vina-134 pipeline outperforms the v1.5- 134 pipeline, instances where the v1.5-134 pipeline outperforms the Vina-134 pipeline, instances where each pipeline yields the same indication accuracy, and instances where each pipeline yields zero percent accuracy. At the top10 threshold, the Vina-134 pipeline had 191 indications (about 13% of all indications) that outperformed the v1.5 pipeline, which had 363 indications outperform Vina-134 (about 25% of all indications). There were 885 equivalently performing indications (with 828 of them at zero percent accuracy) at the top10 cutoff. Overall, the divergence in relative performance increased as the thresholds became less stringent (the CANDO pipeline outperformance share began to decline slightly after the top5% threshold). v1.5 had a higher number of indications in which it outperformed Vina-134. After the top50 cutoff, the proportion of equivalent indication accuracies that were both zero relative to the total equivalent indication accuracies began to decline rapidly.

**Figure 3.** Comparing and contrasting indication accuracies for two CANDO platform pipelines at different cutoffs. Each Venn diagram represents the set of indication accuracies (1439 total) for the Vina-134 and v1.5-134 pipelines at different cutoffs (top10, top25, top37, top50, top100, top5%, top10%, and top50%). Indications that scored higher for the Vina pipeline are in yellow. Indications that scored higher for the v1.5 pipeline are in purple. Indications that scored the same for each pipeline are in gray. The number of indications where both pipelines yield 0% accuracy are provided in parentheses and the total number of equal indication accuracies provided below. At the top10 cutoff, the Vina-134 pipeline yields higher accuracies for 191 indications and the v1.5 pipeline for 363 indications, and each pipeline yields the same accuracy for 885 indications. Although the Vina-134 pipeline produces a substantial number of indication accuracies that are higher or equivalent to v1.5 indication accuracies, the v1.5 pipeline outperforms the Vina-134 pipeline at every cutoff except for top50%. Nonetheless, the orthogonality in the above diagrams indicates that individual pipelines can be synergistically combined into a hybrid pipeline that yields considerable performance improvements.

#### *2.3. Net Differences in Indication Accuracy*

Figure 4 elucidates the net differences in top10 accuracies between two pipelines (and the proportion of approved drugs recovered per indication in the top10) for 700 indications. With some notable exceptions, the v1.5-134 pipeline outperformed the Vina-134 pipeline in frequency and magnitude. On a per indication basis, as the total number of approved drugs decreased, the Vina-134 pipeline had a higher number of the outperforming indications in terms of frequency and magnitude.

**Figure 4.** Comparison of 700 indication accuracies for two CANDO platform pipelines at the top10 cutoff. The top10 indication accuracies for 700 indications produced by the Vina-134 and v1.5-134 pipelines are shown in the left panel, with the the v1.5-134 pipeline per indication accuracies in purple on the left side and the Vina-134 pipeline accuracies in black on the right. The net difference in pipeline accuracy for the same indication is shown in the center panel, using the same percentage scale as the left. The number of drugs recovered by the best performing pipeline (in blue on the left side) and the total number of drugs approved per indication (in black on the right side) are shown in the right panel. The number of approved drugs for all three panels ranges from 158 drugs at the top to two drugs at the bottom. Generally, the v1.5-134 pipeline outperforms the Vina-134 pipeline, both by the number of indications and net difference in accuracy per indication.

#### *2.4. Relative Pipeline Indication Accuracy*

The pipelines differed at average indication accuracy thresholds and on a per indication basis. In some cases, a pipeline that performed worse overall may do better for a specific indication. Figures 3–5 illustrate the overall divergence, the magnitude of divergence, and the threshold frequency distributions. On a per-indication basis, there was divergence in the relative indication accuracy at various cutoffs, both in terms of net difference in accuracy, recovery at a particular threshold, and the frequency of a particular indication being recovered at a particular interval. The divergence between the per-indication performance of each pipeline elucidated by Figures 2–4 suggests that each pipeline should be used in conjunction with one another for maximum indication inclusivity and accuracy.

**Figure 5.** Frequency comparison of indication accuracies for two CANDO platform pipelines at different cutoffs. Shown are four histograms denoting the frequency with which indications fall within particular accuracy ranges (Vina-134 pipeline accuracies are in light yellow and v1.5 accuracies in black). The similarity of each distribution is assessed by the *p*-value using the Kolmogorov– Smirnoff test (*p*-values less than 0.05 are considered to be significant). The v1.5 pipeline outperforms Vina-134 overall, but the *p*-values indicate that the accuracy distributions are different for the two pipelines, indicating the utility of combining pipelines to produce synergistic performance.

#### *2.5. Comparison of the Pipeline Distribution of Per Indication Accuracies*

Figure 5 illustrates the distribution of indication accuracies by counting the frequency of each indication that falls within a certain accuracy range. The dissimilarity of pipeline distributions at each cutoff was assessed by applying the Kolmogorov–Smirnoff test. The v1.5 pipeline outperformed Vina-134 overall (which yielded a higher frequency of indications exceeding 50% accuracy).

#### *2.6. Indication Accuracy Distribution*

Figure 6 examines the distribution of indications to illustrate their relative performance within each pipeline. Pipelines can also be compared with symmetrical accuracy distribution charts, where individual pipeline accuracy is denoted along the x and y axes. Each point can represent a particular indication (e.g., one of the 1439 indications in the CANDO platform), a defined indication class (e.g., all 39 indications with the string "neoplasm"), or some other way of denoting indications (e.g., indications that occupy a particular branch of the Medical Subjects Heading (MeSH) classification [47] or those that are ontologically similar [48]). When pipelines reach accuracy consensus (or near consensus) for a particular indication (or indication grouping), the point falls on or close to the 45 degree symmetry line. These figures suggest that different pipelines had varying success in benchmarking performance on a per-indication basis. More rigorous clustering analysis, indication classification, and indication definition will yield deeper insight into the relative strengths of each pipeline.

#### *2.7. Distribution of Individual Drug-Indication Pair Rankings*

Supplementary Figure S1 plots every drug-indication pair and its corresponding rank within each pipeline. These suggest that there is some substantial ranking consensus between each pipeline, as well as substantial divergence. The distribution was plotted at linear and logarithmic scales to illustrate the density of approved drug-indication pair ranking consensus and divergence. There is a high density of drug-indication pairs that have relatively high ranking in each pipeline. There is also a high density of drug-indication pairs that have a significantly higher ranking in the v1.5-134 pipeline than the Vina-134 pipeline. As with pipeline per-indication accuracy divergence, further investigation into drug-indication pair divergence may help improve the performance of individual and hybrid pipelines, particularly in cases where one pipeline ranked a drug-indication pair substantially higher than the other one (e.g., top100 in one and bottom 50% in the other).

**Figure 6.** Comparison of the indication accuracies at various cutoffs and for a defined indication class for two CANDO platform pipelines. The top panel denotes a symmetrical accuracy chart. Each axis measures the indication accuracy for each pipeline, and indications are plotted according to their corresponding accuracies (different cutoffs are distributed in alternate colors). Points that land on the 45 degree red line are indications where the pipelines reached consensus; points that fall closer to a particular axis achieved a relatively higher score with the corresponding pipeline. The bottom panel isolates indications for the defined class "neoplasm" comprising 39 indications with the corresponding string. The asymmetrical distribution of the accuracy plot suggests pipeline accuracy differentiation, i.e., different pipelines have differing performance strengths and weaknesses, on a per-indication and indication class level.

#### **3. Discussion**

#### *3.1. Multiple Large-Scale Virtual Screening Pipelines*

In this investigation, we hypothesized that distinct docking methods would yield distinct drug-proteome interaction signatures due to differing simulation implementation, and correspondingly differing performance for shotgun drug repurposing: BANDOCK, the default bioanalytical or similarity docking in CANDO is a knowledge-based template/comparative modeling protocol [22,38,41,44,45], and AutoDock is a more traditional molecular docking approach with physics-based force fields [30]. Including other molecular docking pipelines beyond the default pipeline implemented in the platform enabled us to evaluate whether or not CANDO as a platform was specifically dependent on the drug-proteome signature generation methodology implemented in the default pipeline.

Our results demonstrated that the CANDO platform was not dependent on a single pipeline implementation, and that combining different virtual screening pipelines can yield better performance relative to using the individual ones. On a platform level, the drugproteome signature ranking and indication recovery paradigm was viable using more than one means of signature generation. On a pipeline level, the pipelines (the two large-scale virtual screening pipelines and the combined decision tree pipeline) each demonstrated varying degrees of performance and instances of unique signal capture.

The Vina-134 pipeline implemented in this study was viable in that it performed substantially better than the random control and performed at a significant fraction of the performance of the original default pipeline that utilized a bigger protein set. However, small protein libraries have been shown to perform relatively well, and some subsets of protein libraries performed better than others [44]. As previously demonstrated [43], hybrid pipelines can draw from the strengths of each constituent pipeline. As is the case here, the absolute performance of the Vina-134 pipeline was not the best, yet it substantially contributed to the higher performing hybrid pipeline.

#### *3.2. Limitations and Future Work*

Although some pipelines yielded superior signal over others in specific circumstances, precisely identifying why this occurred warrants further investigation. On a per-indication basis, it is possible to identify the superior performance of one pipeline over another (Figures 3 and 4), but the MeSH indication classes were not precisely defined or had varying levels of specificity to one another. This issue will be addressed in the future through the use of more precisely defined indication mapping, for instance by using a realismbased ontology [48,49]. We are also using mathematical, statistical, and machine learning techniques to rigorously evaluate and enhance CANDO's pipeline performance, as well as to identify clusters of drug-indication pair rankings when comparing different pipelines and methods [43,50], to yield insight into the ability of each pipeline to accurately recover known per-indication association information and make useful predictions for downstream prospective preclinical and clinical validation [42,51].

#### **4. Materials and Methods**

#### *4.1. CANDO Platform and Pipeline Implementation*

Figure 1 provides an overview of the CANDO platform and the particular pipeline implementations relevant to this study. The platform uses drug/compound and protein structure libraries curated from public sources and implements protocols for drug- and compound-proteome interaction signature generation, signature similarity calculation and sorting, assessing whether known drugs are ranked highly for the correct indications for single or hybrid pipelines (benchmarking), and generating novel putative drug candidates for specific indications (prediction). CANDO's drug ranking pipelines have utility in many repurposing research contexts. For example, these pipelines can be used for lead generation for subsequent in vitro/in vivo testing and eventual off-label clinical use by physicians. By assessing the top ranking subset of drugs, a researcher or clinician can efficiently infer promising experimental or clinical drug candidates based on drugs ranked

relatively to FDA-approved drug treatments and prior experimental evidence. For many clinical indications, CANDO pipelines are able to identify and highly rank FDA-approved drug treatments along with drugs that are FDA approved for other indications. Researchers can also infer associations between clinical indication classes, diseases, and biological pathways through the examination of indication-indication association networks connected by highly ranked drugs they have in common or other features of their respective compoundproteome signature. As illustrative examples of the broad uses of CANDO, Supplementary Figure S2 and Supplementary Table S1 describe the indication-indication associations for a selection of MeSH neoplasm indications based on shared drugs ranked in the top10 in the Vina pipeline.

#### 4.1.1. Drug/Compound, Protein Structure, and Indication Library Curation

The default CANDO pipelines were implemented using bio-/chem-informatic docking protocols, where interactions were predicted from curated drug and protein libraries. The specific implementations and evolution of the libraries were reported extensively in several publications [22,38,41,44,45]. Briefly, the initial versions of CANDO (v1 and v1.5) incorporated 46,784 proteins and 2030 indication associations for 1439 drugs (out of 3733 compounds in total). Much of the data were drawn from the Protein Data Bank [52], the Food and Drug Administration, PubChem [53], the Comparative Toxicogenomics Database [54], DrugBank [55], protein structure modeling [56], and other sources.

The pipelines used in this study relied on curated sublibraries of the structures of 3733 drugs/compounds and 134 proteins and 13,746 drug-indication mappings, obtained from the same sources as above. We used the sublibraries to rapidly evaluate the utility of multiple molecular docking pipelines.

#### 4.1.2. Drug- and Compound-Proteome Interaction Signature Generation

A CANDO virtual screening pipeline simulates the interactions between all of its proteins and drugs/compounds, usually 3D structures, and is not dependent on any particular approach to accomplish this. These simulations generate proteomic similarity signatures (the vector of drug-protein interaction scores). The default CANDO platform pipelines generate drug-proteome interaction signatures using bioinformatic and cheminformatic docking protocols also described elsewhere extensively [22,38,41,44,45]. These signatures were compared for similarity and ranked. CANDO pipeline Version 1.5 [45] is a refinement of the original default pipeline [22,38,41,44] that uses near identical libraries, but improved interaction scoring [45]. We extended the drug- and compound-proteome interaction signature protocols to include the calculated binding energies generated by the program AutoDock Vina [30], as well as created hybrid pipelines combining molecular docking with bioanalytical/similarity docking (further details below).

#### 4.1.3. Drug- and Compound-Proteome Signature Similarity Calculation and Sorting

Broadly, the CANDO platform works by sorting every drug/compound relative to every other one based on their similarity and then uses known drug-indication associations to assess performance (Figure 2). Various pipelines implemented in CANDO generate drug-proteome interaction signatures for similarity sorting [22,38,43–45]. Underlying this platform is the core assumption that the similarity of drug interaction behavior across a proteome may be used to infer similarity in therapeutic function. The similarity between each drug and every other drug/compound is calculated using the root mean squared deviation of the individual interaction scores across a pair of drug-proteome interaction signatures [38].

Combined drug-proteome interaction signatures form an interaction matrix, with drugs along one axis and proteins on the other. These signatures are compared with one another and then ranked on a per-drug basis, and the quality of the resulting ranking is evaluated using the leave-one-out benchmarking protocol described below.

#### *4.2. Benchmarking CANDO Platform Pipelines*

Our benchmarking protocol calculates the performance for every indication with at least two approved drugs (1439 out of 3733 total) at various cutoffs, considering only the the top10 (abbreviated as "top10"), top25, top37 or 1%, top50, top100, top5%, top10%, and top50% of similarly ranked drugs. For each indication, the accuracy was derived from calculating how many known drugs mapped to that indication were "recovered" and highly ranked at various cutoffs.

We utilized three metrics to benchmark pipeline performance: average indication accuracy, pairwise accuracy, and coverage [22,38,41,44,45], all assessed at the different cutoffs. The average (mean) indication accuracy (%) is the average of all individual indication accuracies. The individual indication accuracy metric was calculated using the formula *c*/*d* × 100, where *c* is a count of the number of times at least one approved drug for the indication was recovered within a particular cutoff and *d* is the total number of drugs approved for that indication. The other two benchmarking metrics were pairwise accuracy (weighted average of indication accuracies using the total number of approved drugs per indication) and coverage (number of indications that have an accuracy greater than zero).

#### *4.3. New and Hybrid Pipelines*

The pipelines examined in this study were derived from similarity ranking and benchmarking drug-proteome interaction signatures generated by large-scale bioinformatic and molecular docking. The CANDO platform is not limited to using docking-based virtual screening pipelines and has the potential to incorporate many different approaches to pipeline implementation and data sets (for example, ligand centric approaches have proven quite effective [43]).

#### 4.3.1. Virtual Screening Pipeline Using Autodock Vina

We used a small sublibrary (134 proteins) of the full CANDO proteome library to create the new molecular docking virtual screening pipeline due to computational constraints and also because we previously have shown that appropriately selected sublibraries of a similar size from the full library yield similar or better benchmarking performance [44]. We used the popular software AutoDock Vina Version 1.1.2 [30] for molecular docking of each protein structure against 3733 drugs/compounds from the CANDO v1.5 libraries. As with BANDOCK, we used COFACTOR [57] to predict binding sites, for binding search space size optimization [58], and used the strongest interaction score (lowest calculated binding energy) for each simulation from multiple sites. The best interaction score values for a drug-protein pair were used to generate the drug-proteome signatures.

#### 4.3.2. Decision Tree Pipeline

Prior CANDO platform investigations have demonstrated that multiple pipelines can be combined into a hybrid decision tree to maximize indication accuracy by drawing from pipelines that produce the best performance on a per-indication basis [43]. We used a similar approach in this investigation, using the pipeline that had the highest performance at the top10 cutoff.

#### *4.4. Controls*

We also compared the benchmarking performance of the pipelines to values obtained using a hypergeometric distribution that estimates the numerical probability of making a correct prediction by chance. This is one of the random control reference benchmarks used in the CANDO platform, the implementation of which was covered in detail in prior publications [43,45]. Benchmarking performance was also compared to the default pipeline implementations in CANDO Version 1 and Version 1.5 using the complete libraries.

#### **5. Conclusions**

Our results indicated that the utilization of multiple diverse docking-based virtual screening approaches in drug repurposing contexts such as the CANDO platform improves benchmarking performance. The Vina-134 pipeline performance indicated that the CANDO platform hypothesis of drug behavior similarity is not limited to the original bionalytical or similarity docking protocol BANDOCK for interaction signature generation. The hybrid decision tree pipeline performance provided further evidence that multiple signature generation pipelines may be combined to yield improved performance. Ongoing and future platform enhancement will incorporate multiple signature generation protocols and pipeline synthesis using AI/machine learning approaches to optimize performance. These improvements in turn will lead to greater predictive power and higher confidence in novel drug candidates generated for specific indications, which will be verified via prospective preclinical and clinical studies.

**Supplementary Materials:** The following are available online at http://compbio.org/data/mc\_ cando\_multiscale\_optimization/ accessed on 24 April 2021, Supplementary Figure S1: Plot of every drug-indication pair and their corresponding rank within each pipeline. Supplementary Figure S2: Indication-indication associations between MeSH neoplasm-associated classes based on the number of compounds predicted in the top10 cutoff by the Vina pipeline. Supplementary Table S1: Raw indication-indication association counts.

**Author Contributions:** M.L.H. conceived the research design, approach and methods, conducted all experiments and analysis, and drafted the manuscript; R.S. conceived the research design, approach and methods, supervised the activities of M.L.H., and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by a National Institutes of Health Director's Pioneer Award 279 (DP1OD006779), a National Institutes of Health Clinical and Translational Sciences Award (UL1TR001412), a NCATS ASPIRE Design Challenge Award, and startup funds from the Department of Biomedical Informatics at the University at Buffalo.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/ram-compbio/CANDO and http://compbio.org/data/ both accessed on April 27, 2021.

**Acknowledgments:** Additional support provided by the Center for Computational Research at the University at Buffalo. The authors thank James Schuler, Will Mangione, Zackary Falls, Liana Bruggemann, and Manoj Mammen for their valuable input during the development of this manuscript.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

CANDO Computational Analysis of Novel Drug Opportunities FDA Food and Drug Administration NMR Nuclear Magnetic Resonance

#### **References**


*Article*

## **The Performance of Gene Expression Signature-Guided Drug–Disease Association in Di**ff**erent Categories of Drugs and Diseases**

**Xiguang Qi <sup>1</sup> , Mingzhe Shen <sup>1</sup> , Peihao Fan <sup>1</sup> , Xiaojiang Guo <sup>1</sup> , Tianqi Wang <sup>2</sup> , Ning Feng <sup>3</sup> , Manling Zhang <sup>3</sup> , Robert A. Sweet 4,5,\*, Levent Kirisci 1,\* and Lirong Wang 1,\***


Academic Editors: Marco Tutone and Anna Maria Almerico Received: 21 April 2020; Accepted: 5 June 2020; Published: 16 June 2020

**Abstract:** A gene expression signature (GES) is a group of genes that shows a unique expression profile as a result of perturbations by drugs, genetic modification or diseases on the transcriptional machinery. The comparisons between GES profiles have been used to investigate the relationships between drugs, their targets and diseases with quite a few successful cases reported. Especially in the study of GES-guided drugs–disease associations, researchers believe that if a GES induced by a drug is opposite to a GES induced by a disease, the drug may have potential as a treatment of that disease. In this study, we data-mined the crowd extracted expression of differential signatures (CREEDS) database to evaluate the similarity between GES profiles from drugs and their indicated diseases. Our study aims to explore the application domains of GES-guided drug–disease associations through the analysis of the similarity of GES profiles on known pairs of drug–disease associations, thereby identifying subgroups of drugs/diseases that are suitable for GES-guided drug repositioning approaches. Our results supported our hypothesis that the GES-guided drug–disease association method is better suited for some subgroups or pathways such as drugs and diseases associated with the immune system, diseases of the nervous system, non-chemotherapy drugs or the mTOR signaling pathway.

**Keywords:** gene expression signature; drug repositioning approaches; RNA expression regulation

#### **1. Introduction**

A gene expression signature (GES) is a set of comprehensive gene expression profiles that can reveal the difference between stimulated and normal cell states [1]. Current applications of GES analysis are fruitful in cancer-related areas for disease genotype classification and outcome predictions. For example, Ramaswamy, S. et al. created a GES database for diagnosing and categorizing the tumour type with an accuracy rate of 78% [2]. Wright, G. et al. developed a Bayesian rule-based algorithm to classify diffuse large B cell lymphoma into two subgroups which have a significant difference in the five-year survival rate [3]. Although the GES method is more commonly used in diagnosing

cancer [2–5] and predicting the outcome of certain medical interventions [6,7], some successful cases of application on drug development have also reported [8–10].

Generally, there are two major strategies for applying GES analysis on drug development: drug–drug-based and drug–disease-based. The drug–drug-based method determines the mechanistic actions of drugs by comparing the similarity between the GES induced by a drug of interest to those of drugs with known mechanisms. If two different drugs have similar GES profiles, then they are considered to have "functional similarity", meaning they work in a similar manner. In contrast, the drug–disease-based method compares the similarity between the GES of a drug to that of a disease in order to determine its potential as a new therapeutic agent. If the GES profile of the drug is opposite to opposite to the expression pattern of the disease, then the drug is considered to have a therapeutic effect for the disease. However, if they have similar patterns, then this drug may exacerbate the disease. Studies aimed at drug repurposing or repositioning based on GES analysis usually use one or both of these strategies [8–10]. In addition, there are studies trying to combine the GES method with other methods, such as machine learning, to increase the accuracy of compound indication prediction [11]. However, as those kinds of GES-guided drug repurposing studies usually just reported the successful predicted cases, therefore, the true accuracy of these methods needs to be assessed.

Due to the different and complex mechanisms of disease processes, the idea of an "inverse pattern of a GES between drugs and diseases for therapeutic effect" may not hold, or at least may not be suitable for all categories of drugs and diseases. In other words, a GES may be useful for certain diseases, but not for others. To our knowledge, the application domains of GES-guided drug–disease associations have not been reported. Herein, we conducted a study to validate the power of the GES-guided drug repositioning method and to further explore which specific subgroups of drug–disease pairs are more suitable for this method. Moreover, the most significant subgroup was selected as a case report of detailingwhich genes and/or pathways were more sensitive to the GES-guided drug repositioning method.

#### **2. Results**

#### *2.1. GES Profiles Enrollment and Drug–Disease Pairs*

After removing signatures from non-human assays and signatures of non-FDA (the U.S. Food and Drug Administration)-approved drugs, we found that GSE10432, GSE7036, GSE6264, GSE38713, GSE31773, GSE11393, GSE8157, GSE13887 and GSE11223 were signatures of both drugs and diseases from the same assays. We kept their disease labels except the CREEDS (crowd extracted expression of differential signatures [12]) ID of dz:297 because this case had information mis-specified (wrong disease information with its original experiment). Two GES profiles from mouse (drug:3288 and dz:724) were mis-specified as human and were also excluded from analysis. The relationship between these Gene Expression Omnibus (GEO) series (GSE) and CREEDS IDs is shown in Table 1**.** The proportion of data that meets the inclusion criteria is shown in Figure 1.


**Table 1.** The Gene Expression Omnibus (GEO) series with crowd extracted expression of differential signatures (CREEDS) IDs excluded.

**Figure 1.** The proportion of data sourced from the crowd extracted expression of differential signatures (CREEDS) database. Numbers of gene signatures are shown in parentheses. "Drug and Disease Signatures Included in the Final Analysis": The proportion of drug or disease gene signatures enrolled in the final analysis. "Drug and Disease Signatures Extracted from Non-Human Assays": The proportion of drug or disease gene signatures extracted from non-human assays. "Signatures with Information Mis-Specified": The proportion of gene signatures with information errors. "Signatures from Same Assays but Labelled as Both": The proportion of gene signatures excluded because of both drug and disease sourcing from the same assay. "Drug and Disease Signatures Because of Indication Not Found": The proportion of gene signatures excluded because no FDA-labelled indication of a relationship was found for the drug or disease (including drugs not approved by FDA).

When the inclusion criteria were applied, and the signatures with no indication relationship were excluded, 230 manual disease signatures and 244 manual drug perturbation signatures from 71 unique diseases and 56 unique drugs, respectively, were enrolled in the final analysis. The average signed Jaccard indexes [12] (SJI) of 3976 unique drug–disease pairs were calculated. Among them, there were 167 pairs with a drug–disease indication from the drug labels. The remaining 3809 unique drug–disease pairs were used as the control group.

#### *2.2. Subgroups Distribution*

Among the 56 unique drugs analysed, 32 unique protein targets with 22 categories of Anatomical Therapeutic Chemical (ATC) classification were assigned. Thirteen drugs are classified as chemotherapy drugs, and 44 drugs are not (Methotrexate is both a chemotherapy and a non-chemotherapy drug due to its different main therapeutic targets when against different diseases). For transcription factor (TF) level, 12 drugs are labelled as "directly", 39 drugs are labelled as "not-directly" and 5 drugs

were labelled as "non-Human" (see section *4.4. subgroup classification* for the detailed meanings of labels). Further, 71 diseases are divided into 11 ICD-11 (International Classification of Diseases 11th Revision) categories. In total, 70 subgroups belonging to five categories were assigned (Figure 2, detailed information in Table S1 and Table S2).

**Figure 2.** The subgroups proportion of unique 167 indicated drug–disease pairs of different categories. (**a**) Disease classification. NEO: neoplasms, DMSCT: diseases of the musculoskeletal system or connective tissue, DS: diseases of the skin, CIPD: certain infectious or parasitic diseases, DIS: diseases of the immune system, ENMD: endocrine, nutritional or metabolic diseases, DBBO: diseases of the blood or blood-forming organs, DRS: diseases of the respiratory system, DNS: diseases of the nervous system, DDS: diseases of the digestive system, DCS: diseases of the circulatory system. (**b**) Drug target. GLUR: glucocorticoid receptor, DNAtopo: DNA/topoisomerase-human, TYRK: tyrosine kinase, DNAclak: DNA cross-linking/alkylation, CYC: cyclooxygenase, DNAlig: DNA/ligase, TOPOI: topoisomerase-non-human, INTR: interferon receptor, MICROT: microtubules, NUCS: nucleotide synthesis, TNF: tumor necrosis factor. (**c**) TF (transcription factor) level. "Directly": drugs with TFs as its main therapeutic targets. "Not-directly" indicates drugs with main therapeutic targets which are human DNA structures or human proteins but not TFs. "Non-Human" represent drugs interacting

with protein or structures of non-human (for example, from virus or bacterial) as main therapeutic targets. (**d**) Chemotherapy. "YES" or "NO" indicates the drug is a chemotherapy drug or not. (**e**) ATC classification. CORTI: corticosteroids for systemic use, plain, OAA: other antineoplastic agents, CYTOANTIB: cytotoxic antibiotics and related substances, ANTIME: antimetabolites, IMMSUP: immunosuppressants, NSAAP: anti-inflammatory and antirheumatic products, non-steroids, HAARA: hormone antagonists and related agents, QUINA: quinolone antibacterial, IMMSTI: immunostimulants, PAAAONP: plant alkaloids and other natural products, ALKA: alkylating agents.

#### *2.3. Overall Score of GES Similarity of Drug-Indicted Disease Pairs Against Random Drug–Disease Pairs*

We observed significantly lower SJI similarity scores of drug–disease indication pairs than those of random drug–disease pairs (p-value of two-side t-test [13] equals to 0.02324). The average similarity score of indicated pairs is −0.00386 with a standard deviation of 0.01794 and that of random control pairs is −0.00072 with a standard deviation of 0.01750, indicating that the GES method can reflect the therapeutic effects of the drugs (The distributions of SJI in both the indication group and the control group are shown in Figure 3). − −

#### *2.4. Subgroup Scores of GES Similarity of Drug-Indicated Disease Pairs Against Random Drug–Disease Pairs*

We compared drugs from five different categories of subgroups: (1) disease classifications; (2) drug target; (3) TF level; (4) chemotherapy; and (5) ATC classification. The results are shown in Figure 4, detail information is listed in Table S3 and Table S4. Subgroups with important or significant (q-value according to false discover rate (FDR) lower than 0.05) results according to least squares mean partitions F tests of a generalized linear model (GLM) [14] are listed in Table 2.


**Table 2.**Subgroups of generalized linear model (GLM) least squares mean partitions F tests results.

Important subgroups or subgroups with false discover rate (FDR) q-value lower than 0.05 from GLM least squares mean partitions F tests for signed Jaccard index differences between drug-indicted disease pairs and random drug–disease pairs. "———-" indicates that subgroups only have one unique drug–disease pair sample with no standard deviation.

**Figure 4.** The average signed Jaccard index score of unique indicated drug–disease pairs split by different categories of subgroups. \*\* indicates FDR Q < 0.01, \* indicates FDR Q < 0.05. (**a**) ATC classification. ADRI: adrenergics, inhalants, AAPS: anti-acne preparations for systemic use, EIBGLD:

blood glucose-lowering drugs, excluding insulins, DAA: direct acting antivirals, ESTR: estrogens, INS: insulins and analogues, LMA: lipid modifying agents, plain, ODP: other dermatological preparations, TET: tetracyclines, VITAD: vitamins A and D, including combinations of the two. CORTI, OAA, CYTOANTIB, ANTIME, IMMSUP, NSAAP, HAARA, QUINA, IMMSTI, PAAAONP, ALKA, see Figure 1 legend. (**b**) Chemotherapy. "YES" or "NO" indicates the drug is a chemotherapy drug or not. (**c**) Disease classification. See Figure 1 for abbreviations. (**d**) Target. 16S: 16S ribosomal RNA, ACRT: aminoimidazole caboxamide ribonucleotide transformylase, AMPAPK: AMP-activated protein kinase, ADGR: androgen receptor, BETAR: beta adrenergic receptor, CD20: CD20 antigen, CYP: cytochromes P450, DAAD: delta-aminolevulinic acid dehydratase, DNMT: DNA/methyltransferase, DNApo: DNA/polymerase, ESR: estrogen receptor, HMG-CoAR: HMG-CoA reductase, I5MD: inosine-5'-monophosphate dehydrogenase, INSR: insulin receptor, mTOR: kinase mTOR, PPAR: peroxisome proliferator-activated receptors, PSB: proteasome subunit beta, RAR: retinoic acid receptor, B-raf: serine/threonine-protein kinase B-raf, THYS: thymidylate synthase, D3: vitamin D3 receptor; GLUR, DNAtopo, TYRK, DNAclak, CYC, DNAlig, TOPOI, INTR, MICROT, NUCS, TNF see Figure 1 legend. (**e**) TF (transcription factor) level. "Directly": drugs with TFs as their main therapeutic targets. "Not-directly" indicates drugs with main therapeutic targets which are human DNA structures or human proteins but not TFs. "Non-Human" represents drugs interacting with non-human proteins or structures (for example, from viruses or bacteria) as main therapeutic targets.

#### *2.5. Gene and Pathway Analysis on an Example Drug–Disease GES Pair*

Interferon receptor (with the same drug–disease pair content as the immunostimulants subgroup), the subgroup with the lowest q-value, was chosen as a case report for the pathway analysis. The top 5% (93/1898) genes with a relatively reversed expression probability according to the relatively expression probability of a gene's (*GI-R%*, an indicator of the relative possibility difference of gene expression between the indicated group and the random control group, see below 4.5) scores are shown in Table 3. The top 10 significant biological pathways identified by the ingenuity pathway analysis are shown in Table 4.


**Table 3.** Top 5% genes with relatively expression probability (*G I-R%*).


**Table 4.** Top 10 significant biological pathways according to high relatively expression probability genes.

These 10 pathways are reported to be involved with interferon regulation [15–27]. within inflammatory and immune responses (see Table 5).


**Table 5.** Top 10 pathways and their function labels.

#### **3. Discussion**

It is well-recognized that genes with similar gene expression patterns have a similar function [34]. From the overall score, we can see that FDA-approved drugs listed in the CREEDS database and their indicated diseases generally have inverse GES patterns compared with the random controls. However, the absolute difference between the indicated group and random control group is not

very obvious. For example, in a recent study [35], a significant relationship was found between drug–disease GES similarity and drug therapeutic effect using Cmap [36], with a relatively low overall area under curve (AUC) of 0.57, indicating a real, albeit weak, inverse relationship. The treatment effectors of the drugs identified in this study likely work via the interaction of the genes' protein products, with only a moderate correlation between gene expression and levels of the corresponding protein(s) [37]. Thus, an association study between drugs/diseases and gene expression/pharmaceutical effect is necessary. Also, other mechanisms, for instance, microRNA-based therapeutics, might directly orchestrate the activation/deactivation of the gene expression. However, due to the limitation of available sources, we were unable to investigate other mechanisms of action. Besides, the drugs' TF-levels were not a significant factor that reflect the indication relationship (although drugs directly interacting with TF perform slightly better with q-value of 0.22309 vs. q-value of 0.99509). In our analyses, some subgroups of drugs–diseases pairs with indication associations have positive similarity scores (which means that the drug may exacerbate the disease according to the assumption of gene expression signature similarity) or a score higher than random drug–disease pairs, but these findings were not statistically significant. On the other hand, 7 of 70 subgroups had a significantly lower similarity score when a drug–disease association is indicated.

This study may provide some hints to other future studies utilizing the GES method strategies of comparing drug–disease GES similarity for drug repositioning. That is, certain types of drugs may have a stronger ability to reverse the GES of the diseases they treat, and the disease type may also influence this ability. As such, in specific kinds of subgroups, the drug–disease pairs with higher similarities of reversed GES patterns may have greater therapeutic relationships, which means that focusing on certain kinds of diseases or drugs can increase the true positive rate of the GES-guided drug repositioning method For example, over half (4/7) of the significant subgroups (immunostimulants, interferon receptor, other dermatological preparations, and diseases of the blood or blood-forming organs) are related to diseases associated with the immune system (the disease includes in "other dermatological preparations" atopic dermatitis). This indicates that a drug with drug–disease pairs associated with the immune system tends to have lower similarity scores when compared with the diseases it indicated than random diseases. This means in a GES-guided drug repositioning analysis, an immune-associated drug is more likely to have a potential therapeutic effect on diseases that have a higher inverse similarity with it.

Chemotherapy drugs may not be as good as non-chemotherapy drugs for the GES-guided drug repositioning method (q-values: 0.99509 vs. 0.03937). Unarguably, the high diversity of chemotherapy responses to heterogenetic tumor tissues or even histologically similar tumors has been a challenging problem for a long time [37,38]. The failure of controlling the process of programmed cell death in tissue, one of the major causes of tumors, can be rectified or even overturned by activating/deactivating different pathways under various conditions [39]. This may be the reason that chemotherapy drugs are not good for the GES-guided repositioning approach. On the other hand, non-chemotherapy drugs show a significant result as they interact with cancer cells through more specific mechanisms, such as hormone regulation or mono-target therapy.

For the biological pathway analysis of the interferon receptor subgroup, we found that the genes involved in pathways directly regulated by drugs have the lowest GI-R% scores. It is reasonable that GES-guided drug repositioning methods are more sensitive to drugs directly targeting pathways related to diseases. Furthermore, the significance of mTOR signaling is in accordance with the result in which the subgroup kinase mTOR had a significant indicated-random drug–disease pairs' SJI difference. This result confirmed the high sensitivity of the GES-guided drug repositioning method to this pathway on the other side.

There are some limitations in this study. First, the tissues used for testing the drug effects may not match with the body parts/organs affected by the diseases.. Second, some bias may be caused by the limited number of the CREEDS bio-assay collection which may not have the ability to fully present the pattern of all kinds of drugs and diseases. Additionally, it is important to differentiate

the types of "treatment effect". Some drugs may cure a particular disease while others may just provide symptomatic relief thereby resulting in different patterns of GES for the same disease. Also, some indicated subgroups ("kinase mTOR", and "other dermatological preparations") have too few unique drug–disease pairs (n = 1), which may weaken the analyses' power.

In this study, we systematically analyzed the similarity of gene expression profiles from known drug–disease associations, and we found that indicated pairs have a greater inverse similarity score. We found seven subgroups in which their drugs or diseases may have a greater reversed GES pattern when there is a clear therapeutic effect. These findings suggest that a GES-guided drug repositioning method should be used based on the drug or disease type differences. For example, drugs or diseases associated with the immune system, diseases of the nervous system or non-chemotherapy drugs may be a better choice for drug repositioning. Moreover, our biological pathway enrichment analysis showed that some pathways may be more sensitive to this method, such as the mTOR signaling pathway.

#### **4. Materials and Methods**

#### *4.1. Gene Signature Data Collection and Filtering*

In this study, all gene signature information was collected from a well-calibrated GES repository, the crowd extracted expression of differential signatures (CREEDS) [12] database. The CREEDS database is maintained by the Ma'ayan Lab of Icahn School of Medicine at Mount Sinai. CREEDS utilized GEO2Enrichr [40] to extract GES profiles from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) and applied a characteristic direction (CD) model [41] to identify differentially expressed genes. This database V1.0 includes 10,797 single-gene perturbations, 2258 disease signatures and 5516 drug perturbation gene signatures. Among these signatures, 2176 manual single-gene perturbations, 828 manual disease signatures and 875 manual drug perturbation signatures were considered to be more accurate compared with the automatically generated GES by the machine learning method. The CREEDS database allows users to compare the similarity between the user-specified GES and the GESs processed and stored in the CREEDS.

We first selected the CREEDS manual GES profiles if the assays were from human tissues and/or human cell linesand if the drugs had FDA approval.

Each GES profile includes a list of up- and down-regulated genes. The SJI [12] (see below), a measurement for the similarity between two GES profiles from the paired drug–disease, was calculated. When a drug or a disease had multiple GES profiles, we calculated the SJIs of all the possible combinations, and an overall score for each unique drug–disease pair was calculated from the average of all scores from pairs sharing the same drug–disease combination. All the disease signatures and drug perturbation signatures were requested through the application program interface (API) provided by CREEDS. GES profiles were removed if they labelled for both a drug treatment and for a disease, because this may cause biased similarity. Under the criteria that (a) the GES profiles must come from assays of human cells/tissues, and (b) drugs must be approved by FDA, the remaining signatures were paired within drugs and diseases according to the indication associations. Signatures without any indicated drug–disease relationship were also excluded from further analysis. For example, cocaine was removed because its indication, local anesthesia, was not in the data of disease signatures and could not be paired. The overall data process is shown in Figure 5.

**Figure 5.** The flow chart of drug and disease gene signature data inclusion process. Numbers of gene signatures left in each step are shown in parentheses: (Number of drug signatures/Number of disease signatures) 1.1. and 1.2. All manual gene signatures retrieved from the CREEDS database. 2. Remove all signatures with assays not labelled as human. 3. Remove all drug signatures not from FDA-approved drugs. 4. Remove signatures with information errors or signatures labelled as both for a drug treatment and for a disease. 5. Remaining drug signatures were paired with each disease signature. 6. Remove signatures with no FDA-labelled indication relationships of drug or disease. 7. Indicated group and control group were divided according to the indication relationship from the FDA drug label. 8. Calculate the signed Jaccard index for each remaining drug–disease pair.

#### *4.2. Similarity Calculation*

In our analysis, SJI, which is based on the Jaccard similarity coefficient [42], was used to compute the similarity between GES profiles from a drug and a disease. The Jaccard similarity coefficient is a statistic used to gauge the similarity between different sample sets. It is defined as the size of the intersection divided by the size of the union of two sample sets. It is calculated as follows:

$$\text{Jaccard Similarity Coefficient} (\text{G}\_1, \text{G}\_2) = \frac{\text{SAME}}{\text{ALL}}$$

where G<sup>1</sup> and G<sup>2</sup> stand for two lists of differential expressed gene sets, "SAME" represents the number of same genes between two given gene sets, and "ALL" stands for all the unique genes that appeared in the two gene sets.

SJI, which combines the Jaccard similarity coefficient with the gene regulation direction is calculated as follows:

$$\text{Sigmoid Jaccard}(\text{G}\_1, \text{G}\_2) = \frac{\text{J}\left(\text{G}\_1^{\text{up}}, \text{G}\_2^{\text{up}}\right) + \text{J}\left(\text{G}\_1^{\text{down}}, \text{G}\_2^{\text{down}}\right) - \text{J}\left(\text{G}\_1^{\text{up}}, \text{G}\_2^{\text{down}}\right) - \text{J}\left(\text{G}\_1^{\text{down}}, \text{G}\_2^{\text{up}}\right)}{2}$$

− −

where J means Jaccard similarity coefficient, and Gup and Gdown are up- or down-regulated genes in the given gene set G, respectively. The index ranges from +1 to −1, where +1 and −1 indicate a same pattern and inverse pattern of two gene sets, respectively. Zero indicates that the two sets have no associations, or the same part is cancelled out by the inverse part. The reason to use an unranked score calculation method (SJI) is to keep in accordance with the same scoring method used in the CREEDS database. The CREEDS API (application programming interface) offers the function to calculate the SJI automatically. However, we found the API could not calculate the SJI correctly when two GES profiles are highly overlapped., therefore, all the SJIs in this study were re-calculated.

#### *4.3. Drug-Related Information Collection*

In our analysis, the source of drug-related information is listed as follows:

1. Drug target information was collected from DrugBank [43,44] Release Version 5.1.4 [45] (https://www.drugbank.ca/releases/latest#external-links). Only the targets with the main therapeutic effect in the mechanism of action section were included;

2. The human TF list was collected from the paper published by Samuel A. Lambert et al. [46];

3. ATC classifications on level 3 were collected from the WHO official website (https://www. whocc.no/atc\_ddd\_index/);

4. The drug indication was from section "indications and usage" of FDA label on FDA website (https://labels.fda.gov/);

5. (Drug-indicated) Disease classification was assigned to each disease based on the International Classification of Diseases 11th Revision (ICD-11), level 1.

#### *4.4. Subgroup Classification*

In our analysis, we assessed the following factors that might influence the power of the GES-guided drug repositioning method:


#### *4.5. Statistical Analysis and Pathway Analysis*

The random control group was generated by calculating the average SJI of all possible drug–disease pairs without indicated associations to imitate a GES-guided drug repositioning screening. A t-test [13] was applied to quantify the mean differences of the SJI between drug-indicated disease pairs and random controls.

For subgroup analysis, GLM [14] least squares mean partitions F tests function was applied to estimate the mean difference between the indicated and control group since the data was unbalanced with multiple factors. A significant result of a certain subgroup indicated that the average SJI of this subgroup was significant between two indication levels (Yes/No). False discovery rate (FDR) q-value of the Benjamini–Hochberg procedure [47] was controlled to 0.05 to avoid an inflated experiment-wise type I error rate caused by multiple comparisons among all subgroups.

Data processing and statistical analysis (student t-tests, GLM, FDR calculation) were conducted using R studio 3.6.1 [48] and SAS software version 9.4. Copyright © 2019 SAS Institute Inc. Cary, NC, USA.

Differentially reversed expression genes (top 5% negative score according to the relatively reverse percentage) from the most significant subgroup will be chosen as examples to conduct biological pathway enrichment analysis.

The relatively reverse percentage is calculated as

$$\text{Relative}\,y\,\text{expression}\,\,probability\,\,of\,\,a\,\,\text{gen}\,\Big(\mathbf{G}^{I-R\%}\,\Big) = \mathbf{D}^{I\circ}\!\!/ \mathbf{\!o} - \mathbf{D}^{R\circ}\!\!/ \mathbf{\!o}$$

where *DI%* **and** *DR%* stand for the percentage of the gene which is differentially expressed in all assays of indicated/random drug–diseases pairs. It is calculated as

$$D\% = \frac{\text{NS} - \text{NR}}{\text{Total assays pairs}}$$

where **NS** and **NR** represent the number of times a gene showed a same or reverse regulation direction between assays of drugs and diseases among all drug–disease assays pairs.

The *GI-R%* ranges from 100% to -100%. A higher positive score indicates that this gene is more likely to be expressed in the same direction in indicated drug–disease assays compared with random drug–disease assays. Likewise, a lower negative score indicates that this gene has a higher probability to express reversely between indicated drug–disease assays compared with random drug–disease assays.

Biological pathway enrichment analysis was conducted by ingenuity pathway analysis (IPA, QIAGEN Inc., https://www.qiagenbioinformatics.com/products/ingenuitypathway-analysis).

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1420-3049/25/12/2776/s1, Table S1: 70 subgroups with four drug categories, Table S2: 70 subgroups with disease category, Table S3: Indicated drug–disease pair results, Table S4: Random drug–disease pair results.

**Author Contributions:** Conceptualization, X.Q., M.S. and L.W.; Data Curation, X.Q. and M.S.; Formal Analysis, X.Q. and P.F.; Funding Acquisition, R.A.S. and L.W.; Investigation, X.Q.; Methodology, X.Q. and L.K.; Project Administration, L.W.; Resources, L.W.; Software, X.Q.; Supervision, L.W.; Validation, X.G. and T.W.; Visualization, X.Q; Writing—original draft, X.Q.; Writing—review and editing, N.F., M.Z., R.A.S. and L.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Institutes of Health [grants AG027224, MH116046, AG005133, PDA035778A]. The APC was funded by the National Institutes of Health [grant MH116046].

**Acknowledgments:** The authors would like to acknowledge Zefei Li at School of Pharmacy in Sun Yat-sen University for revising the calculation process.

**Conflicts of Interest:** Dr. Sweet 's work has been funded by NIH [grants AG027224, MH116046, AG005133]. Dr. Wang's work has been funded by NIH [grants MH116046]. Other authors declare no competing interests.

#### **References**

1. Alizadeh, A.A.; Eisen, M.B.; Davis, R.E.; Ma, C.; Lossos, I.S.; Rosenwald, A.; Boldrick, J.C.; Sabet, H.; Tran, T.; Yu, X.; et al. Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling. *Nature* **2000**, *403*, 503–511. [CrossRef]


**Sample Availability:** Samples of the compounds ...... are available from the authors.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Is PF-00835231 a Pan-SARS-CoV-2 Mpro Inhibitor? A Comparative Study**

**Mohammad Hassan Baig <sup>1</sup> , Tanuj Sharma <sup>1</sup> , Irfan Ahmad <sup>2</sup> , Mohammed Abohashrh <sup>3</sup> , Mohammad Mahtab Alam <sup>3</sup> and Jae-June Dong 1,\***


**Abstract:** The COVID-19 outbreak continues to spread worldwide at a rapid rate. Currently, the absence of any effective antiviral treatment is the major concern for the global population. The reports of the occurrence of various point mutations within the important therapeutic target protein of SARS-CoV-2 has elevated the problem. The SARS-CoV-2 main protease (Mpro) is a major therapeutic target for new antiviral designs. In this study, the efficacy of PF-00835231 was investigated (a Mpro inhibitor under clinical trials) against the Mpro and their reported mutants. Various in silico approaches were used to investigate and compare the efficacy of PF-00835231 and five drugs previously documented to inhibit the Mpro. Our study shows that PF-00835231 is not only effective against the wild type but demonstrates a high affinity against the studied mutants as well.

**Keywords:** SARS-CoV-2; main protease; mutants; inhibitors; PF-00835231

#### **1. Introduction**

SARS-CoV-2, the etiological agent of COVID-19, is a pandemic responsible for claiming over a million human lives [1]. More than a thousand drugs are currently in the COVID-19 treatment pipeline; most of which are in the discovery stage and many of these are existing treatments for other conditions currently being evaluated for SARS-CoV-2 [2]. The latest statistics available from ClinicalTrials.gov, a directory of clinical trials funded by the US National Library of Medicine, reveal that 4371 studies are registered worldwide and are growing day by day. Twenty-five trials, including using the most common drug, Hydroxychloroquine, were discontinued [3]. To date only two studies are found in the category of 'Approved for marketing', which are expanded access to convalescent plasma and the drug molecule Remdesivir [4,5]. As an urgent consequence of the novelty of SARS-CoV-2 infections and the lack of appropriate drugs, a wide range of techniques and methods are being used to tackle the emerging worldwide COVID-19 pandemic. To date, the therapies suggested are mainly the repurposing of existing drugs chosen for the similarity of their initial indication such as antivirals/antiretrovirals or for the similarity of their mode of action [6,7].

Pp1a and pp1ab polyproteins are formed by SARS-CoV-2 and are processed by two virally encoded cysteine proteases; the main protease (Mpro) and the papain-like protease [8]. In the viral replication process, it is apparent that the action of the Mpro is vital as it processes the p1a/p1ab polyprotein virus proteolytically at more than 10 junctions to produce a set of non-structural proteins (NSPs) essential for virus replication and transcription including RdRp, helicase and the Mpro itself [9]. Of all recognized forms of coronaviruses, the Mpro is the most explored target for drug development because it has almost the same

**Citation:** Baig, M.H.; Sharma, T.; Ahmad, I.; Abohashrh, M.; Alam, M.M.; Dong, J.-J. Is PF-00835231 a Pan-SARS-CoV-2 Mpro Inhibitor? A Comparative Study. *Molecules* **2021**, *26*, 1678. https://doi.org/10.3390/ molecules26061678

Academic Editors: Marco Tutone and Anna Maria Almerico

Received: 3 February 2021 Accepted: 4 March 2021 Published: 17 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

mechanism and active site as MERS-CoV (Middle East Respiratory Syndrome Coronavirus) and SARS-CoV [10,11].

Despite the lower mutation rate of the virus, studies have revealed more than 12,000 SARS-CoV-2 genome mutations. Many mutations would not impact the capacity of the virus to transmit or trigger illness because they do not alter the structure of a protein and certain mutations that change proteins are often more likely to damage the virus than to strengthen it [12]. A few vaccines have recently been approved and more than 50 are in different phases of trials. Structural protein mutations that are attacked by the host immune system can impede vaccine efficacy and non-structural protein mutations can develop strains that are resistant to antivirals. Different strains of the virus that are more transmittable than the wild type SARS-CoV-2 have recently been identified in South Africa. In the United Kingdom and several other nations including Europe and Brazil, the extensive spreading of coronavirus variants has placed the world on alert and sparked a new lockdown [13,14]. It picks up minor changes to the genetic code every time the SARS-CoV-2 moves from person to person but researchers are beginning to find variations of how the virus mutates. A major research investment is being made towards the development of new therapeutics or repurposing old drugs as weapons against COVID-19.

Pfizer has started clinical trials (phase I) with a small molecule PF-07304814 that targets the Mpro of SARS-CoV-2 [15]. It may prove to be the first antiviral drug to target this protein (Mpro) to combat COVID-19. PF-07304814 comprises a phosphate group that renders the compound soluble and cleaves the active antiviral PF-00835231 by alkaline phosphatase enzymes in the tissue [15].

Here in this study, we evaluate the binding efficacy of six known inhibitors of the Mpro (Figure 1). The binding efficacy of these inhibitors was measured against the WT and the mutant Mpro (Figure 2). Five drugs (bedaquiline, boceprevir, efonidipine, manidipine and lercanidipine) have been earlier reported to inhibit Mpro activity to below 40 µM [11]. The sixth inhibitor, PF-00835231, is a powerful inhibitor of the SARS-CoV-2 Mpro with sufficient medicinal properties to merit further research as an intravenous COVID-19 therapy [16]. During the 2002–2003 SARS epidemic, PF-00835231 was first discovered by Pfizer chemists to target the SARS-CoV Mpro [16]. Infections petered out, however, and the compound was put on hold along with a collection of other possible coronavirus antivirals. In addition to demonstrating action against two strains of SARS-CoV-2, PF-00835231 was able to kill other coronaviruses in cells as well [15]. PF-00835231 is the active form of PF-07304814 currently being tested (phase 1) in patients with a SARS-CoV-2 infection and mild to moderate symptoms [3,15].

This study involves a state of the art computational evaluation to assess the comparative efficacy of PF-00835231 and other reported inhibitors against the wild type and four reported Mpro mutants. We hypothesize here that PF-00835231 might be less competitive against various SARS-CoV-2 virus mutants; even the results of experimental and clinical studies are still to offer clearer results.

**Figure 1.** Structure of all of the compounds investigated in this study. (**a**) PF-00835231; (**b**) Boceprevir; (**c**) Manidipine; (**d**) Efonidipine; (**e**) Lercanidipine; (**f**) Bedaquiline.

**Figure 2.** The selected mutant investigated in this study.

#### **2. Result and Discussion**

In this study, a comparative analysis of the efficacy of PF-00835231 and five drugs previously documented to inhibit the Mpro (bedaquiline, boceprevir, efonidipine, manidipine and lercanidipine) was performed with the wild type and four reported Mpro mutants (Mutant 1 (Y54C), Mutant 2 (N142S), Mutant 3 (T190I) and Mutant 4 (A191V)).

It was found that in all of the modeled structures, no residues lay in the disallowed region, confirming the significant quality of the structures (Supplementary Figure S1). An ERRAT analysis of all of the structures was also investigated [17]. It was found that the overall structural quality of the modeled structure was very good. A VERIFY\_3D [18,19] analysis was also performed and it was found that in all of the modelled structures more than 94% of the residues had an average 3D–1D score > 0.2, proving the great compatibility between the primary sequence to the tertiary structure. Before conducting the molecular docking experiments, the validation of the molecular docking protocol was performed. Different crystal structures of the inhibitor bound SARS-CoV-2 Mpro were retrieved. The binding orientation of the redocked poses was found to be similar to the crystal confirmation of the inhibitor (Supplementary Figure S2 and Table S1). Most of the redocked poses of the inhibitors were found to share a root mean square deviation (RMSD) less than 1 Å (for small molecule inhibitors) than its crystal counterpart (Supplementary Figure S2 and Table S1).

In order to determine the predictive binding effectiveness of small molecules with receptors, a molecular docking evaluation is usually carried out [20]. Six compounds (PF-00835231, bedaquiline, boceprevir, efonidipine, manidipine and lercanidipine) were minimized and prepared for screening within the active site of the Mpro (WT) and modeled Mpro mutants. In terms of the PLP fitness score using GOLD tools, the binding efficacy score of all six selected compounds was calculated.

PF-00835231 was found to be the most effective against the WT Mpro (PLP Fitness score 83.13 (Table 1). This compound was found to be very effective against other selected mutants as well (Tables 2–5). Compared with other selected compounds, PF-00835231 was found to be the most effective inhibitor against Y54C and A191V (Tables 2 and 5) (Figure 3) whereas efonidipine was found to be most effective against N142S and T190I (Tables 3 and 4) (Figure 4). Figures 3 and 4 show the binding of PF-00835231 and efonidipine against the Mpro and the selected mutants. The study also highlighted the important residues playing a crucial role in accommodating the selected compounds within the active site of the Mpro and the mutants. It was also found that the large number of active site residues of the Mpro (WT and mutant) were actively participating in the positioning of all of the molecules. Tables 1–5 represent the details of the interacting residues (amino acid) of the Mpro and mutants interacting with all of the selected molecules. L141, S144, H164, E166, Q189 and Q192 were found to be very prominently involved in making hydrogen bonds with all of the selected compounds (Tables 1–5) (Figures 3 and 4). The crucial role of these residues has been discussed earlier as well [21–23]. Other residues found to be playing an important role in the binding were T25, T26, L27, H41, M49, C145, M165, L167, P168, D187 and R188 (Tables 1–5) (Figures 3 and 4 and Supplementary Figures S3–S7).

**Figure 3.** Complex of PF-00835231 within the active site of (**a**) WT; (**b**) Y54C.

**Figure 4.** Complex of Efonidipine within the active site of (**a**) N142S; (**b**) T190I.


**Table 1.** The binding details of all compounds against the WT Mpro.

**Table 2.** The binding details of all compounds against the Mutant 1 (Y54C) Mpro.


**Table 3.** The binding details of all compounds against the Mutant 2 (N142S) Mpro.


**Table 4.** The binding details of all compounds against the Mutant 3 (T190I) Mpro.



**Table 5.** The binding details of all compounds against the Mutant 4 (A191V) Mpro.

Recent studies have shown that all of the compounds (bedaquiline, boceprevir, efonidipine, manidipine and lercanidipine) specified in this study carry the potential to inhibit the Mpro with IC50 values below 40 µM [11]. The binding affinity for PF-00835231 has been reported to be in the nano molar range [15]. Our in depth in silico analysis also found PF-00835231 to be carrying a high affinity against the Mpro (WT) compared with other selected compounds. We also hypothesized that these compounds, including PF-00835231, might prove to be effective against the mutants as well. Boceprevir, which is a protease inhibitor and was originally used to treat hepatitis, has been well studied to carry an inhibitory potential against the Mpro [24–26]. This compound was found to carry a very low affinity against the T190I whilst being the most active against the WT. The high binding affinity of this compound against the Mpro (WT) has been reported in several studies [25]. This compound was found to be moderately effective against the WT and the selected mutants. Likewise, other compounds considered in this study showed a moderate affinity against all of the selected mutants. Manidipine [27], which is a calcium channel blocker and is an approved antihypertensive drug, was found to be very effective against the WT. Several studies have reported the potential of manidipine against SARS-CoV-2 [11,28,29]. This compound showed moderate activity against other selected mutants. Lercanidipine [30], another calcium channel blocker and an approved antihypertensive drug, was also found to show moderate activity against all of the selected targets and was most active against the WT and N142S. Bedaquiline [31], another compound considered in this study, is an approved drug for the treatment of active tuberculosis. This compound was found to be moderately active against all of the selected proteins with a maximum binding affinity against N142S and a least binding affinity against A191V.

Considering the high efficacy of PF-00835231 and Efonidipine against all of the selected proteins, we further studied the structure dynamics of WT and the Mpro mutants in complex with these two inhibitors (Figure 5). Root mean square deviation (RMSD) is a very significant parameter to explore the protein dynamics in terms of conformational changes within the protein structure. The backbone RMSD plot revealed that in the presence of PF-00835231, the structures of the WT, Y54C and T190I mutants were stable throughout the 100 ns simulation while the structures of N142S and A191V indicated fluctuations after 50 ns (Figure 5a). The backbone of WT and N142S was found to be stable in the Efonidipine bound structure (Figure 5b) while other mutants, namely Y54C, A191V and T190I, showed fluctuations in the backbone. Our overall investigation found that Efonidipine caused fewer structural variations in the backbone of WT and N142S while PF-00835231 caused fewer structural variations in the WT, Y54C and T190I mutants. This suggested that the association of PF-00835231 within the active site of the Mpro and its mutant was comparatively more stable. Further, the ligand RMSD plot was also analyzed and it was observed that PF-07304814 was more stable with WT, Y54C and N142S throughout the simulation time period compared with Efonidipine (Figure 5c,d). These analyses further provide a strong support that the PF-07304814 bound complexes were very stable. The Hbond analysis also showed that the PF-00835231 bound complexes comparatively made more hydrogen bonds than the Efonidipine bound complexes (Figure 5e,f). This

finding well supports the theory that the stability of PF-07304814 within the active site of the Mpro (WT) and its mutants may be because of the greater number of Hbonds providing the stability to PF-07304814. The overall outcome of this study showed that PF-07304814 could be a very potent inhibitor against the Mpro and its other reported mutants.

**Figure 5.** Molecular dynamics results of the PF-07304814 and Efonidipine bound complexes of the WT Mpro (black), Y54C (yellow), N142S (green), T190I (blue) and A1901V (red). The backbone RMSD of the Mpro (WT) and mutants in complex with (**a**) PF-07304814; (**b**) Efonidipine. The ligand RMSD of (**c**) PF-07304814 and (**d**) Efonidipine during the 100 ns. The intermolecular hydrogen bond formations of the (**e**) PF-07304814 and (**f**) Efonidipine bound complex.

#### **3. Materials and Methods**

#### *3.1. Protein Structure Preparation*

Here in this study, the crystal structure of the SARS-CoV-2 (COVID-19) Mpro in complex with inhibitor UAW248 was retrieved from the RCSB protein databank (pdb id:

6xbi) [32,33]. The crystal bound inhibitor and other heteroatoms were removed. The structure of the mutants was prepared using the molecular modeling technique. The amino acid sequence of all of the selected mutants (Y54C, N142S, T190I and A191V) were retrieved from the NCBI protein database (GenBank: QJD23268.1, QJC19621.1, QJA16866.1 and QIZ14843.1) [34]. To model the structure of the mutants, the 6xbi was taken as a template. The structures were modelled using the modeler 9.23 [35]. All of the modeled structures were validated using various in silico tools [36,37]. The structure of all of the Mpro mutants was modeled and validated as well. The Ramachandran plot was computed for all of the modeled structures using the PROCHECK module of SAVES [38]. All of the structures were subjected to energy minimization using the steepest descent method for 1000 steps.

#### *3.2. Ligand Structure Preparation*

The 3D structure of bedaquiline (CID: 5388906), boceprevir (CID: 10324367), efonidipine (CID: 119171), lercanidipine (CID: 65866), manidipine (CID: 4008) and PF-00835231 (CID: 11561899) were retrieved from the PubChem structure database [39]. All of the compounds were energy minimized using the conjugate gradient method for 1000 steps in UCSF Chimera [40].

#### *3.3. Redocking: Co-Crystallized Ligand Pose Validation Study*

Several crystal structures of the inhibitor bound SARS-CoV-2 Mpro were retrieved from the RCSB protein databank. The inhibitors were separated and were subjected to a redock within the structure of the Mpro using CCDC GOLD [41]. The docked confirmation of the inhibitors within the active site of the Mpro was compared with the crystal orientations.

#### *3.4. Virtual Screening*

All of the selected compounds were subjected to docking within the active site of WT and selected Mpro mutants using Gold 2.2 (CCDC, Cambridge, UK) [41]. The selection was made based on their PLP fitness score. The complexes were visualized using PyMol [42] and a discovery studio visualizer.

#### *3.5. Molecular Dynamics Simulation*

The selected complexes were subjected to a molecular dynamics simulation to investigate the stability of these molecules (in complex with WT and mutant Mpro). The molecular dynamics simulation was performed with GROMACS [43,44]. The complexes of PF-00835231 and efonidipine complexed with the Mpro (WT) and mutants prepared using the molecular docking were considered as a starting point for MD study. Here we used a GROMACS 2020.4 package with a Charmm36 force field to perform the MD simulation [45]. GROMACS is a widely used tool for performing MD simulation studies and its utilization in protein-ligand simulation has been reported in a large number of studies [46,47]. The parameter files for all the ligands were generated using SwissParam (https://www.swissparam.ch/, accessed on 15 January 2021), which is an online tool for generating parameters for the Charmm force field. The complexes were solvated within the dodecahedron box of an explicit TIP3P water model with a 0.1 nm margin between the box walls and solute. Na<sup>+</sup> or Cl<sup>−</sup> counterions were added to neutralize the system charge [48,49]. The particle mesh Ewald method (cutoff distance of 0.1 nm) was employed for calculating the long-range electrostatic interactions [50]. The Lennard-Jones 6–12 potential was used for evaluating the van der Waals interactions; for this calculation, the cutoff distance was set to 0.1 nm. The LINCS algorithm was used to constrain the bond lengths while setting the time step to 0.002 pico second [51,52]. Further energy minimization was performed using the steepest descent method for 10,000 steps in order to remove the steric clashes between atoms. The whole system was further subjected for equilibration for 1 nano second (ns). To maintain the system at 300 K and 1 atm, Berendsen weak coupling systems were utilized [53,54]. A Maxwell Boltzmann distribution was used

for randomly generating the initial velocities. The final 100 ns production run was performed at 300 K in an NPT ensemble. Furthermore, xmgrace was used to generated graphs (http://plasmagate.weizmann.ac.il, accessed on 15 January 2021); PyMol and VMD were used for further graphical inspections and analysis.

#### **4. Conclusions**

In conclusion, this study predicted that PF-00835231, which is already being tested to target the SARS-CoV-2 Mpro, may also be potent against the specified Mpro mutants. Notably, PF-00835231 and five other reported antivirals were investigated for comparative inhibitory efficacy in terms of binding potency against WT and the Mpro mutants. PF-00835231 was found to be the most efficient inhibitor of the Y54C and A191V Mpro mutants with a fitness score of 73.17 and 73.61, respectively, relative to the other listed drugs. Based on our research, it is early to determine but hopefully this potential drug PF-00835231 would most certainly be highly effective against the mutant Mpro and could prove to be a sharp weapon in the fight against the COVID-19 pandemic.

**Supplementary Materials:** The following are available online: Figure S1: Validation of the modeled structures of mutants (a) Y54C (b) N142S (c) T190I (d) A191V, Figure S2: The superimposed structure of crystal pose (blue) of inhibitor and redocked pose of (a) X47 (green) (pdb id: 6wco), (b) X77 (yellow) (pdb id: 6w63), and (c) ADRAFINIL (red) (pdb id: 7ans) within the binding site of Mpro, Figure S3: Complex of (a) PF-00835231 (b) Boceprevir (c) Manidipine (d) Efonidipine (e) Lercanidipine (f) Bedaquiline within the active site of WT, Figure S4: Complex of all (a) PF-00835231 (b) Boceprevir (c) Manidipine (d) Efonidipine (e) Lercanidipine (f) Bedaquiline within the active site of Y54C, Figure S5: Complex of all (a) PF-00835231 (b) Boceprevir (c) Manidipine (d) Efonidipine (e) Lercanidipine (f) Bedaquiline within the active site of N142S, Figure S6: Complex of all (a) PF-00835231 (b) Boceprevir (c) Manidipine (d) Efonidipine (e) Lercanidipine (f) Bedaquiline within the active site of T190I, Table S1: The inhibitor bound crystal structure of SARS-CoV-2 Mpro considered for the validation of docking protocol.

**Author Contributions:** Conceptualization, J.-J.D., M.H.B. and I.A.; methodology, T.S., M.H.B., M.A.; software, J.-J.D., M.H.B.; writing—original draft, J.-J.D., M.H.B., I.A. and M.M.A., M.A.; formal analysis, J.-J.D., M.H.B.; investigation, T.S., I.A., M.M.A.; supervision, J.-J.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors are thankful to the Institute of Research and Consulting Studies at King Khalid University for supporting this research through grant number 26-14-S-2020 and the National Research Foundation of Korea, grant: [NRF-2018R1C1B6009531].

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is available within the article.

**Acknowledgments:** The authors are thankful to the National Research Foundation of Korea and the Institute of Research and Consulting Studies at King Khalid University for supporting this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Sample Availability:** Samples of the compounds are not available from the authors.

#### **References**


### *Article* **Computational Determination of Potential Multiprotein Targeting Natural Compounds for Rational Drug Design Against SARS-COV-2**

**Ziyad Tariq Muhseen 1,2 , Alaa R. Hameed <sup>3</sup> , Halah M. H. Al-Hasani <sup>4</sup> , Sajjad Ahmad <sup>5</sup> and Guanglin Li 1,2,\***


**Abstract:** SARS-CoV-2 caused the current COVID-19 pandemic and there is an urgent need to explore effective therapeutics that can inhibit enzymes that are imperative in virus reproduction. To this end, we computationally investigated the MPD3 phytochemical database along with the pool of reported natural antiviral compounds with potential to be used as anti-SARS-CoV-2. The docking results demonstrated glycyrrhizin followed by azadirachtanin, mycophenolic acid, kushenol-w and 6-azauridine, as potential candidates. Glycyrrhizin depicted very stable binding mode to the active pocket of the Mpro (binding energy, −8.7 kcal/mol), PLpro (binding energy, −7.9 kcal/mol), and Nucleocapsid (binding energy, −7.9 kcal/mol) enzymes. This compound showed binding with several key residues that are critical to natural substrate binding and functionality to all the receptors. To test docking prediction, the compound with each receptor was subjected to molecular dynamics simulation to characterize the molecule stability and decipher its possible mechanism of binding. Each complex concludes that the receptor dynamics are stable (Mpro (mean RMSD, 0.93 Å), PLpro (mean RMSD, 0.96 Å), and Nucleocapsid (mean RMSD, 3.48 Å)). Moreover, binding free energy analyses such as MMGB/PBSA and WaterSwap were run over selected trajectory snapshots to affirm intermolecular affinity in the complexes. Glycyrrhizin was rescored to form strong affinity complexes with the virus enzymes: Mpro (MMGBSA, −24.42 kcal/mol and MMPBSA, −10.80 kcal/mol), PLpro (MMGBSA, −48.69 kcal/mol and MMPBSA, −38.17 kcal/mol) and Nucleocapsid (MMGBSA, −30.05 kcal/mol and MMPBSA, −25.95 kcal/mol), were dominated mainly by vigorous van der Waals energy. Further affirmation was achieved by WaterSwap absolute binding free energy that concluded all the complexes in good equilibrium and stability (Mpro (mean, −22.44 kcal/mol), PLpro (mean, −25.46 kcal/mol), and Nucleocapsid (mean, −23.30 kcal/mol)). These promising findings substantially advance our understanding of how natural compounds could be shaped to counter SARS-CoV-2 infection.

**Keywords:** SARS-CoV-2; COVID-19; multiprotein inhibiting natural compounds; virtual screening; MD simulation

#### **1. Introduction**

Coronaviruses (CoVs) cause infection of the upper respiratory tract in higher mammals and humans [1], and several outbreaks have been associated in the recent past with CoVs reported first time in the year 2002 as SARS, in 2012 as MERS, and in late 2019 as COVID-19 [2–5]. The recent pandemic of COVID-19 is caused by a relatively new strain

**Citation:** Muhseen, Z.T.; Hameed, A.R.; Al-Hasani, H.M.H.; Ahmad, S.; Li, G. Computational Determination of Potential Multiprotein Targeting Natural Compounds for Rational Drug Design Against SARS-COV-2. *Molecules* **2021**, *26*, 674. https:// doi.org/10.3390/molecules26030674

Academic Editor: Marco Tutone Received: 6 December 2020 Accepted: 21 January 2021 Published: 28 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

named SARS-CoV-2 [6–8]. The virus origin is thought to be zoonotic, with potential of transmissibility between person-to-person, resulting in an exponential rise in the number of confirmed cases worldwide [9,10]. Through December 2020, more than 220 countries reported the virus, with more than 64 million individuals infected, and thousands are still getting infected each day. Approximately, the virus has a mortality rate between 5% to 10% [11,12]. Additionally, due to mandatory lockdowns, isolation, and quarantines, millions of lives have been disturbed. The pandemic also badly affected global health, society, and the economy, and these sectors are facing significant challenges [13]. Three vaccines (by Pfizer, Moderna, and AstraZeneca) are authorized by WHO for emergency use and are available to very limited populations. No specific anti-SARS-CoV-2 drugs are currently recommended for SARS-CoV-2 treatment, making the situation difficult to handle. Supportive therapeutics and preventative measures are being taken and are productive in managing the virus [14,15]. Various efforts to target critical proteins of SARS-CoV-2 pathogenesis, including Spike receptor-binding domain (RBD) [16–18], main protease (Mpro) [19], Nucleocapsid N terminal domain (NTD) [20], RNA-dependent RNA polymerase (RdRp) [21], papainlike protease (PLpro) [22], 2′ -O-RiboseMethyltransferase [23], viral ion channel (E protein) [24], and angiotensin-converting-enzyme 2 receptor (ACE2) [25], are on the way. Targeting multiple pathogenesis specific proteins within a close network of interaction or dependent functionality would effectively propose effective drugs against the SARS-CoV-2 [26].

SARS-COV-2 Spike protein is key to the host cell infection pathway as it mediates ACE2 recognition, attachment, and fusion to the host cell [16]. The RBD of S1 subunit of the Spike trimer binds explicitly to the ACE2 receptor [27]. This RBD region is an attractive target for therapeutics as it contains conserved residues that are essential in binding to ACE2 [27]. The Mpro of coronaviruses has been studied thoroughly for drug making purposes. These are papainlike proteases involved in processing replicase enzymes [28]. It has 11 cleavage sites in 790 kD-long replicase lab polypeptide, demonstrating its prominent role in proteolytic processing [19,29]. High structural similarity and sequence identity are seen in Mpro from SARS-CoV-2 to that of the SARS-CoV Mpro. It comprises two catalytic domains: chymotrypsin and picornavirus 3C protease like domain. Each contains β-barrel that are six in number and are antiparallelly containing active diad H41 and C145 [30]. These proteases have emerged as essential drug targets as they have a crucial role in replication. Furthermore, inhibitors of Mpro are found to be significantly less cytotoxic as the protein share less similarity with human proteases [31]. Preliminary studies have suggested that HIV protease inhibitors, lopinavir/ritonavir, could be potentially used against SARS-CoV-2 [32]. Additionally, HIV protease inhibitor, Darunavir, and HCV protease inhibitor, Danoprevir, are under clinical studies and in vivo trials for the treatment of SARS-CoV-2 infection [33]. The PLpro enzyme is vital in processing the polypeptide to produce a functional replicase complex and aids in viral spreading [22]. PLpro also plays a role in evading host antiviral immune responses by cleaving proteinaceous modification on the host protein after the post-translation phase [34]. Thus, targeting this enzyme is useful in highlighting therapeutic strategies that can suppress the virus infection and prompt antiviral immunity. The N protein is significant in viral RNA replication and its packing into new virions, making this protein a good candidate for newer drug identification that is specific and biological active [20].

In silico screening of drugs using different computer-aided drug designing applications greatly accelerate the rational drug design process. Ultimately, this saves time, and extra cost goes into the experimentation of leads that fail in the drug discovery process [35–40]. In this investigation, we performed a blind docking approach, followed by molecular dynamics (MD) simulation coupled with binding free energy techniques that dissect the structural dynamics and energy basis of molecular recognition [41,42]. The MPD3 phytochemical database [43] along with a pool of natural antiviral compounds were used against multiple SARS-CoV-2 protein targets to understand their binding mechanism and put forward a hypothesis on how to further optimize these structures to enhance selec-

tivity and maximize anti-SARS-CoV-2 biological potency [44–47]. A schematic summary of the methodology used in this work is provided in Figure 1. The study results might have potential applications in designing new leads against SARS-CoV-2, which can target its multiple proteins as depicted in this study.

**Figure 1.** Schematic presentation of the methodology used in this current study.

#### **2. Results and Discussion**

#### *2.1. Molecular Docking*

Molecular docking is a modeling approach investigating how the receptors and ligands fit together and how the enzymes interact with the ligands [48–52]. Docking calculations were performed in triplicate, and the compound conformations were ranked according to the binding energy in kcal/mol. We used remdesivir as control in docking. The compounds ranked consistently on top with the each receptor and showed a stronger binding score compared to remdesivir were selected for the downward analysis. A general overview of the binding energy of the compounds against the receptors used is presented in Figure 2. The top compound complex with each receptor was generated and subjected first to visual inspections to decipher atomic level interaction and determine the binding conformation. The docking analysis demonstrated glycyrrhizin followed by azadirachtanin, mycophenolic acid, kushenol-w, and 6-azauridine as the best binders among the ~5000 compounds used in this study. The 2D structures of these compounds are presented in Figure 3. Glycyrrhizin also showed stable interactions with the hotspot residues of SARS-CoV-2 spike protein receptor binding domain (RBD) in our previous study [53]. Glycyrrhizin-docked complex of each SARS-CoV-2 protein can be explained separately.

**Figure 2.** AutoDock binding affinity score of the compounds to the SARS-CoV-2 enzymes.

**Figure 3.** Two-dimensional presentation of high affinity binders to the SARS-CoV-2 proteins.

2.1.1. Mpro–Glycyrrhizin Complex

The Mpro of SARS-CoV-2 is a crucial enzyme and attractive drug target because of its central role in virus transcription and replication [54]. The docking study reported glycyrrhizin again as the best binder among the compounds used to the substrate-binding site of the Mpro (Figure 4). As seen in the binding with other receptors, the compound (2S,3S,4S,5R,6R)-6-(((2S,3R,4S,5S,6S)-6-carboxy-2,4,5-trihydroxytetrahydro-2H-pyran-3 yl)oxy)-3,4,5-trihydroxytetrahydro-2H-pyran-2-carboxylic acid was revealed to contribute

in significant hydrogen bonding and other weak interaction at the active pocket of Mpro. At the binding cavity, the compound engages Asn238 through multiple hydrogen interactions, as well as Asp289. The rest of the compound structure makes a network of hydrophobic interactions mainly dominated by van der Waals contacts. To elucidate further the binding specificity and affinity of the glycyrrhizin for the active pocket residues of Mpro, the interaction profile was compared and contrasted with that for the reported cocrystallized N3 inhibitor [55]. Very low similarity in the binding interaction profile between the compounds was noticed; however, because of the difference in the compound structure, size, and preferred binding site, the pocket residues in contact with glycyrrhizin are close to the N3. This difference in the binding interaction points to the different glycyrrhizin-binding mechanism, where the active moiety favors binding with the P5 binding pocket that is absent in the case of the Mpro–N3 complex. The residues, particularly Asp197 and Thr198, flanked the active site, and any molecule involved in binding with these residues interfere with the natural substrate-binding, thus affecting the enzyme functionality [56]. Additionally, the bulk of the glycyrrhizin structure favors interactions with Domain II and Domain III of the Mpro, in addition to flanking residues of the substrate-binding pocket, thus possibly affecting the dimerization of Domain I and Domain II and rendering the enzyme noncatalytic [57]. Similarly, Zhang et al. reported Mpro complex with an α-ketoamide inhibitor. The cocrystalized lead identified binds to the same substrate binding site reported in this study [28]. Morever, calpain inhibitors and GC-376 analogs are also confirmed to accommodate in the same functional pocket [58]. Beside these, many in silico studies have demonstrated the binding affinity of drug molecules to this active side of Mpro [33,59–61]. α

**Figure 4.** Binding pose of glycyrrhizin (in yellow stick) at the substrate binding pocket of Mpro (in the green surface). A 2D illustration of the glycyrrhizin chemical interactions at the docked site is also provided.

2.1.2. PLpro–Glycyrrhizin Complex

The PLpro enzyme of SARS-CoV-2 is implicated in viral polyproteins processing that generate a replicase complex and assist in virus spreading. The enzyme also plays a fundamental role in cleaving post-translational proteinaceous modifications present on the host protein as a mechanism to avoid antiviral host immune responses [22]. The docked complex between PLpro and glycyrrhizin highlighted the compound binding at the central palm catalytic cavity (Figure 5). Good binding of the compound-rich electronegative oxygen in the (2*S*,3*S*,4*S*,5*R*,6*R*)-6-(((2*S*,3*R*,4*S*,5*S*,6*S*)-6-carboxy-2,4,5-trihydroxytetrahydro-2*H*-pyran-3-yl)oxy)-3,4,5-trihydroxytetrahydro-2*H*-pyran-2-carboxylic acid at the docked site is the output of several strong hydrogen bond interactions: Gln174, Asp179, and Asn128. Besides these residues, the compounds moiety also formed van der Waals interaction, critical from a stability perspective. The remainder of the compound structure produced van der Waals contacts at this central cavity. The preferred binding of glycyrrhizin is at the central palm, sandwiching the finger and thumb domains, adjacent to the active substrate-binding pocket, which makes a strong bond with many vital catalytic residues. In contrast to the cocrystallized peptide inhibitor VIR251, which has a different conformation and binds to a different substrate cavity site, the glycyrrhizin-binding site is close to the VIR251 site [62]. In terms of interacting binding residues, the glycyrrhizin correlates more with the GRL0617 inhibitor of SAR-CoV-2 PLpro [63]. Further, the effect of conformational change of the BL2 loop upon glycyrrhizin binding is important to evaluate in future studies to disclose the glycyrrhizin recognition mechanism.

**Figure 5.** Binding pose of glycyrrhizin (in yellow stick) at the junction binding pocket of PLpro (in blue surface). A 2D illustration of the glycyrrhizin chemical interactions at the docked site is also provided.

In literature, many inhibitors of coronaviruses PLpro are documented that include zinc conjugate inhibitors, naphthalene, and thiopurine derivatives, and natural products [64]. These molecules are known to interact with the active site residues reported in this study. Tanshinones are reported to show inhibition of deubiquitinase and proteolytic activitiy of SARS-CoV PLpro [65]; 8-(Trifluoromethyl)-9*H*-purin-6-amine is a reversible noncovalent inhibitor, whereas N-Ethylmaleimide (NEM) modifies SARS-CoV PLpro Cys [63]. Moreover, 6-mercaptopurine (6MP) and 6-thioguanine (6TG) are slow and competitive inhibitors that form hydrogen bonds with catalytic residues of the SAR-CoV PLpro [66]. Several in silico studies also demonstrated a range of compounds that interfere with the functional site of SARS-CoV-2 PLpro [67–70].

#### 2.1.3. Nucleocapsid–Glycyrrhizin Complex

The SARS-CoV-2 N protein is an RNA binding protein and offers several functions of viral transcription and replication [20]. It particularly plays a pivotal role in helical ribonucleoprotein packing during RNA genome packing, regulating RNA replication, and modulating infected cell metabolism. Blocking of this protein could lead to blocking viral

replication, and thus an attractive target for drug development. The compound glycyrrhizin was found to prefer docking at the loop region 1 at the junction between the β-sheet core and β-hairpin (Figure 6). The molecule is aligned perfectly along the cavity volume where its (*2S,3S,4S,5R,6R*)-6-(((*2S,3R,4S,5S,6S*)-6-carboxy-2,4,5-trihydroxytetrahydro-2*H*pyran-3-yl)oxy)-3,4,5-trihydroxytetrahydro-2*H*-pyran-2-carboxylic acid part is connected to the β3 and β4 sheets of the β-hairpin. Here, this chemical moiety is involved in hydrogen bonding with Thr92, Arg94, and Arg89, and van der Waals contact with Arg90 and Ala91. The (2S,4aS,6aS,6bR,8aS,12aS,12bR,14bR)-2,4a,6a,6b,9,9,12a-heptamethyl-13 oxo-1,2,3,4,4a,5,6,6a,6b,7,8,8a,9,10,11,12,12a,12b,13,14b-icosahydropicene-2-carboxylic acid region of the compound produced hydrogen bonding with residues (Tyr110 and Arg150) and van der Waal contacts with residues (Asn49,Thr50, Als51, Phe54, Thr55, Tyr112, Pro118, Pro152, and Ala157) of β1, β2, β4, β5, β6, and β7 of the β-sheet core of the protein. Bhowmik et al. reported strong binding of Rutin, Doxycycline, Caffeic acid, Ferulic acid, Simeprevir, and Grazoprevir with several functional residues of the SARS-CoV-2 nucleocapsid protein reported in this study [71]. β β β β β β β β β β β β

**Figure 6.** Binding pose of glycyrrhizin (in yellow stick) at the junction binding pocket of PLpro (in deep sky blue surface). A 2D illustration of the glycyrrhizin chemical interactions at the docked site is also provided.

#### *2.2. MD Simulation Analysis*

In computer-aided drug design, MD simulations are essential in providing detailed biomolecule dynamical structural information and surface wealth of protein–ligand interactions, energetic data that are foremost to understanding the structural–functionality relationship of target protein principle in ligand recognition/interactions [37,72,73]. This set of information has tremendous applications in guiding novel drug design, thereby making MD simulation a successful tool in the modern drug discovery framework.

#### 2.2.1. Root-Mean-Square Deviation (RMSD) Analysis

MD simulation of 50 ns was performed for each receptor with bound glycyrrhizin to elucidate the compound binding stability and extract receptors/compound structural information that is key in the binding that may be altered to iMprove binding conformation and, ultimately, compound affinity for the target biomolecules. First, RMSD of receptors in each complex was estimated as carbon alpha deviations by superimposing 50,000 snapshots over the initial reference structure versus time (Figure 7A). RMSDs of all three complexes were found: Mpro (maximum, 3.14 Å; mean, 1.97 Å), PLpro (maximum, 2.59 Å; mean, 1.64 Å), and Nucleocapsid (maximum, 2.34 Å; mean, 1.32 Å). All of the receptors are relatively stable in terms of 3D structure, and no flexibility in secondary structures was noticed. As a consequence, glycyrrhizin binding pose was not altered, thus reflecting strong and stable complex formation.

**Figure 7.** Statistical parameters calculated based on simulation trajectories. Receptor RMSD plots (**A**), glycyrrhizin RMSD plots (**B**), receptor RMSF (**C**), and receptor Rg (**D**).

#### 2.2.2. Glycyrrhizin Conformation Stability

In addition, the MD simulation trajectories were examined to disclose information about the glycyrrhizin conformation stability with the receptors (Figure 7B). The glycyrrhizin RMSD with the receptors is Mpro (maximum, 2.56 Å; mean, 0.93 Å), PLpro (maximum, 2.14 Å; mean, 0.96 Å), and Nucleocapsid (maximum, 4.20 Å; mean, 3.48 Å). The molecules disclosed high stable, except for some deviations in the glycyrrhizin binding mode with the Nucleocapsid protein; therefore, the end MD simulation snapshot over the initial was superimposed to understand the compound dynamics. The (*2S,3S,4S,5R,6R*)-6-(((*2S,3R,4S,5S,6S*)-6-carboxy-2,4,5-trihydroxytetrahydro-2*H*pyran-3-yl)oxy)-3,4,5-trihydroxytetrahydro-2*H*-pyran-2-carboxylic acid fragment of the glycyrrhizin is flexible in an attempt to establish a more stable conformation. This moiety left its original site of interaction and moved more towards the β-core sheet for binding (Figure 8).

**Figure 8.** Binding mode dynamics of glycyrrhizin in the MD simulation at the initial time (in tan stick) versus at the end time (in plum stick).

#### 2.2.3. Root-Mean-Square Fluctuation (RMSF) Analysis

The residual flexibility and stability of the receptors in the presence of glycyrrhizin were further elucidated (Figure 7C). Mean RMSF for Mpro is 1.4 Å, PLpro is 1.57 Å, and Nucleocapsid is 1.9 Å. These values suggest good agreement on intermolecular stability.

β

#### 2.2.4. Radius of Gyration (Rg) Analysis

Additionally, Rg analysis was performed to evaluate protein compactness and structural equilibrium over the simulation time (Figure 7D). The Rg of the systems follows: Mpro–glycyrrhizin (45.62 Å and 42.28 Å), PLpro–glycyrrhizin (50.29 Å and 46.23 Å), and Nucleocapsid–glycyrrhizin (35.71 Å and 30.70 Å). All three systems are quite stable and remain compact.

#### *2.3. MMGB/PBSA Analysis*

To get a deeper insight into the compounds binding potential with the SARS-CoV-2 enzymes used, binding free energies were estimated using MMGBSA and MMPBSA techniques. Additionally, per residue decomposition assay was accomplished to highlight residues that contribute majorly to the compound's stability at the docked position and, ultimately, to the strong intermolecular interactions. To this objective, 100 frames were picked at time intervals of 50 ps from the simulation trajectories, discarding the water molecules and counterions. Detailed binding energies of the complexes are listed in Table 1 All of the binding interactions are energetically favorable, resulting in the formation of stable complexes. In all of the complexes, gas-phase energy dominates the system energy with significant contribution from van der Waals compared to electrostatic energy's minor role. The polar solvation energy is illustrated to play a nonfavorable part in binding, whereas the nonpolar energy seems to be vital in complex equilibration. The MMGBSA net binding-energy-ranked stability of the complexes follows: PLpro–glycyrrhizin > Spike– glycyrrhizin > Nucleocapsid–glycyrrhizin > Mpro–glycyrrhizin. The MMPBSA ranking follows: PLpro–glycyrrhizin > Spike–glycyrrhizin > Mpro–glycyrrhizin > *N*–glycyrrhizin.


**Table 1.** Binding free energy components of SARS-CoV-2 enzyme complexes with glycyrrhizin. The energy values are provided in units of kcal/mol.

#### *2.4. Per-Residue Decomposition*

The atomic-level contribution of each residue from the enzymes to the compound binding was elucidated further. Those with an average binding energy of <1 kcal/mol were categorized as hotspot residues because of their significant overall complex stability contribution [74,75]. In the case of Mpro–glycyrrhizin interaction, Asn238 and Asp289 are vital in holding the compound at the docked site. Phe69, Asn128, Gln174, and Asp179 residues are critical in bridging PLpro enzyme with glycyrrhizin compound. The primary hotspot residues in Nucleocapsid–glycyrrhizin complex are Thr92, Arg94, Tyr110, and Arg150. It was further noticed that the van der Waals energy, as noted earlier, dominates the overall binding interaction energy. Hotspot residues of each receptor that are in direct contact and key in the stabilization of glycyrrhizin are presented in Table 2.

#### *2.5. WaterSwap Binding Energy*

WaterSwap uses an explicit solvation system that considers interaction details of protein–water, protein–water–ligand, and ligand–water. Such information is not provided in the MMGB/PBSA; therefore, it is not reliable for predicting the role of water molecules in biomolecule–ligand interactions [76]. Specifically, this holds great importance in an instance where the ligand is bridged to the receptor through water molecules. The WaterSwap method has been successfully applied to various biological systems and proved critical in determining absolute binding free energy. For each complex, the WaterSwap energies converged significantly after running 1000 frames. All the values also concluded good stability of intermolecular docked conformation. WaterSwap energies for each complex are shown in Table 3.


**Table 2.** Hotspot residues identified that played a significant role in interaction with the glycyrrhizin.


**Table 3.** WaterSwap absolute binding energy estimation for all four complexes.

#### **3. Materials and Methods**

#### *3.1. Target Proteins Preparation*

The anti-SARS-CoV-2 targets (Mpro PDB code: 7BQY, PLpro PDB code: 6XAA, and Nucleocapsid PDB code: 6M3M) were retrieved and prepared using the AMBER18 program [77]. Ff14SB force field [78] was used for amino acid parameterization. To add complementary hydrogen atoms missed by the crystallography, the tleap module of AmberTools18 was employed. Energy minimization of the targeted proteins was done first for 1000 steepest descent steps, and then by 500 conjugate gradient steps, allowing the step size to be 0.02 Å. Charge addition was done through the Gasteiger method.

#### *3.2. Compound Preparation*

The MPD3 phytochemical database (https://www.bioinformation.info/), in addition to reported natural antiviral compounds, were used in this study to filter molecules that show best binding affinity to the selected SARS-CoV-2 multiple targets. The library containing ~5000 natural compounds was imported to PyRx 0.8 software [79], where they were minimized for optimal energy and followed by conversion to pdbqt format for use in virtual screening against the mentioned targets.

#### *3.3. Structure-Based Virtual Screening*

Virtual screening of the compounds against of the targets used was done using the AutoDock Vina in PyRx [80] on Windows 10-supported Dell system (processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz with a 64-bit operating system, ×64-based processor, a memory of 8.00 GB). First, the docking protocol was validated by docking cocrystallized ligands to the protein keeping the docking parameters default except for the sphere around the binding site, which was set to 15 Å. Validation was also done by comparing the bestranked compounds conformation relative to the crystallized ligand by root-mean-square deviation (RMSD) [81]. Docking of the compound to the targets was accomplished by using the same set of parameters described for the validation procedure and run in triplicates to absolute consistency of the results. The docked solutions were clustered, considering an RMSD value of 1 Å. The binding mode of compounds with the lowest binding energy in kcal/mol was refined in MD simulations.

#### *3.4. MD Simulations*

MD simulations of the docked solutions were performed using AMBER18 [77]. Each top complex was explicitly solvated with water molecules, and then to get a neutral system, counter ions were added. Afterward, using the TIP3P solvent model, a water box of thickness 12 Å was created to surround the complex [82]. Simulation of the complex was done through periodic boundary conditions where electrostatic interactions were modeled with the particle–mesh Ewald procedure [74]. In the process, a threshold value of 8 Å was defined for nonbounded interactions. Water molecules were minimized for 500 cycles, followed by complete system minimization for 1000 rounds. Then, each system temperature was gradually scaled to 300 K. Equilibration of the systems was achieved under the NPT ensemble for 100 ps. This involves equilibration of both counter ions and water molecules while considering restraint on solutes in the first phase for 50 ps; subsequent protein side chains were relaxed. MD simulation of 50 ns was performed

at 300 K and 1 atm for two fs under the NPT ensemble. Hydrogen and covalent bonds were constrained using the SHAKE algorithm [83], whereas system temperature was controlled through Langevin dynamics [84]. The initial structure was used as a reference, and CPPTRAJ [85] of AMBER was run to generate a root-mean-square deviation (RMSD) plot to check the system MD simulation convergence [81]. Ligand structural flexibilities were calculated by ligand RMSD. Furthermore, hydrogen bond analysis was performed to investigate hydrogen bonds formed between the compounds and amino acids present within the docked site vicinity.

#### *3.5. MMGB/PBSA Analysis*

The binding free energy (∆G binding) of the complexes was estimated using the AMBER18 MM/PBSA method [42,86]. One hundred snapshots were considered from simulation trajectories at a regular time interval to calculate the free energy difference.

$$
\Delta \text{Gbinding} = \text{Gcomplex} - \text{(Gprotein} + \text{Gljangd)}
$$

$$
\Delta \text{G} = \Delta \text{G}\_{\text{gas}} + \Delta \text{G}\_{\text{solv}} - \text{T}\Delta \text{S}
$$

$$
\Delta \text{G}\_{\text{gas}} = \Delta \text{ele} + \Delta \text{Gvdw}
$$

$$
\Delta \text{Gsolv} = \Delta \text{G}\_{\text{GB}} + \Delta \text{G}\_{\text{SA}}
$$

$$
\Delta \text{G}\_{\text{SA}} = \mathbf{y} \times \text{SASA} \times \mathbf{b}
$$

In these equations, Gcomplex is delta free energy of the complex, Gprotein is delta free energy of the protein, and Gligand is delta free energy of the ligand; ∆Ggas represents gas-phase energy and can be split into delta electrostatic (∆Eele), and delta van der Waals (∆Evdw) energy; and the ∆Gsolv term stands for solvation free energy, which comprises polar (∆GGB) and nonpolar (∆GSA) energy. In the ∆GGB, the εw value is set to 80, and εp is selected as 1.0. Linear combinations of the pairwise overlap method are used to estimate the solvent-accessible surface area (SASA).

#### *3.6. WaterSwap Analysis*

WaterSwap [76,87] was additionally done over the last 10 ns of MD simulation for a total of default 1000 iterations, keeping the sample size of Monte Carlo simulation to 1.6 × 10<sup>9</sup> . The absolute binding energy of each complex was estimated using three useful algorithms: thermodynamics integration, free energy perturbation, and Bennett's. The energy value <1 kcal/mol represents a good convergence of the system [75].

#### **4. Conclusions**

In this study, we found glycyrrhizin as the most significant natural compound that can act as a double-edged sword and inhibit multiple proteins of SARS-CoV-2. This compound has a high binding affinity for all of the SARS-CoV-2 receptors used in this study and had a stable binding mode in the MD simulation time. The compound revealed important interactions with all receptors, and thus requires further consideration in future anti-SARS-CoV-2 therapeutic studies. Glycyrrhizin has been previously documented to have therapeutic applications against SARS-CoV, chronic hepatitis C, and HIV-1 [88]. The molecule is clinically useful and had few toxic reactions. One way to overcome toxicity is by allowing low concentration of the drug in the cells (<100 µg/mL) [89]. Glycyrrhizin has been reported to inhibit viral penetration and effective both during the viral infection and postinfection [90]. It was previously demonstrated that the glycyrrhizin binds with good affinity to the human ACE2 and interacts with Asp30, Gln288, Arg393, and Arg559 residues, hence also underlines its potential to target the SARS-CoV-2 Spike protein RBD attachment to the human ACE2 receptor [90]. It also was shown that glycyrrhizin can be employed in synergism along with other plant-based molecules to treat SARS-CoVs [91]. From a pharmacological perspective, the glycyrrhizin prevents the production of intracellular reactive oxygen species, activates interferon production, downregulates proinflammatory

cytokines, lowers airway exudate production, and inhibits thrombin [45,92]. The compound was also computationally characterized previously to bind with good affinity to SARS-CoV-2 main protease [93]. Therefore, additional structural modification to lower the side effects and enhance the clinical efficacy of this compound is of high interest to treat SARS-related infections.

**Author Contributions:** Conceptualization, G.L.; data curation, Z.T.M., A.R.H., and H.M.H.A.-H.; funding acquisition, G.L.; investigation, Z.T.M.; project administration, G.L.; software, S.A.; supervision, G.L.; validation, A.R.H., H.M.H.A.-H., and S.A.; visualization, Z.T.M.; writing—original draft, Z.T.M.; writing—review and editing, A.R.H., H.M.H.A.-H., S.A., and G.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant numbers: 31770333, 31370329, and 11631012).

**Data Availability Statement:** The data presented in this study are available within the article.

**Acknowledgments:** Authors would like to acknowledge Shaanxi Normal University, Xi'an, China for providing facilities for this study.

**Conflicts of Interest:** The authors declare that the research was conducted in the absence of any commercial or finan-cial relationships that could be construed as a potential conflict of interest.

**Sample Availability:** Not available.

#### **References**


### *Article* **Combining Different Docking Engines and Consensus Strategies to Design and Validate Optimized Virtual Screening Protocols for the SARS-CoV-2 3CL Protease**

**Candida Manelfi <sup>1</sup> , Jonas Gossen 2,3, Silvia Gervasoni <sup>4</sup> , Carmine Talarico <sup>1</sup> , Simone Albani 2,3 , Benjamin Joseph Philipp 2,3, Francesco Musiani <sup>5</sup> , Giulio Vistoli <sup>4</sup> , Giulia Rossetti 2,6,7 , Andrea Rosario Beccari <sup>1</sup> and Alessandro Pedretti 4,\***


**Abstract:** The 3CL-Protease appears to be a very promising medicinal target to develop anti-SARS-CoV-2 agents. The availability of resolved structures allows structure-based computational approaches to be carried out even though the lack of known inhibitors prevents a proper validation of the performed simulations. The innovative idea of the study is to exploit known inhibitors of SARS-CoV 3CL-Pro as a training set to perform and validate multiple virtual screening campaigns. Docking simulations using four different programs (Fred, Glide, LiGen, and PLANTS) were performed investigating the role of both multiple binding modes (by binding space) and multiple isomers/states (by developing the corresponding isomeric space). The computed docking scores were used to develop consensus models, which allow an in-depth comparison of the resulting performances. On average, the reached performances revealed the different sensitivity to isomeric differences and multiple binding modes between the four docking engines. In detail, Glide and LiGen are the tools that best benefit from isomeric and binding space, respectively, while Fred is the most insensitive program. The obtained results emphasize the fruitful role of combining various docking tools to optimize the predictive performances. Taken together, the performed simulations allowed the rational development of highly performing virtual screening workflows, which could be further optimized by considering different 3CL-Pro structures and, more importantly, by including true SARS-CoV-2 3CL-Pro inhibitors (as learning set) when available.

**Keywords:** SARS-CoV-2; 3CL-Pro; antivirals; virtual screening; docking simulations; drug repurposing; consensus models; binding space; isomeric space

#### **1. Introduction**

Coronaviruses (CoVs, subfamily Coronavirinae, and family Coronaviridae) are enveloped viruses consisting of a single positive-strand RNA that can infect humans where they may cause respiratory, gastro-intestinal, and neurological disorders. A recently identified new coronavirus appeared in Wuhan, China [1], at the end of 2019 to cause a world-wide pandemic crisis in the present times [2]. This is mainly responsible for a

**Citation:** Manelfi, C.; Gossen, J.; Gervasoni, S.; Talarico, C.; Albani, S.; Philipp, B.J.; Musiani, F.; Vistoli, G.; Rossetti, G.; Beccari, A.R.; et al. Combining Different Docking Engines and Consensus Strategies to Design and Validate Optimized Virtual Screening Protocols for the SARS-CoV-2 3CL Protease. *Molecules* **2021**, *26*, 797. https://doi.org/ 10.3390/molecules26040797

Academic Editors: Marco Tutone and Anna Maria Almerico Received: 13 November 2020 Accepted: 26 January 2021 Published: 4 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

pneumonia-like illness that shows severe threats and often requires patient hospitalization with a global lethality equal to 2.1% (data updated to 16 January 2021). By considering the caused illness and its similarity with the SARS coronavirus (SARS-CoV, genome equal to 82%), the virus was called SARS-CoV-2 (severe acute respiratory syndrome-coronavirus-2), while the induced disease was termed Covid-19 [3].

The SARS-CoV-2 genome [4] encodes for structural proteins that are required for viral entry (such as the Spike glycoprotein); non-structural proteins, which comprise enzymes endowed with protease, methyltransferase, helicase, and polymerase activities; and accessory proteins [5], the role of which should be yet fully clarified [6,7]. The nonstructural proteins as well as the proteins required for the viral interaction with the host cells represent potential drug targets [8]. In detail, the SARS-CoV-2 replicase gene encodes for two overlapping polyprotein structures (i.e., ppa1a and ppa1ab), which are cleaved in the functional proteins required for viral replication and transcription by a 33.8 kDa protease called MPro or 3-chymotrypsin like protein, 3CL-Pro [9]. Interestingly, 3CL-Pro itself is comprised within the ppa1a and ppa1ab polyproteins, and indeed, the first enzymatic step involves the autolytic cleavage to liberate 3CL-Pro. This protease is active as a homodimer and shows a peculiar specificity, which differs from the close proteases of the host cells, since it shows a unique substrate preference for glutamine at the P1 site [10].

Such a substrate specificity combined with its key role in the viral life cycle renders such a protein an attractive target to develop anti-SARS-CoV-2 drugs. Thus, a relevant number of publications focused on the identification of potential SARS-CoV-2 3CL-Pro inhibitors appeared in the literature during the last few months [11]. Several studies comprised structure-based virtual screening (VS) campaigns especially targeted for drug repurposing. For example, Gimeno et al. [12] reported a consensus repurposing study which combines three docking programs (Glide, FRED, and AutoDock/Vina). Only the ligands showing favorable and equivalent binding modes in all three programs were considered as actives. Elmezayen et al. combined docking calculations with molecular dynamics (MD) simulations to better investigate the stability of the retrieved hits by evaluating the corresponding free energy using the MM-PBSA method [13]. Again, Meyer-Almes described a method in which molecular docking, ∆G energy calculation, and analysis of the protein ligand interaction fingerprints (PLIF) are combined to increase the predictive performances [14]. Other studies involved computational protocols based on a single docking tool and extended the virtual screenings also to drug-like molecules included within the ZINC database [15] as well as to databases of natural compounds [16]. However, and due to the lack of known SARS-CoV-2 3CL-Pro inhibitors, all the reported VS strategies cannot be optimized and validated a priori, and thus the reliability of the retrieved hits has to be corroborated by combining different computational approaches.

The novelty of the present study is that the known SARS-CoV 3CL-Pro inhibitors can be added to the screened databases and used as a training set of active molecules to evaluate the performances of the applied computational strategies. Although the efficacy of the here simulated SARS-CoV 3CL-Pro inhibitors was not experimentally confirmed against SARS-CoV-2 3CL-Pro, such a choice is justified by the very high conservation degree between the two proteases (SARS-CoV and SARS-CoV-2 3CL-Pro differ only for 12 residues out of 306), which is suggestive of a very similar binding mode within the two enzymatic pockets, as also confirmed by a recent computational analysis [17].

The employment of the known SARS-CoV 3CL-Pro inhibitors as active molecules in the here reported simulations is also confirmed by the recently resolved structures of SARS-CoV-2 3CL-Pro enzyme in complex with known SARS-CoV 3CL-Pro inhibitors, which show ligand arrangements superimposable to those already seen within SARS-CoV 3CL-Pro. A compelling example is offered by the SARS-CoV-2 3CL-Pro structure bound to the potent TG-0205221 inhibitor (PDB Id: 7C8T), which is completely comparable with the corresponding SARS-CoV 3CL-Pro complex (PDB Id: 2GX4, see below). On one hand, such a resolved structure indirectly confirms the potential activity on SARS-CoV-2 3CL-Pro of the most active known inhibitors of SARS-CoV 3CL-Pro. On the other hand, this structure

emphasizes the possibility to employ known SARS-CoV 3CL-Pro inhibitors to guide the design and/or selection of promising hits for SARS-CoV-2 3CL-Pro enzyme.

Hence, the study involves extended docking simulations based on a purposely collected database in which a set of 478 molecules active against SARS-CoV 3CL-Pro with a pIC50 values greater than 5.0 were dispersed within a database of about 13,000 safe in man molecules. In this way, the performed simulations can be exploited to design, optimize, and validate targeted VS protocols by comparing and combining different docking tools (i.e., Glide, LiGen, Plants, and Fred) and different post-docking analyses. Moreover, the best performing approaches can also be used to identify potential hits against SARS-CoV-2 3CL-Pro.

Furthermore, the analysis of the performed VS campaigns involves the combination of the already proposed binding space with the here developed isomeric space. The concept of binding space was recently proposed by some of us to account for the multiple binding modes a ligand can assume within the binding pocket by simultaneously considering more than one pose for each ligand [18]. Similar to property space [19], the binding space can be defined by parameters that encode for both the average and the spread of the scores values, the latter being described by ranges or standard deviations. Previous studies involving both correlative analyses and VS campaigns revealed that the binding space concept can elicit significant improvements in the predictive power compared to the best score values [20].

In this study, the same rationale inspiring the definition of the binding space was applied to define the corresponding isomeric space. In detail, the ligands existing in different tautomers, stereoisomers, or protonation forms are simulated by docking all possible states, and the resulting docking scores are utilized for calculating the corresponding space parameters (i.e., score averages and ranges). As detailed under Methods, the study investigates both the binding and isomeric spaces by considering four possible combinations: (1) without including the parameters of both the binding and isomeric spaces but considering only the best scores for each compound (namely, the canonical conditions); (2) by including the parameters of the sole binding space to confirm the encouraging results already pursued in previous studies; (3) by including the parameters of the sole isomeric space to assess its specific relevance; and (4) by variously combining the parameters of both spaces to investigate their synergistic role. All so-computed score parameters were then utilized to develop consensus models by using the enrichment factor optimization (EFO) approach [21], which generates linear combinations of the considered descriptors and proved particularly efficient in previous VS benchmarking studies [22,23].

#### **2. Results**

#### *2.1. Simulated Dataset*

Docking simulations involved a database of 13,535 unique molecules including 478 known SARS-CoV inhibitors with pIC50 > 5, taken from literature, which were dispersed within a dataset of 13,057 safe in man compounds. Among the 478 known SARS-CoV inhibitors, 193 compounds are endowed with a pIC50 > 6. These two inhibition thresholds (i.e., pIC50 > 5 and pIC50 > 6) allow the definition of two different (despite partly overlapped) sets of active molecules, and the generation of all analyzed consensus models was repeated by considering the two resulting training sets of SARS-CoV inhibitors. Furthermore, 4515 molecules (i.e., a third of the complete database) required the generation of 14,013 multiple states/isomers with an average of 3.1 isomers per molecule. In this way, the screened database overall included 23,054 ligands.

Due to the different docking protocols and the criteria by which the generated poses were considered as acceptable by each docking tool (as discussed under Methods), the number of docked molecules varies among the docking simulations as summarized in Table 1. In detail, Table 1 shows that the number of simulated molecules by the four tested programs decreases with the following trend: PLANTS > LiGen > Fred > Glide. The number of docked active molecules parallels the same trend except for Glide, which considers a higher number of active molecules while simulating a lower number of ligands (compared to Fred). The same trend is exhibited by the number of molecules simulated by considering multiple states by the four docking engines.


**Table 1.** General characteristics of the databases simulated by each docking program and relative generated poses.

**<sup>a</sup>** The number of both molecules existing in multiple states and generated poses varies among the docking programs.

When considering the number of monitored poses, it should be noted that PLANTS, Fred, and Glide generated 10 poses per ligand, while LiGen provided 10 poses per unique molecule. This implies that for molecules existing in multiple states, LiGen selected the 10 best poses regardless of the involved isomers. Hence, the number of analyzed poses for LiGen is significantly lower than that of the other docking engines.

Despite the described differences, a common dataset composed of those molecules that are simulated by all the utilized docking tools can be extracted. This comprises 10110 ligands (about 75% of the full database) and includes 353 active compounds (i.e., with pIC50 > 5). Such a common database will be used to develop mixed consensus models by combining the docking results from different tools and to compare the best ranked molecules by the four docking tools.

#### *2.2. Results by PLANTS*

Table 2 compiles the best enrichment factors in the top 1% (EF 1%) based on the PLANTS simulations as well as the resulting enhancements (as percentage values) induced by the inclusion of the explored space descriptors and by the generation of linear consensus equations. In all reported analyses (Tables 2–5), the consensus models are developed by linearly combining from two to five docking scores using the EFO algorithm (as detailed under Methods). As mentioned above, the analyses are repeated by considering the two sets of active molecules as collected considering two different inhibition thresholds (i.e., pIC50 > 5 and pIC50 > 6).

A preliminary consideration, which affects all these analyses, concerns the unavoidably biased comparison between binding and isomeric spaces. Indeed, the former involves all the screened molecules for which different poses were generated, while the latter affects only those molecules existing in multiple states, which represent about 1/3 of the entire dataset (see above). Specifically, the PLANTS simulations involved 31 compounds existing in multiple states out of 193 for the actives with pIC50 > 6 and 136 out of 478 for the molecules with pIC50 > 5.

Thus, one may understand the different effects exerted by the inclusion of the isomeric space parameters for the two considered pIC50 thresholds. Indeed, the number of examined isomers with pIC50 > 5 is high enough and the isomeric space affords an appreciable enhancing effect, which is greater than that exerted by the binding space. In contrast, the limited number of compounds existing in multiple states with pIC50 > 6 reduces the enhancing effect exerted by the isomeric space. The inclusion of the binding space

parameters induces significant improvements, especially with pIC50 > 6, where it is more effective than the isomeric space, especially when considering the best obtained models.

**Table 2.** Best EF1% values reached by the various consensus models developed using the PLANTS results plus the relative average values and the corresponding performance enhancements in percentage values. As described under Methods, the consensus equations were generated by linearly combining from two to five docking scores. The EF1% values referring to one variable correspond to the performances reached by single scoring functions.


<sup>a</sup> Here and in the following rows, the performance enhancements are computed in respect to the first column reporting the EF1% values obtained without including space parameters to assess the beneficial role of the inclusion of these parameters and their combinations. The same holds for pIC50 > 6 and Tables 4, 6, and 8. <sup>b</sup> In the last column, the enhancements are computed with respect to the average value obtained by the single scoring functions to evaluate the average effect of the consensus models. The same holds for pIC50 > 6 and Tables 4, 6, and 8.

**Table 3.** Best EF1% values reached by the various consensus models developed using the LiGen results plus the relative average values and the corresponding performance enhancements in percentage values. As described under Methods, the consensus equations were generated by linearly combining from two to five docking scores. The EF1% values referring to one variable correspond to the performances of single scoring functions.



**Table 3.** *Cont.*

**a** In both analyses, the last two rows report the performances reached by the consensus equations generated by combining five variables and excluding the pharmacophoric distances ("five no PH distances") or considering only these parameters ("five only PH distances").

**Table 4.** Best EF1% values reached by the various consensus models developed using the Fred results plus the relative average values and the corresponding performance enhancements in percentage values. As described under Methods, the consensus equations were generated by linearly combining from two to five docking scores. The EF1% values referring to one variable correspond to the performances of single scoring functions.



**Table 5.** Best EF1% values reached by the various consensus models developed using the Glide results plus the relative average values and the corresponding performance enhancements in percentage values. As described under Methods, the consensus equations were generated by linearly combining from two to five docking scores. The EF1% values referring to one variable correspond to the performances reached by single scoring functions.

> Regardless of the adopted strategy, the combination of the two monitored spaces appears to be productive only with pIC50 > 5, where it affords average improvements greater than those reached by the individual spaces. In contrast, the space combination exerts very modest effects with pIC50 > 6, since the resulting best models do not exceed the performances reached by the sole binding space. Overall, PLANTS affords encouraging performances that benefit from both the inclusion of the space descriptors and the generation of consensus linear equations. The latter induces an average improvement of around 39% with the best improvements around 70%.

> While avoiding the systematic analysis of the generated consensus equations, attention will be here focused on the occurrence of the various scores in all generated models. The interested reader can find all computed consensus equations with the resulting performances for all docking programs and all performed analyses at http: //www.exscalate4cov.network/. Table S1 reports the occurrence of the diverse scores in the 20 best consensus models generated by the EFO algorithm in all the 10 analyses included in Table 2 (for a total of 200 equations). In detail, Table S1 reveals the remarkable role played by the XScore [24] and PLANTS [25] scoring functions (the latter here comprising also the primary scores). Table S1 also evidences the interesting role played by the scores encoding for non-polar interactions as exemplified by molecular lipophilic potential (MLP) [26] and the VEGA-based scores [27], which here correspond to the Lennard–Jones interaction energies as computed by using the CHARMM and CVFF force fields. Concerning the type of score values (in the models including the space descriptors), the best values play a prevailing role, while spread and mean values have a lesser and rather similar incidence.

#### *2.3. Results by LiGen*

As described under Methods, the analyses based on the LiGen simulations involved the primary scores, the scoring functions derived by rescoring analyses, and 30 pharmacophoric distances (PH) as generated by the LiGen software. Table 3 reports the performances reached by the LiGen results as well as the performance enhancements due

to both the inclusion of the space parameters and the generation of consensus models (as percentage values).

The first key observation concerns the very limited role played by the isomeric space for both pIC50 thresholds. Such a result cannot be ascribed to a too low number of active molecules existing in multiple states, since they roughly correspond to those already considered by PLANTS (i.e., 132 out of 472 with pIC50 > 5 and 29 out of 190 with pIC50 > 6). Instead, this finding can be explained by considering that limiting the analysis to 10 poses per unique molecule does not allow an extensive exploration of the isomeric space. Moreover, the scoring functions and the search algorithm implemented by LiGen might be not so sensitive to the isomeric differences among the simulated ligands. When considering the overall satisfactory performances reached by LiGen (see below), the poor results of the isomeric space can be positively evaluated, since they suggest that this docking program can conveniently perform VS campaigns without requiring excessive isomeric expansions of the screened databases with a beneficial reduction of computational time and complexity.

In contrast, the inclusion of the binding space descriptors markedly improves the EF1% values with an enhancement effect, which is particularly evident with pIC50 > 6, where the binding space leads to almost doubled EF1% values. When considering that the LiGen program produced the lowest number of poses, the remarkable results yielded by the binding space parameters emphasize that this reduction is yet able to extract the significant poses, thus minimizing redundant results and reducing the computational costs. Based on the different effects induced by the two analyzed spaces, one may understand why their combinations are unproductive since they yield EF1% values, which are comparable with those obtained by the sole binding space.

The LiGen results seem to particularly benefit from the linear combination of diverse scoring functions: this effect is noticeable in all the experiments, as confirmed by the average increases around 30%, but is particularly remarkable with pIC50 > 6, where the consensus models lead to the doubling of the corresponding EF1% value. Taken together, LiGen provides satisfactory performances, reaching EF1% values around 25 for pIC50 > 6 and with the inclusion of the binding space descriptors.

Since the LiGen analyses also comprise the pharmacophoric distances, a specific experiment was performed to investigate their specific role. In detail, the analysis involved the generation of two additional five-variable consensus equations for each pIC50 threshold. The first model was based on the pharmacophoric distances only and the second on all computed scoring functions except for the pharmacophoric distances. The resulting EF1% values (see Table 3) highlight an overall synergistic effect between pharmacophoric distances and scoring functions. Nevertheless, Table 3 reveals different behaviors depending on the considered inhibition threshold. The pharmacophoric distances play indeed a prevailing role with pIC50 > 5, while the score values assume a more marked role with pIC50 > 6.

Table S2 compiles the incidence of the various docking scores in all the generated consensus equations and reveals the prevailing role played by the pharmacophoric distances as computed by LiGen. Apart from these distances, only PLANTS scores show a significant relevance, while the scores encoding for specific interactions exhibit negligible roles. As seen above, the best values confirm their prevailing role, but here the spread values reveal an incidence that is almost double compared to the score means.

#### *2.4. Results by Fred*

Table 4 compiles the EF1% values and the corresponding enhancements as derived by the Fred docking results. The first key consideration is that the single scoring functions provide satisfactory results even without space parameters and consensus combinations as seen with pIC50 > 6. The notable performances elicited by simple docking scores can explain why the inclusion of the descriptors for both explored spaces exerts only limited effects. Similar to what was seen for LiGen, the isomeric space also plays here an almost negligible role, while the binding space parameters yield modest improvements

by increasing the corresponding EF1% values of about 10% in the best consensus models. This result can be explained by considering that the beneficial effects of such post-docking procedures (such as rescoring, the inclusion of binding and isomeric spaces, and consensus approaches) depend on the room for improvement that the simple docking simulations show, and they inevitably decrease when the standard docking results already reveal notable predictive powers.

The docking simulations by Fred involved several active molecules existing in multiple states roughly comparable to those simulated by Plants and LiGen (despite the lowest number of total simulated isomers: i.e., 118 out of 411 with pIC50 > 5 and 27 out of 138 with pIC50 > 6). Hence, the modest relevance of the isomeric space suggests that this docking program does not require significant isomeric expansion of the screened databases to perform reliable VS campaigns. Similarly, the modest effect also exerted by the binding space suggests that the poses generated by Fred show a limited variability and tend to be focused around the best (and reasonably reliable) pose. In this way, the score averages roughly correspond to the best values, and the score spreads lose most of their relevance. As discussed above for the LiGen results, these insensitivities can be seen as positive features, which allow docking simulations to be performed without significant isomeric expansions of the screened datasets and reducing the number of poses computed per ligand thus minimizing the computational costs. Despite the modest effects exerted by the single spaces alone, their combinations yield encouraging results, especially with pIC50 > 5.

Table S3 shows the relative incidence of the computed scoring functions, as seen in all the consensus models generated by Fred. The key observation is that here PLANTS and XScore functions show an almost exclusive role, which minimizes the relevance of all other docking scores. While considering the unsatisfactory results afforded by space parameters and consensus strategies (see above), these results indicate that at least the rescoring calculations played a key role in enhancing the predictive power of the Fred simulations. About the type of score values, the best scores represent the most abundant values, while the mean values play a more relevant role here than the score ranges.

#### *2.5. Results by Glide*

Table 5 compiles the EF1% values (and the corresponding enhancements) as derived by using Glide. Despite the lowest number of active compounds simulated in multiple states (i.e., 105 out of 415 with pIC50 > 5 and 24 out of 164 with pIC50 > 6), Table 5 shows the markedly beneficial effect exerted by the isomeric space with both pIC50 thresholds. Although the binding space also elicits encouraging EF1% improvements for both thresholds, the isomeric space affords better results in terms of both reached EF1% values and relative performance enhancements. As already evidenced by previous studies [28], these results suggest that the search algorithms and the scoring functions implemented by Glide are strongly dependent on isomeric differences and invite the exhaustive expansion of the screened databases by considering as many as reasonable isomers/states, even when the number of simulated molecules existing in multiple states is relatively low.

While in the previous analyses the space combinations provided limited enhancements without significant differences between the two applied strategies; here, the joint combination of the two spaces yields remarkable performance enhancements compared to the single spaces with both pIC50 thresholds. In contrast, the merging combination approach appears to be constantly unproductive. These results suggest that the Glidebased docking scores are able to properly account for both multiple states (by isomeric space) and multiple binding modes (by binding space). Such a sensitivity implies that the two explored spaces encode for different information, and thus their descriptors can be synergistically combined, but cannot be fused into a unique space, the descriptors of which would detrimentally confuse the specific roles of the two spaces (as seen with the merging combination). Taken globally, the Glide simulations provide satisfactory predictive performances, which are, on average, comparable with those offered by LiGen with best EF1% values ranging between 25 and 30. As discussed above, the obtained performances

markedly benefit from both explored spaces (and their joint combination) as well as from the development of consensus linear equations as assessed by an overall improvement of about 20% (with best results around 25%).

Table S4 reports the incidence of the various scoring functions in all the equations developed by using the Glide results and emphasizes the relevant role played by the 23 different primary Glide scores [29]. Notably, they appear to be particularly abundant in those experiments, which afforded the best performances. The equations seem to benefit from the inclusion of scoring functions encoding for non-polar interactions. Indeed, the VEGA-based scores, which here correspond to the CHARMM- and CVFF-based Lennard– Jones interaction energies, and the MLP values show an overall incidence of about 20%. Concerning the score types, the best values represent about 50%, and mean and spread values are equally abundant.

#### *2.6. Overall Comparison*

Although the differences in the protocols adopted and in the successfully docked molecules by the four tested docking programs prevent a precise comparison of the reached performances, an overall assessment of the previously discussed performances is here reported to compare the specific relevance of the computed space parameters. To do this, Figure 1 compares the reached AUC values as derived from the ROC curves corresponding to the best developed consensus models in all performed docking experiments and for both inhibition thresholds.

**Figure 1.** Comparison of the AUC values from the ROC curves for the best performing models generated by the four utilized docking programs in the five tested conditions with (**A**) pIC50 > 5 and (**B**) pIC50 > 6.

The analysis of the reported AUC values reveals results in substantial agreement with those previously discussed for EF1% values and allows for some considerations, which can be summarized as follows. The two explored spaces induce similar overall enhancing effects with the isomeric space, which appears to be more relevant for analyses with pIC50 > 5, reasonably due to the higher number of involved molecules existing in multiple states. The combinations of the two space parameters provide comparable performances, and rarely do they surpass those reached by the individual spaces. The LiGen program is that best benefitting from the inclusion of space parameters, while Fred is the most insensitive tool. When considering the best AUC values, Figure 1 reveals that the performances of the four docking programs are overall comparable for the screening campaigns with pIC50 > 5, while PLANTS yield lower AUC values with pIC50 > 6 compared to the other three pieces of software, which in turn afford rather similar performances.

Comparative analyses were also performed by calculating the corresponding consensus models using the common database. The obtained results (Figure S1 and Table S5, Supplementary Materials) are in clear agreement with those previously discussed. The only difference involves the more limited enhancing role played by the isomeric space, which is ascribable to the reduced number of considered compounds existing in multiple states. The best consensus models developed using the common database will be used to compare the resulting rankings from the four tested docking programs (see below).

Figure 2 focuses on the enhancing role played by the development of the consensus models by showing the progressive effect exerted when including from two to five variables (in respect to the single scores). As already seen, the best improvements are reached by PLANTS and LiGen, while Glide and especially Fred show more limited effects. The beneficial role on the LiGen results might be also ascribed to the fact that their analyses involve the largest number of computed score values due to the inclusion of 30 pharmacophoric distances. However, a relation between performances and the number of variables cannot be evidenced for the other three docking programs.

**Figure 2.** Average EF1% enhancements (in percentage values) due to the generation of the consensus models when including two (V2), three (V3), four (V4), or five (V5) variables plus overall averages. The reported values are computed by averaging the EF1% enhancements for both inhibition thresholds.

When considering the progressive contribution when including from two to five variables in the consensus models, Figure 2 shows that the largest improvements are observed when shifting from one to two (around 14%) and from two to three variables (around 10%), while the inclusion of additional variables induces more limited EF1% increases (6% and 2% for four and five variables, respectively). On one hand, these results

justify the choice made here of avoiding the calculation of consensus models with more than five variables. On the other hand, Figure 2 suggests that a simpler and faster analysis might be focused on consensus equations, including at most three variables that represent an optimal balance between performances, reliability, and computational costs.

To conclude this general analysis, Table S6 compiles the occurrence of the various scores as obtained by all the computed consensus models and reveals the major role played by the primary scores. Notably, primary scores also include the LiGen pharmacophoric distances, which alone represent more than 50% of all primary score values. Table S6 highlights the overall relevant role of both PLANTS and XScore scoring functions and consequently emphasizes the crucial role of rescoring procedures for enhancing the predictive power of all performed VS campaigns. The scores encoding for specific interactions play a minor role, even though Table S5 underlines the appreciable effect played by scoring functions encoding for non-polar interactions as seen by summing the relevance of both MLP and VEGA-based scores. Concerning score types, Table S5 confirms that best values represent about half of all included values and spread and mean scores show a more limited and similar incidence (around 25%).

Even though the analysis of the computed poses goes beyond the primary objective of the study, which was designed to evaluate the effects of the monitored space parameters in enhancing the performances of the performed VS campaigns, Figure 3 compares the best computed poses for a close TG-0205221 analogue included in the screened database with the recently resolved SARS-CoV-2 3CL-Pro complex [30]. While considering that the performed VS campaigns cannot simulate the formation of the covalent bond between Cys145 and the bound ligand, as seen in the reference structure (Figure 3A), the computed poses are in encouraging agreement with that of TG-0205221. Indeed, in all four shown structures, the 2-oxopyrrolidin ring approaches Asn142, the leucine side chain, which replaces the cyclohexyl alanine of TG-0205221 contacts His41, and the benzoyloxy moiety approaches Pro168 and Gln192. The four complexes slightly differ for the arrangement of the electrophilic warhead, even though it appears to be always close enough to Cys145 to yield the Michael adduct.

**Figure 3.** Comparison of the resolved complex of TG-0205221 with the SARS-CoV-2 3CL Pro enzyme (**A**) with the best poses as computed for its close analogue by PLANTS (**B**), LiGen (**C**), Fred (**D**), and Glide (**E**).

#### *2.7. Mixed Consensus Models*

The analyses on the common database was primarily carried out to develop mixed consensus equations by linearly combining scores coming from different docking simulations. For simplicity's sake and considering the observed differences among the four docking runs even for the common database, the development of the mixed models was focused on the docking scores as such avoiding the space parameters.

Table 6 reports the best EF1% values as obtained by generating consensus equations that linearly combine scores of pairs of docking engines and the corresponding performance improvements (in percentage values). Overall, the reported EF1% values reveal interesting synergistic effects that affect most of the tested combinations with only four out of 12 cases showing no enhancement. On average, the synergistic enhancements are rather similar for both inhibition thresholds (i.e., 12% for pIC50 > 5 and 19% for pIC50 > 6). The best results are afforded by the combination of Glide plus Fred, which yield for both pIC50 thresholds EF1% better than those of the single docking program, while the best synergistic enhancement is seen when combining LiGen and PLANTS scores with pIC50 > 6 (40%).

**Table 6.** Best EF1% values as obtained by combining the simple score values (without space parameters) of two or three different docking programs and the corresponding synergistic effect (as enhancements in percentage values). For easy comparison and concerning the results for pairs of docking tools, the diagonal cells reports the best EF1 value obtained by the single docking tool (in italics).


Given these encouraging results, the next analysis involved the combination of triplets of programs. Table 6 also includes the EF1% values reached by these analyses and reveals an appreciable synergistic effect for these combinations with only one case being ineffective and most cases with an EF1% increase greater than 10%. Notably, these consensus models allow reaching EF1% around 20 for pIC50 > 5 and very close to 30 for pIC50 > 6. Unfortunately, the full combination of the scores coming from all tested docking programs did not yield further improvements (results not shown).

#### *2.8. Analysis of the Best Rankings*

The last section of this study analyzes the rankings obtained by applying the best consensus models for pIC50 > 6 using the common database (see Table S7). The first part

of the analysis investigates how the frequency of the molecules shared at the same time by two, three, or four rankings varies when browsing the first half of the ranking positions (Figure S1 and Table S7, from 1 to 5000). Figure S1A shows the computed trends and reveals that the frequency of common molecules found in three rankings increases with a linear trend, while the frequencies of molecules included in all the four rankings or only in two rankings show symmetric and parabolic trends. The former grows when increasing the number of monitored ranking positions, and the latter symmetrically decreases.

Figure S1B illustrates the parabolic trends as computed by considering specific pairs of rankings. This allows a graphical evaluation of the increasing overlapping between the results of two docking programs. Figure S1B reveals that the highest frequency of shared compounds is provided by combining the LiGen and Fred rankings, while the other five pairs of programs yield rather similar profiles with the pairs LiGen–Plants and Plants–Fred showing the lowest frequencies of shared molecules. While being detectable even within the best top 100 ranking positions, the differences between the frequencies become appreciable when considering at least the first 500 ranking positions. Hence, the following analyses will focus on the molecules included in the top 500 ranking positions.

The first analysis of the top 500 molecules of each ranking deals with their physicochemical profiling. Table 7 reports the corresponding averages values and standard deviations for some key geometrical and physicochemical descriptors and allows for some relevant considerations. Firstly, limited differences are seen for the average values of the number of rotors and H-bonding groups, while molecular size (as encoded by M.W. and SAS averages) and polarity (as parameterized by PSA and log P averages) reveal a more marked variability. In detail, Table 7 shows that Glide and PLANTS select the bulkiest and the smallest set of ligands, respectively, while Fred and LiGen unravel intermediate averages. Additionally, there is an expected relation between size and lipophilicity for PLANTS, LiGen, and Fred, while Glide selects the ligands with a peculiar profile, since they comprise the bulkiest and the most polar molecules. Finally, concerning the property variability, Table 7 reports modest differences among the four monitored sets of ligands and for all computed descriptors, even though the Glide set shows, on average, the highest standard deviations, thus suggesting a conceivable relation between molecular size and property variability.


**Table 7.** Average values plus standard deviations for some key geometrical and physicochemical properties as computed by considering the Top 500 molecules of each ranking.

The next analysis on the top 500 ligands concerned the overlapping between the four sets of selected ligands. Figure 4 shows the resulting Venn diagram with the corresponding frequency values. As seen in Figure S1B, Figure 4 confirms that the pairs Fred–LiGen and, to a minor extent, Glide–PLANTS show the highest frequencies of common molecules, while the pairs PLANTS–LiGen as well as PLANTS–Fred show the lowest degree of overlapping. Accordingly, the two triplets including Ligen and Fred (LiGen–Glide–Fred and PLANTS– LiGen–Fred) show the highest overlapping degree, while PLANTS–LiGen–Glide reveals the lowest number of shared ligands. Consequently, LiGen and Fred show the lowest numbers of unshared molecules, while PLANTS has the highest number of unique ligands,

which roughly correspond to one half of the analyzed set. Finally, the molecules common to all the analyzed sets are 62, and this is a remarkable result, since they represent about 12.5% of the monitored ranking positions.

**Figure 4.** Venn diagram representing the overlapping degree between the Top 500 molecules as derived by applying the best equations developed for each docking software. A table detailing the resulting frequencies is also included (the color code for the four docking tools is the same adopted in Figure 1).

≅ A similar diagram was also obtained when analyzing the common scaffolds, as detected within the screened Top 500 molecules (Figure S2). The scaffold frequencies are in excellent agreement with the ligand frequencies seen in Figure 4 (r<sup>2</sup> = 0.97) and the total number of considered ligand frequencies (i.e., 1284) is slightly higher than that of scaffolds (i.e., 828). This means that each detected scaffold is shared on average by ∼=1.5 ligands. Stated differently, these findings emphasize that the selected Top 500 molecules do not include congeneric series, and the chemical spaces covered by the top-ranked molecules of each docking software are rather similar (as also suggested by Table 7) with PLANTS and Glide showing the highest number of unshared scaffolds.

Finally, Table S8 compiles the common molecules shared by at least three rankings. In detail, the so collected common molecules are 194 (62 and 132 molecules shared by four and three rankings, respectively), among which 92 belong to the set of active compounds (with pIC50 > 5, i.e., 47%). This finding affords a further validation of the overall predictive power of the reported VS strategies, especially considering that the inhibition activity of several SARS-CoV 3CL-Pro inhibitors against the SARS-CoV-2 3CL-Pro enzyme was experimentally confirmed (as discussed above for TG-0205221 analogues. Among the other 102 molecules, there are 37 compounds that are known inhibitors of SARS-CoV 3CL-Pro but with pIC50 < 5, and 17 known inhibitors of the main proteases of other viruses (such as norovirus and HIV), and these compounds represent a further confirmation of the efficacy of the performed VS campaigns, since some of the retrieved hits (rupintrivir [31], saquinavir [32], and lopinavir [33]) were experimentally confirmed as promising inhibitors of SARS-CoV-2 3CL-Pro. Finally, among the other common molecules, cobicistat [34] and galloyl analogues [35] were identified as SARS-CoV-2 3CL-Pro inhibitors.

#### **3. Methods**

#### *3.1. Library and Protein Structure Preparation*

Virtual screening studies were performed on a repurposing library, containing a unique list of 13,057 drugs. They comprise the set of safe in man drugs, commercialized or under active development in clinical phases and retrieved from the Integrity database, plus the Fraunhofer's BROAD Repurposing Library provided by Fraunhofer IME. The screened database also includes a set of 478 molecules, in particular preclinical compounds, identified as "CoV Inhibitors", which were considered as the active training set in the reported optimization/validation analyses. Hence, the screened dataset overall included 13,535 unique molecules and the set of presumably active compounds was composed of 478 inhibitors with pIC50 > 5 of which 193 molecules show a pIC50 > 6.

Even though the database was collected by selecting safe in man molecules so that the obtained results could be used also for repurposing purposes, Figure S4 compares some representative physicochemical properties of active and inactive, which show rather similar distributions. To better assess the reliability of the screened database and to appreciate the role of docking simulations, a ligand-based VS campaign involving about 100 structural and physicochemical descriptors was performed. The best EF1% values as obtained by the consensus models with five variables are very low (EF1% = 5.5 and 3.7 with pIC50 > 5 and pIC50 > 6, respectively). On one hand, these poor results confirm that the simulated database does not show significant differences between active and inactive ligands, which can bias the here presented docking results. On the other hand, the very modest performances reached by the ligand descriptors further emphasize the relevance of the here described structure-based VS workflows.

The reported docking simulations were performed by applying procedures as homogeneous as possible to render the obtained results as comparable as possible. Nevertheless, minor differences remain concerning the preparation of the 3D structures of both protein and ligands, primarily due to specific requirements of the docking software. Thus, the common procedures applied to prepare the input structures are here described, while the specific tasks required by each software will be reported in the following sections. All compounds were converted to 3D structures and prepared by using Schrödinger's LigPrep tool. This process generated multiple states for stereoisomers, tautomers, ring conformations (one stable ring conformer by default), and protonation states. In particular, another Schrödinger package, Epik, was used to assign tautomers and protonation states that would be dominant at a selected pH range (pH = 7 ± 1) [36]. Ambiguous chiral centers were enumerated, allowing a maximum of 32 isomers to be produced from each input structure. Then, energy minimization was performed with the OPLS3 force [37]. In this way, 4515 compounds were characterized by multiple states (with an average of 3.1 states per compound), and a total of 23,654 ligands were generated. Docking simulations involved the monomer A of the first resolved SARS-CoV-2 3CL-Pro structure (PDB Id: 6LU7) in a covalent complex with the N3 inhibitor [9]. The protein structure preparation and the binding site characterization were performed as previously described [38]. Briefly, the protein structure was prepared by removing water solvents, crystallization additives, and the covalently bound N3 ligand. The hydrogen atoms were added by using the VEGA program [27] to remain compatible with physiological pH. The protein structure was then minimized using Namd2 [39] and by keeping the backbone atoms fixed to preserve the resolved folding.

#### *3.2. PLANTS Simulations*

Concerning PLANTS simulations, ligand conformations and atomic charges were further optimized by semi-empirical PM7 method as implemented by MOPAC [40]. Docking simulations were performed by PLANTS, which is based on ant colony optimization (ACO). [40] Docking search was focused within an 8 Å radius sphere around the cocrystallized N3 inhibitor and, for each compound, 10 poses were generated and ranked by the ChemPLP scoring function with the speed equal to 1. PLANTS and MOPAC calculations were carried out by exploiting Warpengine, an in house developed system for distributed computing [41]. For the post-docking analyses, all generated poses having the ChemPLP score > 0 were discarded, and this induced the loss of 75 inactive compounds (as seen in Table 1).

#### *3.3. LiGen Simulations*

The geometrical docking procedure implemented in LiGen™, proprietary software developed by Dompé, was used for the reported docking simulations [42]. In detail, the docking search was focused within a 5.0 Å radius sphere around the co-crystallized ligand. The available void volume of the resulting pocket was defined by determining the free points within a 3D grid, which encompass the entire binding site. The free points are used by the docking procedures as well as to define the pharmacophore schemes. Specifically, the docking engine follows a specific workflow during which three docking scores are computed: first, the Pacman Score (PS) estimates a geometric fitting by evaluating the interaction between a ligand pose and the pocket, based on shape and volume complementarity. Then, the Chemical Score (CS), which encodes for the ligand binding interaction energy, is calculated by an in-house developed scoring function [43]. The last step involves a rigid body minimization of the docked ligand within the binding site, at the end of which a third score called the optimized chemical score (Csopt) is evaluated. All poses that do not fulfill geometric fitting or thresholds values of user-defined specific parameters are discarded, and this induced the loss of 744 compounds (among which were only six active molecules, as seen in Table 1).

With regard to pharmacophore analysis, the program implements three different probe atoms, based on the Tripos Force Field, to explore the binding pocket: (1) a positively charged sp3 nitrogen atom (ammonium cation), describing a hydrogen bond donor; (2) a negatively charged sp2 oxygen atom (as in a carboxyl group), representing a hydrogen bond acceptor; and (3) an sp3 carbon atom (methane), encoding for a hydrophobic group. The representative atom types can be modified, even though this selection produced the best outcomes in previous benchmarking analyses. For each free grid point, the binding energies between the probes and the protein atoms are evaluated by using an in house extended scoring function based on the work of Wang [39]. Every grid point will be identified as donor, acceptor, or hydrophobic according to which probe yields the best score.

The software then filters all the grid points to extract the key interaction sites in three steps. Firstly, the program averages the scores of all the grid points for each probe and selected those points having a score lower than the average. Based on these favorable grid points, LiGenPocket finds the pharmacophoric features. They are identified by clustering the neighbors' grid points, which are here defined as grid points with the same definition (donor, acceptor, or hydrophobic), falling at a distance less than 2.0 Å from a given point. The score averages of all points belonging to the so identified clusters are computed for each type of grid points, and those points, which show a score value lower than the average of their cluster, are discarded.

Secondly, the clusters of neighbor grid points that survive to the previous filtering process constitute the pharmacophoric features of the binding site. The geometric center of each cluster is thus defined as a pharmacophoric point. Lastly and for each pharmacophore element, the minimum distances between a ligand's atom and the closest compatible pharmacophoric point are calculated for each ligand. Each atom of the ligand is defined by a classification that parallels that of the used probes: four atom types are indeed considered and defined by a letter code (A, D, AD, and H are Acceptor, Donor, Acceptor and Donor, and Hydrophobic, respectively).

#### *3.4. FRED Simulations*

Ligand conformers were generated using OpenEye OMEGA [44]. Conformers with internal clashes and duplicates were discarded by the software, and the remaining ones were clustered based on of the root mean square deviation (RMSD). For this virtual screening, a maximum of 200 conformers per compound, clustered with an RMSD of 0.5 Å, was used. If the number of conformers generated exceeds the specified maximum, only the ones with the lowest energies are retained. For 1780 molecules, the generation of the rotamers was not possible due to stereochemistry issues and/or for the presence of large macrocycles, and they were removed from the library. The resulting library consisted of 11,755 molecules. The target protein was processed using UCSF Chimera (v1.14). AMBER ff14SB was used to assign parameters to the standard residues, whereas the Antechamber module was used for the nonstandard residues. The charges for the nonstandard residues were calculated using the AM1-BCC method. The structure was minimized with 100 steps of gradient descent and 10 steps of conjugate descent, using the MMTK module. Rigid docking was then performed using OpenEye FRED [45] included in the OEDocking 3.4.0.2 suite (OpenEye Scientific Software, Santa Fe, NM. http://www.eyesopen.com). Each docked pose is scored using the Gaussian Shape scoring function. Finally, top scoring poses are converted into density fields to form the final shape potential field. The highest values in this field represent points where molecules can have a high number of contacts, without clashing into the protein structure. In its exhaustive search, FRED translates and rotates the structure of each conformer within the negative image of the active site and scores each pose. FRED first step has a default translational and rotational resolution of 1.0 and 1.5 Å, respectively. The 100 best scoring poses are then optimized with translational and rotational single steps of 0.5 and 0.75 Å, respectively, exploring all the 729 (six degrees of freedom with three positions = 36) nearby poses. The best scoring pose is retained and assigned to the compound. The binding poses were evaluated by using the Chemgauss4 scoring function implemented in OpenEye FRED [28]. For 24 molecules, the docking algorithm was unable to find a suitable binding pose, and these molecules were thus discarded from the analysis.

#### *3.5. Glide Simulations*

To perform docking experiments with Glide, the protein was preprocessed by the Protein Preparation Wizard from the Schrodinger Suite version 2019-4 with the default parameters [28]. The protonation states of each side chain were generated using Epik for pH = 7 ± 2 [36]. Protein minimization was performed using the OPLS3 force field [37]. All water molecules were removed. Glide software [29] was used for the docking calculations. Internal receptor grid boxes of 10 Å × 10 Å × 10 Å were defined and centered on the ligand atom position. The size of the outer binding box was determined by the ligand size (27 × 27 × 27 Å). A standard precision (SP) Glide docking was carried out, generating 20 poses per docked molecule. H-Bond constraints with D166, H163, and H164 were applied. Docking results were analyzed by Glide Docking score in the version 5.0 [46]. Here, Glide score was used to extract the best binding pose for each ligand. This is an empirical scoring function able to reproduce the trends of the binding affinity and is defined by the following equation:

$$GScore = a.vdW + b.Coul + Lipo + H bond + Metal + Rewards + RotB + Sitet$$

where: *vdW* = van der Waals interaction energy; *Coul* = Coulomb interaction energy; *Lipo* = Lipophilic-contact plus phobic-attractive term; *HBond* = Hydrogen-bonding term; *Metal* = Metal-binding term (usually a reward); *Rewards* = Various reward or penalty terms; *RotB* = Penalty for freezing rotatable bonds; *Site* = Polar interactions in the active site, and the coefficients of vdW and Coul are: *a* = 0.050, *b* = 0.150 for Glide 5.0 (the contribution from the Coulomb term is capped at −4 kcal/mol).

#### *3.6. Rescoring Calculations*

All the poses generated by the four docking programs were rescored by ReScore+ [47]. The computed scoring functions comprise (a) the various components of PLANTS [25] and XScore [24] scoring functions; (b) a set of scores computed by the VEGA suite, which encodes for polar and non-polar interaction energies [27]; (c) the MLP interactions scores for hydrophobic contacts [26]; (d) the recently proposed Contacts scores [18], which are simply based on several surrounding residues, and (e) the APBS score for evaluating ionic interactions [48]. Both the primary scores and the values from rescoring calculations were utilized to calculate binding and isomeric spaces as well as their combinations by applying a joining and a merging strategy (see below). For each considered scoring function, each explored space is defined by the following values: (1) the best scores including both the

lowest and the highest values (notice that the best value is not the lowest one for all scores); (2) the average score value; and (3) the score range and the standard deviation to encode for the spread of score values. For each ligand, all the generated poses were utilized to calculate the corresponding space parameters without exceptions.

In detail, the binding space was computed by averaging the computed scores for the poses of a given molecule/isomer. For molecules existing in multiple states, the space parameters corresponding to the isomer with the best primary score were considered. Similarly, the isomeric space was calculated by averaging the computed scores of the pose with the best primary score for all the isomers (clearly only for molecules existing in multiple states). In the so-called merged combination, the space parameters were calculated by averaging together the computed scores of all poses and all isomers. In the so-called joint combination, the consensus equations were developed by simultaneously considering the space parameters as computed for both binding and isomeric spaces. The descriptors for the binding and isomeric spaces were computed by using ad-hoc scripts of the VEGA suite of programs [27].

#### *3.7. Consensus Analyses*

The consensus analyses involved the primary scores and the scoring functions as computed by rescoring procedures. Notably, the analysis of the LiGen results also comprised the pharmacophoric distances as computed by this tool. The consensus analyses were performed by the EFO approach, which generates linear combinations of score values by exhaustively combining all possible variables and by optimizing a quality function based on both the early recognition (as encoded by the corresponding EF 1% values) and the entire ranking (as encoded by an asymmetry index applied to the distribution of the active molecules) [33].

By considering the high number of here analyzed descriptors along with the already included exhaustive search method, an incremental search algorithm was also implemented. In particular, given n descriptors in the input dataset, the equations with k variables are built by considering only the top ranked m equations with k − 1 variables (m is a user-defined parameter and is set by default equal to 30) and by combining them with the n descriptors avoiding repetitions. Therefore, the models to be evaluated are m (n − k + 1) instead of the number of all possible combinations without repetitions, which are equal to n!/k! (n − k)!. A benchmark analysis by comparing the consensus models as generated by incremental and by exhaustive searches revealed that the former involves a performance decrease of about 10 % as assessed by the relative EF1% values when generating the three variable equations for the simplest case without space parameters (data not shown). Such a performance loss is seen as acceptable by considering that the incremental search algorithm allows the analysis of very extended datasets of descriptors and the development of consensus models including more than three variables. Furthermore, the reduced performances of the incremental algorithm equally affected all the here performed comparative analyses. This new search approach was implemented into a standalone and highly optimized version of EFO, which does not require the full installation of the VEGA program and can be freely downloaded at www.vegazz.net. In detail, 20 consensus models were generated for each analysis by combining all input variables without preliminary filtering processes and by always using the incremental search approach. The consensus equations were developed by including from one to five variables. The predictive power of the resulting equations was assessed by subdividing the dataset into training (70%) and test sets (30%) and repeating this task 10 times to minimize the randomness.

#### **4. Conclusions**

The study describes and compares a set of VS campaigns performed by using four different docking tools to repurpose an extended set of safe in man molecules as potential inhibitors of the SARS-CoV2 3CL-Pro enzyme. To assess the predictive performances of the here proposed docking strategies and due to the lack of known inhibitors for the considered enzyme, the peculiar idea of the study is the exploitation of a training set composed of ∼=500 compounds that were reported as effective inhibitors (i.e., pIC50 > 5) of the SARS-CoV 3CL-Pro enzyme, a choice justified by the very high conservation degree between these two viral proteases.

All docking simulations were carried out by generating more than one pose per ligand and explicitly considering all possible isomers/states for those molecules existing in multiple states. In this way and after a rescoring analysis of all computed poses, the obtained results were utilized to calculate the descriptors of the resulting binding and isomeric space. The so-calculated score values for each utilized docking program were finally employed to develop consensus models by linearly combining them using the EFO approach.

Taken together, the obtained results allow for some concluding considerations, which can be summarized as follows:


Since the SARS-CoV2 3CL-Pro is functionally active as a homodimer and many Xray structures were recently resolved, the encouraging results here reported invite to repeat similar docking protocols by simultaneously considering both the monomers of different representative resolved structures to evaluate the enhancing effects exerted by binding and isomeric spaces for the resulting ensemble simulations. Not to mention that the same strategies could be also applied to explore representative frames coming from MD simulations.

Even though the employment of the known SARS-CoV 3CL-Pro inhibitors was justified by the very high conservation degree between these two enzymes, there is no doubt that the developed predictive models could be enhanced by utilizing true SARS-CoV 3CL-Pro inhibitors. One may figure out that the here described simulations could be exploited in a near future to develop increasingly performing predictive models as novel true inhibitors are identified.

To conclude, the here reported VS campaigns emphasize the generally beneficial effects of the applied post-docking procedures, even though their specific roles significantly vary among the utilized pieces of software. These observed differences can be ascribed to the different implemented algorithms for docking search and scoring calculations, but they can also be due to the intrinsic geometrical and physicochemical features of the SARS-CoV-2 3CL-Pro binding pocket. This last consideration emphasizes that a robust assessment of the role of binding and isomeric spaces could require extensive benchmarking studies involving wide sets of diverse target proteins.

**Supplementary Materials:** The following are available online: Figure S1 Best EF1% values as obtained by the four tested docking programs using the common databases with pIC50 > 5 (S1A) and pIC50 > 6 (S1B); Figure S2 Trends of the frequency of the molecules shared at the same time by two, three or four rankings (S2A) or by specific pairs of ranking (S2B) when browsing the first half of the ranking positions (from 1 to 5000); Figure S3 Venn diagram showing the frequencies of the scaffolds detected within the Top500 molecules of the four computed rankings; Figure S4 Distribution of some selected physicochemical properties for the simulated molecules; Table S1 Occurrence of the various score values as observed in all the consensus models generated by using the PLANTS docking results; Table S2 Occurrence of the various score values as observed in all the consensus models generated by using the LiGen docking results; Table S3 Occurrence of the various score values as observed in all the consensus models generated by using the Fred docking results; Table S4 Occurrence of the various score values as observed in all the consensus models generated by using the Glide docking results; Table S5 Enrichment factors as obtained by analyzing the common dataset; Table S6 Occurrence of the various score values as observed in all the consensus models generated by all docking program; Table S7 Rankings as derived by using the best consensus models for the four utilized docking programs; Table S8 Common molecules as found within the top 500 positions of at least three rankings. All the generated consensus models for all performed analyses with the corresponding score values are available at http://www.exscalate4cov.network.

**Author Contributions:** A.R.B., G.R., C.M. and G.V. designed the study; C.M. prepared the screened database, C.M. and C.T. performed LiGen simulations; S.G. performed PLANTS simulations; A.P. developed the EFO approach; G.V. wrote the manuscript; S.A. performed Fred experiments; J.G. performed Glide experiments; S.A., J.G., B.J.P., F.M. and G.R. planned and validated Glide and Fred setup; A.R.B.: project administration and funding acquisition. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was conducted under the project "EXaSCale smArt pLatform Against paThogEns for Corona Virus–Exscalate4CoV" founded by the EU's H2020-SC1-PHE-CORONAVIRUS-2020 call, grant N. 101003551.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data available in a publicly accessible repository that does not issue DOIs. This data can be found here: http://www.exscalate4cov.network.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Sample Availability:** Not available.

#### **Abbreviations**



#### **References**


**Pierre Laville , Michel Petitjean and Leslie Regad \***

Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France; pierre.laville@u-paris.fr (P.L.); petitjean.chiral@gmail.com (M.P.) **\*** Correspondence: leslie.regad@u-paris.fr; Tel.: +33-1-57-27-82-72

**Abstract:** The use of antiretroviral drugs is accompanied by the emergence of HIV-2 resistances. Thus, it is important to elucidate the mechanisms of resistance to antiretroviral drugs. Here, we propose a structural analysis of 31 drug-resistant mutants of HIV-2 protease (PR2) that is an important target against HIV-2 infection. First, we modeled the structures of each mutant. We then located structural shifts putatively induced by mutations. Finally, we compared wild-type and mutant inhibitor-binding pockets and interfaces to explore the impacts of these induced structural deformations on these two regions. Our results showed that one mutation could induce large structural rearrangements in side-chain and backbone atoms of mutated residue, in its vicinity or further. Structural deformations observed in side-chain atoms are frequent and of greater magnitude, that confirms that to fight drug resistance, interactions with backbone atoms should be favored. We showed that these observed structural deformations modify the conformation, volume, and hydrophobicity of the binding pocket and the composition and size of the PR2 interface. These results suggest that resistance mutations could alter ligand binding by modifying pocket properties and PR2 stability by impacting its interface. Our results reinforce the understanding of the effects of mutations that occurred in PR2 and the different mechanisms of PR2 resistance.

**Keywords:** drug-resistance mutations; HIV-2 protease; structural characterization; induced structural deformations

#### **1. Introduction**

The human immunodeficiency virus (HIV) affects humans and causes the acquired immunodeficiency syndrome (AIDS). The treatment against HIV-1 infection corresponds to the same molecules that target four HIV proteins: the fusion protein, protease (PR), integrase, and reverse transcriptase. The same molecules are used to fight HIV-2 infection but HIV-2 is naturally resistant to most of these inhibitors [1–9]. Thus, it is important to develop new molecules that inhibit the HIV-2 replication, particularly against the HIV-2 protease (PR2).

PR is a homodimer that plays a major role in the virus maturation process: it hydrolyzes the viral polyproteins Gag and Gag-Pol, causing the development of immature virions. There are currently nine protease inhibitors (PIs) clinically recommended for treating HIV-1 infection [10]. These drugs bind to the PR catalytic site in the interface of the two monomers. This binding induces structural changes in the entire PR2, particularly in the flap region allowing the closing of binding pocket [10–12]. PR2 is naturally resistant to most of these PIs and only three of them are recommended for the treatment of HIV-2 infection: darunavir (DRV), saquinavir (SQV), and lopinavir (LPV) [1,10]. The natural resistance of PR2 is explained by amino-acid changes between PR1 and PR2 that induce structural changes in the entire structure [13]. Some of these structural changes, located in the binding pocket, modify properties and conformation of the PI-binding pocket and the internal interactions between PR2 and PIs [3,5,14–20]. Other structural changes, occurring

**Citation:** Laville, P.; Petitjean, M.; Regad, L. Structural Impacts of Drug-Resistance Mutations Appearing in HIV-2 Protease. *Molecules* **2021**, *26*, 611. https:// doi.org/10.3390/molecules26030611

Academic Editors: Marco Tutone and Anna Maria Almerico Received: 18 December 2020 Accepted: 19 January 2021 Published: 25 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in the elbow and flap regions, alter the transition between the open and closed forms involved in ligand binding [13,21,22].

In addition to its natural resistance, many acquired resistance mutations appear in PR2. The identification of these mutations have been performed using genome sequencing studies of HIV-2 virus extracted from infected patients [2,4–6,8,23–28]. For example, it has been shown that the V47A, I50V, I54M, I82F, I84V, and L90M mutations lead to several PIs resistance [2,4,5,7,8]. Phenotypic susceptibility assays were used to confirm the resistance of some mutants [2,4,5,8,26,29]. For example, genotypic and phenotypic analyses showed that the I82F mutation causes resistance to indinavir (IDV) [26]. Furthermore, this mutation has been identified as causing hypersusceptibility to both DRV and SQV using phenotypic assays [2]. In addition, combinations of several mutations confer high resistance to several PIs [2,4,5,7,8,30]. For example, the I54M and I82F mutations induce resistance to all PIs and the V62A/L99F mutant is resistant to nelfinavir, IDV, and LPV [26]. Few studies have focused on the structural analysis of the impacts of drug-resistance mutations reported in PR2 because no tridimensional (3D) structure of PR2 mutant have been solved. These structural studies could help to understand the atomistic mechanism of resistance mutations. In our previous work, we performed the first structural analysis of the impact of 30 drug-resistant mutants of PR2 based on modeled structures. More precisely, we explored the consequences of drug-resistance mutations on PR2 structural asymmetry, an important property for ligand-binding and the target deformation [31]. Our findings suggested three possible resistance mechanisms of PR2: (i) mutations that induce structural changes in the binding pocket that could directly alter PI-binding, (ii) mutations that could impact the properties and conformation of the binding pocket by inducing structural changes in residues outside of the binding pocket but involved in interaction with pocket residues, and (iii) mutations that could modify the PR2 interface and its stability through structural changes in interface residues. These results have been based on PR2 backbone analysis. However, a better characterization of the structural impacts of drug-resistance mutations on PR2 structure including side-chain atoms could help in a better understanding of different proposed mechanisms.

In this study, we structurally analyzed a set of 31 drug-resistant mutants of PR2 that was updated relative to [31]. The 3D structure of each mutant was modeled and its structure was compared to the wild-type PR2 to locate structural rearrangements induced by drug-resistance mutations at backbone and side-chain atoms. The study reported that drug-resistance mutations could impact the flexbility of PR2 and the closing binding pocket, conformation and properties of PI-pocket and the composition and size of the PR2 interface.

#### **2. Results**

We studied the impact of 22 drug-resistance mutations on PR2 structure. These mutations appeared alone or in combination with others (two or three mutations

per mutant) resulting in a set of 31 mutants, Figure 1A. First, mutant structures were modeled using FoldX software [32] and an energetic minimization step using the crystallographic structure of the wild-type PR2 (PDB code: 3EBZ [33]). Five 3D structures were built for each mutant to consider the different possible rotamers per amino acids as illustrated by Figure A1. This mutagenesis process resulted in a set of 155 mutant structures. The crystallographic structure of the wild-type PR2 (PDB code: 3EBZ [33]) was also energetically minimized with the protocol used for mutant structures. In the following, the minimized wild-type structure was referred as the wild-type structure and its structure was compared to the minimized mutant structures.

**Figure 1.** Description of the 31 drug-resistant mutants studied in this analysis. (**A**) Table listing the 31 drug-resistant mutants studied in this analysis. Single mutants are colored in magenta, double mutants in green and triple mutants in blue. (**B**) Location on PR2 structure of the 22 drug-resistance mutations included in the 31 mutant set. PR2 is represented in cartoon mode and colored according to the 13 regions defined in [34,35]. Mutations are represented in stick mode. (**C**) Amino acid sequence of PR2 presenting the limit of the 13 PR2 regions. All the mutated residues are colored in red in the sequence.

#### *2.1. Identification of Atom Shifts in the Mutant-Structure Set*

We first explored the structural deformations induced by mutations by locating shifted atoms in mutant structures relative to the wild-type structure. The shift of an atom was quantified by *dist WT*−*mutant* , i.e., the distance between its positions in the wild-type and mutant structures. Figure A2 presents the distribution of *dist WT*−*mutant* of all atoms in the mutant structures set, except hydrogen atoms. These distances varied from 0.05 to 0.05, with 95% of atoms exhibited a *dist WT*−*mutant* smaller than 0.14 Å. These distances were summarized per structures by computing the RMSD between the wild-type and mutant structures (Figure A2). As expected, mutant structures exhibited close conformation than the wild-type, resulting in an average RMSD of 0.12 ± 0.09 Å. Only one mutant structure (one structure of the K7R mutant) exhibited a RMSD higher than 0.5 Å. Like in Liu et al., 2008 [36], we considered that a shift was significant if it had a magnitude higher than 0.3 Å because of uncertainties in the X-ray and mutant structures. Thus, to identify atom shifts induced by mutations, we retained the 2136 atoms with a *dist WT*−*mutant* higher than 0.3 Å in at least three structures of a given mutant. These selected atoms were denoted as "mutant-conserved shifted atoms" (MCS atoms) and the distribution of their *dist WT*−*mutant* is provided by Figure 2A. Most MCS atoms (70%) exhibited a shift of moderate magnitude (<1 Å). However, 17% of MCS atoms are of large magnitude (>2 Å), such as atom shifts detected in residue K69\_A (Lysine 69 in chain A) in the L99F mutant that is illustrated in Figure 2B. Figure 2C presents another type of atom shift that corresponds to a flip of a ring of residue 3\_B (residue 3 in chain B) in the five structures of the V62A/L99F mutant. This rearrangement type does not induce structural deformation and thus it does not seem to be linked to PI resistance.

**Figure 2.** (**A**) Distribution of *distWT*−*mutant* distances for MCS atoms extracted from the set of 155 mutant structures. Magenta lines corresponds to the cutoffs used to define a weak shift (0.3 Å < *distWT*−*mutant* < 1 Å), moderate shifts (1 Å ≤ *distWT*−*mutant* < 2 Å), and large shifts (*distWT*−*mutant* > 2 Å). (**B**,**C**) Illustration of atom shifts in the L99F (**A**) and V62A/L99F (**B**) mutants. (**B**) Superimposition of the five structures of the L99F mutant and the wild-type structure. Wild-type structure is colored in orange and represented in line and cartoon modes. The five structure of the mutant are represented in line mode and colored in magenta, cyan, blue, green, and pink. The L99F mutated residue is represented in stick mode. (**C**) Illustration of structural shift occuring at residue 3\_B in the V62A/L99F mutant. The wild-type structure is represented in cartoon and line modes and colored according to its two chains: chain A is colored in purple and chain B is colored in marien blue. The mutant V62A/L99F structure is represented in lines and colored in green. The mutated residue 99\_A and shifted residue 3\_A are represented in stick mode. The arrow represents the *distWT*−*mutant* of the CE atom of residue 3\_B computed between the wild-type and the first structure of mutant V62A/L99F. *CE*2 *<sup>V</sup>*62*A*/*L*99*<sup>F</sup>* and *CE*2*WT* correspond to atom CE of residue 3 of chain B in the V62A/L99F and wild-type structures and are represented in sphere mode.

As expected, side-chain atoms were overrepresented in the MCS atom set (Pearson's Chi-squared Test *p*-value = 3 × 10 −28 ) and they had larger *dist WT*−*mutant* value than backbone atoms (Student's *t*-test *p*-value of 6 × 10 −95 ). This means that drug-resistance mutations have more impacts on side-chain atoms than on backbone atoms. From the 2136 MCS atoms, 543 (=25%) are atoms of mutated residues. The shift of these atoms, named direct shifts, was a direct consequence of mutations. In contrast, 1593 (=75%) MCS atoms corresponded to indirect shifts, i.e., they occurred in non mutated residues, and their shifts resulted either from the intrinsic flexibility of atoms or from indirect impacts induced by the mutation through contacts between these atoms and mutated residues. Direct shifts had larger magnitude than indirect shifts, i.e., they exhibited an average higher *dist WT*−*mutant* distances than indirect shifts (Student's *t*-test *p*-value = 3 × 10 −13 ). To distinguish structural shifts resulting from mutation from those induced by flexibility, we detected MCS atoms in non minimized structures, i.e., mutant structures corresponded to the output of FoldX software. From this set of non-minimized structures, we located 883 MCS atoms, with 646 that were also detected as MCS atoms in the minimized structures set. This means that the shift of these MCS atoms observed in minimized structures (30% of MCS atoms) were the consequence of the mutation. As expected, these shifts, observed in minimized structures, occurred only on side-chain atoms and in or close to the mutated residue because FoldX software optimizes the configuration of only side chains in the vicinity of the mutated residue. The remained detected shifts were explained by the mutation and intrinsic flexibility of atoms.

Figure 3A shows that the entire PR2 structure was sampled by MCS atoms. A total of eight regions had few MCS atoms, i.e., less than 50 MCS atoms were detected in the 155 mutant structures (Figure 3). In contrast, some regions exhibited many MCS atoms, such as the cantilever and flap regions of the two chains and the *α*-helix region of chain B, with more than 150 MCS atoms. In addition, an assymetry between the two chains in terms of number of MCS atoms was observed. Indeed, chain B contained more MCS atoms than chain A (*p*-value of the Pearson's Chi-square test is of 5 × 10 −24 ). For example, the Nter, fulcrum, elbow, and R3 regions of chain A present few structural shifts, while they exhibited lots of deformations in chain B. Thus, even though mutations occurred in the two chains, they did not impact in the same way the two monomers.

**Figure 3.** (**A**) Number of MCS atoms per mutant counted in the two PR2 chains. Each PR2 chain was divided in 13 regions according to [13]. Numbers in brackets indicate the size of each region in terms of amino acids. (**B**) Number of MCS atoms (grey) and MCS atoms with large magnitude (*distWT*−*mutant* ≥ 2, cyan) per mutants. Mutants were sorted according to their number of MCS atoms.

Figure 3B presents the number of MCS atoms per mutants. On average, a mutant contained a moderate number of MCS atoms (16.26 ± 26.09). The I82M mutant was particular because it had no MCS atoms, revealing that the I82M mutation induced few structural changes in PR2 structure. In contrast, the K7R mutant was the mutant with the most MCS atoms (147 MCS located in 53 different residues as illustrated in Figure 4. A total of 14 mutants had less than 10 MCS atoms, indicating that these mutations induced few impacts on PR2 structure (Figure 1A). Except the I54L/L90M, all these mutants were single mutants. Although these mutants exhibited few deformations, 57% of them had shifts with a large magnitude (*dist WT*−*mutant* ≥ 2). For example, the D30N mutant induced 5 MCS atoms with one exhibiting large shift of 2.9 Å, while the L90M mutant caused four small shifts at residues 90\_A/B and 97\_B with a magnitude varying from 0.35 to 0.48 Å (Figures 3B and A3). In contrast, 17 mutants had many MCS atoms and all of these shifts had of large magnitude (Figure 3B). From these mutants, seven were single mutants, revealing that only one mutation could cause large deformations, such as those observed for the K7R and L99F mutants (Figures 4 and A3).

**Figure 4.** Location of the MCS atoms found in some mutants. PR2 is colored in cyan and represented in putty mode. The putty radius is relative to deformations induced by mutations: the higher the radius, the stronger the mutation-induced rearrangement. Mutated residues are colored in blue.

> According to the location of MCS atoms, we differentiated three types of mutations (Figure 4). The first mutations corresponded to four mutations that induced structural rearrangements only in the mutated residues, i.e., having only direct impacts. For example, the I50V mutation caused structural changes in two atoms of residue 50 of the two chains, a residue involved in the binding pocket, the dimer interface and the flap region (Figure 4). The second type of mutations grouped five mutations impacting residues in their vicinity or further in structure, i.e., inducing indirect structural changes, such as the I82F mutations (Figure 1). Indeed, the I82F mutation induced many atoms rearrangements in five non mutated residues (8\_A, 8\_B, 21\_B, 27\_B, 49\_B) with some of them of large magnitude. In this mutant structure, residue 8\_A is located in the vicinity of mutated residue 82\_A (located at less than 5.5 Å), while residues 27\_B and 49\_B that are

located at more than 6 Å of the mutated residue 82\_A (Figure A8). The last type of mutations corresponded to mutations that induced both direct and indirect rearrangements (Figures 1 and 4). This mutation type grouped most of mutations. Figure 4 showed that the L99F mutation produced large shitfs in the mutated residues and also in its neighbor residues 68 and 69.

The location of MCS atoms (Figure A3) in mutant structures highlighted structural rearrangments located in important regions for PR2: in the elbow and flap regions that are implied in the PR2 deformations induced by ligand binding, in its pocket binding and in its interface. In the following, we explored the impacts of the studied resistance mutation in the PI-binding pocket and PR2 interface.

#### *2.2. Impact of Mutations on the Properties of PI-Binding Pocket*

From the 31 mutants, 15 had at least one mutation in the binding pocket (Figures 1A and A4). Except the I82M mutant, all these mutants presented MCS atoms in the pocket in the mutated or non mutated residue. Surprisingly, structural rearrangements in the binding pocket were also observed in five mutants without pocket mutations (K7R, I54L, V62A, I54L/L90M, I54M/L90M, Figure 1). A total of 36% of pocket atoms were deformed in at least one mutant, with an overrepresentation of side-chain atoms (Pearson's Chi-squared Test *p*-value = 1 × 10−<sup>3</sup> ), see Figure A4.

To explore impacts of these mutations on the conformation and properties of the PI-binding pocket, PI-binding pockets were extracted from the 156 structures (1 wild-type and the 155 mutant structures). These 156 pockets were then classified according to their structural similarity quantified by pairwise RMSD (Figures A5 and 5). In addition, their volume, sphericity, and hydrophobicity values were compared to those of the wild-type pocket (Figure 6). First, the five structures of a given mutant were not always bundled in the classification or presented some variability in terms of descriptor values. This highlighted a structural diversity of the five structures of mutants. This is explained by the minimization effects and the fact that several rotamers were possible for some amino acids during the mutagenesis process as illustrated by Figure A1 for the K45R mutation. Except structures of the I82M, I54M, L90M, I54M/L99F, and I54M/V71A/L90M mutants, most structures of mutants without MCS in pocket were close to the wild-type pocket in the hierarchical classification and presented similar descriptor values than the wild-type pocket (Figures 5 and 6). Pocket of mutants with the K7R, I54M, I54L, I82F, and I84V mutations were the farthest to the wild-type pocket in the classification, meaning these mutations had the most impact on the pocket structure (Figure 5). Figure 6 showed that the K7R and I82F mutations also strongly modified pocket properties, like the K45R, V47A, G48V, I82M/F, and L90M mutations. More precisely, the V47A, I82F, and I82M mutations strongly decreased the pocket hydrophobicity. The I50V, I50L, V62A, and I84V mutations also caused a reduction of pocket hydrophobicity but with a weaker magnitude. The I50L and I82F mutations were also responsible of an increase of the pocket volume, in contrast to the I82M and I84V mutations that caused a reduction of the pocket volume. The G48V and I54M mutations increased the hydrophobicity of the pocket that was accompanied with a modification of the pocket size: the G48V mutation led to a reduction of the pocket volume in contrast to the I54M. An increase of pocket volume was also observed in the K7R, I54L, and L90M mutants with different magnitudes and a decrease of the pocket volume in the D30N and V47A mutants. The volume modification of the pocket of the K7R and D30N mutants was accompanied with an increase of the sphericity of the pocket. The K45R mutant was distinct because its five structure presented large diversity in terms of descriptor values Figure 6. Two of these pockets were bigger and less hydrophobic than the wild-type pocket while the three others were smaller, more hydrophobic and more spheric than the wild-type pocket.

**Figure 5.** Hierarchical classification of wild-type and mutant pockets according to their conformational similarity quantified by pairwise RMSD computing using all pocket atoms. The table provides the description of each mutant structure in terms of mutations. Mutant structures are ranked according to their apparition in the classification. Single mutants are colored in magenta, double mutants in green and triple mutants in blue. The orange column locate the wild-type pocket. Mutations colored in purple correspond to mutations located in the binding pocket and mutant colored in red contain a MCS atom located in the binding pocket.

**Figure 6.** Hydrophobicity, volume, and sphericity of pockets extracted from each mutant structure. Red mutants correspond to mutants exhibiting at least one mutation in the binding. Each point corresponds to a mutant structure. For some mutants, less than five points appear meaning that several mutant structures exhibit same values for a given descriptor.

Figure A4 presents the MCS atoms occurring in the binding pocket in each mutant. We noted that some mutations induce structural rearrangements in residues important

T for the PI binding. For example, the K7R, I50L/V and I54L mutations caused structural deformations in residues 25, 27, 30 that establish hydrogen bonds with PIs [18,33]. The D30N, K7R, I82F, I84V mutations led to atomic displacements in residues involved in van der Waals interactions with PI, such as in residues 23B, 27A, 28A, 30A, 49B, 48B, 82A and 84A.

Thus, the K7R, K45R, V47A, G48V, I50V/L, I54M, V62A, I82M/F, I84V, and L90M mutations could impact ligand binding by modifying pocket properties or the network of interactions with PIs.

#### *2.3. Impact on Interface*

From the 31 drug-resistant mutants, 13 had at least one mutation in the PR2 interface (Figure 1A). Except the I50L and I54M mutants, all these mutants contained MCS atoms in their interface. Five mutants without mutation in the interface presented structural deformations in their interface. To analyze the impact of these mutations on the PR2 interface, interfaces of the wild-type and mutant structures were extracted and compared in terms of amino acid composition and their size. To do so, a hierarchical classification of the 156 interfaces was computed according to their similarity in terms of interface composition (Figure 7). In addition, the Solvent Accessible Surface Area (SASA) value, measuring the interface size, of the two parts of the interface was computed for each mutant structure using NACCESS software [37] (Figure 8). Figure 7 showed that most structures without MCS atoms in interface were close to the wild-type interface in the classification, revealing that these mutations led weak changes in the PR2 interface. This was confirmed by the fact that these interfaces had similar size than the wild-type interface (Figure 8). Three mutants (I50L, I54M and I84V/L90M) without MCS atom in interface exhibited different interface composition relative to the wild-type. These differences in terms of interface composition led to an increase of the size of chain A interface in the I50L and I84V/L90M mutants. The G48V mutation was responsible of the presence in the interface of the two side-chain atoms of residue 48\_A and the absence in the interface of one atom of residue 95\_B and 99\_B relative to the wild-type interface (Figure A6), resulting that the chain A interface of the mutant was larger and this of chain B was smaller than the wild-type interface (Figure 7). These differences in terms of interface observed in the I50L, I54M, and G48V mutants were explained by supplementary atoms in their interface induced by the mutation (Figure A6).

**Figure 7.** Classifications of the 156 structures according to their similitude of their interface. The table provides the description of each mutant structure in terms of mutations. Mutant structures are ranked according to their apparition in the classification. Single mutants are colored in magenta, double mutants in green and triple mutants in blue. The orange column locate the wild-type interface. Mutations colored in purple are invoved in the interface. Residues colored in red correspond to MCS atoms in the interface.

**Figure 8.** SASA values for interface of chain A (**a**) and chain B (**b**) for each mutant. Red mutants correspond to mutants having at least one mutation in PR2 interface.

From mutants having MCS atoms in the interface, the I50V, V62A, and I82F mutants corresponded to mutants inducing the less modifications in the PR2 interface (Figures 7 and 8). This was expected for the V62A mutant because only one MCS atom was observed in its interface. It was more surprising for the I50V and I82F mutants because large deformations were detected in interface-residues 50 and 8 of chains A and B, respectively. In contrast, the K7R, I54L, L90M, and L99F mutations alone or in combination with others contained many MCS atoms in the interface that strongly modified its composition and its size. The K7R, L90M, and L99F mutations caused an increase of the size of the two parts of PR2 interface, and the mutation I54L induced a weak increase of the size of the interface of chain B.

#### *2.4. Impact of Combining Several Mutations Relative to Single Mutant*

Figure 3 showed that most multiple mutants contained many MCS atoms with large magnitude, i.e., with *dist WT*−*mutant* distance higher than 2 Å. However, combining several

mutations did not significantly increase the average number of MCS atoms per mutant (Student's *t*-test *p*-value = 0.59, Figure 3B). Comparison of MCS atoms in the single and multiple mutants showed that several multiple mutants exhibited specific MCS atoms relative to the corresponding single mutants. For example, we highlighted a displacement in atoms of residue 50\_B in the I54L and I54L/L90M mutants but the shift was larger in the double (*distWT*−*mutant* for CD\_50\_B and CG1\_50\_B atoms >2.5 Å) than in single mutants (*distWT*−*mutant* for CD\_50\_B and CG1\_50\_B atoms was of 0.69 and 0.62 Å in the I54L mutant). Combining several mutations could induce apparition or loss of MCS atoms relative to the corresponding mutants. For example, the combination of the I54M and I82F mutations caused structural shifts in pocket residues 50\_A, 81\_A/B, and 82\_A/B, while no shift at these residues were observed in the I54M and I82F mutants (Figure A4). In contrast, the I82F mutant contained a large shift at residue 8\_B, a residue involved in the PR2 interface and pocket, those were not found in the I54M/I82F mutants (Figure A4). These structural changes in the double mutant relative to the I82F mutant induced a weak decrease of the interface size and an increase of the pocket volume (Figure 6).

#### *2.5. Impact of Using Different Structure Modeling Software*

In this section, we explored the impact of using another structure-modeling software in the detection of structural rearrangements induced by mutations. To do so, we modeled the structure of the 31 mutants using the webserver Robetta, see Appendix G. These mutant structures were denoted as *mutantRobetta*. For a better clarity of this section, the mutant structures built using our initial protocol (based on FoldX software and an energetic minimization step) were denoted as *mutantFoldX*+*Mini*. First, we compared mutant structures generated with the wild-type structure (crystallographic structure), Figure A7A,B. We noted that the protocol based on FoldX software plus a minimization step led to a set of structures exhibiting a larger diversity than Robetta webserver. Then, the two mutant sets were compared by computing RMSD between *mutantRobetta* structures and the five structures of *mutantFoldX*+*Mini*, denoted as *RMSDWT*∗−*mutantRobetta* . The two modeling protocols led to different structures with an average RMSD of 0.46 ± 0.04 between *mutantRobetta* and *mutantFoldX*+*Mini* structures (Figure A7C). We then compared the number of MCS atoms oberved in each structure of the two mutant sets, see Appendix G.3. Figure 9 presents the number of MCS atoms detected in each mutant. We noted that *mutantRobetta* structures exhibited substantially more MCS atoms than *mutantFoldX*−*mini* structures. This was explained by two reasons. First, the determination of the MCS atoms in *mutantFoldX*−*mini* structures was based on the comparison of the mutant structures and the minimized wildtype structures, while MCS atoms in *mutantRobetta* structures were detected by comparing mutant structures with the non minimized wild-type structure. The second reason was that an atom was detected as a MCS atom in the *mutantFoldX*−*mini* if it had a *distWT*−*mutant* larger than 0.3 Å in at least three structures of the mutant. Figure 9 presents the number of atoms detected as MCS in both mutant sets. From the 504 MCS atoms per mutants detected in the *mutantFoldX*+*mini* set, 78 % were detected as MCS atoms in the *mutantRobetta* set. Although, the protocol based on FoldX software minimized the detection of structural rearrangements, the extracted structural deformations using this protocol was mainly found by a protocol based on another modeling software.

**Figure 9.** Number of MCS atoms in the *mutant Robetta* set, in the *mutant FoldX*+*mini* set, and observed both in the two mutant sets.

#### **3. Discussion**

In this paper, we proposed a quantification and location of structural deformations at backbone and side-chain atoms of PR2 occurred in an update set of 31 drug-resistant mutants based on their modeled structures. We detected a set of atoms presenting a shift in at least one mutant structure relative to the wild-type structure. To identify structural rearrangements resulted from the mutation, we then retained only MCS atoms, i.e., atoms with a distance between its position in the mutant and wild-type structure higher than 0.3 Å in at least three of structures of a given mutant. The distance cutoff was set up to 0.3 Å to select only significant shifts and considering uncertainties in the X-ray structure, like in Liu et al. [36]. This step allowed us to detect on average 16.26 ± 26.09 MCS atoms per mutants. However, these two cutoffs could led to an under-estimation of detected structural shifts. This could explain the fact that no MCS atom was detected in the I82M mutant, while the binding pocket extracted from this mutant was smaller and less hydrophobic than the wild-type pocket. Indeed, using a distance cutoff of 0.1 Å, we counted 96 ± 107 MCS atoms per mutants and 17 MCS atoms in the I82M mutant.

The analysis of MCS atoms showed that they occurred in side-chain and backbone atoms in the mutated residues, in its vicinity or further in the structure. We noted that structural rearrangements in side-chain was more frequent and with larger magnitude than those observed in backbone atoms. Thus, drug-resistance mutations induced more deformations in side-chains than in backbone atoms. These results suggest that to combat against HIV drug resistance, it would be interesting to develop inhibitors that establish hydrogen interactions with backbone atoms. Favor backbone interactions between PI and PR is a strategy used for the design of DRV to avoid the detrimental effects of resistance mutations [38,39].

Our results showed that the studied drug-resistance mutations impacted all PR2 regions. However, even though mutations occurred in the two PR2 chains, we noted an assymetry in the impact of mutations in the two chains. This assymetric behavior of mutations is linked to the fact that PR2 is a structural asymmetric protein, i.e., its two chains exhibit different conformations in unbound and bound forms [11,12,35,40–43]. This structural asymmetry of PR2 could result from crystal packing, ligand binding, and intrinsic flexibility of PR2 [11], and may be involved in the structural changes of PR2, particularly upon ligand recognition and binding [11,12,41,42]. Our results are in agreement with previous findings that have showed that drug-resistance mutations could modify PR2 structural asymmetry [31]. Thus, as PR2 is an asymmetric protein, resistance mutations do not always have the same impact on the two chains.

The location onto PR2 structure of MCS atoms highlighted structural deformations that could be linked to resistance mechanisms. First, we observed, in 19 mutant structures, structural deformations in binding pocket residues as reported in Laville et al., 2020 [31]. Most of these mutations were located in the binding pocket, except the K7R, I54M and L90M mutations. Our findings revealed that the S43T, V47A, K45R, G48V, I82F, and I82M strongly modified pocket hydrophobicity. The two first decreased the hydrophobicity, while the four last increased it. The I50V, I50L, V62A, and I84V mutations induced also a modification of the pocket hydrophobicity but with a very weak magnitude. The K7R, I50L, I54M, I54L, I82F, and L90M mutations increased the pocket volume, while the D30N, V47A, G48V, I82M, and I84V mutations had the opposite effect. By comparing PR1 and PR2 binding pocket, we have previously observed that amino-acid changes occurring in pocket residues 31, 32, 46, 47, 76, and 82 increased the hydrophobicity of the binding pocket [19]. Chen et al. (2014) reported that these mutations have also an impact on the volume of the binding pocket [22]. In addition, the K7R, D30N, I50L/V, I54L, I82F, and I84V mutations seemed to have direct impact on the PI binding by causing structural rearrangements in pocket residues that establish hydrogen and van der Waals interactions with PIs. Secondly, our finding reported that mutations K7R, E37D, S43T, K45R, V47A, G48V, I50V, I50L, I54M, I54L, and V62A, occurring alone or in combination with others, induced directly or indirectly structural changes in the elbow and flap regions. These two regions are known to be important in the opening and closing mechanisms of the binding pocket during ligand binding [10]. Thus, these mutations could impact PI binding by modifying the flexibility and movement of the flap region upon PI binding. Thirdly, we reported that the K7R, I50L, I54L, G48V, L90M, and L99F mutations caused structural displacements that impacted the composition and size of the PR2 interface. This suggested that these mutations may alter the stability of PR2.

Several drug-resistant mutations were structurally studied in PR1 by comparing crystallographic structures of the wild-type and mutants in bound and unbound forms. These studies showed that drug-resistance mutations alter the conformation of flap residues and flap dynamics, modifying binding pocket properties and the interaction network with PI, and the PR2 stability. For example, it has been shown that some resistance mutations, such as the M46L, G48V, I50V, and I54V/M mutations, alter the conformation of flap residues and flap dynamics in PR1 [36,44–48]. The impact of resistance mutations on flap conformation had also observed in several multi-drug resistant mutants such as the PR20 (including 20 mutations), Flap+ (with L10I, G48V, I54V, and V82A mutations), and MDR-769 (with nine mutations) mutants having more opened flap region relative to the wild-type PR1 [49–52]. In addition, Shen et al., (2015) suggested that the E35D, M36I, and S37D mutations in the multi-drug resistant PRS17 mutant induce an increase of the flexibility of the flap region [53]. The impact of the amino-acid changes at residues 30, 50, 82, 84, and 90 on the pocket volume have been also observed in PR1 [48,49,52,54–57]. For example, the V82A and I84V mutations lead to an expansion of the active-site cavity [52,54], while the V82F and L90M mutations cause a volume reduction in the binding cavity [48,49,54,57]. The direct impact of mutations occurred in residues 30, 50, 82, and 84 on the network of PR-PI interactions were previously observed in PR1 [51,57–62]. The impact of the L90M mutation on the PR2 stability was previously observed in PR1 using urea denaturation experiment [63,64] or using sedimentation equilibrium analysis [65]. However, we noted several disagreements between findings obtained in PR1 and PR2. For example, Liu et al., (2008) observed structural deformations around the tip of the flap and 80 s loop (residues 78–82) in PR1 mutants G48V, I50V, I54V/M complexed with SVQ and DRV [36]. Our analysis of PR2 mutants detected displacements of atoms located around the flap tip in the I50V, I54M, and G48V mutants and in residue 53 in the G48V mutant. However, structural deformations in the 80 s loop is only found in the I54M mutant with a weak shift in backbone atom 79\_B\_0 (*distWT*−*mutant* of 0.44 Å). An important structural rearrangement of the main-chain of residue 25 in the PR1 mutant L90M linking to the L90M resistance was observed in several studies [49,57]. Our result did not

highlight structural deformations at atoms of residue 25 in the L90M mutant of PR2; atoms of residue 25A/B exhibit an average *distWT*−*mutant* of 0.07 ± 0.02 Å. In addition, we noted that the I84V and L90M mutations have not the same effect on the binding pocket in PR1 and PR2. Indeed, our findings showed that, in PR2, the I84V led to a reduction of pocket volume and the L90M mutation induced an augmentation of the pocket size in contrast in PR1 [48,49,52,54,57]. These disagreements in terms of impacts of drug-resistance mutations in PR1 and PR2 could be explained by the methodological differences of the two approaches. Indeed, our study was based on modeled structures, while PR1 studies used crystallographic structures. However, another reason that could explained these disagreements is the structural differences between PR1 and PR2 structures [13,18,19,40,42] leading to different PI-resistance profiles: PR2 is naturally resistant to six of the nine FDA (Food and Drug Administration)-approved PIs available for HIV-1 therapy[1,3,4]. More particularly, our previous work showed that the *α*-helix region (87–95), containing residue 90, presents different conformations in PR1 and PR2 and these structural differences could be partially explained by amino acid substitutions observed between the two PRs [13]. In addition, we previously showed that two PRs exhibit pockets with different properties: PR2 pockets are smaller and more hydrophobic than PR1 pockets [19,22]. These observation suggested that a same mutation could have different impacts on PR1 and PR2 structures.

In this study, we explored the impact of drug-resistance mutations on PR2 structure. To do so, we modeled 3D structures of mutant using as template the PR2 complexed with DRV (PDB code: 3EBZ [33]). However, it has been shown that drug-resistance mutations exhibit different sensitivities to the nine FDA-approved drugs [2,4,8,10,26]. For example, the I54M mutation lead to moderate resistance for indinavir, nelfinavir, tripanavir, DRV, and LPV, a high resistance for amprenavir, and is susceptible to SQV [26]. Thus, it would be interesting to consider different PIs and to cross information about the complete resistance profile of each mutant and the detected structural deformations. However, reliable resistance profile for PIs are difficult to collect for all mutants as few studies are available [3,4,26,29] and some of these studies have led to opposing results [10].

#### **4. Materials and Methods**

#### *4.1. Data*

In this study, we started from the list of the 30 drug-resistant mutants of PR2 studied in our previous study [31]. This mutant list was updated and a list of 31 drug-resistant mutants of PR2 was selected. This mutant set contains 22 different mutations (Figure 1) and there are 20 single mutants (with one mutation), 8 double mutants (with two mutations), and 3 triple mutants (with three mutations). These mutations sample the entire PR2 (Figure 1).

#### *4.2. Structure Modeling*

We modeled the 3D structure of each mutant using the protocol used in Laville et al., 2020 [31] based on FoldX suite [32] and Gromacs software [66]. This two-step protocol was applied to the wild-type crystallographic structure of the PR2 complexed with DRV (PDB code: 3EBZ [33]). First, the DRV ligand, metal atoms, and water molecules were removed from the crystallographic structure of PR2. Then, the RepairPDB command of the FoldX suite [32] was applied to the PR2 structure to reduce its energy. The in silico mutagenesis was then performed using the BuildModel command of the FoldX suite based on a side-chain rotamer library [32]. This step was performed five times to consider the several rotamers available for each amino acid [32] and generated five structures per mutant. An energetic minimization was then applied to the five modeled structures of a mutant using the protocol developed in our previous study [31]. We applied PROPKA software [67] to monoprotonate the oxygen atom OD2 of Aspartate 25 in chain B. Then, the system was solvated in a truncated octahedron box of explicit solvent (TIP3P water model) with a 12.0 Å buffer in each dimension and its charge was neutralized using chloride ions. The minimization of the system was performed using a two-step protocol using the force field AMBER ff99SB in GROMACS [66] by applying a steepest descent algorithm

combined with a conjugate gradient algorithm. A first step energy minimization allowed relaxing water molecules and counterions using a position harmonic restraining force of 100 kcal·mol−1Å−<sup>2</sup> on the heavy atoms of the protein. Then, restraints on protein atoms were removed using a second energy minimization step. The protocol used the particle mesh Ewald (PME) method to treat the long-range electrostatic interactions [68] and a cutoff distances of 10.0 Å for the long-range electrostatic and van der Waals interactions. This protocol was applied to the 31 mutants generating 155 mutant structures (5 structures per mutant). The minimization protocol was also applied to the wild-type structure and the minimized wild-type structure was named wild-type structure.

#### *4.3. Identification of Shifted Atoms Induced by Mutations.*

First, all shifted atoms in each mutant structure were extracted by comparing the position of atoms in the wild-type and mutant structures. To do so, we applied the method used in Perrier et al., 2019 [69]. Each mutant structure was superimposed onto the wild-type structure (the minimized wild-type structure) using PyMOL software [70]. Superimposition was based on all atoms. Hydrogen atoms were removed. Euclidean distances between the position of each atom in the mutant and wild-type structure were computed. These distances were denoted as *distWT*−*mutant*. Higher a *distWT*−*mutant* value of an atom is, more the atom is shifted in the mutated structure relative to the wild-type structure. According to the modelization protocol, detected atom shifts resulted from the mutagenesis and minimization. To retain only significant structural rearrangements induced by resistance mutation, only atoms with a *distWT*−*mutant* value higher than 0.3 Å were retained like in Liu et al., 2008 [36]. This distance cutoff of 0.3Å allowed selecting only significant structural displacements and removing uncertainties in the X-ray and mutant structures [36]. In addition, for each mutant, only shifted atoms observed in three of the five structures of the mutant were retained. These shifted atoms were named mutant-conserved shifted and noted MCS atoms.

#### *4.4. Comparison of Wild-Type and Mutant Pockets*

#### 4.4.1. Pocket Estimation

From the 156 structures, we extracted the binding pocket. To consider the fact that the different known PI are structurally diverse and that resistance mutations induce resistance to one or several PI, we used the "common-ligand" approach to estimate pockets [11] that consisted to define the binding pocket as all atoms of PR2 capable to bind all co-crystallized ligands. To do so, a virtual ligand was built by superimposed all co-crystallized ligands extracted from a set of PRs. This virtual ligand was then placed in the query structure and its pocket was estimated as atoms located at least 4.5 Å of the virtual ligand. This protocol was applied on the 156 wild-type and mutant structures using the virtual ligand built from the PR set used in Triki et al., 2018 [11]. The pocket extracted from the wild-type structure was denoted as wild-type and those extracted from mutant structures were denoted as mutant pockets.

#### 4.4.2. Comparison of the Conformation of Wild-Type and Mutant Pockets

To compare the conformation of the 156 pockets, we computed the root mean square deviation (RMSD) between each pocket pair (156 × 156). The calculation of RMSD was performed using PyMOL software [70] based on all pocket atoms. From these RMSD values, we computed a hierarchical classification of the pockets using the Ward method aggregation.

#### 4.4.3. Comparison of the Properties of Wild-Type and Mutant Pockets

Each pocket was characterized by two geometrical descriptors (VOLUME\_HULL and PSI), and one physicochemical descriptor (hydrophobicity\_kyte) [71]. The VOLUME\_HULL descriptor provides an estimation of the volume of a pocket. PSI measures the sphericity of a

pocket. These two descriptors were computed using PCI software [72]. hydrophobicity\_kyte descriptor quantifies the hydrophobicity of a pocket.

#### *4.5. Interface Comparison*

#### 4.5.1. Interface Extraction

The PPIC (Protein-Protein Interface Computation) program [73,74] was used to determine atoms involved in the interface of a structure. This program is parameter-free. It takes in input the 3D structure of a complex with two molecules (molecules or macromolecules) A and B. It defines the interface between the two molecules in two parts: interface of A and interface of B. Each interface part corresponds to the non-redundant set of all nearest neighbor atoms in one molecule of the atoms of other molecules. The extraction of neighbor atoms is performed using a simpler method of the Voronoï tessellation method [75]. Contrary to the Voronoï tessellation method, PPIC approach does not generate neighbors at long distances in the interface.

We used PPIC program to extract interface from the wild-type structure, denoted as wild-type interface, and the interface from the mutant structures, denoted as mutant interface.

#### 4.5.2. Comparison of the Interface Composition

We compared the composition of the 156 interfaces by computed the Hamming distance, noted *Dinter f ace* between each interface pairs (Equation (1)).

$$D\_{\text{interface}}(I1, I2) = N\_{I1} + N\_{I2} - 2 \times N\_{I12} \tag{1}$$

with *I*1 and *I*2 the interfaces extracted from two structures, *NI*<sup>1</sup> and *NI*<sup>2</sup> the number of atoms of the interfaces *I*1 and *I*2, respectively, and *NI*<sup>12</sup> the number of common atoms of the two interfaces *I*1 and *I*2.

Higher *Dinter f ace*(*I*1, *I*2) is, more dissimilar the two interfaces *I*1 and *I*2 are. To facilitate the interpretation of these distances, each *Dinter f ace*(*I*1, *I*2) value was normalized by the maximum number of atoms in the two interfaces (*NI*<sup>1</sup> + *NI*2) using Equation (2), resulting in the computation of the *Dinter f ace*(*I*1, *I*2) *norm* value for each interface pair.

$$D\_{\text{interface}}(I1, I2)^{\text{norm}} = \frac{D\_{\text{interface}}(I1, I2)}{N\_{I1} + N\_{I2}}.\tag{2}$$

The *Dinter f ace*(*I*1, *I*2) *norm* is ranking from 0 to 1, with a *Dinter f ace*(*I*1, *I*2) *norm* equal to 0 means that the composition of the two interfaces *I*1 and *I*2 is identical.

Using the *Dinter f ace*(*I*1, *I*2) *norm* values of each interface pairs, we computed a hierarchical classification of the 156 structures allowing to group structures according to their similarity in terms of interface composition. This classification was computed using the Ward method aggregation.

#### 4.5.3. Comparison of the SASA of Interface

The SASA of the interface of each structure was determined using NACCESS software [37]. First the two monomers of each structure were separated. Then, the accessible surface area of all atoms of the 312 monomers was computed using NACCESS software [37]. The SASA values of the two parts of the interface (chains A and B) of a structure were obtained by computed the sum of the SASA of each atom detected as involved in the interface. The SASA value reflects the size of a interface.

#### **5. Conclusions**

In this study, we explored the impact of drug-resistance mutations reported in PR2. We first compared the modeled structure of 31 mutants with the wild-type PR2 structure to locate structural rearrangements induced by mutations. Secondly, we studied the impact of these deformations on the conformation and properties of the binding pocket and

on the interface. Our findings showed that one mutation could induce large structural deformations located in the mutated residue, in its vicinity or far in the structure. These structural deformations occur in both side-chain and backbone atoms, with on average more impact in the former. However, we revealed that resistance mutations do not always have the same behavior in the two monomers of PR2, which is link the structural asymmetry of this target.

The analysis of the location of structural rearrangements induced by resistance mutations provided clues to better understand resistance mechanisms. First, some mutations have a direct or indirect impacts on PI-binding. The K7R, V47A, G48V, I50L/V, I54L/M, V62A, I82F/M, I82M, I84V, and L90M induce structural rearrangements in the binding pocket that modify its conformation, volume and/or hydrophobicity. These changes in the binding pocket could have a negative effect on PI binding. In addition, some of these mutations (K7R, D30N, I50L/V, I54L, I82F, and I84V) have a direct impact on PI binding by causing structural displacements of residues establishing interactions with PI. Resistance mutations have also an impact on conformation of the elbow and flap regions, regions involving in the transition of the open and closed conformations of the PR2 upon ligand binding. Indeed, we reported that the K7R, E37D, S43T, K45R, V47A, G48V, I50V, I50L, I54M, I54L, and V62A led atom shifts in the elbow and flap regions that could modify the flexibility and movement of the flap region important in the binding-pocket closing. Thirdly, the K7R, G48V, I50L, I54L, L90M, and L99F mutations induce structural rearrangements in the PR2 interface that modify its composition and size that may alter the stability of PR2. Finally, we highlighted several residues that were never deformed in mutant structures.

In conclusion, this study explored the impact of a large number of PR2 resistance mutations on PR2 structure, particularly on its pocket binding and interface. Our results showed that some structural rearrangements induced by resistance mutations are located in important regions of PR2: the elbow and flap regions inducing in the PR2 deformation upon ligand binding, in the PI-binding pocket and in the interface of the two monomers. Some of these deformations modify the properties of binding pocket and the composition and size of the PR2 interface. These results suggest that the studied resistance mutations could alter PI binding by modifying the properties and flexibility of the pocket or the interaction between PR2 and PI and/or alter PR2 stability. Our results reinforce the resistance mechanisms proposed in our previous study [31] and lead to a better understanding of the effects of mutations that occurred in PR2 and the different mechanisms of PR2 resistance.

**Author Contributions:** L.R. conceive and designed the experiments; M.P. and L.R. supervised the project; P.L. and L.R. performed the experiments; P.L., M.P. and L.R. developed tools dedicated to analyses; P.L. and L.R. analyzed the data; L.R. wrote the paper. All authors reviewed the manuscript. All authors have read and agreed to the submitted version of the manuscript.

**Funding:** This work was supported by an ANRS Grant. PL is supported by a fellowship from the Ministère de l'Education Nationale de la Recherche et de Technologie (MENRT).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are openly available in https:// figshare.com/articles/dataset/Data\_of\_Laville\_et\_al\_2021/13634147.

**Acknowledgments:** We are grateful to D. Flatters for helpful discussions and O. Taboureau for proofreading the manuscript. LR thanks the University of Paris and l'UFR SdV.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

**Figure A1.** Illustration of the different rotamers for the arginine in position 45. Superimposition of the five structures of the K45R mutant before (**A**) and after the minimization process (**B**). Mutant structures are presented in line and cartoon mode. Residues R45 are presented in sticks.

**Figure A2.** (**A**) Distribution of *distWT*−*mutant* distances for each atom (hydrogen atoms were removed) in the set of 155 mutant structures. Red line corresponds the distance cutoff (0.3 Å) used to define an atom as shifted atom. (**B**) Distribution of RMSD (in Å) computed between wild-type and mutant structures. RMSD were computed using all atoms.

#### **Appendix C**

**Figure A3.** Location of the MCS atoms found in each mutant. PR2 is colored in cyan and represented in putty mode. The putty radius is relative to the deformation induced by mutations: the higher the radius, the stronger the mutation-induced rearrangement. Mutated residues are colored in blue.

#### **Appendix D**

**Figure A4.** MCS atoms in the pocket. Mutants colored in purple present at least a mutation located in the binding pocket.

**Figure A5.** Comparison between pockets. (**A**) Distribution of RMSD computed between all pocket pairs. (**B**) Distribution of RMSD computed between the wild-type and mutant pockets.

#### **Appendix F**

**Figure A6.** Number of mutant structure where a given atom is involved in the interface. Mutants colored in green present at least a mutation involved in the interface.

#### **Appendix G**

In this section, mutant structures built using our initial protocol (based on FoldX software and an energetic minimization step) were denoted as *mutant FoldX*+*Mini* . The *WT* ∗ abbreviation described the crystallographic structure of PR2 complexed to the DRV that corresponded to the wild-type structure that was not minimized.

#### *Appendix G.1. Protocol to Model Mutant Structures Using Robetta Software*

Robetta webserver (https://robetta.bakerlab.org/) is a protein structure prediction service based on the RosettaCM method [76]. RosettaCM is a comparative modeling method that assembles structures using integrated torsion space-based and Cartesian space template fragment recombination, loop closure by iterative fragment assembly and Cartesian space minimization, and high-resolution refinement [76].

As in our initial modeling protocol, we used the PDB structure 3EBZ (PR2 complexed with DRV) as template. First, the DRV ligand, metal ions and water molecules were removed from the structure. Using this template, the structure of the 31 drugresistant mutants was built using Robetta webserver with the "CM" option. Other parameters were set to defaults. This step resulted in a set of 31 mutant structures, named *mutant Robetta* . The RMSD (based on all atoms) between the wild-type and *mutant Robetta* structures were computed using PyMoL software [70]. These RMSD values were refeered as *RMSDWT* <sup>∗</sup>−*mutant Robetta* .

**Figure A7.** Comparison between *mutant FoldX*+*Mini* and *mutant Robetta* structures. (**A**) RMSD computed between the wild-type structure (not minimized, *WT* ∗ structure) and *mutant FoldX*+*Mini* structure. (**B**) RMSD computed between the wild-type structure (not minimized, *WT* ∗ structure) and *mutant Robetta* structure. (**C**) RMSD computed between each *mutant Robetta* structure and the five structure of *mutant FoldX*+*Mini* .

#### *Appendix G.2. Structural Comparison between the Mutant FoldX*+*Mini and Mutant Robetta Structures*

First, we compared the two sets of PR2 mutants. For each mutant, its *mutant Robetta* structure was superimposed onto the five *mutant FoldX*+*Mini* mutants and the RMSD based on all atoms was computed using PyMoL software [70]. These RMSD were denoted as *RMSDFoldX*+*Mini*−*Robetta* .

#### *Appendix G.3. Detection of Structural Rearrangements in the Set of Mutant Robetta Structures*

First, each *mutant Robetta* structure were superimposed onto the 3EBZ (PDB code) structure using PyMoL software [70]. Superimposition was based on all atoms. Euclidean distances between the position of each atom in the mutant and 3EBZ structure were computed. These distances were denoted as *dist WT* <sup>∗</sup>−*mutant Robetta* . An atom was considered as a MCS atom in a *mutant Robetta* structure if it had a *dist WT* <sup>∗</sup>−*mutant Robetta* value higher than 0.3 Å.

#### **Appendix H**

**Figure A8.** Location of residues exhibiting structural shifts in the I82F mutant. Chains A and B are presented in cartoon mode and colored in orange and blue, respectively. Mutated residues (82) are colored in magenta and presented in stick mode. Residues having shifted atoms are colored in blue and presented in stick mode.

#### **References**


### *Article* **Multiple-Molecule Drug Design Based on Systems Biology Approaches and Deep Neural Network to Mitigate Human Skin Aging**

**Shan-Ju Yeh , Jin-Fu Lin and Bor-Sen Chen \***

Laboratory of Automatic Control, Signal Processing and Systems Biology, Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan; m793281@gmail.com (S.-J.Y.); sweettofu531@gmail.com (J.-F.L.)

**\*** Correspondence: bschen@ee.nthu.edu.tw

**Abstract:** Human skin aging is affected by various biological signaling pathways, microenvironment factors and epigenetic regulations. With the increasing demand for cosmetics and pharmaceuticals to prevent or reverse skin aging year by year, designing multiple-molecule drugs for mitigating skin aging is indispensable. In this study, we developed strategies for systems medicine design based on systems biology methods and deep neural networks. We constructed the candidate genomewide genetic and epigenetic network (GWGEN) via big database mining. After doing systems modeling and applying system identification, system order detection and principle network projection methods with real time-profile microarray data, we could obtain core signaling pathways and identify essential biomarkers based on the skin aging molecular progression mechanisms. Afterwards, we trained a deep neural network of drug–target interaction in advance and applied it to predict the potential candidate drugs based on our identified biomarkers. To narrow down the candidate drugs, we designed two filters considering drug regulation ability and drug sensitivity. With the proposed systems medicine design procedure, we not only shed the light on the skin aging molecular progression mechanisms but also suggested two multiple-molecule drugs for mitigating human skin aging from young adulthood to middle age and middle age to old age, respectively.

**Keywords:** skin aging; oxidative stress; aging progression mechanism; genome-wide genetic and epigenetic network (GWGEN); systems medicine design; multiple-molecule drug

#### **1. Introduction**

Being the largest organ of the human body, the skin shows aging with biological age. Many people, especially female, like to spend money on cosmetics and pharmaceuticals regularly for preventing or reversing skin aging. Thus, changes in human skin caused by aging are important issues for both the pharmaceutical and cosmetic sectors worldwide [1]. Additionally, increasing life expectancy in developed countries reveals advancing age as the primary risk factor for numerous diseases [2]. Elder people tend to have dryness, itch, dyspigmentation, wrinkles, as well as benign and malignant tumors on skin. Under these worse conditions, they would feel sleep deprivation leading to having weakened immunity and getting infection. Hence, keeping our skin health promotes healthy aging [3]. Furthermore, identifying interventions, which are able to ameliorate skin aging progression, to delay, prevent or lessen age-related diseases is worth studying.

Human skin provides a primary protective barrier, routinely shielding us from allergens, microbes, and other environmental assaults, including solar ultraviolet (UV) irradiation, heat, infection, water loss, and injury. Skin aging is a complex process leading to the decrement of cutaneous functions and structures with time. Impaired epidermal barrier function, decline in resistance to infections and regenerative potential, and impairment of mechanical properties like loss of extensibility and elasticity are the essential biomarkers

**Citation:** Yeh, S.-J.; Lin, J.-F.; Chen, B.-S. Multiple-Molecule Drug Design Based on Systems Biology Approaches and Deep Neural Network to Mitigate Human Skin Aging. *Molecules* **2021**, *26*, 3178. https://doi.org/10.3390/ molecules26113178

Academic Editors: Marco Tutone and Anna Maria Almerico

Received: 30 April 2021 Accepted: 24 May 2021 Published: 26 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of human skin ageing [4]. In general, skin aging can be regarded as two different processes. The first one is intrinsic aging, which is caused by biological age. The second one is extrinsic aging, which arises from solar UV exposure. The extrinsic factors contain the exposure under UV radiation and pollution, and poor nutrition resulting in alterations of DNA, RNA and protein in skin cells. The clinical manifestation of intrinsic aging is characterized by age spots, laxity, wrinkles, sagging, dryness, itchy, and the lower type I and III fibrillar collagens leading to dermal atrophy [5].

MicroRNAs (miRNAs) are a group of small noncoding RNAs, owning the posttranscriptional regulation ability to control gene expression negatively. Meanwhile, they are found to involve in many biological processes, such as epidermal development, proliferation, differentiation [6–8], inflammatory responses, immune regulation and wound healing in human skin [9,10]. Although we know that miRNAs might be a key player in the age-associated change, studies about age-related miRNAs in human skin remain limited [11]. As for long noncoding RNAs (lncRNAs), they are another type of noncoding RNA with >200 nucleotides. One review paper has summarized age-related lncRNAs and elucidated their roles in different aging process [12]. Since lncRNAs have versatile functions including gene regulation, chromatin structure modulation, genomic imprinting, cell growth and differentiation, and embryonic development, the dysregulated expression of lncRNAs may cause age-related diseases and disorders [13]. Recently, lncRNAs are regarded as potential targets for antiaging therapies [14]. Moreover, the well-known epigenetic modifications are DNA methylation and histone post-transcriptional modifications, including methylation, acetylation, ubiquitination, and phosphorylation. The accumulation of epigenetic alternations may not only contribute to skin aging but also promote malignant transformation [15,16]. H19, an epigenetic regulatory RNA, has been demonstrated to positively affect cell growth and proliferation and delay senescence [17]. With epigenetic silencing on LMNA, which is one of progeroid genes, we could observe a corresponding malignant transformation [18].

For the purpose of investigating skin aging process, researchers tried to identify the influence of microenvironment and epigenetic change on skin aging and put focus on some specific proteins, such as members of the collagen family, or cellular functions. However, the definitions of young skin and older skin are not fixed, that is, there is no definition of age range for young people and old people, respectively. Therefore, most studies compared young and old people with great differences in the research of skin aging progression. Although these studies proposed plenty of credible skin aging-associated theories and experimental results, the genomewide molecular progression mechanism of skin aging was unknown since the restriction of experimental methods and the attention to specific proteins and cellular functions. Moreover, although pharmacological interventions may prove to ameliorate the effect of aging on humans, the prohibitive expansion of treating healthy individuals in clinical trials over a long duration becomes a crucial difficulty in developing new drugs. On the contrary, repurposing drugs, which have been already approved for specific diseases, or those have been passed their safety tests but failed against their original indication, is more feasible than targeting aging itself with new drugs [19–21].

In recent years, pharmaceutical scientists put a lot of efforts into novel drug development based on the knowledge of existing drugs [22,23]. By performing in vitro search for drug discovery, researchers could identify interaction between drugs and targets (e.g., genes). However, due to the high cost and time consuming work, we could not conduct in vitro research most of the time. Instead, virtual screening in silico, selecting possible candidates first and verifying them in wet laboratory offer alternatives to us [24]. In general, docking simulation and machine learning method are considered to be two main approaches for in silico prediction of drug–target interaction [25]. For docking simulation, the process would be limited if the 3D structure of the protein is unknown [26]. To deal with this issue, chemogenomic methods, namely feature-based methods, transform drugs and targets into sets of descriptors (e.g., feature vector) allowing machine learning models to make prediction of drug–target interactions [27]. Chen at al. reviewed machine learning

methods and databases that used chemogenomic approaches for drug–target interaction prediction [28]. Except for traditional machine learning methods, the deep neural network has been employed in drug–target interaction prediction as well, such as deep belief neural networks [29], convolutional neural network [30], and multilayer perceptron [31,32].

In this study, we define different age intervals for each stage of skin aging. We build the candidate genome-wide genetic and epigenetic network (GWGEN) containing a candidate protein–protein interaction network (PPIN) and a candidate gene regulatory network (GRN) by big database mining. Moreover, it can be represented by a binary matrix. Assisted with microarray data of human skin, the false positives from the candidate GWGEN are pruned away by the system identification method and system order detection scheme. By doing so, we obtain real GWGENs of young-adult, middle-aged, and elderly skin aging as shown in Figures S1–S3, respectively. However, real GWGENs are still complex. Therefore, we further extract core GWGENs from real GWGWNs by the principal network projection (PNP) method. Based on the rank of projection values, we could obtain the core signaling pathways in respect of KEGG pathways to investigate skin aging molecular progression mechanisms for each stage of skin aging. To identify essential biomarkers in core signaling pathways, we refer to GenAge [33], which contains genes involved in human aging progression, and the Connectivity Map (CMap) [34] dataset to find overlap nodes being drug targets. To explore the potential candidate drugs toward our identified biomarkers, we trained a deep neural network of drug–target interactions in advance. By applying it, we could predict potential candidate drugs, which holds higher interaction probability to the identified biomarker. Afterwards, we designed two filters considering drug regulation ability and drug sensitivity to narrow down the candidate drugs. Consequently, we propose two potential multiple-molecule drugs i.e., niridazole, liothyronine, decitabine, pinacidil, and allantoin for mitigating skin aging from young adulthood to middle age; allantoin, diclofenac, mepyramine, resveratrol, and azathioprine for mitigating skin aging from middle age to old age.

#### **2. Results**

For the purpose of analyzing molecular progression mechanisms in human skin aging, extracting core signaling pathways from each core GWGEN becomes an essential issue. We defined three skin aging stages, including young-adult, middle-aged, and elderly human skin as shown in Figure 1. The research flowchart in Figure 2 shows how to construct the candidate GWGEN, real GWGENs, and core GWGENs so as to extract core singling pathways and investigate molecular progression mechanisms of human skin aging. By big database mining, the candidate GWGEN containing candidate PPIN and candidate GRN was constructed. With the help of the corresponding young-adult, middle-aged, and elderly skin microarray data, we applied system identification and system order detection methods to the candidate GWGEN for obtaining real GWGENs shown in Figures S1–S3 (Supplementary Materials), respectively. The Table 1 shows the number of nodes (e.g., proteins, TFs, miRNAs, and lncRNAs) as well as the edges standing for the interaction or regulation between two nodes for the candidate GWGEN and real GWGENs. According to Table 1, compared the nodes in candidate GWGEN to the nodes in real GWGENs, one could realize that the number of nodes diminish a lot in real GWGENs, reflecting that the false positives were removed successfully by the system order detection scheme. Since the real GWGENs were still too complicated to investigate molecular progression mechanisms of human skin aging, we applied principal network projection (PNP) method and selected the top-ranked 4000 nodes with significant projection values that could reflect 85% of the real GWGENs in three stages of skin aging to obtain core GWGENs (Figure 3a–c), respectively. In addition, for the genes in core GWGENs, we used the Database for Annotation, Visualization and Integrated Discovery (DAVID) Bioinformatics Resources version 6.8 to perform enrichment analyses for each stage of skin aging as shown in Tables S1–S3, respectively. Moreover, for investigating molecular progression mechanisms of skin aging conveniently, we denoted differential core signaling

pathways for young adult to middle-aged and middle-aged to elderly skin aging in respect of KEGG pathways, respectively. Based on skin aging molecular progression mechanisms, we identified essential biomarkers as drug targets for young-adult to middle-aged and middle-aged to elderly skin aging, respectively. Exploring candidate drugs toward our identified biomarkers, we trained a deep neural network of drug–target interaction in advance. We applied the trained model to predict the candidate drugs holding higher interaction probability with identified biomarkers. In order to narrow down the candidate drugs, we design two filters considering drug regulation ability and drug sensitivity. The more details will be discussed in the following sections.

**Figure 1.** Skin aging stages. The figure denotes age intervals for each stage of skin. The young-adult, middle-aged, and elderly stage of skin are defined as 19 to 45 years old, 43 to 65 years old, and 64 to 86 years old, respectively.



**Figure 2.** Reseasrch flowchart of systems medicine design for human skin aging. Flowchart of using systems biology methods to construct the candidate GWGEN, real GWGENs, core GWGENs, and core signaling pathways to find skin aging progression mechanisms for identifying essential biomarkers. After obtaining the essential biomarkers, we applied trained a deep neural network of drug–target interactions to predict the potential candidate drugs holding higher probability. To narrow down the candidate drugs, we considered drug regulation ability by querying the CMap dataset and drug sensitivity by referring to the sensitivity dataset from DepMap portal. Consequently, we proposed two multiple-molecule drugs to mitigate the skin aging from young-adult to middle-aged and middle-aged to elderly-stage.

**— —**

**— —** **—**

**—**

**—**

**— —**

**—**

**—**

**– – Figure 3.** The core genomewide genetic and epigenetic networks (GWGEN). (**a**–**c**). (**a**) The core GWGEN of young-adult skin. The purple lines denote protein–protein interactions (PPIs). The green lines indicate transcriptional regulations by TFs and lncRNAs. The black lines represent post-transcriptional regulations by miNRAs. The total number of receptors, proteins, lncRNAs, TFs and miRNAs are 556, 3166, 136, 134 and 8, respectively. (**b**) The core GWGEN of middle-aged skin. The PPIs are in purple. The regulations from TFs and lncRNAs are in green. The black lines stand for the post-transcriptional regulations by miRNAs. The total number of receptors, proteins, lncRNAs, TFs and miRNAs are 499, 3190, 105, 117 and 10, respectively. (**c**) The core GWGEN of elderly-stage skin. The PPIs are shown in purple lines; regulations by TFs and lncRNAs are denoted in green; regulations from miRNAs are in black. The total number of receptors, proteins, lncRNAs, TFs and miRNAs are 546, 3220, 126, 102 and 6, respectively.

#### *2.1. Differential Core Signaling Pathways from Young-Adult to Middle-Aged Skin Aging*

The differential core signaling pathways from young-adult to middle-aged human skin were selected and analyzed as shown in Figure 4. According to our results, in core signaling pathways of young-adult skin aging only, receptor ESR1 receives microenvironment factor FASN to activate the TF SIRT6 through signaling transduction proteins PRR4 and LMNA. The TF SIRT6 could not only downregulate target gene *RBBP8*, which was modified by deacetylation, but also activate TF PARP1 to upregulate target gene *XRCC1* to promote cell proliferation and DNA repair in young-adult skin aging only. The receptor ESR1 also regulates TF JUN through signaling transduction proteins GOT1 and CHEK2 to regulate target gene *BRCA2* and *KPNA2*. TF JUN not only activates the target gene *BRCA2*, which was modified by phosphorylation, to promote DNA repair, but also downregulates the target gene *KPNA2* to promote cell proliferation, DNA repair and cell cycle in young-adult skin aging only.

**Figure 4.** The core signaling pathways from young-adult to middle-aged skin aging. The green dot lines denote the signaling pathways in young-adult skin; the green dash lines represent the signaling pathways in middle-aged skin; the green solid lines indicate the signaling pathways in both stage; the green lines with arrow heads are upregulation (positive regulation); the green lines with circular heads are downregulation (negative regulation); the black solid lines with arrow heads mean activating cellular function; the black solid lines with circular heads mean inhibiting cellular function; the selected red target gene nodes indicate a higher gene expression in middle-aged skin compared with young-adult skin; the selected green target gene nodes indicate a lower gene expression in middle-aged skin compared with young-adult skin; the blue background shows young-adult skin; the brown background shows the overlap between young-adult and middle-aged skin; the skin color background shows middle-aged skin.

Next, in the core signaling pathways of both young-adult and middle-aged skin aging in Figure 4, the microenvironment factor FASLG was received by receptor FAS to activate TF E2F7 via signaling transduction proteins DAXX, MAP3K5 and AIFM1 to

downregulate target gene *APAF1* to promote apoptosis and cell-cycle arrest. In the next pathway, the receptor CXCR2 receive microenvironment factor CXCL1 to regulate TF E2F7, TF JUN, TF FOXM1 and miRNA MIR26B. First, the TF JUN was activated through signaling transduction proteins CENPJ and CDKN1A to activate TF FOXM1. As TF FOXM1, which was modified by phosphorylation, was activated, the target gene *CAT* was upregulated to inhibit ROS accumulation in both young-adult and middle-aged skin aging. The miRNA MIR26B was activated via signaling transduction proteins CENPJ, CDKN1A and AKT1 to inhibit target gene *KPNA2* so as to promote cell proliferation. It is noted that, protein AKT1 was modified by phosphorylation. The TF E2F7 was also regulated by protein AIFM1 to downregulate target gene *APAF1* in both young-adult and middle-aged skin aging.

In the core signaling pathways of middle-aged skin aging only, the TF FOXM1 was also activated through signaling transduction proteins CENPJ, CDKN1A, CDK4 and DCDC2 when receptor CXCR2 received microenvironment factor CXCL1 in middle-aged skin aging only as shown in Figure 4. With the activation of FOXM1, the target gene *CCNB1* was upregulated to promote cell-cycle arrest and inhibit ROS accumulation. For the next pathway in middle-aged skin aging only, the microenvironment factor IGF1 was received by receptor IGF1R to regulate TF FOXO3 via signaling transduction proteins HMGCS2, ARRB1, PDK1 and AKT1. Additionally, the protein AKT1 and TF FOXO3 were modified by phosphorylation. The TF FOXO3 downregulates not only target gene *SENS3* to promote DNA damage, apoptosis, and ROS accumulation, but also target gene *GADD45A* to promote DNA damage, apoptosis, and ROS accumulation and inhibit cell-cycle arrest in middle-aged skin aging only.

In summary, when the young-adult skin aging turned into middle-aged skin aging, DNA repair ability decreases and cell cycle starts to be arrested. Thereby, ROS accumulation increases and further promotes DNA damage and apoptosis in skin cells. Additionally, these molecular progression mechanisms from young-adult to middle-aged might potentially accelerate skin aging process in elderly skin aging. According to the core signaling analyses results and considering the overlap nodes between the GenAge and CMap datasets, we identified AIFM1, CAT, IGF1R, and LMNA as essential biomarkers for preventing skin aging from young adulthood to middle-age.

#### *2.2. Differential Core Signaling Pathways from Middle-Aged to Elderly Skin Aging*

The molecular progression mechanism based on differential core pathways from middle-aged skin to elderly skin aging is represented in Figure 5. In core pathways of middle-aged skin aging only, the TNF receptor superfamily member 1 alpha TNFRSF1A received microenvironment factor TNF to activate TF GATA2 through transduction proteins GABPA and STAT1 to upregulate target gene *MMP9* so as to inhibit collagen stability and skin homeostasis in middle-aged skin aging only. Note that STAT1 was modified by phosphorylation. In the next pathway, the microenvironment factor NGF was accepted by neurotrophic receptor tyrosine kinase1 NTRK1 and then transmitted the signal through transduction proteins EME1, HSPB1, NEDD9 and CPNE2 to activate TF ETS1. TF ETS1 could downregulate target gene *ERRFI1*, which was modified by hypermethylation, through activating miRNA MIR573 to promote homeostasis in middle-aged skin aging only.

Next, we focus on the core pathways of both middle-aged and elderly skin aging. In the first pathway, the receptor NTRK1 receives the microenvironment factor NGF and then transmits the signal through transduction proteins KPNA2, KAT5, CST2 and HRAS to activate TF GATA2. The protein KAT5 was modified by phosphorylation. After GATA2 was activated, target gene *BCL2* was upregulated to inhibit cell-cycle arrest and apoptosis in both middle-aged and elderly skin aging. For the second pathway, the receptor KIT could interact with the microenvironment factor KITLG to trigger TF AR through signaling transduction proteins FAM83H, HSPB1, PAX3 and H2AFB2. TF AR downregulated not only target gene *TYR* through triggering TF MITF to promote melanin synthesis, but also target gene *CDH1*, which was modified by phosphorylation, to promote cell-cycle,

apoptosis and DNA damage in both middle-aged and elderly skin aging. In the third pathway, receptor LRP1 could receive the microenvironment factor CYR61 to trigger the TF ETS1 via signaling transduction proteins RAMP1, MCM2, GEMIN4 and NOP56. In both middle-aged and elderly skin aging, the TF ETS1 could negatively regulate target gene *COL17A1* to promote melanin synthesis and inhibit collagen stability and skin homeostasis.

**Figure 5.** The core signaling pathways are obtained by projecting core GWGENs to KEGG pathways to investigate the aging progression mechanism from middle-aged to elderly skin aging. The green dotted lines denote the signaling pathways in middle-aged skin; the green dashed lines represent the signaling pathways in elderly skin; the green solid lines indicate the signaling pathways in both stages; the green lines with arrow heads are upregulation; the green lines with circle heads are downregulation; the black solid lines with arrow heads mean activating cellular function; the black solid lines with circle heads mean inhibiting cellular function; the selected red target gene nodes indicate a higher gene expression in elderly skin compared with middle-aged skin; the selected green target gene nodes indicate a lower gene expression in elderly skin compared with middle-aged skin; the blue background shows middle-aged skin; the brown background covers the overlap between middle-aged and elderly skin; the skin color background shows elderly skin.

In the core pathways of elderly skin aging only in Figure 5, CYR61/LRP1 could also trigger TF NDUFS4 through signaling transduction proteins CPNE2, MYH9 and ERCC6. The activated TF NDUFS4 might downregulate target gene *CASP3* to inhibit apoptosis and promote DNA damage. For another pathway, the microenvironment factor IL6 was accepted by receptor IL6R to trigger TF YAP1 through signaling transduction proteins RHOB and CDK20. In the elderly skin aging only, TF YAP1 activated target gene *CDC5L* through inhibiting MIR126 to inhibit cell cycle and promote skin homeostasis and DNA damage.

In summary, for skin aging molecular progression mechanisms from middle age to old age, we found that the promotion of cell cycle process, the inhibition of apoptosis, and the damage of DNA arose in elderly skin. Furthermore, skin homeostasis and collagen stability were destroyed to cause lower immunity and epidermal thinning, that is, the increment of wrinkles. According to core signaling analyses and considering the overlap nodes between the GenAge and CMap datasets, we identified MMP9, IL6, BCL2, and CASP3 as essential biomarkers for preventing skin aging from middle age to old age. Moreover, by extracting differential core signaling pathways from young-adult to elderly skin aging, some cellular dysfunctions including proliferation, DNA repair and damage, cell-cycle arrest, apoptosis, ROS accumulation, collagen stability, skin homeostasis, and melanin synthesis are induced in the skin aging process shown in Figure 6.

#### *2.3. The Application of Deep Neural Network of Drug–Target Interaction Prediction and the Design of Two Filters Considering Drug Regulation Ability and Drug Sensitivity*

To explore the drug–target interaction toward our identified biomarkers, we trained a deep neural network for drug–target interaction prediction. The design framework is shown in Figure S4. The interaction dataset used for training are from BindingDB [35]. In total, there are 80,291 known drug–target interactions between 38,015 drugs and 7292 proteins. The number of unknown drug–target interactions is 19,966,109, which is greater than the known drug–target interactions. Considering the class imbalance problem, we randomly chose the number of unknown interactions and made them the same size as known interactions. We trained the model using 70% of data, including 10% of data as the validation set. The remaining 30% of data was used as the testing set. To the data preprocessing before training the model, we performed feature scaling by standardization. Assisted with principal component analyses (PCA) for dimensionality reduction, we had 1000 out of 1359 features. For the architecture of deep neural network of drug–target interaction, we used Adam as an optimizer (learning rate = 0.003) with binary cross-entropy loss. The input layer had 1000 neurons followed by 512, 256, 128, and 64 neurons of hidden layers, respectively. The output layer has one neuron. Except for using sigmoid function to the output layer, we set a nonlinear activation function ReLU for each hidden layer. Moreover, the dropout 0.5, 0.4, 0.3, and 0.1 was applied to each hidden layer, respectively. For the trained deep neural network of drug–target interaction prediction, the training accuracy, validation accuracy, and testing accuracy were 95.469%, 93.230%, and 93.077%, respectively. From the perspective of the deep neural network framework application, we used it to predict the potential candidate drugs for our identified biomarkers. When the score of candidate drug approaches one, it would be selected. In other words, the higher the score, the higher probability of interacting between the candidate drug and the identified biomarker.

**Figure 6.** The overview of human skin aging molecular progression mechanisms from young-adult to middle-aged and then elderly skin aging. This figure summarizes the genetic and epigenetic progression mechanisms of skin aging in Figures 4 and 5. The upper horizontal part is the genetic and epigenetic progression mechanism from young-adult skin to middle-aged skin; the middle part indicates the genetic and epigenetic progression mechanisms from middle-aged skin to elderly skin; the red rectangle with orange background represents cellular functions; the yellow ellipse circles are microenvironment factors; the red dash lines surround the pathways and biomarkers that appear in two consecutive stages of skin; the black arrow lines represent the protein–protein interaction or transcriptional regulation; the black lines with circle head represent inhibit or downregulation; the red arrow lines represent the genes to induce cellular function; the red lines with circle head represent the genes to repress cellular function.

In order to narrow down the candidate drugs predicted by the deep neural network framework based on the identified biomarkers, we designed two filters considering drug regulation ability and drug sensitivity. With the help of the CMap dataset, we could know whether a gene was upregulated or downregulated after treating the small molecule compound. The abnormal up or down gene expression could be found by comparing the gene expression of identified biomarkers to the later skin stage. The goal for the first filter is to select candidate drugs, which could reverse the abnormal gene expression. Afterwards, we used the drug sensitivity dataset (PRISM Repurposing Primary Screen) to consider drug sensitivity. The second filter aims to find the drugs with around zero values implying that they would not influence the cell line too much since we are not going to kill or proliferate cells toward the skin corresponding cell line. Consequently, we proposed niridazole, liothyroninr, decitabine, pinacidil, and allantoin as a multiple-molecule drug for mitigating skin aging from young adulthood to middle age; and allantoin, diclofenac, mepyramime, resveratrol, and azathioprine as multiple-molecule drug for mitigating skin aging from middle age to old age. The drug targets with their corresponding drugs are shown in Tables S4 and S5.

#### **3. Discussion**

#### *3.1. Investigating Skin Aging Molecular Progression Mechanisms by Differential Core Signaling Pathways from Young-Adult to Middle-Aged Human Skin Aging*

In the first core pathway of young-adult skin aging only as shown in Figure 4, microenvironment factor fatty acid synthase FASN can promote cell proliferation, DNA repair, and cell-cycle arrest and interact with receptor ESR1 via crucial signaling transduction proteins PRR4, LMNA, GOT1, and CHEK2 to regulate TFs SIRT6, PARP1, and JUN. The signaling protein LMNA, which is an endogenous activator of TF SIRT6, could promote SIRT6-mediated downstream functions upon DNA damage. Moreover, protein LMNA could directly bind and activate TF SIRT6 toward histone deacetylation [36]. The TF SIRT6, which could control the longevity and regulation of DNA repair, could promote DNA repair and cell proliferation through the downregulation of the target gene *RBBP8*, which was mediated by deacetylation [37]. TF SIRT6 could also promote DNA repair under oxidative stress by activating TF PARP1 to upregulate target gene *XRCC1* [38]. TF PARP1 serves as a genomic caretaker by participating in several molecular mechanisms such as DNA repair and cell-cycle regulation. Therefore, PARP1 was considered as a longevity assurance and aging-promoting factor [39]. The target gene *XRCC1* upregulated by PARP1 was required for the viable and efficient repair for DNA single-strand breaks [40]. The TF JUN was also activated by the signaling transduction proteins GOT1 and CHEK2. The signaling transduction protein CHEK2 initiated by oxidative stress could regulate target gene *BRCA2* and *KPNA2* through interacting with TF JUN. In human cell, the serine kinase CHEK2 could induce the appropriate cellular response such as cell cycle checkpoint activation and DNA repair depending on the extent of damage, the cell type, and other factor. CHEK2 could participate in DNA repair by phosphorylating the target gene *BRCA2* through TF JUN [41]. The karyopherin alpha2 *KPNA2* expression had been reported to be induced in various proliferative skin disorders such as psoriasis and squamous cell carcinoma [42]. When the target gene *KPNA2* was downregulated by TF JUN, cell proliferation, cell cycle and DNA repair induced by CHEK2 were indirectly promoted.

In the core pathways of both young-adult and middle-aged skin aging, the microenvironment factor FASLG binds receptor FAS to regulate TF E2F7 through signaling transduction proteins DAXX, MAP3K5 and AIFM1. Responding to ROS, the microenvironment factor FASLG was activated, then binding to death receptor FAS to promote apoptosis pathway [43]. Signaling transduction protein MAP3K5 known as apoptosis signal-regulating kinase 1 (ASK1) could respond to oxidative stress and be activated [44]. Target gene *APAF1* is the core of the apoptosome, was activated by TF E2F7 to trigger the mitochondrial apoptotic pathway. Furthermore, target gene *APAF1* was also involved in the maintenance of genomic stability by the cell-cycle arrest response elicited upon DNA damage and promoted apoptosis [45]. For other pathways in both young-adult and middle-aged skin aging,

microenvironment factor CXCL1 bound the G-protein coupled receptor CXCR2 to activate signaling transduction protein CDKN1A through protein CENPJ. Protein CDKN1A known as CIP1, was a potent cyclin-dependent kinase inhibitor to regulate TFs E2F7, JUN, FOXM1, and miRNA MIR26B. First, protein CDKN1A transmits signal to AIFM1 so as to activate TF E2F7 to enhance apoptosis and cell-cycle arrest. TF JUN was also activated by protein CDNK1A to regulate FOXM1. Then TF FOXM1 upregulated target gene *CAT*, which was known as ROS detoxification enzyme and could defend the ROS accumulation [46]. After miRNA MIR26B inhibited the target gene *KPNA2*, the cell proliferation could be promoted [47].

In core pathways of middle-aged skin aging only, microenvironment factor CXCL1 also could regulate TF FOXM1 through signaling transduction proteins CENNPJ, CDKN1A, CDK4, and DCDC2 to trigger target gene *CCNB1* as shown in Figure 4. Cyclin dependent kinase 4 CDK4, which was modified by phosphorylation, is a positive regulator of cell cycle entry and can stabilize and activate FOXM1, thereby promote cell cycle and suppress the levels of reactive oxygen species [48]. TF FOXM1 also had been reported to be essential for proper cell cycle progression via activating cell cycle gene *CCNB1* for propelling specific cell cycle phase and inhibition ROS accumulation [49]. For another pathway, the microenvironment factor IGF1 was received by receptor IGF1R to activate FOXO3 via signaling transduction proteins HMGCS2, ARRB1, PDK1 and AKT1. The protein AKT1, which was modified by phosphorylation, could activate TF FOXO3. Protein phosphoinositide-dependent kinase PDK1 was one of the upstream kinases that activate AKT1. After AKT1, which is a key regulator of the PI3K/AKT1 signaling cascade controlling cell growth and survival, was activated and modified by phosphorylation, TF FOXO3 would be activated [50]. Moreover, it had been reported that the enhanced ROS production might further activate the signal of PI3K/AKT pathway, thus establishing a self-perpetuating cycle leading to further aging [51]. TF FOXO3, which was modified by phosphorylation, could downregulate target genes *SESN3* and *GADD45A* [52]. TF FOXO3 could decline ROS rescue pathway through downregulating the peroxiredoxin gene *SESN3*, which is responsible for the biphasic ROS accumulation. Therefore, FOXO3-induced ROS was increased and then accelerated for apoptosis and DNA damage [53]. Furthermore, phosphorylated FOXO3 also inhibited proapoptotic activity such as cell-cycle arrest by downregulating *GADD45A* [54]. The cause of the pleiotropic action of *GADD45* members, a decreased inducibility, might lead to far-reaching consequences such as DNA damage accumulation and disorder of cellular homeostasis and could eventually contribute to the aging process [55]. Therefore, we suggest that the downregulation of *GADD45A* also promotes ROS accumulation through cell-cycle arrest and the inhibition of proapoptotic activity.

#### *3.2. Investigating Skin Aging Molecular Progression Mechanisms by Differential Core Signaling Pathways from Middle-Aged to Elderly Human Skin Aging*

According to the core pathways of middle-aged skin aging only in Figure 5, the ligand TNF can inhibit collagen stability and skin homeostasis through receptor TNFRSF1A by transmitting the signal through significant signaling transduction proteins GABPA and STAT1 to TF GATA2. The proinflammatory cytokine tumor necrosis factor-alpha (TNF-A) inhibits collagen synthesis and enhances collagen degradation via increasing the production of target gene *MMP9*. It also increases the risk of cutaneous infections in the elderly by reducing skin immunity [56]. The activation of STAT1 is modified by phosphorylation. STAT1 has also been indicated as a potential target in the treatment of psoriasis, which is a chronic skin diseases [57]. TF GATA2 could upregulate target gene *MMP9* to digest collagen type IV, which is an important component of the basement membrane in skin [58].

For the next pathway, ligand NGF can promote skin homeostasis through receptor NTRK1 to transmit the significant signal via signaling transduction proteins EME1, HSPB1, NEDD9, and CPNE2 to upregulate TF ETS1. In human skin, proliferating keratinocytes release NGF in an increasing amount. Receptor NTRK1, known as tyrosine kinase receptor (TrkA) is the high-affinity receptor for NGF. At the skin level, NTRK1 could mediate NGFinduced keratinocyte proliferation [59]. Note that protein NEDD9 could be modified by

phosphorylation in human skin [60]. TF ETS1, which was regulated through the signaling pathway activated by ligand NGF, has been identified to be associate with skin aging [60]. The expression levels of MIR-573 were found to be lower in melanoma tissues and cell lines when compared to normal skin tissue. Moreover, MIR-573 reduction was demonstrated to be essential in melanoma initiation and progression [61]. Target gene *ERRFI1*, which was modified by hypermethylation, is required for proper epidermal homeostasis [62,63].

Focusing on core pathways in both middle-aged and elderly skin aging in Figure 5, the ligand NGF inhibits apoptosis and promotes cell-cycle arrest when received by receptor NTRK1 to activate TF GATA2 via signaling transduction proteins KPNA2, KAT5, CST2, and HRAS. Protein KAT5, which was modified by phosphorylation, has been presumed to serve as a potential biomarker for melanoma therapeutic target [64]. NGF can not only rescue human epidermal keratinocytes from spontaneous and UVB-induced apoptosis via NTRK1, but also protect keratinocytes from cell death via target gene *BCL2* family of apoptosis inhibitors [59]. Antiapoptotic function of target gene *BCL2* is regulated by phosphorylation. In addition, target gene *BCL2* could not only regulate cell cycle progression, but also act as an antioxidant that may regulate intracellular ROS. Expression of target gene *BCL2* has been observed to increase upon the induction of a senescence-like growth arrest or apoptosis by oxidative stress [65,66].

In the next core pathway of both middle-aged and elderly skin aging, the ligand KITLG promotes melanin synthesis, DNA damage, and inhibits cell-cycle arrest by modulating TFs AR and MITF via signaling transduction proteins FAM83H, HSPB1, PAX3, ATF5, and H2AFB2. The tyrosine kinase receptor KIT, its ligand KITLG, and TF MITF have been reported to play an important role of initiating and regulating signaling systems and transcription factors of melanin production. TF MITF also regulates melanocyte pigmentation by inducing target gene *TYR* [67]. Moreover, a previous study supposed that PAX3 and SOX10 could act together to induce the expression of MITF [68]. Target gene *CDH1*, which is downregulated by TF AR, has been reported to be regulated by phosphorylation [69]. It has been reported that cells lacking target gene *CDH1* have a shortened G1 phase, accumulate DNA damage, and undergo apoptosis [70].

In the final core pathways of both middle-aged and elderly skin aging, the ligand CYR61 could modify skin homeostasis and melanin synthesis through receptor LRP1 to transmit signal by signaling transduction proteins RAMP1, MCM2, GEMIN4 and NOP56 for upregulating TF ETS1. Responding to oxidative stress, CYR61 was elevated in the dermis of chronologically aged human skin, promoting aberrant collagen homeostasis by downregulating collagen members, the major structural protein in skin, to promote collagen degradation [71,72]. The loss of target gene *COL17A1* and MCM2 expression in advanced aged skin has been found to eventually cause epidermal thinning [73].

Focusing on the first core pathway of elderly skin aging only in Figure 5, through the signaling transduction starting from the ligand CYR61, TF NDUFS4 can promote apoptosis and DNA damage through signaling transduction proteins CPNE2, MYH9 and ERRC6. The ligand CYR61 interacting with receptor LRP1 has also been indicated to contribute to CCN1-induced ROS accumulation and CCN1/TNFA-induced apoptosis [74]. With the downregulating target gene *CASP3* by TF NDUFS4, senescence fibroblast can resist apoptosis death [75].

In the final core pathway of elderly skin aging only, the ligand IL6 could be accepted by receptor IL6R and then the significant signal is transmitted through signaling transduction proteins RHOB and CDK20 to activate TF YAP1. Proinflammatory cytokine IL6 has been suggested to be a biomarker of health status in the elderly [76]. TF YAP1 has been identified to play a physiological role in skin homeostasis, which can promote cell proliferation in the basal layer [77]. Knockdown of target gene *CDC5L* induces mitotic arrest and DNA damage [78].

#### *3.3. The Genetic and Epigenetic Molecular Progression Mechanisms from Young-Adult to Elderly Human Skin Aging*

The overview of overall skin aging molecular progression mechanisms is shown in Figure 6. Microenvironments trigger corresponding ligand signals to initiate some cellular dysfunctions affecting skin aging progression. Thus, core signaling pathways with the genetic and epigenetic modifications play a significant role in cellular dysfunctions of signaling transductions for each stage of skin aging.

In Figure 6, the core pathways of young-adult skin aging only, ligand FASN (oxidative stress) binds to receptor ESR1 to mediate two pathways. Responding to DNA damage signal to cause of oxidative stress, LMNA directly binds and activates TF SIRT6 toward histone deacetylation [36]. Activated SIRT6 promotes DNA repair cell-cycle and proliferation through the downregulating gene *RBBP8*, which was modified by deacetylation [37]. In addition, TF SIRT6 also promotes DNA repair and cell cycle under oxidative stress by activating TF PARP1 to upregulate target gene *XRCC1* [38]. Transduction protein CHEK2 also responds to oxidative stress from ligand FASN to activate TF JUN. TF JUN promotes DNA repair through phosphorylating target gene *BRCA2* [41]. Moreover, TF JUN could downregulate target gene *KPNA2* to promote cell proliferation, cell cycle and DNA repair [42].

In the core pathways of both young-adult and middle-aged skin aging, ligand CXCL1 (oxidative stress) binds to receptor CXCR2 to trigger protein CDKN1A. Activated protein CDKN1A not only upregulates target gene *CAT* to defend ROS accumulation through TF JUN and FOXM1, but also inhibits target gene *KPNA2* to promote cell proliferation by activating MIR26B [46,47]. Next, responding to ROS induced from DNA damage, the ligand FASLG interacts with FAS to initiate apoptosis pathway. Signaling transduction protein MAP3K5, activated by oxidative stress, triggers TF E2F7 to downregulate target gene *APAF1* and thereby involve in the maintenance of cell-cycle arrest upon DNA damage and promoting apoptosis [43–45]. Hence, in order to fight to the excessive accumulation of ROS upon decreasing the ability of DNA repair from young adulthood to middle-age, functions of apoptosis and cell-cycle arrest are raised.

In the core pathways of middle-aged skin aging only, the ligand CXCL1 (oxidative response) interacts with receptor CXCR2 and also activates signaling transduction protein CDK4. Phosphorylation of CDK4 positively regulates cell cycle entry and can stabilize and activate FOXM1 to upregulate target gene *CCNB1* to promote cell cycle phase and suppress the level of ROS [48,49]. Moreover, the ligand IGF1 (oxidative response) is received by receptor IGF1R to activate PI3K/AKT signaling pathway. Transduction protein AKT1 can activate TF FOXO3 through the modification by phosphorylation. Furthermore, TF FOXO3, which is modified by phosphorylation, downregulates genes *SESN3* and *GADD45A*. With the silence of target gene *SESN3*, ROS rescue pathway is declined, thus accelerating apoptosis with the increment of FOXO3-induced ROS. In addition, TF FOXO3 downregulates target gene *GADD45A* to promote ROS accumulation through cell-cycle arrest. The ligand TNF (proinflammatory cytokine) can inhibit collagen stability and skin homeostasis through activating TNFRSF1A and initiate the corresponding pathway. TF GATA2 was activated by phosphorylated transduction protein STAT1 to upregulate target gene *MMP9*. Increased gene *MMP9* can inhibit collagen synthesis and enhance collagen degradation. The ligand NGF (proliferating keratinocytes) interacts with receptor NTRK1 to activate TF ETS1, which was identified to be associative with skin aging [59,60]. MIR573 is activated by TF ETS1 and then negatively regulates gene *ERRFI1*. Downregulated gene *ERRFI1*, which was modified by hypermethylation, can maintain proper epidermal homeostasis [62,63].

In the core pathways of both middle-aged and elderly skin aging, the ligand NGF/NTRK1 (signal of positive regulation of Ras signaling pathway) activates TF GATA2 so as to regulate gene *BCL2* to protect keratinocytes from cell death [59]. Target gene *BCL2* not only triggers antiapoptotic function through the modification by phosphorylation, but also regulates cell cycle progression to act as an antioxidant of intracellular ROS [65,66]. The

ligand KITLG interacts with receptor KIT to promote melanin synthesis, DNA damage, and inhibits cell-cycle arrest by activating TF MITF through signaling transduction. TF MITF regulates melanocyte pigmentation by inducing gene *TYR* [67]. Due to the downregulation of gene *CDH1*, which is modified by phosphorylation, by TF AR, G1 phase is shortened to accumulate DNA damage and undergo apoptosis [69,70]. Next, the ligand CYR61 (oxidative stress) is received by receptor LRP1 to downregulate collagen members and promote collagen degradation [71,72]. After TF ETS1 is activated by signaling transduction, target gene *COL17A1* can be downregulated to destroy the balance between collagen stability and skin homeostasis and eventually cause epidermal thinning [73].

In the core pathways of elderly skin aging only, the ligand CYR61 (oxidative stress) interacts with receptor LRP1 so as to contribute to the CCN1-induced ROS accumulation. When target gene *CASP3* is downregulated by TF NDUFS4, senescence fibroblast could resist apoptosis death [74,75]. Moreover, since the ligand IL6 interacts with receptor IL6R to activate TF YAP1, which could promote cell proliferation in basal layer, it has been identified to play a physiological role in skin homeostasis [77]. Note that IL6 has been suggested as a biomarker of elderly health status [76]. After TF YAP1 downregulating target gene *CDC5L* through activating MIR126, functions of mitotic arrest and DNA damage were activated [78].

#### *3.4. Two Multiple-Molecule Drugs Based on Identified Biomarkers to Mitigate Human Skin Aging*

For mitigating the skin aging from young adulthood to middle age, we proposed a multiple-molecule drug including niridazole, liothyronine, decitabine, pinacidil, and allantoin. The drug targets were AIFM1, CAT, IGF1R, and LMNA as shown in Table 2. The black dot in Table 2 represents the proposed small molecules target to which identified biomarker (drug target). For instance, the niridazole has more potential to target to AIFM1 and CAT. Niridazole, an antiparasitic drug, could suppress delayed dermal hypersensitivity [79]. Studies have shown that combined therapy with liothyronine improved the treatment of hypothyroidism [80,81]. Decitabine, a DNA methyltransferase, induced changes in gene expression and cellular behavior associated with a regenerative response. Furthermore, wounds treated by decitabine were able to participate in regeneration [82]. Pinacidil is an effective antihypertensive drug for the treatment of mild to moderate essential hypertension [83]. In the meanwhile, according to the findings of one study, pinacidil may be utilized to prevent from UV-induced skin aging [84]. It is noted that allantoin, which is found in plants like chamomile, wheat sprouts, sugar beet, and comfrey, has been widely used in anti-aging serum [85,86]. Allantoin is also a well-known anti-irritating and hydrating agent as well as a peeling agent for skin [87,88].


**Table 2.** Drug targets and multiple-molecule drugs for preventing skin aging from young adulthood to middle-age.

•: Proposed small molecules target to the identified biomarkers (drug targets).

For mitigating the skin aging from middle-aged to elderly, we proposed a multiplemolecule drug consisting of allantoin, diclofenac, mepyramine, resveratrol, and azathioprine. The drug targets were MMP9, IL6, BCL2, and CASP3 as shown in Table 3. In Table 3, the black dot shows the drug target to each specific small molecule. For example, the drug target for allantoin are MMP9 and IL6. Diclofenac is a nonsteroidal anti-inflammatory drug. It has been used to treat actinic keratoses developing in fair-skinned individuals with a history of overexposure to ultraviolet light [89]. To mepyramine, it works by preventing the action of histamine, which is a compound produced by the body when getting venom from insect bites [90]. Moreover, one study mentioned the stimulation from histamine would upregulate matrix metalloproteinase 9 (MMP9), which is also our proposed drug target for mepyramine [91]. Resveratrol is abundant in grape skin and seeds [92]. Responding to infection, stress, injury, bacteria or fungal infections, and UV-irradiation, it a popular ingredient in skincare products [93]. In the field of dermatology, azathioprine is an effective immunosuppressant that is extremely valuable in treating pemphigoid, generalized eczematous disorders, and actinic dermatitis [94]. Taken together, most of the proposed small-molecule compounds are approved by the U.S. Food and Drug Administration (FDA). Drug repurposing for identifying new uses of old drugs with the proposed systems biology approaches might provide the alternative way to find the effective drugs for mitigating skin aging.

**Table 3.** Drug targets and multiple-molecule drugs for preventing skin aging from middle age to old age.


•: Proposed small molecules target to the identified biomarkers (drug targets).

#### *3.5. The Limitations and Advantages to the Proposed Systems Medicine Design Procedure for Human Skin Aging*

Gene expression has been widely used to infer other molecular type measures, such as proteomics, copy number variation, and mutation. In this study, we used human skin microarray data processed with cubic spline interpolation to help us construct GWGENs by system identification method via solving constrained linear least-squares estimation problem. After that, we computed Akaike's information criterion (AIC) for each gene to prune false positives. Increasing samples through data interpolation and computing AIC for detecting real systems order, we conquered the overfitting issue. Even though we applied AIC and performed the data interpolation for increasing sample size in each skin aging stage, it is noted that the estimated real GWGENs are near-optimum solutions but not unique solutions. Furthermore, we include basal level in protein, gene, miRNA, and lncRNA systems modeling. These terms imply the unknown interaction or epigenetic modification, and mutation. If we found a basal level change, which was higher than a threshold, we inferred the corresponding node was influenced by epigenetic modification or mutation. These findings have to be verified by a literature survey. Based on the progression molecular mechanisms in each skin aging stage, we could identify essential biomarkers. For exploring the drug–target interaction to our identified biomarkers, we trained a deep neural network of drug–target interaction in advance. In the drug–target data which we used to train the prediction model, if pairs have not been mentioned as known interactions in the BindingDB, we would assign them in the group of negative samples, meaning no interaction. However, the negative samples in our study do not mean without interaction. They might just be lack of experimental evidence or record nowadays. Although the proposed system medicine design procedure exists the aforementioned limitations, it still provides another viewpoint to shed the light on the human skin aging progression based on system level. Moreover, drug repurposing strategy, giving new uses for old drugs, has been used in this study. Most of the suggested small molecules are

approved by the FDA, which could shorten the time of clinical trials. Integrating systems biology approaches, deep learning framework and the design of two filters, we not only transferred biological knowledge into engineering interpretation but also applied them to drug discovery efficiently.

#### **4. Materials and Methods**

#### *4.1. Overview of Systems Medicine Design Procedure of Human Skin Aging*

In order to further understand skin aging molecular mechanisms from young adulthood to old age, we proposed a research flowchart as shown in Figure 2. At first, we collect several regulation and interaction databases including DIP [95], IntAct [96], BioGRID [97], BIND [98], MINT [99], HTRIdb [100], ITFP [101], Transfac [102], CircuitDB2 [103], and TargetScan [104] to construct the candidate GWGEN, which is composed of candidate protein-protein interaction network (PPIN) and candidate gene regulatory network (GRN). Moreover, the candidate GWGEN is a Boolean matrix. If two nodes have interaction, we would give one; if two nodes do not have interaction, we would give zero in it. With three-stage preprocessed microarray data, we then identify real GWGENs by system identification method and system order detection scheme. Since real GWGENs are still too complicated to investigate the skin aging progression mechanisms, we apply principal network projection (PNP) method to extract core GWGENs from real GWGENs based on the projection values. Subsequently, we denote the core signaling pathways in the style of KEGG pathways. According to the core signaling pathways, we investigate skin aging molecular mechanisms and identify essential biomarkers for young adulthood to middle age and middle age to old age, respectively. After that, we used the trained deep neural network of drug-target interaction to predict potential candidate drugs, which hold higher probability to have interactions with identified biomarkers. To narrow down the candidate drugs, we design two filters considering drug regulation ability and drug sensitivity by CMap [34] and PRISM Repurposing dataset [105]. Consequently, we propose two multiplemolecule drugs for slowing down human skin aging from young adulthood to middle age and from middle age to old age, respectively.

#### *4.2. Data Preprocessing of Human Skin Microarray Data*

We obtained human skin microarray data from GSE18876 containing the gene expression level of male skin. It included 50 ages in the range from 19 to 86 years old with 29,226 probes. One study has shown that *OR52N2*, *SIRT6*, *CPT1B*, *TUBAL3*, *COL1A1* and *MATN4* were significantly regulated with age. Furthermore, it also indicated that gene expressions of *OR52N2*, *SIRT6* and *CPT1B* increased with age and gene expressions of *TUBAL3*, *COL1A1* and *MATN4* decreased with age [106]. Therefore, we sketched the changes of gene expression levels of these typical genes. Based on this line graph and gene expression trend in aforementioned study, we defined young-adult, middle-aged and elderly skin as 19 to 45 years old, 43 to 65 years old and 64 to 86 years old, respectively. That is, the averages of gene expressions of *OR52N2*, *SIRT6* and *CPT1B* increased and the averages of gene expressions of *TUBAL3*, *COL1A1* and *MATN4* decreased from young adult stage to middle age, and then to old age in human male skin. In the estimation problem, one would easily face overfitting issue when the sample size is small and the feature size is big [107]. Hence, in this study, firstly, we increased the sample size to 500 for each skin aging stage by performing cubic spline data interpolation via *splin*, a MATLAB function [108–110]. Secondly, we utilized system order detection scheme by computing the AIC value to prune the false positives in the candidate GWGEN for finding the real GWGENs of the human skin aging systems. The more details would be discussed in the Section 4.5.

#### *4.3. Dynamic Systems Modeling for the Candidate GWGEN*

The candidate GWGWN consisting of PPIN and GRN. It is noted that GRN also includes miRNA regulation network and lncRNA regulation network. In the following contents, we would take PPIN and GRN as an example, and the rest of them could be found in Supplementary Materials. The PPIs of human-protein *i* in the candidate PPIN can be described as a dynamic equation shown as below:

$$\begin{aligned} p\_i(t+1) &= p\_i(t) + \sum\_{j=1}^{I\_i} \boldsymbol{a}\_{ij}^P p\_i(t) p\_j(t) - \sigma\_i^P p\_i(t) + \lambda\_i^P g\_i(t) + \beta\_i^P + \varepsilon\_i^P(t) \\ &\quad \text{ for } i = 1, \dots, I, \ -\sigma\_i^P \le 0 \text{ and } \lambda\_i^P \ge 0. \end{aligned} \tag{1}$$

where *pi*(*t*), *pj*(*t*), and *gi*(*t*) indicate the expression levels of the *i*th protein, the *j*th protein, and the *i*th gene at time t, respectively; *αij* denotes the interactive abilities between the *i*th protein with the *j*th protein in human skin cells; *σ P i* represents the degradation rate of the *i*th protein; *λ P i* indicates the translation effect from the corresponding mRNA to the *i*th protein; The basal level *β P i* signifies the regulations from other unknown regulators to the *i*th protein; *I<sup>i</sup>* denotes the number of human proteins interacting with the ith protein in the candidate GWGENs; *ǫ P i* (*t*) signifies the noise of the *i*th protein owing to model uncertainty or measurement noise at time *t*.

The *k* gene in the candidate GRN can be represented as a dynamic equation in the following:

$$\begin{array}{ll} g\_k(t+1) = & g\_k(t) + \sum\_{i=1}^{l\_k} a\_{ki}^{\mathbb{G}} p\_i(t) - \sum\_{r=1}^{R\_k} b\_{kr}^{\mathbb{G}} g\_k(t) m\_r(t) + \sum\_{\ell=1}^{L\_k} c\_{k\ell}^{\mathbb{G}} o\_\ell(t) - \mu\_k^{\mathbb{G}} g\_k(t) \\ & + \delta\_k^{\mathbb{G}} + \omega\_k^{\mathbb{G}}(t) \text{ for } k = 1, 2, \dots, \mathbb{K}, \ -b\_{kr}^{\mathbb{G}} \le 0 \text{ and } \ -\mu\_k^{\mathbb{G}} \le 0 \end{array} \tag{2}$$

where *g<sup>k</sup>* (*t*), *pi*(*t*), *mr*(*t*), and *o*<sup>ℓ</sup> (*t*) indicate the expression level of the *k*th gene, the *i*th transcription factor(TF), the *r*th miRNA and the ℓth lncRNA at time *t*, respectively; *a G ki*, −*b G kr*, and *c G k*ℓ represent the regulatory abilities of the *i*th TF, the repression ability of the *r*th miRNA, and the regulatory ability of the ℓth lncRNA on the *k*th gene, respectively; −*µ G k* signifies the degradation rate of the gene expression of the *k*th gene; The basal level *δ G k* denotes the regulations from other unknown regulators to the *k*th gene such as phosphorylation; *ω<sup>G</sup> k* (*t*) signifies the noise of the *k*th gene owing to model uncertainty or measurement noise at time *t*; *I<sup>k</sup>* , *R<sup>k</sup>* , and *L<sup>k</sup>* mean the total number of TFs, miRNAs, and lncRNAs in the candidate GRN, respectively. Note that the biological regulatory mechanisms in skin cell in (2) involve TF transcription regulations by ∑ *Ik i*=1 *a G ki pi*(*t*), miRNA repressions by − ∑ *Rk r*=1 *b G krg<sup>k</sup>* (*t*)*mr*(*t*), lncRNA regulation by ∑ *Lk* ℓ=1 *c G k*ℓ *o*ℓ (*t*), the mRNA degradation by −*µ G k gk* (*t*), the basal level by *δ G k* , and the noise by *ω<sup>G</sup> k* (*t*). In this study, the effect of post-translational modification on skin aging is considered by the basal level term *δ G k* .

#### *4.4. Systems Identification Approach in the Candidate GWGEN via Microarray Data*

After systems modeling by Equations (1)–(4), we then perform the systems identification by solving the parameter estimation problems. The PPIN in Equation (1) can be rewritten in the following linear regression form:

$$\begin{aligned} p\_i(t+1) &= \begin{bmatrix} p\_1(t)p\_i(t) & \cdots & p\_{\bar{l}\_i}(t)p\_i(t) & g\_i(t) & p\_i(t) & 1 \end{bmatrix} \begin{bmatrix} \mathbf{a}\_{i1}^P \\ \vdots \\ \mathbf{a}\_{i\bar{l}\_i}^P \\ \mathbf{a}\_i^P \\ 1 - \sigma\_i^P \\ \mathbf{\beta}\_i^P \end{bmatrix} + \boldsymbol{\varepsilon}\_i^P(t) \\ &= \boldsymbol{\psi}\_i^P(t)\boldsymbol{\theta}\_i^P + \boldsymbol{\varepsilon}\_i^P(t) \text{ for } i = 1, 2, \dots, I. \end{aligned} \tag{3}$$

where *ψ P i* (*t*), represents the regression vector that can be obtained from the microarray data and *θ P i* denotes the unknown parameter vector to be estimated for the *i*th protein in PPIN.

Furthermore, the Equation (3) of the *i*th protein can be augmented for *Y<sup>i</sup>* time points shown as below:

$$\begin{bmatrix} p\_i(t\_2) \\ p\_i(t\_3) \\ \vdots \\ p\_i(t\_{Y\_i+1}) \end{bmatrix} = \begin{bmatrix} \psi\_i^P(t\_1) \\ \psi\_i^P(t\_2) \\ \vdots \\ \psi\_i^P(t\_{Y\_i}) \end{bmatrix} \theta\_i^P + \begin{bmatrix} \epsilon\_i^P(t\_1) \\ \epsilon\_i^P(t\_2) \\ \vdots \\ \epsilon\_i^P(t\_{Y\_i}) \end{bmatrix}, for \ i = 1, 2, \dots, I,\tag{4}$$

which can also be simplified by

$$P\_i = \Psi\_i^P \theta\_i^P + E\_i^P \;/\; for \; i = 1, 2, \dots, I \tag{5}$$

where

$$P\_{i} = \begin{bmatrix} p\_{i}(t\_{2}) \\ p\_{i}(t\_{3}) \\ \vdots \\ p\_{i}(t\_{Y\_{i}+1}) \end{bmatrix}, \Psi\_{i}^{P} = \begin{bmatrix} \Psi\_{i}^{P}(t\_{1}) \\ \Psi\_{i}^{P}(t\_{2}) \\ \vdots \\ \Psi\_{i}^{P}(t\_{Y\_{i}}) \end{bmatrix}, E\_{i}^{P} = \begin{bmatrix} \varepsilon\_{i}^{P}(t\_{1}) \\ \varepsilon\_{i}^{P}(t\_{2}) \\ \vdots \\ \varepsilon\_{i}^{P}(t\_{Y\_{i}}) \end{bmatrix}.$$

Therefore, the interaction parameters in the vector *θ P i* can be estimated by solving the following constrained least-squares estimation problem:

$$\boldsymbol{\theta}\_{i}^{P} = \min\_{\boldsymbol{\theta}\_{i}^{P}} \frac{1}{2} \left\| \boldsymbol{\Psi}\_{i}^{P} \boldsymbol{\theta}\_{i}^{P} - \boldsymbol{P}\_{i} \right\|\_{2^{\prime}}^{2} \text{ subject to } \boldsymbol{A}^{P} \boldsymbol{\theta}\_{i}^{P} \leq \boldsymbol{b}^{P},\tag{6}$$

where

$$A^P = \begin{bmatrix} 0 & \cdots & 0 & -1 & 0 & 0\\ 0 & \cdots & 0 & 0 & 1 & 0 \end{bmatrix} \in \mathbb{R}^{2 \times (I\_k + 3)}, \ b^P = \begin{bmatrix} 0\\ 1 \end{bmatrix}.$$

To estimate the interaction parameters in (1) by solving the parameter estimation problem in (6), we use an optimization toolbox function *lsqlin* in MATLAB. Simultaneously, we ensure the protein translation rate *λ P i* and the protein degradation rate −*σ P i* to always be non-negative and non-positive value, respectively; that is, *λ P <sup>i</sup>* ≥ 0 and −*σ P <sup>i</sup>* ≤ 0.

Similarly, we rewrite the dynamic equation of GRN in the Equation (2) as the following linear regression form:

$$\begin{array}{llll} \log(t+1) &= [p\_1(t) \ \cdots \ p\_k(t) \ g\_k(t) m\_1(t) \ \cdots \ g\_k(t) m\_{\overline{R}\_k}(t) & o\_1(t) \ \cdots \ o\_{L\_k}(t) \ g(t) 1] \end{array} \tag{7}$$

$$\begin{bmatrix} \log(t+1) & \cdots & p\_k(t) \ g\_k(t) m\_1(t) & o\_1(t) \ \cdots & o\_{L\_k}(t) \ g(t) 1 \end{bmatrix} \begin{bmatrix} d\_1^G \\ \vdots \\ d\_{L\_k^G}^G \\ \vdots \\ -\dot{d\_{L\_k}^G} \\ -\dot{d\_{R\_k}^G} \\ \vdots \\ -\dot{d\_{R\_k}^G} \\ \vdots \\ -\dot{d\_{L\_k}^G} \\ 1 - \dot{\mu}\_k^G \end{bmatrix} \tag{8}$$

$$+\omega\_k^{\mathbb{G}}(t) = \psi\_k^{\mathbb{G}}(t)\theta\_k^{\mathbb{G}} + \omega\_k^{\mathbb{G}}(t), \text{ for } k = 1, 2, \dots, K$$

where *ψ G k* (*t*), represents the regression vector that can be obtained from the microarray data and *θ G k* signifies the unknown parameter vector estimated for the *k*th gene in the GRN. Moreover, Equation (7) can be augmented for *Y<sup>k</sup>* time points in the following form:

$$\begin{bmatrix} \begin{matrix} g\_k(t\_2) \\ g\_k(t\_3) \\ \vdots \\ g\_k(t\_{Y\_k+1}) \end{matrix} \\ \end{bmatrix} = \begin{bmatrix} \Psi\_k^{\mathbb{G}}(t\_1) \\ \Psi\_k^{\mathbb{G}}(t\_2) \\ \vdots \\ \Psi\_k^{\mathbb{G}}(t\_{Y\_k}) \end{bmatrix} \theta\_k^{\mathbb{G}} + \begin{bmatrix} \omega\_k^{\mathbb{G}}(t\_1) \\ \omega\_k^{\mathbb{G}}(t\_2) \\ \vdots \\ \omega\_k^{\mathbb{G}}(t\_{Y\_k}) \end{bmatrix}, \text{ for } k = 1, 2, \dots, K \tag{8}$$

Next, we simplify the Equation (8) as below:

$$\mathbf{G}\_{k} = \mathbf{Y}\_{k}^{G} \boldsymbol{\theta}\_{k}^{G} + \boldsymbol{\Omega}\_{k}^{G}, \text{ for } \mathbf{k} = 1, 2, \dots, \mathbf{K} \tag{9}$$

where

$$\mathbf{G}\_{k} = \begin{bmatrix} \varrho\_{k}(t\_{2}) \\ \varrho\_{k}(t\_{3}) \\ \vdots \\ \varrho\_{k}(t\_{Y\_{k}+1}) \end{bmatrix}, \begin{bmatrix} \Psi\_{k}^{\mathbb{G}}(t\_{1}) \\ \psi\_{k}^{\mathbb{G}}(t\_{2}) \\ \vdots \\ \psi\_{k}^{\mathbb{G}}(t\_{Y\_{k}}) \end{bmatrix}, \Omega\_{k}^{\mathbb{G}} = \begin{bmatrix} \omega\_{k}^{\mathbb{G}}(t\_{1}) \\ \omega\_{k}^{\mathbb{G}}(t\_{2}) \\ \vdots \\ \omega\_{k}^{\mathbb{G}}(t\_{Y\_{k}}) \end{bmatrix}.$$

Hence, the regulatory parameters in the vector *θ G k* can be estimated by solving the following constrained least-squares estimation problem:

$$\boldsymbol{\theta}\_{\mathbf{k}}^{\mathbf{G}} = \min\_{\boldsymbol{\theta}\_{\mathbf{k}}^{\mathbf{G}}} \frac{1}{2} \left\| \mathbf{Y}\_{\mathbf{k}}^{\mathbf{G}} \boldsymbol{\theta}\_{\mathbf{k}}^{\mathbf{G}} - \mathbf{G}\_{\mathbf{k}} \right\|\_{2^{\prime}}^{2} \text{ subject to } \mathbf{A}^{\mathbf{G}} \boldsymbol{\theta}\_{\mathbf{k}}^{\mathbf{G}} \le \mathbf{b}^{\mathbf{G}} \tag{10}$$

where

$$\mathbf{A}^{\mathcal{G}} = \begin{bmatrix} 0 & 0 & \cdots & 0 & 1 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 & 0 & 0 \\ 0 & 0 & \cdots & 0 & 0 & 1 & \cdots & 0 & 0 & 0 & \cdots & 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & \cdots & 0 & 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & 0 & 0 & 0 \\ 0 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 & 1 & 0 \\ \end{bmatrix} \in \mathbb{R}^{(\mathbf{R}\_{k} + 1) \times (\mathbf{I}\_{k} + \mathbf{R}\_{k} + \mathbf{I}\_{k} + 2)}.$$

$$\mathbf{b}^{\mathcal{G}} = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix}.$$

By applying the function *lsqlin* in MATLAB optimization toolbox to solve the parameter estimation problem in Equation (10), we can estimate the regulatory parameters for GRN equation in Equation (2). Furthermore, we ensure that the miRNA repression rate −*b G kr* to be a nonpositive value and the gene degradation rate −*µ G k* to be a nonpositive value for *k* = 1, 2, . . . *K* and *r* = 1, 2, . . . *R<sup>k</sup>* .

#### *4.5. Pruning False Positives in Candidate GWGENs to Obtain Real GWGENs by System Order Detection Scheme*

Due to the collected data, which we used for constructing the candidate GWGEN, come from different databases, the various experimental conditions and noises might result in getting many false-positive interactions and regulations after doing system identification. Thus, we have to apply system order detection scheme by computing AIC to detect the real system order of PPI model in Equation (1) and GRN model in Equation (2). According to Akaike's theory, the most accurate model has the smallest AIC value [111]. In other words, when the value of AIC achieves the minimum, the detected system order approaches to the real system order.

For PPI model in Equation (5), the AIC value of the *i*th protein can be defined in the following equation:

$$AIC\_i^P(\mathbf{K}\_i) = \log \left\{ \frac{1}{T\_i} \left[ P\_i - \mathbf{Y}\_i^P \hat{\theta}\_i^P \right]^T \left[ P\_i - \mathbf{Y}\_i^P \hat{\theta}\_i^P \right] \right\} + \frac{2\mathbf{K}\_i}{T\_i} \tag{11}$$

where ˆ*θ P i* denotes the estimated interactive parameters of the *i*th protein from the solutions of the parameter estimation problem in Equation (6), and the covariance of estimated residual error is (*ς P i* ) <sup>2</sup> = <sup>1</sup> *Ti* - *P<sup>i</sup>* − Ψ*<sup>P</sup> i* ˆ*θ P i T*- *P<sup>i</sup>* − Ψ*<sup>P</sup> i* ˆ*θ P i* . In order to find out the real system order *K* ∗ *i* of the *i*th protein in the PPI model so that *AIC<sup>P</sup> i* (*K* ∗ *i* ), in Equation (11) can achieve the minimum value, we trade off the system order and the estimated residual error. By aforementioned system order detection method, PPIs with insignificant interaction abilities, which are out of *K* ∗ *i* , could be regarded as false positives and be pruned away.

For the GRN model in Equation (9), AIC value of the *k*th gene can be defined as the following equation:

$$AIC\_k^G(I\_{k'}R\_{k'}L\_k) = \log\left\{\frac{1}{T\_k}\left[\mathbf{G}\_k - \mathbf{Y}\_k^G\boldsymbol{\theta}\_k^G\right]^T \left[\mathbf{G}\_k - \mathbf{Y}\_k^G\boldsymbol{\theta}\_k^G\right]\right\} + \frac{(2I\_k + R\_k + L\_k)}{T\_k} \tag{12}$$

where ˆ*θ G k* denotes the estimated regulatory parameters of the *k*th gene from the solutions of the parameter estimation problem in Equation (10), and the covariance of estimated residual error is (*ς G k* ) <sup>2</sup> = <sup>1</sup> *Tk* - *G<sup>k</sup>* − Ψ*<sup>G</sup> k* ˆ*θ G k T*- *G<sup>k</sup>* − Ψ*<sup>G</sup> k* ˆ*θ G k* . In order to find out the real system order *I* ∗ *k* , *R* ∗ *k* , and *O*<sup>∗</sup> *k* of the *k*th gene in GRN so that *AIC<sup>G</sup> k* (*I* ∗ *k* , *R* ∗ *k* , *L* ∗ *k* ), in (12) can achieve the minimum value, we trade off the system order and to estimate residual error. In this way, to *k*th gene, the gene regulations with insignificant regulatory abilities, which are out of *I* ∗ *k* , *R* ∗ *k* , and *O*<sup>∗</sup> *k* , can be treated as false-positives and be pruned away from the candidate GRN. It is noted that we apply the same system order detection scheme on the miRNA model and the lncRNA model, which could be found in the Section S1.3 of Supplementary Materials.

After performing system identification and system order detection scheme, which pruned away the insignificant interactions and regulations in the candidate GWGEN, we eventually obtained the real GWGENs for three stage of human skin aging. However, it is still quite difficult to investigate the progression mechanisms of skin aging from these real GWGENs due to their high complexity. Here, we introduce the principal network projection (PNP) method to extract the core networks from the real GWGENs as core GWGENs to solve this issue. The details are described in the following section.

#### *4.6. Extracting Core Networks from Real GWGENs by the Principal Network Projection Method*

The PNP method is a network structure projection approach based on the principal singular values so as to reduce network dimension via deleting insignificant structures. In order to use the PNP method to extract the core networks from the real GWGENs, we have to construct a network matrix *H* consisting all of the estimated interactions and regulations in the real GWGEN (with the ith row denoting the interactions or regulations on the ith node, i.e., protein, gene, miRNA or lncRNA of real GWGEN) in the following formation:

*H* = *α*ˆ <sup>11</sup> · · · *α*ˆ <sup>1</sup>*<sup>I</sup>* 0 · · · 0 0 · · · 0 . . . *α*ˆ*ij* . . . . . . 0 . . . . . . 0 . . . *α*ˆ *<sup>I</sup>*<sup>1</sup> · · · *α*ˆ *I I* 0 · · · 0 0 · · · 0 *a*ˆ *G* <sup>11</sup> · · · *a*ˆ *G* <sup>1</sup>*<sup>I</sup>* <sup>−</sup>ˆ*<sup>b</sup> G* <sup>11</sup> · · · −ˆ*<sup>b</sup> G* 1*R c*ˆ *G* <sup>11</sup> · · · *c*ˆ *G* 1*Z* . . . *a*ˆ *G ki* . . . . . . −ˆ*b G kr* . . . . . . *c*ˆ *G kz* . . . *a*ˆ *G K*1 · · · *a*ˆ *G K I* <sup>−</sup>ˆ*<sup>b</sup> G K*1 · · · −ˆ*b G KR c*ˆ *G K*1 · · · *c*ˆ *G KZ a*ˆ*<sup>M</sup>* <sup>11</sup> · · · *<sup>a</sup>*ˆ*<sup>M</sup>* <sup>1</sup>*<sup>I</sup>* <sup>−</sup>ˆ*<sup>b</sup> M* <sup>11</sup> · · · −ˆ*<sup>b</sup> M* 1*R c*ˆ*M* <sup>11</sup> · · · *<sup>c</sup>*ˆ*<sup>M</sup>* 1*Z* . . . *a*ˆ*<sup>M</sup> ri* . . . . . . −ˆ*b M rr* . . . . . . *c*ˆ*<sup>M</sup> rz* . . . *a*ˆ*<sup>M</sup> R*1 · · · *a*ˆ*<sup>M</sup> RI* <sup>−</sup>ˆ*<sup>b</sup> M R*1 · · · −ˆ*b M RR <sup>c</sup>*ˆ*<sup>M</sup> R*1 · · · *c*ˆ*<sup>M</sup> RZ a*ˆ *L* <sup>11</sup> · · · *a*ˆ *L* <sup>1</sup>*<sup>I</sup>* <sup>−</sup>ˆ*<sup>b</sup> L* <sup>11</sup> · · · −ˆ*<sup>b</sup> L* 1*R c*ˆ *L* <sup>11</sup> · · · *c*ˆ *L* 1*Z* . . . *a*ˆ *L zi* . . . . . . −ˆ*b L zr* . . . . . . *c*ˆ *L zz* . . . *a*ˆ *L Z*1 · · · *a*ˆ *L Z I* <sup>−</sup>ˆ*<sup>b</sup> L Z*1 · · · −ˆ*b L ZR c*ˆ *L Z*1 · · · *c*ˆ *L ZZ* ∈ R(*I*+*K*+*R*+*Z*)×(*I*+*R*+*Z*) (13)

where *α*ˆ*ij* denotes the interactive abilities of the *i*th protein with the *j*th protein in the PPIN which could be obtained from ˆ*θ P i* by solving parameter estimation problem in Equation (6) and pruning the false positives by AIC in Equation (11); *a*ˆ *G ki*, ˆ*b G kr*, and *c*ˆ *G kz* represent transcriptional regulative abilities from the *i*th TFs, the *r*th miRNAs and the *z*th lncRNAs onto the *k*th protein-coding genes, respectively, which could be obtained from ˆ*θ G k* by solving parameter estimation problem in Equation (10) and pruning the false positives by AIC in (12); *a*ˆ*<sup>M</sup> ri* , ˆ*b M rr* , and *c*ˆ*<sup>M</sup> rz* indicate the transcriptional regulative abilities from the *i*th TFs, the *r*th miRNAs and the *z*th lncRNAs onto the *r*th miRNA's gene, respectively, which could be acquired from ˆ*θ M <sup>r</sup>* by solving parameter estimation problem in Equation (S6) and pruning the false positives by AIC in Equation (S11); *a*ˆ *L zi*, ˆ*b L zr*, and *c*ˆ *L zz* indicate the transcriptional regulative abilities from the *i*th TFs, the *r*th miRNAs and the *z*th lncRNAs onto the *z*th lncRNA's gene, respectively, which could be acquired from ˆ*θ L <sup>z</sup>* by solving parameter estimation problem in Equation (S10) and pruning the false positives by AIC in Equation (S12). It is noted that if interactions or regulations do not exist in the candidate GWGEN via big data mining or already have been pruned by AIC, the corresponding components in matrix *H* are padded with zero.

As the *H* have been constructed, we thereby extract the core GWGEN from the real GWGEN by the PNP method shown as below. At first, the combined network matrix *H* can be a factorization of the following singular value decomposition (SVD) form as below:

$$H = \mathbb{U} \times \mathbb{D} \times \mathbb{V}^{T} \tag{14}$$

where *U* ∈ R(*I*+*K*+*R*+*Z*)×(*I*+*R*+*Z*) , *V* ∈ R(*I*+*R*+*Z*)×(*I*+*R*+*Z*) , and *D* = diag(*d*1, · · · , *dI*+*R*+*Z*). *D* is composed of *I* + *R* + *Z* singular values of *H* and *d*<sup>1</sup> ≥ *d*<sup>2</sup> ≥ · · · ≥ *dI*+*R*+*Z*. The eigen expression fraction *E<sup>h</sup>* is defined in the following energy normalization:

$$E\_h = \frac{d\_h^2}{\sum\_{h=1}^{I+R+Z} d\_h^2} \tag{15}$$

Then, we find out the minimum *γ* such that *γ* ∑ *h*=1 *E<sup>h</sup>* ≥ 0.85. That is, top *γ* singular vectors of matrix *H* contain 85% core network structure of the real GWGEN from the energy point of view. Additionally, we define the projections of *H* to the top *γ* singular vectors of *V* as

$$N\_{\mathbb{R}}(w, s) = h\_{w\_{\succ}} \times v\_{\succ, s}^T \text{ for } w = 1, 2, \dots, I^\* + \mathbb{R}^\* + Z^\* \text{ and } s = 1, 2, \dots, \gamma \tag{16}$$

where *hw*:, and *v T* :,*<sup>s</sup>* denote the *w*th row of *H* and the *s*th column of *V*, respectively. Subsequently, for the top *γ* right-singular vectors, we define the 2-norm projection value of proteins, genes, lncRNAs, and miRNAs (i.e., the nodes) in the real GWGEN as below:

$$D\_{\mathcal{R}}(w) = \left[\sum\_{s=1}^{\gamma} \left[\mathbf{N}\_{\mathcal{R}}(w, s)\right]^2\right]^{1/2} \text{ for } w = 1, 2, \dots, I^\* + \mathbb{R}^\* + \mathbb{Z}^\* \text{ and } \mathbf{s} = 1, 2, \dots, \gamma \tag{17}$$

If the projection value, *DR*(*w*), approaches to zero for the *w*th node, it means that the wth node is almost independent to the principal network structure. That is, the larger the projection value is, the greater the contribution of the corresponding node to the core network is. By doing so, we can extract the core GWGEN by collecting nodes with large projection values from the real GWGENs and denote them in the KEGG pathway style to investigate the progression mechanisms of human skin aging.

#### *4.7. Data Preprocessing for Training Deep Neural Network of Drug–Target Interaction in Advance*

The drug–target interaction dataset comes from BindingDB [35]. The descriptors of drugs and targets are transformed by PyBioMed [112]. We install this package and import PyMolecule module and PyProtein module to transform drugs and targets into their descriptors under python 2.7 environment. The PyMolecule module in PyBioMed is responsible to compute the commonly used structural and physicochemical descriptors to be drug features. The drug features include constitutional and geometrical descriptors. Furthermore, the PyProtein module in PyBioMed is responsible for calculating the widely used descriptors, including structural and physicochemical properties of proteins and peptides from amino acid sequences, to be target features. Subseqently, concatenating the drug descriptor and the target descriptor, we describe properties of a drug and its target by a feature vector shown in (18). Moreover, the total number of drug features and target features are 363 and 996, respectively.

$$v\_{drag-target} = [D\_\prime T] = [d\_{1\prime}d\_{2\prime}\dots, d\_{1\prime}t\_{1\prime}t\_{2\prime}\dots, t\_{\,\}] \tag{18}$$

where *vdrug*−*t*arg*et* indicates a feature vector of a drug-target pair; *D* denotes the feature vector of the drug; *d<sup>i</sup>* indicates the *i*th drug feature; *T* represents the feature vector of the target; *t<sup>j</sup>* is the *j*th target feature; *I* is the total number of drug features; *J* is the total number of target features. We conduct the same transformation for all the drug-target pairs to obtain their drug-target feature vectors.

**Supplementary Materials:** The following are available online, Figure S1: The real genome-wide genetic and epigenetic network (GWGEN) of young-stage skin, Figure S2: The real genome-wide genetic and epigenetic network (GWGEN) of middle-stage skin, Figure S3: The real genome-wide genetic and epigenetic network (GWGEN) of elder-stage skin, Figure S4: Deep neural network of drug-target interaction framework, Table S1: The pathway enrichment analysis of proteins through applying the DAVID in the core GWGEN of young-stage skin, Table S2: The pathway enrichment analysis of proteins through applying the DAVID in the core GWGEN of middle-stage skin, Table S3: The pathway enrichment analysis of proteins through applying the DAVID in the core GWGEN of elder-stage skin, Table S4: Drug targets with their corresponding small-molecule compounds, Table S5: Drug targets with their corresponding small-molecule compounds.

**Author Contributions:** Conceptualization, B.-S.C. and S.-J.Y.; methodology, S.-J.Y. and J.-F.L.; software, S.-J.Y. and J.-F.L.; validation, S.-J.Y. and J.-F.L.; formal analysis, S.-J.Y. and J.-F.L.; investigation, J.-F.L.; data curation J.-F.L.; writing—original draft preparation, S.-J.Y. and J.-F.L.; writing—review and editing, B.-S.C. and S.-J.Y.; visualization, J.-F.L.; supervision, B.-S.C.; funding acquisition, B.-S.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Ministry of Science and Technology grant number MOST 107-2221-E-007-112-MY3.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The human skin data is from GSE18876 (https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi?acc=GSE18876) (accessed on 19 May 2021). Drug sensitivity data is from depmap portal (https://depmap.org/portal/download/) (accessed on 19 May 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Sample Availability:** Samples of the compounds are not available from the authors.

*Article*

### **Identification of Potential COX-2 Inhibitors for the Treatment of Inflammatory Diseases Using Molecular Modeling Approaches**

**Pedro H. F. Araújo 1,2, Ryan S. Ramos <sup>2</sup> , Jorddy N. da Cruz <sup>2</sup> , Sebastião G. Silva <sup>3</sup> , Elenilze F. B. Ferreira 1,2,4, Lúcio R. de Lima <sup>2</sup> , Williams J. C. Macêdo 1,2,5 , José M. Espejo-Román 6 , Joaquín M. Campos <sup>6</sup> and Cleydson B. R. Santos 1,2,5,\***


#### Academic Editor: Marco Tutone

Received: 25 August 2020; Accepted: 9 September 2020; Published: 12 September 2020

**Abstract:** Non-steroidal anti-inflammatory drugs are inhibitors of cyclooxygenase-2 (COX-2) that were developed in order to avoid the side effects of non-selective inhibitors of COX-1. Thus, the present study aims to identify new selective chemical entities for the COX-2 enzyme via molecular modeling approaches. The best pharmacophore model was used to identify compounds within the ZINC database. The molecular properties were determined and selected with Pearson's correlation for the construction of quantitative structure–activity relationship (QSAR) models to predict the biological activities of the compounds obtained with virtual screening. The pharmacokinetic/toxicological profiles of the compounds were determined, as well as the binding modes through molecular docking compared to commercial compounds (rofecoxib and celecoxib). The QSAR analysis showed a fit with R = 0.9617, R<sup>2</sup> = 0.9250, standard error of estimate (SEE) = 0.2238, and F = 46.2739, with the tetra-parametric regression model. After the analysis, only three promising inhibitors were selected, **Z-964**, **Z-627**, and **Z-814**, with their predicted pIC<sup>50</sup> (−log IC50) values, **Z-814** = 7.9484, **Z-627** = 9.3458, and **Z-964** = 9.5272. All candidates inhibitors complied with Lipinski's rule of five, which predicts a good oral availability and can be used in in vitro and in vivo tests in the zebrafish model in order to confirm the obtained in silico data.

**Keywords:** *in silico*; COX-2 inhibitors; molecular modeling

#### **1. Introduction**

Inflammatory processes stand out for their pathophysiological aspect, as they are caused by pathogenic microorganisms, such as viruses, bacteria, fungi, and parasites that invade host cells to

reproduce, resulting in a complex and heterogeneous group of diseases that cause morbidity and mortality to those affected by them. At the same time, the need for specific attention to protect the integrity of human organisms from harmful or exogenous agents is emphasized [1].

Inflammatory mediation is regulated by the action of neutrophils, mast cells, eosinophils, macrophages, dendritic cells, and epithelial cells. It is a complex process involving vasodilation, chemotaxis, and increased permeability. Sensors called Toll-like receptors (TLR) recognize the products of pathogens, such as endotoxins and bacterial DNA that are located in the plasma membrane and in the endosomes, and they are capable of detecting intra and extracellular microorganisms [2].

When an injury occurs, platelets release complement proteins, where mast cells degranulate releasing histamine (vasodilation) and serotonin (cell permeability and diapedesis), where neutrophils are activated and migrate to the site of action induced by chemokines. Neutrophils phagocytose pathogenic organisms releasing mediators and attracting macrophages, increasing the release of pro-inflammatory mediators (prostaglandins and leukotrienes) and cytokines (interleukin 1 (IL-1), interleukin 6 (IL-6) and tumor necrosis factor (TNFα)). When cells are activated, the arachidonic acid (AA) of the cell membrane is converted by enzymes for the synthesis of prostaglandins and leukotrienes. α

In one of the stages of the inflammatory process, prostaglandins, also called eicosanoids, are synthesized by triggering various stimuli that activate cell membrane receptors, when coupled with a regulatory protein that results in the activation of phospholipase A2 or with an increase in the concentration of Ca+<sup>2</sup> . This type of enzyme hydrolyzes membrane phospholipids, consequently releasing the cascade of arachidonic acid, which is a substrate for the synthesis of physiopatological and inductive prostaglandins; see Figure 1 below [3].

**Figure 1.** Cascade of arachidonic acid.

Prostaglandins-endoperoxide synthase (PTGS) are known as cyclooxygenases responsible for the synthesis of prostaglandins that are percussive biologically active molecules. Figure 1 shows that the conversion of AA into signaling molecules takes place in 2 moments. (i) The enzyme cyclooxygenase-1 (COX-1, PTGS1) catalyzes the addition of two free oxygen atoms to form the 1,2-dioxane bridge and a functional group to form prostaglandin G2 (endoperoxide (PGG2)). (ii) COX-2 (PTGS2) reduces the peroxide functional group to a secondary alcohol, forming prostaglandin H2 (PGH2). PGG2 and PGH2 are unstable, but they are precursors for the formation of other prostaglandins (PGE2, PGD2, PGF2), prostacyclin (PGl2), and thromboxane A2 (TXA2), which are commonly called eicosanoids.

Currently, it is known that two genes express two distinct, but similar, isoforms of COX. These enzymes catalyze the biosynthesis of prostaglandins and thromboxanes by reducing arachidonic acid; they are called COX-1 and COX-2, with similar general protein structures that essentially catalyze the same reaction. COX-1 is an enzyme considered to be constitutive; it is part of the homeostatic maintenance of various processes of the human organism, and it is present in most tissues, including stomach, kidney, and coronary arteries [4].

On the other hand, the COX-2 enzyme is considered inductive; that is, it expresses the inflammatory process in cells. The understanding of the COX inhibition process has allowed and still allows the development of new perspectives in relation to the therapeutic targets and the synthetic drugs produced, so that they are as selective and smooth as possible, corroborating their adverse effects [4,5]. Widely prescribed non-steroidal anti-inflammatory drugs (NSAIDs) are a class of drugs used to treat pain, fever, and inflammation. Since 1893, when acetylsalicylic acid was produced, NSAIDs have become the most accepted and prescribed drugs. However, it was not until 1971 that John Vane clarified the mechanism of action of these drugs: They inhibit cyclooxygenase (COX), preventing consequently the synthesis of prostaglandins [6].

Non-selective or traditional NSAIDs inhibit both isoforms (1 and 2), but COX-1 inhibition is the main cause of increased gastrointestinal bleeding or ulcer formation, abdominal pain, and dyspepsia, including indomethacin, naproxen, and ibuprofen. On the other hand, COX-2 inhibition, inducing pro-inflammatory processes, plays a role in pain relief with reduced gastrointestinal effects, which is usually expressed by the coxib class, among them rofecoxib, celecoxib, valdecoxib, and lumiracoxib [7].

The knowledge of the structures of COX-1 and COX-2 and their active sites constitute the fundamental basis for the development of more specific inhibitors for COX-2 and for the elaboration of studies of the structure–activity relationship of these products. During enzymatic activity, arachidonic acid binds to an Arg<sup>120</sup> and to a Ser530. An electron transfers Tyr<sup>385</sup> to an oxidized heme, which is also bound within the enzyme, initiating the reaction cyclooxygenase. Several studies have attempted to elucidate how and where non-steroidal anti-inflammatory drugs act on cyclooxygenase to block prostaglandin synthesis. Within the hydrophobic channel of COX, an amino acid difference at position 523 (isoleucine in COX-1 and valine in COX-2) can be of critical importance in the selectivity of several drugs [8].

One of the main drugs used, with a selective effect to inhibit the pro-inflammatory effect of COX-2, was rofecoxib, which would have the potential for treatment without side effects such as ulcers and gastrointestinal problems. Studies by Bombardier et al. 2000 [5] reported that when compared to a 500 mg dose of naproxen per day for the same period, the incidence of efficacy was equivalent, but the administration of rofecoxib had less side effects such as bleeding gastric and duodenal ulcers. However, it was observed that the toxicological effect of rofecoxib, with a daily 50 mg dose for 9 consecutive months, doubled the acute myocardial infarction and strokes [8].

At the current stage of knowledge, in which the binding sites for specific inhibitors in COX-2 have already been described, and the three-dimensional protein structure of the enzyme is clearly established, the use of modern molecular modeling techniques should be able to triate new compounds of high affinity and specificity, but probably without the presence of the sulfonamide and sulfone groups of the second-generation compounds seen previously, thus representing the birth of a third generation of specific COX-2 inhibitors [9].

The process from a biological target project to a new drug discovery can take an average of 10 years or more, and the computational chemistry comes with an excellent direction in the rational planning of drugs, with already countless cases of success involving or using computer simulations citing as an example the main factors: losartan, atorvastatin, and celecoxib.

Mathematical analyses accompany in silico studies in order to enable the reduction of costs and time to obtain positive results, observing the molecular structures and their possible affinities with the therapeutic target, using the quantitative structure–activity relationship (QSAR). This method aims to build parametric models for predicting inhibitory activity (IC50) correlated with dependent variables such as physical–chemical, biological, and toxicological properties [10].

In these analyses, reduction filters are applied to the models used to predict inhibitory activity, correlating the molecule structures with their activity and toxicological potential, and comparing them with a positive control. In 2016, Brick et al. [10] applied the QSAR analysis to identify new antimalarial inhibitors from 1H-Imidazol-2-IL-Pyrimidine-4,6-Diamines, with reducing filters to eliminate descriptors that did not show correlation or information relevant to the process of statistical and toxicological analysis, beginning the screening with 107 compounds from the ZINC database and ending with four more promising compounds.

In this context, one of the main types of studies in progress by the scientific community is the in silico and in vivo studies of inhibitors for the inflammatory processes to search for new selective molecules. In parallel, the QSAR study (quantitative structure–activity relationships) uses multiparametric models that interrelate biological activity with the physical–chemical properties of selected molecules in order to predict their inhibitory capacity against the inflammatory mechanism [10]. Therefore, the objective of this work is the virtual screening of analogs of rofecoxib (Vioox®), based on pharmacophore and QSAR analysis, understanding approaches and molecular modeling techniques through free software that is easily accessible by the scientific community, in parallel with the prediction of pharmacokinetic properties and toxicity that show the possible effectiveness of the selected structures, according to the methodological scheme presented in Figure 2 (see more details in the Materials and Methods section).

**Figure 2.** General scheme summarizing of the methodological steps.

#### **2. Results and Discussion**

#### *2.1. Molecular Optimization and QSAR Analysis*

The structure of rofecoxib was selected as a pivot given its potential to mitigate the gastrointestinal effects compared to other selective drugs. Although this drug has the unwanted effect of acute myocardial infarction, the objective is to detect essential pharmacophore characteristics through virtual screening so that the selected promising structures have the same effectiveness. On the other hand, this is one of the only structures that presents the complexed structure in the Protein Data Bank (PDB) (5KIR, https://www.rcsb.org/) for the *Homo sapiens* organism, bringing the results of an ideality in front of the human organism.

The 32 molecules (Rofecoxib as a pivot) for the analysis were selected in the BindingDB database (https://www.bindingdb.org/bind/index.jsp) obeying an increasing order of IC50, with specific activity related to COX-2 and the *Homo sapiens* organism, in addition to not repeating inhibitory activity values, which could impact false-positive results through a straight line adjustment facilitated by of statistical analysis.

The molecular optimization values are shown in Table S1. The overlapping process was carried out by selecting molecules with the lowest energy value (PM3), since the optimization of molecular structures aims to bring the real structure closer to the energy minimum conformation, and with the observed experimental data, the optimization time quantification aims to elucidate the computational cost, as it is an expensive and time-consuming process [11].

Later, they were submitted to the PharmaGist software (https://bioinfo3d.cs.tau.ac.il/PharmaGist) for the extraction of physicochemical properties and construction of structure–activity relationship models (QSAR). The characteristics were analyzed with the aid of the Statistica® software, where the most relevant ones were used to predict the inhibitory activity as a function of the pIC<sup>50</sup> value to decrease statistical inconsistencies. This software is capable of predicting the relationship between the inhibitor structure and its inhibitory activity, with a Pearson correlation cut-off of 0.4, obtaining a training set with n = 20 structures (methodology adopted by Santos, Cruz and Santos) [12–14]. Table 1 shows the selected descriptors. The atoms (ATM) characteristic presented the best correlation among all descriptors, with a value of 0.7651, allowing inferring that the number of atoms significantly interferes in the pIC<sup>50</sup> responses of the selected molecules. However, it must be noted that the selected regression model is tetra-parametric, so the prediction analysis must take into account the contribution of each descriptor in the process of prediction of the inhibitory activity value, as is the case of aromatic characteristics (ARO) with a p value of 0.7358 and acceptors (ACC) with a p value of 0.6399, which also contributes to the prediction of the inhibitory activity.

This result can also be accompanied by the analysis of hierarchical cluster analysis (HCA) (Figure S1) performed with the aid of the Minitab® Trial software, allowing observation of the similarity between the physical–chemical characteristics and the inhibitory activity of the respective molecules, corroborating with the data obtained by Pearson's correlation. The characteristics of ATM and ARO show greater proximity to the predicted pIC50. The descriptor ACC is inversely proportional to the predicted pIC<sup>50</sup> value, which indicates that the presence of hydrophilic groups capable of establishing hydrogen-bonding interactions can increase the inhibitory potential of the selected structures.

The ATM characteristic may not be essential when analyzed individually; however, we observed that the greater the number of atoms present in a structure, the greater its volume and topological polar surface area (TPSA), which are both characteristics that are essential for good oral absorption of the medication in the body, consequently obeying the Lipinski rule. On the other hand, it is not the only relevant characteristic for the prediction of the values of inhibitory activity and must be corroborated with the other characteristics provided by the statistical analysis [15]. Figure S2 represents the PCA analysis for the selected molecules. It correlates its characteristics with the inhibitory activity; the compounds with the lowest activity are in red, and those with the best activity are in blue. Molecule **11** is displaced from the others because it presents values of hydrogen donors equal to

3, which is different from the others in the selected training group that present values of 0 and 1, showing statistically a decrease in the pIC<sup>50</sup> values of the structures.


**Table 1.** Properties of the screened training molecules obtained from the PharmaGist server.

In addition, the values of the number of atoms provide a better forecast: molecule **11** (37 atoms) shows one of the lowest values for ATM, as well as **15, 18, 19,** and **20**. It is observed that for the most active molecules, the ATM characteristics are relevant to the value of inhibitory activity, shifting them to the most active side. All the most active structures have four aromatic groups, except for **molecule 4** which has only 2; however, its inhibitory activity is accentuated by the number of atoms in its structure (50).

In parallel, it must be understood that the predicted activity depends on the correlation between the four selected characteristics and their relative weight, and that the objective of the preliminary QSAR analysis is to investigate the most relevant characteristics among the data presented in the sampling of the structures already reported with inhibitory activity that are selective for COX-2. Figure S3 shows the analysis of HCA for the selected inhibitors. The HCA analysis gathers in hierarchical groups by similarity; the most active are represented in blue, and the least active ones are in red. It is observed that the data from the cluster followed the data obtained previously via QSAR and principal component analysis (PCA) analyses.

The tree-like dendrogram is seeking the structural similarities and the response to the inhibitory activity. It is noticed that **2** to **9** have four aromatic groups, with the exception of **4**, which has only two, but its activity is enhanced by the number of atoms of the structure. Figure 3 shows the structure of the eight most active molecules. The common group observed in these inhibitors is pyrrole, in addition to the methylsulfone group, with the exception of **4**. The presence of the methylsulfone group resembles the structure of rofecoxib and differs from the structures of the other coxibs (celecoxib and etoricoxibe), which have a sulfonamide functional group that is responsible for their toxicological characteristics. Pyrrole gives the appropriate lipophilic character to the molecule, which can help the molecule enter the COX-2 active channel [16].

<sup>\*</sup> Pivotal molecule. [a] Atoms; [b] Aromatics; [c] Donors; [d] Acceptors.

Δ

**Figure 3.** The most active molecules.

Table 2 shows the regression data of the descriptors used to verify the best model for predicting the inhibitory activity. A combination has been used to evaluate the statistical parameters and select the parametric prediction equation according to the best fit. It is observed that the physical-chemical parameters ATM, ACC, and ARO are significant for the final calculated pIC<sup>50</sup> values. The best statistical parameters were obtained for the parametric tri and tetra models with R<sup>2</sup> values of 0.9599 and 0.9617, and variance ratios of 62.6373 and 46.2719, respectively. It is emphasized that the greater the number of equated scores, the greater the quality of the predictor model, although the values are statistically close [10].


**Table 2.** Parametric models and regression analysis values (R: Correlation coefficient; R<sup>2</sup> : Correlation to square coefficient; R<sup>A</sup> 2 : Correlation coefficient to adjusted square. SEE: Standard estimated error; F: Variance ratio).

Note that the parametric tetra prediction values were better adjusted with a correlation R = 0.9617 and standard error of estimate (SEE) = 0.2238, with a notable predictive capacity. Such alignment can be compared with the residual values found during the validation step of the equation, with ∆4 differences close to 0.2, which demonstrates the ability to predict the values of inhibitory activity. Table 3 analyzes the predicted models for the molecules, allowing inference of the difference (∆ = residual) between the pIC<sup>50</sup> values found in an experimental way and the statistical prediction values for a defined parametric model and the equation was determined considering the highest statistical correlation.


**Table 3.** pIC<sup>50</sup> values calculated using the prediction equations.

\* Pivotal molecule; [a] internal validation.

The best results in question were for **10** (∆4 = 0.0075) and **12** (∆4 = 0.0020) inhibitors, although **2** (∆4 = 0.4764), **16** (∆4 = −0.2779), and **18** (∆4 = −0.2625) present residues greater than 0.2 in relation to the experimental data obtained. The margin of error (SEE = 0.2238) allows us to infer that the two may be within the desired perspective of residue, mainly, less than 1. The internal validation set demonstrated the detection of anomalous samples, which were excluded from the test set because they reduced the statistical correlation of the applied parametric method, with residues greater than 0.4, increasing the estimated error and deviating the correlation of the predicted values with the experimental data, which is justified by its exclusion from the test set initially. However, the reinclusion of these aims to determine the predictive capacity of the model, and results with residuals less than 1 are significant.

Figure 4 shows the projection of the data obtained in relation to the linear regression obtained, with a line adjustment of R<sup>2</sup> = 0.9617 of the tetra-parametric model, showing a good relationship between the experimental and predicted values.

**Figure 4.** Linear correlation graph of the tetra-parametric model.

Table 4 shows the predicted values for the external validation set, applying the equations according to Table 4. For the validation test, a set between 20% and 30% of the total of the original set were used in the predictive model in order to prove its robustness [10]. It is observed that the values have good predictive quality for the molecules selected as an external set, with greater proximity for **26**, **27**, and **30**, with residues close to 0.1. This shows that the model has a significant correlation between the descriptors used.


**Table 4.** External validation set.

#### *2.2. Virtual Screening and Analysis of Pharmacokinetic and Toxicological Properties*

After selecting the best inhibitors, these were used on the Protox II and Molinspiration servers to select the reduction filters; the compounds were directed to the ZincPharmer database through the Pharmit web server (http://pharmit.csb.pitt.edu/search.html) shown in Table S2, with the maximum and minimum values being selected in order to limit the promising structures to those within the pre-applied characteristics through the QSAR analysis, with a maximum limit of 2000 structures to be selected. The filters are applied on the online server, as well as the pharmacophore coordinates (below) elucidated for possible data reproduction and comparison with statistical analysis.

The pharmacophore structure obtained through the PharmaGist server is demonstrated, aligning the similarities of the twenty selected molecules (Table 5). It is observed that the characteristics of the pharmacophore follow the data obtained through statistical analysis, presenting two aromatic groups, two hydrophilic groups, and hydrogen receptors. Such pharmacophore characteristics are essential when compared to the central process molecule, which has two ARO groups and four ACC groups, allowing the tracking of molecules with physical and chemical characteristics closer to rofecoxib.


**Table 5.** Pharmacophore characteristics.

On the other hand, in studies by Chakraborty, Sengupta, and Roy (2004) [17], linear multiple regression (RML) analyses were used to deduce statistically acceptable equations. The variation ratios were 0.675 for COX-1 and 0.842 for COX-2, observing three important pharmacophores groups: methyl sulfonyl portion, central phenyl ring, and terminal phenyl ring. These are relevant when compared to their affinity with the lipophilic channel present in the active sites of the enzymes, corroborating with the data obtained in the present study.

After the application of the reduction and selection filters of the 2000 compounds, they were submitted to similarity to Tanimoto to find out which ones are closer to the characteristics of the pivot molecule used in the process (Rofecoxib). The fifty eight (58) molecules were obtained with a Tanimoto index greater than 0.35 (see Table 6), which is a value that is considered reasonable for the application of toxicological and pharmacological prediction studies in silico, with three promising molecules being selected during these tests as reported below, subsequently applying the molecular coupling and molecular dynamics tests [18].

**Table 6.** Similarity studies of molecules using the Tanimoto Index.


Table 7 shows the best results of the toxicological tests applied to the three inhibitors selected through the Tanimoto index and tests performed through the online server PreADMET (https: //preadmet.bmdrc.kr/adme/) in order to screen those who present better absorption, distribution, and metabolism values besides limiting the possibilities of mutagenicity through toxicological tests. It is observed that molecules showed high LD<sup>50</sup> values, with the exception of **Z-627**, but it presents good values for the absorption and distribution tests, contributing to its selection in the molecular docking tests. Carcinogenicity tests for rats and mice demonstrate a possibility of mutation for all the selected inhibitors; however, when compared to the control compound, it also observed this important side effect, and accordingly, it did not prevent the selection of these molecules. At the same time, the Ames test was used as a cut-off parameter between the most promising and those that would be excluded from the subsequent steps, where those that showed a positive result were eliminated from the process. This test assesses the possibility of mutagenicity of chemical compounds in media with a low histidine concentration, which allows the strains of *Salmonella typhimurium* to change and return to a prototypical state, which directly influences the carcinogenic response.


**Table 7.** Toxicological data of selected inhibitors.

**[a]** Protox (http://tox.charite.de/protox\_II) Class I: fatal if swallowed (LD<sup>50</sup> ≤ 5); Class II: fatal if swallowed (5 < LD<sup>50</sup> ≤ 50); Class III: toxic if swallowed (50 < LD<sup>50</sup> ≤ 300); Class IV: harmful if swallowed (300 < LD<sup>50</sup> ≤ 2000); Class V: may be harmful if swallowed (2000 < LD<sup>50</sup> ≤ 5000); Class VI: non-toxic (LD<sup>50</sup> > 5000).

The pharmacokinetic data for distribution are shown in Table 8. The plasma protein binding values (PPB) refer to the degree of binding of the inhibitors with the proteins present in the blood and Cbrain/Cblood represents the permeability of the blood–brain barrier. Compounds with Cbrain/Cblood values less than 1 do not have activity on the central nervous system (CNS).


**Table 8.** Distribution data of selected inhibitors.

**[a]** Permeability of the blood–brain barrier; **[b]** Plasma protein binding.

It is observed that **Z-964** shows 100% of binding with the plasma proteins, inferring the possibility of its bioaccumulation and a consequent increase in its half-life within the organism, since the unbound portion is metabolized, consequently is excreted, and the bound part is slowly released in order to maintain the balance of the medium. In parallel, **Z-627** showed an 85% binding, indicating that 15% of the fraction will not be bound, which increases the efficiency of diffusion and penetration into the cell membranes [19]. All selected compounds have no activity on the central nervous system, as they show values below one. In silico values for absorption are shown in Table 9.



**[a]** Cell permeability; **[b]** Human intestinal absorption; **[c]** Cell permeability Maden Darby Canine Kidney; **[d]** in vitro P-glycoprotein inhibition.

The selected drug candidates showed high values for intestinal absorption (HIA> 94%), being one of the most important absorption, distribution, metabolism and excretion (ADME) properties [20]. The drug molecules are transported from the gastro-enteric tract to the blood circle and permeate the gastro-enteric membrane by various mechanisms, and among them, the activity of the P-glycoprotein must be taken into account. This P-glycoprotein is a common transporter in the intestinal penetration

of drugs, inferring in the hypothesis that the inhibitors **Z-964** and **Z-814**, because they present an in vitro inhibition of P-gp, decrease the efflux process through the passive permeability of the inhibitors, which is mediated by this protein. However, they have considerable absorption values when compared with those of the other molecules screened in this study.

The PMDKC permeability value is significant for the **Z-814** inhibitor (28.3061 nm/sec), being higher than for the control compound. Values greater than two indicate a significant medication efflux. The compound**Z-964** showed a low permeability MDCK (0.0517 nm/sec) and**Z-627** approached the ideal (1.4352 nm/sec) [21,22]. In parallel, the Caco-2 permeability assay measures the flow rate of a compound through Caco-2 cell monolayers to predict the in vitro drug absorption, where values greater than two present drug efflux. Inhibitors **Z-814** (PCaco-2 = 12.4185 nm/sec) and **Z-964** (PCaco-2 = 42.9100 nm/sec) have higher values than rofecoxib (PCaco-2 = 2.7291 nm/sec), with the **Z-627** inhibitor (PCaco-2 = 0.6460 nm/sec) having a lower value; however, it is not a P-gp inhibitor, which can significantly interfere with intestinal absorption [22,23].

Table 10 demonstrates the predicted data for the biological activity of the selected inhibitors and compares the results against the selected controls rofecoxib and celecoxib, which were obtained from the PASS server (http://www.pharmaexpert.ru/passonline/). Celecoxib is used in everyday clinical practice, being part of the set of external validation and molecular docking of this research. The three selected inhibitors showed Pa > Pi values, indicating the possibility of activity in relation to the reported biological activities, mainly in terms of anti-inflammatory responses. **Z-627** has the best values for anti-arthritic (Pa = 0.985) and anti-inflammatory (Pa = 0.852) activities higher than controls (**Z-627** = 0.852, rofecoxib = 0.828, celecoxib = 0.663).


**Table 10.** Prediction of biological activity of selected inhibitors.

**[a]** Pa = Possibility of activity; **[b]** Pi = Possibility of inactivity.

All the candidate inhibitors have the possibility of activity against COX-2, although it is below the value of the reference compounds. Nonetheless, it serves as a reference for possible activities that they may present during the in vivo tests to be performed. In addition, a prediction of adverse effects that they may have on the organisms was performed, verifying that all of them present a propensity of activity similar to the other selected control compounds (celecoxib and rofecoxib) in the case of extrapyramidal effects. However, they have a lower propensity for the emergence of gastrointestinal problems, such as ulcers, which is the main focus in the development of new selective

anti-inflammatory drugs, and the **Z-964** structure did not present the possibility of presenting such an adverse effect. Table 11 shows the physical–chemical data of the selected inhibitors.


**Table 11.** Cardiotoxic effects of selected molecules.

[a] Pa = Possibility of activity; [b] Pi = Possibility of inactivity; [c] Metatox web; [d] PreADMET.

Knowing the possibility that the structures present adverse cardiotoxic risks, the results were compared with the molecule still marketed in celecoxib, now considering its performance. However, it is observed that it presents risks of myocardial infarction and heart failure, which are analyzed through the Metatox (http://way2drug.com/mg2/gen\_meta\_all.php) and the hERG study (Human ether-a-go-go), through the PreADMET server, which refers to the blocking of the potassium channel, and that may cause cardiac collateral damage; see Table 11 below.

In view of the foregoing, this fact still did not allow its withdrawal from the market, as follow-up and adequate dosage reduce side effects and toxicological risks, which is a valid narrative for every drug currently sold; therefore, its cost–benefit must be evaluated. For now, it is seen that the molecules present a risk similar to or below the molecule withdrawn from the market (rofecoxb) and that which is still on the market (celecoxib), which does not preclude the possibility of being evaluated as candidates for specific COX-2 inhibitors. It is observed that the structures Z-814 and Z-627 present low and medium hERG risk, respectively, being better or equal to the molecules already commercialized, which makes their application as candidates for inhibitors of cyclooxygenase-2 possible. The Z-964 structure remains in the study due to the good results of bioavailability, in order to evaluate its experimental response in another study, as well as the others.

All candidate inhibitors present physical and chemical data within the acceptable range, showing no violation (Nv) or violating Lipinski's rule of five. This rule says that drugs with good oral bioavailability must obey four physicochemical parameters: molecular weight (MW) ≤ 500 g/mol, octanol/water partition coefficient (log *P*) ≤ 5, the number of hydrogen-bond donor groups (nHD) ≤ 5, and the number of hydrogen-bond acceptor groups (nHA) ≤ 10, see Table 12.


**Table 12.** Physicochemical data of the selected inhibitors.

[a] Partition coefficient; [b] Topological Polar Surface Area; [c] Molecular Weight; [d] Number of Hydrogen Acceptors; [e] Number of Hydrogen Donors; [f] Number of violations; [g] Number of Rot bonds.

Table 12 shows that they have good absorption or permeability [24]. The pIC<sup>50</sup> values (nM) were predicted for the selected molecules, see Table 13, according to the equations of Table 4, demonstrating acceptable values. The three selected inhibitors with Tanimoto index are shown in Figure 5.


**Table 13.** Predicted pIC<sup>50</sup> values of the selected inhibitors and controls.

[a] Atoms; [b] Aromatics; [c] Donors; [d] Acceptors.

**Figure 5.** Selected inhibitors with Tanimoto index. (**A**) **Z-814**; (**B**) **Z-964**; (**C**) **Z-627**.

#### *2.3. Molecular Docking*

Figure 6 shows the poses calculated in relation to the deposited PDB complexes, with the deviation of the mean square root (RMSD) calculated at 0.91 Å for rofecoxib (RCX; 5KIR PDB code), 0.63 Å for celecoxib (**CEL**; 3LN1 PDB code) and 0.71 Å for indomethacin (**IMS**; 2YOE PDB code). Such a methodology provides alignment values for a maximum of 2 Å for the study of molecular docking, and accordingly, it validates the protocols used [25,26].

**Figure 6.** Overlapping poses of the crystallographic complexes (in green) with calculated poses (in red): (**a**) rofecoxib (**RCX**) for *Homo sapiens* (PDB 5KIR), (**b**) celecoxib (**CEL**) for *Mus musculus* (PDB 3LN1) and (**c**) indomethacin (**IMS**) *Ovis aries* (PDB 2OYE).

Figure 7 shows the interactions of the selected inhibitors with the control RCX in *Homo sapiens*. It is known that COX-1 and COX-2 have practically identical tertiary structures; however, the main difference between both is the replacement of Ile434, His513, and Val<sup>434</sup> residues in COX-1 by Val<sup>434</sup> ,

Arg513, and Val<sup>523</sup> in COX-2, respectively. This allows an increase of approximately 25% of the active site that consists of a more accessible pocket with Arg<sup>513</sup> as a fundamental bonding site [16,27]. Figure 7a shows the main interactions of rofecoxib, within the pocket that provides a selective inhibition.

β

**Figure 7.** Interactions of the inhibitors (**a**) rofecoxib, (**b**) **Z-627**, (**c**) **Z-964,** and (**d) Z-814** with the active site of the structure of the Vioxx bound to the *Homo sapiens* cyclooxygenase-2 (COX-2, PDB 5KIR).

It is observed that the selected molecules **Z-814**, **Z-964**, and **Z-627** present a similarity of interactions with the amino acids of the hydrophobic region of the β leaf, Ser530, and Val<sup>523</sup> for the former, and Ser<sup>530</sup> , Phe518, Val523, and Leu<sup>358</sup> for the latter (Figure 7 and Table S3). The lipophilic channel of the enzyme is also constrained by the presence of Tyr<sup>355</sup> and Arg<sup>513</sup> on the enzyme surface, with the additional hydrogen-bond interaction between the Ala<sup>527</sup> and Val<sup>523</sup> phenolic group and the oxygen sulfone atoms of the structures.

On the other hand, in COX-2, there are some interactions that allow greater accessibility in the lipophilic channel in this isoform than in COX-1, which can be observed, indicating a greater ease of interaction with the active site of COX-2 via Phe518. This effect can also be translated by the negative Gibbs free energy required for the interaction to occur (Figure S6). The inhibitors selected may have an equivalent affinity in relation to the selected control compounds. The interaction with Ser<sup>353</sup> in **Z-814** demonstrates the possibility of a binding activity associated with low IC<sup>50</sup> values [28–30].

In Figure 8 and Table S4, it is possible to verify the interactions of the selected inhibitors with the reference drug (**CEL**) against *Mus musculus*. Hydrophobic interactions are observed with Val<sup>509</sup> , Phe504, Gly512, Ser339, and Leu338, and hydrogen-bond interactions are observed with Gln178, Phe<sup>504</sup> , and Ser<sup>339</sup> for **CEL**. In parallel, we can observe the interactions for the selected inhibitors, where **Z-627** shows interactions with the hydrophobic residues Ser<sup>339</sup> and Val<sup>509</sup> as well as the control, and in addition, it presents a hydrogen-bond with Ala513, showing selectivity [31,32].

β

**Figure 8.** Interactions of inhibitors (**a**) celecoxib, (**b**) **Z-627**, (**c**) **Z-964**, and **(d) Z-814** with the active site of the *Mus musculus* COX-2 (PDB 3LN1).

On the other hand, the molecule **Z-964** shows greater interactions with Ala513, Ser339, and Val<sup>509</sup> in the lipophilic region present in the β-leaf of the enzyme and the hydrophilic residue Leu338, which links to the sulfonic group of both inhibitors (sulfonamide for **CEL** and methylsulfone for **Z-964**). The **Z-814** molecule showed a lower affinity than the others, but it showed relative selectivity when it comes to the amino acid residues that are part of the interactions (Ser339, Val509, and Phe<sup>504</sup> (fluorine)) in the hydrophilic region of the molecule, which allows for interactivity in parallel with the CEL molecule. Furthermore, the data corroborate the QSAR analyses carried out when dealing with the connections with hydrogen acceptors, which are mainly influenced by the electronegativity of the selected structures. These interactions have already been observed in other studies, corroborating with the affinity data shown in Figure S5, in which **Z-967** shows an energy of 10.00 kcal/mol and **Z-964** shows an energy of 9.50 kcal/mol. These data are considered the most important ones [30,31].

Docking studies corroborate the preliminary QSAR results, as they consider that the presence of aromatic groups can influence the inhibitory activity of such molecules; nevertheless, chemical changes are necessary in order to decrease in the cytotoxic effect of the inhibitors when compared with the reference drugs, such as the replacement of the sulfonamide by a methylsulfone group (rofecoxib analogs) [28]. The QSAR analysis demonstrates a structure–activity relationship, as is the case with the characteristics **ARO, ACC**, and **DONN**, being closely linked with the possibilities of their interaction with the active enzyme site. Lipophilicity deals with an intrinsic relationship of the possibility of permeation and good oral availability, which was previously reported with obedience to the rule of five by Lipinski (logP ≤ 5) interacting with the side pocket of the enzyme [30,31]. ≤

The three structures were subjected to molecular coupling tests (Figure 9) to assess the possibility of being selective for COX-1 as well, which would determine the possibility of the appearance of undesirable adverse effects, such as gastrointestinal problems. They demonstrate a lower affinity possibility to the COX-2 enzyme as previously reported, with low bond energies (**Z-627** = −8.40 Kcal/mol, **Z-964** = −8.60 Kcal/mol, and **Z-814** = −6.80 Kcal/mol) compared to the selected control compounds and the indomethacin molecule (−10.70 Kcal/mol) deposited in the crystallographic structure of the PDB. The energy ratios (COX-2/COX-1 and COX-1/COX-2, as shown in Figure 10) were evaluated following an adaptation of the methodology adopted by Araújo and collaborators (2005) [7] that verified the influence of the most prescribed anti-inflammatory drugs of COX-2 on COX-1. − − − −

**Figure 9.** Interactions of the inhibitors (**a**) indomethacin, (**b**) **Z-627**, (**c**) **Z-964,** and (**d) Z-814** with the active site of the structure of the indomethacin complex to the *Ovis aries* COX-1 (PDB 2OYE).

**Figure 10.** Binding affinity ratio of the selected structures.

It is observed that the activity ratio on COX-1 is lower when compared to COX-2, suggesting that the structures will not be highly selective for isoform 1, emphasizing that all the NSAIDs already prescribed have a small selectivity for this, which decreases the possibility of side effects [7]. On the other hand, the perspective of the advent of adverse effects can be compared in parallel with the prediction of biological PASS activity (Table 10), indicating a probability of few gastrointestinal effects. Furthermore, it is noted that selectivity in relation to COX-2 is given by the substitution of valine for an isoleucine in COX-1 in position 523, which in this case interacts with the phenolic ring of the selected structures. In addition, most inhibitors selective for COX-2 suggest not having free carboxylate groups, which contributes to this low affinity to isoform 1, and the high affinity is expressed by the interaction as the amino acid residue Arg<sup>120</sup> [33].

In this case, the **Z-627** structure presents this interaction relationship with the Arg<sup>120</sup> residue in the pyrroline portion of the structure, showing a possible structural rigidity and suggesting a possibility of structural modification in order to further limit the relationship estimate. However, the addition of the methylene group in the residue from Ile<sup>523</sup> indicates that interactions are restricted in access to the COX-1 side pocket, directly impacting the time-dependent competitive inhibition process in relation to COX-2 [34,35].

#### *2.4. Structure–Activity Relationship of the Most Promising Molecules*

The selected compounds (Figure 5) show a similar structure to that of the pivotal compound, having in their structures the methylsulfone group, showing no cytotoxic effect in relation to the sulfonamide group (**Z-627** and **Z-964**). Small substituents are the best, because they influence the volume of the molecule and possible van der Waals interactions with COX-2, which is a fact observed in docking studies. The introduction of fluorinated groups may show more significant activity.

According to Hayashi et al. 2012 [36], substituted analogs by acceptable hydrogen-bond groups potentiate the inhibitor activity. On the other hand, substitutions in the isoindoline nucleus can contribute to the inhibitor–enzyme stabilization, further demonstrating the fundamental role that the electrostatic and dipole–dipole interactions can play [37,38].

At the same time, the endocyclic nitrogen atoms included in five- or six-membered cycles such as pyrrole, pyridine, and pyrimidine, among others, may produce an increase in selectivity. The five-amino group in the isoindoline ring may favor the inhibitory activity of **Z-627** and, moreover, possible hydrogen-bond interactions through the methylsulfone group. The inhibition mechanism depends on the prostaglandin biosynthesis by means of arachidonic acid (AA), estimating that AA

fits into the channel cavity surrounded by amino acid residues with aromatic, aliphatic, and phenolic groups that establish several interactions.

Therefore, competitive or selective inhibitors bind to Val<sup>523</sup> in COX-2, interfering with the arachidonic acid cascade and preventing the peroxidase action, as well as the formation of prostaglandins or thromboxanes (pro-inflammatory eicosanoids). In parallel with the studies carried out by Hayashi et al. 2012 [36], the best inhibitors **Z-627** and **Z-814** have two hydrogen-bond donors, as well as low values of TPSA and MW. For the three selected inhibitors we proposed theoretical synthetic routes—Supplementary Materials Figures S9–S11.

#### *2.5. Molecular Dynamics Results and A*ffi*nity Energy*

The studies of molecular dynamics simulations were carried out to understand more deeply the modes of interaction of the selected compounds with the target proteins. The results obtained through molecular dynamics simulations have served as support for the detailed evaluation of conformations over time observed in drug–receptor complexes [39–41]. Thus, understanding that the dynamics and changes in the movement of a protein are closely related to its biological function allows us to understand that the observation of these phenomena is extremely important. In this way, we carried out the investigation of the protein structure during the 100 ns of molecular dynamics simulations using the methods of root mean square deviations (RMSD) and root mean square fluctuations (RMSF). To plot the RMSD of the ligands, all the heavy atoms of the molecules were used, while to plot the RMSD and RMSF of the protein backbone, the Cα carbon atoms were used. In Figure 11, the graphs of the compounds that were bound to COX-2 of *Homo sapiens* are plotted, while in Figure 12, the RMSD plot of the complexes established with COX-2 of *Mus musculus* is displayed. α

**Figure 11.** Root mean square deviations (RMSD) plot of complexes established with *Homo sapiens* COX-2. The protein backbone plot is colored black, but the ligand plots are colored in different ways. (**A**) RMSDs of the COX-2-rofecoxib system, (**B**) RMSDs of the COX-2-Z627 system, and (**C**) RMSDs of the COX-2-Z814 system.

**Figure 12.** RMSD plot of complexes established with *Mus musculus* COX-2. The protein backbone plot is colored black, but the ligand plots are colored in different ways. (**A**) RMSDs of the COX-2-celecoxib system, (**B**) RMSDs of the COX-2-Z627 system, and (**C**) RMSDs of the COX-2-Z814 system.

Along the trajectories of MD simulations, COX-2 showed differences in the RMSDs of the complexes. The maximum Plator rising by the backbone RMSD was 3 Å, which was visualized in the COX-2-Z814 system, and the smallest fluctuation was observed in the COX-2-rofecoxib system. Despite the fluctuations displayed, this did not impair the interaction with the complexed ligands. It is important to note that the RMSD of the ligands showed low fluctuations and had a low RMSD value; this means that the ligands did not undergo drastic conformational changes after settling at the protein binding site.

Similar phenomena were observed for the complexes established between *Mus musculus* COX-2 and ligands. The backbone RMSD Plator was approximately 3 Å, and the ligands also remained in equilibrium throughout the 100 ns simulations, as observed in the RMSD graphs with small fluctuations.

The evaluations of the regions of the protein that obtained the greatest fluctuations along the trajectories of molecular dynamics were performed using the RMSF plot (see Figure 13). In general, the RMSF graphs showed a similar profile, even in the regions that suffered the greatest fluctuations. The greatest fluctuations were observed in the N-terminal portion of the protein.

This region is exposed to the solvent, being formed by alpha helices and beta leaves that are connected by long loop regions. Structurally, loops are the most flexible regions of the protein, so a region that exhibits many loops has a tendency to be flexible, as was observed in the RMSF plots displayed. Although this region is close to the active site of the ligands, its flexibility did not compromise the binding of the compounds, since all the ligands showed energy of favorable affinity with the protein, according to the molecular mechanics/generalized born surface area (MM-GBSA) results obtained. In addition, the fluctuation of this region did not affect the conformational stability of the ligand along

the trajectory of molecular dynamics, as the RMSD graphics of the ligands demonstrated that they remained stable along the trajectories without showing drastic changes in the RMSD plot.

**Figure 13.** Root mean square fluctuations (RMSF) plot of the backbone of the proteins that established complexes with the compounds obtained by virtual screening. (**A**) RMSF for the COX-2 *Homo sapiens;* (**B**) RMSF for the COX-2 *Mus musculus.* (**C**) The region of the protein that has undergone the greatest fluctuations is highlighted in red.

In addition to structural analysis of the protein and ligands using RMSD and RMSF, we also evaluated whether the compounds are capable of interacting favorably with molecular targets. For this, we use the MM-GBSA method. The results obtained are summarized in Table 14.


**Table 14.** Affinity energy of COX-2 ligands systems.

Δ − Δ − Δ − All ligands have been shown to be able to interact favorably with COX-2. The selected compounds showed great affinity with COX-2 when we compared their values of affinity energy with the value obtained for the positive control of protein in the human body and Mus musculus. In the system established with human COX-2, compounds **Z-814** (∆Gbind = −48.15 Kcal/mol) and **Z-627** (∆Gbind = −45.51 Kcal/mol) showed binding affinity values similar to that obtained for rofecoxib (∆Gbind = −42.76 Kcal/mol), which was the positive control.

− − − − − − Δ − Δ − Rofecoxib interacted through hydrogen bonds with the Arg<sup>513</sup> and His<sup>90</sup> residues, with an affinity energy of −2.14 and −1.82 Kcal/mol. Ligand **Z–814** established hydrogen bonds with His<sup>90</sup> and Tyr<sup>385</sup> with energy values of −1.53 and −1.48 Kcal/mol, while **Z–627** remained interacting with Phe<sup>518</sup> and

− −

Δ −

Ile<sup>517</sup> with affinity values of −1.87 and −1.92 Kcal/mol. With the Mus musculus protein, the selected ligands, **Z–627** (∆Gbind = −41.63 Kcal/mol) and **Z–964** (∆Gbind = −44.27 Kcal/mol), also showed affinity values close to celecoxib (∆Gbind = −47.78 Kcal/mol), which was used as a positive control. Celecoxib interacted with Phe<sup>504</sup> and Ser<sup>339</sup> through hydrogen bonds with affinity values of −1.42 and −1.95 Kcal/mol. Ligand **Z–627** interacted with Arg<sup>499</sup> with an affinity of −1.81 Kcal/mol, while **Z–964** interacted with Phe<sup>504</sup> with an affinity value of −1.68 Kcal/mol. The affinity energy values obtained with the MM-GBSA method, for the compounds selected by QSAR, were promising. This demonstrates that the selected substances can be considered as promising COX-2 inhibitors.

#### **3. Materials and Methods**

#### *3.1. Selection of COX-2 Inhibitors*

The molecule considered pivotal in the process was 4- (4-methylsulfonylphenyl)-3 phenyl-5*H*-furan-2-one (rofecoxib), which is known commercially as Vioxx®. It was taken from the BindingDB database (The Binding Database, https://www.bindingdb.org/bind/index.jsp) alongside twenty-four more molecules (Supplementary Material, Figure S7) to study the anti-inflammatory activity against COX-2 according to the literature data, following an increasing criterion of inhibitory activity, or IC<sup>50</sup> (Table 15). The molecules were aligned using the Discovery Studio® v. 4.0 program [42] for input on PharmaGist Web Server15 (http://bioinfo3d.cs.tau.ac.il/pharma/index.html) [43].


**Table 15.** Molecules selected in ascending order of IC50.

\* Pivot; a Internal validation; b External validation.

#### *3.2. Optimization of Molecular Structures and Determination of Pharmacophore Characteristics*

The selected inhibitors were pre-optimized by means of Molecular Mechanics (MM+), followed by calculations of Austin Model 1 (AM1) and PM3 in the Hyper Chem 7® program (Table 2), with the lowest energy value used as a parameter of choosing the best model to carry out the construction of the pharmacophore hypothesis. Subsequently, the input was made to the PharmaGist Web Server 15 to determine the following characteristics: atoms (ATM), spatial characteristics (SF), characteristics (F), aromatic (ARO), hydrophobic (HYD), acceptor (ACC), and donor of hydrogen (DONN). The initial set presented 25 molecules, which were aligned according to the similarity with the selected pivot molecule, allowing the generation of pharmacophore models with the aid of the Discovery Studio® v. *4.0* program, following the methodology developed by us [10,12,14,57–59].

#### *3.3. QSAR and PCA*/*HCA Studies*

The inhibitory activity values were transformed into pIC<sup>50</sup> (−log (IC50)) in order to reduce the inconsistencies of the data obtained in an experimental way and homogenize the dataset, following the adopted methodological proposal [10,57,59]. In parallel, the importance of each pharmacophore descriptor was attributed—atoms (ATM), spatial characteristics (SF), characteristics (F), aromatic (ARO), hydrophobic (HYD), acceptor (ACC) and hydrogen donor (DONN); these were used for prediction in order to assess notoriety regarding the response to the pIC<sup>50</sup> value through the Pearson correlation (p), using the software Statistica 7.0® and Minitab 19®, adapting the methodology adopted by Santos et al. 2015 and Ferreira et al. 2019 [12,59]. Pearson's coefficient (Equation (1)) measures the degree of linearity between two variables, assuming a value between +1 and −1. If one variable tends to increase while the others decrease, the value is negative. On the other hand, if both increase, the coefficient is positive. Moreover, *x* is the sample mean for the first variable; *s<sup>x</sup>* is the standard deviation for the first sample; *y* is the sample mean for the second variable; *s<sup>y</sup>* is the standard deviation for the second sample; and n is the column length.

$$\rho = \frac{\sum\_{i=1}^{n} (\mathbf{x}\_i - \overline{\mathbf{x}})(y\_i - \overline{y})}{(n-1)s\_{\mathbf{x}}s\_{\mathbf{y}}} \tag{1}$$

The best pharmacophore descriptors were obtained considering the statistical quality relations of multiple linear regression (MLR), such as correlation coefficient (R), correlation coefficient squared (R<sup>2</sup> ), explained variance (adjusted R<sup>2</sup> ), standard error of estimate (SEE), and variance ratio (F), and they were transformed into parametric models for predicting the inhibitory activity at pIC<sup>50</sup> values. The combinations were obtained using four parameters indicated by Pearson's correlation without repetition [12,59], according to Equation (2), where C = number of combinations, p = model type (p , 0 and p = 4), and n = number of variables (n = 4).

$$\text{Cp}\,\text{n} = \frac{\text{n!}}{\text{p}!(\text{n}-\text{p})!} \tag{2}$$

For the prediction of the best model, in the internal validation stage, the random correlations between the descriptors and the inhibitory activity were measured to normalize the data obtained, applying the technique of detecting anomalous samples (outliers), in order to obtain a homogeneous set. This subset is considered as internal validation, for analysis of the prediction capacity of the selected model, comparing the data obtained during the two validations (internal and external; Figure S8). Principal component analysis (PCA) together with hierarchical cluster analysis (HCA) were applied in order to verify whether the model obtained corresponds to the degree of similarity, using Pearson's squared distance as a measurement parameter in the latter [60,61]. For the respective analyzes, Minitab v. 19® trial version was used.

#### *3.4. Virtual Screening and Selection of Inhibitor Compounds*

After selecting the best model via QSAR analysis, the selected molecules were superimposed to form a pharmacophore model. After inputing the pharmacophore, the search was performed within the ZINC database, selecting the 2000 most similar molecules, using the partition coefficient (log *P*), surface area (TPSA), number of atoms (Natoms), Molar Mass (MW), hydrogen acceptors (nHA), hydrogen donors (nHD), number of violations (Nv), number of revolutions (Nrotb), and volume, which were values as filters determined via Protox II servers (http://tox.charite.de/protox\_II/) and molinspiration (https://www.molinspiration.com/cgi-bin/properties). The RMSD (Equation (3)) value was used as a reference parameter, which is the measure of the average distance between the atoms of the overlapping inhibitors, given in Angstroms, representing the quantitative similarity relationship between them. The lower the RMSD value, the better the model is compared to the target structure. δ 2 *i* is the distance between atom i of any reference structure or the average position of N equivalent atoms.

$$\text{RMSD} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \delta\_i^2} \tag{3}$$

Then, the Tanimoto test was performed via the BindingDB server. The similarity was determined according to the chemotype of the compounds screened with the pivotal molecule of the selection process to reduce and optimize the selection of compounds for determining pharmacokinetic and toxicological characteristics, using a cut-off index of 0.3, applying Equation (4) below [22].

$$\mathbf{J} = \frac{\mathbf{M}\_{11}}{\mathbf{M}\_{01} + \mathbf{M}\_{10} + \mathbf{M}\_{11}} \tag{4}$$

where M11—total number of attributes where A and B have a value of 1; M01—total attributes where A is 0 and B is 1; M10—total attributes where A is 1 and B is 0; M11—total attributes where A and B have a value of 0.

#### *3.5. Prediction of Toxicological and Pharmacokinetic Properties*

Pharmacokinetic and toxicological studies were applied to inhibitors extracted from Pharmit via the ZincPharmer server. PreADMET v. 2.0 (https://preadmet.bmdrc.kr/) was used, which is an application based on a database on the web that is used for the prediction of ADME data (Absorption, Distribution, Metabolism, Excretion) with the following being selected: blood–brain barrier (BBB) penetration, in vitro permeability in Caco2 cells, human intestinal absorption (HIA), in vitro permeability of MDKC cells, in vitro P-glycoprotein inhibition, plasma protein binding (PPB), and Toxicological for Ames\_Test, Carcinogenicity for Rats and Mice. The LD<sup>50</sup> values were determined via Protox II servers (http://tox. charite.de/protox\_II/) as well as toxicity class.

#### *3.6. Prediction of Biological Activity of Selected Inhibitors*

Activity predictions were made using the online PASS server (http://www.pharmaexpert.ru/ passonline), which allows you to predict the biological effects of compounds based on their formula using multilevel atom neighbors (VMA) descriptors, suggesting that the inhibitor's activity is expressed in terms of its chemical structure. Molecules with activities reported for anti-inflammatory and cyclooxygenase inhibitor effects were selected [25,61].

#### *3.7. Molecular Docking*

For this step, only the molecules with the best pharmacokinetic, toxicological, and biological parameters were selected for the study of molecular docking, in order to evaluate the interactions with selected inhibitors and the respective targets through the measurement of free energy interaction with amino acid residues and binding affinity. The crystallographic poses were extracted from the web serve Protein Data Bank (PDB; https://www.rcsb.org/) for *Homo sapiens* with COX-2 complexed with the inhibitor rofecoxib having the code PDB 5KIR with a resolution of 2.697 Å, *Mus musculus* with COX-2 complexed with celecoxib having the PDB code 3LN1 with 2.40 Å resolution, and *Ovis aries* with COX-1 complexed with indomethacin having the PDB code 2OYE with 2.85 Å resolution; all structures were elucidated through X-Ray diffraction analyses.

Docking Study with AutoDock 4.2/Vina 1.1.2 via Graphical Interface PyRx (Version 0.8.30)

The selected inhibitors and proteins were prepared with the aid of Discovery Studio® 4.0 software, and the evaluation of the complexes with the ligand was evaluated using the AutoDock 4.2/Vina 1.1.2 software and the PyRx graphical interface version 0.8.30 (https://pyrx.sourceforge.io), with the standard exhaustiveness parameter of the software being the best conformation obtained through the analysis of the RMSD value. The validation protocol was based on the determination of the x, y, and z coordinates according to the average region of the active site; these values are observed in Table 16. The energy function score was used to evaluate the free binding energy (∆G) of the interaction of the receptors with the ligands [25].


**Table 16.** Protocol data used in the validation of molecular docking.

The calculation of binding affinity (∆G) was also performed in order to compare the actual data obtained and the values predicted *in silico*, which was the same methodology adopted by Santos et al., 2020 [14], according to Equation (5).

$$
\Delta \mathbf{G} = -\mathbb{R} \mathbf{T} \ln \mathbf{K}\_{\mathbf{i}} \tag{5}
$$

where R (gas constant) is 1.987.10−<sup>3</sup> kcal·mol−<sup>1</sup> ·K−<sup>1</sup> , the temperature is 310 K for rofecoxib/celecoxib, and K<sup>i</sup> is 310.10−<sup>9</sup> M for rofecoxib and 340.10−<sup>9</sup> M for celecoxib [28,32,52].

#### *3.8. Molecular Dynamics Protocol*

The initial structure for the system was obtained from molecular docking methods. The restrained electrostatic potential (RESP) protocol with the HF/6-31G\* basis sets was applied to obtain the partial atomic charges of the atoms of each ligand [62–65]. The parameters of the ligand were constructed with the Antechamber module [66] using General Amber Force Field (GAFF) [67].

The amino acid protonation state was characterized using the PDB2PQR server [68]. The systems were built with the tLEaP module of the Amber 16 package [69–71]. The force field used to describe the protein in all simulations was ff14SB [72]. The protein–ligand system was solvated in an octahedron periodic box containing water molecules in the TIP3P model [73]. The partial charges were neutralized by adding counter-ions.

Energy minimization occurred in four stages. First, the water molecules and ions were optimized using 2000 cycles of the steepest descent and 3000 cycles of conjugate gradient. Then, the position of receptor-ligand hydrogen atoms was optimized using 4000 steps of the steepest descent algorithm and 3000 steps of the conjugate gradient. At the third stage, hydrogen atoms, water molecules, and ions were further optimized using 2500 steps of the steepest descent algorithm and 3500 steps of

the conjugate gradient. Finally, all atoms were minimized using 3000 steps of the steepest descent algorithm and three steps of the conjugate gradient.

Molecular dynamics simulations were performed at a constant volume by heating the systems up to 298 K. This heating was performed in five steps for a duration of 1 ns. After 100 ns, production runs were performed for each system.

The Particle Mesh Ewald method [74] was used for the calculation of the electrostatic interactions, and the bonds involving hydrogen atoms were restricted with the SHAKE algorithm—Restriction algorithm used to ensure that the distance between points of mass is maintained [75]. The temperature control was performed with the Langevin thermostat [76] within a collision frequency of 2 ps−<sup>1</sup> .

#### 3.8.1. Affinity Energy Calculations

To estimate the binding affinity (∆Gbind), we used the molecular mechanics/generalized born surface area (MM-GBSA) methods [77–80]. The ∆Gbind was calculated according to the following equations:

$$
\Delta \mathbf{G}\_{\text{bind}} = \Delta \mathbf{G}\_{\text{complex}} - \Delta \mathbf{G}\_{\text{receptor}} - \Delta \mathbf{G}\_{\text{ligand}} \tag{6}
$$

$$
\Delta \mathbf{G}\_{\text{bind}} = \Delta \mathbf{H} - \mathbf{T} \Delta \mathbf{S} \approx \Delta \mathbf{E}\_{\text{MM}} + \Delta \mathbf{G}\_{\text{solv}} - \mathbf{T} \Delta \mathbf{S} \tag{7}
$$

$$
\Delta \mathbf{E}\_{\rm MM} = \Delta \mathbf{E}\_{\rm internal} + \Delta \mathbf{E}\_{\rm ele} + \Delta \mathbf{E}\_{\rm vdW} \tag{8}
$$

$$
\Delta \mathbf{G}\_{\rm solv} = \Delta \mathbf{G}\_{\rm GB} + \Delta \mathbf{G}\_{\rm NP} \,. \tag{9}
$$

The free energy of bonding (∆Gbind) is the summation of the interaction energy of the gas phase among the protein–ligand (∆EMM), desolvation free energy (∆Gsolv), and system entropy (−T∆S). ∆EMM is the result of the sum of internal energy (∆Einternal, sum of the energies of connection, angles and dihedral) electrostatic contributions (∆Eele), and the van der Waals term (∆EvdW). ∆Gsolv is the sum of the polar (∆GGB) and non-polar (∆GNP) contributions. ∆GSA was determined from the solvent accessible surface area (SASA) estimated by the linear combination of pairwise overlaps (LCPO) algorithm.

#### 3.8.2. Per-Residue Free Energy Decomposition Analysis

Per-residue free energy decomposition was decomposed using the approach of MM/GBSA according to the following equation [14,81,82]:

$$
\Delta \mathbf{G}\_{\text{MM-GBSA}} = \Delta \mathbf{E}\_{\text{vdW}} + \Delta \mathbf{E}\_{\text{elec}} + \Delta \mathbf{E}\_{\text{pol}} + \Delta \mathbf{E}\_{\text{np}}.\tag{10}
$$

#### **4. Conclusions**

After the pharmacophore-based virtual screening, the QSAR analysis demonstrated a good line fit with R<sup>2</sup> = 0.96 and an equation with four main prediction parameters for pIC50, ATM, ARO, ACC, and DON, where the ARO, ACC, and DON report the relationship with the three new and promising compounds selected and the pivot structure (rofecoxib). The development of the predetermined multiple linear regression model predetermined the pIC<sup>50</sup> values for the selected compounds **Z-814** = 7.9484, **Z-627** = 9.3458, and **Z-964** = 9.5272. In database searches to evaluate possible applications that may have already been carried out, these substances are not used in specific biological activities (https://scifinder.cas.org/ and https://zinc.docking.org/).

The analyzes of toxicological prediction and bioavailability confirm the possibility of significant activity of the structures with a reduction of possible undesirable effects, of which **Z-627** was considered the most promising in view of all the tests applied via ADME analysis, without consequences for the CNS; this was corroborated with the main compounds selected. All selected compounds have the methyl sulfone group, unlike coxibs, which have the sulfonamide group. These three molecules do not present toxicological risks; they comply with the Lipinski rule of five, which provides for good oral availability, and PASS provides for a specific activity with a high probability of showing promising anti-inflammatory activity, in addition to dim side effects in relation to the compound's selected controls. Molecular coupling tests demonstrate strong energy affinity with isoform 2 and low activity with isoform 1 through relationship analysis, which induces a possibility of minor side effects. Finally, zebrafish larvae should be analyzed to assess anti-inflammatory activity in the treatment of inflammatory disorders to confirm in silico results.

**Supplementary Materials:** The following are available online, Table S1: Energy values of the optimized molecules, Table S2: Filters applied according to the properties of the selected molecules, Table S3: Distance of Interactions for the structures for the PDB 5KIR, Table S4: Distance of Interactions for the structures for the PDB 3LN1, Table S5: Distance of Interactions for the structures for the PDB 2OYE, Figure S1: Dendrogram representing clustering of pharmacophores, Figure S2: Analysis of the main components for the sorted molecules. Scores (a) and Loading Graph (b), Figure S3: Dendrogram of selected molecules. More active (blue) and less active ones (red), Figure S4: Binding affinity results of compounds, including Vioxx bound (COX-2 – Homo sapiens), Figure S5: Binding affinity results of compounds, including celecoxib (COX-2 Mus musculus), Figure S6: Binding affinity results of compounds, including Indomethacin (COX-1 Ovis aries), Figure S7: Structures used in the molecular modeling, Figure S8: Structures used in the external validation set, Figure S9: Theoretical synthetic route for the preparation of compound A (Z-814), Figure S10: Theoretical synthetic route for the preparation of compound B (Z-964), Figure S11: Theoretical synthetic route for the preparation of compound C (Z-627).

**Author Contributions:** Conceptualization, P.H.F.A., W.J.C.M. and C.B.R.S.; methodology, P.H.F.A. and C.B.R.S..; software, R.S.R. and E.F.B.F.; validation, P.H.F.A., S.G.S., L.R.d.L., J.M.E.-R. and C.B.R.S,; formal analysis, P.H.F.A., R.S.R., J.N.d.C., J.M.C. and C.B.R.S.; investigation, P.H.F.A., R.S.R. and C.B.R.S..; resources, P.H.F.A., W.J.C.M., R.S.R. and C.B.R.S.; data curation, P.H.F.A., R.S.R. and C.B.R.S.; writing—original draft preparation, P.H.F.A. and C.B.R.S.; writing—review and editing, J.M.C and J.N.d.C.; visualization, P.H.F.A.; supervision, C.B.R.S.; project administration, C.B.R.S.; funding acquisition, J.M.C., C.B.R.S., P.H.F.A., W.J.C.M and E.F.B.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** To MEC/CAPES for the granting of development grants; UNIFAP/UEAP for financial assistance.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

AA—arachidonic acid, ACC—acceptors, ADME—absorption, distribution, metabolism and excretion, ARO—aromatic, ATM—atoms, Cbrain/Cblood—permeability of the brain barrier, CEL—celecoxib, COX-1—cyclooxygenase 1, COX-2—cyclooxygenase 2, DONN—donors, HCA—hierarchal components analysis, HIA—human intestinal absorption, IC50—inhibitory concentration, MDPI—Multidisciplinary Digital Publishing Institute, DOAJ—directory of open access journals, MilogP—partition coefficient, MW—molar weight, Natoms—number of atoms, nHA—number of hydrogen acceptors, nHD—number of hydrogen donors, Nrotb—number of rot bonds, Nv—number of violations, p—Pearson value, PCA—principal components analysis, Pcaco2—cell permeability, PDB—Protein Data Bank, P-gp inhibition—*In vitro* P-glecoprotein inhibition, PMDKC—cell permeability Maden Darby canine kidney, PPB—plasma protein binding, QSAR—quantitative structure–activity relationships, RCX—rofecoxib, RMSD—deviation of the mean square root, Ti—Tanimoto Index, TPSA—topological polar surface area, Z-627—ZINC 170592627, Z-814—ZINC 33332814, Z-964—ZINC 225723964.

#### **References**


**Sample Availability:** Samples of the compounds not available from the authors.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
