*Article* **QSAR, ADMET In Silico Pharmacokinetics, Molecular Docking and Molecular Dynamics Studies of Novel Bicyclo (Aryl Methyl) Benzamides as Potent GlyT1 Inhibitors for the Treatment of Schizophrenia**

**Mohamed El fadili 1, \* , Mohammed Er-Rajy 1 , Mohammed Kara 2, \* , Amine Assouguem <sup>3</sup> , Assia Belhassan 4 , Amal Alotaibi 5 , Nidal Naceiri Mrabti 1 , Hafize Fidan 6 , Riaz Ullah 7 , Sezai Ercisli 8 , Sara Zarougui 1 and Menana Elhallaoui 1**


**Abstract:** Forty-four bicyclo ((aryl) methyl) benzamides, acting as glycine transporter type 1 (GlyT1) inhibitors, are developed using molecular modeling techniques. QSAR models generated by multiple linear and non-linear regressions affirm that the biological inhibitory activity against the schizophrenia disease is strongly and significantly correlated with physicochemical, geometrical and topological descriptors, in particular: Hydrogen bond donor, polarizability, surface tension, stretch and torsion energies and topological diameter. According to in silico ADMET properties, the most active ligands (L6, L9, L30, L31 and L37) are the molecules having the highest probability of penetrating the central nervous system (CNS), but the molecule 32 has the highest probability of being absorbed by the gastrointestinal tract. Molecular docking results indicate that Tyr124, Phe43, Phe325, Asp46, Phe319 and Val120 amino acids are the active sites of the dopamine transporter (DAT) membrane protein, in which the most active ligands can inhibit the glycine transporter type 1 (GlyT1). The results of molecular dynamics (MD) simulation revealed that all five inhibitors remained stable in the active sites of the DAT protein during 100 ns, demonstrating their promising role as candidate drugs for the treatment of schizophrenia.

**Keywords:** GlyT1; QSAR; schizophrenia; ADMET; molecular docking; DAT; MD

#### **1. Introduction**

About 1% of the worldwide population is affected by schizophrenia as a serious neuropsychiatric disease [1]. Despite the current regimens with favorable levels of efficacy and the great advancement in the treatment of schizophrenia, no antipsychotic medication can

**Citation:** El fadili, M.; Er-Rajy, M.; Kara, M.; Assouguem, A.; Belhassan, A.; Alotaibi, A.; Mrabti, N.N.; Fidan, H.; Ullah, R.; Ercisli, S.; et al. QSAR, ADMET In Silico Pharmacokinetics, Molecular Docking and Molecular Dynamics Studies of Novel Bicyclo (Aryl Methyl) Benzamides as Potent GlyT1 Inhibitors for the Treatment of Schizophrenia. *Pharmaceuticals* **2022**, *15*, 670. https://doi.org/10.3390/ ph15060670

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 5 May 2022 Accepted: 21 May 2022 Published: 27 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

completely treat the cognitive dysfunction associated with this disorder, because its present treatments are accompanied by undesirable secondary effects. Therefore, the discovery of more clinically effective antipsychotic drugs are still necessary [2]. For this goal, the glycine transporter type 1 (GlyT1) inhibitors approved by the Food and Drug Administration (FDA) are a key therapeutic development strategy to treat a variety of central nervous system (CNS) disorders, in particular schizophrenia and cognitive disorders [3,4]. In this regard, type 1 glycine transporters regulate N-methyl-D-Aspartate (NMDA) receptor function via modulation of glycine concentration at the glutamatergic synapses, but their deficiency may affect the higher central nervous system functions [5,6]. In this paper, a systematic in silico study was performed on 44 GlyT1 inhibitors, which were tested in a locomotor activity assay (LMA) of the MK801 mouse to model the treatment of positive and negative symptoms of schizophrenia [4], by means of the following molecular modeling techniques: first of all, the quantitative structure activity relationships (QSAR) as a technology widely used in drug discovery, indicating ligands with a high affinity for a given macromolecular target and optimizing the quantitative linear and non-linear relationship established between structure and inhibitory activity [7,8]; secondly, in silico ADMET prediction of newly engineered drugs [9]; and third, the molecular docking study as an approach designed in computational chemistry to accelerate drug discovery at the early stages through the detection of typical intermolecular interactions, established between the potent ligands and the responsible protein target [10]. The last step concerns the molecular dynamics (MD) simulation as an efficient technique to investigate the dynamic conformational changes of the selected complexes (active ligands-protein target) [11,12]. In this context, we started our study with a molecular descriptors calculation for each GlyT1 inhibitor, using a quantum chemistry computation with the assistance of the molecular modeling method of MM2 type and the density functional theory (DFT) based on B3LYP/6-31 + G(d,p) level, in order to optimize the molecular configurations of all inhibitors [13]. Then, we reduced the dimension of the molecular descriptors using a principal component analysis (PCA) based on the correlation matrix. Next, two QSAR models were developed using multiple linear regression (MLR) and multiple nonlinear regression (MNLR). The robustness and reliability of the established QSAR models were examined using the external validation technique, followed by Y-randomization test, an applicability domain and a cross-validation technique with the Leave-One-Out process, as one of the decisive steps to assess the confidence of the developed model's predictions for a new data set [14,15]. Moreover, we predicted the molecules having the highest inhibitory activity, based on their adsorption, distribution, metabolism, excretion and toxicity (ADMET) properties and the conditions mentioned in the rules of Lipinski, Ghose, Veber and Egan [16]. Additionally, we studied the intermolecular interactions established between the more active ligands and the dopamine transporter (DAT) membrane protein, encoded 4M48 as a crucial target for schizophrenia, with the assistance of the molecular docking approach [2,17], which was validated using docking validation protocol [18]. Lastly, we performed the molecular dynamics simulations to analyze and elaborate the details of interaction and stability of the potent ligands in the protein targets [19].

#### **2. Results and Discussion**

#### *2.1. Pricipal Component Analysis*

Principal component analysis (PCA) is one of the most widely applied multivariate techniques. It is used to reduce the size of the variables into a limited number of principal components (linear combinations of the original variables) [20]. In this paper, we calculated 40 different descriptors, which were later reduced to 27 descriptors based on the correlation matrix, since descriptors that are strongly correlated with each other (*r pearson* > 0.9) were removed. From this reduced number of variables, we were able to visualize the projection of the new database on the first two principal components (factorial axes), as shown in Figure 1, which clearly indicates that molecules 1 and 32 are poorly explained. Consequently, they are considered as outliers.

**Figure 1.** Data visualization on the first two principal components.

#### *2.2. Statistical Database*

Although observations 1 and 32 are considered as outliers, the new database will be represented by a matrix of 27 descriptors and 42 molecules. Using the k-means method, we randomly divided the database into training and test sets. The first one includes 80% of the total data (35 molecules) and was taken to develop the QSAR models, while the second one contains 20% of the total data (7 molecules) and was used to assess the validity of the developed models [21].

#### *2.3. Multiple Linear Regression*

 The Quantitative structure-activity relationships (QSAR) have the potential to reduce the time and effort of molecular screening using mathematical predictive models [22]. One of these models is obtained by the multiple linear regression (MLR) technique, as a statistical tool for estimating the linear relationship between more than two variables which have cause-effect relations [23]. Thus, the first QSAR model was applied using the MLR technique with stepwise selection, on a training set of thirty-five molecules (*N* = 35), where the process was repeated more than a thousand times based on statistical criteria: in particular, the determination and correlation coefficients, provided that they will be validated in the next stage. Accordingly, the best QSAR model is given by the following equation:

− − α ɣ Log10IC<sup>50</sup> = −10,407−0.279 × *αe* + 0.069 × *γ* + 0.156 × TE + 1.83 × HBD + 1.716 × SE + 1.029 × TD. (1)

> α ɣ This constructed model shows that the biological activity at the log scale is a quantitative variable affected by the following six descriptors: polarizability (αe), surface tension (*γ*), torsion energy (TE), Hydrogen bond donor (HBD), stretch energy (SE) and topological diameter (TD), which have been calculated and presented in Table 1. Moreover, the significance test demonstrates that the slope of each variable has a probability inferior to 5% as shown in Table 2, and so the selected descriptors have a significant weight on the biological inhibitory activity at a 95% confidence interval. Except the polarizability, all five molecular descriptors affect positively the biological activity as shown in Figure 2, where a molecule can be more active if it is less polar and has higher values of surface tension, torsion and stretch energies, hydrogen bond donor and topological diameter.


**Table 1.** The values of selected descriptors for 44 molecules.

\* indicates test set molecules.

**Table 2.** Significance test of the slopes.


ɣ

− − − −

*α* − − − −

Additionally, the null hypothesis (H0) postulated by the Fisher statistical test is rejected, because the calculated Fisher value (*F* = 10.325) is so much higher than its critical value: [*F* (35,6) = 2.37, *p* < 0.0001], as presented in the one Anova test (Table 3). Therefore, the variance between the response (Log10IC50) and the six predictor variables is homogeneous. Moreover, the correlation and determination coefficients of *R* = 0.83 and *R <sup>2</sup>* = 0.69, respectively, confirm that there is a strong relationship between the descriptors and the inhibitory activity. Thus, the first QSAR model generated via MLR technique has a good predictive performance, with a low standard error (*RMSE* = 0.66).


**Table 3.** Variance analysis.

#### *2.4. Multiple Non-Linear Regression*

The multiple non-linear regression (MNLR) technique is applied using a set of adapted algorithms to generate the quantitative predictive models [24]. In the present study, we relied on the programmed function of the type:

$$\mathbf{Y} = \mathbf{a}\mathbf{0} + \sum\_{i=1}^{n} \left( \mathbf{ai} \times \mathbf{X} \mathbf{i} + \mathbf{bi} \times \mathbf{X} \mathbf{i}^2 \right) \tag{2}$$

As:

Y: is the predicted biological activity (Log10IC50) Xi: is the explicative variable a0: is the constant of the QSAR model ai and bi: are the slopes of each descriptor to one and two degrees, respectively. Finally, we arrived at the second QSAR model given by the following equation:

$$\begin{array}{c} \text{LogoIC}\_{50} = -19.699 - 0.009 \times ac - 0.056 \times \gamma - 0.161 \times \text{TE} + 1.466 \times \text{HBD} + 0.5 \times \text{SE} + 2.693 \times \text{TD} \\ \quad - 0.002 \times \text{ae}^2 + 0.001 \times \gamma^2 + 0.013 \times \text{TE}^2 + 0.148 \times \text{SE}^2 - 0.07 \times \text{TD}^2. \end{array} \tag{3}$$

This mathematical model has a good predictive capacity, justified by a strong nonlinear relationship between the biological activity and the six descriptors, as it is defined by a good correlation coefficient (*R* = 0.84) and a good coefficient of determination (*R <sup>2</sup>* = 0.71), in addition to its minimal mean square error (*RMSE* = 0.72).

Y = a0 + (ai × Xi + bi × Xi<sup>ଶ</sup>

 ୀଵ

)

#### *2.5. QSAR Model Validation*

− α ɣ − − α ɣ

#### 2.5.1. Applicability Domain

The applicability domain (AD) of a quantitative structure-activity relationship (QSAR) model is necessary to verify its reliability on new compounds (test set) that were not considered during its development [25]. This technique has been evaluated by an analysis expressed as a Williams diagram (Figure 3), which confirms that the molecules (1 and 32) belonging to the test set are really outliers, because they exceed the warning leverage (*h\** = *0.6*), where: *h\* = 3* × *K/n and K = p + 1, (p = 6, K = 7, n = 35)* as, *n*: is the number of training set, and *p*: is the number of predictor descriptors [26,27]. Next, we noted that the compound 33 from training test is not an outlier because it does not exceed the critical leverage *(h\*)*. Therefore, except for molecules (1 and 32), all the others are well explained because they have in addition a normalized residual included in the 3 times standard deviation interval. Consequently, the 42 remaining molecules are tested in the applicability domain and the QSAR model was predicted correctly.

**Figure 3.** William's diagram of the MLR model established by Equation (1). 1 and 32 are outliers in the test set and 33 is a non-aberrant molecule in the training test.

#### 2.5.2. External Validation

To assess the accuracy of the QSAR predictive model and guarantee its generalizability, it is absolutely needed to validate it on new molecules included in the test set, before its application in clinical practice [28]. Based on a training test (35 molecules), we tested the seven new molecules from the test set and got the results presented in Table 4.


**Table 4.** External validation results of the MLR and MNLR models.

\* Indicates test set molecules.

The results mentioned in Figure 4 indicate that the MLR QSAR model is given by an external validation correlation coefficient (*R 2 ext* = 0.63), and the results noted in Figure 5 indicate that the MNLR QSAR model is characterized by an external validation correlation coefficient of *R 2 ext* = 0.68. According to the Alexander Golbraikh and Alexander Tropsha theory, a QSAR model is externally validated if the correlation coefficient of its external validation is greater than 0.6. Therefore, the mathematical models developed with the help of MLR and MNLR techniques are externally validated.

− − − − − −

− − − − − −

**Figure 4.** Correlation between the observed and predicted activities using MLR technique.

**Figure 5.** Correlation between the observed and predicted activities using MNLR technique.

#### 2.5.3. Internal Validation

To validate internally the QSAR model, we applied the cross-validation technique with the leave-one-out procedure (CVLOO), so that each observation is tested exactly once, by executing a new model each run on thirty-four compounds (*N*-1 = 34) and predicting the biological activity of the removed sample, as shown in Table 5. This technique is based on the calculation of the quadratic coefficient of cross validation (*Q<sup>2</sup> cv*), which is expressed in the following equation [29,30]: *Q*<sup>2</sup> *cv* = 1 − ∑ *n i* (*Ypred*−*Yobs*) 2 ∑ *n i* (*Yobs*−*Ymean*) 2 (4) AS: *Ypred*: is the predicted activity value, *Yobs*: is the observed activity value, *Ymean*: is the mean of the observed activity values. A high value of *Q<sup>2</sup> cv* = 0.57 (superior than 0.5) signifies that the established model is reliable, robust and has better internal predictivity.


**Table 5.** Observed and predicted activity values from the QSAR models.

#### 2.5.4. Validation Using Y-Randomisation Test

The statistical study of Alexander Golbraikh and Alexander Tropsha confirms that the cross-validation technique is necessary but not sufficient, as the internal predictive accuracy of the cross-validation procedure tends to be overestimated and the high value of the quadratic coefficient may be the result of chance correlation. For this reason, the Y-randomisation test is necessary [31]. Using java Platform SE binary, we tested the QSAR model quality by running one hundred randomizations, as presented in Table 6. The results of the Y-randomisation test demonstrate that the (*cR2p* = 0.602) criteria is superior than 0.5; moreover, the *R*, *R <sup>2</sup>* and *R 2 cv* values of the original model are much better than the values obtained by 100 randomizations. Consequently, the biological activity values predicted by the original model are not due to chance.

**Table 6.** Y-randomization test results.




2.5.5. Golbreikh and Tropsha Criteria

The quantitative structure-activity relationship (QSAR) model, defined by the first Equation (1), satisfies the threshold criteria postulated by Golbraikh and Tropsha theory, as shown in Table 7.



**Table 7.** *Cont.*

*Yobs* and *Ycalc*: refer to the observed and calculated/predicted response values. *Yobs* and *Ycal* refer to the mean of the observed and calculated/predicted response values. *N* and *p* refer to the number of data points (compounds) and descriptors.

#### *2.6. In Silico Pharmacokinetics ADMET Prediction*

The most active ligands (L6, L9, L30, L31, L32 and L37), acting as inhibitors of type 1 glycine transporters, were tested based on the rules of Lipinski, Veber, Egan, and Ghose, and the pharmacokinetic properties (ADMET) [32], which were compared to the obtained results for nortriptyline as a co-crystallized ligand bound to the dopamine transporter (DAT) membrane protein encoded 4M48. The results presented in Table 8 indicate that all molecules respect the rules of Lipinski, Veber, Egan and Ghose except the ligand 32, because its molar refractivity index exceeds 130 and its Ghose violation number is equal to 2 (exceed 1). Additionally, the exact predictive model (BOILED-Egg), highly practical in the context of drug discovery and medicinal chemistry, and based on the calculation of lipophilicity given by the logarithm of the partition coefficient between n-octanol and water (Log PO/W) and polarity signaled by the topological polar surface area (TPSA) of small molecules, clearly shows that the molecule (L32) is the only one that does not belong to the yellow Egan-egg, as presented in Figure 6. Therefore, the five ligands (L6, L9, L30, L31 and L37) are the molecules having the highest probability to penetrate the brain. In comparison, the molecule 32 belonging to the white region of the egg has the highest probability of being absorbed by the gastrointestinal tract [33], which is why it was an outlier in the previous QSAR study.

**Table 8.** Prediction of the physicochemical properties of nortriptyline and more active ligands, based on Lipinski, Veber, Egan and Ghose violations.


The pharmacokinetic parameters of adsorption, distribution, metabolism, excretion and toxicity (ADMET) of the most active ligands as presented in Table 9 indicate that the ligands have a good absorption in the human intestine (IAH so higher than 70%), and a good distribution, since their human distribution volumes are estimated to be greater than −0.44 Log L/kg. Their permeability to the blood-brain barrier (BBB) is greater than −1 Log BB, and their permeability to the central nervous system (CNS) outside the interval (of −2 to −3) Log PS. Thus, they all penetrate the central nervous system (CNS) with the exception of ligands L32 and L30. In addition, the molecules are all predicted as inhibitors of cytochrome 2D6 except ligand 32. Consequently, the ligands (L6, L9, L30, L31 and L37) are designed to be agents of the central nervous system due to the highest probability of penetrating the blood-brain barrier (BBB).

**Figure 6.** BOILED-egg predictive model of the most active ligands.

− −

−

−

**Table 9.** Prediction of ADMET pharmacokinetic properties of nortriptyline and more active ligands.


#### *2.7. Molecular Docking*

Molecular docking results are focused on the dopamine transporter (DAT) bound to the tricyclic antidepressant nortriptyline, as a transmembrane protein that removes the neurotransmitter dopamine from the synaptic cleft and transports it into the cytosol of surrounding cells. The crystal structure of this receptor is extracted using the X-ray diffraction method at a resolution of 2.96 Å taken from the protein data base (PDB) [34–36]. In this part of the research, the molecular docking process is started for the following most active molecules (L6, L9, L30, L31 and L37) to predict the type of Intermolecular interactions established with the protein encoded 4M48, compared to the established interactions with the co-crystallized ligand (nortriptyline) pictured in Figure 7, which indicate that Phe43A, Phe325A and Tyr124A amino acids, are the active sites of the target protein, as sourced using the ProteinsPlus online server [37].

The results of molecular docking applied on the more active ligands, presented in Figure 8, show that the ligands L6 and L9 share common molecular interactions as the chemical bonds of type pi-sigma and Pi-Pi T-shaped established between the benzenic cycle and (Val120 and Tyr124) amino acids respectively, in addition to two bonds of alkyl type with Phe325 and Phe43 amino acids. L30 and L31 ligands also form common bonds, like the hydrogen bond linked to the nitrogen atom, with the amino acid Asp46 at the same nuclear distance (5.5 Å), in addition to the alkyl bond with bicyclo group and Tyr124 amino acid. The same type of bond was established between the methyl group and the amino acid Phe325 at a nuclear distance of 5.5 Å, more than Pi-Pi bonds with Phe43 and Phe319 amino acids. Even the ligand L37 formed an alkyl bond with Tyr124 amino acid, and two Pi-Pi chemical bonds with Phe325 and Phe319 amino acids. Therefore, we can conclude that Tyr124, Phe43, Phe325, Asp46, Phe319 and Val120 amino acids are the active sites of the dopamine transporter (DAT) membrane protein, in which the most active ligands can inhibit the glycine transporter type 1 (GlyT1).

**Figure 7.** Experimental pose view of Nortriptyline with the protein's active sites.

**Figure 8.** *Cont*.

**Figure 8.** *Cont*.

**Figure 8.** 2D and 3D visualization of intermolecular interactions between DAT (PDB code: 4M48) and the more active ligands (L6, L9, L30, L31 and L37), with binding energies of −10.74 kcal/mol, −10.22 kcal/mol, −8.46 kcal/mol, −8.78 kcal/mol and −8.59 kcal/mol, respectively.

#### *2.8. Docking Validation Protocol*

The efficiency of the molecular docking algorithms was tested using the re-docking methodology, which is based on the superposition of the docked ligand on the proteinbound ligand, as shown in Figure 9. The superposition result indicates a root mean square deviation smaller than 2 (*RMSD* = 0.022 Å), which explains an exact pose prediction. Additionally, 2D and 3D visualization (Figure 10) of the intermolecular interaction between the docked nortriptyline and the protein target indicates that the chemical bonds formed with Phe43A and Tyr124A amino acids are the same as those observed experimentally. Thus, the molecular docking protocol is successfully validated [18].

**Figure 9.** Re-docking pose with RMSD equal to 0.022 Å (original nortriptyline of cyan color, and docked nortriptyline of green color).

− − **Figure 10.** 2D (**a**) and 3D (**b**) visualization of intermolecular interaction between the docked nortriptyline and the protein target (binding energy of −9.11 kcal/mol).

#### *2.9. Molecular Dynamics Simulations*

The most active ligands (L6, L9, L30, L31 and L37) were chosen for the molecular dynamic's simulation during 100 ns, to examine their stability toward DAT protein, where the conformational changes of one of these ligands are presented in Figure 11, and the others were presented in Figure S1.

The dynamic changes of conformation for (L9-protein) complex shown in Figure 11, indicate that the simulation is well-equilibrated, as the fluctuations of the root mean square deviation (RMSD) of the protein (left *Y*-axis) are around the thermal mean structure throughout the simulation time (100 ns), because the changes of the order of 1–3 Å are perfectly acceptable for small globular proteins. Moreover, the RMSD evolution of the heavy atoms of the ligand (right *Y*-axis) shows its stability with respect to the protein, when the protein-ligand complex is first aligned on the protein backbone of the reference,

because the observed values are significantly smaller than the RMSD of the protein, and so the ligand did not diffuse away from its initial binding site.

**Figure 11.** RMSD and RMSF graphs for the ligand 9 complexed with the dopamine transporter membrane protein during 100 ns.

The root mean square fluctuation (RMSF) values are also computed to examine the impact of the ligand binding on the internal dynamics of the target protein during 100 ns, where the tails (N- and C-terminal) fluctuate more than any other part of the protein and the secondary structure elements like alpha helices and beta strands are usually more rigid than the unstructured part of the protein; for this reason, they fluctuate less than the loop regions. Except for a single fluctuation of 3.2 Å, detected in the loop region of residue 390, all fluctuations were less than 3 Å, indicating the binding strength between the ligand 9 and the DAT protein, and no significant change in the protein conformation resulting from ligand binding.

Additionally, the radius of gyration (r Gyr) values fluctuated in a small interval from 3.45 to 3.76 Å until the end of the simulation, as shown in Figure 12, indicating that there are just some changes in the compactness of the ligand; thus, the protein has a good flexibility after its binding with the ligand 9. Moreover, the solvent accessibility of the protein-ligand 9 complex was evaluated by the solvent accessible surface area (SASA) analysis, which fluctuated between 0 and 15 Å<sup>2</sup> for 100 ns; this graph revealed that the structure of compound 9 was relatively stable during the simulation time. The polar surface area (PSA) is a solvent accessible surface area in a molecule contributed only by oxygen and nitrogen atoms; this parameter varies between 8 and 32 Å<sup>2</sup> , accompanied by some maximal and minimal fluctuations during the simulation time. The contributions of this type of atoms make the ligand relatively unstable. However, the molecular surface area (MolSA) illustrates the molecular surface calculation with a probe radius of 1.4 Å, equivalent to a Van der Waals surface, showing only minimal fluctuations.

Lastly, the graph of the total energy presented in Figure 12 shows a minimal variation about the average −53.8682 kcal.mol−<sup>1</sup> , which means that the energy of the L9- DAT protein complex remained in equilibrium throughout the MD simulation.

The dynamic changes of conformation for other complexes are available in the supplementary material (Figure S1). The protein-ligands interactions fluctuate with a root mean square deviation (RMSD) of 1 to 3 Å along the simulation time (100 ns), except for the L31-protein complex, which oscillates for the first 20 nanoseconds from 1 to 3 Å, then destabilizes until 50 ns, and stabilizes again with a deviation 4 Å of about until the end of the simulation time.

**−**

−

**Figure 12.** Rg, MolSA, SASA and PSA during 100 ns of MD simulation, and the variation of total free energy for the ligand 9 complexed with DAT protein.

− − − − Minimal fluctuations have been observed during the established interactions between the protein and the ligand 6, such as the observed root mean square fluctuations (RMSF) of 3.4 Å, 4.4 Å and 3.9 Å detected in the loop region of 10, 240 and 480 residues, respectively, in addition to a maximal fluctuation of 5.5 Å recorded in the loop area of residue 380. No fluctuation was noticed greater than 3 Å for L30-protein and L37-protein complexes. Three fluctuations have been recorded for L31-protein complex of 3.6 Å, 3.5 Å and 7 Å reported in the loop zone of 10, 480 and 520 residues, respectively.

Overall, we note that the protein has a good flexibility of binding with L6, L30, L31 and L37 inhibitors, as there are just a few changes in the compactness of the ligand, since the r Gyr, SASA, MolSA and PSA parameters are varied with minimal fluctuations about the mean along 100 ns.

Finally, the total energy plots presented in Figure S2 show a minimal variance around the average energy of the total system, which was in Kcal/mol of −47.7003, −48.8821 −49.5236 and −62.5749 for L30, L31, L37 and L6 inhibitors, respectively, indicating that the energies of the ligands-protein complexes have remained in equilibrium over the course of the MD simulation.

We conclude that the molecular dynamics simulations reinforce the previous results obtained by QSAR and docking studies, since the ligands L6, L9, L30, L31 and L37 are the most active inhibitors, forming typical static interactions with some amino acids of the target protein. These interactions form dynamically stable complexes during the 100 ns of the simulation time, as there is no change in their properties, except for the minimal fluctuations that were observed.

#### **3. Materials and Methods**

#### *3.1. Database*

The present study is performed on 44 bicyclo ((aryl) methyl) benzamides as glycine transporter type 1 (GlyT1) inhibitors, whose biological activities are expressed on a logarithmic decimal scale (log10IC50), as illustrated in Table S1.

#### *3.2. Molecular Descriptors Calculation*

To build the quantitative structure-activity relationship (QSAR) models that provide information on the correlation between activities and structure-based molecular descriptors, we calculated various types of molecular descriptors [38], as shown in Table S2. Initially, the constitutional descriptors were calculated using the ACD/chemsketch software [39]. Subsequently, the thermodynamic and physicochemical descriptors were extracted using the MM2 technique via the ChemBio3D software [40]. Lastly, the quantum descriptors are calculated through Gaussian 09 software [41], using the density function

theory (DFT)/B3LYP [42], combined with the 6-31 + G(d,p) basis set, in order to ensure the molecules stability and optimize their three-dimensional geometries.

#### *3.3. Statistical Methods*

The Quantitative structure-activity relationships (QSARs) are developed with the help of XLSTAT 2014 software [43], using different statistical methods such as: the principal component analysis (PCA), multiple linear regression (MLR) and multiple non-linear regression (MNLR). The principal component analysis method is a very important step that serves to minimize the molecular descriptor dimension so as to identify the most predictive variables [44]. This limited number of descriptors is mathematically modeled by the multiple linear regression (MLR) and multiple non-linear regression (MNLR) techniques. Therefore, the two obtained QSAR models were generated to predict the linear and nonlinear relationships established between the biological activity of glycine transporter type 1 (GlyT1) inhibitors and their relevant descriptors. For their applicability, these two models have been evaluated by external and internal validation, as well as the molecules, which have been tested in the applicability domain [45]. Additionally, the Golbreikh and Tropsha criteria and the Y-randomization test were used to verify the robustness and predictive potential of the established QSAR model [31,46].

#### *3.4. Drug Likeness and In Silico Pharmacokinetics ADMET Prediction*

To make the drugs applicable in clinical trials, it is necessary to study their absorption, distribution, metabolism, excretion and toxicity (ADMET) in the human body before starting the investigation protocols [21,47], respecting some important rules such as those of Lipinski [48], Veber [49], Ghose [50] and Egan [51,52]. This technique is also applied to eliminate the compounds with potentially undesirable physiological qualities, taking into account toxicity and pharmacokinetic properties [53]. For this task, we estimated the drug similarity and in silico pharmacokinetic properties of the newly selected molecules as GluT1 inhibitory agents, using the online SwissADMET [54] and pkCSM [55] servers, respectively.

#### *3.5. Molecular Docking Modeling*

The computational technique of molecular docking is an efficient, fast and powerful tool for drug discovery [56]. For this project, we uploaded the three-dimensional coordinates of the target protein from the protein data bank (pdb) using the Discovery Studio 2021 (BIOVIA) software package [57]. To improve the performance of the cavity method, water molecules and suspended ligands bound to the protein were removed and polar hydrogens were added. Accordingly, the prepared protein was docked with the most active ligands, previously optimized by the density functional theory (DFT), with the assistance of AutoDock 4.2 [58]. Moreover, the grid box was centralized on (−42.562 Å, −0.46 Å, −55.066 Å) with the help of AUTOGRID algorithm, by putting the sizes (80, 80, 80) in their three-dimensional structure, and running 10 genetic algorithms with a total of 25 million trials. Finally, the molecular interactions of the protein-ligand were visualized using discovery studio 2021 [59].

#### *3.6. Molecular Dynamics*

Based on QSAR and molecular docking results, the five best-docked ligands, having the highest activity, were chosen for the molecular dynamics simulations in order to identify the molecular recognition between the ligand and the dopamine transporter (DAT) membrane protein. The MD simulations were performed for 100 nanoseconds using Desmond software, a package of Schrödinger LLC [60]. The first stage in the molecular dynamic's simulation of protein-ligand complexes was obtained by docking studies, and preprocessed using Protein Preparation Maestro, which performs optimization and minimization of complexes that have been prepared by the System Builder tool using a solvent model with an orthorhombic box that was chosen as TIP3P (transferable intermolecular interaction potential 3 points), using the OPLS force field [61]. At 300 K temperature and 1 atm pres-

sure, the models were made neutral with the addition of water molecules, and counter ions, such as 0.15 M salt (Na<sup>+</sup> , Cl−) were added to mimic the physiological conditions. Finally, the trajectories were saved after every 10 ps for analysis, and the stability of simulations was evaluated by calculating the root mean square deviation (RMSD) of the protein and ligand over time. In addition, the root-mean-square fluctuation (RMSF), gyration radius (Rg), solvent accessible surface area (SASA), molecular surface area (MolSA) and polar surface area (PSA) were recorded for 100 ns, and the free energies of the inhibitors-protein interactions were evaluated using MM-GBSA approach [62].

#### **4. Conclusions**

A systematic in silico study was applied on 44 bicyclo((aryl)methyl)benzamide derivatives as glycine transporter type 1 (GlyT1) inhibitors to discover effective antipsychotic candidates for the treatment of schizophrenia. Initially, two QSAR models were developed using MLR and MNLR techniques and were examined through external and internal validation, applicability domain (AD), Y-randomization test and Golbreikh and tropsha criteria, indicating a significant effect of hydrogen bond donor, polarizability, surface tension, stretch and torsion energies and topological diameter on the locomotor activity (LMA). Subsequently, ADMET in silico pharmacokinetics prediction revealed a favorable profile of the most active ligands, where L6, L9, L30, L31 and L37 were predicted as non-toxic inhibitors for 2D6 cytochrome, which respect the rules of Lipinski, Veber, Egan and Ghose, with an excellent absorption exceeded 91% and highest probability to penetrate the central nervous system (CNS). In contrast, the ligand L32 as an outlier in the QSAR study has an unfavorable ADMET profile with the highest probability of being absorbed by the gastrointestinal tract. Lastly, the obtained results were further strengthened and qualified using molecular docking and molecular dynamics studies, which confirm that L6, L9, L30, L31 and L37 react specifically with Tyr124, Phe43, Phe325, Asp46, Phe319 and Val120 amino acids of the dopamine transporter (DAT) membrane protein in a way that blocks glycine transporter type 1 (GlyT1) forming dynamically stable complexes during 100 ns of MD simulation time. Therefore, they could be used as therapeutics in medicine to treat schizophrenia. However, they must be subjected to in vitro and in vivo investigations to evaluate their efficacy and safety as anti-schizophrenia drugs.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/ph15060670/s1, Table S1: The 44 molecules and their biological activities. Figure S1: RMSD and RMSF graphs for L6, L30, L31, L37 ligands complexed with the dopamine transporter membrane protein during 100 ns. Figure S2: Rg, MolSA, SASA and PSA during 100 ns of MD simulation, and the variation of total free energy for L6, L30, L31 and L37 ligands complexed with DAT protein. Table S2: The calculated molecular descriptors.

**Author Contributions:** Conceptualization, M.E.f., M.E. and M.K.; methodology, M.E.f., A.A. (Amine Assouguem) and M.E.-R.; software, M.E.f. and A.A. (Amal Alotaibi); validation, A.B., N.N.M. and H.F. data curation, R.U., S.E. and S.Z.; writing—original draft preparation, M.E.f. and M.E.-R.; writing—review and editing, M.K. and M.E.; supervision, M.E. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R33), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article and Supplementary Materials.

**Acknowledgments:** Authors wish to thank Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R33), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia, for financial support.

### **Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Structural Elucidation of Rift Valley Fever Virus L Protein towards the Discovery of Its Potential Inhibitors**

**Mubarak A. Alamri 1,† , Muhammad Usman Mirza 2,† , Muhammad Muzammal Adeel 3,† , Usman Ali Ashfaq 4 , Muhammad Tahir ul Qamar 4, \* , Farah Shahid 4 , Sajjad Ahmad 5 , Eid A. Alatawi 6 , Ghadah M. Albalawi 7,8 , Khaled S. Allemailem 7, \* and Ahmad Almatroudi 7**


**Abstract:** Rift valley fever virus (RVFV) is the causative agent of a viral zoonosis that causes a significant clinical burden in domestic and wild ruminants. Major outbreaks of the virus occur in livestock, and contaminated animal products or arthropod vectors can transmit the virus to humans. The viral RNA-dependent RNA polymerase (RdRp; L protein) of the RVFV is responsible for viral replication and is thus an appealing drug target because no effective and specific vaccine against this virus is available. The current study reported the structural elucidation of the RVFV-L protein by in-depth homology modeling since no crystal structure is available yet. The inhibitory binding modes of known potent L protein inhibitors were analyzed. Based on the results, further molecular docking-based virtual screening of Selleckchem Nucleoside Analogue Library (156 compounds) was performed to find potential new inhibitors against the RVFV L protein. ADME (Absorption, Distribution, Metabolism, and Excretion) and toxicity analysis of these compounds was also performed. Besides, the binding mechanism and stability of identified compounds were confirmed by a 50 ns molecular dynamic (MD) simulation followed by MM/PBSA binding free energy calculations. Homology modeling determined a stable multi-domain structure of L protein. An analysis of known L protein inhibitors, including Monensin, Mycophenolic acid, and Ribavirin, provide insights into the binding mechanism and reveals key residues of the L protein binding pocket. The screening results revealed that the top three compounds, A-317491, Khasianine, and VER155008, exhibited a high affinity at the L protein binding pocket. ADME analysis revealed good pharmacodynamics and pharmacokinetic profiles of these compounds. Furthermore, MD simulation and binding free energy analysis endorsed the binding stability of potential compounds with L protein. In a nutshell, the present study determined potential compounds that may aid in the rational design of novel inhibitors of the RVFV L protein as anti-RVFV drugs.

**Keywords:** RVFV; RdRp; structural modeling; virtual screening; docking; MD simulation

**Citation:** Alamri, M.A.; Mirza, M.U.; Adeel, M.M.; Ashfaq, U.A.; Tahir ul Qamar, M.; Shahid, F.; Ahmad, S.; Alatawi, E.A.; Albalawi, G.M.; Allemailem, K.S.; et al. Structural Elucidation of Rift Valley Fever Virus L Protein towards the Discovery of Its Potential Inhibitors. *Pharmaceuticals* **2022**, *15*, 659. https://doi.org/10.3390/ ph15060659

Academic Editor: Marialuigia Fantacuzzi

Received: 23 April 2022 Accepted: 20 May 2022 Published: 25 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Rift Valley fever is a hemorrhagic ailment caused by the Rift Valley Viral Infection (RVVI), affecting humans and livestock. Mosquitos are the vectors for this disease. The first case of RVVI was reported in East Africa's Rift Valley in Kenya, which made it known as RVF. In its first instance, a significant population of farmed sheep was killed from 1930 to 1931. *Aedex* and *cules* are the two common vectors (mosquitoes) transmitting this virus [1]. Post RVVI, flu-like common symptoms have lasted up to seven days. Although most human infections are mild, approximately four percent of the cases have been severe and show symptoms such as Hemorrhagic fever, meningoencephalitis, and ocular type [2].

The virus responsible for this disease is Rift Valley Fever Virus (RVFV), a singlestranded RNA virus belonging to the Phenuviridae Family and the genus Phlebovirus. Rift Valley fever (RVF) is caused by RVFV and has a wide range of clinical symptoms. A mild febrile illness is common in humans and can lead to more severe illnesses such as encephalitis, hemorrhagic fever, and liver disease. Neurological issues including dizziness, paralysis, headaches, hallucinations, vertigo, and delirium are also observed [3]. Up to 10% of RVFV-infected humans experience ophthalmologic complications, including retinal hemorrhaging and photophobia. Sheep, cattle, and goats are more susceptible to RVFV, with a 20–30% mortality rate in adult ruminants and 70% in young animals. Spontaneous abortion in infected pregnant animals is alarmingly high, ranging from 40–100 percent [4].

RVF outbreaks are sporadic and linked to meteorological, hydrological, and socioeconomic factors. RVF outbreaks have been reported to produce considerable numbers of infected human cases, which have substantially impacted the healthcare system [5]. According to an analysis by the Center for Disease Control and Prevention (CDC), several RVF outbreaks have occurred in various countries, resulting in hundreds of thousands of human illnesses and human casualties [6]. This disease has spread from the African region to the Arabian to the USA, making it a grave concern to affected regions and the whole world [7]. Epizootics that occur in livestock frequently precede human epidemics. In other words, reducing RVVI in animals is expected to break the transmission cycle and help avoid human sickness [8]. The CDC and the U.S Department of Agriculture (USDA) both classify RVFV as a select agent because of its ability to cause morbidity and mortality and its potential use as a bioterrorism agent [4,9]. Unfortunately, there are no FDA-approved vaccines for humans, but some approved vaccines are available for veterinary use. Similarly, there are no therapeutics available to treat RVFV, requiring additional research in this area.

As per the structure of RVF, it is a high-molecular-weight, single-stranded, linear virion (2–2.5 nm in diameter, 200–300 nm in length, and displaying helical symmetry) enclosed within the ribonucleo-capsid [10]. Three segments of the RVFV genome (S, M, and L-segments) have previously been described as having three separate open reading frames (ORFs) coding for three distinct proteins (viral polymerase, L-protein, and S-protein) [11,12]. The L segment, which encodes only one protein, RNA-dependent RNA polymerase (RdRp), commonly referred to as L protein, is particularly important to our research. The L protein is involved in genomic replication and viral mRNA transcription. The middle region of the L protein contains an RdRp domain, which is found in all RNA viral polymerases and is required for viral RNA production. The RdRp domain retains the six characteristic conserved structural motifs, including PreA/F, A, B, C, D, and E in the central core region [12–15]. Most of these motifs are located in the palm subdomain and characterize the active site chamber formation [14]. These host mRNAs are employed as primers for viral transcription and suppress viral-RNA-induced immune responses [16]. L protein has an endonuclease domain at its N-terminus that is necessary for cap-snatching, in which it cleaves 50 m7G caps from host mRNAs [17,18]. Viruses translate and reproduce their genome in the cell cytoplasm for survival and growth. During the replication, the virus uses a technique termed cap-snatching, which means taking advantage of two known viral L protein functions: the capability to bind cap structures and cut off the cellular mRNA attached to [19].

The rational drug design process is greatly accelerated by using different computeraided drug designing applications for in silico drug screening [20–27]. A virtual screeningbased drug discovery strategy has been identified as one of the most effective ways to discover and develop new drugs [28]. RVFV L protein was selected for the current study as a potential target for which no drug has been reported so far. Besides evolutionarily conserved motifs in the core region, RdRp has channels/tunnels that link the active site chamber with the exterior and therefore emerge as a potential target for developing antiviral inhibitors [29–32], which is evident from the inhibitor design against many deadly viruses such as *Zika virus* (ZIKV) [33–36], *Japanese encephalitis virus* (JEV) [37], *West Nile virus* (WNV) [38], *Dengue virus* (DENV) [39–41], HCV [24,42–44], HIV [45], SARS-CoV-2 [25], and most of the drugs have been reported against *Ebola* polymerase L (EBOV) [26]. The current study aimed to use the virtual screening approach to identify potential RVFV L protein inhibitors and inhibitory binding modes of known potent inhibitors, followed by molecular docking analysis to discover novel inhibitors that could be used as potential leads for *RVVI* treatment.

#### **2. Results and Discussion**

#### *2.1. Protein Structural Analysis*

Homology modeling is a well-established technique in modern drug discovery, and with more advancement in machine learning approaches, it is now possible to build a homology model with high accuracy even from a template with a low identity. Recent advances in homology modeling have proven their effectiveness as an alternative [46,47] and retrospective analysis, and validate the usefulness of homology modeling in SBVS. [48–51]. In the current study, the 3D structure of RVFV L protein was predicted by using the homology modeling approach via Modeller.v9.11. It was observed that the crystal structure and cryo-EM structure of severe fever with thrombocytopenia syndrome virus L (SFTSV L) protein (PDB IDs: 6L42 and 6Y6K) were the best hits based on query coverage and percentage identity. 6L42 showed a 34.41% sequence identity and a 91% query coverage during the sequence to structure alignment, while 6Y6K showed a 34.28% sequence identity and a 91% query coverage against the target sequence (Figure S1). Hence, these models were considered the best templates for homology modeling. Chain "A" of both templates were used for downstream structure prediction steps. Studies have suggested that more than 90% confidence indicates that the core model is precise and correct, deviating 2–4 Å in RMSD from the native protein structure [52]. Moreover, a good percentage of identity with maximum coverage between the template and the query sequence indicates a high level of accuracy in the model. The structural superposition/RMSD of the model with the templates is depicted in (Figure S2).

Overall, the newly predicted structure showed a pretty stable arrangement of aminoacid residues; apparently, no structural distortion was observed, and a total of 82 helices, 40 sheets, and 118 coils were detected (Figure 1A). Deep structural analysis suggested that the computational model of RVFV protein contains seven domains, i.e., the endoN domain (amino acids 25–205), separated by a linker of span 206–295; the PA-C-like domain from 296 to 762 amino acid; the RdRp core from 763 to 1345 amino acid; the PB2-N-like domain with residues 1346–1571 span; the arm domain contains two spans 1615–1696 and 1811–1932 that are separated by a blocker motif of 1811–1852 residues; the next domain was CBD with the residues 1697–1810, and on the C-terminal, a lariat domain (1933–2049 amino acid) was observed as previously reported [53] (Figure 1B). Some of the RVFV protein model domains were reported as structurally similar to SFTSV-L (severe fever with thrombocytopaenia syndrome virus –L protein) [54], which seems evident because these proteins shared the same values as the *Phenuiviridae* protein family. The orientation of each domain determines the functional specificity of protein and facilitates inter-molecular interactions for domain organization [19,53].

– ) RMSD of Cα atoms **Figure 1. RVFV L protein structural analysis:** (**A**) Two-dimensional representation of RVFV L protein showing the helices (blue), sheets (red), and coils (green) information (top to bottom). The horizontal black bar is representing the length of a protein. (**B**) Ribbon representation of 3D-model of RVFV L Protein (residues from 291–1583 are highlighted) with divalent cation, presumed to be a magnesium ion (circle Mg 2+ ). Each structural component is highlighted with a different color. (**C**) Ramachandran plot contains four quadrants. The 1st and 3rd quadrant indicate the allowed region, while 2nd and 4th show the disallowed region. Blue dots are showing the density of amino acid residues. (**D**) Model energy calculation graph showing the local energy estimation of a model. The *X*-axis indicates the sequence length, while *Y*-axis shows the energy values. (**E**) RMSD of Cα atoms of the RVFV L protein over a period of 50 ns.

– – Multiple methods were employed for the validation of the 3D model. The Ramachandran plot indicates that 82.8% of residues were present in the favored region, 15.4% were present in additional allowed regions, and 0.7% were present in disallowed regions (Figure 1C). The presence of 82% of residues in the favored region indicates the high quality of the model. The functional properties of a protein depend mainly on its 3D structure and these properties were analyzed based on secondary structures. It is beneficial to identify secondary structure elements and structural motifs when studying the protein 3D structure [55]. Structural components of L protein were analyzed by PDBsum [56]; results showed the presence of multiple structural components such as 18 Beta-hairpins, 84 helices, 1Psi-loops, 40 strands, 155 Beta-turns, 44 Gamma-turns, 1 Beta-alpha-beta-unit, and 111 helix–helix interactions. Since we detected the 84 helices and 111 helix–helix interactions, such a high number of helices content determines the high chirality of proteins and thus provides structural compactness [57]. The distribution of structural components suggested the potential stability of the newly designed model, which was cross-validated by MD-simulations in subsequent steps. The lowest energy of a model also determines its quality, here we have applied the ProsaWeb server [58] to determine the overall local energy of the RVFV protein model at the amino acid level, results showed that most of the

amino acids were showing the lowest energy below the threshold (Figure 1D). A 50 ns MD simulation further evaluated the generated model, and less than a 1.5 Å average RMSD was found that endorsed the reliability of the predicted RVFV-L protein model. An RMSD trajectory plot depicted the overall backbone structure stability of RVFV's core L protein subdomains (Figure 1E). The backbone Cα-RMSD of RVFV-L fluctuated in the start and converged afterwards. Protein expansion during the start of the simulation probably led to a somewhat larger RMSD of model structure to achieve a more stable conformation. Additionally, the RVFV-L protein homology model was compared with the corresponding template using TM align [59] to identify the likelihood of a similar structural fold (RdRp polymerase chamber) as categorized in SCP/CATH. The TM-algin is the protein structural alignment program for comparing proteins. The TM-align ranked the RVFV-RdRp with up to an 88% similar fold. Reverse template selection by the profunc server [60] (which scans auto-generated templates from the query structure against the most representative structures in PDB using Jess, a fast and accurate 3D-structure search algorithm) also determined SFTSV L (6L42) as the best hits with an E-value of 0.00E+00 and structural similarity of 99.9% (Table S1).

#### *2.2. Binding Site Determination*

The identification and characterization of binding sites of target proteins using various in silico methods are of main consideration for structure-based drug design. For binding site analysis, careful consideration was given to the biological suggestive templates. As RdRps are encoded by a wide range of viruses and play an important role in the replication and transcription of viral RNA [61] and have a conserved RdRp core region including the characteristic conserved finger, palm, and thumb subdomains with conserved 6 structural motifs (motifs A–F), which are essential for polymerase function [31]. Therefore, more in-depth information was extracted from other templates (with low identity) that showed sequence identity in the core region. The overall predicted binding site is diagrammatically illustrated in Figure 2.

Despite the low sequence identity (<35% identity), the RVFV-L RdRp core (residues 769–1358) displayed characteristic features of the polymerase core domain, which included the RNA synthesis chamber configured by representative RdRp conserved structural motifs connected to the exterior by four channels (for template entry and exit, NTP entry, and product exit) (Figure 2). The active site chamber of the RVFV-L RdRp core is formed by conserved RdRp motifs A-F located in the palm domain as reported in other RNA polymerases, notably in *orthobunyavirus* polymerases [14,62]. For molecular docking, the active site residues were further predicted using COACH and 3DLigandSite, and both servers predicted Arg926, Ile928, Asp991, Lys994, Trp995, Asn996, Gln1084, Ser1132, Asp1133, Asp1134 and Ser1175, and Phe1191 and Phe1194 as common binding site residues. Interestingly, the structural superimposition of RdRp core regions of RVFV-L and SFTSV-L demonstrated these residues to be entirely inside the functionally conserved structural motifs A-F. These include premotif A (motif F) (919-QQHGGL**R**E**I**-928), motif A (991-**D**AR**KWN**-996) with a conserved divalent cation binding residue Asp991, motif B (1084-**Q**GILHYTSSLLH-1095), catalytic signature motif C (1132-**SDD**-1134) located between two β-strands, motif D (1173-YP**S**EKST-1179), and motif E (1186-MEYNS**EF**Y**F**-1194). To further evaluate the binding site for docking simulations, poliovirus elongation complex structure (PDB ID: 3OL8) [63] and foot-and-mouth disease virus in complex with RTP (PDB ID: 2E9R) [64] were superimposed and positions of catalytic divalent cations and conformation of RTP were analyzed. The superimposition revealed that the position of divalent cations was in close connection with functionally conserved aspartates of motif C (Asp1133) and motif A (Asp991) [65,66]. The importance of these aspartates is apparent from a number of mutational studies that uncovered altered polymerase activity in several RdRp viruses [63,67–73]. The conservation of core residues in these motifs suggests a conserved evolutionary link between RNA polymerase viruses and pinpoints the potential of exploiting the core architecture with other related segmented (−) ssRNA viruses in order to

predict their structural/functional features even with a low sequence homology [32,74–76]. The docking grid was generated to cover these structural motifs for molecular docking.

– – **Figure 2.** The predicted RVFV-L RdRp active site. (**A**) The RVFV-L RdRp core (residues 769–1358) is highlighted in green with conserved RNA synthesis chamber (active site), organized by conserved structural motifs in distinct colors. The four predicted tunnels are marked with arrows as an entrance (template and NTP entry) into the active site chamber and exit tunnels (template and product exit). The arrangement of structurally conserved RdRp motifs are colored magenta, red, golden, khaki, brown, purple, and cyan for motifs A–F, respectively, and superimposed on SFTSV-L RdRp core (light orange). (**B**) The predicted binding site residues are aligned and highlighted through multiple structure alignment (MSA) in representative motifs. These conserved residues are highlighted in stick representation accordingly. Superposition of the poliovirus elongation complex structure (PDB ID: 3OL8) and foot-and-mouth disease virus in complex with RTP (PDB ID: 2E9R) displays the conformation of RTP (yellow) and positions of the catalytic divalent cations (black spheres). Viral names in MSA are abbreviated as follows: Influenza A (FluA), B (FluB), and C (FluC) virus polymerase, Rift valley fever virus (RVFV), and Severe fever with thrombocytopenia syndrome virus (SFTSV).

#### – *2.3. Molecular Docking*

tween two β

Molecular docking is a modeling technique that examines how ligands and receptors fit together and how enzymes interact with ligands [77,78]. The docking computations were done three times, and the compound conformations were sorted by binding energy in kcal/mol. Initially, three known potent inhibitors with distinct chemical structures, namely, Monensin, Mycophenolic acid, and Ribavirin were docked [79]. Monensin inhibits host cell entry by blocking endocytic organelles' acidification [80]. Mycophenolic acid and Ribavirin act directly by inhibiting the RdRp enzyme or indirectly via the inhibition of cellular enzymes necessary for the biosynthesis of guanine-nucleotide [81,82]. The docking grid-box in Autodock Vina was set up to cover the predicted active site residues within the conserved structural RdRp motifs. As shown in Figure 3, the docked complexes obtained a similar binding orientation within the active chamber RdRp core. For Monensin, it forms

hydrogen bonds with the side chains of Arg 672, Asp 990, Asp 1133 (belong to motif C), and Ala 992 (belong to motif A) residues with a binding energy score of −8.5 kcal/mol (Figure 3B). Mycophenolic acid forms hydrogen bonds with residues Arg 1197 and Asn 984 having a binding score of −7.0 kcal/mol (Figure 3D), while Ribavirin forms hydrogen bonds with residues Asp 1134 (belong to motif C), Tyr 757, Ser 1190 (belong to motif E), Arg 672, Asp 991 (a conserved divalent cation binding residue), and Asp 1133 (functionally conserved aspartates of motif C) having a binding score of −6.7 kcal/mol (Figure 3F). −8 −7 −6

'

exploiting the core architecture with other related segmented (−) ssRNA viruses in order

–

–

**Figure 3. Binding modes and interaction mechanisms of known L protein inhibitors.** Close-up view into 3D binding mode of (**A**) Monensin, (**C**) Mycophenolic acid, and (**E**) Ribavirin. The main residues that involve in the formation of the active site within the structurally conserved RdRp motifs that are show in in stickes. 2D interaction analysis of (**B**) Monensin, (**D**) Mycophenolic acid, and (**F**) Ribavirin.

Subsequently, the structure-based virtual screening of the Selleckchem Nucleoside Analogue Library containing 156 diverse compounds against the L protein was performed. Nucleoside or nucleotide analogue is an important class of antiviral therapeutics [83]. It is a powerful tool for dissecting the mechanisms and functions of viral DNA and RNA polymerases [84,85]. This screening selected top-ranking compounds with better binding scores than control compounds for further analysis. The virtual screening results for the top compounds with a binding affinity > −1.5 kcal/mol are shown in Table S2. Table 1 depicts a general overview of the binding energies and residues of control drugs. The docking analysis revealed that A-317491, followed by VER155008, and Khasianine were the best binders among the docked compounds used in this study. The selected compounds exhibited their binding energies in −8.7 kcal/mol to −6.7 kcal/mol.

Figure 4 shows the binding modes and 2D interaction mechanisms of top docked compounds. The 2D plot shows that A-317491 is involved in hydrogen bonding with several residues, such as His1090 and Gln1086 within motif E, Asp991, Ala992, Trp995, and Asn996 within motif A, and Lys 1177 within motif D, with a binding energy score of −8.7 kcal/mol (Figure 4B). Khasianine formed hydrogen bonding with a Gln997 side chain with a binding score of −7.1 kcal/mol (Figure 4C), while VER155008 was involved in making hydrogen bonds with residues Arg672, Lys779, Asp991, Lys1177, Tyr1188, and Ser1190 with a binding score of −7.8 kcal/mol (Figure 4D).


**Table 1.** Binding energy and binding residues of docked complexes. −8 −6 −8 −6 −8 −6

−

−

−

−

−

−

−8 −6

−8 −6

−8 −7 −8 −7 −8 −7 −8 −7 −8 −7 −8 −7 Notably, both known inhibitors and the identified compounds occupied the binding site pockets within the RdRp core with a similar interaction pattern (Figure 5). Both control drugs and selected compounds formed hydrogen bonds with functionally conserved aspartates of motif C, namely, Asp 1133 and the conserved divalent cation binding residue, Asp991 [24]. Our screened nucleoside compounds displayed higher binding energies and interacting profiles than Monensin, mycophenolic acid, and Ribavirin. Most ligands tended to bind the residues belonging to the PA-C-Like domain and the RdRp core domain. Since both these domains are actively involved in forming active site chamber and vRNA binding. Previously published data have proposed that RdRp and PA-C-Like domains facilitate the template-directed RNA synthesis in SFTSV-L protein by providing the NTP entry into the catalytic chamber [62]. Overall, both these domains and endo N domain are critical for product synthesis and the safest release in the cell [53]. These findings suggested that the compounds identified in our study could be more useful candidates for RVFV drug therapy. Previously, a number of studies on RdRp have identified several leads using various computational and experimental techniques. The compounds revealed to interact with RdRp binding pocket residues reported herein [86,87]. However, RVFV RdRp has not been explored much, and the current study is a pioneer work in this regard.

−7

**Figure 4. Binding modes and interaction mechanisms of novel L protein inhibitors.** (**A**) Close-up view into binding mode of (**A**) A-317491, (**C**) Khasianine, and (**E**) VER155008. The main residues involved in the formation of the active site within the structurally conserved RdRp motifs shown in stickes. 2D interaction analysis of (**B**) A-317491, (**D**) Khasianine, and (**F**) VER155008.

**Figure 5. Binding modes of compounds**. (**A**) Molecular surface representation for the inhibitory binding pattern of all ligands. (**B**) Close-up view into the binding mode of all compounds within the active site of RdRp core. The main residues involved in the formation of the active site within the structurally conserved RdRp motifs that are shown in stickes.

#### *2.4. ADME and Toxicity Prediction Analysis*

' ' Predicting ADME profiles using an in silico approach has long been proven to be a reliable method of determining a compound's pharmacokinetic properties. Evaluating lead compounds' ADME properties is a major challenge in the drug development [88]. Because of poor toxicity and pharmacokinetic properties, most drugs fail to pass the drug development process. The development of high-throughput and fast ADMET profiling assays has aided the detection of active lead compounds during early drug discovery [89]. Swiss ADMET and ADMETlab 2.0 were used to predict the absorption, distribution, metabolism, excretion, and toxicity (ADME) studies of the top three compounds, and their results are

– – −5

–

presented in Table 2. Gastro-intestinal absorption (GI) and blood-brain barrier (BBB) permeation indicate drug absorption and distribution of drug molecules [71]. One of the primary factors optimising drug discovery is information about drug distribution via BBB [90]. According to Table 2 results, all compounds show low gastro-intestinal absorption and no BBB permeation. The compounds cannot cross the blood-brain barrier (blood-brain barrier negative), so their consumption is not linked to the onset of neurological disorders. The absorption of the compounds was further revealed by caco-2 permeability values ranging from –6.019 to –5.727 log unit. A permeability of >−5.15 log unit in the ADMETlab 2.0 server indicates optimal caco-2 absorption. Oral bioavailability is frequently viewed as crucial in determining the drug-likeness of active compounds as therapeutic agents [91]. Furthermore, a variety of cytochromes (CYPs) regulate drug metabolism, with CYP2C19, CYP1A2, CYP2C9, CYP3A4, and CYP2D6 being critical for the biotransformation of drug molecules. The ability of a drug to inhibit or act as a substrate of the cytochrome P450 (CYP450) subfamily determines its therapeutic action [92]. A-317491 is an inhibitor of CYP2C9 while being a non-inhibitor and a non-substrate of other isoforms, while Khasianine is a non-inhibitor and a non-substrate of all isoforms, and VER155008 is an inhibitor of CYP3A4 and a non-inhibitor and non-substrates of other isoforms. Besides, p-glycoprotein inhibitors reduce the bioavailability of drugs that are known to be transported by it [93]. Except for Khasianine, all of the compounds in our analysis are inhibitors and negative substrates of p-glycoprotein, which explains the good absorption profile of the compounds. All of the compounds studied were nontoxic in terms of AMES toxicity. Following that, the safety profile of the three compounds was assessed by conducting toxicity prediction studies with an online tool: ProTox-II. This server classified substances into six toxicity classes (1–6), with class one being the most lethal and toxic, with an estimated lethal dose (LD50) of ≤5, and class six designating non-toxicity of the compound with an LD50 > 5000. A-317491 falls in class five with an LD50 value of 2500 mg/kg, while Khasianine falls in class four with an LD50 value of 500 mg/kg, and VER155008 falls in class six with an LD50 value of 7000 mg/kg. The findings strongly support the ability of the compounds studied to act as drugs against RVFV.


**Table 2.** Predicted ADME and Toxicity properties of identified nucleoside analogs. The probability of each parameter is depicted. **BBB**: blood–brain barrier; **CYP450**: cytochrome P450.


**Table 2.** *Cont.*

#### *2.5. MD Simulation*

To understand the structural–functionality relationship of the target protein, MD simulations are essential in computer-aided drug design (CADD) [21]. MD simulations provide detailed biomolecule dynamical structural information and surface wealth of protein-ligand interactions and energetic data. MD simulations have been very successful in recent years for optimizing the docked hits [24,25,45,94,95] and related studies [23,52,95–97]. This data set can guide novel drug design, making MD simulation a valuable tool in modern drug discovery.

#### 2.5.1. Root Mean Square Deviation (RMSD)

MD simulations of 50 ns were performed for the top three complexes and controls to elucidate compound binding stability and extract receptors/compound structural information that is important in the binding, and that may be altered to improve binding conformation and, ultimately, a compound affinity for the target biomolecule [95]. The RMSD is a frequently applied analysis to measure the structural similarity between superimposed proteins; smaller RMSD corresponds to similar structures and vice versa. The RMSD of each complex was calculated as carbon alpha deviations by superimposing 50,000 snapshots over the initial reference structure versus time. The RMSDs of the top three compounds were: A-317491-L (maximum, 6.76 Å; mean, 5.64 Å), VER155008-L (maximum, 7.82 Å; mean, 6.20 Å), and Khasianine-L (maximum, 7.67 Å; mean, 6.20 Å), while the RMSDs of the control drugs were: Monensin-L (maximum, 7.02 Å; mean, 5.95 Å), Mycophenolic acid-L (maximum, 7.24 Å; mean, 5.80 Å), and Ribavirin-L (maximum, 6.92 Å; mean, 6.0 Å) (Figure 6A). All the systems depicted uniform RMSD patterns, and no prominent peak was observed throughout the length of simulation time. The RMSD from the initial simulation time can be seen continuously for all complexes and converged at 10–15 ns. This is rightly possible because of the sudden exposure of the systems to a dynamic environment. This forces the receptor enzyme loops to acquire a more stable conformation. After 10 ns the systems can be witnessed in a more stable behavior until the end of the simulation time. The RdRp has been explored in SARS-CoV-2 for dynamics in the presence of a variety of drug molecules. The studies have found several leads that experience stable conformational dynamics and showed significant intermolecular stability with enzyme's active pockets [98–100]. However, RdRp from RVFV is not investigated so far, making the findings of the current study interesting for experimentalists to test them in in vivo and in vitro validation.

– – – – – – – **Figure 6. MD-Simulation studies of ligand bounded complexes.** (**A**) Root mean square deviation. (**B**) Root mean square fluctuations (endoN domain 25–205 aa; PA-C-like domain 296–762 aa; RdRp core 763–1345 aa; PB2-N-like domain 1346–1571 aa; arm domain two spans 1615–1696 and 1811 aa; CBD domain 1697–1810 aa; and C-terminal domain 1933–2049 aa). (**C**) Radius of Gyrations. (**D**) Betafactor analysis, each ligand is represented with different color such as A-1317491 (blue), Khasianane (pink), Monensin (Red), Mycophenolic acid (dark green), Ribavirin (Yellow), and Ver155008 (Cyan).

2.5.2. Root Mean Square Fluctuation (RMSF) Analysis

–

ity with enzyme's active pocket –

The residual flexibility and stability of complexes were further computed. The mean RMSF for A-317491-L is 2.38 Å, VER155008-L is 2.37 Å, Khasianine-L is 2.23 Å, Monensin-L is 2.31 Å, mycophenolic acid-L is 2.16 Å, and Ribavirin-L is 1.98 Å. These values indicate a high level of agreement on intermolecular stability. Overall, a high rate of fluctuations was observed starting from residue 1250 to onward, and Khasianine-L and A-317491-L were among the ligands showing a high tendency to fluctuate (Figure 6B). It was observed that part of the RdRp core, PB2-N-like domain, CBD domain, and the C-terminal domain comprise a large percentage of flexible loops, forcing these segments to behave more dynamically. This may be a natural mechanism of the protein to confer some flexibility for mediating the proper accommodation of the substrate/ligand molecule inside the pocket and accomplish the catalytic mechanism. Moreover, the predicted binding site residues and their corresponding structural motifs were stable within a ~2 Å fluctuation in the presence of ligands. The less flexibility in the RdRp core region indicated that the ligands adopted a more stable conformation and established consistent interactions with the important surrounding residues.

#### 2.5.3. Radius of Gyration (Rg)

Furthermore, Rg analysis was used to assess structural equilibrium and protein compactness over the simulation time. Rg is employed to investigate whether the binding of the compounds affects the overall structural stability of the receptor enzyme. Lower Rg indicates the tight packing of the enzyme atoms and less effect of the compounds on the enzyme structure upon binding, thus presenting the stable nature of the subject complex. The Rg of the complexes follows; A-317491-L (maximum, 82.41 Å; mean, 76.76 Å), Khasianine-L (maximum, 76.76 Å; mean,70.16 Å), VER155008-L (maximum, 83.67 Å; mean, 75.05 Å), Monensin-L (maximum, 83.64 Å; mean, 76.43 Å), Mycophenolic acid-L (maximum,

85.40 Å; mean, 77.29 Å), and Ribavirin-L (maximum, 83.76 Å; mean, 74.09 Å) (Figure 6C). All six complexes are quite stable and remain compact. These Rg results complement the RMSD result in interpreting the docked complex stability.

#### 2.5.4. B-Factor Analysis

B-factors were also derived from simulation trajectories to probe highly mobile regions of the complexes. The B-factor monitors the thermal motion of protein atoms, side chains, and whole regions. The B-factor is commonly used to identify internal protein motions, probing rigid and flexible regions important in proteins/enzyme functionality. The average values of the B-factor for the top three complexes A-317491-L, VER155008-L Khasianine-L, and control drugs Monensin-L, Mycophenolic acid-L, and Ribavirin-L were: 201.68 Å<sup>2</sup> , 210.62 Å<sup>2</sup> ,187.20 Å<sup>2</sup> , 183.67 Å<sup>2</sup> , 155.86 Å<sup>2</sup> , and 139.35 Å<sup>2</sup> , respectively (Figure 6D). It demonstrates that these complexes have good stability throughout the 50 ns simulation period. From the simulation results, it can be concluded that all the studied systems are structurally stable, and the intermolecular interactions remained strong throughout the simulation time.

#### *2.6. MMGBA/PBSA Analysis*

Binding free energies were estimated using MMPBSA and MMGBSA techniques to understand better the compounds' binding potential with the L protein. Table 3 shows the detailed binding energies of the complexes [101,102]. As all binding interactions are energetically favorable, stable complexes are formed. Gas-phase energy dominates the system energy in all complexes, with van der Waals playing a significant role, while electrostatic energy plays a minor role. This reflects that both van der Waals and electrostatic interactions are key in intermolecular stability. The polar solvation energy is unfavorable in binding, whereas the nonpolar energy appears to be favourable in complex equilibration. The complexes' MMGBSA net binding-energy rankings were as follows: A-317491-L > VER155008-L > Khasianine-L > Mycophenolic acid-L > Monensin-L > Ribavirin-L. The complexes' MMPBSA net binding-energy rankings were as follows: A-317491-L > Khasianine-L > VER155008-L > Monensin-L > Mycophenolic acid-L > Ribavirin-L. These results indicate that our compounds have high binding free energies than control drugs.


**Table 3.** Binding free energy of docked complexes. The energy values are given in kcal/mol units.


**Table 3.** *Cont.*

#### **3. Materials and Methods**

#### *3.1. Homology Modeling*

Drug discovery relies heavily on homology modeling, with current efforts resulting in unprecedented accuracy models, even with low sequence identity [32,103–107]. Because L protein lacked a crystal structure, homology modeling was required to determine target structure elucidation. The L protein sequence was obtained from the UniProt database (P27316 (L\_RVFVZ)) to perform homology modeling [108]. Sequence to Structure alignment was performed using PSI-BLAST against Protein Data Bank (PDB) to find suitable templates for the three-dimensional (3D) structure prediction [109,110]. The initial alignment between the target and the template was generated using the ALIGN2D module. The final model was built using a restrained-based approach in Modeller.v9.11 with the most-fitted template and secondary structural information obtained by manual curation after superimposition between all generated models and the template [111]. Modeller's secondary structure module was used to model the final structure using the extracted spatial secondary structure restraints.

#### *3.2. Structure Validation*

Accurate evaluation of the 3D model is considered one of the core elements of computational structure prediction. The emergence of rapidly endorsed and highly efficient approaches for structure evaluation has paved new ways to assess the quality of newly designed models [112,113]. Ramachandran plots are a quick and easy way to evaluate the quality of a 3D structure. The Ramachandran plot was used to determine the energetically permissible and prohibited phi (φ) and psi (ψ) dihedral angles of amino acid residues [114]. The 3D structure was further analyzed by PDBsum [56]. PDBsum is a database that summarises the contents of 3D macromolecular structures. In addition, the predicted 3D structure was refined and validated by using 50 ns MD simulations.

#### *3.3. Target Protein Preparation*

3D structure of L protein predicted from Modeller and validated with 50 ns MD simulations were prepared using the AMBER 18 program [115]. The ff14SB force field [116] was utilized to parameterize amino acids. AMBER18's tleap module was used to add complementary hydrogen atoms missed by crystallography. Energy minimization of the target protein was performed first for 1000 steepest descent steps and then for 500 conjugate gradient steps, allowing the step size to be 0.02 Å. After MD clustering, the most stable conformation was selected from the cluster (with RMSD < 1 Å) for further processing.

#### *3.4. Compound Preparation*

The Selleckchem Nucleoside Analogue Library (https://www.selleckchem.com/screening/ Nucleoside-Analogue-Library.html, accessed on 15 February 2021) was used to identify molecules with the highest binding affinity to the L-protein. The library composed of 156 natural compounds was downloaded in 3D SDF format. A recent study reported

Monensin, Mycophenolic acid, and Ribavirin as potent inhibitors of RVFV-L protein [79]. To get an insight into the mechanism of the binding of known inhibitors, these three potent inhibitors were docked as controls in the present study. These drugs were retrieved from the PubChem database (https://pubchem.ncbi.nlm.nih.gov, accessed on 15 February 2021) in 3D SDF format [117].

#### *3.5. Structure-Based Virtual Screening*

The AutoDock Vina implicated in the PyRx (version 0.8) was used to perform the virtual screening of the compounds against the target protein [118,119]. The control drugs and chemical library of 156 compounds in SDF format were imported in the Open-babel program of PyRx for further energy minimization, followed by conversion of all the ligands into AutoDock PDBQT format. Further, the top three ligands were subjected to docking against the L protein of RVFV using AutoDock Vina. The grid box was set to cover the entire binding site within the core chamber of RdRp with box\_size of (x = 20, y = 20, z = 20 Å) and dimensions of (x = 122.005, y = 105.326, and z = 137.824). Dockings were run in triplicates to ensure absolute consistency of the results. The docked solutions were clustered using an RMSD of 1 Å. The 3D structural alignment, visualization and analysis, and docking figures production were carried out using Discovery Studio 2019 software [120], PyMOL [121], and UCSF Chimera 1.14 [122].

#### *3.6. ADME and Toxicity Prediction Analysis*

The analysis of the pharmacological activity of drug candidates is a critical step in drug discovery [93]. Therefore, in silico prediction of pharmacophore properties of active compounds is critical for accelerating the drug development process. Absorption, distribution, metabolism, excretion, and toxicity (ADMET) are the most plausible drug-like properties to evaluate virtual hits. ADME analysis was carried out by submitting the compounds' canonical simplified molecular-input line-entry system (SMILES) to online servers; ADMETlab 2.0 [123] and Swiss ADME [124]. Acute oral toxicity was predicted by the Protox II web server [125].

#### *3.7. MD Simulations*

AMBER18 was used to perform MD simulations of the docked solutions [115]. The same MD simulation protocol was adopted in the current study as described in previous studies [22,94,126–128]. The Antechamber package of AmberTools was employed to generate the general AMBER force field (GAFF) parameters for the studied ligands using AM1–BCC charge definitions. Each top complex was explicitly solvated with water molecules, and then counter ions were added to create a neutral system. A water box with a thickness of 12 Å was created using the TIP3P solvent model to encircle the complex [129,130]. The complex was simulated using periodic boundary conditions, with electrostatic interactions modeled using the particle mesh Ewald procedure [131]. For nonbounded interactions, a threshold value of 8 Å was defined during the procedure. Water molecules were minimized for 500 cycles, and then the entire system was minimized for 1000 rounds. The temperature of each system was then steadily increased to 300 K. Solutes in the first phase were restrained for 50 ps during equilibration of counterions and water molecules, while protein side chains were relaxed afterwards. At 300 K and 1 atm, a 50 ns MD simulation was performed, and coordinate trajectories were collected every 2 ps during the simulation under the NPT ensemble. While the SHAKE algorithm [132] constrained covalent and hydrogen bonds, Langevin dynamics [133] was used to control the system temperature. The original structure was used as a reference, and AMBER's CPP-TRAJ [134,135] was used to generate an RMSD (root mean square deviation) plot to assess the system's MD simulation convergence [136]. The structural flexibility of ligands was calculated using ligand RMSD. The radius of gyration (RoG) was analyzed to determine the complex's compactness and three-dimensional packing. RMSF reflects the root mean square averaged distance between an atom and its average geometric position [137]. The Term

β–factor, which is closely related to the RMSF, measures the spatial displacement of atoms around their mean positions, as well as a thermal and local vibrational movement [138].

#### *3.8. MMGBA/PBSA Analysis*

AMBER18 MM/PBSA and MM/PBSA methods were used to estimate the binding free energy (G binding) of the complexes [101,139]. The method has been well-documented in our previous studies, implemented in AMBER 18. The free energy difference was calculated using 100 snapshots from simulated trajectories at regular intervals. The total binding free energy is calculated as a sum of the molecular mechanics binding energy (∆EMM) and solvation free energy (∆Gsol) as follows:

$$
\Delta \mathbf{E}\_{\rm MM} = \Delta \mathbf{E}\_{\rm int} + \Delta \mathbf{E}\_{\rm ele} + \Delta \mathbf{E}\_{\rm v\rm dw}
$$

$$
\Delta \mathbf{G}\_{\rm sol} = \Delta \mathbf{G}\_{\rm P} + \Delta \mathbf{G}\_{\rm np}
$$

$$
\Delta \mathbf{G}\_{\rm total} = \Delta \mathbf{E}\_{\rm MM} + \Delta \mathbf{G}\_{\rm sol}
$$

$$
\Delta \mathbf{G}\_{\rm bind} = \Delta \mathbf{E}\_{\rm MM} + \Delta \mathbf{G}\_{\rm sol} - \mathbf{T} \Delta \mathbf{S}
$$

In these equations, ∆EMM is further divided into internal energy (∆Eint), electrostatic energy (∆Eele), and van der Waals energy (∆Evdw), and the total solvation free energy (∆Gsol) is contributed by the sum of polar (∆Gp) and non-polar (∆Gnp) components. ∆Gbind is the free energy of binding of the ligand evaluated after entropic calculations (T∆S). We excluded entropic term, T∆S, from the calculation due to its intensive computational cost [101,102,140,141]. These methods have been well documented [101,128,140] and considered reliable end-point binding free energy estimations [102,141].

#### **4. Conclusions**

A combination of virtual screening and molecular docking analysis was used to identify inhibitors of RVFV L protein in the present study. The 3D structure of the RVFV-L enzyme was predicted, and the findings could aid researchers working on RVFV drug development. The Selleckchem Nucleoside Analogue Library (156 compounds) was screened, and the top three compounds, A-317491, VER155008, and Khasianine exhibit a high binding affinity compared to the control drugs chosen that may inhibit RVFV-L enzyme activity and hence virus replication. Compounds also show good pharmacodynamic and pharmacokinetic profiles based on toxicity and ADME studies. Overall, the results reveal that all of the compounds fit into the same binding pocket of the protein, suggesting that they could be good therapeutic candidates for RVFV. However, further in-vivo and in-vitro experiments are required to convert these potential inhibitors into clinical drugs. We anticipate that the findings of this study will be useful in the future for developing and exploring novel natural anti-RVFV therapeutic agents.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ph15060659/s1, Figure S1. Sequence alignment of RVFV L protein with the templates (Structure of the alpha-Synuclein (PDB ID: 6I42 and severe fever with thrombocytopenia syndrome virus L protein (PDB ID: 6Y6K). Figure S2. Structural superposition of the RVFV L protein model with the templates. The RMSD values between the model and 6I42 and 6Y6K are 0.179 and 1.165 Å, respectively. Table S1. Reverse template comparison of RVFV L protein model with the representative structures in PDB using ProFunc reverse template search program. Table S2. Virtual Screening result of Selleckchem Nucleoside Analogue Library (compounds with binding affinity > −1.5 kcal/mol) against model of RVFV-L RdRp core.

**Author Contributions:** Conceptualization, U.A.A., M.T.u.Q., S.A. and K.S.A.; data curation, M.T.u.Q., F.S., S.A., E.A.A., G.M.A. and A.A.; formal analysis, M.A.A., M.U.M., M.M.A., U.A.A. and S.A.; funding acquisition, K.S.A.; investigation, M.A.A., M.U.M., M.M.A., F.S., E.A.A. and G.M.A.; methodology, M.A.A., M.U.M., M.M.A., U.A.A., M.T.u.Q., S.A., E.A.A., G.M.A., K.S.A. and A.A.; project administration, M.T.u.Q. and K.S.A.; resources, K.S.A.; software, M.A.A., M.U.M., M.M.A. and S.A.; supervision, U.A.A. and K.S.A.; validation, U.A.A., M.T.u.Q., F.S., S.A., E.A.A., G.M.A., K.S.A. and A.A.; visualization, M.A.A., M.U.M. and M.M.A.; writing—original draft, M.A.A., M.U.M., M.M.A. and F.S.; writing—review and editing, U.A.A., M.T.u.Q., S.A., E.A.A., G.M.A., K.S.A. and A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available within the article and Supplementary Material.

**Acknowledgments:** The researchers would like to thank the Deanship of Scientific Research, Qassim University for funding the publication of this project.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


### *Review* **Drug Design by Pharmacophore and Virtual Screening Approach**

**Deborah Giordano 1 , Carmen Biancaniello 2 , Maria Antonia Argenio <sup>1</sup> and Angelo Facchiano 1, \***


**Abstract:** Computer-aided drug discovery techniques reduce the time and the costs needed to develop novel drugs. Their relevance becomes more and more evident with the needs due to health emergencies as well as to the diffusion of personalized medicine. Pharmacophore approaches represent one of the most interesting tools developed, by defining the molecular functional features needed for the binding of a molecule to a given receptor, and then directing the virtual screening of large collections of compounds for the selection of optimal candidates. Computational tools to create the pharmacophore model and to perform virtual screening are available and generated successful studies. This article describes the procedure of pharmacophore modelling followed by virtual screening, the most used software, possible limitations of the approach, and some applications reported in the literature.

**Keywords:** structure-based pharmacophore modeling; ligand-based pharmacophore modeling; virtual screening; drug discovery; bioinformatics; computational biology

#### **1. Introduction**

Computer-Aided Drug Discovery (CADD) investigates molecular properties to develop novel therapeutic solutions by way of computational tools and data resources. In its broadest meaning, it includes computational approaches for designing or selecting compounds as potential candidates before they are synthesized and tested for their biological activity [1]. Bioinformatics and computational tools offer an in silico approach to reducing costs and times, i.e., the factors that influence the progress of the research and, in the specific field of drug development, limit the possibilities of fighting more pathologies. To date, in vitro screening is expensive and time-consuming, and alternatives are highly desirable. Virtual Screening (VS) is a CADD method that involves in silico screening of a library of chemical compounds, to identify those that are most likely to bind to a specific target [2]. In this way, it is possible to reduce the impact of these limiting factors on drug discovery, meeting the needs due to health emergencies, as well as the spread of personalized medicine. This process can be speeded up using pharmacophore models used as the query with which compound libraries can be searched to pull out molecules of interest with desired properties. Indeed, pharmacophore-based methods are widely used tools in CADD and are of great interest in the chemo-informatics field [2], since they find many applications in drug discovery projects including not only the virtual screening but also scaffold hopping, lead optimization, ligand profiling, target identification, multi-target drug or de novo drug design.

The concept of a pharmacophore was coined in the 19th century when Langley first suggested that certain drug molecules might act on particular receptors. Only later, with the discovery of Salvarsan by Paul Elrich, the selectivity of drug–target interactions was

**Citation:** Giordano, D.; Biancaniello, C.; Argenio, M.A.; Facchiano, A. Drug Design by Pharmacophore and Virtual Screening Approach. *Pharmaceuticals* **2022**, *15*, 646. https://doi.org/10.3390/ph15050646

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 30 April 2022 Accepted: 21 May 2022 Published: 23 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

recognized. Several years of experimentation confirmed its therapeutic effect. This discovery had found support in the statement of Emil Fisher who, following his research, had defined the concept "Lock & Key" in 1894, according to which a ligand and its receptor fits like a key with its lock to interact with the on top of each other through a chemical bond [3].

Hence, historically the term "pharmacophore" was used to indicate the functional or structural capacity of a compound with specific characteristics towards a biological target [4].

Schueler provided the basis for our modern understanding of a pharmacophore, defined by the International Union of Pure and Applied Chemistry (IUPAC) as "the ensemble of steric and electronic features that is necessary to ensure the optimal supra-molecular interactions with a specific biological target structure and to trigger (or to block) its biological response" [3,5,6].

Pharmacophoric modelling is based on the theory that having common chemical functionalities, and maintaining a similar spatial arrangement, leads to biological activity on the same target. The chemical characteristics of a molecule capable of creating interactions with its ligand are represented in the pharmacophoric model as geometric entities such as spheres, planes and vectors. The most important pharmacophoric feature types are: hydrogen bond acceptors (HBAs); hydrogen bond donors (HBDs); hydrophobic areas (H); positively and negatively ionizable groups (PI/NI); aromatic groups (AR); and metal coordinating areas (Figure 1). Additional size restrictions in the form of a shape or exclusion volumes (XVOL)—forbidden areas—can be added to represent the size and the shape of the binding pocket [7].

**Figure 1.** Pharmacophoric features. Main pharmacophoric feature types are represented by geometric entities and include: 1—hydrogen bond acceptor (HBA), 2—hydrogen bond donor (HBD), 3—negative ionizable (NI), 4—positive ionizable (PI), 5—hydrophobic (H), 6—aromatic (AR), 7—exclusion volume (XVOL).

Since the models themselves do not focus on actual atoms, but on chemical functionalities, they are good tools in recognizing similarities between molecules. Pharmacophore activity is independent of the scaffold, and this explains why similar biological events can be triggered by chemically divergent molecules. The use of these models as a query allows performing searches in the large libraries of compounds made available on the computational platform in order to select molecules of interest for the next vs. or in the chemo-informatics field [7].

Pharmacophore models can be generated using two different approaches depending on the input data employed for model construction, which are, namely, "structure-based" and "ligand-based" pharmacophore modelling. The structure-based approach uses the structural information of the target proteins like enzymes or receptors, to identify compounds that can potentially be used as a drug. On the other hand, the ligand-based approach consists of the development of 3D pharmacophore models and modelling quantitative structure-activity relationship (QSAR) or quantitative structure-property relationship (QSPR), using only the physicochemical properties of known ligand molecules for drug development. The choice of the best approach to use depends on several factors such as data availability, data quality, computational resources and also the intended use of the generated pharmacophore models. In the following paragraphs, we describe both strategies and their implementation in virtual screening is provided below to guide the nonexperts on this topic in their application by explaining the basic concepts these methods are based on.

#### **2. Structure-Based Pharmacophore Modelling**

The structure-based approach owns its name to the fact that the three-dimensional structure of a macromolecule target is the essential prerequisite to obtaining a structurebased pharmacophore. The 3D structure of a protein provides significant details at the atomic level that can be very useful for the design or discovery of new drugs. As previously outlined, a pharmacophore is an abstract picture showing the stereo-electronic features, which make a ligand bioactive toward a specific target, and this type of information can be extracted from the 3D structure of the protein target in its holo or apo form. Typically, the workflow of the structure-based approach (Figure 2, left side of the flow) consists of different steps: protein preparation, identification or prediction of ligand binding site, pharmacophore features generation, and the selection of relevant features for ligand activity [8,9].

The 3D structure of the target or the ligand–target complex is the required starting point of this methodology. The RCSB Protein Data Bank (PDB) [www.rcsb.org] (accessed on 20 May 2022) includes thousands of protein structures at high resolution alone or in the presence of a bound ligand, mainly solved by X-ray crystallography or NMR spectroscopy. In case experimental structural data are lacking, computational techniques such as homology modelling [10] may be an alternative strategy to retrieve a 3D model and machine learning-based methods are successfully applied to retrieve protein structures, a powerful example is ALPHAFOLD2 [11]. Molecular docking is another method to study in silico the interaction of a known active compound towards a specific receptor and derive protein-ligand complexes from the most favorable binding poses [12].

A practical tip for those who want to use this approach is to perform a deep analysis of the quality of input data before using the target structure since it directly influences the quality of the pharmacophore model. For this reason, the protein structure preparation is the first step to carry out by evaluating, for instance, the residues' protonation states, the position of hydrogen atoms (that are absent in X-ray solved structures), the presence of non-protein groups which may have or not some functional roles, eventual missing residues or atoms, the stereochemical and energetic parameters accounting for the general quality and biological-chemical sense of the investigated target, also in the case of experimentally solved structures that may contain errors [8].

**Figure 2.** Pharmacophore modeling workflow. A pharmacophore model can be generated using two approaches. In the structure-based approach (on the left side), the structural information of the macromolecular target, alone or bounded to a ligand, is evaluated for its correctness and used to analyze the ligand binding sites in order to derive and select ligand complementary characteristics composing the pharmacophore models. In the ligand-based approach (on the right side), the first step is the construction of a training set made up of known active ligands structures. From each of them, conformers are generated and then used to extrapolate common features that are combined in different pharmacophore models. The resultant models undergo a validation and refinement procedure to obtain the final models exploitable for practical applications.

Once the target structure is identified and critically evaluated, ligand-binding site detection is the next crucial step. This can be manually inferred by analyzing the area including residues suggested to have a key role from experimental data such as sitedirected mutagenesis or X-ray structures of the protein co-crystallized with ligands, but it also requires time and expert knowledge of the target and ligands. Nowadays this purpose can be easily and quickly achieved using bioinformatics tools based on different methods which inspect the protein surface to search for potential ligand-binding sites according to various properties (evolutionary, geometric, energetic, statistical, or a combination of them), and which are particularly useful if the ligand information is not available [13]. Examples of computer programs developed for this purpose are GRID [14] and LUDI [15]. The first one, as suggested by the name, is a grid-based method that uses different molecules or functional groups to sample a specific protein region defined with a regular grid to identify grid points with which they make energetically favorable interactions, thus generating molecular interaction fields [14]. The second one predicts potential interaction sites using the knowledge from distributions of non-bonded contacts in experimental structures or geometric rules [15].

Since the goal of the structure-based method is to generate a pharmacophore model from the interactions between an active molecule (the ligand) and its protein target (enzyme or receptor), the characterization of the ligand-binding site can be used to derive a map of interaction and to build accordingly one or more pharmacophore hypotheses describing the type and the spatial arrangement of the chemical features required for a ligand to interact with the residues of the binding region. Initially, many features are detected with this approach and therefore only those that are essential for ligand bioactivity should be selected and incorporated into the final model to have a more reliable and selective pharmacophore hypothesis [9,16]. This operation can be accomplished in different ways, such as removing features that do not strongly contribute to the energy binding, identifying the most conserved interaction if multiple protein–ligands structures exist, preserving residue with key functions from sequence alignments or variation analysis, and incorporating spatial constraints from the receptor information [8].

If the structure of a protein–ligand complex is available, the pharmacophore features generation and selection can be achieved more accurately. The 3D information of the ligand in its bioactive conformation directs the identification and the spatial disposition of the pharmacophore features in correspondence with its functional groups directly involved in the interactions with the target. Moreover, the presence of the receptor allows accounting for spatial restrictions from the binding site shape through the addition of the exclusion volumes. In this ideal situation, pharmacophore models of high quality can be obtained. In case the available structural data do not include a bound ligand, the pharmacophore modelling depends only on the target structure which can be analyzed to detect all possible ligands interaction points in the binding sites and, therefore, to compute the complementary features that a ligand should match to potentially bind the receptor. However, in the absence of a ligand counterpart, several pharmacophoric features are calculated and approximately positioned resulting in less accurate models that should be manually refined [3,16,17].

#### **3. Ligand-Based Pharmacophore Modelling**

So far, we have mentioned the importance of the availability of the structure for modelling pharmacophores suitable to accommodate the best interaction features. However, even if this still represents one of the best strategies for modelling a pharmacophore, often the target molecule is not available. To overcome this problem, an alternative approach to structural-based modelling is represented by ligand-based pharmacophore modelling.

The starting point of this strategy is the knowledge of active compounds, which bind the same protein target with a similar orientation [4] that could allow the extraction of chemical features in common and, therefore, the pharmacophore construction. Due to the unknown bioactive conformations of the input compounds, a critical step, before the extraction of the shared features, is the generation of ligand conformers, so that, from each set of conformers, at list one should correspond to the bioactive conformation of the ligand [3].

In more detail, two datasets are necessary to generate a ligand-based pharmacophore, i.e., a training set and a test set. The training set composition differs according to the algorithm employed and the data availability, varying from the simplest, composed of at least two active compounds, to the most complex, for which it is possible to set multiple compounds with a different activity. The test set, instead, should contain as many active and inactive structurally different compounds as possible, preferring the inclusion of experimentally confirmed inactive compounds and compounds judged as active by experimentally proven direct interaction and with suitable activity cutoffs (i.e., low EC50/IC50 and high binding affinity values). In the case of unknown inactive ligands, decoy molecules could be used [6], which are available from a specific repository such as DUD (Directory of Useful Decoys) [18] or by generating them by service as DUD-E [19]. In addition to what could be extrapolated from pertinent literature, the source of molecules to be employed in the construction of the dataset could be chemical databases such as OpenPHACTS [20], ChEMBL [21] or Drugbank [22], for compounds with a target-based activity, ChEMBL, PubChem Bioassay [23] and Tox21 [24] for research on the inactive molecules. In principle, all the algorithms appointed to the ligand-based pharmacophore modelling, starting from the active compounds of the training set, extrapolate from them common chemical functionalities, related for instance to the presence of hydrogen bond donors or acceptors, of a positively or negatively charged ionizable group, of aromatic ring systems or hydrophobic areas, and use these features for model design. This step is followed by an improvement of the pharmacophore model including shape features and spatial constrains which take care of specific regions occupied by inactive molecules [25]. The pharmacophore models obtained are ranked by the best ability to fit the active molecules but not the inactive ones. The best models obtained from this first phase are chosen for the validation and refinement using the test set. The purpose of this second phase is to evaluate if the generated models can fit most of the active molecules rejecting the inactive ones, to select only models which show the major sensitivity and specificity. A summary of the described process is described in Figure 2 (right side of the flow).

Despite the accuracy of the pharmacophore construction being highly dependent on the quality of the training set, its final composition is conditioned by the algorithm used for modelling [7]. In general, ligand-based pharmacophore modelling could be performed following two main strategies: the common feature pharmacophore and the QSAR based pharmacophore.

In the common feature approach, generally, an algorithm of alignment is involved, its purpose is basically to extrapolate and align specific and common features: starting from a defined feature for each active element the algorithm combines it with the corresponding features of the other molecules, from this alignment a novel feature, combination of the previous, is generated; reiterating this procedure for each set of overlapping features a pharmacophore of shared feature is built. This is for example the case of the espresso algorithm implemented in LigandScout [26,27], which achieved the pharmacophore modelling by dividing the procedure previously described into six steps exploiting various algorithms and two alignment methods: the first, based on a combination of shared pharmacophore feature and the second, based on a merged pharmacophore feature. Finally, the common features present in all training compounds are included in the common features models and all features of the aligned compounds are summated in a merged feature pharmacophore [28]. Another example of similar functionality is given by the HipHop algorithm provided by DiscoveryStudio [29], which builds several pharmacophore models starting from identifying the common chemical features arrangement of the training set structures. Beginning with a few groups of features, which are extended until no more shared configuration features are found, several models are obtained and ranked based on the fitness of the training compounds. Models are refined by the HipHop Refinement algorithm, which adds space restrains imposed by the structures of inactive compounds.

QSAR (Quantitative structure-activity relationships) based pharmacophore is a mathematical model that tries to find statistical correlations between structures and functions, quantifying the impacts in a biological activity of specific structural modifications in existing or predicted molecules [30]. Based on the idea that molecules with similar physico-chemical properties show a similar binding affinity for a protein target, nowadays the QSAR approach is used in drug discovery to build statistical models derived from these correlations exploiting for the prediction of the bioavailability, the ADMET profile (toxicity), adsorption, distribution, metabolism, binding affinity, biological activity and elimination of novel compounds [31].

There are six classifications of the QSAR approaches, from the 1D to the 6D-QSAR, according to the different dimensions of the method (Figure 3).

**Figure 3.** QSAR classifications. The QSAR approaches are classified from the 1D to the 6D-QSAR according to the different dimensions of the descriptors implemented in the method.

1D-QSAR takes into account a single physico-chemical property of the ligand, for an example the pKa value. In the 2D-QSAR, instead, affinity is correlated with structural patterns, while in the 3D-QSAR with the 3D structure of the ligand and its interactions. The concept extends as dimension rises: 4D-QSAR incorporates an ensemble of ligand configurations in 3D-QSAR, 5D-QSAR adds to 4D-QSAR various induced-fit models, 6D-QSAR implements 5D-QSAR with different solvation models [25].

#### *Pharmacophore Models Validation*

Once one or more pharmacophore models have been computed, a validation step is crucial before their implementation for practical purposes. Pharmacophore validation could be performed by exploiting several methods, such as the goodness of hit list (GH), receiver operating characteristic (ROC) curves construction, Fischer's method, or other statistical analysis, which relies on screening a test set and decoy set (if needed) to evaluate the model ability to distinguish active and inactive molecules and provide an estimation of its quality [9].

Mainly, the quality of a model can be described by four parameters: the sensitivity (capacity to detect active compounds), the specificity (capacity to exclude the inactive molecules), the yield of actives (the ratio between true positives and the number of hits) and the enrichment factor (which relates the yield of actives to the composition of the screening dataset) [7].

The GH scoring method consists of calculating the percentage of sensitivity, the percentage of the yield of actives the enrichment factor and the Güner–Henry score, which gives an evaluation of the efficiency of the screening dataset search and can vary from 0 to 1 where 1 is the value for the ideal model [32]. The ROC curve shows the enrichment power of a model plotting the sensitivity against 1—specificity (the false positive rate). The area under this curve (AUC) gives a measure of the pharmacophore's performance and it is useful for multiple models evaluation. AUC can vary from 0 to 1, where 1 corresponds to an ideal case in which all the active compounds are detected at first, 0 to the collection of the inactive ones at first and 0.5 to random results [33]. Fisher's randomization test validation method is instead used to analyze the significance of the statistical correlation between structure and biological activity [34]. Regression analysis can also be applied to check for the correlation between the compounds predicted as active and those whose bioactivity is experimentally confirmed.

In case the validation process reveals low-quality results, manual refinement could be a way to improve its performance by applying some changes in chemical features' selection or definition settings, training or test sets composition or in the pharmacophore building procedure, but a new model generation with a different approach and algorithm should not be excluded.

#### **4. Pharmacophore-Based Virtual Screening**

Among the strategies for the identification of new bioactive substances, VS techniques play a prominent role. VS tools are important in drug discovery as they increase the speed of the bioactive molecule discovery process through computational simulations by selecting from large libraries the compounds that are most likely to interact with the identified target. In addition, VS identifies compounds that may be toxic or have unfavorable pharmacodynamic (for example, potency, affinity, selectivity) and pharmacokinetic (for example, absorption, metabolism, bioavailability) properties [35].

In this context, pharmacophore models can be successfully applied to filter large collections of compounds to find the so-called hits, i.e., novel molecules matching the pharmacophoric features required to be potentially active against a specific target. Since a pharmacophore does not represent exact chemical groups but chemical functionalities and their spatial relationships, the retrieved hits usually include structurally different compounds, making pharmacophores useful tools for scaffold hopping [7].

The pharmacophore-based screening can be performed directly using some of the software already mentioned for the pharmacophore generation, which allows doing this on a manually created dataset. Other useful open-access platforms for virtual screening are Pharmit (http://pharmit.csb.pitt.edu) (accessed on 20 May 2022) and ZINCPharmer (http://zincpharmer.csb.pitt.edu/) (accessed on 20 May 2022). The first one allows the user to import a predefined pharmacophore query or elucidate pharmacophore and shape queries from the receptor and/or ligand structures, and to use it to screen a desired dataset among a set of provided compounds libraries or a personal one. The results are quickly computed and classified according to different criteria such as energy minimization [36]. ZINCPharmer is a user-friendly online web server, which, exploiting the Pharmer research technology, allows screening of the purchasable molecules present only in the ZINC database using a pharmacophore model imported from other tools or directly derived from ligands or structures [37].

In general, once the model to use has been defined, the first step in virtual screening is to consult a database that contains a large number of compounds annotated with information about the 3D structure, known target, and in some cases the purchasability. Some of the free databases include the Protein Data Bank (PDB) [38], PubChem [39], ChEMBL [21], Zinc [40], and Drugbank [22]. Moreover, there are some commercially available databases such as the MDL Drug Data Report. Usually, due to the presence of a multitude of compounds in these repositories, the strategy adopted by many researchers is to filter the data by applying different parameters to reduce the computational cost of the pharmacophore searching. This may be obtained through the exclusion of compounds that are too big for the ligand pocket, the use of Lipinski's Rule of Five or standard metrics for lead-likeness, and removing compounds deemed to be pan-assay interference compounds (PAINS) [35,41]. In addition, a combination of filtering for desired pharmacological and adsorption, distribution, metabolism, excretion, and toxicological (ADMET) properties is advisable to be applied early in the virtual screening process [42]. These filters should be considered suggestions, not mandatory, and they can be applied depending on the specific case. As an example, the Rule of Five is not an absolute rule, as many drugs go under the "beyond the rule of five" [43]. To efficiently filter a library of compounds against such criteria, several online tools have been developed. For example, the commercial software ACD/LogD Suite (https://www.acdlabs.com/products/percepta-platform/physchem-suite/logd/) (accessed on 20 May 2022) by Advanced Chemistry Development (ACD/Labs) can be used to predict ADME-related properties including hydrophobicity, lipophilicity and pKa, while Pharma Algorithms provided a suite of products via ADME Boxes (the company joined ACD/Labs and the ADME Suite is now available at https://www.acdlabs.com/ products/percepta-platform/adme-suite/) (accessed on 20 May 2022) that addresses issues such as solubility, oral bioavailability, absorption and distribution [1]. It is necessary to underline that in some protocols reported in the literature, the evaluation of toxicity and of the ADME profile is made only after molecular docking. Therefore, it is in the choice of the researcher the use of a protocol that adds a preliminary filter, thus reducing the number of molecules on which to apply further analysis or proceed on a larger number of molecules to identify potential ligands that, if not usable directly because of their toxicity, could serve as the basis for the synthesis of new molecules. If the structure information of the target is available, the retrieved pool of compounds can be further selected by molecular docking with programs such as DOCK [44] or Autodock (http://autodock.scripps.edu/) (accessed

on 20 May 2022) [45], to have a preliminary evaluation of binding affinities and exclude molecules with very low values, as well as to obtain preliminary protein-ligand complexes. Molecular docking is a computational process where ligands are moved in 3D space to find a configuration of the target and ligand that maximizes the scoring function [1]. With docking, therefore, ligand–target complexes are obtained and can be evaluated more in depth, by specific methods, to verify if these compounds can be used subsequently for in vitro experimentation. One of these evaluation methods is the molecular dynamics (MD) approach. It is the pivotal theoretical approach, which can be utilized to gain molecular insight into the stability of the binding pose of the screened molecules in the active site [46], giving a more accurate evaluation of their binding affinity. At the end of the virtual screening, the selected compounds matching the desired properties and giving the best results must undergo an in vitro validation process to verify if indeed the results obtained in silico are reliable. Everything said up to now represents a typical workflow followed in a virtual screening analysis and it is graphically depicted in Figure 4.

**Figure 4.** Virtual screening workflow. Basic filters and methods applicable for a virtual screening process. Scheme of the multistage vs. approach. The application of vs. follows a typical sequence of processes with a cascade of filters able to narrow down and choose a set of hits with potential biological activities.

Recently, MD simulations have been used to perform a thorough conformational search without any help of traditional docking procedures, with accurate all-atom force field and enhanced sampling techniques, obtaining accurate results for both binding modes and binding affinities [47,48].

Finally, it is necessary to also mention the free energy calculations used to obtain protein–ligand binding affinity by MD simulations in conjunction with molecular docking [49–52].

#### **5. Limitations and Possible Solutions**

As shown, pharmacophore modelling is a very useful tool for many applications in drug discovery or drug design. However, despite its strength, this method is not free of limitations that have implications in virtual screening increasing the rate of false positives and/or false negatives, and that should be in mind to find proper solutions, when possible, and to have a critical view of the results.

As explained in the previous paragraphs the quality of a pharmacophore model is strictly related to that of the input dataset and the algorithm used for the modelling. Indeed, the quality of the structures used for pharmacophore building, refinement, and

validation greatly affect its reliability, and to overcome this limit many authors suggest a manual inspection of the structures used and of the annotations regarding the biological data associated with them. For what concerns the choice of the algorithm, it is important to underline that different algorithms can give different results with no overlap starting from the same macromolecular target. This could be attributed to the different screening methodology of the algorithm, but also to the specific screening dataset used among the several publicly available screening platforms [7]. Therefore, an integration of different tools is recommended to cover a larger chemical space and obtain different chemotypes.

Moreover, whether the structure-based or the ligand-based approach is used, both present some intrinsic limitations. In particular, flexibility is the main drawback of the structure-based strategy. As already explained, a structure-based pharmacophore derives from the 3D structure of a macromolecule target, ideally complexed with an active compound. A single structure represents a static image reproducing a single binding mode of a ligand towards its target without taking into account the dynamic nature of the system due to receptor and ligand conformational flexibility [53]. This lacking information results in pharmacophore models that may be defective in some features that could be relevant for the binding of different ligands and, therefore, in retrieving new potential hits. If experimental structures of the same protein target complexed with different ligands exist, it is possible to reduce this limit by using them to extract and merge features responsible for ligands interactions into a more refined model. However, this does not often happen and an alternative way to handle the flexibility consists in employing computational techniques such as flexible docking and/or MD simulations of the investigated system. Dynamic pharmacophores can be computed from protein-ligand MD trajectories and have been successfully applied in recent studies showing a better performance in the virtual screening of bioactive compounds compared to the classic rigid approach [54–56]. Nevertheless, this methodology needs higher computational resources (a problem that is less relevant due to the increasing availability of computing resources, and in case of application to a small set of selected ligands) and expert knowledge since it produces many models that should be averaged or clustered to derive more representative pharmacophores [57].

Another common problem affecting the structure-based approach is the detection of a high number of unprioritized pharmacophoric features, especially when apo-structures are the only available starting point. In this case, the derived pharmacophore model is too restrictive for virtual screening making the identification of structurally different compounds with similar characteristics difficult [17]. Given this, the reduction of chemical features is crucial to obtaining a reasonable and usable model. Some strategies to apply for this purpose have already been discussed, but they depend on available data and an expert manual intervention is often required.

Regarding the ligand-based method, the major limitation is the absence of receptor complementary information related to the interaction pattern with the binding pocket or even just about its shape. By its nature, the ligand-based approach is founded on all possible chemical features and geometric information inferable from the input compounds and their flexible alignment. However, more are the chemical features to be taken into account by the algorithm, more is the time cost of it. Therefore, this approach is limited to small compounds and simple chemical features by computational/time costs, generating a less tailored pharmacophore model [27].

A further complication for the ligand-based modelling is the low availability of inactive compounds, due to the lack in the literature of negative results. The use of a not enough number of molecules known as inactive for a specific protein target has an impact on the selectivity and sensitivity of the pharmacophore model obtained [6].

Above the individual deficiencies, another key point to place is that the pharmacophore represents a model of the chemical features but other variables that can influence the binding of active compounds to a specific target, such as chemical solubility, cell metabolism, membrane permeability, the particular toxicologic effect could not be considered or well-integrated on it; this could have a strong impact on the real capacity of the selected compounds to have a suitable biological effect on the target [7].

In general, even if some of these issues could be addressed by the use of accurate structures, training and test sets, the validity of the model obtained should be always compared to the statistical robustness of the input dataset, because no matter how complete the dataset employed could be, no model can be judged as universal [58].

Because of the things reviewed above, the simultaneous use of both strategies, when it is allowed, is advantageous since the weaknesses of one can be balanced by the strengths of the other. Moreover, their combination with multiple approaches at different stages, such as MD simulation, docking studies and machine learning approaches, could be a valuable resource to overcome the inherent limitations of the pharmacophore modelling, improving the chance to have more complete and reliable results.

#### **6. Software for Pharmacophore Modelling**

Several articles reported lists of software, databases and online tools for CADD applications [1,25,59]. Moreover, online services and catalogues provide links to many resources (see https://www.click2drug.org/ or https://bio.tools/) (accessed on 20 May 2022). In this paragraph, we report the most used and reliable, in our opinion, and findable at the time of the writing of this review. Several programs have been developed to perform the pharmacophore modeling in an automated way using different algorithms, such as LigandScout, Schrodinger-Maestro, MOE, and Discovery Studio. Among them, Ligand-Scout [60] is widely used for pharmacophores generation from a protein–ligand complex and performs a step-by-step ligand topology interpretation including aromatic ring detection, the assignment of functional group patterns, the determination of hybridization state and bond types, followed by ligand and binding site analysis to automatically detect and classify protein–ligand interactions into hydrogen bonding, ionic, aromatic and lipophilic contacts according to geometric constraints. The ligand features composing this set of paired interactions define a pharmacophore model which can be also modified by removing specific features [60]. The Schrodinger–Maestro suite allows the generation of e-pharmacophores, energetically-optimized pharmacophores using the Glide XP scoring function to prioritize protein-ligands interaction sites and shows good speed and performance in virtual screening [61]. MOE is an integrated computer-aided molecular design platform that offers many different services, from drug design to virtual screening and molecular simulations. Drug design by MOE allows exploiting different methods such as the structure-based design, the ligand-based design and the fragment-based discovery (www.chemcomp.com/MOE-Molecular\_Operating\_Environment.htm/) (accessed on 20 May 2022). Another commonly used software is Discovery Studio (BIOVIA, Dassault Systèmes), which allows deriving structure-based and/or ligand-based pharmacophores. In the first case, it offers three methods to build pharmacophores: structure-based, fragmentbased or receptor-ligand pharmacophores according to the available structural data. In absence of a bound ligand, this program implements the Interaction Generation protocol to create LUDI interaction maps for hydrogen and hydrophobic contacts with the binding site that can be manually selected or computationally computed. The interaction map is then converted into pharmacophore models consisting of different combinations of at least three features complementary to those of residue in the binding pocket in terms of chemical nature and location, and the resultant models are scored and ranked according to their target selectivity [62]. For ligand-based pharmacophore generation, besides the use of the HipHop algorithm, DiscoveryStudio gives also the opportunity to explore the QSAR approach by the HypoGen algorithm [63]. The purpose of this algorithm is to find a spatial 3D arrangement of the features shared by the training molecules whose activity on a specific target has been measured. To do so, it needs as input at least 16 compounds covering four orders of magnitudes: H-bond donors, H-bond acceptor, aromatic ring, hydrophobicity. The algorithm works in three phases: the constructive, the subtractive and the optimization. In the first phase, all the pharmacophoric conformations, which preserve at most five default features, are collected and pharmacophores preserving features shared by all the bioactive training compounds are generated; only pharmacophores fitting other active molecules are kept. In the second phase, pharmacophores fitting inactive molecules are removed; in the last, pharmacophore collection is optimized by simulating annealing algorithm. Only the pharmacophore models showing the highest score are given as output [7]. Other examples of 3D QSAR programs, which generate predictive models include PHASE [64], comparative molecular field analysis (CoMFA) [65] and comparative similarity indices analysis (CoMSIA) [66]. For a more exhaustive list, we remand to Gurung et colleagues' review of 2021 [26].

The software tools mentioned above are commercial but there are also academic programs freely accessible. For instance, Pocket v.2 is a structure-based pharmacophore program available either on the web (http://www.pkumdl.cn:8080/pocketv2.html) (accessed on 20 May 2022) or a stand-alone version that uses the Pocket module of LigBuilder (for de novo structure-based ligand design) to analyze the binding site of the target receptor by scored grids. The basic required input is the protein 3D structure in PDB format that is employed to generate pharmacophore models from receptor structures with or without a bound compound [67].

A user-friendly and freely-available web server, which exploited the common feature approach for ligand-based modelling, is represented by PharmaGist [68]. The main window of PharmaGist requires only the uploading of up to 32 input molecules in mol2 format (uploadable in single or in a unique zip file), with bonds angles and length already corrected and hydrogens atoms explicitly specified, the selection of the number of output of pharmacophores that will be showed, and an e-mail address where the link to the results, stored at least for a month, will be sent. Besides this standard research PharmaGist gives also the possibility to choose some advanced options as: i. the selection of a key molecule, by this option all the other molecules will be aligned to the compound selected, otherwise all compounds are selected as key compounds; ii. the minimum number of features in pharmacophore iii. the feature weighting options, by which it is possible to modify the weight of some features (aromatic ring, hydrogen bond donor/acceptor, anion/cation and hydrophobic properties) in the scoring function; iv. User-defined features, by which additional feature types can be defined by a supported feature file format created by the user. The output consists of several tables: the first gives a summary of the input molecules' features and the following summarized the results starting always with the highest scoring pharmacophore.

For those who are researching an easy to use and free available tool which exploits the QSAR method without requiring particular informatics skills, or who simply want to start learning about pharmacophore construction, the web portal 3d-qsar.com (accessed on 20 May 2022) represents a valid opportunity, being a platform hosting several applications which guide the user along the correct workflow necessary to build a pharmacophore model. The first is represented by Py-MolEdit which allows the correct compilation and uploading of the training and the test dataset in any format, giving also the possibility to draw your molecules. After using that tool, the platform gives the discretionary option to align your dataset by exploiting the Py-Align tool. At this point is possible to perform the building of a pharmacophore model with Py-CoMFA, obtained which, can be easily analyzed by Py-ConfSearch tool that performs a conformational analysis. Py-CoMFA allows the building and validation of the 3-D QSAR model with the same style as the original CoMFA software, which seems to reproduce published results with a very good agreement. The platform also provides the opportunity to make docking simulations (by Py-Docking), or to use a different QSAR algorithm (Py-ComBinE). An interface to the Protein Data Bank is also provided (by Py-PDB) [69].

#### **7. Examples and Case Studies**

Many examples in the literature present pharmacophore models obtained by QSAR and common feature pharmacophore approaches. In addition to articles and reviews published in the past that report on successful applications in pharmacophore modeling and virtual screening [70–74], we highlight some specific example of studies.

Schuster and colleagues performed the ligand-based pharmacophore modelling for selective and nonselective 11β-HSD1 inhibitors. Their pharmacophore models, after a virtual screening procedure, led to the in vitro validation of 30 compounds among which seven were found to inhibit more than 70% of the activity of 11β -HSD1 when measured in cell lysates with IC50 values below 10 µM and without cytotoxicity at concentrations up to 40 µM [75]. Pal and colleagues exploited the 3D-QSAR pharmacophore hypothesis and found relevant binding features of novel topoisomerase inhibitors [46].

In a recent study, a pharmacophore model based on the experimental structure of AKT1 protein complexed with an inhibitor has been used for virtual screening on a database of natural compounds. The most promising compounds obtained by the screening have been further investigated by toxicity profile and ADMET analyses, together with molecular docking for better prediction of the potential binding [76]. Experimental studies confirmed the potential inhibitory role of at least one of the selected compounds [76] and, in addition, its synergic effect with quercetin to inhibition of cell growth and induce apoptosis [77].

Due to the pandemic, many research teams explored the opportunity to find novel drugs through time-saving CADD approaches. This is the case of a study oriented to finding novel inhibitors of TMPRSS2, or type II transmembrane serine protease, a human protein involved in the activating processes of SARS-CoV-2 [78]. TMPRSS2 is of particular interest for its involvement also in cancer pathologies. The in silico screening has been performed by a pharmacophore-based approach, starting from camostat mesylate, a known inhibitor of serine protease 2, also approved for drug. The authors selected 10 pharmacophoric features of camostat mesylate and selected 2140 compounds from a public database containing more than 30,000 natural compounds. After a molecular docking analysis, the 2140 compounds were filtered to 85 candidates with docking scores similar to or better than the known inhibitor. The computational study has been further extended to analyze the list of filtered compounds for matching with Lipinski's rule of five and for ADMET properties. The study has been cited many times in literature, due to the interest in the results and the need to verify by experimental approaches the real potential inhibitory activity of filtered compounds.

#### **8. Future Perspectives**

CADD approaches have been successfully developed to improve the ability to discover new drugs. The increasing computational power of hardware opens the perspective of many more applications, also by increasing the availability of protein target structures through deep learning and AI-based prediction tools [79]. The deep learning approaches have been also recently applied to drug-target interaction [80]. However, the experimental steps needed to validate the compound activity and safety remain time-consuming and will always slow down the whole drug discovery process. From this point of view, the pharmacophore modelling followed by the virtual screening of databases of approved drugs, in the drug repurposing perspective, represents the most advanced rational approach, and the most promising way to find novel therapies.

**Author Contributions:** All authors contributed to conceptualization, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding. DG is supported by a post-doc fellowship under the project framework "CIR01\_00017-"CNRBiOmics–Centro Nazionale di Ricerca in Bioinformatica per le Scienze "Omiche"–Rafforzamento del capitale umano" funded by MUR, CUP B56J20000960001.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **Virtual Screening Based on Machine Learning Explores Mangrove Natural Products as KRAS G12C Inhibitors**

**Lianxiang Luo 1,2, \* , Tongyu Zheng 3 , Qu Wang 3 , Yingling Liao 3 , Xiaoqi Zheng 3 , Ai Zhong 3 , Zunnan Huang 4,5, \* and Hui Luo 1,2, \***


**Abstract:** Mangrove secondary metabolites have many unique biological activities. We identified lead compounds among them that might target KRAS G12C **.** KRAS is considered to be closely related to various cancers. A variety of novel small molecules that directly target KRAS are being developed, including covalent allosteric inhibitors for KRAS G12C mutant, protein–protein interaction inhibitors that bind in the switch I/II pocket or the A59 site, and GTP-competitive inhibitors targeting the nucleotide-binding site. To identify a candidate pool of mangrove secondary metabolic natural products, we tested various machine learning algorithms and selected random forest as a model for predicting the targeting activity of compounds. Lead compounds were then subjected to virtual screening and covalent docking, integrated absorption, distribution, metabolism and excretion (ADME) testing, and structure-based pharmacophore model validation to select the most suitable compounds. Finally, we performed molecular dynamics simulations to verify the binding mode of the lead compound to KRAS G12C. The lazypredict function package was initially used, and the Accuracy score and F1 score of the random forest algorithm exceeded 60%, which can be considered to carry a strong ability to distinguish the data. Four marine natural products were obtained through machine learning identification and covalent docking screening. Compound **44** and compound **14** were selected for further validation after ADME and toxicity studies, and pharmacophore analysis indicated that they had a favorable pharmacodynamic profile. Comparison with the positive control showed that they stabilized switch I and switch II, and like MRTX849, retained a novel binding mechanism at the molecular level. Molecular dynamics analysis showed that they maintained a stable conformation with the target protein, so compound **44** and compound **14** may be effective inhibitors of the G12C mutant. These findings reveal that the mangrove-derived secondary metabolite compound **44** and compound **14** might be potential therapeutic agents for KRAS G12C .

**Keywords:** mangrove natural products; KRAS G12C; machine learning; molecular docking; drug discovery; virtual screening; molecular dynamics

#### **1. Introduction**

Plants are an important source of drugs and many compounds extracted from plants have been shown to have excellent medicinal properties [1]. In China and India, drugs extracted from plants have been widely put into use. For example, ethyl acetate compounds obtained from the flowers of Cassia fistula [2]. Mangroves are widely distributed in tropical and subtropical beach areas and grow a variety of plants with rich medicinal value, such

**Citation:** Luo, L.; Zheng, T.; Wang, Q.; Liao, Y.; Zheng, X.; Zhong, A.; Huang, Z.; Luo, H. Virtual Screening Based on Machine Learning Explores Mangrove Natural Products as KRASG12C Inhibitors. *Pharmaceuticals* **2022**, *15*, 584. https://doi.org/10.3390/ ph15050584

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 11 April 2022 Accepted: 5 May 2022 Published: 8 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

as Scyphiphora and Clerodendruminerme [3]. Additionally, mangroves have unique biochemical properties that produce a large number of novel natural products and complex skeletons, and various extracts of mangrove plants have been used to treat tumors and bacterial infections [4]. Mangrove secondary metabolite extracts contain a large number of medicinal compounds similar to tannins, steroids, triterpenes, saponins, etc. [5]. These medicinal compounds play a unique role in the fight against human and animal pathogens.

KRAS targets have long been considered typically non-targetable targets in drug discovery, circulating in active and inactive GTPase [6–8]. RAS is a GTPase that cycles between GTP-bound (active) state and GDP-bound (inactive) forms through the actions of guanine nucleotide exchange factors (GEFs) and GTPase-activating proteins (GAPs). In the active state, KRAS can maintain affinity with many proteins, among which RAF and PI3K pathways may be the most typical [9–11]. Although the KRAS protein is closely associated with cancer, recent studies have shown that the KRAS protein can cause oncogenic mutations around activation sites (e.g., G12C, G12D, etc.), leading to downstream RAFs and overexpression of PI3K proteins. This leads to a range of cancerous lesions and over-proliferation of cancer cells [12]. In KRAS mutations, mutations in Cys12 cause the KRASG12C protein to lose its inherent catalytic activity and at the same time to lose the GTPase-activated protein (GAP). The enhanced catalytic effect leads to the activation of its structure and disrupts the inactive state of KRAS, causing cancer cells to promote proliferation. Lung cancer is the leading cause of cancer deaths in Western countries. In the course of the current study, although we have made substantial progress in treating genetic subtypes (e.g., patients with EGFR mutations or ALK-translocated lung cancer), the most common (30% genetically defined subtypes) and effective treatment strategies are still lacking. Cys12 mutations caused by codon 12 mutations account for nearly 50% of patients with KRAS mutations [13–15]. Therefore, drugs that target KRASG12C may have important therapeutic effects. Although KRAS is one of the first oncogenes to be discovered, it has two features that make it nearly impossible to be suppressed. 1. KRAS binds to GTP and GDP with a dermo maline affinity, which makes it difficult to develop nucleotide-based inhibitors [16]. 2. The hydrophobic pocket of KRAS is shallow, resulting in the insufficient affinity of the compound with the KRAS protein, resulting in off-target effects and increasing the difficulty of finding high-affinity allosteric inhibitors [17].

In 2013, Shokat and his colleagues used a new strategy for Cys12 for KRASG12C [18]. They suggested that covalent bonds formed with Cys12 residue could interfere with the activity of KRASG12C, thereby locking it into an inactive state bound to GTP, thereby downregulating the downstream signaling pathway. After that, the researchers developed a series of covalent inhibitors (ARS1620, AMG510, MRTX849) [19–22], that passed with Cys12 and switch II (amino acids 60–68) binds and occupies an allogeneic pocket on KRAS, and in these findings, the hydrophobic fraction of the new inhibitor penetrates switch II. In the allogeneic pockets below the ring area (residue His95, Tyr96, Gln99), these inhibitors are highly selective for the state in which KRASG12C binds to GTP and can maintain the inactivity of the KRASG12C protein in the formation of its covalent complex.

To begin with, we performed preliminary virtual filtering based on ligand structures. Then, we constructed training and test sets based on a selection of KRASG12C inhibitors from the ChEMBL library and trained a random forest classifier to prospectively predict a library of candidate compounds that were then predicted to be active by covalent screening. More effective chemical structures than positive compounds were screened by comprehensive evaluation of docking score and MM-GBSA score, and lead compounds were selected by ADME toxicity analysis. A compound adapted to KRASG12C was selected and it can be considered that it may be an inhibitor of KRASG12C. Under this process, we can predict the lead components that may target KRASG12C in the mangrove natural product library under the algorithm that is most suitable for the training set and then select the compounds with better docking effects than the positive control for ADME property analysis to reduce false positives in ADME screening as much as possible. Pharmacophore validation demonstrated

that the lead compound and the positive control had more common features, and kinetics further validated the binding activity of the selected compounds (Figure 1).

**Figure 1.** A virtual screening workflow (VSW) was used to identify molecules that hit KRAS G12C . A workflow overview of machine learning, covalent screening, elimination, and toxicity (ADMET) approaches for pharmacophore validation and MD simulations.

#### **2. Result**

#### *2.1. Candidate Compound Library Data*

The dataset for external use was from the ChEMBL database, and all data excluded compounds that did not have semi-concentration inhibition activity. To obtain the distribution of compound species in the mangrove secondary metabolite library, we calculated the structural similarity score (volume score) between each compound in the dataset. The volume score is between 0 and 1, and a higher value indicates higher structural similarity. The following figure reports the volume score between the two compounds. The two-volume score was obtained by fractional normalization based on the backbone of the first or second compound. Additionally, the result demonstrates the cluster analysis of the corresponding volume score sizes for different compounds in the mangrove natural product library, where most compounds have a similarity score of less than 0.4. Figure 2A reveals three sets, compounds **0**–**50**, compounds **50**–**100**, the high similarity of compounds **125**–**200** may be due to the presence of co-backbone structures in the mangrove secondary metabolite pool (Figure 2B–D).

#### *2.2. Machine Learning Models*

To screen the mangroves in the laboratory to find inhibitors that can better target KRAS G12C, we used machine learning technology to better predict the activity of these compounds on the target protein. Machine learning models were developed using the random forest model and the various algorithms included in the lazypredict package. All data were cross-validated by 10x, and from our result, it is clear that different predictive statistics for all machine learning algorithms were implemented using training data alone. At a threshold of 6.5um, the random forest algorithm performed best among all candidate classification algorithms, with the highest Accuracy and F1 score, indicating that the random forest algorithm had the best-fit value. (Figure 3) Therefore, we chose the random forest classification algorithm in Weka (version 3.8) to train the pubchem molecular fingerprint and analyze its 882 features. However, in real-world analysis, many features are complex and have noise implications for the analysis. Therefore, with the Rank and CfsSubsetEval modules, we removed features that were not related to structure–activity effects and finally analyzed the remaining 404 features, improving the performance of the random forest algorithm. Consequently, from a variety of classifiers, we screened and applied the random forest classifier that best fit the dataset, showing its good discriminative power.

**Figure 2.** Clustering and typical frameworks of clusters in mangrove secondary metabolite library. (**A**) Clustering of clusters in the mangrove secondary metabolite library; (**B**–**D**) typical frameworks in the candidate compound library.

#### *2.3. Random Forest Classification Model*

After identifying the machine learning classifier likely to best fit the ChEMBL dataset, we performed a new round of parameter tuning for the random forest algorithm, which helped to better identify new molecules with similar properties that bind to the target protein. The descriptor set obtained by the feature selection method was used to establish a classification model, and the machine learning algorithm of random forest classification was used to evaluate the molecular descriptor set to be selected in detail. Confusion matrices are visualization tools used in machine learning to show accuracy assessments in supervised learning. The records in the dataset were summarized in matrix form based on the two criteria for the actual category and the classification judgments made by the classification model. The random forest model was suitable for both true positives and true negatives (Figure 4A,B), with a false positive number of 9 and a false negative number of 24 in the training set. The number of false positives in the test set is 3 and the number of false negatives is 8. Compared with true positives and true negatives in the matrix, good classification effects can be shown. A good binary classification model usually has good sensitivity and resolution, accuracy, and a larger area under ROC. If both sensitivity and resolution are high, the accuracy will be biased towards the highest value. Figure 4D shows the area under the ROC curve of the machine learning algorithm. It can be seen that it

has a high ROC value (ROC = 0.965), which partly indicates that it has a strong ability to distinguish between molecular descriptors. Additionally, it is sufficient for the activity differentiation of mangrove natural product libraries. In Tables 1 and 2, the predicted values, recall values, F-scores, and MCC scores of the active and inactive compound classifiers are given, and the values of the average model after the mixture of the two are given. The predicted, recall values, and F-scores of the active and inactive compounds in the training set were all close to 1, showing excellent sensitivity, with an average model MCC value of 0.825. While the average model in the test set has better overall statistics, a better balance is achieved between recall and specificity.

**Figure 3.** The accuracy of each machine learning algorithm trained using the lazypredict package. (**A**) Balanced Accuracy value of each machine learning classifier; (**B**) Accuracy value of each machine learning classifier; (**C**) F1 score value of each machine learning classifier.


**Table 1.** Prediction values for the training set for the random forest classifier.

**Table 2.** Predicted values for the test set of a random forest classifier.


#### *2.4. Chemical Space*

The chemical space of the mangrove secondary metabolite library is highly diverse, consisting of aldehydes, alcohols, and esters. The data set for this study was established using compounds extracted from secondary metabolites of mangroves collected in published papers. The data set was constructed with Schrodinger software, and a total of 281 molecules from different bacterial groups were collected. Principal component analysis was performed on secondary metabolic natural products in mangrove forests (Figure 4C). When analyzed using molecular fingerprint descriptors, it can be seen that KRAS G12C is spatially well-arranged, active and inactive molecules are well-arranged and are located in large clusters with wider distributions. Splitting the test data in the compound library into the 80% training dataset and 20% test dataset shows the considerable overlap between

the two sets (Figure 4C), which shows that the classifier is validated based on COM with similar intervals.

**Figure 4.** Chaos matrix and roc curve of the constructed random forest classifier and the chemical space of the candidate mangrove compound library. (**A**) Confusion matrix of random forest classifier to distinguish the training set; (**B**) confusion matrix of random forest classifier to distinguish the test set; (**C**) chemical space of the candidate compound library; (**D**) ROC curve of the random forest classifier.

#### *2.5. Prediction of Prospects*

Candidate compounds were scored by a random forest model and compounds labeled as active were selected for subsequent docking experiments. From this point of view, compounds are docked based on reliability, which may provide a degree of confidence for these predictions.

#### *2.6. Docking*

Molecular docking can better reveal how compounds bind to targets. The selected lead compounds were covalently docked in the Schrödinger Suite 18.4. After the ligand minimization step, the interaction energy of each compound at each docking position was calculated. The compounds with better performance than the positive control MRTX849 score were selected to show the most favorable conformation. It can be seen that the final selected compound is better combined in the preset binding pocket and wrapped tightly, so

it is judged that its off-target possibility is not high. Figure 5 shows 2D and 3D interaction patterns of compounds **44** and **14** docking. Among them, the 2D interaction diagram of compound **44** can be seen in Figure 5A and the 3D docking mode diagram in Figure 5C. For compound **14**, its two-dimensional interaction diagram and three-dimensional docking model diagram are shown in Figure 5B,D. It can be seen that both compounds are linked to the H95 cryptocodon of KRAS G12C and form an irreversible covalent bond with Cys12. Both compounds are connected to the switch II region, where the KRAS G12C protein can be inactivated by regulation. Compound **14** had a solid docking conformation by interacting with Gly60 to form a hydrogen bond, forming a hydrogen bond with Gly10. Compound **44** is firmly present by forming the ΠΠ interaction with Arg68 and a hydrogen bond interaction and forming hydrogen bonds with Peo34, surrounded by hydrophobic bonds in the pocket. The results of validation and docking are consistent, which is qualitatively expected for an inhibitor of KRAS G12C. The RMSD values of compound **44** and compound **14** with the best docking effect are 7.9196 and 9.1083, respectively, which are in line with the docking agreement among all the compounds with better docking effects, showing that the overlap with MRTX849 fluctuates less (Table 3).

ΠΠ cation interactions **Figure 5.** The most pharmaceutically available compound **44** with compound **14** and the KRAS G12C docking structure diagram. (**A**) Two-dimensional binding mode of compound **44** and protein complex; (**B**) two-dimensional binding mode of compound **14** and protein complex; (**C**) three-dimensional binding mode of compound **44** and protein complex; (**D**) three-dimensional binding modes of compound and protein complexes. The purple sticks are hydrogen bonds and the red sticks are ΠΠ cation interactions.

−

−

−


**Table 3.** Docking RMSD between ligand pose and crystal coordinates for compounds with better docking results than positive controls.

*‐*

*‐*

*‐*

*‐*

*‐*

*‐*

‐

‐

‐

‐

‐

‐

‐

#### *2.7. MM-GBSA*

Binding free energies calculated by molecular mechanics generalized born surface area (MM-GBSA) indicate that the compensation between binding enthalpy and entropy plays a crucial role in drug–protein binding. For the alternative lead compound, calculations were also performed in the MM-GBSA module in the Schrödinger Suite 18.4. The values of MM-GBSA for each compound were obtained, and the compound superior to the positive control MRTX849 was selected (Figure 6). We describe the surrounding kinetic environment by analyzing the conformation of protein ligands, but the calculations are too complex to be easily controlled. We selected conformation by analyzing intermolecular MM-GBSA scores and binding fractions, which not only focus on binding but also visualize it by a fraction. The MRTX849 compound has an MM-GBSA score of −3.58, and compound **127** (score = −55.97), compound **44** (score = −6.35), compound **14** (score = −32.68), compound **31** (score = −25.13), and compound **15** (score = −4.43) are higher than the MRTX849, indicating that these compounds may have a better conformation than the positive control.

**Figure 6.** Compounds that were identified as active by random forest classifiers and were superior to positive controls in covalent screening with KRAS G12C protein docking results and MM-GBSA results.

#### *2.8. ADME*

uble molecules are −4 to −6 and −2~−4, respectively. Based on the results, all molecules are ADME (Absorption, Distribution, Metabolism, and Excretion) is a key aspect for predicting the pharmacodynamics of the molecule under study which could be used as a future lead molecule for drug development. Swiss-ADME is a website (https:// www.swissadme.ch, accessed on 17 December 2021) that allows users to draw individual ligands or drug molecules or contains molecules from pubchem smiles data and provides information such as fat solubility (iLOGP, XLOGP3, WLOGP, MLOGP, SILICOS-IT, Log Po/w), Water Soluble Log S (ESOL, ALI, SILICOS-IT), drug-like rules (Lipinski, Ghose, Veber), and other parameters. ADME prediction studies for design compounds are shown in Table 4. Swiss-ADME is based in part on Lipinski, Ghose, Veber, Egan, and the five different rules identified by Muegge give the physicochemical properties of a possible oral drug candidate [23–26]. The logarithmic S reference values for medium soluble and highly soluble molecules are −4 to −6 and −2~−4, respectively. Based on the results, all molecules are classified as medium soluble and highly soluble. ADME drug capability assessment was performed on four selected compounds, where compounds **127** and **31** violated Lipinsky's rule of five, but compound **44** and compound **14** exhibited good ADME properties (Figure 7) and were reserved for the next evaluation and submitted for bone

− −

− −

35 − − h − 53 − h − toxicity analysis, where the benzene backbone of compound **31** showed stronger toxicity (Figure 8) (https://mcule.com/apps/toxicity-checker/, accessed on 19 January 2022). The next evaluation was skipped. All of these parameters infer that compounds **44** and **14** are close to a drug-like molecule.


**Table 4.** ADME properties of ligands selected from the Marine Natural Products Library.

**Figure 7.** ADME properties of compounds obtained after the covalent screening. (**A**) ADME properties of compound **14**; (**B**) ADME properties of compound **31**; (**C**) ADME properties of compound **44**; (**D**) ADME properties of compound **127**.

**Figure 8.** Toxicity alerts for compounds **44** and **31** are displayed in red font. Toxicity alerts for compounds **44** and **31** are displayed in red font. (**A**) Toxicity alert for compound **44**; (**B**) toxicity alert for compound **31**.

#### *2.9. Pharmacophore Analysis*

During the process of virtual screening, the pharmacophore model can be used to characterize the active conformation of the ligand molecule by conformational search and molecular superposition, and the possible mode of action between the receptor and the ligand molecule can be deduced and explained accordingly. Based on the ranking and scoring results of the pharmacophores given by the platform (Table 5), the best pharmacophores we selected (rank score = 52.937) have four hydrophobic characteristics and three hydrogen bond receptors. The results confirm that both drug candidates match the selected model, with compound **44** being the best match to the pharmacophore, matching the three hydrophobic interaction features and one hydrogen bond acceptor feature (green) of the model (Figure 9). It can be assumed that both compounds are more similar to the known inhibitors in terms of distribution of spatial pharmacodynamic curves.


**Table 5.** Feature composition and ranking score of 10 pharmacophore hypothesis models generated based on common features of positive compounds.

**Figure 9.** The pharmacophore model was generated based on the common features of the positive compounds and the matching status of the two candidate molecules with the model. (**A**) The common feature pharmacophore model; (**B**) the matching status of compound **44** and the model; (**C**) the matching status of compound **14** with the model.

#### *2.10. Root Mean Square Deviation (RMSD) Analysis*

To obtain the equilibration time for each simulated protein–ligand complex during the MD simulation, the RMSD of the skeleton was calculated. RMSD plots are typically used to evaluate the time it takes for a system to reach structural balance and to estimate the duration of running a simulation. RMSD is an important parameter for estimating changes or changes in molecular conformation. Due to sudden changes in structural conditions, the RMSD value of analog complexes, including references, increases suddenly, which is related to protein crystallization. The latter effect is to be expected since, in the crystal structure, the protein is rigid, and when it dissolves in the tank it resumes its dynamic movement.

A complex system with a time frame x should have an *RMSD* that can be calculated from the following equation [27,28].

$$RMSD\_{\mathbf{x}} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} (r\_i'(t\_{\mathbf{x}})) - r\_i \left(t\_{ref}\right)^2} \tag{1}$$

Here, the *RMSD<sup>x</sup>* is the calculation of *RMSD* for the specific number of frames, *N* is the number of selected atoms; *tre f* is the reference or mentioned time, *r* ′ is the selected atom in the frame *x* after super imposing on the reference frame, and *t<sup>x</sup>* is the recording intervals.

#### *2.11. Root Mean Square Volatility (RMSF) Analysis*

As shown in Figure 10A, the entire KRAS G12C system is in equilibrium in the first 62 ns of the simulated 100 ns (RMSD value is 0.52 nm), and then fluctuates to the RMSD of 0.50 nm after 62 ns and in equilibrium in the remaining 38 ns; the entire system can eventually be in equilibrium in this 100 ns without much fluctuation in the process. For the system of compound **14**, the RMSD of the system is finally stable at 0.23 nm and the RMSD of its ligand is stable at around 0.16 nm. However, in the 100 ns simulation process, it can be seen that there are four relatively long-term fluctuations. Interestingly, the system finally stabilizes and is 0.27 nm lower than the RMSD value of compound **44** (Figure 11A). To determine the deviation of the ligand from the initial posture and the degree of movement of the protein residues, the RMSF values of all sampled conformations during the 30 ns simulation were also calculated. RMSF fluctuates greatly, indicating that the residue is unstable; otherwise, the residue is stable. The RMSF of the residue is *i* ccalculated from the following equation [29].

**Figure 10.** Dynamic simulation results. (**A**) RMSD values extracted from protein fit ligand of the protein-ligand docked complexes and ligand. RMSD plot of KRAS G12C complex (red) and ligand 44 (black). (**B**) The RMSF graph of all complexes along with the protein during 100 ns MD simulation. RMSF plot of the KRAS G12C complex (black). (**C**) Residue-wise decomposition of binding free energies obtained from the MM-PBSA analyses. Red bars indicate CYS-12, ARG-68, and TYR-96. (**D**) Binding energy of binding for the protein complexed with ligand 44.

As shown in Figure 10B, the RMSF range of the entire system is between 0.05 and 0.3 nm. From this numerical range, the flexibility of the entire complex system is relatively low, and each residue does not fluctuate too much. The RMSF value of the residue ranged from 0.05 to 0.15 nm, indicating that the binding site of compound **44** with the target fluctuated significantly and the binding was stable. The RMSF of another system is basically consistent with the overall surface line of RMSF of compound **44**. After visualization, the RMSF of compound **14** is significantly larger on the key residues Cys12, Arg68, and Tyr96. The flexibility of these residues shows that compound **14** is not as effective as compound **44** (Figure 11B).

**Figure 11.** Dynamic simulation results. (**A**) RMSD values extracted from protein fit ligand of the protein-ligand docked complexes and ligand. RMSD plot of the KRAS G12C complex (red) and ligand 14 (black). (**B**) The RMSF graph of all complexes along with protein during 100 ns MD simulation. RMSF plot of the KRAS G12C complex (black). (**C**) Residue-wise decomposition of binding free energies obtained from the MM-PBSA analyses. (**D**) Binding energy of binding for the protein complexed with ligand 14.

#### *2.12. MM/PBSA Analysis*

−1 −1 −1 −1 −1 −1 −1 −2 −1 −1 −2 −1 For molecular mechanics, Poisson–Boltzmann surface area (MM/PBSA) is an efficient and reliable method for calculating the free energy of small inhibitors bound to their protein targets. In general, low binding energy values indicate that the binding between the ligand and the target is good, and the results of the g\_mmpbsa are shown in Figure 10C,D. In Figure 10C we can see that for the key residues Cys12, Arg68, and Tyr96, the energy contribution of Cys12, Arg68, and Tyr96 is −15.05, −15.21, and −18.02 kj/mol, which also corresponds to the interaction forces shown in the molecular docking section results. It is shown that the interaction force acts in this system. In terms of the specific energy contribution of key residues Cys12, Arg68, and Tyr96, the energy value of compound 14 is not very good. These values are −10.03, −11.21, and −12.02 kj/mol, respectively (Figure 11C). In addition, in the energy decomposition of compound 44 and the target (Figure 10B), the total binding energy is −181.601 kj/mol, the energy of van der Waals is −290.5 kj/mol, the energy of electrostatic energy is 12.155 kj/mol, the energy of polarization is 115.96 kj/mol, and the final energy of SASA is −19.209 kj/mol. Overall, compound **44** binds very well to KRAS G12C. The total combined energy of compound **14** is −162.601 kj/mol, van der Waals energy is −290.5 kj/mol, electrostatic energy is 22.155 kj/mol, polarization energy is 105.96kj/mol, and the final energy of Sasa is −19.209 kj/mol (Figure 11D).

#### **3. Discussion**

KRAS is the cancer gene with the most mutations in a single place and is the first to be identified to have a causal relationship with human cancer [30]. Mutations in KRAS are common among the three deadliest cancers: pancreatic, colorectal, and lung cancers [31].

Frequent mutations in the KRAS gene lead to increased demand for drug development, but due to its strong affinity with GTP and lack of deep hydrophobic pockets, the hydrogen bond formed by molecular docking has difficultly accurately anchoring its active pocket, which makes the development of corresponding small molecule inhibitors very difficult. However, recently, after the single mutation KRASG12C was identified as an inhibitor that could be used in clinical trials. There have been an increasing number of studies on KRASG12C, the most common KRAS mutation in lung cancer individuals.

In recent years, due to the low affinity of the binding form formed by the hydrogen bond and the limited binding efficiency of the active pocket, people have gradually begun to focus on covalent docking. Covalent bonds are directly connected to the target residues; thus, providing a more stable and higher affinity than hydrogen bonds. Therefore, in this paper, due to the special form of KRASG12C protein, we chose covalent screening as a way to find lead compounds, hoping to find new inhibitors from the mangrove natural product library through machine learning high-throughput screening.

In this study, 281 published small molecules targeting KRASG12C were selected from the ChEMBL database and all of them were converted into pubchem molecular fingerprints. Molecular descriptors were evaluated with the lazypredict package in Python. The random forest classifier had higher AUC values among all the classifiers used. This means that compared to other machine learning methods, random forest outperforms all methods on the KRASG12C dataset, so the random forest classifier chosen in this study seems suitable for prospective prediction. Therefore, we characterized the data in Weka (version 3.8) to make the algorithm fit the data better, the AUC area under the ROC curve of the random forest classifier indicates that the model has a good degree of discrimination, and the real number of filters in the confusion matrix with positive numbers for the training and test sets also indicate that the model has moderate to high reliability for forward-looking predictions. The mangrove secondary metabolite library in the laboratory contained a large number of molecules with different frames, which showed the diversity of candidate compounds. PCA results showed that the mangrove natural product library had a broader space of chemical properties. After the compounds screened by the random forest classifier were introduced into the covalent screening module of Schrödinger Suite 18.4, the control compounds with higher scores were selected for further analysis by comparison with the positive control MRTX849. From the docking results, it seems that the carbon–carbon double bond and the imine group can form a covalent bond with the Csy12 group of KRASG12C through Michael addition reaction, and these warheads were proved to be feasible in this study. Binding modes and molecular interactions reveal the mode of action of the selected ligands for KRASG12C. The comparison with the positive control showed that the warhead of compound **14** covalently bound to the receptor was similar, which may instruct us to modify it for better effect in the following experiments. Interestingly, the skeleton of compound **44** is similar to that of the positive control MRTX849, which allows compound **44** to have more favorable interactions and better probe itself into the shallow hydrophobic pocket of KRASG12C to interact with the better receptor combination. Toxicity testing shows that the structure of compound **14** has no components that are toxic to humans, but compound **44** has groups that may be harmful to humans, which is very important for future research. We may be able to optimize the functional groups of the lead compounds to provide them with better performance when targeting KRASG12C. In addition, quantum/molecular mechanics (QM/MM) calculations can be performed on the complexes, and finally, the conformation is selected from the docking simulations [32].

After years of development and calibration, the QM/MM hybrid method has become an indispensable tool for studying the kinetics of various chemical and biochemical processes. QM/MM is mainly used to characterize and study the transition states and activation energies of enzymatic reactions. Conformations computed in this way describe the surrounding environment in more detail. However, the calculations become more complex and not easy to control. We chose the conformation by analyzing the interaction between the molecule and the binding moiety, which focuses not only on the binding

mode but also the moiety to be referenced. However, some compounds change mating conformations due to changes in the environment, regardless of environmental influences. Machine learning to resolve inhibitors is an emerging technology that is an important branch of artificial intelligence to extract useful and thematically relevant data when analyzing large samples. In this study, we focused on the classification analysis and principal component analysis of machine learning, allowing the compounds in the data set to pass activity prediction, so that the virtual screening can obtain more drug-like results. At the same time, in the machine learning analysis, if the docking score is more balanced by improving the docking score, it can also accurately filter out potential lead compounds from the virtual screening [33]. Compared with the QSAR analysis under the traditional algorithm [34], the active structure–activity relationship model constructed by machine learning can better fit the data and can have better performance in terms of robustness and prediction accuracy [35]. We, therefore, identified compounds in the mangrove secondary metabolite pool that might target KRASG12C, which exhibited favorable pharmacokinetic properties and docking effects and also received high scores in machine learning models. In the following research, its inhibitory activity can be verified experimentally to better obtain its inhibitory effect in animals and humans.

All in all, in terms of statistical machine learning methods, docking scores, and in silico ADMET studies, the results are satisfactory, indicating that virtual screening strategies combined with machine learning as well as structure-based molecular docking can improve the efficiency and accuracy of screening of target compounds.

#### **4. Materials and Methods**

#### *4.1. Protein Pretreatment*

We used the PDB website (https://www.rcsb.org/, accessed on 1 May 2021) to select and download KRASG12C's structure (PDB id:5F2E) [36], and then imported it into the Schrödinger Suite 18.4 to perform protein processing preparation. Pre-processing took place in the prepwizard module (Schrödinger Inc., New York, NY, USA), flipping pairs of Asn, Gln, and His by 180◦ . The terminal X angle of the residue was sampled to optimize the hydrogen bond network, and the hydrogen on the hydroxyl and thiols was sampled to optimize the hydrogen bond network. After hydrogen bond optimization, we used impact's imperf module and OPLS-2005. It also minimizes the structure of the protein; thus allowing the entire structural system to relax. In a protein minimization protocol, including all atoms and pure hydrogen atoms, the conditional criterion for termination was based on the root mean square deviation of the heavy atoms from their initial positions. At the same time, all water molecules were removed under the premise of optimizing hydrogen bonds and retaining the necessary water molecules at the minimum stage.

#### *4.2. Machine Learning*

#### 4.2.1. Data

The dataset constructed to train the machine learning classifier was extracted and queried separately in the ChEMBL database, and the reference protein we chose was ChEMBL2189121. The activity set was defined as compounds with a molecular weight < 1000, and the activity type (Standard type = "IC50") that detected inhibition had a STAN-DARD\_UNITS value of "NM". After removal of duplicate structures and no experimentally determined definitive IC50 values, the active set included 98 structures. Standard Relation =">" for the inactive compound set, which means that the construct did not show any activity at the concentrations used for screening. The inhibitory activity type of the resulting structure was also IC50, and its STANDARD\_UNITS was also "NM". After deduplication, the inactive set contained 72 structures. The electrical properties of the compounds were restored to electrical neutrality for both test and training sets. All machine learning classifiers use L1 regularization to weed out unimportant descriptors. The area under the curve (AUC) of the receiver operator characteristic (ROC) and the number of true positives (TP),

true negatives (TN), false positives (FP), and false negatives (FN) were used as metrics for the classification model.

#### 4.2.2. Machine Learning Models

Some of the molecular descriptors in the sample were noisy and irrelevant. We needed to remove them without missing too much information to reduce the likelihood of overfitting. From here, a condition was introduced to remove the unwanted descriptor, which measures the correlation between the descriptor and the sample output by the classifier [37]. For this purpose, we used a feature selection project, which selected the appropriate descriptor in a sample that contains a small amount of information without losing a lot of information. This study used the Rank method in the Select Attributes module in Weka [38] to rank each feature in descending order and then delete the lowerranked feature. At the same time, the CfsSubsetEval method was used to predict the degree of complexity between each feature and the predicted feature for classification.

#### 4.2.3. QSAR Modeling

The QSAR classification model can reflect the molecular descriptor as a correspondence between the independent and dependent variables, each representing the category of the corresponding sample (KRASG12C inhibitory activity). Machine learning algorithms can group observations or instances into classes. In structure–activity relationships, it tended to be complex and nonlinear, in which case QSAR modeling had shown excellent performance [39]. Lazypredict Pack (https://github.com/shankarpandala/lazypredict/ tree/master, accessed on 12 January 2022) uses a variety of machine learning algorithms to verify which algorithm is better suited for a dataset in Python. The machine learning software Weka (Waikato Knowledge Analysis Environment) [38] version 3.8 was used to perform a random forest algorithm selected by lazy prediction. Weka implemented 10 cross-validations to get the best fit on the training set.

#### 4.2.4. Principal Component Analysis

Principal component analysis (PCA) was performed on the mangrove secondary metabolite library data set to assess its chemical space. We used the Scikit-Learn 40 (0.22.2) for the PCA algorithm. The pubchem fingerprint reduces the feature dimension to 3. Molecular descriptors and fingerprints were from the Cheminformatics Library Rdkit (1 March 2020).

#### *4.3. Covalent Docking*

To further screen candidate compounds, the resulting candidate compounds were subjected to covalent docking virtual screening (CovDock-VS) based on the KRASG12C structure (PDB ID:5F2E). Molecular docking studies were conducted using the Maestro program. The ray structure of the KRASG12C protein (PDB ID:5F2E) with a resolution of 1.40 Å was selected for covalent docking. The ligands were prepared in Schrödinger's LigPrep module (Schrödinger Inc., New York, NY, USA). Protonation and ionization states of various stereoisomers, tautomers, and ligands were generated at pH 7.4 using an ionizer. Finally, the energy of the ligand was minimized using the OPLS2005 force field. The LigPrep-generated ligands were docked into the receptor grid in a covalent docking manner and the Michael addition reaction was selected as the reaction equation. The active functional group of the ligand was limited to 5 Å of the active amino acid residue, according to the previously obtained binding sites. A receptor grid with X = 14.9, Y = 10.5, and Z = 14.3 was prepared. The energy was minimized after docking, and each ligand outputs up to three optimal poses. Unless otherwise noted, all docking results were visually screened and conformations with the best docking scores were retained. CovDock used Cys12 as a covalently mated nucleophilic residue that conjugates to CovDock's preset alkyne hydrocarbons (carbonyl activation). Ligands with reactive functional groups in the range of 5 Å form covalent bonds specified by the reaction. Ligands were selected and

ranked based on the Glide score of the reaction complex binding pattern [40]. To ensure the accuracy of the docking protocol, we chose the superposition module in maestro to calculate the docking RMSD between the ligand's pose and crystal coordinates, while using MRTX849 as a control.

#### *4.4. ADME*

ADME's analysis of the pharmacodynamics of this pharmaceutically acceptable small molecule was of great significance. The Swiss ADME Server (http://www.swissadme.ch/, accessed on 17 December 2021) evaluates lead compounds retained after machine learning and covalent docking screening. This was described based on the specification SMILES [41]. The ADME properties of the selected compound were calculated by the website. The main relevant parameters such as pharmacokinetic properties and drug solubility were taken into account. The observed attribute values are shown in Table 3.

#### *4.5. Pharmacophore Modeling and Matching Validation*

By using the Discovery Studio platform (Discovery Studio 4.5, Accelrys, Co., Ltd., 175 Wyman Street, 02451 WALTHAM, MA, USA), we spatially aligned three known positive compounds and generated 10 hypothetical pharmacophore models based on molecular common characteristics. We selected hydrogen bond receptors, hydrogen bond donors, and hydrophobicity features as model pharmacodynamic features. The minimum distance was set between pharmacodynamic features within the model to 2.97 Å and the best conformation method was applied to generate the potential conformation of the positive compound. According to the pharmacophore ranking score given by the platform, the optimal pharmacophore was selected to match the two candidate molecules to assess whether the candidate molecules were consistent with the common pharmacodynamic characteristics of known inhibitor molecules.

#### *4.6. Molecular Dynamics (MD) Simulation*

After docking, an MD simulation of the compounds **14** and **44** with KRASG12C was used to check the stability of the compound in the binding bag. Then, the GROMACS 2019.1 package [42], amber 99sb-ildn force field (https://www.gromcs.org/About\_Gromacs, accessed on 26 December 2021), and single point charge (SPC216) model were used for molecular dynamics simulations of 100 ns. To guarantee the total charge neutrality of the simulated system, a corresponding number of sodium ions are added to replace the water molecules in the system to produce a solvent cartridge of appropriate size. Then, the periodic boundary condition (PBC) was applied in the three directions of the system [43]. Using the amber99sb-ildn force field, the force field parameters obtained for the entire atom can be found on the Acpype website [44] (https://www.bio2byte.be/acpype/, accessed on 29 December 2021). The first pass (EM) minimizes the energy of the entire system at 50,000 steps below 300 K. Then, through MD simulations with position constraints, collected by NVT (constant particle count, volume, and temperature), and finally by NPT (constant particle number, pressure, and temperature) [45]. In addition, we balanced enzymes, ligand molecules, and solvents. Among them, we carried out non-standardized residual treatment of covalent bonds in the system.

#### *4.7. MM-PBSA*

Poisson Boltzmann Surface Area is open-source software, and g\_mmpbsa was primarily used to calculate the free energy of binding between the receptor and the inhibitor after MD [46]. As a scoring function, MM-PBSA has been used in computational methods for drug design. In this study, MM-PBSA was used to determine the binding free energy of KRASG12C with molecules 44 and 14, respectively.

The following Equation (2) describes the binding free energy:

$$\mathbf{G}\_{\text{binding}} = \mathbf{G}\_{\text{complex}} - \left(\mathbf{G}\_{\text{protein}} + \mathbf{G}\_{\text{ligand}}\right) \tag{2}$$

The free energy of the protein–inhibitor complex was represented by the *G\_complex*, the free energy of the protein in the solvent is represented by the *G\_protein*, and the free energy of the inhibitor in the solvent is represented by the *G\_ligand*.

#### **5. Conclusions**

In summary, marine natural products are an important source of lead compounds, especially mangroves and their secondary metabolites have many potential antitumor lead compounds. In the present study, we constructed a random forest classifier with excellent discriminatory power and sensitivity and used it to predict mangrove-derived compounds with potential KRASG12C inhibitory activity. Subsequently, further covalent docking and MM-GBSA analysis results confirmed the stable binding ability of two mangrove-derived compounds **14** and **44**. To investigate the commonalities in the potency of our two selected mangrove compounds and previously reported KRASG12C inhibitors, a pharmacophore model based on molecular common features was also used to further extend our study and corroborate the potential of both compounds to inhibit KRASG12C. Next, our work is focused on improving the biochemical and pharmacological profiles of mangrove secondary metabolite compounds **14** and **44** through further medicinal chemistry work and structural studies. Although the development of KRASG12C inhibitors is still considered a challenging task in the field of drug discovery and development, our work expands new horizons for this field.

**Author Contributions:** L.L. conceived and designed the study. X.Z., T.Z., A.Z., Q.W. and Y.L. performed data mining and analysis. X.Z. and L.L. wrote the manuscript. Z.H., H.L. and L.L. reviewed the paper and provided comments. All authors contributed to the article and approved the submitted version. All authors have read and agreed to the published version of the manuscript.

**Funding:** This project was supported by Basic and Applied Basic Research Program of Guangdong Province (2019A1515110201); Discipline Construction Project of Guangdong Medical University (4SG21004G). Science and technology program of Guangdong Province (2019B090905011).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used to support the findings of this study are included within the article.

**Acknowledgments:** We thank the Public Service Platform of South China Sea for R&D Marine Biomedicine Resources for support.

**Conflicts of Interest:** The authors declare that they have no competing interest.

#### **References**


### *Article* **Chromene Derivatives as Selective TERRA G-Quadruplex RNA Binders with Antiproliferative Properties**

**Roberta Rocca 1,2,† , Francesca Scionti 3,† , Matteo Nadai 4 , Federica Moraca 2,5 , Annalisa Maruca 2,6 , Giosuè Costa 2,6, Raffaella Catalano 2,6 , Giada Juli 1 , Maria Teresa Di Martino 1 , Francesco Ortuso 2,6 , Stefano Alcaro 2,6 , Pierosandro Tagliaferri 1 , Pierfrancesco Tassone 1 , Sara N. Richter 4, \* and Anna Artese 2,6, \***


**Abstract:** In mammalian cells, telomerase transcribes telomeres in large G-rich non-coding RNA, known as telomeric repeat-containing RNA (TERRA), which folds into noncanonical nucleic acid secondary structures called G-quadruplexes (G4s). Since TERRA G4 has been shown to be involved in telomere length and translation regulation, it could provide valuable insight into fundamental biological processes, such as cancer growth, and TERRA G4 binders could represent an innovative strategy for cancer treatment. In this work, the three best candidates identified in our previous virtual screening campaign on bimolecular DNA/RNA G4s were investigated on the monomolecular Tel DNA and TERRA G4s by means of molecular modelling simulations and in vitro and in cell analysis. The results obtained in this work highlighted the stabilizing power of all the three candidates on TERRA G4. In particular, the two compounds characterized by a chromene scaffold were selective TERRA G4 binders, while the compound with a naphthyridine core acted as a dual Tel/TERRA G4-binder. A biophysical investigation by circular dichroism confirmed the relative stabilization efficiency of the compounds towards TERRA and Tel G4s. The TERRA G4 stabilizing *hits* showed good antiproliferative activity against colorectal and lung adenocarcinoma cell lines. Lead optimization to increase TERRA G4 stabilization may provide new powerful tools against cancer.

**Keywords:** G-quadruplex DNA; TERRA; docking; circular dichroism; mass spectrometry; biological assays

#### **1. Introduction**

The dynamic nucleoprotein telomerase plays a central role in cellular senescence by maintaining chromosomal integrity. In particular, it adds TTAGGG sequences to the end of the chromosomes, known as telomeres, preventing chromosomal degradation or end-to-end fusion events that may cause genomic instability [1]. Telomeres are guanine-rich (G-rich) sequences, characterized by non-canonical higher-order structures called G-quadruplexes (G4s). These are characterized by two or more stacked G-tetrads that are constituted by four guanines held in a planar arrangement through a network of Hoogsteen hydrogen

**Citation:** Rocca, R.; Scionti, F.; Nadai, M.; Moraca, F.; Maruca, A.; Costa, G.; Catalano, R.; Juli, G.; Di Martino, M.T.; Ortuso, F.; et al. Chromene Derivatives as Selective TERRA G-Quadruplex RNA Binders with Antiproliferative Properties. *Pharmaceuticals* **2022**, *15*, 548. https://doi.org/10.3390/ ph15050548

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 18 January 2022 Accepted: 22 April 2022 Published: 28 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

bonds. Moreover, further stabilization is provided by a monovalent cation coordinating the O6-lone pairs of each guanine [2]. Based on the environmental conditions, G4 can assume multiple folding topologies influenced by the number and orientation of the strands (parallel, antiparallel, and hybrid 1 and 2 types), the loop size, the groove width, and the *syn*/*anti* glycosidic bond orientation of the guanines [2,3]. For the human telomeric sequence (HTS), several G4 DNA topologies have been observed in the presence of K<sup>+</sup> ions. Specifically, the parallel-stranded conformation is the only one found for the wild-type Tel<sup>22</sup> AG3[T2AG3]<sup>3</sup> crystallographic structure [4], while hybrid 1 and hybrid 2 topologies are prevalent in NMR solutions [5,6]. Conversely, antiparallel folding was revealed by NMR studies in the presence of Na<sup>+</sup> [7]. Furthermore, several bimolecular and tetramolecular G4 structures can be obtained in vitro by nucleic acid sequences that include groups of contiguous guanine residues [8–10]. At first, telomeres were considered as transcriptionally silent regions of mammalian chromosomes, but Azzalin et al. pointed out their transcription by detecting non-coding telomeric repeat-containing RNAs (TERRA) into mammalian cells [11]. TERRA can play significant a significant role in different biological processes, such as end protection, telomeric replication, and telomerase recruitment [12,13]. Interestingly, Xu et al. demonstrated that human TERRA molecules folded in G4s similarly to the telomeric DNA and provided direct evidence about the presence of the parallel-stranded TERRA G4s in living cells, through a light-switching pyrene probe [14]. Moreover, several TERRA-binding proteins have been discovered, including telomeric duplex DNA binding proteins TRF1 and TRF2, which are Shelterin key components able to protect telomeres [15]. Balasubramanian and co-workers also demonstrated that TRF2 interacts with the G4 conformation of TERRA for binding more tightly to telomeric DNA (Tel) [16]. TRF2 promotes telomere folding by hiding the 3′ -end overhang, which is not recognized as damaged DNA, thus preventing DNA damage response (DDR) activation [17]. Because the maintenance of telomeres is a key tract of cancer cells, compounds that target telomeres and their transcripts have been investigated as anticancer strategies. The therapeutic evaluation of TERRAmediated telomerase regulation in cancer cells showed promising results. Indeed, previous studies demonstrated TERRA downregulation in advanced stages of various human cancers compared to normal tissues, suggesting that telomeric transcription is downregulated in advanced tumors [18]. Moreover, recent studies demonstrated that the identification of TERRA G4-targeting drugs induced a cytotoxic effect on high TERRA-expressing cells, where they induce a DDR at telomeres, probably by displacing TERRA from telomeres [19]. In light of this, TERRA provides a promising antitumor target as it is necessary for the formation of telomeric heterochromatin in all tumor cells, even those not expressing telomerase (ALT-positive tumors) [20]. The availability of TERRA experimental (NMR or X-ray) structures, together with a deep understanding of their topologies, is extremely helpful for the rational design of their selective ligands. Interestingly, CD, NMR, and X-ray crystallographic studies pointed out the parallel "propeller" topology as the most predominant for TERRA [21], if compared to the wide variety of G4 conformations observed for Tel [4–6,22,23]. Unfortunately, only the bimolecular sequence of TERRA was solved [21,24], while its unimolecular counterpart is still unavailable. While several Tel ligands have been identified and studied [25–30], few compounds have yet been proposed as TERRA binders. Among them, Collie et al. described a naphthalene diimide able to bind bimolecular TERRA with higher selectivity than BRACO-19, a ligand more selective towards Tel. This observation is suggested to be related to the presence of the 2′ -OH groups in the RNA sugars, which reduce groove and loop widths, making important changes in the portion interacting with the ligand sidechains as well [31]. Therefore, the hydrogen bonds (H-bonds) network, involving the O2′ hydroxyl groups of the ribonucleotide sugars in TERRA, can be considered an important structural feature for the drug-design of selective TERRA-ligands [32]. In particular, this H-bond network tunes down the negative electrostatic surface of the target, partially explaining the observed selectivity of carboxypyridostatin (*c*PDS) [33]. A previous computational study by us pointed out that the *c*PDS high electrostatic surface, coupled to a conformational profile able to maximize the solvation contribution, is an important feature

to make it selective against TERRA [34]. Starting from our previous virtual screening (VS) campaign on bimolecular telomeric DNA and RNA G4 structures (named Tel<sup>2</sup> and TERRA<sup>2</sup> G4, respectively) [35], in this work, the three best candidates (Figure 1) were submitted to molecular recognition studies on the 3D coordinates of monomolecular Tel and TERRA. A computational approach, based on molecular dynamics (MD) and docking simulations, was applied to target the monomolecular Tel of the 22-nt telomeric sequence 5 ′ -AGGGTTAGGGTTAGGGTTAGGG-3 ′ (PDB code: 1KF1) [31] and its TERRA counterpart sequence 22-nt 5 ′ -AGGGUUAGGGUUAGGGUUAGGG-3 ′ . In particular, we investigated the binding affinity of the three best candidates **7**, **15**, and **17** previously discovered as dual Tel<sup>2</sup> and TERRA<sup>2</sup> binders [35], towards their Tel and TERRA monomolecular counterparts. With this aim, the homology model of the monomolecular TERRA was built and submitted to MD simulations to assess its geometric stability with respect to the corresponding Tel conformation. Subsequently, our docking simulations allowed us to predict the putative binding mode of all three candidates on TERRA. Furthermore, biophysical assays confirmed their stabilizing power. 5′ 3′ (PDB code: 1KF1) ′ 3′. In particular, we investi-

**Figure 1. The** 2D chemical structures of the three best *hits* found in our previous VS campaign on bimolecular Tel2/TERRA<sup>2</sup> G4s: (**A**) *hit* **7**, characterized by a naphthyridine scaffold with a ((dimethylamino)propyl)acetamide side chain; (**B**) *hit* **15**, exhibiting a furo-chromene structure; and (**C**) *hit* **17**, distinguished by a benzofuran ring [35].

#### **2. Results and Discussion**

#### *2.1. Computational Studies*

template to build the corresponding TERRA G4 structure through the " " The binding capability of the three best *hits*, discovered in a previous VS on bimolecular Tel2/TERRA<sup>2</sup> G4s [35], was computationally investigated on the corresponding monomolecular structures. The G4 crystallographic model of Tel (PDB ID: 1KF1) was used as template to build the corresponding TERRA G4 structure through the "*homology modelling*" approach and was submitted to MD simulations to verify the geometrical stability. MD simulations agree with the experimental data [36,37], showing higher structural stability for TERRA with respect to Tel (Supplementary Figure S1), as demonstrated by the average of the Root Mean Square deviation (RMSd) on all the heavy atoms, which is equal to 0.26 nm and 0.34 nm, respectively, for TERRA and Tel. This observation is further confirmed by the RMSd matrix (Supplementary Figure S2) calculated on the heavy atoms among all MDs conformations, from which it turns out that Tel exhibited a more significant heterogeneity than TERRA, as highlighted by the wider orange and red areas (Supplementary Figure S2A), associated with the higher RMSd values. A cluster analysis performed on all the nucleic acid heavy atoms allowed us to select the most representative conformations of both G4 targets. Specifically, to perform the subsequent docking studies, four and three conformations for Tel and TERRA, respectively, were selected, thus considering the flexibility of both targets. A visual inspection analysis of all the most representative conformations (cluster 1, 2, 3, and 4 of Tel and cluster 1, 2, and 3 of TERRA) (Supplementary Figure S3) retrieved a well-structured G-core with partial coverage of the G-tetrad at the top of the residue DA1 (Deoxyribonucleotide Adenine 1) and RA1 (Ribonucleotide Adenine 1) in Tel and TERRA, respectively. Moreover, the structure of the Tel cluster 4 was exhibited in the top position in the presence of an additional residue, DA7 (Supplementary Figure S3B). Interestingly, TERRA and Tel showed two different behaviors

in the bottom position. While all TERRA representative conformations showed that the G-tetrad at the bottom position is free to interact with any end-stacking ligands, in three Tel structures (cluster 1, cluster 3, and cluster 4), we observed residue DT17 (Deoxyribonucleotide Thymine 17) interacting through stacking interactions with DG16 (Deoxyribonucleotide Guanine 16), partially preventing ligands access (Supplementary Figure S3B). The greatest structural difference between the target-selected conformations of both Tel and TERRA was instead found in the loops, where the highest heterogeneity was observed. In order to investigate the differences in the recognition of the unimolecular telomeric G4s for the three ligands, we clusterized all generated docking poses obtained against all clusters by using the angle descriptor defined by three dummy atoms, as reported in a previous work [38]. As shown in Supplementary Figure S4, *hits* **15** and **17** bound to loop in both targets, while *hit* **7** exhibited a greater number of poses positioned at the bottom of the TERRA molecule. The second favorite binding site for all compounds was the bottom position. The top position was not a favorite binding site for these compounds: we hypothesize that the presence of DA1 and RA1 in Tel and TERRA, respectively, blocks the access to the G-tetrad. In Supplementary Figure S5, we report a deeper analysis for the distribution of the binding poses to the most representative conformations of both targets. Regarding TERRA clusters, we observed heterogeneous behaviour. In fact, although all 3D structures showed that the G-tetrad in the bottom position is free to interact with the three studied compounds, as previously described, only cluster 1 and cluster 2 for *hit* **7**, and cluster 2 and cluster 1 for *hits* **15** and **17,** respectively, seem to favour this binding mode. Conversely, the lateral position appeared to be preferred in the rest of the clusters, with a higher prevalence for *hits* **15** and **17**, suggesting the conformation of the loops more favourable to accommodate a ligand. Interestingly, for cluster 4 of the Tel target, a single binding mode could be observed for all generated docking poses since all compounds bind the nucleic acid in a lateral position. In addition, Tel cluster 2 showed a preference to bind the ligands in a lateral position, especially for *hits* **7** and **17**. Finally, despite the presence at the bottom of residue DT17 that partially covers the G-tetrad, clusters 1 and 3 of all *hits* seemed to slightly prefer this binding site. Then, for each compound, we selected and deeply analyzed the energetically most stable complex with both targets (Supplementary Table S1). All three compounds showed better binding free energies when they were docked against TERRA, with ∆Gbind values ranging from −58.86 to −88.77 kcal/mol. Conversely, Tel complexes were characterized by ∆Gbind values higher than −57.96 kcal/mol. Thus, *hit* **7** potentially maintained its dual Tel/TERRA ligand profile, as previously also observed on the respective bimolecular Tel2/TERRA<sup>2</sup> G4s [35]. Surprisingly, *hit* **15** and *hit* **17**, previously characterized as selective Tel<sup>2</sup> ligands [35], turned out to have a better theoretical affinity towards the monomolecular TERRA G4. An analysis of the related single contributions of ∆Gbind highlighted the better contribution of Van der Waals (∆Gbind\_vdW), lipophilic (∆Gbind\_Lipo), and packing energy (∆Gbind\_Packing), that is, the π–π packing correction, in TERRA complexes compared to Tel ones [39]. We next analyzed the interaction pattern of the best thermodynamic complexes (Figure 2). Specifically, *hit* **7** recognized the bottom of TERRA, confirming this binding site as the most geometrically and energetically favored (Figure 2A). The complex was stabilized by four π–π interactions, established between the naphthyridine moiety of the ligand and the nitrogenous base of the nucleobase RG10 (Ribonucleotide Guanine 10), and a π–cation between *hit* **7** pyridine ring and the potassium coordinating the last G-tetrad. This ligand also engaged three H-bonds with nucleobases RG10, RG4, and RA7 by means of the carbonyl group on the naphtyridine ring, the linear amide, and the quaternary ammonium, respectively. Its positive charge was also involved in three salt bridges with nucleobases RU5 (Ribonucleotide Uracil 5), RA7, and RG9. Conversely, in the Tel complex, *hit* **7** behaved as a loop binder by preventing the formation of the stacking interactions that stabilize its bond with TERRA (Figure 2D). This ligand interacted with Tel through three H-bonds with nucleobases DA14 and DA19 by employing its quaternary ammonium, its amide group, and its pyridinium ring, respectively. Two salt bridges were also engaged between the quaternary ammonium and the pyridinium ring

of the ligand and nucleobases DA19 and DA14, respectively. *Hit* **15** (Figure 2B) appeared to form the most energetically stable complex for both G4 structures by acting as a loop binder, in agreement with the geometrical analysis of the docking binding sites. In the complex between *hit* **15** and TERRA, electrostatic contributions were the driving interactions (Supplementary Table S1). Specifically, a salt bridge, involving the quaternary ammonium of the ligand and the phosphoric group of the RG16 nucleobase, and four H-bonds were established. These latter interactions were engaged between the 1,3-dioxole portion; the furan ring; and the quaternary ammonium of the ligand and nucleobases RU17, RA19, and RG15, respectively. Conversely, in the Tel complex (Figure 2E), *hit* **15** engaged only two H-bonds between the carbonyl group of the chromene moiety and the quaternary ammonium of the ligand with nucleobases DT11 and DG10, respectively. In the complex between *hit* **15** and Tel, we observed four π–π interactions between the chromene ring and the furopyridine moiety with the nucleobases DG8 and DA13, respectively. Although only two π–π interactions were observed in the TERRA complex between the 1,3 dioxole portion and the pyridine ring of *hit* **15** with nucleobases RA19 and RU18, respectively, the ligand was able to establish several hydrophobic interactions, as highlighted by the lipophilic energetic terms in Supplementary Table S1 (∆Gbind\_Lipo and ∆Gbind\_Packing). Finally, the energetic analysis confirmed the behavior of loop binders for *hit* **17** when complexed to TERRA (Figure 2C). At the same time, the bottom position was disclosed as the best binding site on Tel G4 for the same compound (Figure 2F). Moreover, the best energy evaluation for the complex between *hit* **17** and TERRA G4 was related to the higher number of established favorable interactions compared to Tel. Specifically, *hit* **17** was involved in three π–π interactions and two H-bonds with TERRA G4, while in the Tel complex only one π–π interaction and one H-bond were observed. In detail, the TERRA complex showed the phenyl ring and the chromene moiety of the ligand interacting with nucleobases RG8 and RG14, respectively, through π–π interactions. Instead, the amide moiety and the quaternary ammonium engaged two H-bonds with nucleobases RA13 and RU12, respectively. Regarding the Tel complex, we observed π–π interactions between the ligand furan ring and DT17 nucleobase, while the amide portion interacted through an H-bond with DG22. Both complexes of *hit* **17** shared a salt bridge, involving its quaternary ammonium and nucleobases RU12 and DG22 for TERRA and Tel, respectively.

In a second step, we evaluated the ability of the three compounds to stabilize both TERRA and Tel. We performed 200 ns long MDs starting with thermodynamically best complexes and compared the results with respect to the stability of the related cluster conformation for each target. As shown in Supplementary Figure S6, the analysis of the RMSd trend, calculated on the heavy atoms of both targets, showed a better ability to stabilize TERRA for all compounds, compared to Tel complexes. Interestingly, *hit* **7** exhibited the best stabilizing profile on TERRA, with an average RMSd value of 0.17 nm. Moreover, it is the only compound able to form a complex with Tel showing an RMSd trend (with an average RMSd value of 0.26 nm) similar to that of the related unbound target structure (with an average RMSd value of 0.24 nm). Conversely, *hit* **17** exhibited the worst trend of RMSd on Tel (with an average RMSd value of 0.35 nm) since we observed for the entire duration of the MDs higher RMSd values compared to the unbound target, with a further increase in the last 50 ns. Regarding *hit* **15**, although in the first 100 ns of MDs it seemed to stabilize both G4 structures, in the last part of the simulation it showed a gradual increase in the RMSd trend of Tel. Finally, for each complex, the most populated structure during the MDs was selected and deeply analyzed (Supplementary Table S2 and Supplementary Figure S6). As observed in docking simulation, all the three compounds confirmed a better binding free energy if complexed with TERRA, with ∆Gbind values ranging from −38.36 to −85.93 kcal/mol. Conversely, Tel complexes were characterized by ∆Gbind values higher than −35.01 kcal/mol, except for the complex of *hit* **7**, which exhibited ∆Gbind value of −45.85 kcal/mol. Regarding the binding mode and the interaction pattern of the most populated structure during MDs, we observed the maintenance of the binding site only in the TERRA complexes

(Supplementary Figure S7A–C). In particular, *hit* **7** always bonded the bottom portion of TERRA, as in the docking pose, and it established excellent π–π interactions between its naphthyridine moiety and nucleobases RG10 and RG16, as well as a π–cation interaction and an H-bond with RG22 and RG16, respectively. On the other hand, when comparing this structure with the initial docking pose (Figure 2A), we observed the rotation of the ligand with good interactions between its lateral chain and the first loop of TERRA. Specifically, two salt bridges were established between the quaternary ammonium of *hit* **7** and the phosphoric groups of nucleobases RA7 and RU5, while two H-bonds were observed with the sugar and the base of the RG4 (Supplementary Figure S7A). As noted in the docking pose (Figure 2B), *hit* **15** kept its lateral binding mode by establishing two π–π and a π–cation interaction with the RA19 nucleobase (Supplementary Figure S7B). Regarding *hit* **17**, the comparison of the most populated structure during MDs with the docking pose (Figure 2C) highlighted the absence of interactions between the ligand side-chain and the TERRA loop. Conversely, *hit* **17** strengthened π–π interactions between its psoralen portion and nucleobases RG8 and RG14, and it also established an additional H-bond between its carbonyl amide and RA (Supplementary Figure S7C). Interestingly, the most populated structures of MDs for Tel complexes showed a complete change in the binding site (Supplementary Figure S7D–E), except for *hit* **17**. During MDs, *hits* **7** and **15** changed their binding sites, moving from the lateral (Figure 2D–E) to the top position. The new binding mode of *hit* **7** was characterized by a strong interaction between the ligand naphthyridine moiety and nucleobase DG8, thanks to formation of six π–π, two π–cations, and one H-bond (Supplementary Figure S7D). Moreover, we also observed a π–π interaction with nucleobase DA1 and an H-bond between the ammonium group of the ligand and the sugar portion of DG14. Additionally, for *hit* **15**, we beheld several π–π interactions between its furo-chromen ring and nucleobases DG14 and DG8, but, as shown in Supplementary Figure S7E, these events caused the alteration and destabilization of the G-tetrad formed by nucleobases DG2, DG8, DG14, and DG20. Moreover, the ammonium group of the ligand was involved in two H-bonds with nucleobase DG8. Finally, the most populated structure of MDs for *hit* **17** exhibited two salt bridges between its ammonium group and the DG21 and DG22 nucleobases, an H-bond between the amide group and the DT17 and four π–π interactions with DG16 and DG10 (Supplementary Figure S7F). As noted for *hit* **15**, in this case the ligand also appeared able to induce the alteration and destabilization of the interacting G-tetrad.

#### *2.2. In Vitro Analysis*

*Hits* **7**, **15,** and **17** were tested for their ability to stabilize the target Tel and TERRA G4s by means of circular dichroism (CD) melting experiments. This technique is used to obtain information about the G4 topology and stability (melting temperature, Tm) of the G4 structured oligonucleotides. Tel G4 displayed the well-known hybrid 3 + 1 topology and a T<sup>m</sup> of 67.2 ± 0.2 ◦C: when incubated with *hit* **7**, we observed a topological change with the appearance of a peak around 260 nm that could be ascribed to the contribution of the parallel structure, and stabilization by 5.8 ◦C. No significant topological changes and stabilization were observed when the target Tel G4 was incubated with *hits* **15** and **17** (Figures 3 and S8).

TERRA G4 showed a prevalently parallel topology, with a maximum peak of around 260 nm and a shoulder of around 300 nm, and T<sup>m</sup> of 75.2 ± 0.7 ◦C. Incubation with *hits* **7**, **15,** and **17** did not alter its topology but led to great stabilization of the structure, with *hit* **7** being the best stabilizer (∆T<sup>m</sup> 11.2 ◦C), followed by *hit* **15** (∆T<sup>m</sup> 9.5 ◦C) and *hit* **17** (∆T<sup>m</sup> 7.7 ◦C) (Figures 4 and S9). These in vitro data are in perfect agreement with our MD simulations.

π–π and π– **Figure 2.** A docking pose analysis of the best thermodynamic complexes of *hits* **7** (panels (**A**,**D**)), **15** (panels (**B**,**E**)), and **17** (panels(**C**,**F**)) in complex with TERRA and Tel, respectively. For *hits* **7**, **15,** and **17,** the ligand is depicted as orange, green, and cyan carbon sticks, respectively. The nucleic acids are shown as faded blue and grey surfaces for TERRA and Tel, respectively, while the guanine residues, forming the G-tetrads, are shown as lines. Moreover, the residues interacting with the ligands are depicted as faded blue and grey carbon sticks for TERRA and Tel, respectively. K + ions are represented as pink spheres. Hydrogen bonds, salt bridges, and π–π and π–cation interactions are shown as dashed violet, red, cyan, and green lines, respectively.

CD thermal unfolding spectra of the nucleic acid Tel G4 4 μM in 100 mM K 16 μM ( 16 μM ( 16 μM ( **Figure 3.** The CD thermal unfolding spectra of the nucleic acid Tel G4 4 µM in 100 mM K<sup>+</sup> alone (**A**) and in the presence of the *hit* **7** 16 µM (**B**), *hit* **15** 16 µM (**C**), and *hit* **17** 16 µM (**D**).

CD thermal unfolding spectra of the nucleic acid TERRA G4 4 μM in 100 mM K 16 μM ( 16 μM ( 16 μM (

Δ °

–

values were 3.4 ± 0.2 μM, 7.1 ± 1.2 μM and 6.1 ± 0.2 μM

values of 11.0 ± 0.7 μM, 18.0 ± 4.2 μM

**Figure 4.** *Cont*.

Δ

Δ °

5′ 3′

22.9 ± 6.4 μM for binding of

CD thermal unfolding spectra of the nucleic acid TERRA G4 4 μM in 100 mM K 16 μM ( 16 μM ( 16 μM ( **Figure 4.** The CD thermal unfolding spectra of the nucleic acid TERRA G4 4 µM in 100 mM K<sup>+</sup> alone (**A**) and in the presence of *hit* **7** 16 µM (**B**), *hit* **15** 16 µM (**C**), and *hit* **17** 16 µM (**D**).

Δ Δ °

5′ 3′ – values were 3.4 ± 0.2 μM, 7.1 ± 1.2 μM and 6.1 ± 0.2 μM values of 11.0 ± 0.7 μM, 18.0 ± 4.2 μM 22.9 ± 6.4 μM for binding of To measure the binding affinity of the hits to the G4-folded oligonucleotides with unmodified 5 ′ - and 3 ′ -ends (i.e., lacking fluorophores or biotin), we employed ESI-MS analysis, as previously described [40,41]. All *hits* displayed both a 1:1 and 1:2 binding ratio, albeit adducts with 2 bound *hit* molecules were 2–3 times less abundant than those with 1 bound *hit* molecule. For each *hit*, the binding affinity (KD) of the 1:1 complex with Tel and TERRA G4s was measured at three compound concentrations, corresponding to 1:1, 1:2, and 1:4 G4:compound ratios. K<sup>D</sup> values were 3.4 ± 0.2 µM, 7.1 ± 1.2 µM, and 6.1 ± 0.2 µM for binding of *hits* **7**, **15,** and **17**, respectively, to TERRA G4 (Figures 5 and S11). Binding affinity to Tel G4 was in general lower, with K<sup>D</sup> values of 11.0 ± 0.7 µM, 18.0 ± 4.2 µM, and 22.9 ± 6.4 µM for binding of *hits* **7**, **15,** and **17**, respectively (Figure S10).

– **Figure 5.** *Cont*.

–

–

molecule (10 μM) were incubated in MS buffer (HFIP

4

–

–

–

molecule (10 μM) were incubated in MS buffer (HFIP **Figure 5.** The MS spectra of TERRA (blue squares) incubated with the indicated *hits*. Samples containing TERRA oligonucleotide (5 µM) and *hit* molecule (10 µM) were incubated in MS buffer (HFIP 120 mM/TEA pH 7.4, KCl 0.8 mM, isopropanol 20%) overnight before MS analysis. A zoom on the most significant m/z range is shown. The larger m/z range is provided in Figure S11.

#### *2.3. In Cell Assays*

–

–

–

–

–

–

–

–

–

–

–

The cytotoxic activity of the tested compounds was evaluated on a panel of cultured human tumor cell lines: MCF7 (mammary gland adenocarcinoma), HT-29 (colorectal adenocarcinoma), and A549 (lung adenocarcinoma) cells. After 48 h treatment with the compounds, cytotoxicity was assessed by the MTT test and indicated as the concentration able to kill 50% of the cell population (CC50). As reported in Table 1, the colorectal adenocarcinoma HT-29 cell line was the most sensitive to compound treatment, with CC<sup>50</sup> in the low micromolar range for *hits* **7** and **17** and in the nanomolar range for *hit* **15**. A549 cells had intermediate sensitivity to the *hits*, while MCF7 cells were the least affected by compounds' treatment. In these cells, it was not possible to obtain a discrete CC<sup>50</sup> value for of *hit* **7** as it exceeded the compound limit of solubility. *Hit* **15** was the most effective compound against HT-29 and A549 cells.

**Table 1.** Cytotoxicity CC<sup>50</sup> (µM) in human tumor cell lines measured 48 h post administration of *hits* **7**, **15,** and **17**.


The sensitivity of a particular cell line to a given compound is influenced not only by the intrinsic nature of the drug but also by the characteristics of cancer, including mutations, gene expression, and copy number variation. Thus, the different response of the three cell lines observed in this study is probably due to that distinct oncogenic drivers and drug resistance mechanisms that operate in each cell line and create perturbations of the downstream pathways triggered by the compounds.

#### **3. Materials and Methods**

#### *3.1. Target Preparation for In Silico Analysis*

The crystal structure with PDB code 1KF1 and 2.1 Å resolution was selected as a tridimensional model of the parallel stranded Tel G4, featured by the 22-nt human telomeric sequence *d*[AG3(T2AG3)3] [7]. The same structure was used as a template to generate TERRA homology modeling by adding hydroxyl groups to the sugar ring. The TERRA G4 is known to be characterized by a monomorphic nature since only the parallel topology was experimentally observed [36]. K<sup>+</sup> ions, coordinating the G-tetrad O6 atoms and vertically aligned in the internal G-delimited channel, were retained at their respective crystallographic positions, while all the crystallized water molecules were removed. Both

Tel and TERRA structures were submitted to MD simulations, using GROMACS code ver. 4.5.1 [42]. Both nucleic acids were treated with standard *parm99* Amber force field with modified parmbsc0 [43,44] and combined with corrections ε/ζOL1 and χOL4, to improve the description of ε/ζ and χ G4 torsions, respectively [45–47]. For each system, the tleap module of the AmberTools program was employed to generate a topology file, which was converted into a suitable GROMACS file format using the Acpype script [48]. A truncated dodecahedron box with the TIP3P water solvent model [49] was built using periodic boundary conditions, and the global negative charge was neutralized by adding K+ counter-ions. To resolve bad steric contacts, both systems were energy-minimized, using 5000 steps with the steepest descent algorithm; equilibrated at 300 K through 5 ns MD under NVT conditions; and then equilibrated in the isothermal–isobaric (NPT) ensemble at 1 atm. An MD production run (200 ns) was performed in NPT, using a time step of 2 fs. The V-rescale algorithm [50] and Parrinello–Rahman barostat [51] were used to control and monitor temperature and pressure, respectively. Finally, for both targets, all conformations found during MD runs were submitted to cluster analysis with the GROMOS algorithm [52], by the g-cluster tool implemented in the GROMACS package [42]. A cut-off of 2.5 Å was used in the cluster process, with the aim to select different representative structures for the two G4 targets.

#### *3.2. Molecular Docking Protocol*

The most representative structures of the two G4 targets, obtained during MD runs, were used to generate grids by applying default parameters. Each energy grid was built centering the docking box on the G-tetrads centroid and setting its outer box size to 48 × 48 × 48 Å. For each docking run, 10 poses per ligand were generated and the scaling factor for the target Van der Waals radii was set to 1.0. We used the Standard Precision (SP) scoring function of Glide ver. 7.8 software of the Schrödinger suite [53] to perform docking calculations of the most promising compounds found in our previous screening on the bimolecular target [35]. The molecular structures of *hits* **7**, **15,** and **17** were previously built using the Maestro graphical user interface (*Schrödinger Release 2019: Maestro, Schrödinger, LLC., New York, NY, 2019*) [54], while their most probable protonation state at physiological pH 7.4 was computed using LigPrep (LigPrep version 2.5, *Schrödinger, LLC., New York, NY, 2012*) tool [55]. All complexes generated with the docking procedure were further submitted to the Molecular Mechanics Generalized Born/Surface Area (MM-GBSA) method [56], applying molecular mechanics and continuum solvation models, to compute their binding free energies (∆Gbind) [57]. The docking pose of each compound with the best ∆Gbind was selected and further analyzed.

#### *3.3. Molecular Dynamics of the Thermodynamically Best Complexes*

For each compound, the best thermodynamics complex was submitted to MD simulations, using the same MD protocol applied for both targets. To consider a comparable starting point, also the TERRA and Tel clusters that provided the best complex thermodynamics were submitted to MDs to investigate the stabilizing power of each compound. For each ligand, we calculated the electrostatic potential (ESP) by Jaguar ver. 9.3 software [58], using the 6-31G\* basis set at the Hartree–Fock theory level. The restrained electrostatic potential (RESP) [59] was computed using Antechamber [60] and parameterized with General Amber Force Field (GAFF) [61]. Finally, all conformations found during MD runs were submitted to cluster analysis with the GROMOS algorithm [52], by the g-cluster tool implemented in the GROMACS package [42]. MD frames were aligned on the nucleic acid targets, and RMSd values were computed on the heavy atoms of both the target and ligand. In this case, 1.5 Å cut-off was used in the cluster process to select different representative structures of the complex for all ligands.

#### *3.4. Circular Dichroism*

Circular dichroism spectra were recorded on a Chirascan-Plus (Applied Photophysics, Leatherhead, UK) equipped with a Peltier temperature controller using a quartz cell of 5-mm optical path length and a scanning speed of 50 nm/min, with a response time of 4 sec over a wavelength range of 230–320 nm. The reported spectrum of each sample represents the average of 2 scans. Observed ellipticities were converted to the mean residue ellipticity (θ) = deg × cm<sup>2</sup> × dmol−<sup>1</sup> (molar ellipticity). Oligonucleotides were diluted from stock to the final concentration (4 µM) in the Li cacodylate buffer (10 mM, pH 7.4) with 100 mM KCl, annealed by heating at 95 ◦C for 5 min, and gradually cooled to room temperature. Compounds were added at 4 × G4 final concentration (16 µM). CD spectra were recorded after 24 h from 20 ◦C to 95 ◦C, with a temperature increase of 5 ◦C. T<sup>m</sup> values were calculated according to the van't Hoff equation, applied for a two-state transition from a folded to an unfolded state, assuming that the heat capacity of the folded and unfolded states are equal [62].

#### *3.5. Binding Affinity*

To determine the K<sup>D</sup> of *hit* ligands to Tel and TERRA G4s, mass spectrometry (MS) analysis was performed on mixtures of oligonucleotide (5 µM) + *hit* compound (5, 10, and 20 µM). A mixture of 2 µM reference dT6 + 5 µM oligonucleotide + 5 or 10 µM *hits* was also used to check for unspecific *hit* binding. Oligonucleotides were heat denatured on MS buffer (HFIP 120 mM/TEA pH 7.4, KCl 0.8 mM, isopropanol 20%) for 5 min at 95 ◦C and gradually cooled to room temperature to allow the correct folding. After 4 h, *hits* were added, and samples were incubated over night at room temperature. Samples were analyzed by direct infusion electrospray ionization (ESI)-MS on a Xevo G2-XS QTOF mass spectrometer (Waters, Manchester, UK). This is a high-resolution instrument that allowed us to visualize the isotopic pattern, identify the charge state, and therefore unambiguously calculate the neutral mass of the detected species. The injection was automatically performed by an Agilent 1290 Infinity HPLC (Agilent Technologies, Santa Clara, CA, USA) equipped with an autosampler; the carrying buffer was HFIP 120 mM/TEA pH 7.4 with 20% isopropanol. A volume of 5 µL of each sample was typically injected. In all experiments, ESI source settings were electrospray capillary voltage, 1.8 kV; source and desolvation temperatures, 45 ◦C and 65 ◦C, respectively; and sampling cone voltage, 65 V. All these parameters ensured minimal DNA complex fragmentation. The instrument was calibrated using a 2 mg/mL solution of sodium iodide in 50% isopropanol. The additional use of the LockSpray during analysis provided typical <5 ppm mass accuracy. The internal standard LockSpray consisted of a solution of leu-enkephalin (1 µg/mL) in acetonitrile/water (50:50, *v*/*v*) containing 0.1% formic acid. Peak areas were used to calculate the concentration ratios, as previously reported [40,41], using the formulas:

$$[\text{oligo}]\_{\text{free}} = \text{C}\_0 \times \text{A(oligo)}^{\text{n}-} / (\text{A(oligo)}^{\text{n}-} + \text{A(oligo} + hit)^{\text{n}-})$$

$$[\text{oligo} + hit] = \text{C}\_0 \times \text{A(oligo} + hit)^{\text{n}-} / (\text{A(oligo)}^{\text{n}-} + \text{A(oligo} + hit)^{\text{n}-})$$

$$[hit]\_{\text{free}} = [hit]\_{\text{tot}} - [\text{oligo} + hit]$$

$$\text{K}\_{\text{d}} = [hit]\_{\text{free}} \times [\text{oligo}]\_{\text{free}} / [\text{oligo} + hit]$$

where [oligo]free and [*hit*]free are the concentrations of the unbound oligonucleotide and *hit*, respectively; [*hit*]tot is the total concentration of *hit*; [oligo + *hit*] is the concentration of *hit* bound to oligonucleotide; C<sup>0</sup> is the starting oligo (Tel or TERRA G4) concentration; A(oligo)n<sup>−</sup> is the peak area of the oligonucleotide alone at charge state n−; and A(oligo + *hit*) <sup>n</sup><sup>−</sup> is the peak area of *hit* bound to oligo at charge state n−. Peak areas were calculated using MassLynx 4.1 software (Waters), after processing steps consisting of smoothing, background subtraction, and conversion to centroid.

#### *3.6. Compounds' Cytotoxicity*

Cytotoxic effects were determined by MTT assay. Compounds were dissolved and diluted into working concentrations with DMSO. All cell lines were obtained from ATCC (MCF7, human breast adenocarcinoma, cat. # HTB-22, HT-29, human colorectal adenocarcinoma, cat # HTB-38, A549, human lung carcinoma, cat # CCL-185), grown and maintained according to the manufacturer's instructions (https://www.lgcstandards-atcc.org (accessed on 8 June 2020)). Cells were plated into 96-microwell plates to a final volume of 100 µL and allowed to attach overnight. The following day, the tested compounds were added to each well with a 0.5% final concentration of DMSO per well; each concentration was tested in triplicate. Compounds were incubated for 48 h, and control cells (without any compound but with 0.5% DMSO) were treated in the exact same conditions. Cell survival was evaluated by MTT assay: 10 µL of freshly dissolved solution of MTT (5 mg/mL in PBS) was added to each well, and after 4 h of incubation, MTT crystals were solubilized in solubilization solution (10% sodium dodecyl sulphate (SDS) and 0.01 M HCl). After overnight incubation at 37 ◦C, absorbance was read at 540 nm. Data were expressed as mean values of at least three experiments conducted in triplicate. The percentage of cell survival was calculated as follows: cell survival = (Awell − Ablank)/(Acontrol − Ablank) × 100, where blank denotes the medium without cells. Each experiment was repeated at least three times.

#### **4. Conclusions**

In this study, three compounds (*hit* **7**, *hit* **15**, and *hit* **17**), previously identified by means of a virtual screening campaign on bimolecular DNA/RNA G4s, were investigated on the monomolecular Tel DNA and TERRA G4s. Molecular docking calculations indicated all ligands were able to better recognize TERRA with respect to Tel. As previously observed on the bimolecular Tel2/TERRA<sup>2</sup> G4s, *hit* **7** maintained its behavior as a dual Tel/TERRA ligand. Conversely, *hit* **15** and *hit* **17**, previously characterized as selective Tel<sup>2</sup> ligands [35], showed better theoretical affinity towards the monomolecular TERRA G4. Moreover, MDs results highlighted that all the analyzed *hits* better stabilized TERRA G4 folding if compared to Tel, while the naphthyridine *hit* **7** confirmed its dual profile. The in vitro data corroborated our MDs since the relative *hit* stabilization efficiency on TERRA and Tel corresponded to that calculated by MDs. Analysis in cells showed that these compounds have anticancer activity. *Hit* **15**, i.e., the compound that displayed the highest selective stabilization towards TERRA, was the most active compound. Our data indicate that the tested *hits* have enhanced activity towards HT-29 cell, a model line for colorectal adenocarcinoma [63]. In particular, *hit* **15**, being over 200 times more efficient on HT-29 cell that MCF7 cells, could be a promising compound to be further optimized against colorectal cancer.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ph15050548/s1, Figure S1: the plot of the RMSd values calculated on all heavy atoms during 200 ns of MDs, performed on both the parallel telomeric (Tel) DNA (black line) and TERRA (red line) G-quadruplex (G4) structures; Figure S2: the RMSd matrices calculated on the heavy atoms of all the saved structures throughout all the MDs of the parallel telomeric Tel (A) and TERRA (B) G4 structures; Figure S3: the 3D structure of all the most representative conformations of (A) TERRA and (B) Tel G4. All clusters have been superimposed, while the single cluster and the most interesting residues are shown as surface and carbon sticks, respectively. Cluster 1, cluster 2, cluster 3, and cluster 4 are reported as red, faded-blue, green, and faded-plum surface, respectively; Figure S4: a pie chart showing the distribution of all the generated docking poses of hits 7, 15, and 17, obtained against all clusters, by considering the site analysis towards both TERRA and Tel G4 targets; Figure S5: an analysis of the binding modes of hits 7, 15, and 17 on each single cluster of both G4 targets, according to the geometrical descriptors reported in a previous work [38]; Figure S6: a plot of the RMSd values calculated on all heavy atoms during 200 ns of MDs, performed on the best thermodynamic complexes of the three hits with both Tel and TERRA G4 and on the related cluster structures of both receptors. (A) An RMSd plot of TERRA cluster 2 (red line) and its

related complex with hit 7 (orange line). (B) An RMSd plot of TERRA cluster 3 (red line) and its related complexes with hit 15 and hit 17 (green and cyan lines, respectively). (C) An RMSd plot of Tel cluster 4 (black line) and its related complexes with hit 7 and hit 17 (orange and cyan lines, respectively). (D) An RMSd plot of Tel cluster 1 (black line) and its related complex with hit 15 (green line); Figure S7: a binding pose analysis of the MD-generated, most populated structure of hit 7 (panels (A,D)), hit 15 (panels (B,E)), and hit 17 (panels (C,F)) in complex with TERRA and Tel, respectively. Hit 7, hit 15, and hit 17 are depicted as orange, green, and cyan carbon sticks, respectively. The nucleic acids are shown as faded blue and grey surface for TERRA and Tel, respectively, while the guanine residues, forming the G-tetrads, are shown as lines. Moreover, the residues interacting with the ligands are depicted as faded blue and grey carbon sticks for TERRA and Tel, respectively. K+ ions are represented as pink spheres. Hydrogen bonds, salt bridges, and π–π and π–cation interactions are shown as dashed violet, red, cyan, and green lines, respectively; Figure S8: the CD thermal unfolding analysis of Tel DNA G4 in complex with hits 7, 15, and 17. The melting curves of Tel G4 (4 µM) in the absence and presence of each hit (16 µM), plotted at the wavelength corresponding to the maximum CD signal; Figure S9: the CD thermal unfolding analysis of TERRA G4 in complex with hits 7, 15, and 17. The melting curves of TERRA G4 (4 µM) in the absence and presence of each hit (16 µM), plotted at the wavelength corresponding to the maximum CD signal; and Figure S10: the MS spectra of Tel (grey squares) incubated with the indicated hits. Samples containing Tel DNA oligonucleotide (5 µM) and hit molecule (10 µM) were incubated in MS buffer (HFIP 120 mM/TEA pH 7.4, KCl 0.8 mM, isopropanol 20%) overnight before MS analysis. The relevant m/z range is shown, Figure S11: the MS spectra of TERRA (blue squares) were incubated with the indicated hits. Samples containing TERRA oligonucleotide (5 µM) and hit molecule (10 µM) were incubated in MS buffer (HFIP 120 mM/TEA pH 7.4, KCl 0.8 mM, isopropanol 20%) overnight before MS analysis. The relevant m/z range is shown; Table S1: ∆Gbind and related single contributions of the binding free energy for the best thermodynamic complex of hits 7, 15, and 17 with both Tel and TERRA G4. All thermodynamic values are reported in kcal/mol. In Table S1, we also reported the related cluster for each hit-target most stable complex, Table S2: ∆Gbind and related single contributions of the binding free energy of the most populated cluster structure of hits 7, 15, and 17 complexed with both Tel and TERRA G4. All thermodynamic values are reported in kcal/mol.

**Author Contributions:** Conceptualization, A.A. and R.R.; methodology, G.C.; software, F.O.; validation, R.R., A.M. and R.C.; formal analysis, R.R., F.M. and F.S.; investigation, F.S., G.J., M.T.D.M. and M.N.; resources, S.A., P.T. (Pierosandro Tagliaferri) and P.T. (Pierfrancesco Tassone); data curation, R.R., F.S., and M.N.; writing—original draft preparation, R.R., M.N. and F.M.; writing—review and editing, A.A. and S.N.R.; visualization, all; supervision, S.A., P.T. (Pierosandro Tagliaferri) and P.T. (Pierfrancesco Tassone); project administration, S.A., A.A., P.T. (Pierfrancesco Tassone) and S.N.R.; funding acquisition, S.A., P.T. (Pierfrancesco Tassone) and S.N.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the PRIN 2017 research project "Novel anticancer agents endowed with multi-targeting mechanism of action", grant number 201744BN5T to A.M., Italian Association for Cancer Research (AIRC) Investigator Grant "Small molecule-based targeting of lncRNAs 3D structure: a translational platform for treatment of multiple myeloma" Project N. 21588 to R.R. and the Investigator Grant "Characterization and targeting of G-quadruplexes in the MDM2 P2 promoter for the treatment of liposarcomas" Project N. 21850 to S.N.R.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained in the article and Supplementary Material.

**Acknowledgments:** The authors acknowledge Mu.Ta.Lig. COST ACTION CA15135.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Improved Database Filtering Technology Enables More Efficient Ab Initio Design of Potent Peptides against Ebola Viruses**

**Thomas Ripperda † , Yangsheng Yu † , Atul Verma, Elizabeth Klug , Michellie Thurman, St Patrick Reid \* and Guangshun Wang \***

> Department of Pathology and Microbiology, College of Medicine, University of Nebraska Medical Center, 985900 Nebraska Medical Center, Omaha, NE 68198-5900, USA; ripperdatj@gmail.com (T.R.); yangshengyu@unmc.edu (Y.Y.); atul.k.verma@hotmail.com (A.V.); liz.klug@unmc.edu (E.K.); michelliethurman19@gmail.com (M.T.)


**Abstract:** The rapid mutations of viruses such as SARS-CoV-2 require vaccine updates and the development of novel antiviral drugs. This article presents an improved database filtering technology for a more effective design of novel antiviral agents. Different from the previous approach, where the most probable parameters were obtained stepwise from the antimicrobial peptide database, we found it possible to accelerate the design process by deriving multiple parameters in a single step during the peptide amino acid analysis. The resulting peptide DFTavP1 displays the ability to inhibit Ebola virus. A deviation from the most probable peptide parameters reduces antiviral activity. The designed peptides appear to block viral entry. In addition, the amino acid signature provides a clue to peptide engineering to gain cell selectivity. Like human cathelicidin LL-37, our engineered peptide DDIP1 inhibits both Ebola and SARS-CoV-2 viruses. These peptides, with broad antiviral activity, may selectively disrupt viral envelopes and offer the lasting efficacy required to treat various RNA viruses, including their emerging mutants.

**Keywords:** antimicrobial peptide database; antiviral peptides; database filtering technology; SARS-CoV-2; Ebola virus; peptide design

#### **1. Introduction**

Emerging viral infections can cause harm to our society. This was made crystal clear in the previous Spanish flu and the current severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic. According to the World Health Organization, COVID-19 has caused 464 million infections and 6.06 million cumulative deaths globally as of March 21, 2022 [1]. SARS-CoV-2 has also added a burden to our economic, educational, and health care systems around the world. Similarly, Ebola viruses have a major impact on our society, causing a significant number of mortalities. Ebola viruses are highly lethal and have an average mortality rate of 50% [2,3]. Fortunately, vaccines have proven to be effective to protect humans from SARS-CoV-2 and Ebola virus (EBOV) infections [4–6]. However, viruses mutate rapidly and have evolved into numerous mutants that could compromise vaccine protection and cause breakthrough coronavirus infections [7]. Hence, there is a need to develop alternatives such as antiviral drugs to better manage viral infections.

Antimicrobial peptides (AMPs) are an important factor of the innate immune defense for both invertebrates and vertebrates, including humans [8–14]. Recently, human defensins have been shown to inhibit SARS-CoV-2 infection [15]. Likewise, human cathelicidin LL-37, another key AMP, also demonstrated antiviral activity against SARS-CoV-2 [16]. These antiviral peptides can work through different mechanisms, ranging from immune regulation to direct inactivation via membrane disruption [17–20]. Hence, natural AMPs

**Citation:** Ripperda, T.; Yu, Y.; Verma, A.; Klug, E.; Thurman, M.; Reid, S.P.; Wang, G. Improved Database Filtering Technology Enables More Efficient Ab Initio Design of Potent Peptides against Ebola Viruses. *Pharmaceuticals* **2022**, *15*, 521. https://doi.org/10.3390/ ph15050521

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 29 March 2022 Accepted: 22 April 2022 Published: 24 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

may be engineered into new therapeutics to help control these viruses [11,12,21–23]. As a proof of concept, we previously demonstrated that LL-37 could be engineered to inhibit Ebola virus entry [24].

As an alternative approach, we found it useful to discover novel antimicrobials based on the antimicrobial peptide database (APD) [25]. This was made possible due to our systematic classification of AMPs based on the sequence length, net charge, hydrophobic ratio, post-translational modification, structure, activity, and source organism. In the APD, most of the AMPs had a peptide length of less than 60 amino acids. The net charge of a peptide was calculated at pH 7 and chemical modification was considered. The majority of the peptides in the APD have a net charge in the range of +1 to +7. The hydrophobic amino acids were defined based on the Kyte–Doolittle hydrophobic scale [26], where residues of leucine (L), isoleucine (I), valine (V), alanine (A), methionine (M), cysteine (C), and phenylalanine (F) are hydrophobic. In addition, the APD included tryptophan (W) in the hydrophobic group. The hydrophobic ratio of a peptide was calculated based on the sum of all the hydrophobic amino acids mentioned above divided by the peptide length. The dominant hydrophobic ratios for AMPs are located between 20 and 70%. Peptide structures are separated into four classes based on the presence or absence of the α-helix and β-sheet (i.e., α, β, αβ, and non-αβ). In the current APD, 478 peptides are found to have an α-helical structure based on nuclear magnetic resonance (NMR) spectroscopy and/or circular dichroism (CD). Only 88 peptides are determined to have a β-sheet structure, while 116 entries are known to adopt an αβ fold. Finally, the structures of 22 peptides are found to belong to the non-αβ class (i.e., no α-helix nor β-sheet). Both the peptide sequence and post-translational modification play an important role in determining 3D structure and biological activity (e.g., antibacterial, antiviral, antiparasitic, antifungal, and anticancer) [27].

This study reports the design of novel antiviral peptides by developing an improved version of the database filtering technology [28], making peptide design more efficient. Following up on our initial observation of Ebola virus entry inhibition, we validate the utility of the database filtering approach in designing new antiviral peptides using an Ebola virus pseudo-type system. In addition, we also evaluate the antiviral effects of DDIP1, a database-designed inhibitory peptide 1 [27], using both Ebola virus and SARS-CoV-2.

#### **2. Results and Discussion**

#### *2.1. Peptide Design via an Improved Database Filtering Technology*

The APD, originally established in 2003, was expanded and described in 2009 and 2016 [25,27]. To learn the wisdom of nature, the APD currently focuses on natural peptides with determined antimicrobial activity, known amino acid sequences, and less than 100 amino acids, leading to a widely used core data set. Our careful annotation of peptide activity data (e.g., antibacterial, antiviral, antifungal, antiparasitic, spermicidal, and anticancer) laid a solid foundation for designing peptides with the desired activity. In addition, peptide properties, such as the length, net charge, hydrophobicity, and structure, can be searched for in a systematic manner. Each parameter can be arrayed to identify the optimum. These database features enabled the development of database technology for peptide design. The database filtering technology (DFT) is one such approach that designs new peptides based on the most probable parameters within a set of peptides with common activity [28]. The DFT is an ab initio approach to designing peptides because, unlike de novo approaches, it makes no prior assumptions. The first proposed version of the DFT designed a peptide with activity against methicillin-resistant *Staphylococcus aureus* (MRSA), but did not inhibit Gram-negative bacteria. To design antiviral peptides, the first database filter selected all 190 antiviral peptides in the database (Summer 2021) as templates. These peptides were analyzed to determine the AMP length with the highest abundance. The search was conducted in 10 amino acid increments. The 21–30 residue range had the highest peptide count (Figure 1). This range would be the most probable length for our peptide design. Considering the cost of making longer peptides, however, we selected a peptide length

of 20, which was closest to the lower optimal boundary. We then derived the rest of the peptide parameters based on the antiviral peptides with 11–20 amino acids. To study the impact of the AMP length on antiviral activity, we also designed two shorter peptides with 12 and 16 amino acids. In the original DFT design method, a series of filtering steps were involved in deriving numerous parameters. For example, three separate filters were used to determine the frequency of amino acids, net charge, and the hydrophobicity of the anti-MRSA peptide. Here, we found it possible to derive these three parameters in one step by analyzing the amino acid frequency plot for the 11–20-peptide length group. In this plot (Figure 2A), the 20 standard amino acids were separated into four groups based on their common features: (1) hydrophobic (I, V, L, F, C, M, A, and W), (2) special glycine/proline (G and P), (3) polar and hydrophilic (T, S, Y, Q, and N), and (4) charged (E, D, H, K, and R) [25]. We then selected the most abundant amino acid in each group as a representative. In this manner, 20 amino acids were reduced to four for our peptide design (solid columns in Figure 2A, top panel). In this plot, the four representative amino acids were leucine (L), glycine (G), serine (S), and arginine (R). Interestingly, we were not alone in utilizing this reductionist approach, since nature also uses a small set of amino acids to design peptides such as θ-defensins. While the hydrophobic group contained I, V, F, L, and C, the other three groups consisted of a single amino acid (glycine, threonine, and arginine, respectively) (Figure 2B) [29]. This plot indicated that other amino acids in the special GP, polar, and charged groups were not preferred in the known θ-defensins. Next, in our improved design, the amino acid percentages within each amino acid group were summed (Figure 2A bottom panel) to represent the percentage for each of the four selected amino acids (L, G, S, or R) in the designed peptides. To design a 20-mer peptide, we calculated the numbers of the four amino acids L, G, S, and R for this peptide by multiplying the percentage for each amino acid with the targeted peptide length. Thus, this procedure enabled us to determine the types and contents of the four representative amino acids (including hydrophobic and cationic) in one step, increasing the efficiency of this ab initio method. Next, the most probable structure of the peptide was determined to be α-helical, because this structure had the highest occurrence in the 11–20 residue range (Figure 3). This allowed us to place charged and hydrophobic amino acids in a classic amphipathic pattern, where every two leucines were dispersed with two hydrophilic amino acids. To determine the potential combinations of these amino acids in the sequence, a statistical analysis was performed on the possible sequence motifs (four amino acids) that could be formed. The motifs with the highest abundance in the database were selected to connect the leucine pairs. The amino acid sequence of the first peptide (DFTavP1) designed to inhibit viral replication is given in Table 1. For antiviral assays, we utilized a pseudotyped Ebola virus as described previously [24]. In this study, the designed peptides displayed different degrees of inhibitory effects on pseudo-EBOV VSV-eGP (vesicular stomatitis virus-Ebola glycoprotein) infection. DFTavP1, the 20mer, showed an 18% higher inhibition than LL-37 at 2.5 µM (Figure 4). Also, the inhibition increased as the peptide dosage increased.

– **Figure 1.** Antiviral peptide count in the antimicrobial peptide database at various peptide length ranges [25]. Peptides in the range of 21–30 amino acids were dominant. This length range would be

–

ɵ

–

ɵ

the most probable length based on our previous design idea [28]. To reduce peptide cost, we selected a peptide length of 20 in this study and two even shorter peptides were also designed based on the peptide parameters with 11–20 amino acids. –

–

– ɵ ɵ **Figure 2.** A single-step amino acid analysis of antiviral peptides with 11–20 amino acids to derive multiple most probable parameters for peptide design. (**A**) Four frequent amino acids (L, G, S, and R, solid column) were identified from the four groups of amino acids. The percentages of amino acids for each group were then merged into L, G, S, and R. (**B**) The amino acid analysis of 19 θ-defensins registered in the APD. It is remarkable that in nature, θ-defensins are designed in a similar way by mainly using C, G, T, and R dotted with a few other hydrophobic amino acids. – ɵ ɵ

– – **Figure 3.** Count of antiviral peptides with different structures. Data were obtained from the antimicrobial peptide database [25] for antiviral peptides with 11–20 amino acids. Antiviral peptides with an α-helical structure were dominant with 13 counts. In addition, there were eight antiviral peptides in the selected length group with β-sheet structures. No antiviral peptides in this length group had a determined αβ structure (packed or unpacked). Only one antiviral peptide in this group had a non-αβ structure. In addition, six antiviral peptides in this group were disulfide-linked (S-S), although their 3D structures are unknown. Note that a similar plot was obtained for antiviral peptides with 21–30 amino acids (not shown). Therefore, α-helical structure was decided as the most probable structure for this design.


**Table 1.** Database-designed antiviral peptides and their properties.

<sup>a</sup> All the peptides were C-terminally amidated except for DDIP1. A tryptophan was introduced into the DFT peptides to facilitate peptide quantification (see Methods section). <sup>b</sup> Hydrophobic content. <sup>c</sup> Boman index (originally called protein binding potential in kcal/mol) [8] was renamed by the APD in 2003 [25]. <sup>d</sup> All were D-amino acids. In addition, a disulfide bond exists between the two cysteines C4 and C16 of DDIP1. αβ structure. In addition, six antiviral peptides in this group were disulfide –30 amino acids (not shown). Therefore, α

ual AMPs at 10 μM for 2 h before VSV 10 μM) for 2 h before VSV treated with individual AMPs (at 2.5, 5, or 10 μM) for 2 h before VSV **Figure 4.** Designed AMPs inhibited the infection of pseudo-EBOV virion in Vero cells. (**A**) AMPs inhibited pseudo-EBOV (VSV-eGP) infection in Vero cells. Vero cells were pretreated with individual AMPs at 10 µM for 2 h before VSV-eGP viruses were added. After 24 h of culture, Vero cells were harvested for flow cytometry analysis to measure viral infection. Percentages of GFP-positive cells represent percentages of cells infected with VSV-eGP. (**B**) DFTavP1 and its derivatives inhibited pseudo-EBOV (VSV-eGP) infection in Vero cells in a dose-dependent manner. Since AMPs usually inhibit microbes at micromolar, Vero cells were pretreated with individual AMPs (at 2.5, 5, or 10 µM) for 2 h before VSV-eGP viruses were added. After 24 h of culture, Vero cells were harvested for flow cytometry analysis to measure viral infection. (**C**) DFTavP1, DDIP1, and LL37 inhibited pseudo-EBOV (VSV-eGP) infection in Vero cells in a dose-dependent manner. Vero cells were pretreated with individual AMPs (at 2.5, 5, or 10 µM) for 2 h before VSV-eGP viruses were added. After 24 h of culture, Vero cells were harvested for flow cytometry analysis to measure viral infection.

#### *2.2. Validation of the Most Probable Parameters*

Next, we validated the improved methodology by designing additional peptides with parameters deviated from the optima. A 16 mer (DFTavP2) was designed in the same manner as DFTavP1 (sequence in Table 1). This peptide length decrease caused a 33% decrease in the viral inhibition of DFTavP2 at 5 µM compared to DFTavP1. A further

−

deviation from the most probable length led to DFTavP3 (a 12 mer in Table 1), which did not inhibit the VSV–eGP (Figure 4B). Hence, the peptide close to the most probable peptide length range was more potent than the sequence-shortened counterparts.

To further validate the most probable amino acids, leucine in DFTavP2 was converted to valine, leading to DFTavP4 (sequence in Table 1). The flow cytometry results suggested that DFTavP4 entirely lost the ability to inhibit the Ebola pseudo-virus (Figure 4B). This observation indicated the significance of the most probable hydrophobic leucine in conferring antiviral activity to the peptide. The reason for this might be twofold. First, leucine is more hydrophobic than valine, enabling a better binding to viruses. Second, leucine has a higher potential than valine in forming the helical structure required for target binding [25]. To fully validate this, leucine may be converted to other hydrophobic amino acids (I, F, A, M, C, and W) as well. A previous study revealed the peptide became less soluble when substituted by isoleucine [28]. Thus, we did not make the same change. As alanine is even less hydrophobic than valine, we predicted that alanine substitution would also lead to an inactive peptide. We did not test methionine, since this residue is not favorable for peptide design due to its readiness of being oxidized. While phenylalanine and tryptophan substitutions may be of interest for future studies, we included one, W, here for UV quantification. In a different design below, two leucines were transformed to cysteines.

Additional proof of the most probable principle in peptide design came from our previous database-guided design. The GLK-19 peptide (a 19-residue peptide containing G, L, and K) designed based on the frequently occurring amino acids from amphibian peptides [25] became active to human immunodeficiency virus type 1 (HIV-1) when all lysines were replaced with arginines. These changes were determined based on the arginine/lysine ratios in antibacterial, antifungal, antiviral, and anticancer peptides in the APD, where only in the antiviral peptides was the arginine/lysine ratio greater than one [30]. Since cysteine is also abundant in the hydrophobic group of the antiviral peptide amino acid signature (e.g., see Figure 2A), we changed two leucines in GLR-19 to two cysteines at positions 4 and 16. These changes led to DDIP1 with a disulfide bond. For this study, we created a new version of DDIP1, where all L-amino acids were converted to D-amino acids to gain stability to proteases. Our previous study suggested the importance of peptide stability for inhibiting Ebola viruses [24]. In the inhibition experiment, DDIP1 displayed 7% inhibition at 2.5 µM and 14% inhibition at 5 µM against the virus. Human LL-37, a known antiviral peptide [16,24,30], showed a higher viral inhibition than DDIP1 (Figure 4C).

#### *2.3. Antiviral Efficacy of Peptides Treated before or after Viral Infection*

We then compared the treatment efficacy of DDIP1 and DFTavP1 postinfection. Human LL-37 was included as a positive control. At 0, 2, and 4 h postinfection, a dose-responsive viral inhibition was observed for all the peptides, indicating the effect results from the peptide treatment (Figure 5A). When treated immediately after infection (delay 0 h), DDIP1, DFTavP1, and LL-37 displayed ~15%, 69%, and 9% inhibitions at 2.5 µM, respectively. Hence, DDIP1 and DFTavP1 demonstrated a higher inhibition than LL-37. DFTavP1, in the 4 h postinfection treatment study, showed a major decrease of 54% inhibition at 5 µM. The inhibition of the Ebola pseudovirus was reduced with a longer delay after infection for both DDIP1 and DFTavP1, implying an action on viral entry. The decrease in inhibition was caused by the decrease in viral load in the extracellular fluid because the intracellular viral load was increasing. Interestingly, the effect of human LL-37 was relatively constant. More experiments are required to determine the precise mechanism of DFTavP1 or DDIP1.

We also tested the antiviral activity of DDIP1 against SARS-CoV-2. Like LL-37, DDIP1 showed an inhibitory effect on SARS-CoV-2 in a dose-dependent manner (Figure 5B). At 5 µM, LL-37 inhibited 18% of the virus, while DDIP1 suppressed 42%. At 10 µM, DDIP1 (~55% inhibition) was more potent than LL-37 (~27% inhibition) as well. These results reinforced the antiviral potency of the database-designed antiviral peptides.

AMPs (at 2.5, 5, or 10 μM) at the μM) for 2 h before SARS **Figure 5.** Designed AMPs inhibited EBOV cell entry and SARS-CoV-2 infection. (**A**) AMPs targeted pseudo-EBOV (VSV-eGP) at the early stage of infection. Vero cells were treated with individual AMPs (at 2.5, 5, or 10 µM) at the same time with VSV-eGP (0 h), or at 2 or 4 h after VSV-eGP infection. Vero cells were harvested for flow cytometry analysis of GFP levels at 24 h post infection. (**B**) DDIP1 inhibited SARS-CoV-2 infection. Vero cells were pretreated with individual AMPs (at 2.5, 5, or 10 µM) for 2 h before SARS-CoV-2 viruses were added. After 24 h of culture, Vero cells were harvested for immunofluorescence staining to access infection activity.

#### *2.4. Different Requirements for Antiviral and Antibacterial Properties*

– – As our antiviral peptides were designed based on the antimicrobial peptide database [25], one may wonder whether they are active against bacteria. To obtain a more complete picture, we used two Gram-positive and four Gram-negative bacterial strains and the minimum inhibitory concentrations (MIC) of the peptides are provided in Table 2. In contrast to the antiviral case, DFTavP1 was less active against bacteria than the two shortened peptides DFTavP2 and DFTavP3 (low MIC values). Interestingly, DDIP1 displayed excellent MIC values against all the tested antibiotic-resistant pathogens (e.g., MRSA, *Escherichia coli*, *Pseudomonas aeruginosa*, and *Klebsiella pneumoniae*) in the range of 2–8 µM, comparable to the two shorter DFT peptides (Table 2). These results unveiled different parameter requirements for designing antibacterial and antiviral peptides. We speculate that such a difference primarily resulted from the potential differences in pathogenic targets. It is likely that our designed peptides (e.g., DDIP1) targeted the viral envelope since many AMPs act on bacterial membranes [8–10]. Another in silico study screened antiviral peptides by docking known antiviral peptides to the major protease (MPro ) of SARS-CoV-2 without experimental validation [31]. Future studies on both viruses and bacteria could validate such mechanisms of action. Our speculation was supported by our previous knowledge on human cathelicidin LL-37 peptides (e.g., GF-17 and GI-20), which inhibit both bacteria and viruses [32]. However, the peptide lost its antiviral effects when the helical structure was disrupted by partially incorporating D-amino acids. In terms of the mechanism, the helical structure of LL-37 peptides was not a must for targeting bacterial membranes, but essential to inhibit HIV-1 reverse transcriptase [33]. In the case of Ebola viruses, only the engineered peptides such as 17BIPHE2 were effective at the endosomal cell entry step by impairing the cathepsin-B-mediated processing of the Ebola viral glycoprotein [34].

**Table 2.** Minimum inhibitory concentration (µM) of database-designed antiviral peptides.


#### *2.5. Cytotoxicity of Antiviral Peptides*

For therapeutic use, it is important that the designed peptides had minimal toxic effects on mammalian cells. To evaluate the cytotoxicity of the new peptides, we utilized several cell lines. Vero cells are a model cell for viral infection, derived from the kidney epithelia of the African green monkey. It appeared that Vero cells were highly sensitive to DFTavP1 (TC<sup>50</sup> 2.4 µM), but became less sensitive to DDIP1 (TC<sup>50</sup> 13.0 µM in Table 3). Because SARS-CoV-2 infects lung cells, we also tested their cytotoxicity to Calu3 cells. Both DFTavP1 and DDIP1 showed a TC<sup>50</sup> in the range of 12–14 µM, comparable to the human host defense cathelicidin peptide LL-37 (TC<sup>50</sup> 19.1 µM). These results indicated a direct antiviral effect of our peptides at a low peptide concentration (e.g., 2.5 µM), where its secondary toxic effect on host cells might play a role. The toxic effect of the designed peptides depended on cell types. The 50% hemolytic concentration (HC50) for DDIP1 was greater than 160 µM, the highest concentration we tested (Table 3). However, both DFTavP1 and DFTavP2 were highly hemolytic, with an HC<sup>50</sup> below 12.5 µM, while DFTavP3, the shortest 12-mer peptide, had an HC<sup>50</sup> of 50 µM. Likewise, DDIP1 was poorly hemolytic to murine red blood cells as well (HC<sup>50</sup> > 160 µM), while DFTavP1 was highly hemolytic (Table 3). In the case of skin HaCaT cells, the HC<sup>50</sup> was 25 µM for DFTavP1, but greater than 100 µM for DDIP1. These results confirmed the toxicity of DFT peptides. However, the engineered peptide DDIP1 was much more selective and showed cell-dependent toxicity (Table 3).

**Table 3.** Cytotoxicity comparison of antiviral peptides to different cells.


<sup>a</sup> hRBC, human red blood cells; <sup>b</sup> mRBC, BALB/c mouse red blood cells; <sup>c</sup> TC50, the peptide concentration that killed 50% of human keratinocytes.

#### **3. Materials and Methods**

#### *3.1. The Antimicrobial Peptide Database*

The APD was originally established in 2003. Since then, it has been updated regularly and expanded substantially [25,27]. For scientific rigor, the APD applied a set of criteria for peptide registration (natural peptides, known sequences, known activity, and a size of less than 100 amino acids). Thus, this database currently focuses on natural peptides from six life kingdoms, including bacteria, archaea, protists, fungi, plants, and animals [25]. After over 18 years, 26 types of peptide activities (e.g., antibacterial, antiviral, antifungal, antiparasitic, spermicidal, and anticancer) have been annotated. Such a well-annotated peptide sequence–activity database provides a unique platform for peptide prediction and design. The APD enables a thorough statistical analysis of natural AMPs through a variety of search functions and database filters. Such an analysis identifies key parameters for peptide design [27].

#### *3.2. Database Filtering Technology (DFT) vs. Improved DFT*

The rigorous registration of the data in the APD set the stage for us to develop databaseguided approaches for peptide discovery, ranging from database screening to database filtering technology [27]. The original database filtering technology [28] consisted of multiple database filters that allowed us to derive a family of peptides with desired biological activity, followed by deriving key peptide parameters step by step. In the original DFT, the first filter selected a set of peptides with activity against Gram-positive bacteria, whereas the improved DFT used here selected a group of peptides annotated with antiviral activity. The second filter is common and was used to identify the most probable peptide length. This was achieved by statistically analyzing the peptides in the APD in bins (every 10 s), so that the peptide length with the highest count was found. Subsequently, the original DFT identified the most probable amino acid frequency, net charge, and hydrophobic amino acid in three steps. In

contrast, the improved DFT derived frequency, net charge, and hydrophobic content in one step. The rest of the steps in identifying the most probable structure and motifs are shared by both methods and were detailed elsewhere [28].

#### *3.3. Chemicals and Peptides*

All the chemicals were purchased from established vendors such as Fisher and Sigma. Peptides were created by Genemed Synthesis, Inc. (San Antonio, TX, USA). All the peptides were highly purified and reached over 95% purity based on HPLC. The correct mass of each peptide was validated by mass spectrometry (Shimadzu MALDI-8020, Kyoto, Japan or Thermo Fisher Scientific SALDI-TOF-MS, Waltham, MA, USA). The incorporation of a tryptophan (W) for each peptide at position 2 (in replacement of a leucine in Table 1) facilitated peptide quantification on a UV spectrometer (Ultraspec 1100 pro, Amersham Biosciences) at 280 nm.

#### *3.4. SARS-CoV-2 Safety Statement*

All experiments involving SARS-CoV-2 were conducted in an approved BSL-3 facility of the University of Nebraska Medical Center (UNMC) by dedicated trained personnel.

#### *3.5. Antiviral Assays*

Peptide activity against Ebola pseudo-virus was tested using established lab protocols as described [24]. Inhibitory effects on SARS-CoV-2 were tested as below. Live virus experiments were performed in the BSL3 laboratory at the UNMC (Omaha, NE, USA). Briefly, Vero cells (~10,000 cells/well) were seeded in a 96-well plate and cultured in complete medium overnight. In the prevention experiments, cells were pretreated with compounds for 2 h at 37 ◦C. Treatments were washed off and cells were infected with SARS-CoV-2 WI at a multiplicity of infection (MOI) of 0.1 in complete media. After 24 h, cells were fixed with 4% buffered paraformaldehyde for 30 min at room temperature. The fixed cells were washed with phosphate-buffered saline (PBS), permeabilized with 0.1% (*v*/*v*) Triton X-100 solution for 10 min, then blocked with 3% bovine serum albumin–PBS solution. The cells were incubated with anti-SARS-CoV-2 spike protein rabbit monoclonal antibody (Sino Biological, Beijing, China) at 1:1000 overnight at 4 ◦C, followed by incubation with 1:2000 diluted Alexa Fluor 488 conjugated secondary antibody (Jackson ImmunoResearch) for 1 h at room temperature. Cell nuclei were counterstained using Hoechst 33,342 (Invitrogen, #H3570), and cytoplasmic membranes were stained with CellMask (Invitrogen, #C10046). Noninfected cells and untreated virus-infected cells were included as internal controls. Cells were imaged using a high-content analysis system, Operetta CLS (PerkinElmer Inc., Waltham, MA, USA). Percentage inhibition of viral infection was calculated using Harmony 4.9 software (PerkinElmer Inc.).

#### *3.6. Antibacterial Assays*

Peptide activity against bacteria was tested using established lab protocols as described previously [28]. In brief, a peptide concentration gradient with two-fold dilution was created in the 96-well polystyrene microplates at 10 µL per well. From overnight cultures, six bacteria (Table 2) were grown to the logarithmic phase (i.e., optical density at 600 nm ≈ 0.5), diluted to ~10<sup>5</sup> CFU/mL, and partitioned into the 96-well microplates at 90 µL per well. The microplates were incubated at 37 ◦C overnight and read on a ChroMate 4300 Microplate Reader at 600 nm (GMI, Ramsey, MN, USA).

#### *3.7. Cytotoxicity*

Peptide toxicity was evaluated by using human red blood cells (hRBCs) and other host cells. Hemolysis was conducted as described elsewhere [28]. Briefly, hRBCs, obtained from the UNMC Blood Bank, were washed three times with phosphate-buffered saline (PBS) and diluted to a 2% solution (*v*/*v*). After peptide treatment, incubation at 37 ◦C for one hour, and centrifugation at 13,000 rpm, aliquots of the supernatant were carefully transferred to

a fresh 96-well microplate. The amount of hemoglobin released was measured at 545 nm. The percent lysis was calculated by assuming 100% release when human blood cells were treated with 2% Triton X-100, and 0% release when incubated with PBS buffer. The peptide concentration that caused 50% lysis of hRBCs was defined as HC50.

Vero cells (ATCC, CCL-81) or Calu-3 (ATCC, HTB-55) cells were seeded into 96-well tissue culture plates (Greiner Bio-One, Monroe, NC) and treated with different concentrations of peptides for 24/48 h at 37 ◦C. Cell viability was examined by Vybrant® MTT Cell Proliferation Assay Kit (Thermo Fisher Scientific, Grand Island, NY) following the manufacturer's instructions. Toxicity assays using HaCaT cells were conducted similarly as described elsewhere [34].

#### **4. Conclusions**

It is useful to identify antiviral peptides that can eliminate all kinds of SARS-CoV-2 variants: alpha, beta, delta, omicron, etc. Based on the antimicrobial peptide database [25], we previously developed a database filtering technology [28], which was recently proved to be useful for us in the design of anti-MRSA peptides with systemic efficacy in mice [34]. Here, we improved the original stepwise database filtering method by deriving multiple most probable parameters in a single amino acid composition analysis for designing antiviral peptides, thereby accelerating this peptide design process based on the antimicrobial peptide database. Our designed peptides were indeed inhibitory to the Ebola pseudo-virus. In addition, both human cathelicidin LL-37 and DDIP1 could inhibit SARS-CoV-2. The decrease in or loss of activity of peptides created with deviated most probable parameters further validated the improved method. Our study demonstrated that the designed peptides were inhibitory to the Ebola pseudo-virus in either prevention or treatment experiments. The reduced antiviral effect of the designed peptides after additional delay postinfection implied the blockage of viral entry. Our engineered disulfide-containing peptide DDIP1 with numerous desired properties could be further optimized to remove toxicity and developed into a novel treatment for viral infections such as COVID-19. This should require the study of pharmacokinetics and pharmacodynamics of the optimized peptide in animal models. It is predicted that DDIP1 is effective against a variety of mutated viral strains, since it may damage viral envelopes to prevent infection in a similar manner to disrupt bacterial membranes. Our discovery of novel antiviral peptides may be further accelerated by applying the machine learning/artificial intelligence algorithms [35–43] that enable the prediction of both the antiviral activity and toxicity of peptides with the accumulation of experimental data of AMPs.

**Author Contributions:** G.W. conceived the study; Y.Y., T.R., S.P.R. and G.W. designed the study; T.R., Y.Y., A.V., E.K. and M.T. conducted the experiments; T.R., Y.Y., S.P.R. and G.W. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported in part by the NIGMS, NIH grant number GM138552 and the state grant to GW.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: https://aps.unmc.edu (accessed on 1 June 2021) [44].

**Acknowledgments:** T.R. is a summer student who gratefully appreciates the support of the Summer Undergraduate Research Fellowship provided by the Department of Pathology and Microbiology, University of Nebraska Medical Center.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

