*Article* **Re-Exploring the Ability of Common Docking Programs to Correctly Reproduce the Binding Modes of Non-Covalent Inhibitors of SARS-CoV-2 Protease Mpro**

**Davide Bassani , Matteo Pavan , Giovanni Bolcato, Mattia Sturlese and Stefano Moro \***

Molecular Modeling Section (MMS), Department of Pharmaceutical and Pharmacological Sciences, University of Padova, 35131 Padova, Italy; davide.bassani.1@studenti.unipd.it (D.B.); matteo.pavan.7@phd.unipd.it (M.P.); giovanni.bolcato.1@phd.unipd.it (G.B.); mattia.sturlese@unipd.it (M.S.) **\*** Correspondence: stefano.moro@unipd.it

**Abstract:** In the latest few decades, molecular docking has imposed itself as one of the most used approaches for computational drug discovery. Several docking benchmarks have been published, comparing the performance of different algorithms in respect to a molecular target of interest, usually evaluating their ability in reproducing the experimental data, which, in most cases, comes from X-ray structures. In this study, we elucidated the variation of the performance of three docking algorithms, namely GOLD, Glide, and PLANTS, in replicating the coordinates of the crystallographic ligands of SARS-CoV-2 main protease (Mpro ). Through the comparison of the data coming from docking experiments and the values derived from the calculation of the solvent exposure of the crystallographic ligands, we highlighted the importance of this last variable for docking performance. Indeed, we underlined how an increase in the percentage of the ligand surface exposed to the solvent in a crystallographic complex makes it harder for the docking algorithms to reproduce its conformation. We further validated our hypothesis through molecular dynamics simulations, showing that the less stable protein–ligand complexes (in terms of root-mean-square deviation and root-mean-square fluctuation) tend to be derived from the cases in which the solvent exposure of the ligand in the starting system is higher.

**Keywords:** molecular docking; molecular dynamics; SARS-CoV-2; main protease, Mpro ; docking benchmark

#### **1. Introduction**

In the 1980s, with the first study provided by Kuntz et. al [1], the computational technique of molecular docking had its birth. The efficiency, speed, and robustness of this method make its presence a constant in every structure-based drug-discovery pipeline [2]. To give a brief explanation, molecular docking consists of a multistep computational process that aims to find the best conformation of a molecule to bind to another to form a stable complex [3]. In the field of medicinal chemistry, as is deductible, its main application is finding the best molecules to bind in a firm way to the desired target (a protein, a nucleic acid, etc.). The algorithm starts with the exploration of the conformations space of the ligands (exploiting the so-called "search algorithm"). The conformations (called "poses") are then classified by a "scoring function", which attributes a numeric value to the goodness of the interaction according to energetical and/or geometrical function.

After a series of iterations, the final conformations are presented from the program to the user and ranked by the internal scoring function [4].

In the last 30 years, many docking programs have been developed. Among them, GOLD [5] (a genetic docking algorithm developed by the Cambridge Crystallographic Data Center—CCDC), Glide [6] (a systematic docking program released under license by Schrödinger), AutoDock [7] (a non-commercial genetic algorithm from The Scripps

**Citation:** Bassani, D.; Pavan, M.; Bolcato, G.; Sturlese, M.; Moro, S. Re-Exploring the Ability of Common Docking Programs to Correctly Reproduce the Binding Modes of Non-Covalent Inhibitors of SARS-CoV-2 Protease Mpro . *Pharmaceuticals* **2022**, *15*, 180. https://doi.org/10.3390/ ph15020180

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 3 January 2022 Accepted: 28 January 2022 Published: 31 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Research Institute), AutoDock VINA [8] (created by the same organization and released for non-commercial use), and PLANTS [9] (an algorithm based on an "Ant Colony Optimization" method) have gained popularity.

The performance of molecular docking programs can be measured by evaluating their ability to reproduce the experimental structural data, such as the crystallographic coordinates of a ligand into its binding site [10]. This ability has been evaluated in several benchmarks [11,12] to rank the performance of different algorithms regarding a specific target, usually using as the key parameter the root-mean-square deviation (RMSD) between the coordinates of the different poses given by the program and the crystallographic ones.

The ability to reproduce a crystallographic conformation strongly relies on different factors. First, the geometrical characteristics of the binding site, such as extension and shape, play a very important role; it is known that the performance of the algorithms has been improved to dock molecules in "cavities" or "pockets", rather than surfaces of proteins [13]. Second, the nature and the dimensions of the ligand are also crucial. Indeed, very small ligands may explore different places in a binding site, and the interactions that they can establish are usually few in number, reducing the "synergism" which could induce a molecule to keep a peculiar shape in a pocket [14]. On the other hand, even if drug-like molecules generally have higher conformational freedom, their dimensions force them to be oriented into a site in a more conserved way, so they have less roto-translational freedom.

In this study, we examined the ability of three docking programs characterized by diverse conformational sampling algorithms to efficiently reproduce the crystallographic pose of different ligands bound in different sites of a protein. To accomplish this task, a target in which several crystal structures were solved with the ligands located in different sites of the macromolecule itself was needed. To this scope, we considered a very recent and relevant target in the current pharmaceutical scenario, namely the SARS-CoV-2 main protease (Mpro).

In the last couple of years, with the pandemic spreading of the SARS-CoV-2 virus, the world of medical sciences had found itself fighting a new and dangerous adversary [15,16]. This biological entity, which is part of the coronavirus family, has been demonstrated to cause a pulmonary infection which eventually leads to serious complications, as witnessed by the high number of deaths that have already been linked to it (more than 5 million, at the present day [17]). The replication cycle of this virus strongly relies on the activity of its main protease (known as Mpro or 3CLpro, a crystallographic structure example is reported in Figure 1) [18]. Indeed, this protein is responsible for the cleavage of the propeptide transcribed by the viral genome. In this way, the formation of all the functional proteins for the building of new virions takes place, and so the viral infection can proceed. Even if many molecules have been shown to bind to Mpro [19] and inhibit its activity, and even if a molecule is currently in phase III clinical trial for this purpose (PF-07321332, from Pfizer [20,21]), no drug has already been approved by the European Medicinal Agency for the treatment of SARS-CoV-2 (also called "COVID-19"). Computational methods have already proven to be beneficial in the research for new potential inhibitors for Mpro, as the literature witnesses [22,23]. In this work, we decided to implement a molecular-dockingbased approach relying on the programs GOLD, Glide, and PLANTS. These algorithms are considered "orthogonal" because they are characterized by diverse placing and scoring algorithms to obtain the best solution to the "protein–ligand posing problem". Each of these programs was used to dock each of the different non-covalent ligands of the various crystal structures of Mpro, and this allowed us to evaluate the factors which influence the variability in reproducing the crystallographic poses. A self-docking protocol similar to the one herein reported had already been developed by our laboratory, with the name "DockBench". This program was implemented with success in several workflows, as the literature assesses [24,25]. In this study, a slightly modified version of that tool was used, which exploits only three docking programs at the present moment but can expand the analysis of the results obtained.

**Figure 1.** Representation of the crystal structure of Mpro (PDB:7L10). The two monomers composing the protein are colored differently, while the residues of the catalytic dyad, Cys145, and His41 are labeled in each of the monomers.

Looking at the docking benchmarking protocols on Mpro, we see that a remarkable study has already been conducted and published by Zev et al. [26]. In that specific work, six different docking programs were evaluated in their performance in reproducing the Mpro non-covalent ligands' crystallographic poses, and three of those algorithms have also been compared in their ability to correctly place Mpro covalent ligands into their proper binding site. In our work, we decided to expand the considerations brought by that study, evaluating specifically how docking performance changes in respect of the crystallographic data that have to be reproduced.

Indeed, we considered in our calculations parameters such as the solvent exposure of the ligand and the influence of the crystallographic water molecules in docking calculations, focusing our evaluations just on non-covalent Mpro ligands. We executed the experiment in two different scenarios, one which excluded the crystallographic waters from the calculation (which we will name "Scenario 1"), and one which induced the docking programs to consider them (called "Scenario 2"). After that, we compared the docking results with the percentage of solvent exposure of the crystallographic pose of the ligand, successfully confirming that a higher solvent exposure tendentially reflects a worsening in the ability to reproduce the crystallographic pose by the algorithms (that, as already mentioned, are better trained for "cavities" rather than "surfaces"). To further investigate this aspect, we expanded our computational analysis by performing a molecular dynamics (MD) experiment, in which each crystallographic ligand was left free to move for 5ns (three replicas per system). This approach (known as "MD post-docking") has already become part of our computational protocol [27,28] and is based on the fact that the conformations of the ligands, which are less prone to be displaced from their initial position during the simulation, are related to higher stability and binding strength with the target. In the case presented, this principle was applied directly to the crystallographic conformations of the

ligands rather than to docking poses. This was performed because the goal was not to select the most promising molecules in binding to a specific region of the protein; instead, the objective was to elucidate which are the features of the crystallographic ligands that tend to guarantee a tighter binding with the receptor. Our evaluation demonstrated that the molecules bound to the orthosteric pocket of Mpro keep their position much stronger than the molecules crystallized on other sites, further validating our solvent exposure-based theory. A representation of the Mpro ligands crystallized in the various sites of the protein is given in Figure 2.

**Figure 2.** Representation of all the crystallographic ligands of Mpro superposed. To give a better view, just one protein structure is represented (the one coming from PDB:7L10). The ligands which are crystallized inside the catalytic pocket are colored in magenta, while all the small molecules crystallized outside the orthosteric binding site are colored in cyan.

#### **2. Results and Discussion**

#### *2.1. Scenario 1—Docking Calculations without Considering the Crystallographic Water Molecules*

The results of our docking protocol for this section (which are numerically reported in the Supplementary Materials File "Selfdocking\_scenario1.csv") are graphically represented with colormaps. All the colormaps present in this study are based on a colorimetric scale delineating the RMSD values, starting from 0 Å, which corresponds to a molecular docking pose exactly super posable to the crystallographic one (maximum docking performance, represented by the dark blue color), and reaching values of 5 Å or higher (minimum docking performance, all represented by the dark red color), corresponding to a very low level of overlap between the coordinates of the pose produced and the ones of the crystallographic conformation. The colormaps in Figure 3 show the self-docking results obtained on the different Mpro crystal structures in the case in which water molecules are not considered in the calculation. As is depicted, the RMSD values were far lower for all the complexes in which the crystallographic ligand is located in the orthosteric pocket.

**Figure 3.** Colormaps represent the results of the self-docking experiments in the case in which the crystallographic water molecules are not considered during the docking runs. (**A**) Results coming from the average of the RMSDs of all the poses for each docking run. (**B**) Results derived just from the RMSD between the crystallographic ligand coordinates and the pose classified as the best from the scoring function. (**C**) Results of the self-docking experiments if just the pose showing the best RMSD value between its coordinates and the crystallographic ones are retained. The *x*-axis lists all the different protein–ligand complexes, which are plotted against the different pairs docking program-scoring function used for this study, reported in the *y*-axis.

To give a better resolution of this, we separated each map into two different colormaps, one grouping all the 78 proteins in which the ligand is located into the catalytic pocket, and one including all other cases (41 complexes).

We analyzed the data coming from the calculations, and we determined that, looking at all the complexes with all the different couples docking program-scoring functions, we see that the average values of all the RMSDs obtained was 5.76Å ("RMSD\_average"). Looking at the average of the RMSDs coming from the poses which were scored as the best ones from the algorithms' scoring functions ("RMSD\_scor\_func"), we see that the value was 5.10Å. If the lowest RMSD values only are taken into account for each docking run ("RMSD\_sorted"), the mean of the values was 3.70Å.

The average values were also calculated separately for all the complexes in which the crystallographic ligand is located in the catalytic pocket, and for all other cases. The colormaps for these different conditions are reported in Figures 4 and 5.

First, the analysis focused on the complexes having the crystallographic ligand located within the orthosteric pocket. For this set of systems, we calculated the average RMSD value of all the poses ("RMSD\_average"), which was revealed to be 4.54Å. Then we computed the average of the RMSD values coming from the poses which were ranked with the best score from the scoring functions ("RMSD\_scor\_func"), and its value was 3.43Å. Finally, the average RMSD value of the poses with the lowest RMSD in each run was calculated ("RMSD\_sorted"), and its measure was 2.45Å.

Second, the same steps were performed for the rest of the protein–ligand complexes, which are the ones in which the crystallographic ligand is located outside the orthosteric binding site. Moreover, in this case, the first passage involved the calculation of the average RMSD value of all the poses generated for these systems ("RMSD\_average"), and its measure was 8.08Å. Then, the mean of the RMSD values coming from the poses which received the highest rank from the scoring functions was calculated ("RMSD\_scor\_func") and was revealed to be 8.29Å. In the end, the average value of the lowest RMSDs of each run was computed ("RMSD\_sorted"), and its measure was shown to be 6.08Å.

**Figure 4.** Colormaps represent the results of the self-docking experiments just for the ligands crystallized inside the orthosteric pocket in the situation in which the crystallographic water molecules are not considered during the docking runs. (**A**) Results coming from the average of the RMSDs of all the poses for each docking run. (**B**) Results derived just from the RMSD between the crystallographic ligand coordinates and the pose classified as the best from the scoring function. (**C**) Results of the self-docking experiments if just the pose showing the best RMSD value between its coordinates and the crystallographic ones is retained. The *x*-axis lists all the different protein–ligand complexes, which are plotted against the different pairs docking program-scoring function used for this study, reported in the *y*-axis.

**Figure 5.** Colormaps represent the results of the self-docking experiments just for the ligands crystallized outside the orthosteric pocket in the case in which the crystallographic water molecules are not considered during the docking runs. (**A**) Results coming from the average of the RMSDs of all the poses for each docking run. (**B**) Results derived just from the RMSD between the crystallographic ligand coordinates and the pose classified as the best from the scoring function. (**C**) Results of the self-docking experiments if just the pose showing the best RMSD value between its coordinates and the crystallographic ones is retained. The *x*-axis lists all the different protein–ligand complexes, which are plotted against the different pairs docking program-scoring function used for this study, reported in the *y*-axis.

The results obtained for Scenario 1 are summarized in Table 1.


**Table 1.** Table representing the self-docking results obtained for Scenario 1.

#### *2.2. Scenario 2—Docking Calculations Considering the Crystallographic Water Molecules*

The outcomes of our molecular docking experiment for this section (which are reported in the Supplementary Materials File "Selfdocking\_scenario2.csv") are graphically represented with colormaps, which were created with the same criteria listed in the previous paragraph. The results reported in the colormaps in Figures 6–8 reveal the self-docking performance obtained on the different Mpro crystal structures in the case in which the crystallographic water molecules within 5 Å from the ligand were retained during the calculation. Moreover, in this case, it is easy to notice that the values result in being far better for the complexes in which the small molecule of interest is in the orthosteric binding site.

**Figure 6.** Colormaps represent the results of the self-docking experiments in the case in which the crystallographic water molecules at 5 Å or nearer to the ligand itself are taken into account during the docking runs. (**A**) Results coming from the average of the RMSDs of all the poses for each docking run. (**B**) Results derived just from the RMSD between the crystallographic ligand coordinates and the pose classified as the best from the scoring function.(**C**)Results of the self-docking experiments if just the pose showing the best RMSD value between its coordinates and the crystallographic ones is retained. The *x*-axis lists all the different protein–ligand complexes, which are plotted against the different pairs docking program-scoring function used for this study, reported in the *y*-axis.

**Figure 7.** Colormaps represent the results of the self-docking experiments just for the ligands crystallized inside the orthosteric pocket in the situation in which the crystallographic water molecules at 5 Å or nearer to the ligand itself are taken into account during the docking runs. (**A**) Results coming from the average of the RMSDs of all the poses for each docking run. (**B**) Results derived just from the RMSD between the crystallographic ligand coordinates and the once of the pose classified as the best from the scoring function. (**C**) Results of the self-docking experiments if just the pose showing the best RMSD value between its coordinates and the crystallographic ones are retained. The *x*-axis lists all the different protein–ligand complexes, which are plotted against the different pairs docking program-scoring function used for this study, reported in the *y*-axis.

**Figure 8.** Colormaps represent the results of the self-docking experiments only for the ligands crystallized outside the orthosteric pocket in the situation in which the crystallographic water molecules at 5 Å or nearer to the ligand itself are taken into account during the docking runs. (**A**) Results coming from the average of the RMSDs of all the poses for each docking run. (**B**) Results derived just from the RMSD between the crystallographic ligand coordinates and the once of the pose classified as the best from the scoring function. (**C**) Results of the self-docking experiments if just the pose showing the best RMSD value between its coordinates and the crystallographic ones are retained. The *x*-axis lists all the different protein–ligand complexes, which are plotted against the different pairs docking program-scoring function used for this study, reported in the *y*-axis.

Similar to the first scenario, we divided each colormap into two sets, one with the 78 proteins having the ligand located into the catalytic pocket, and the other including all the remaining cases (41 proteins). Considering all the protein–ligand complexes with all the different pairs docking program-scoring function, the mean values of all the RMSDs obtained ("RMSD\_average") was 5.64Å, but focusing only on the mean of the RMSDs derived from the poses which were given the highest rank from the algorithms ("RMSD\_scor\_func"), the value resulted to be 4.83Å. Looking only at the best RMSDs for each docking run ("RMSD\_sorted"), we see that the average of the values was 3.68 Å.

As already performed for Scenario 1, also in Scenario 2, the analysis was divided between the complexes in having the crystallographic ligand crystallized into the catalytic pocket, and for all other situations.

We reported the colormaps which resulted from this evaluation, and those are represented in Figures 7 and 8.

We started from the complexes in which the ligand is located inside the catalytic pocket in the crystal. For those systems, the mean of the RMSD values coming from all the poses("RMSD\_average") resulted in being 4.22Å. Then, the average of the RMSDs derived from the scoring function highest-ranked poses in all the docking runs ("RMSD\_scor\_func") was computed, and its value was 3.11Å. In the end, also the average value between the lowest of the RMSDs in each docking run was calculated ("RMSD\_sorted") and was revealed to be 2.26Å.

Second, we repeated the analysis for all the complexes in which the crystallographic ligand is located outside the orthosteric pocket. For these systems, the average of the RMSD coming from all the poses collected in the docking runs ("RMSD\_average") was calculated to be 8.32Å. Next, we computed the mean of the RMSD values derived from the poses which received the highest score (from the scoring functions) in each run ("RMSD\_scor\_func"), and this value was 8.11Å. Last, also the average value between the lowest of the RMSDs in each docking run was calculated ("RMSD\_sorted"), giving 6.36 Å.

The results obtained for Scenario 1 are summarized in Table 2.


**Table 2.** Table representing the self-docking results obtained for Scenario 2.

Just analyzing the numbers coming from the average values allows us to see how the performance of the docking programs dramatically increases when the ligand is docked inside the catalytic pocket rather than on the surface of the protein, in line with the fact that the molecules have a limitation in the conformation that they can explore into a binding site. Together with this, the ligands can exploit their accessible surface area to interact with the protein more efficiently, following the principle of "complementarity" [29,30].

#### *2.3. Solvent Exposure Analysis*

The results of the docking calculations were then analyzed in light of the data coming from the solvent exposure analysis. For each docking program-scoring function pair, the best RMSDs given by the docking calculation were evaluated against the solvent exposure of the ligand in its crystallographic pose. The results were reported in different plots, one for each couple docking program-scoring function, also in this case dividing the graphs in respect to the "scenario" from which the docking result was coming. To give an example, we reported in this article the plots for the pair GOLD-goldscore for each of these cases (Figures 9 and 10).

**Figure 9.** Scatter plots showing the different distribution of the RMSD values between the coordinates of the best pose from the GOLD-goldscore docking experiment in respect to the solvent exposure of the corresponding crystallographic ligands. The red dots represent the values having the ligand crystallized inside the catalytic pocket while the blue dots represent the ligands crystallized on the other sites of Mpro. As can be noticed, the molecules showing the best values of RMSD are, in most cases, located inside the orthosteric pocket and characterized by low solvent exposure. This plot depicts part of the results of Scenario 1, and so the crystallographic water molecules are not considered in the docking runs of which the outcomes are here represented.

**Figure 10.** Scatter plots showing the different distribution of the RMSD values between the coordinates of the best pose from the GOLD-goldscore docking experiment in respect to the solvent exposure of the corresponding crystallographic ligands. The red dots represent the ligands that are originally crystallized inside the catalytic pocket, while the blue dots represent the ligands crystallized in the other parts of Mpro. As can be noticed, the molecules showing the best values of RMSD are in most cases located inside the orthosteric pocket and characterized by low solvent exposure. This plot depicts part of the results of Scenario 2, meaning that the crystallographic water molecules at 5 Å or nearer to the ligand are also considered in the docking runs of which the outcomes are here represented.

The plots arising from all other docking program-scoring function pairs, both in Scenario 1 and Scenario 2, are reported in Supplementary Materials Figure S1. From these graphs, we can easily see how the best RMSDs values tended to be derived from protein– ligand complexes in which the solvent exposure of the ligand is low, and, most of the time, this means that the small molecule is crystallized in the binding pocket (indicated with the red dots in the plots). There are some cases in which the mean RMSD values were suboptimal also for this kind of ligands, and this can be due to several reasons. In some situations, of which the complexes 5REH, 5RE9, 5RGK (represented in Figure 11), and 7AVD are an example, the solvent exposure was tendentially higher in respect to the other orthosteric ligands, while, in other cases, the increase in RMSD can be attributable to the small dimensions of the ligand itself, making it harder for the docking algorithms to reproduce the reference pose in a pocket of such considerable volume (the complexes 5R82 and 5RG0 are an example for this) [31].

**Figure 11.** Representation of the crystallographic complex conformation of 5RGK, one of the protein– ligand complexes in which the crystallographic ligand is located inside the orthosteric binding site, but the docking calculation results in high RMSD values. This is mainly due to the high level of solvent exposure that characterizes this ligand, which locates just a small portion of its structure inside the pocket, leaving the rest in an outer zone. The ligand is represented with stick representation (C-atom are colored in magenta), and the catalytic dyad (Cys145 and His41) is highlighted, as well as the His163 and the binding site residue interacting with the ligand. To give a better representation, the surface of the protein in a 5 Å radius from the ligand is represented and colored in blue.

On the other hand, there are also some cases in which the best RMSD given by the protocol was pretty low, even if the crystallographic ligand was not placed inside the orthosteric pocket. This is the case, for example, of 7LFP (the crystallographic pose is reported in Figure 12); the ligand was placed at the interface between the monomers, and so its solvent exposure and RMSDs values were low, even if was marked to be "outside the catalytic pocket". A similar situation is observed on 5RF0, where the ligand, even if not located into the orthosteric pocket, is not situated in the peripheral part of the protease.

**Figure 12.** Representation of the crystallographic pose of 7LFP, which is one of the protein–ligand complexes in which, even if the crystallographic ligand is located outside the orthosteric binding site, the RMSD values between the original coordinates and the ones given from the docking runs are considerably low. The reason for this can be found in the very low solvent exposure of this ligand, which is located in the interface between the monomers, and so is shielded by them. The ligand is represented in orange, and the catalytic dyad (Cys145 and His41) of both monomers is highlighted. To give a better representation, the surface of the protein in a 5 Å radius from the ligand is represented and colored in blue.

#### *2.4. Molecular Dynamics Simulations*

For each of the 119 crystallographic complexes, three different molecular dynamics simulations (MD) of 5 ns each were collected to examine the behavior of the ligands in a dynamic environment. The trajectories were wrapped, aligned to the first frame and the root-mean-square fluctuation (RMSF) of the ligand, as well as the RMSD between its crystallographic and final coordinates ("RMSD\_final"), and were calculated for every single experiment. For each protein, the values coming from the average of the RMSFs and "RMSD\_final" derived from the replicas were considered. Considering all the simulations collected, the average of all the ligand RMSF values was calculated to be 5.28 Å, while the average of the RMSD values between the coordinates of the crystallographic conformation of the ligand and the ones coming from the last frame of the trajectory ("RMSD\_final") was of 8.89 Å.

As already performed for the docking results analysis, we first focused on the complexes in which the crystallographic ligand is originally located inside the orthosteric pocket. For these systems, the average of all the RMSFs coming from the simulations was 2.19 Å. The mean value of the RMSDs of the ligands in the last frame of each trajectory ("RMSD\_final") was instead calculated to be 4.43 Å.

Second, we concentrated on the systems in which the crystallographic position of the ligand (and so its initial location) is outside the catalytic pocket. For these systems, the average value of all the ligand RMSFs during the trajectories was calculated to be 11.15Å. Then the RMSD value between the final coordinates of the ligands and their crystallographic ones ("RMSD\_final") were considered. The average of these values, for all the trajectories

collected for these complexes, was 17.66Å. The output files of the molecular dynamics simulation geometric analysis are available in Supplementary Materials "MD\_data.csv".

As already performed for the docking experiments, also for MD results, the average values of RMSF and "RMSD\_final" were plotted against the percentage of solvent exposure of the crystallographic conformation of the ligand, and the plots that were obtained are reported in Figures 13 and 14.

As expected, the complexes in which the ligand is crystallized in the orthosteric site (marked with the red dots in the scatter plot) tended to fluctuate much less than the ligands which are complexed in the external parts of the protease (represented with the blue dots in the graphs). As depicted, MD analysis confirms that the best values in terms of RMSF and "RMSD\_final", which are correlated to a more energetically stable situation for the protein–ligand complex, come from the systems in which the crystallographic ligand is localized inside the catalytic pocket and are characterized by a low percentage of solvent exposure. These outcomes further support the already-mentioned hypothesis about the correlation between the improvement of the docking performances in the case in which the binding site is a pocket rather than a surface.

The overall results obtained with molecular dynamics simulations are summarized in Table 3. A graphical representation of the molecular dynamics simulations is reported in Supplementary Materials "Video\_S1.mp4". In this video, the ligands crystallized into the catalytic pocket are colored in magenta, while the other ligands are colored in cyan.

**Figure 13.** Scatter plots showing the different distribution of the mean RMSF values between the coordinates of the Mpro ligands compared to crystallographic ones after the molecular dynamics simulations in respect to the solvent exposure of the corresponding crystallographic ligands. The red dots represent the ligands thatwere originally crystallized inside the catalytic pocket, while the blue dots represent the ligands crystallized in the other parts of Mpro. As can be noticed, the molecules showing the best values of RMSF after the analysis of the trajectories are mainly located inside the catalytic pocket and characterized by a low solvent exposure of the original crystallographic pose.

**Figure 14.** Scatter plots showing the different distribution of the mean RMSD values between the final coordinates of the Mpro ligands compared to crystallographic ones after the molecular dynamics simulations in respect to the solvent exposure of the corresponding crystallographic ligands. The red dots represent the ligands thatwere originally crystallized inside the catalytic pocket, while the blue dots represent the ligands crystallized in the other parts of Mpro. As can be noticed, the molecules showing the best values of RMSF after the analysis of the trajectories are mainly located inside the catalytic pocket and characterized by a low solvent exposure of the original crystallographic pose.


**Table 3.** Results of the molecular dynamics experiments.

#### **3. Materials and Methods**

*3.1. Software Overview*

The molecular modeling operations were executed with the Molecular Operating Environment (MOE) suite (version 2019.01) [32]. The molecular docking calculations were carried out with CCDC GOLD (version 2020), Schrodinger Glide (from the Schrödinger suite 2021.3), and PLANTS. The solvent exposure calculation was performed with a series of SVL commands (exploiting "moebatch" of the MOE suite) implemented into a python script. The systems for molecular dynamics simulations were prepared by using tleap [33] and VMD [34]. The simulations were then carried out with ACEMD3 [35](version 3.3.0), a licensed program based upon OpenMM [36] (version 7.4.0). The modeling and docking calculations were performed on a 12 CPU (Intel Xeon E5-1620 3.50 GHz) Linux Workstation, while the MD simulations were carried out on a GPUs-cluster based composed of 20 NVIDIA GPUs.

#### *3.2. Structure Preparation for Docking Calculations*

The different crystal structures of Mpro were collected from the Protein Data Bank [37]. Among these, the proteins which did not present any ligand, or which were complexed with a covalent ligand, were excluded. This way, only the non-covalent protein–ligand complexes were retained, and the complete list of all the 119 complexes used is available in Supplementary Material Table S1. The structures were grouped into a database and were prepared with MOE "QuickPrep" tool. With this tool, each complex was properly prepared to recreate the small missing loops in the structure, assigning the proper conformation to the residues with alternate orientation (based on occupancy) and adding the hydrogens to the system (this last passage was performed with the MOE "Protonate 3D" tool). The hydrogens added this way were then minimized by using the AMBER10:EHT force field implemented in MOE [38].

After these preliminary but crucial steps, each complex was manually examined and treated to eliminate every molecule, except for the crystallographic waters and the main ligand. Each complex was then independently saved.

#### *3.3. Docking Calculations*

For each of the complexes prepared, the crystallographic ligand was separated from the protein and self-docked into its binding site. For each docking program, all the scoring functions available were used for separate runs, and in each run, 5 poses were collected for the ligand. GOLD supports 4 different scoring functions: goldscore, chemscore, asp, and plp; Glide supports two main functions for docking, which are Glide-SP and Glide-XP, while PLANTS implements plp and chemplp.

For each docking-program-scoring function couple, the docking calculation was carried out in two different scenarios: one in which the crystallographic water molecules were not considered (which we refer to as Scenario 1) and one in which also the water molecules 5 Å or nearer from the ligand atoms were taken into account into the computation (which we refer to as Scenario 2).

When all the docking calculations were executed, the ligand root-mean-square deviation (RMSD) between the coordinates of each one of the poses and the crystallographic conformations were computed. The data of major interest were the RMSD in respect to the pose which is marked with the highest score by the program (RMSD\_scor\_func), the lowest RMSD of the docking run (RMSD\_sorted), and the average of the RMSDs of all the poses generated (RMSD\_average). The output files of the self-docking experiments executed are available in the Supplementary Materials ("Selfdocking\_scenario1.csv" and "Selfdocking\_scenario2.csv").

#### *3.4. Solvent Exposure Calculation*

For each Mpro complex, the solvent exposure of the main crystallographic ligand was calculated with an SVL script based on MOE "moebatch". The output of such calculation was the percentage of the ligand surface which is exposed to the solvent in the protein–ligand crystallographic complex. All the percentages obtained are presented in Supplementary Materials Table S2.

#### *3.5. Molecular Dynamics Simulations Setup and Execution*

All the protein–ligand Mpro systems were independently prepared for molecular dynamics simulations. The program tleap was used for the creation of the simulation box, which was set to be cubic and characterized by a 15 Å padding. The solvation model used was the explicit TIP3P, and the neutralization of the system was performed by adding Na<sup>+</sup> and Cl<sup>−</sup> ions until the salt concentration inside the box reached the value of 0.154 M.

The systems then underwent a two-passage equilibration. In the first one, both protein and ligand atoms were subjected to a harmonic position restrain of 5 kcal/mol. The length of this step was set to 0.1 ns, and the ensemble used was the canonical one (NVT). During the second equilibration step, the ensemble was set to NPT (isothermal–isobaric), the length was 0.5 ns, and the harmonic restrains (always 5 kcal/mol) were applied both on the ligand and on the alpha-carbons of the protein backbone.

After these preliminary steps, 3 replicas of 5 ns each were collected for each system; the ensemble was again the NVT one, and no restraints were applied. At the end of the simulations, the average root-mean-square fluctuation (RMSF) of the ligand during the trajectory, as well as the RMSD betweencrystallographic coordinates of the ligand and the ones coming from the last frame of the trajectory, were collected.

#### **4. Conclusions**

In this study, we evaluated the performance of three orthogonal docking algorithms in reproducing the crystallographic pose of several ligands located in different parts of the same target, which, in our case, was the SARS-CoV-2 main protease (Mpro). Our analysis revealed how, even if the docking programs used operate in different ways to give the final conformations to the user, all of them perform much better in the case in which the ligands are located in a binding pocket rather than crystallized outside of it. Specifically, we reported that their performance tends to decrease with the increment of the exposure of the crystallographic pose to the solvent. This was confirmed both from the experiments executed without considering the crystallographic water molecules in the docking calculations and from the ones taking into account the waters 5 Å or nearer to the ligand. Molecular dynamics simulations further give credit to our study, demonstrating how the less-fluctuating ligands (and so the most stable) through the trajectories were the ones crystallized inside the orthosteric binding site of Mpro .

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/ph15020180/s1: "Supplementary\_material.docx" (containing Table S1, Table S2, and Figure S1), the CSV files "Selfdocking\_scenario1.csv", "Selfdocking\_scenario2.csv" and "MD\_data.csv", "Video\_S1.mp4".

**Author Contributions:** Conceptualization, D.B. and S.M.; methodology, D.B.; formal analysis, D.B., M.S. and S.M.; writing—original draft preparation, D.B.; writing—review and editing, M.P., G.B., M.S. and S.M.; supervision, S.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Fondazione Cariparo (An Integrated Strategy for the Fast Discovery of SARS-CoV-2 Main Protease (Mpro) Inhibitors, No. 55812).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data are contained within the article and Supplementary Materials.

**Acknowledgments:** MMS lab is very grateful to Chemical Computing Group, OpenEye, and Acellera for the scientific and technical partnership. MMS lab gratefully acknowledges the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Evaluation of Docking Machine Learning and Molecular Dynamics Methodologies for DNA-Ligand Systems**

**Tiago Alves de Oliveira 1,2, \* , Lucas Rolim Medaglia 1 , Eduardo Habib Bechelane Maia 2 , Letícia Cristina Assis 1 , Paulo Batista de Carvalho 3 , Alisson Marques da Silva <sup>2</sup> and Alex Gutterres Taranto 1,4, \***


**Abstract:** DNA is a molecular target for the treatment of several diseases, including cancer, but there are few docking methodologies exploring the interactions between nucleic acids with DNA intercalating agents. Different docking methodologies, such as AutoDock Vina, DOCK 6, and Consensus, implemented into Molecular Architect (MolAr), were evaluated for their ability to analyze those interactions, considering visual inspection, redocking, and ROC curve. Ligands were refined by Parametric Method 7 (PM7), and ligands and decoys were docked into the minor DNA groove (PDB code: 1VZK). As a result, the area under the ROC curve (AUC-ROC) was 0.98, 0.88, and 0.99 for AutoDock Vina, DOCK 6, and Consensus methodologies, respectively. In addition, we proposed a machine learning model to determine the experimental ∆T<sup>m</sup> value, which found a 0.84 R 2 score. Finally, the selected ligands mono imidazole lexitropsin (**42**), netropsin (**45**), and *N*,*N*′ - (1H-pyrrole-2,5-diyldi-4,1-phenylene)dibenzenecarboximidamide (**51**) were submitted to Molecular Dynamic Simulations (MD) through NAMD software to evaluate their equilibrium binding pose into the groove. In conclusion, the use of MolAr improves the docking results obtained with other methodologies, is a suitable methodology to use in the DNA system and was proven to be a valuable tool to estimate the ∆T<sup>m</sup> experimental values of DNA intercalating agents.

**Keywords:** computer drug design; molecular docking; molecular dynamic simulation; virtual screening; MolAr; DNA intercalating agents

#### **1. Introduction**

Drugs interacting with DNA are among the most effective anticancer agents [1], but their low selectivity makes them highly toxic, a major drawback that calls for new studies and strategies to develop drugs selective towards DNA in cancerous cells [2].

One of the strategies for the development of new drugs is to identify small molecules through a systematic analysis of large groups of compounds with drug-like properties. An experimental approach commonly used is the high throughput screening (HTS), an automated process using robots for a systematic search. It is a costly technique due to the number of compounds to be acquired, the cost of purchase and operation of sophisticated robots [3], and experimental considerations such as stability and solubility of the compounds.

An alternative to HTS is the virtual high-throughput screening (vHTS or VS), an in silico method to test large groups of compounds, including databases available online

**Citation:** de Oliveira, T.A.; Medaglia, L.R.; Maia, E.H.B.; Assis, L.C.; de Carvalho, P.B.; da Silva, A.M.; Taranto, A.G. Evaluation of Docking Machine Learning and Molecular Dynamics Methodologies for DNA-Ligand Systems. *Pharmaceuticals* **2022**, *15*, 132. https://doi.org/10.3390/ ph15020132

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 29 November 2021 Accepted: 17 January 2022 Published: 22 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

containing millions of molecules. This technique also allows the design and virtual testing of theoretical compounds prior to synthesis or acquisition, reducing the cost and time required to find compounds with a high potential for further development [3,4]. VS methods use molecular docking to study the interaction between small molecules and their receptors [5], a method that has been evaluated for protein-ligand systems, and more recently have been used to model DNA-ligand complexes [6–8]. However, most docking programs use algorithms that are not suitable for modeling DNA due to its high charge density [1], prompting the need for a more adequate in silico model for nucleic acids.

Several studies have been done trying to develop a molecular docking software appropriate for DNA modeling. Ricci and Netz [9] developed a method to predict the binding mode of small molecules to DNA using AutoDock 4.0 [10], which used distinct DNA receptors in the most common conformations related to the most common binding poses to suppress the importance of the receptor's flexibility in the algorithm.

Srivastava et al. [11] described a systematic computational analysis of 57 DNA ligands through four docking protocols, with the following root-mean-square deviation (RMSD) for the best ligands: GOLD [12] (1.24 Å), Glide [13] (1.23 Å), CDOCKER [14] (1.44 Å), and AutoDock [10] (1.57 Å). GOLD and GLIDE, with similar values, were shown to have a better performance and being the most suitable for modeling nucleic acid-ligand complexes. Molecular dynamics simulations showed that the DNA duplex skeleton underwent minor deviations in the complex, supporting docking protocols even though the receptor is kept rigid. However, the area under the ROC curve (AUC) of these methodologies was not evaluated. ROC curve is an important metric to check the capacity of methodology to distinguish false positive results. Fong and Wong [15] evaluated four scoring functions (AutoDock [10], ASP@GOLD [16], ChemScore@GOLD [17], and GoldScore@GOLD [12]) for DNA-ligand complexes, and the scoring functions reproduced the experimental crystallographic structure complexes. It is noteworthy that these previous studies improved their results by combining more than one scoring function.

Good RMSD results were obtained in previous studies, but the ranking capacity of these docking methods was not evaluated. Our study used Molecular Architecture (MolAr) [18] software to predict DNA-ligand poses. MolAr is a docking workflow that allows an integrated and automated virtual screening (VS) process, from protein preparation (homology modeling and protonation state) to virtual screening with different methods. MolAr is open access and free of charge (available at http://www.drugdiscovery.com.br, accessed on 20 May 2020), allowing users to perform all the docking steps in a unique interface with a simple and intuitive operation. It uses AutoDock Vina [19], DOCK 6 [20], and Consensus Virtual Screening (CVS) docking protocols. AutoDock Vina uses a hybrid scoring function that combines knowledge-based and empiric scoring function features. DOCK 6 offers physics-based energy score functions based on force fields and score functions (GRID and AMBER scores). CVS is a ranking normalized combination of the results of AutoDock Vina and DOCK 6, reducing the chance of false positive results [18]. Our results were evaluated for DNA-ligand model systems [11] through visual inspection of the RMSD and ROC curves. In addition, docking binding energy and descriptor values of ligands were used as a predictor to calculate the ∆T<sup>m</sup> experimental values of 11 DNA ligands previously reported. Dynamic molecular simulations were also used to clarify their intermolecular interactions with DNA.

#### **2. Results**

A visual inspection was performed on the 1VZK structure to identify the principal forces for molecular recognition. Figure 1 shows the 3D structure of the target 1VZK and a 2D diagram with its crystallographic ligand (D1B). The ligand is complexed into the minor groove of DNA (Figure 1a) through hydrogen bonds between amidinic moieties and the carbonyl oxygen of nitrogenous bases. Hydrophobic interactions can also be observed between the benzimidazole and aromatic and with nitrogenous bases (Figure 1b).

**Figure 1.** (**a**) The crystallographic structure of molecular target under PDB code 1VZK; (**b**) a close view of the intermolecular interactions between ligand (D1B) in the minor DNA groove of the 1VZK. The red circles and ellipses in each plot indicate protein residues. Hydrogen bonds are shown as green dotted lines, while the spoked arcs represent residues making van der Waals interactions with the ligand generated with LigPlot+.

In general, redocking is the first evaluation method to be used in the docking process. This process shows (i) the correct elaboration of grid box parameters; (ii) the capacity of the docking method of reproducing the crystallographic binding pose; (iii) the acquisition of binding energies that can be used to rank the compounds. Usually, the redocking is evaluated by the RMSD value between the crystallographic binding pose and redocking results. The RMSD value between 1VZK and D1B ligand was 0.65 Å by AutoDock Vina, while the threshold value is 2.0 Å [20]. This result was better than previous docking methodologies, GOLD, GLIDE, CDOCKER, and AUTODOCK, which had values ranging from 1.23 Å to 1.57 Å [11].

#### *2.1. Molecular Docking*

The docking performance was evaluated by calculating the AUC-ROC, EF values, and BedROC. AUC-ROC has been used to check if the docking method can distinguish false positives from true positives [21]. The AUC-ROC values for our test compounds were 0.98, 0.88, and 0.99 for AutoDock Vina, DOCK 6 (Amber Score), and CVS, respectively. Moreover, the enrichment factor (EF) value [22] reflects the ability of the docking calculations to find true positives throughout the background database compared to random selection. Thus, it indicates how good the set formed by the top x% ranked compounds is compared to a set of equal size selected randomly from the entire collection of compounds. EFs values are calculated utilizing a percentage of the data set. For example, EF5% represents the value obtained when 5% of the database was screened. The EF value is defined by:

$$\text{EF\%} = \frac{\text{activity\%}}{\text{compounds\%}} \times \frac{\text{total compounds}}{\text{total activities}} \tag{1}$$

Previous reports show CVS having the best EF values, compared to DOCK 6 and AutoDock Vina, which can be explained by the use of AutoDock Vina output as the input for DOCK 6. Consequently, AUC-ROC had values corroborated by the EF values; in other words, the EF validates AUC-ROC results, especially with the performance at EF 1%, showing the CVS method could distinguish 100% of molecules [18].

The BedROC [23] value was calculated to confirm these AUC-ROC and EF results. BedROC uses exponential weighting to give early rankings more weight than the latest rankings of active compounds. The BedROC values were 0.60, 0.52, and 0.83 for AutoDock Vina, DOCK 6 (Amber Score), and CVS, respectively. As in AUC-ROC and EF the values of CVS are better than AutoDock Vina and DOCK6 (Amber Score).

Finally, Machine Learning was used to develop a model to predict ∆Tm experimental values. ∆T<sup>m</sup> represents the change in the melting temperature of DNA upon drug binding, being directly correlated with the binding energy, and is a valuable tool to evaluate the docking results. Six algorithms of linear regression were implemented as follows: (i) Gradient Boosting Regressor [24]; (ii) Random Forest Regressor [25]; (iii) Linear Regressor [26]; (iv) Voting Regressor [27] between algorithms (i), (ii) and (iii), (v) Lasso [28] and (vi) Elastic Net [29]. The results are summarized in Table 1, with Mean Squared Error (MSE) and R<sup>2</sup> score information. The Gradient Boosting Regressor shows the best result, with an R<sup>2</sup> score of 0.84, and the worst is the Random Forest Regressor with an R<sup>2</sup> score of 0.33.

**Table 1.** Regression Linear values calculated for the Prediction of ∆T<sup>m</sup> values.


#### *2.2. Molecular Dynamics*

The five best CVS results and the original ligand were chosen for simulation and presented in Figure 2, where the conformation changes of each of these ligands during the MD simulation were analyzed.

**Figure 2.** Ligands chosen for MD simulations after docking simulation.

− − − − − − The average energy for the total system, in Kcal/mol was −55,067.1, −55,473.3, −55,507.2, −55,137.1, −54,524.8, and −54,988.1 to ligands **51**, **42**, **45**, **15**, **43**, and **44**, respectively. The total energy graph shown in Figure 3 demonstrates an example of how the energy has a minimal variation. All energy graphs (available in the Supplementary Material, Figure S3) remained in equilibrium throughout the entire MD simulation. − − − −

− −

**Figure 3.** Example of Total Energy (Kcal/mol) calculated in Dynamic Molecular Simulation for interactions with DNA and ligand **51**.

The simulations were carried out in 50 ns to observe if there were significant conformational changes during the trajectory, with the results summarized in Figure 4. As can be observed, the ligands 42, 45, and 51 have the best results with an RMSD variation below

1Å. The original ligand and ligands **43** and **44** showed a major variation of approximately 2, 5, and 4 Å, respectively.

**Figure 4.** RMSD graphs for: (**a**) **45**; (**b**) **51**; (**c**) **42**; (**d**) **15**; (**e**) **44**; (**f**) **43** ligands complexed with DNA during 50 ns.

The intermolecular interactions of structures in equilibrium can be observed in Figure 5. As can be observed, the ligands **45** (5a), **51** (5b), **42** (5c), **15** (5d) were able to form hydrogen bonds and van der Waals interactions; whereas the ligands **44** (5e) and **43** (5f) carried out van der Waals interactions with nucleic acids.

**Figure 5.** *Cont*.

**Figure 5.** *Cont*.

2D interaction diagram obtained by LigPlot+ in Dyn

**Figure 5.** 2D interaction diagram obtained by LigPlot+ in Dynamic Molecular Simulation for interactions with DNA and ligand: (**a**) **45**; (**b**) **51**; (**c**) **42**; (**d**) **15**; (**e**) **44**; (**f**) **43**. The red circles and ellipses in each plot indicate protein residues. Hydrogen bonds are shown as green dotted lines, while the spoked arcs represent residues making van der Waals interactions with the ligand.

in Dynamic Molecular Simulation for interactions

Finally, to improve the analysis of RMSD values fluctuation, the heat map was plotted with the best ligands using VMD software. Figure 6 shows the heat map for ligands **45**, **51**, **45**, and **15**, Figure 6a–d, respectively. In general, the DNA structure is kept rigid during the MD trajectory with low variation. However, the highest fluctuation can be observed for all ligands reaching values ranging from 1.2 to 8.27.

**Figure 6.** *Cont*.

**Figure 6.** Heat Map for MD simulations generated with HeatMap plugin in VMD with ligand: (**a**) **45**; (**b**) **51**; (**c**) **42**; (**d**) **15**.

#### **3. Discussion**

#### *3.1. The Best Docking Methodologies to Study the DNA System*

In this study, we use AUC-ROC, EF, and BedROC to evaluate the best docking methodology for DNA intercalating agents, comparing AutoDock Vina, DOCK6, and CVS. AutoDock Vina is an important tool to find the correct pose of ligand into the binding site [19]; however, the ranking among the ligands has not been carried out properly. In addition, the score function of AutoDock Vina does not consider the charges. On the other hand, the score function of DOCK 6 [20], Amber score, includes the AM1-BCC charges of the system. Consequently, the AM1-BCC charges have been determined for a start pose obtained from AutoDock Vina output, improving the accuracy of charge calculations.

AUC-ROC was used to check if the docking method can distinguish false positives from true positives. AUC values close to 1 suggest good discrimination between false and true positives, whereas values closer to 0.5 show a random process, and values higher than 0.7 represent a good distinguishing power [21].

EF indicates how good the set formed by the top x% ranked compounds is when compared to a set of equal size selected randomly from the entire collection of compounds. EF corroborates AUC-ROC [18], yielding even better results with CVS methodology. BedROC calculated values were 0.60, 0.52, and 0.83 for AutoDock Vina, DOCK 6 (Amber Score), and CVS, respectively, confirming that CVS is the best methodology.

The model of linear regression summarized in Table 1 shows the results of the implementation of six linear regressors. The Gradient Boosting Regressor shows the best result, with an R<sup>2</sup> score of 0.84, and the worst is the Random Forest Regressor with an R<sup>2</sup> score of 0.33. Gradient Boosting Regressor (GBR) is a generalization of boosting to arbitrary differentiable loss functions. GBR is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems in a variety of areas. Our GBR's result is better than Srivastava's studies [11], which used docking information from GOLD, GLIDE, CDOCKER, AUTODOCK 4, Average Information Content level 2, and chemical hardness. Thus, our Machine Learning model was able to estimate ∆T<sup>m</sup> value with more accuracy than previous reports.

#### *3.2. Molecular Dynamics Simulations*

MD simulations were performed to obtain information about the ligands' interaction and stability into the DNA groove. According to the results of total energy, ligand **51** obtained the lower energy; whereas ligand **44** obtained the highest energy value. Figure 3 demonstrates the lower variation of the system in simulation.

Figure 4 shows the RMSD results of MS simulations. Ligands **15**, **42**, **45,** and **51** showed a RMSD average variation below 1 Å. Both ligands **42** and **45** achieved the equilibrium state in the beginning of the process (Figure 4a,c). Noteworthy that ligand **51** (Figure 4b) showed a RMSD average value of 0.5 Å, indicating an absence of conformational changes for this inhibitor, suggesting a better molecular recognition between DNA and ligand in the DNA groove. In addition, **15** showed an RMSD average variation of 1 Å and stabilized in the DNA groove after 20 ns of simulation, as shown in Figure 4d. Visual inspection of the MD simulation path of **15** showed a decrease in the intermolecular interaction forces until 20 ns, followed by complete filling of the binding site during the rest of the process. **43** and **44** showed RMSD values higher than 2 Å, and it was observed that the 2-thienyl-1Hbenzimidazole portion of **43** was more exposed to the solvent, resulting in a higher degree of freedom and consequent adoption of various conformations. **43** presented an RMSD value of 5.0 Å (Figure 4f), but reached equilibrium after 20 ns. **44** behaved similarly to **43**, reaching equilibrium at 35 ns (Figure 4e). **44** had the quinolinium group outside the major DNA groove, obtaining an RMSD value of 3.5 Å.

Summarizing, all ligands achieved equilibrium within 50 ns of simulation, characterizing molecular recognition. Although compounds ligand 43 and ligand 44 presented good docking results at 15 and 28 nanoseconds of dynamic simulation, respectively, the structures presented conformational changes, resulting from parts of the ligands leaving the DNA groove suggesting hydrogen bonding with the solvent.

MD results corroborate the molecular docking results, with compounds **42**, **45**, and **51** interacting and accommodating themselves better in the smaller groove of DNA, presenting themselves as promising compounds for further studies as anticancer drugs. Compounds **43** and **44** can be considered weak DNA intercalators because, despite good docking results, they had a higher variation of RMSD values during MD.

The molecular interactions are depicted on Figure 5. **45**, **51**, **42,** and **15** have hydrogen bonding acceptors and donors and were recognized by DNA through hydrogen bonds and van der Waals interactions (Figure 5a–d). For instance, Figure 4a shows the intermolecular interactions between compound **45** with DNA. This compound carried out hydrogen bonding with CytA:7, ThyA:9, AdeB:22 and hydrophobic interactions with ThyA:8, AdeA:10, and ThyB:20. Compound **51** (Figure 5b) was better recognized by the DNA minor groove by performing a higher number of intermolecular interactions, such as hydrogen bonding with AdeA:10, GuaA:11, and ThyB:20; and van der Waals interactions with ThyA:8, ThyA:9, AdeA:12, GuaB:17, CytB:19, and AdeB:22. Compound **42** (Figure 5c) is able to perform hydrogen bonding with CytA:6, CytA:7, ThyA:9, and AdeB:22; and hydrophobic interactions with ThyA:8, AdeA:10, CytB:19, and ThyB:20. It is noteworthy that the guanidinium groups of the **15** performed hydrogen bonds with CytA:6, CytA:7, CytB:18, AdeB:22, GuaB:23, beyond several hydrophobic interactions, such as GuaA:11, ThyB:20, and ThyB:21 (Figure 5d). These interactions with these nitrogenous bases are essential components for intercalation within the minor DNA groove, which indicates that this inhibitor remained well accommodated in the DNA during the dynamic's simulation.

In contrast, both compounds **44** and **43** were not able to perform hydrogen bonding with DNA. **44** (Figure 5e) carried out hydrophobic interactions, for instance, with ThyA:9, AdeA:10, GuaA:11, AdeA:12, GuaA:13, GuaB:17, CytB:18, CytB:19, ThyB:20 and ThyB:21. Similarly, **43** performed hydrophobic interactions with ThyA:9, AdeA:10, GuaA:11, AdeA:12, GuaB:17, CytB:18, CytB:19 and ThyB:20, as shown in Figure 4f. These missing hydrogen bonding interactions can explain the higher RMSD fluctuations value during MD simulations, once this interaction has an important hole in the molecular recognition and stabilization of the ligand within the DNA groove. These findings highlight the structure-activity relationship of guanidinium groups in the development of antineoplastic compounds.

Through the analysis of RMSD plots using heat map graphs (Figure 6) it is possible to confirm the results generated by RMSD and interactions maps. **51** presented the smallest RMSD variation (max of 2.22 Å), best RMSD graphics and total energy. **42**, **45**, and **15** demonstrated RMSD variation of 2.52 Å, 3.31 Å, and 8.27 Å, respectively.

#### **4. Materials and Methods**

#### *4.1. Molecular Docking*

The three-dimensional structures of 57 ligands were constructed using Marvin Sketch [30] and 150 decoys were generated by the DUD-E platform [31]. The DNA molecular target was obtained from the Protein Data Bank (PDB-1VZK). These ligands and targets were described in a previous report [11] and are available in the Supplementary Material, Figure S1 [31]. The ligands were refined by Run\_Mopac [32] software using Parametric Method 7 (PM7) [33] and Eigenvector Following routine [34].

The target was prepared by Chimera [35] by:


Finally, all compounds were docked against the 1VZK molecular target at the minor groove position using a grid box with 20 × 20 × 26 Å and atomic coordinates centered to 14.44 Å, 20.57 Å, and 8.64 Å, for x, y, and z, respectively. In order to evaluate the best docking methods for DNA docking, three virtual screening simulations using MolAr were performed through AutoDock Vina, DOCK 6 (Amber Score), and CSV. All methodologies were double-checked by redocking, measurement of the area under the Receiver Characteristic Operator (ROC) curve (AUC-ROC), Enrichment Factor (EF), and Boltzmann-Enhanced Discrimination (BedROC). The redocking process consisted of removing the crystallographic ligand, with subsequent docking of the ligands into the same binding site.

In addition, we developed a machine learning model with the docking binding energy results and molecular descriptors of the molecules hereby tested to calculate the ∆T<sup>m</sup> experimental values. The data frame was elaborated using the 57 compounds described previously, from which only 11 had their ∆T<sup>m</sup> values calculated [11]. The descriptors were obtained using the Mordred library [36], a molecular descriptor calculator. Afterward, six Linear Regression algorithms were performed with the following descriptors: Molecular Weight, cLogP, cLogS, Total Surface Area, Relative Polar Surface Area, Polar Surface Area, AutoDock Vina with major groove, AutoDock Vina with minor groove, *Ehomo*, *E* −1 *homo*, *Elumo*, *E* +1 *homo*, DOCK 6 with Amber Score, Consensus with Grid and Amber Score, Structural Information Content level 1, Bond Information Content level 1, and chemical hardness [*η* = (*Elumo* − *Ehomo*)/2)].

#### *4.2. Molecular Dynamics*

The five best-docked ligands (according to MolAr Consensus with Amber score) were chosen for MD simulations among the crystallographic reference ligands (PDB-ID 1VZK) to characterize the molecular recognition between ligands and DNA. The ligands (PDB-IDs 1VZK, 1LEY, 1ZPH, 1ZPI, 261D, and 2GYX) are shown in the Supplementary Material, Figure S2. All ligands and the energy values for all configurations are presented in the Supplementary Material, respectively, in Figure S1 and Table S1.

The ligand-DNA complexes were inserted into a 74.15 × 52.33 × 55.58 Å simulation box and solvated with TIP3P model water molecules [37]. Sodium chloride ions were added to neutralize the system charge. Each system was energetically minimized with 5000 cycles using the Conjugate Gradient algorithm [38]. The nucleic acid atoms had position restraints with an exponent of energy function of 2 and scaling of 1.0 applied to them during the first 4000 cycles and no restraints during the last 1000 cycles. After the energy minimization, the systems were heated to 310 K during a 30 ps equilibration

conducted under an isothermal-isochoric ensemble (NVT), followed by a 500 ps simulation under an isothermal-isobaric ensemble (NPT) using the Langevin piston method [39] to maintain the total pressure to an average of 1 bar. The final production had a total of 50 ns. Water stretching and bending motions were constrained by the SETTLE algorithm [40]. Electrostatic interactions were treated via the Particle-Mesh Ewald method [41,42] with a 12 Å cutoff radius. All simulations were performed using the CHARMM36 [41,43,44] force field implemented into NAMD software [33], version 2.13. Analysis was performed using VMD, version 1.9.3 [45].

#### **5. Conclusions**

Even though DOCK 6 and AutoDock Vina showed different results, the overall result was improved when they were combined and subjected to the MolAr CVS approach. AUC-ROC, BedROC, and EF values showed the combination was able to generate more reliable results and a better prediction of the ligand conformation. MD is a critical methodology to confirm the interactions between ligands and nucleic acids, showing that MolAr CVS virtual screening can rank ligands in the DNA intercalating compounds. It is noteworthy that CVS has a low computational cost when compared with MD simulations.

In this study, two different approaches were carried out to predict the activity of compounds capable of binding to the minor groove of DNA. The first approach, structurebased drug design, was carried out to rank compounds for their ability to dock with the 1VZK molecular target at the minor groove position using docking and MD simulations. The second approach, ligand-based drug design through Machine Learning methods, ranked the six selected structures based on their binding energy. These methods were able to properly describe the intermolecular interactions between intercalating agents and DNA and build a machine learning model able to predict the ∆T<sup>m</sup> experimental values. The application of docking machine learning and molecular dynamics methodologies suggests compounds **51**, **42**, and **45** as leads for the development of improved anticancer compounds.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ph15020132/s1, Figure S1: Structure of all ligands with their PDB code and docking results values, Table S1: Docking Results with all configurations with energy values in Kcal/mol, Figure S2: Ligands chosen for MD simulations after docking. Figure S3. Total Energy (Kcal/mol) calculated in Dynamic Molecular Simulation for interactions with DNA and ligands: (a) **45**; (b) **51**; (c) **42**; (d) **15**; (e) **44**; (f) **43**.

**Author Contributions:** Conceptualization, A.G.T. and A.M.d.S.; methodology, A.G.T.; software, T.A.d.O. and E.H.B.M.; validation, A.G.T. and A.M.d.S.; formal analysis, T.A.d.O. and L.C.A.; investigation, T.A.d.O., L.C.A., L.R.M. and E.H.B.M.; resources, A.G.T. and A.M.d.S.; data curation, A.G.T. and A.M.d.S.; writing—original draft preparation, T.A.d.O.; writing—review and editing, A.G.T., A.M.d.S., E.H.B.M. and P.B.d.C.; visualization, T.A.d.O. and L.R.M.; supervision, A.G.T.; project administration, A.G.T.; funding acquisition, A.G.T. and A.M.d.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by PPBE and PPGCF/UFSJ; Research Support Foundation of the State of Minas Gerais-FAPEMIG-Brazil, grant CDS-APQ-02742-17, grant APQ-00855-19, and grant APQ-01733-21; and National Council for Scientific and Technological Development-CNPq-Brazil, grant 305117/2017-3, grant 426261/2018-6; and by fellowship of 2021 (grant 310108/2020-9).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Date is contained in the article and supplementary materials.

**Acknowledgments:** The authors would like to thank the Federal University of Sao Joao del-Rei (UFSJ) and the Federal Center for Technological Education of Minas Gerais (CEFET-MG) for providing the physical infrastructure.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

