**Fragment-Based Ligand Discovery Applied to the Mycolic Acid Methyltransferase Hma (MmaA4) from** *Mycobacterium tuberculosis***: A Crystallographic and Molecular Modelling Study**

**Romain Galy <sup>1</sup> , Stéphanie Ballereau 2 , Yves Génisson 2 , Lionel Mourey 1 , Jean-Christophe Plaquevent 2 and Laurent Maveyraud 1, \***

	- ballereau@chimie.ups-tlse.fr (S.B.); genisson@chimie.ups-tlse.fr (Y.G.)

**Abstract:** The mycolic acid biosynthetic pathway represents a promising source of pharmacological targets in the fight against tuberculosis. In *Mycobacterium tuberculosis*, mycolic acids are subject to specific chemical modifications introduced by a set of eight S-adenosylmethionine dependent methyltransferases. Among these, Hma (MmaA4) is responsible for the introduction of oxygenated modifications. Crystallographic screening of a library of fragments allowed the identification of seven ligands of Hma. Two mutually exclusive binding modes were identified, depending on the conformation of residues 147–154. These residues are disordered in *apo*-Hma but fold upon binding of the S-adenosylmethionine (SAM) cofactor as well as of analogues, resulting in the formation of the short η1-helix. One of the observed conformations would be incompatible with the presence of the cofactor, suggesting that allosteric inhibitors could be designed against Hma. Chimeric compounds were designed by fusing some of the bound fragments, and the relative binding affinities of initial fragments and evolved compounds were investigated using molecular dynamics simulation and generalised Born and Poisson–Boltzmann calculations coupled to the surface area continuum solvation method. Molecular dynamics simulations were also performed on *apo*-Hma to assess the structural plasticity of the unliganded protein. Our results indicate a significant improvement in the binding properties of the designed compounds, suggesting that they could be further optimised to inhibit Hma activity.

**Keywords:** *Mycobacterium tuberculosis*; mycolic acid methyltransferases; fragment-based ligand discovery; binding energies; molecular modelling

#### **1. Introduction**

*Mycobacterium tuberculosis* (Mtb), the causative agent of tuberculosis (TB), remains one of the deadliest infectious agents worldwide: it claimed 1.5 million deaths in 2020, and an estimated 10 million new cases were reported [1]. This remarkable efficacy as a human pathogen relies in part on the structure of its thick, atypical, highly hydrophobic cell wall [2], which limits antibiotic penetration [3], protects Mtb from the host immune system [4,5], and provides important virulence factors [6,7]. This cell wall is formed by the mycomembrane, or mycobacterial outer membrane, which surrounds arabinogalactan and peptidoglycan [2]. The inner leaflet of the mycomembrane comprises mycolic acids (MAs) covalently bound to arabinogalactan, whereas trehalose-bound mycolic acids are found in the outer leaflet [8].

**Citation:** Galy, R.; Ballereau, S.; Génisson, Y.; Mourey, L.; Plaquevent, J.-C.; Maveyraud, L. Fragment-Based Ligand Discovery Applied to the Mycolic Acid Methyltransferase Hma (MmaA4) from *Mycobacterium tuberculosis*: A Crystallographic and Molecular Modelling Study. *Pharmaceuticals* **2021**, *14*, 1282. https://doi.org/10.3390/ph14121282

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 9 November 2021 Accepted: 5 December 2021 Published: 8 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Mycolic acids, long-chain 2-alkyl, 3-hydroxy fatty acids, are an idiosyncrasy of the genus *Mycobacterium* [6], and, as such, their metabolism is a relevant target in the fight against Mtb [9,10]. Indeed, isoniazid, one of the most widely used antitubercular drugs, targets this biosynthetic pathway [11–13]. The biosynthesis of MAs starts with the synthesis of C16–C<sup>18</sup> fatty acids (FAs), by the multifunctional fatty acid synthase (FAS) I enzyme, which are further elongated up to C48–C<sup>62</sup> by the FAS-II multienzyme system, while being decorated at two distinct positions by a set of eight MA S-adenosylmethionine (SAM) dependent methyltransferases (MAMTs). The enzyme Pks13 condensates these long modified FAs, called a meromycolic chain, with a C24–C<sup>26</sup> long FA, also generated by FAS-I [14]. The resulting decorated MAs are translocated to the periplasm by the membrane transporter MmpL3 [15].

The introduction of decorations at the distal and proximal positions on the meromycolic chain necessitates the presence of *cis* double bonds. The exact mechanism that leads to the presence of these double bonds is subject to debate [6]. These *cis* double bonds can be converted at the distal and proximal positions into cyclopropane by MmaA2 [16] and PcaA [17], respectively, into a *trans* double bond with a vicinal methyl by UmaA1 [18], or hydrated into a hydroxylated compound by MmaA4/Hma [19,20]. The resulting hydroxymycolates can be further modified to keto- and methoxy-MAs by MmaA3 [19,21,22]. The catalytic mechanisms of CmaA2, MmaA4, and MmaA1 have been studied by QM/MM steered molecular dynamics [23]. It would begin with the formation of a carbocation at the olefin site that would spontaneously convert into a methyl alcohol in the case of Hma/MmaA4 [23]. Deletion of individual genes encoding SAM-dependent MAMTs is not lethal and affects the mycomembrane structure and/or virulence of Mtb to varying extent [16–18,24]. On the other hand, simultaneous inactivation of all eight genes encoding MAMTs resulted in a viable but highly attenuated and hyperinflammatory Mtb [25]. Furthermore, chemical inhibition of MAMTs was found to be bactericidal [24,26]. All these results suggest that MAMTs are attractive targets in the fight against TB.

Among these, MmaA4/Hma is particularly interesting, as it has been shown that it is necessary and sufficient for the introduction of oxygenated modifications on MAs [19,20,27] and that oxygenated MAs participate in the virulence of Mtb in mice [19], modulate IL12 production in macrophages [28], and trigger the differentiation of macrophages into foamy macrophages in granulomas in vitro [29]. In continuation of our previous work on the 3D structure of Hma in the presence of SAM and of cofactor analogues [26,30], we screened a small library of fragments against Hma using X-ray crystallography. Molecular dynamics simulations were performed for the experimentally observed bound fragments to estimate their binding energies. Based on the observed structures, evolved fragments were designed and their binding energies were also estimated.

#### **2. Results**

#### *2.1. Crystallographic Structures of Fragment-Bound Hma*

Soaking experiments at 20 mM were performed with 126 fragments (average molecular weight 153 ± 29 Da, 0–3 hydrogen bond donors, 0–5 hydrogen bond acceptors, 1–3 cycles, and 0–4 rotatable bonds), providing as many crystals that were flash cooled in a stream of nitrogen gas at 100 K. Diffraction data could be collected for 109 crystals, resulting in 66 datasets with resolution better than 2.5 Å. After a preliminary refinement with dimple, the PanDDA procedure [31,32] identified seven datasets corresponding to possible bound fragments (Figure 1), which were further refined (Table 1). In the case of compound ZT260, as low ligand occupation was observed, a second soaking experiment was performed with 100 mM of compound. Four distinct binding sites were observed (Figure 2): two binding sites are buried in a profound crevice, which has been shown to accommodate the cofactor [30] and the substrate [26], and the other two are on the surface of the protein, involving in one case residues of a neighbouring protein in the crystal.

*σ*

**Figure 1.** Structures of the fragments bound to Hma.

**Figure 2.** Ribbon representation of the Hma protein (PDB ID 2FK8) in the presence of the S-adenosylmethionine cofactor, represented as sticks with black carbon atoms, with all observed bound fragments represented as sticks with coloured carbon atoms. The protein is represented as a ribbon coloured from blue at the N-terminus to red at the C-terminus and secondary structure elements are labelled according to [30].


**Table 1.** Data collection and refinement statistics.

<sup>1</sup> Values in parentheses are for the highest resolution shell.

#### 2.1.1. Fragments ZT218, ZT260, ZT275, ZT320, and ZT585 Bind at the Substrate Binding Site

Five fragments were found to bind Hma at the substrate binding site (Figure 2), where the lipophilic moiety of S-adenosyl-N-decyl-aminoethyl (SADAE) has been observed in Hma [26], as well as didecyldimethylammonium bromide (DDDMAB) and cetyltrimethylammonium bromide (CTAB) in the structures of homologous CmaA1 and CmaA2, respectively [33].

The binding modes of ZT218, ZT260, and ZT585 (Figure 1) share common features: these fragments are buried between residues Ile204, Phe209, Tyr274, and Cys278 on one side and residues Glu149, Ser178, and Leu214 on the other side (Figures 3 and 4). Watermediated hydrogen bonds are observed in all three structures, albeit at longer distance in the case of ZT260: a water molecule, occupying an almost identical position in all three structures, connects the fragments to Glu146OE2 (2.5–2.7 Å), Glu149OE2 (2.5–2.7 Å), and Ser178OG (2.6–2.7 Å). In the case of ZT218, an additional water-mediated hydrogen bond to the imidazole group of His150 is found (Figures 3 and 4). Protein residues interacting with these fragments display a conformation almost identical to that observed in the structures of Hma in the presence of the SAM cofactor or analogues. In the *apo*-Hma structure, residues 151–153 were found to be disordered, resulting in a dramatically different conformation for

residues 147–150: the phenyl group of Phe148 in *apo*-Hma is approximately 12.5 Å from the position it occupies in the structures of these complexes. The conformation of residues 147–150 observed in the *apo*-Hma structure would not be compatible with the binding of these fragments. It is likely that the binding of ZT218, ZT260, and ZT585 fragments leads to structuration of residues 147–153, resulting in folding of the helix η1, as already observed upon binding of the SAM cofactor or analogues [26,30]. <sup>η</sup> η

η **Figure 3.** Detailed representation of binding of ZT218 (**A**, pink), ZT260 (**B**, beige), and ZT585 (**C**, yellow) at the substrate binding site. The protein backbone is represented as a tube, the side chains of residues involved in ligand binding are shown as sticks and labelled, water molecules as red spheres, and hydrogen bonds as blue dotted lines. Residues 148–151 forming helix η1 are coloured orange. η

**Figure 4.** 2D interaction maps of fragments ZT218 (**A**), ZT260 (**B**), ZT585 (**C**), ZT275 (**D**), ZT320 (**E**), and ZT424 (**F**) bound to the Hma protein. Hydrogen bonds are represented with dashed black lines and their lengths are indicated. Residues/atoms involved in van der Waals contacts are represented by notched semicircles (figure adapted from LigPlot+ [34]).

Although ZT275 and ZT320 (Figure 1) also bind to Hma at the substrate binding site (Figure 2), they induce a previously unobserved conformation for residues 147–154. These two fragments establish van der Waals contacts with Phe148, Gly152, Phe209, and Leu214 (Figures 4 and 5). The oxygen atoms of the sulphonamide group of ZT275 form a hydrogen bond with the main-chain nitrogen atom of Phe148 (3.2 Å) and Ser178 (3.0 Å). In the case of ZT320, the nitrogen atom of the amine moiety forms a hydrogen bond with the oxygen atom of the main chain of Phe151 (2.8 Å, Figures 4 and 5). In both cases, binding results in a modified conformation for residues 147–154. The three-residue long helix η1 (Phe148–His150), observed in the presence of the SAM cofactor or the ZT218, ZT260, or ZT585 fragments, is pushed away from the fragment binding site, reorganises, and includes Phe151. In this new position, Phe148 is about 10 Å away from the position it occupies in the structure of the other complexes: the main-chain atoms are in a position similar to that observed in the *apo*-Hma structure, but the position of the side chain is different, as a result of a 110◦ rotation of the χ1 dihedral angle. In addition, Glu149 and His150 are found at the position where the adenine moiety of the cofactor resides when bound to Hma [26,30]. Therefore, binding of ZT275 or ZT320 induces a new conformation that would not be compatible with the presence of the cofactor. η χ

η **Figure 5.** Detailed representation of binding of ZT275 ((**A**), pale green) and ZT320 ((**B**), turquoise) at the substrate binding site. The protein is represented as a ribbon, the side chains of residues involved in ligand biding are shown as sticks and labelled, water molecules as red spheres, and hydrogen bonds as blue dotted lines. Residues 148–151 forming helix η1 are coloured orange.

#### 2.1.2. Fragment ZT424 Binds at the Cofactor Adenine Site

Binding of ZT424 (Figure 1) is observed at the position where the adenine moiety of the SAM cofactor and its analogues were located [26,30] (Figure 2). ZT424 establishes van der Waals contacts with the side chains of Leu104, Trp132, His150, and Phe151 (Figures 4 and 6). The bromine atom of ZT424 makes a weak halogen bond [35] with the main-chain oxygen atom of Leu102 (4.0 Å), while the ring nitrogen atom of the fragment interact with a water molecule (3.3 Å), which is also hydrogen bonded to the main-chain nitrogen atom of Leu104 (3.3 Å). The hydroxyl group of ZT424 makes a weak hydrogen bond with the main-chain nitrogen atom of Trp132 (3.4 Å) and with the carboxylate group of Glu133 (3.4 Å). In this structure, residues 147–154 display the same conformation as that found in the structures of Hma in the presence of the SAM cofactor or analogues, as well as those obtained in the presence of ZT218, ZT260, and ZT585.

η **Figure 6.** Detailed representation of binding of ZT424 (light pink) at the cofactor binding site. The protein is represented as a ribbon, the side chains of residues involved in ligand biding are shown as sticks and labelled, water molecules as red spheres, and hydrogen and halogen bonds as blue and green dotted lines, respectively. Residues 148–151 forming helix η1 are coloured orange. The position of ZT260, coloured beige, in the substrate binding site is also indicated for easier comparison with Figures 4 and 5. η

#### 2.1.3. Fragments ZT260, ZT320, ZT585, and ZT726 Bind at the Protein Surface

Two fragment binding sites are observed on the surface of Hma (Figure 2). The first is delineated by Arg40, Arg111, and Trp84. A second molecule of the ZT260 and ZT320 fragments is located at this position, as well as ZT726 (Figure 1). The planar aromatic ring of ZT260, ZT320, and ZT726 is intercalated between the guanidinium groups of the two arginine residues and forms a perpendicular aromatic–aromatic interaction with the indole moiety of Trp84 (Figure 7). Broad, planar, and ill-defined electron density peaks were observed at this position in several of the structures obtained in this study, but they did not allow the unambiguous positioning of the corresponding fragments.

**Figure 7.** Detailed representation of binding of ZT260 (**A**), ZT320 (**B**), ZT726 (**C**), and ZT585 (**D**) at the protein surface. The protein is represented as a ribbon, the side chains of residues involved in ligand biding are shown as sticks and labelled, water molecules as red spheres, and hydrogen bonds as blue dotted lines. Residues 148–151 forming helix η1 are coloured orange. In D, the symmetry-related molecule is represented in beige and labelled in italics.

A second surface binding site is also observed in the case of ZT585, located between two protein molecules in the crystal, near helix α2 (Figure 7). ZT585 makes van der Waals contacts with the side chain of Tyr184 and hydrogen bonding with the side chain of Glu218. ZT585 is also hydrogen-bonded to the side chain of Glu207 and within van der Waals distance of the main-chain atoms of Asn269 and Arg270, all belonging to a symmetryrelated molecule in the crystal. The binding of ZT585 induces significant conformational changes. Indeed, in all Hma structures determined so far, the α2 helix comprises residues 183–188 and a two-residue long loop (residues 189–190) connects the α2 and α3 (residues 191–208) helices. In the presence of ZT585, the α2 helix is shortened at residues 183–185 and the α3 helix is extended with an additional turn at its N-terminus (188–208). α α α α α α

η

#### *2.2. Chimeric Compounds*

As several fragments bind to the substrate binding site, chimeric compounds were designed to mimic the simultaneous binding of two fragments and thus improve binding affinities. Superimposition of fragments suggested that chimeric compounds could be derived from the fusion of ZT218 with ZT260 or ZT585. Indeed, replacement of the phenyl ring of ZT218 with the aromatic core of ZT260 or ZT585 results in the series 218260x and 218585x (Figure 8). On the other hand, ZT275 could be merged with ZT320 to give the compounds 275320x (Figure 8). Additionally, compound 320sadae was designed by the addition of a lipophilic C<sup>7</sup> chain to the aromatic nucleus of ZT320, in order to mimic the binding of SADAE [26]. The chimeric compounds were manually positioned in the appropriate structure of Hma using Coot.

**Figure 8.** Chimeric compounds derived from merged fragments.

### *2.3. Molecular Dynamics Simulations*

### 2.3.1. *Apo*-Hma

In the crystallographic structure of *apo*-Hma, residues 151–153 were found to be disordered [30] suggesting that the 146–155 loop, connecting strand β4 to helix αD, was mobile, at least partially. In the presence of the SAM cofactor [30] and analogues [26], this loop displays decreased mobility, and residues 148–150 form the short η1 helix. A similar conformation of the η1 helix was also observed in the structures of complexes of Hma with ZT218, ZT260, and ZT585 bound in the substrate binding site and with ZT424 bound in the cofactor binding site. On the other hand, a very different conformation was observed in the presence of ZT275 and ZT320 in the substrate binding site. This new conformation would not be compatible with cofactor binding, as Glu149 and His150 are displaced to the position occupied by the adenine moiety. Molecular dynamics simulations were performed for *apo*-Hma, using the two observed conformations of the loop, for a simulation time of 1.2 µs. The calculations were performed in triplicate. The all-atoms root-mean-square fluctuation (RMSF) per residue monitored throughout the simulation indicates that some parts of the protein are indeed more flexible (Figure S1). In both cases, four regions of higher mobility can be identified. Indeed, the N- and C-terminal extremities (residues 19–28 and 298–301, respectively) and residues 152–155 and 182–195 display an average RMSF greater than 1.5 Å for more than three consecutive residues. Residues 152–155 are part of the disordered loop observed in the *apo*-Hma structure that folds upon cofactor binding and residues 182–195 are part of helices α2 and α3. These residues are displaced upon binding of ZT585 at the protein surface (see above). Although the RMSF profiles are comparable for the two starting conformations, residues 129–136 display greater mobility in the conformation compatible with the presence of the cofactor (average RMSF of 1.5 Å compared to 0.9 Å).

The root-mean-square deviation (RMSD) from the starting conformation was also analysed, after removing the global rotational and translational displacement and ignoring the parts of the protein with the highest RMSF, i.e., residues 19–28, 152–155, 182–195, and 298–301. The variations of the RMSD in function of the simulation time (Figure S2) display a homogeneous behaviour independent of the starting conformation. The RMSD converges to about 1.2–1.7 Å after 0.2 µs in each case.

The possibility of a transition between the two conformations observed for residues 147–154 was also investigated. To this end, some distances were monitored during the simulation. This was the case for the distance between the centre of mass of the aromatic side chains of Phe148 and Phe160, which was measured to be 5.7 and 16.2 Å in the crystallographic structures of the cofactor-compatible and cofactor-incompatible conformation, respectively. Similarly, the distances from the centre of mass of the imidazole group of His150 to the Cα atom of Gly131 or to the centre of mass of the aromatic ring of Tyr42 was also monitored. Initial values were 11.5 Å and 9.3 Å, respectively, in the cofactorcompatible conformation, and 4.7 Å and 14.4 Å, respectively, in the cofactor-incompatible conformation. Representative profiles of the variation of these distances over the course of the simulation are shown in Figure S3.

#### 2.3.2. Hma in the Presence of Fragments or of Chimeric Compounds

Molecular dynamics simulations were also performed with Hma in the presence of fragments, starting from observed crystallographic structures, or chimeric compounds, starting from manually generated structures. In the case of the fragments ZT218, ZT260, and ZT585, for which water molecules mediate hydrogen bonds to the protein, two simulations were performed, with or without these water molecules. Simulations were also performed in the presence of the cofactor or analogues of the cofactor, starting from the coordinates of the complexes found at the PDB [26,30].

#### *2.4. Estimation of Binding Energies for Fragments and Chimeric Compounds*

For the estimation of binding energies, 2000 consecutive frames were selected from the trajectories of the molecular dynamics simulation, after an equilibrium was reached, visualised by the stabilisation of the main-chain RMSD. One frame out of two was used for the calculation of the binding energies, according to the generalised Born method and the Poisson–Boltzmann method coupled to the surface area continuum solvation method (hereafter GBSA and PBSA, respectively). Both approaches approximate the enthalpic term of the binding Gibbs energy and neglect the entropic term, which would be complicated and time consuming to evaluate. Nevertheless, this simplification allows compounds to be ranked in a drug design perspective and the impact of chemical modifications of the ligand on binding to be assessed [36,37]. The PBSA approach is generally considered more accurate in calculating absolute free energies, but it is more time consuming and appears to be more dependent on the system under study, whereas the GBSA approach is better at ranking binding affinities [37]. PBSA and GBSA terms were evaluated for all ligands investigated, including the SAM cofactor and its analogues SAH, SADAE, and sinefungin, which have been shown to bind Hma [26,30]. The binding energies evaluated using the PBSA method are consistently lower than those obtained with the GBSA method (Table 2 and Figure S4). However, as a good correlation was found between the two methods (R<sup>2</sup> = 0.983, Figure S4), the values obtained with the GBSA method will be considered.

**Table 2.** Estimated binding energies and standard deviations (kcal/mol) for all ligands mentioned in this study.


\*An asterisk following the name of the fragment indicates that experimentally observed bridging water molecules were conserved in the molecular dynamics simulations.

As expected, the estimated binding energies for the fragments are significantly higher (−17.9 to −26.8 kcal/mol) than the values obtained for the SAM cofactor and its analogues (−39.6 to −67.7 kcal/mol). SAM, SAH (the reaction product), and sinefungin display comparable values (−45.7, −40.2, and −39.6 kcal/mol, respectively) while SADAE shows an extremely favourable binding energy (−67.7 kcal/mol). The chimeric compounds resulting from the fusion of the fragments exhibit binding energies ranging from −18.9 to −35.7 kcal/mol, intermediate between values found with the original fragments or the cofactor and its analogues. The most favourable chimeric compounds are 320sadae (−35.7 kcal/mol), which combines ZT320 with a C<sup>7</sup> alkyl chain reminiscent of the C<sup>10</sup> alkyl chain of SADAE, and 218585x (−32.3 to −35.1 kcal/mol) resulting from the fusion of ZT218 and ZT585. The chimeric compounds 218260x and 218585x show more favourable binding energies (between −29.9 and –35.1 kcal/mol) than the individual original fragments (−22.8 kcal/mol for ZT218, −18.0 kcal/mol for ZT260, and −26.8 kcal/mol for ZT585), which is not the case for the chimeric compounds 275320x, which exhibit comparable binding energies (between −18.9 and −24.9 kcal/mol) to those found for ZT275 (−17.9 kcal/mol) and ZT320 (−23.0 kcal/mol).

#### **3. Discussion**

#### *3.1. Crystallographic Screening*

X-ray crystallography is a powerful technique for fragment screening, as high concentrations of fragments can be achieved in co-crystallisation or soaking experiments, as long as the crystals are resistant to the treatment [38–40]. High concentrations are required to provide clear electron density for bound fragments, despite the expected low affinity resulting from their low molecular weights [41,42]. Nevertheless, weak binding is often observed, and specific ligand detection procedures have to be used in order to detect those fragments. In this regard, the use of the PanDDA procedure [31,43] was instrumental in this study to visualise the binding of the ZT275, ZT320, ZT424, and ZT726 fragments that were barely visible in conventional electron density maps.

Crystallographic screening of 126 fragments identified 7 bound fragments, corresponding to a hit rate of 5.5%, in the lower range of what is usually observed in a fragment screening [43–46]. Among the seven positive hits, five were found to bind in the substrate binding site and one in the cofactor binding site at the adenine position.

#### *3.2. Molecular Plasticity of Hma*

Comparison of the structure of *apo*-Hma [30] with those of Hma in the presence of the SAM cofactor or analogues [26,30] showed that the 146–155 loop was highly mobile in the absence of ligands and stabilised upon binding of the cofactor or analogues. This is also confirmed by our structures in the presence of the ZT218, ZT260, ZT424, and ZT585 fragments, since the 146–155 loop adopts a similar conformation to that observed in the presence of the cofactor. Surprisingly, while the ZT275 and ZT320 fragments also bind at the substrate binding site, albeit deeper in the crevice, they induce a different conformation of the 146–155 loop. Notably, in this new conformation, Glu149 and His150 occupy the position of the adenine portion of the cofactor. Hence, this new conformation is likely to be incompatible with the presence of the cofactor.

Molecular dynamics simulation of *apo*-Hma, starting with either of the two conformations observed for the 146–155 loop, was run in triplicate for simulation time of 1.2 µs to assess the structural plasticity of each conformation and the possible exchange between them. RMSD analysis along the simulation indicates that both conformations reach an equilibrium state with RMSD values of approximately 1.5 Å relative to the starting conformation (Figure S2). The RMSFs along the protein backbone also display a homogeneous behaviour: in addition to the N- and C-terminal ends, two regions display higher mobility, namely, residues 147–156 and 188–200, as previously observed in the case of the other mycolic acid methyltransferases CmaA2 and CmaA3, which are responsible for the cyclopropanation of MAs [47]. A notable exception occurs for residues 129–137 for which a

greater mobility is observed in the case of the cofactor-compatible conformation (average RMSF of 1.5 Å compared to 0.9 Å) (Figure S1). These residues border the cofactor binding site and interact with its adenine moiety [30]. In the cofactor-compatible conformation of *apo*-Hma, this site is filled with water molecules, and residues 129–136 are not restrained. In the cofactor-incompatible conformation, this site is occupied by His150, which makes a water-mediated hydrogen bond with the carboxylate group of Glu133. An additional hydrogen bond is observed between the main-chain oxygen atom of His153 and the sidechain nitrogen atom of Trp132 (2.9 Å). These interactions likely decrease the mobility of residues 129–136. In the presence of the SAM cofactor or analogues, it is the adenine moiety that similarly limits the mobility of residues 129–136 (Figure S1).

The crystallographic structures presented here indicate that the substrate binding site of Hma is capable of adopting at least two distinct conformations, depending on the ligand bound. Interestingly, one of these conformations is not compatible with the presence of the SAM cofactor in its binding site, which, from the perspective of inhibiting the enzyme activity, appears particularly interesting. It seems that there is no transition between the two conformations, at least during the 1.2 µs of the simulation. However, the evolution of inter-residue distances throughout the simulation of *apo*-Hma (Figure S3) suggests that the cofactor-incompatible conformation displays less structural variability than the cofactor-compatible conformation.

#### *3.3. Computed Binding Energies of Fragments and Chimeric Compounds*

As expected for low molecular weight fragments, the calculated binding energies are rather high (−19.1 kcal/mol on average), suggesting that the interactions are indeed tenuous. The explicit inclusion of experimentally observed water molecules involved in the interactions with the fragment and the protein does not significantly alter the binding energies. Indeed, although the presence of these water molecules might have an effect on the position of the fragment during the dynamics, and thus indirectly affect the estimation of binding energies, they do not directly contribute to the binding energy estimation, as the calculation relies on an implicit solvent model.

Although five of the identified fragments bind to the substrate binding site, they can be divided into two binding modes. Binding of the ZT218, ZT260, or ZT585 fragments induces a conformation for residues 146–155 similar to that observed in the Hma structures obtained in the presence of the SAM cofactor and analogues. Furthermore, from a steric point of view, the binding of these fragments would not prevent cofactor binding. Thus, the inhibitors derived from these fragments would compete with the enzyme substrate. On the other hand, binding of ZT275 and ZT320 fragments induces a different conformation for residues 146–155, resulting in residues 149 and 150 occupying the position where the adenine part of the cofactor is located. Thus, inhibitors derived from these fragments would simultaneously prevent binding of the substrate and cofactor, in a manner similar to that observed for SADAE [26,48]. However, unlike SADAE, which competes for binding with both substrate and cofactor, the inhibitors derived from the ZT275 and ZT320 fragments would act as competitive inhibitors for the substrate but as allosteric inhibitors for the cofactor.

Based on the observed structures, several chimeric compounds were designed by fusion of the identified bound fragments. The 218260x and 218585x series were derived from merging fragments ZT218 with ZT260, and ZT218 with ZT585, respectively, and compounds 275320x resulted from merging ZT275 and ZT320. Binding energies of chimeric compounds were evaluated in the same way as for those of original fragments. Among those chimeric compounds, compounds 2753201 and 2753202 display binding energies of the same order of magnitude as those of the original fragments (−18.9 and −24.9 kcal/mol). This could be related to the low molecular complexity of these compounds, comparable to that of the original fragments. The chimeric compounds 218260x and 218585x display much more favourable binding energies (−31.5 and −33.9 kcal/mol, on average, respectively).

SADAE is a SAM analogue that was shown to inhibit *Escherichia coli* cyclopropane fatty acid synthase (CFAS) both in vivo and in vitro [48], as well as Hma in vitro [26].

Furthermore, SADAE also inhibited the growth of Mtb and *M. smegmatis*, indicating that it is able to cross the cell wall of mycobacteria [26]. The efficacy of SADAE has been attributed partly to the lipophilic C<sup>10</sup> chain, which is thought to mimic the lipophilic chain of CFAS and Hma substrates [26,48]. The calculated binding energy for SADAE is extremely favourable, due to the numerous van der Waals interactions resulting from the presence of the lipophilic chain. It should be noted, however, that for a compound with such a degree of freedom, neglecting the entropic term is likely to lead to significant approximations. The chimeric compound 320sadae was designed by adding a C7-chain to ZT320 to occupy the substrate binding site. It exhibits the lowest binding energy (−35.7 kcal/mol), compared to other chimeric compounds, marginally better than the values obtained for compounds of the 218585x series. Compared to the binding energy of the original ZT320 fragment (−23.0 kcal/mol), the observed gain is, however, important, even considering the uncertainty resulting from neglecting the entropic term.

These results suggest that at least two strategies are conceivable to inhibit Hma, and potentially other mycolic acid methyltransferases, which share high structural similarities [33,49]. First, elaborating from the ZT275 and ZT320 fragments would yield compounds that simultaneously interfere with substrate binding, as they occupy the substrate binding site, and prevent cofactor binding, as they induce structural modifications of the protein that are not compatible with the presence of SAM. Secondly, the addition of a lipophilic moiety would optimise the occupation of the substrate binding site, and would contribute to improve the specificity of the compounds towards methyltransferases acting on long aliphatic compound, such as lipids. In this regard, the functionalisation of the aliphatic chain that would mimic reaction intermediates would further improve the inhibitors' affinity and specificity.

#### **4. Materials and Methods**

#### *4.1. Expression, Purification, and Crystallisation of Hma*

The Hma protein was expressed and purified as previously described [26,48]. In summary, a pET15b plasmid (Novagen) containing the *hma* cDNA was used for transformation of *Escherichia coli* BL21(DE3)pLysS bacteria. This construct exchanges the first three residues of Hma with a 20-residue cleavable His-tag. Expression of the recombinant protein was induced by the addition of 1 mM Isopropyl β-D-1-thiogalactopyranoside (IPTG) at 310 K for 3 h. After sonication and centrifugation, the soluble fraction was loaded onto a nickel affinity column (Amersham Biosciences, Amersham, UK) and the His-tagged protein was eluted with a 5–500 mM imidazole gradient in a buffer consisting of 50 mM MES, pH 6.5, and 300 mM NaCl. A final size exclusion chromatography step, using a Sephadex 75 HiLoad column (Amersham Biosciences), yielded a pure protein for structural studies.

The purified Hma protein was crystallised at 285 K by vapor diffusion using the hanging drop technique. The crystallisation conditions were optimised from those published previously [26,30] to reproducibly provide sufficient quantities of good quality crystals. A 3 µL droplet was prepared by mixing 2 µL of a 3–4 mg/mL of Hma solution (MES 50 mM, NaCl 50 mM, pH 6.5) with 1 µL of reservoir solution (BisTris 50 mM, PEG 3350 4% (*w*/*v*), pH 6.5). Under these conditions, seeding of crushed crystal fragments was necessary because the protein concentration in the drop was not sufficient to allow spontaneous nucleation. This procedure yielded reproducibly 5 to 10 single crystals per drop, bipyramidal in shape, and about 200 µm long in their largest dimension, suitable for soaking experiments.

#### *4.2. Fragments*

A 352-fragment library was acquired from Zenobia Therapeutics. The molecular fragments (average molecular weight 154 ± 29 Da, 0–3 hydrogen bond donors, 0–6 hydrogen bond acceptors, 0–3 cycles, and 0–5 rotatable bonds) were formulated at 200 mM in pure DMSO. Fragments were used without prior purification or characterisation.

#### *4.3. Crystallographic Screening*

Crystallographic screening was performed by soaking Hma crystals overnight in 20 mM fragment solutions in 50 mM BisTris, PEG 3350 4% (*w*/*v*), pH 6.5, at 285 K. Soaked crystals were cryoprotected by immersion for 2 min in the crystallisation solution supplemented with 20% (*v*/*v*) glycerol before cooling in a stream of nitrogen gas at 100 K.

Diffraction data were collected at ALBA (Barcelona, Spain, beamline XALOC), SOLEIL (Saclay, France, beamline PX1), and European Synchrotron Radiation Facility (ESRF, Grenoble, France, beamlines ID14-1, ID14-2, ID23-1, ID23-2, and ID29), and processed with XDS [50] and AutoProc [51]. Preliminary refinement was performed using the dimple pipeline of the CCP4 Program Suite [52] starting with *apo*-Hma coordinates [30] before identifying structures with potentially bound fragments with the PanDDA procedure [31]. These structures were further refined with REFMAC5 [53], Buster [54], and Coot [55]. The fragment dictionaries were generated using MarvinSketch [56] and the Grade Server [57].

#### *4.4. Molecular Dynamics Simulation*

Available crystallographic structures of the apo protein and of complexes [26,30], including those described here, as well as models of complexes generated in the presence of the chimeric compounds, were used as starting point for molecular dynamics simulations. The chimeric compounds were drawn and converted in 3D with MarvinSketch [56].

The tleap module for AMBER-20 [58] was used to generate a periodic cubic box extending 10 Å around the protein, containing the structure of the protein, the ligand if present, water molecules represented with the TIP3P model, and sodium cations to neutralise the system. The GPU version of the PMEMD module available in AMBER-20 was used for energy minimisation and molecular dynamics calculations. An initial energy minimisation was performed, with progressively reduced constraints on protein atom positions, followed by a 150 ps equilibration MD and a 100 ns production run. In the case of *apo*-Hma and of the SAM-Hma complex, the production simulations were extended to 1 µs. Analysis of trajectories, as well as monitoring of interactions and of inter-residue distances along the trajectories were performed using CCPTRAJ [59].

#### *4.5. Relative Binding Affinity Evaluation*

Molecular dynamics (MD) simulation was coupled with the MM-GB/PBSA postprocessing method [60] to estimate the interaction energies of fragments, cofactor and cofactor analogues, and chimeric compounds derived from the identified bound fragments. This procedure relies on frames extracted from an all-atom molecular dynamics simulation of a protein–ligand complex, after removal of solvent molecules, since these methods rely on an implicit solvent model. The enthalpic term of the Gibbs free energy of binding is approximated from the force-field energy, and the entropic term is usually neglected as it is extremely time consuming to calculate [61]. Therefore, this procedure does not provide true binding energies, but it can still estimate relative binding energies between ligands, as the entropy term should be dominated by the protein contribution, which should be comparable for the different ligands. The MMPBSA.py.MPI program [60] was used for the calculations.

#### **5. Conclusions**

The crystallographic screening of a fragment library allowed for the identification of 7 fragments bound to Hma. The presence of bound fragments in the substrate binding site of Hma induced two distinct conformations of residues 147–154. One of these conformations would be incompatible with the presence of the SAM cofactor in its binding site. Second generation chemical compounds were designed based on the observed positions of the fragments. Binding energies of initial fragments, of second generations molecules and of the SAM cofactor and analogues were estimated using MM-GBSA/PBSA methods. Whereas bonding energies of fragments were high, as would be expected for low molecular weight compounds, some of the second generations compounds displayed binding energies close to that found for cofactor analogues. These results suggest that our compounds could be further improved to inhibit Hma, and possibly other MAMTs. Additionally, our findings allow to envision the possibility of allosteric inhibition of cofactor binding.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/ph14121282/s1, Figure S1: Root-mean-square fluctuation along the MD simulation of *apo*-Hma. The per residue main-chain atom fluctuations are average of three independent 1.2 µs simulations of *apo*-Hma starting either from the cofactor-compatible conformation (orange) or from the conformation observed in the presence of ZT275 or ZT320 that would be incompatible with the binding of the cofactor (blue). For comparison, the per residue main-chain atom fluctuations of the structure of Hma in complex with SAM (averaged from 2 independent 1.2 µs simulations) is shown in grey. Secondary structures elements are indicated, labelled and coloured as in Figure 2. Figure S2: Evolution of the root-mean-square deviation along the MD simulation of *apo*-Hma. RMS deviations (Å) were computed using main-chain atoms of residues 29–151, 156–181, and 196–297 on the whole trajectory and plotted as a function of time for the 1.2 µs simulation of *apo*-Hma starting either from the cofactorcompatible conformation (blue) or from the cofactor-incompatible, ZT320-bound conformation (orange). For clarity, only one in 10 values is plotted. A single representative curve is displayed for each simulation performed in triplicate. Figure S3: Variations of selected inter-residues distances along the simulation trajectory of *apo*-Hma. Distances were measured between the centre of mass of the aromatic side chain of Phe160 and of Phe148 (blue) and between the centre of mass of the imidazole group of His150 and either the Cα atom of Gly131 (orange) or the centre of mass of the aromatic ring of Tyr42 (grey). Distances are plotted as a function of time for the simulation starting from the cofactor-compatible conformation (top) and from the cofactor-incompatible, ZT320-bound conformation (bottom). Simulations were performed in triplicate, curves from a single simulation are shown. Figure S4: Comparison of binding affinities as evaluated using the GBSA or the PBSA approach. The linear fit is indicated. Initial fragments are represented with red dots, chimeric compounds with blue dots and SAM and analogues with green dots. An asterisk following the name of the fragment indicates that experimentally observed bridging water molecules were conserved in the molecular dynamics simulations. Abbreviation: SIN, sinefungin. A straight line is fitted to all the points, passing at the origin. The square of the Pearson correlation coefficient is indicated.

**Author Contributions:** Conceptualisation, L.M. (Lionel Mourey) and L.M. (Laurent Maveyraud); validation, R.G., L.M. (Lionel Mourey), and L.M. (Laurent Maveyraud); formal analysis, R.G. and L.M. (Laurent Maveyraud); investigation, R.G., Y.G., S.B., J.-C.P., L.M. (Lionel Mourey), and L.M. (Laurent Maveyraud); resources, J.-C.P. and L.M. (Lionel Mourey); writing—original draft preparation, R.G. and L.M. (Laurent Maveyraud); writing—review and editing, L.M. (Lionel Mourey), Y.G., S.B., and L.M. (Laurent Maveyraud); supervision, J.-C.P. and L.M. (Laurent Maveyraud); project administration, L.M. (Lionel Mourey); funding acquisition, L.M. (Lionel Mourey). All authors have read and agreed to the published version of the manuscript.

**Funding:** This project was supported by the Centre National de la Recherche Scientifique (CNRS). RG was supported by a fellowship from the Scientific Council of the Université Paul Sabatier, Université de Toulouse.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The atomic coordinates and crystallographic structure factors of complexes described in this work have been deposited in the Protein Data Bank (www.rcsb.org) with accession codes as follows: Hma-ZT218 7Q2B; Hma-ZT260 7Q2C; Hma-ZT275 7Q2H; Hma-ZT320 7Q2D; Hma-ZT424 7Q2E; Hma-ZT585 7Q2F; and Hma-ZT726 7Q2G.

**Acknowledgments:** We thank the scientific staff at the European Synchrotron Radiation Facility (Grenoble, France) SOLEIL (Saclay, France) and ALBA (Barcelona, Spain) for the use of their excellent data collection facilities. We particularly thank the staff of beamlines ID14-4, ID23-1, and ID29 at the European Synchrotron Radiation Facility and beamline PX1 at SOLEIL where the crystallographic data presented here were collected. The macromolecular crystallography equipment used in this study are part of the Integrated Screening Platform of Toulouse (PICT, IBiSA).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


### *Article* **A Deep-Learning Proteomic-Scale Approach for Drug Design**

**Brennan Overhoff , Zackary Falls , William Mangione and Ram Samudrala \***

Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY 14203, USA; brennano@buffalo.edu (B.O.); zmfalls@buffalo.edu (Z.F.); wmangion@buffalo.edu (W.M.)

**\*** Correspondence: ram@compbio.org

**Abstract:** Computational approaches have accelerated novel therapeutic discovery in recent decades. The Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun multitarget therapeutic discovery, repurposing, and design aims to improve their efficacy and safety by employing a holistic approach that computes interaction signatures between every drug/compound and a large library of non-redundant protein structures corresponding to the human proteome fold space. These signatures are compared and analyzed to determine if a given drug/compound is efficacious and safe for a given indication/disease. In this study, we used a deep learning-based autoencoder to first reduce the dimensionality of CANDO-computed drug–proteome interaction signatures. We then employed a reduced conditional variational autoencoder to generate novel drug-like compounds when given a target encoded "objective" signature. Using this approach, we designed compounds to recreate the interaction signatures for twenty approved and experimental drugs and showed that 16/20 designed compounds were predicted to be significantly (*p*-value ≤ 0.05) more behaviorally similar relative to all corresponding controls, and 20/20 were predicted to be more behaviorally similar relative to a random control. We further observed that redesigns of objectives developed via rational drug design performed significantly better than those derived from natural sources (*p*-value ≤ 0.05), suggesting that the model learned an abstraction of rational drug design. We also show that the designed compounds are structurally diverse and synthetically feasible when compared to their respective objective drugs despite consistently high predicted behavioral similarity. Finally, we generated new designs that enhanced thirteen drugs/compounds associated with non-small cell lung cancer and anti-aging properties using their predicted proteomic interaction signatures. his study represents a significant step forward in automating holistic therapeutic design with machine learning, enabling the rapid generation of novel, effective, and safe drug leads for any indication.

**Keywords:** computational drug design; deep learning; multiscale; polypharmacology; autoencoder; docking; recurrent neural network

#### **1. Introduction**

Drug discovery—identifying chemicals with therapeutic effects against a particular indication/disease that is safe for human use—is a long, laborious, and expensive process. On average, \$3 billion and about 15 years are required to bring a novel chemical entity to the market using traditional approaches [1]. Computational methods are a popular means of identifying potential leads through paradigms such as high-throughput virtual screening [2–5], where simulations are run to assess the binding affinity of a library of compounds against a therapeutic target of interest. The combinatorial explosion of binding poses [6,7] and ligand conformations [6,8,9] and the chaotic nature of such dynamical systems [10] prevent popular virtual screening methods from producing safe and effective therapeutic leads a priori. These issues are exacerbated by the fact that virtual screening studies usually consider a single protein target, whereas drugs ingested by humans go through absorption, dispersion, metabolism, and excretion (ADME) and exert their effects

**Citation:** Overhoff, B.; Falls, Z.; Mangione, W.; Samudrala, R. A Deep-Learning Proteomic-Scale Approach for Drug Design. *Pharmaceuticals* **2021**, *14*, 1277. https://doi.org/10.3390/ph14121277

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 24 September 2021 Accepted: 29 November 2021 Published: 7 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(and side effects or toxicity (T)) via interactions with multiple targets and systems [3,11–15]. Furthermore, the chemical space explored by virtual screening is limited to a relatively small selection of compounds when compared to the vastness of the small molecule space [4,16], thus missing more effective and safer leads.

Computational methods are efficient, accurate, holistic (i.e., take into account the entire interaction space of chemical entities), and have breadth in terms of chemical space exploration necessary to overcome the limitations of traditional approaches [2,6,12,13,17–34]. To expand compound libraries utilized in screening, combinatorial chemistry and machine-learning design pipelines have been developed to generate libraries of compounds likely to bind to a given target [35–37]. Some notable examples in machine learning include Insilico Medicine's Chemistry42 platform, which designs compounds to a binding pocket [38], or a recent transformer-based network that utilized machine translation methods to generate binding ligands for the amino acid sequence of a target protein [39]. However, to take full advantage of these leads, additional screening and in vivo work must be performed to identify off-target binding as these approaches do not address the multitarget nature of drug interactions [11,13].

Various encoder–decoder models [40–42] for conditional [43] molecular generation on multiple properties have been proposed [17,44–46], but in most cases, these properties are limited to physiochemical ones. These models, however, do show great promise in their ability to rapidly generate compounds with desired properties. The most sophisticated conditional molecular generation performed thus far to our knowledge is inducing a differential expression profile of several hundred genes [17]. In all these models, proteins, the functional molecules that are primarily bound by human-ingested drugs to ensure efficacy and ADMET, remain to be considered explicitly on a large scale. Determining interactions between drug candidates and target proteins on a proteomic scale will offer the most comprehensive predictions for bioactivity and safety as many on- and off-targets will be considered simultaneously.

We developed the Computational Analysis of Novel Drug Repurposing Opportunities (CANDO) platform for shotgun multitarget drug discovery, repurposing, and design to overcome the aforementioned limitations of traditional single-target approaches [18–29]. The platform screens and ranks drugs/compounds for every disease/indication (and adverse event) through the large-scale modeling and analytics of the interactions between comprehensive libraries of drugs/compounds and protein structures. CANDO is agonistic to the interaction scoring method used; two primary pipelines within the platform allow for rapid screening and assessment of billions of drug/compound to protein interactions with fast bioanalytic docking and machine learning affinity regression protocols. Machine learning is also used to improve performance in conjunction with preclinical data in an iterative manner. Finally, CANDO implements a variety of benchmarking protocols for shotgun repurposing, i.e., to determine how every known drug is related to every other in the context of the indications/diseases for which they are approved, which enables the evaluation of various pipelines and protocols within and external to the platform for their utility in drug discovery. The multiple fast and accurate interaction scoring/docking protocols, the proteomic scale, and rigorous all-against-all benchmarking used within the platform make it unique and ideal for the design of chemical entities that target a desired proteomic space or objective.

Here, we describe the development and rigorous benchmarking of a multi-step deep learning pipeline for drug design. These pipelines perform conditional drug design given a desired proteomic interaction signature using a generative approach to explore the vastness of the entire small molecule space, while evaluating the functional behavior of candidate designs across the proteomic space. The CANDO platform's benchmarking strategy is used in a modified fashion to determine the performance of the designed compounds relative to an objective drug. We show that the best generated designs were evaluated as being equivalent or better than a variety of controls for twenty objective drugs. Our pipeline represents a significant leap in automating holistic drug design with machine learning, with the ability to rapidly generate effective and safe drug candidates that accurately target multiple proteins within a proteome as desired.

#### **2. Results and Discussion**

#### *2.1. Behavioral Similarity of Designed Compounds to Their Objectives*

We observed excellent performance of our Reduced Conditional Variational Autoencoder (RCVAE) proteome-scale design pipeline in all our benchmarking experiments following training (see the methods Section 3.2). If this performance continues to hold following synthesis and validation, it indicates that this pipeline will greatly enhance the pharmaceutical discovery pipeline for novel treatments against a variety of simple and complex indications. A critical aspect of verifying the utility of the designs generated was to compare their predicted behavior (i.e., proteomic interaction signatures) to their intended behavior, which was input to conditional generation. If the predicted behaviors of designed compounds were highly similar to the conditional objective across objectives relative to the corresponding controls, we concluded that the RCVAE design pipeline may be used to accurately design compounds that possess any desirable bioactivity and subsequent function, given the extensive benchmarking and validation the CANDO paradigm has undergone [18,21,26,29,47–50]. This is the primary motivation and goal for using the CVAE architecture in terms of accelerating drug discovery: design with respect to arbitrary numbers of on-, off-, and anti-targets (Figure 1).

**Figure 1. Deep -learning architecture and pipeline for generative drug design.** We used the CANDO platform to predict interaction signatures for each compound in a training set against a library of nonredundant protein structures representing the human proteome. The interaction signatures have their dimensionality reduced in an autoencoder, which models the underlying correspondence between protein structures as they behave in the proteome. The reduced signatures are then used as labels for each training compound, which the generative conditional variational autoencoder model learns to reconstruct given a target interaction signature. This pipeline allows us to redesign behaviorally similar compounds to existing drugs based on their interaction signatures, as well as to modulate interactions on a proteomic scale as desired to generate behaviorally novel therapeutics.

We evaluated the performance primarily by the median of each distribution, indicated by the horizontal bars in the box plots in Figure 2. Lower root-mean-squared deviation (RMSD) values indicate a greater reproduction of proteomic interaction signatures for the intended and/or predicted behavior of any given compound, i.e., greater behavioral similarity. Every set of redesigns performed significantly better (*p*-value of ≤ 0.05) than a selection of random compounds according to this criterion. Additionally, despite being comparatively close in predicted proteomic behavior, our redesigns maintained high levels of structural diversity, as evidenced by their Tanimoto coefficients to our drug library (average ≤ 0.39). For sixteen of the objectives (excluding sirolimus, cucurbitacin Q1, digoxin, and myriocin), the redesigns significantly outperformed the corresponding top 100 and same indication controls. The top 100 control, which included several "me too" compounds (or structural analogs) for each objective [24,51], is the most rigorous one we could devise and illustrates that just generating 100 designs in many instances produces more behaviorally similar compounds to a desired interaction signature than selecting the most similar 100 compounds from a total of 13,194 (the size of the CANDO drug library). The existence of structural analogs in the top100 control indicates bias in favor of already effective compounds in an effort to break into a new market or retain market dominance by generating new intellectual property. New drugs are often derivatives of existing ones with small changes, which our design pipeline is able to overcome, particularly with a bit of extra effort (see Section 2.4 below). Overall, these results indicate that the RCVAE design pipeline produces compounds that accurately match the behavior of desired proteomic interactions relevant to drug discovery. In other words, interactions related to therapeutic efficacy, ADME, and toxicity are modulated precisely in the designed compounds, particularly for objectives from the rational sources subset.

#### *2.2. Relative Performance Gains of Designs Relative to Controls*

The bottom panel of Figure 2 compares the relative performance gains for proteomic objectives from the rational design and natural sources subsets relative to the controls. Compounds in the natural sources subset were derived from massively parallel evolutionary processes over eons of time. As a result, they exhibit evolutionary drift, resulting in complex behaviors that are suboptimal from a therapeutic discovery perspective, i.e., unnecessary off-target interactions and/or individual interactions drifting away from functional free energy minima [52]. In addition to verifying that our designs behave as intended, our benchmarking also shows that the RCVAE design pipeline accomplishes this replication of behavioral similarity through an abstraction of rational drug design. That is, the similarities of redesigns to objectives from the rational design subset were greater relative to those from the natural sources subset (Figure 2). Adopting an abstraction of rational drug design is optimal for the impact of this platform on drug discovery because if the model was merely replicating the molecular structure of design objectives and not proteomic behavior specifically, the limitations of natural products (low synthetic accessibility, poor ADMET [53]) would present themselves in the designs in addition to indicating that the model may be over-trained. This discrepancy in the similarity of redesigned natural products and rationally designed drugs, therefore, further supports the notion that the RCVAE pipeline is able to intelligently design compounds with desirable bioactivities across multiple targets.

**Figure 2. Performance of our deep learning drug design pipeline.** To evaluate the performance of the RCVAE pipeline for drug design, we compared redesigned compounds to three controls. The root-mean-squared deviations (RMSDs) between redesigned and control interaction signatures for the rational design and natural sources subsets and the corresponding objective signatures were used to evaluate the performance, as a proxy for behavioral similarity. In the top panels, the blue and green box plots denote the distributions of RMSDs between predicted proteomic interaction signatures for 100 redesigns and the ten corresponding objectives from the rational design (left) and the natural sources (right) subsets along the horizontal axis, respectively. For each RMSD distribution box plot, the boxes indicate the first and third quartile ranges, the horizontal bar indicates the median, and the whiskers indicate non-outlier ranges, with outliers plotted as dots. As a naive control, the red box plots in both panels denote the distributions of the RMSDs between predicted proteomic interaction signatures for a set of 100 randomly selected drug-like compounds taken from the ZINC database [54] and the corresponding objectives. Yellow box plots denote the distributions of the CANDO-predicted top 100 most similar compounds by interaction signature and their corresponding RMSDs; this is a more rigorous control, which checks to see if there exists any compound in the CANDO library that could match or exceed the performance of the design pipeline. Purple box plots denote the RMSD distributions of compounds approved for the same indication as the objective; this is a third control based on phenotype. Dots represent RMSD values for outlier points for each distribution. Dashed lines show the average RMSDs of all compounds corresponding to a given color. All redesign–control distribution pairs were significantly different as indicated by Kolmogorov–Smirnov (K-S) tests and outperformed random controls. For 16/20 of the objective compounds, our redesigns were able to perform better than the top 100 and same indication controls (medians and distributions). The exceptions were four compounds from the natural sources subsets (sirolimus, cucurbitacin Q1, digoxin, and myriocin), discussed further in Sections 2.2 and 2.4. The bottom panel displays cumulative frequency graphs of percentage increases of similarities ("proximity") to the objective for rational design and natural sources subsets redesigns (see the Materials and Methods Section 3.3). Redesigns from the former subset were significantly more accurate to the objective compound than the natural sources subset when compared to the naive control, with mean percentage similarity increases of 33.4% and 32.9%, respectively (*p*-value ≤ 7.0 × 10 −6 ). This indicates that not only does the design pipeline generate compounds with the desired proteomic-scale behavior, but that it likely implements a learned abstraction of rational drug design.

#### *2.3. Visualizing and Filtering Using t-SNE Plots*

For all objectives, the redesigns greatly outperformed existing drugs approved for the desired indication when compared to the objective signature. Despite this, some Kolmogorov–Smirnov (K-S) tests indicated weaker statistical significance when differentiating between the RCVAE design and the same indication distributions. To understand the distributions of redesigned and control compounds and to provide an additional filtering mechanism to evaluate the performance of our designs as illustrated in Figure 2, we visualized the interaction signatures of all compounds (objectives, redesigns, and all controls) evaluated in our benchmarking with t-SNE plots [55] (Figure 3).

**Figure 3. t-SNE visualizations for the interaction signatures of objective compounds, redesigns, and various controls with structural comparisons.** t-SNE plots were generated for each of the twenty design objectives. The t-SNE algorithm was run on the interaction signatures of the objectives (black stars) and 100 redesigns (blue and green circles), as well as the 100 random (red squares), top 100 (orange triangles), and same indication (purple hexagons) control compounds. (Color coding is the same as in Figure 2.) Euclidean distances shown in t-SNE visualizations generally corroborate our findings from Figure 2 when considering the RMSD due to the maintenance of proximal points and distributions. The exceptions to this were for benzylpenicillin and paclitaxel, as Figure 2 shows greater proximity for designs than the top 100 control, whereas t-SNE plots show better clustering for the top 100 set around the objective. The generally (14/20) greater proximity of designed compound clusters to the objective point when compared to top 100, same indication, and random control compounds corroborate the behavioral similarity of designed compounds to their objective in the predicted interaction space, i.e., designed compounds are predicted to behave in the manner in which they were designed.

t-SNE plots generally corroborate the relative behavioral similarities between redesigns and controls relative to their objectives. RCVAE pipeline designs tend to cluster densely around the objective compound, as they are designed to do. The top 100 compounds cluster around these designs, with a few structural analogs very close to their objectives. Finally, the same indication and random compounds clustered the farthest from the objective. As the same indication compounds represent a group of diverse drugs approved for an indication that includes the objective, there is a general lack of clustering, and they are at greater distances from their objectives. Comparing the performance of the designs relative to their objectives and controls using both Figures 2 and 3 indicates that the t-SNE plots illustrated in the latter may be used to assess the confidence in and room for improvement of the design pipeline performance for specific objectives.

#### *2.4. Improving Cases with Sub-Optimal Performance*

For eleven objectives, a handful of outliers in the top 100 or same indication controls outperformed the design distributions (Figures 2 and 3). As noted above, structural analogs create a bias when evaluating performance due to them having very similar behavioral interaction signatures. We observed that the top 10 compounds (out of the top 100 controls, covering almost all outliers) yielded an average Tanimoto coefficient of 0.51, in contrast to an average of 0.39 for all designs, relative to their objectives. We further observed that 23/200 top 10 compounds, across all 20 objectives, had a Tanimoto coefficient ≥0.90 with an average of 0.96, demonstrating the "me too" bias with the top 100 control. Unlike the few top 100 outliers, the designs offer a larger and more diverse selection pool for prospective validation.

Regardless, we further investigated the behavior of the RCVAE pipeline for one of the objectives (cucurbitacin Q1) where an outlier was clearly better than the best designed compound by expanding the number of designs generated by the RCVAE pipeline from 100 to 1000 compounds. We found that the RMSDs of the top 100 out of 1000 designs, far less than the 13,194 compounds that were the source for the top 100 most similar compounds control, ranged from 0.073 to 0.09. This placed the RMSD of the best designs well below the lowest RMSD outlier of the top 100 control. Altogether, this indicates that the performance of the RCVAE pipeline may be enhanced by increasing the number of designs generated and selecting for behavioral similarity. Our design pipeline offers structurally diverse lead compounds with the potential to match or exceed the behavioral similarity of the top CANDO predictions from its known drug library for noisy objectives, such as compounds from the natural sources subset.

#### *2.5. Synthetic Feasibility of Designed Compounds*

To viably demonstrate that our 2000 redesigns were synthetically feasible, we utilized a high-throughput machine-learning-based approach to predict synthetic complexity scores, called SCScore. SCScore utilizes a database of known synthetic reaction pathways for training to make predictions of how easy or difficult it is to synthesize a novel compound.

As shown in Figure 4, predicted synthetic complexities for the RCVAE pipeline redesigns are often comparable to their design objective, i.e., the objective compound's scores exist within the distribution of the scores for the redesigns. Low Tanimoto coefficients (average 0.39) between designs and objective compounds indicate that the comparable synthetic complexities of our redesigns are not due to a corresponding high structural similarity. In other words, despite the high structural diversity of the generated designs, the synthetic complexity for objective and designed compounds remains stable. This may be explained by the maintenance of functionally relevant substructures of comparable synthetic complexity that are present in any given objective and redesign that are combined differently to produce similar behaviors and low structural similarity. We are studying this phenomenon more thoroughly with larger datasets by investigating the redundancy of the substructures of all approved drugs and redesigns for publication in a future study.

**Figure 4. Predicted synthetic complexity of designs compared to objective compounds.** Synthetic complexity scores for objective and RCVAE pipeline designed compounds as predicted by SCScore [56] serve as the basis for this comparison. Box plots denote the distributions of predicted synthetic complexity scores for RCVAE redesigns of objective compounds (stars). The scores of the objectives enable us to evaluate the synthetic feasibility of the redesigns relative to their corresponding objective compounds. There is a greater occurrence of low-scoring natural sources objectives in predicted synthetic complexity. This indicates that the SCScore software is biased by what is already well known or it may also highlight a potential reason why RCVAE performs better for objectives from the rational design subset: it is somewhat plausible that evolution optimizes for synthetic accessibility over binding free energies, hence the discrepancies in behavioral similarities between the designs and their objectives for the two benchmarking (rational design and natural sources) subsets (Figure 2). RCVAE designs are of similar synthetic complexity to their corresponding objective compounds as most objective complexities lie within the box plot whiskers (not outliers), with some designs being more accessible than their objectives.

> Additionally, predicted synthetic complexities are typically lower for objectives and redesigns in the natural sources subset relative to the rational design one, indicating that the SCScore software may be biased by what is already well known. As objectives in the natural sources subset score similarly to their redesigns, it is somewhat plausible that evolutionary optimization towards synthetic accessibility, at the expense of macromolecular interaction free energies [52], may account for the diminished behavioral similarity of RCVAE designs between objectives in the rational design and natural sources subsets correspondingly (see Figure 2). A comprehensive analysis of this hypothesis with larger subsets is necessary to validate or falsify this hypothesis.

> It is useful to note that many RCVAE designs are more synthetically accessible than the compound whose behaviors they replicate. As Figure 2 already indicates high behavioral similarity between redesigns and their objectives, the RCVAE design pipeline may serve the additional purpose of designing analogs to existing drugs that are more synthetically feasible and therefore easier and less costly to produce. This is accomplished without adding a synthetic complexity parameter to the condition vector for compound generation (Figure 4).

> Finally, we routinely used several other methods to evaluate RCVAE designs such as the Quantitative Estimation of Drug-likeness (QED) [57], Synthetic Accessibility Score (SAScore) [58], and AiZynthFinder [59] to assess their chemical viability and drug-likeness. We also compared our designs to benchmarks from GuacaMol [60]. These results corroborated the outputs of our benchmarking and/or SCScore; future work will include a rigorous evaluation of drug design technologies, much as we have done for repurposing [29].

#### *2.6. Applications to Aging and Non-Small Cell Lung Cancer*

A fundamental tenet of CANDO is that evaluating all the possible interactions between a human-ingested drug/compound and the macromolecules and systems it encounters on a proteomic/interactomic scale is necessary to determine its safety and efficacy for a given indication [18–29]. The RCVAE design pipeline represents a significant step forward in early drug discovery as it allows for the virtually unlimited generation of novel putative drug candidates to treat any indication/disease by combating it on a proteomic scale. As a precursor to upcoming work, we generated designs for several objective compounds/drugs associated with human or cell longevity ("aging") [61–70] and approved for Non-Small Cell Lung Cancer (NSCLC) [71–78]. We expect our pipeline to produce redesigns that retain the proteome-scale behaviors of these objectives used to treat these complex indications/diseases, while being structurally diverse.

Figure 5 illustrates the top redesigns for the thirteen objectives covering the two indications ranked using two metrics based on the greatest similarity criterion: lowest RMSD between corresponding interaction signatures and highest Tanimoto coefficient between corresponding molecular fingerprints. Two classes of redesigns, all with the greatest behavioral similarity to their objectives, emerged when performing the comparisons illustrated in Figure 5: designs that chemically/structurally resemble their objective compound/indication (metformin, NAD+, resveratrol, curcumin, and RepSox for aging and gefitinib, erlotinib, afatinib, and dacomitinib for NSCLC, all with a Tanimoto coefficient ≥0.39) and those that do not. The former class is intriguing as it implies the existence of highly optimized structures for a given phenotype (indication), as shown by the convergence of redesigns to known drugs. It also demonstrates that the RCVAE design pipeline produces highly similar designs to known drugs to perform specific tasks, only from their proteomic interaction behavior and without being exposed to any similar structures in training. In other words, the proteomic-scale interaction information for a compound is enough information for our design pipeline to reliably reconstruct a chemically/structurally similar compound in some cases. The latter class demonstrates the expanse and diversity of a chemical space not yet charted by medicinal chemists, capable of replicating the interactions of known drugs. We are in the process of synthesizing the top designs from these pipelines for these indications and validating them in corresponding preclinical models via industry partners and collaborators.

#### *2.7. Limitations and Future Work*

To treat an indication in a comprehensive fashion, especially complex ones such as aging or NSCLC, entirely novel drugs will likely need to be developed that go beyond replicating the systemic effects of existing ones. This would require a thorough and accurate description of the interaction networks responsible for disease etiology, as well as compound behavior to ensure optimal efficacy and safety. The creation of these interaction networks may be accomplished by multiscale modeling, literature/database analyses, and/or high-throughput experimental studies, all of which may be incorporated within CANDO. We are currently in the process of conditioning the RCVAE design pipelines and comparing them to ones based on graph neural networks [79] using these more complex interaction networks that go beyond the information present in the linear signatures.

**Figure 5. Analysis of the top designs generated for aging and Non-Small Cell Lung Cancer (NSCLC).** The names, chemical structures, and predicted Synthetic Complexity (SC) scores of the objective compounds for aging (left) and NSCLC (right) are displayed in the top row of each panel (red border). The chemical structures of the most similar redesigns to each objective based on their interaction signatures alongside three metrics for these redesigned compounds relative to their objectives (RMSD of interaction signatures, Tanimoto coefficient, and predicted synthetic complexity) are displayed in the row below (blue border). The chemical structures of the most similar redesigns using molecular fingerprints alongside the same three metrics (RMSD, Tanimoto, and synthetic complexity) for these redesigns relative to their objectives are displayed in the bottom row (green border). The designs generated using the RCVAE pipeline retain a fair amount of structural diversity, as indicated by the fingerprint comparison scores (low Tanimoto coefficients) whilst maintaining high predicted behavioral similarity in terms of interaction signatures (low RMSDs). Other methods for evaluating our designs such as QED corroborated the above results [57]. These designs demonstrate the utility of our pipeline for designing novel compounds to combat complex indications on a proteomic scale and may be pursued further via preclinical validation studies.

We are currently exploring other scoring protocols within the CANDO platform for conditional generation to overcome the limitations of any specific interaction calculation method. For example, the interaction scoring protocol used in this work has been shown to have great utility in the context of evaluating proteomic behavioral similarity based on benchmarking performance [18–29]. However, information on agonism/antagonism, and downstream functional activity upon binding may be obtained via design pipelines that utilize gene and protein expression data, which we are incorporating into CANDO. Public gene expression data, available through the L1000 and Connectivity Map projects [80,81], highlight important genes/proteins that are upregulated and downregulated following exposure to a drug/compound. For example, if gene expression data suggest that certain downstream genes are significantly upregulated in a given pathway following interaction with a compound and that same compound is predicted to have strong interactions with multiple proteins in that pathway, it can be inferred that those are likely to cause activating/agonist behavior. The same can be inferred for downregulated genes and inhibition/antagonism. In addition, we are exploring the use of high-throughput robotic systems such as DESI-MS to generate large-scale interaction and activity data [82–90]. CANDO thus enables and illustrates the benefit of combining heterogeneous sources (gene, as well as protein expression, protein pathway databases, high-throughput binding, and activity data) to create novel types of interaction signatures/networks to produce design objectives that tackle complex indications.

The CANDO platform enables the benchmarking of any arbitrary proteome/protein library for its utility in drug discovery using a similar all-against-all process as described in Section 3.3. The generation of highly accurate modeled protein structures such as those predicted by Deepmind's AlphaFold offer an attractive representation of the full human proteome to perform such benchmarking, which we have completed and will publish separately. This allows for conditional drug design using AlphaFold interaction signatures, especially giving us greater coverage and control over particular proteins and pathways to modulate with expert input for specific indications.

The benchmarking and performance evaluation of our RCVAE drug design pipeline were based on using known data as the ground truth or gold standard. While computational experiments are an important first step and indicate promising, prospective preclinical validation of the pipeline, its designs will require medicinal chemistry synthesis, binding studies, and disease models assays at multiple scales, which we are currently undertaking. Regardless, our combined work to date [18–29], including this study, indicates that novel high-throughput methods for rapidly identifying relationships between compounds, proteins, pathways, and cells are highly desirable for holistic drug discovery. As CANDO is agnostic to the specific methods used for any of its protocols, should such data become available, the RCVAE design pipeline described here would be well poised to take advantage of them for maximum drug discovery efficiency.

#### **3. Materials and Methods**

Figure 1 illustrates our overall methodology to create a new drug design pipeline. We employed the Computational Analysis of Novel Drug Opportunities (CANDO) platform to generate proteomic interaction signatures for the compounds in the training set of our learning-based model. The CANDO interaction signatures had their dimensionality reduced in an autoencoder, which models the underlying correspondence between protein structures as they function in the proteome. The reduced signatures were then used as labels for each training molecule, which the generative Conditional Variational Autoencoder (CVAE) model learns to reconstruct given a target interaction signature. We then used CANDO to benchmark the performance of the designed compounds in the context of their objectives and make predictions of novel designs for two indications for future prospective validation.

#### *3.1. Compound–Proteome Interaction Signature Generation Using the CANDO Platform*

Multiple pipelines for multiscale therapeutic discovery, repurposing, and design have been implemented in the CANDO platform [18–29]. Here, we utilized CANDO to simulate the interactions between a given drug/compound and a library of protein structures to generate the corresponding proteomic interaction signature.

The protein structure library used in this study is a set of 14,606 nonredundant structures derived from the Protein Data Bank (PDB) [91] corresponding to the human proteome fold space ("nrPDB")[92–97]. The compound–protein interaction scores in these signatures are computed using the bioanalytic docking protocol BANDOCK, which compares query compound structures to all ligands that are known or predicted to interact with a protein binding site [22,27,29].

Potential binding sites on a protein are elucidated using the COACH algorithm, which uses three different complementary algorithms and a consensus approach to consider the sequence or substructure similarity to known PDB binding sites [98]. COACH has been utilized extensively within the CANDO platform to accurately predict the binding behavior of numerous compounds against numerous targets, as demonstrated by its benchmarking performance in multiple studies (Section 3.3 and [18–29]). For each potential binding site, the COACH output includes a set of co-crystallized ligands, which are compared to a compound of interest using binary chemical fingerprinting methods that describe the presence or absence of particular molecular substructures [99]. The maximum Tanimoto coefficient between the binary fingerprints of the query compound and the set of all predicted protein binding site ligands becomes the interaction score. The better the score, the higher the likelihood of the interaction being correct due to the inferred homology. Thus, if there are proteins with multiple binding sites and corresponding ligands, the strongest interaction is used. If there are no matches, then the score returned is zero (i.e., no interaction). The final output is a vector of 14,606 scores comprising the interaction signature between a given compound and the nrPDB library. Further detail on the pipelines used to generate and benchmark the interaction signatures is given elsewhere in numerous publications [18–29].

#### *3.2. Model Architecture and Data Generation*

We selected a CVAE [45] architecture for generating novel molecular structures. Training data consisted of SMILES [100] strings labeled with predicted proteomic interaction signatures based on the nrPDB library. The training set consisted of 300,000 compounds selected at random from the ZINC database [54]. The 14,606 protein binding scores were predicted for each compound. The dimensionality of the protein interaction signatures was reduced to a 200-dimensional vector via a conventional autoencoder [101]. Following an input layer with 14,606 neurons, the encoder consisted of 10 sequential, densely connected layers with 10,000, 7750, 5500, 2250, 2000, 1250, 1000, 500, 250, and 200 neurons in each layer, respectively. This was reversed in the decoder, and a final layer with 14,606 neurons was used as the output to the network. The root-mean-squared deviation (RMSD) between the input and reconstructed signatures was used as a loss metric. This model was trained on 250,000 compounds until over-fitting was observed, which occurred after 15 epochs/iterations. Each epoch was validated on another 50,000 randomly selected compounds. The model was then used to reduce the non-redundant signatures of the training compounds. These became the labels for each SMILES string present in the CVAE training data.

Before being input into the CVAE, SMILES strings were one-hot encoded [102,103], resulting in a rank-2 tensor of size (sequence length × vocab length), where sequence length is the maximum number of characters allowed per SMILES string and vocab length is the unique number of characters represented in the input data. The reduced protein signature, *c*, was appended to the end of the one-hot encoding repeated at each sequence position (commonly referred to as time steps in the context of Long Short-Term Memory (LSTM) cells [102,104,105]). Similar to [45], the tensor was fed sequentially through three LSTM cells [104,105] to encode the original input. The encoder outputs to two parallel layers, one representing the mean and one for the standard deviation of the latent vector. The latent vector, *z*, consists of 200 dimensions and is sampled from the encoder output. The latent vector is then repeated for the total number of time steps, and the protein signature, *c*, is re-appended onto the resulting tensor in the prior fashion. This is input into the decoder, which also consists of three LSTM cells. Finally, this is output to a matrix of probabilities for each character in a SMILES string. Taking the maximum probability token for each character slot, one-hot encoded SMILES strings denoting reconstructions of input compounds are generated.

The loss metric used to train the CVAE is as follows:

$$E[\log(P(X|z,\mathcal{c}))] - D\_{\rm KL}[Q(z|X,\mathcal{c})||P(X|z,\mathcal{c})] \tag{1}$$

where *E* denotes the reconstruction error and *DKL* denotes the relative entropy or the Kullback–Leibler divergence [106]. *P*(*X*|*z*, *c*) denotes the probability density function approximated by the decoder for each character in a SMILES string given the latent and conditional vectors. *Q*(*z*|*X*, *c*) denotes the probability density function approximated by the encoder given the input SMILES strings and condition vector. The CVAE was trained on 300,000 compounds until convergence. The Reduced CVAE (RCVAE) model pipeline is depicted visually in Figure 1.

#### *3.3. Benchmarking and Analysis of the RCVAE Design Pipeline Performance*

The fundamental supposition and result of benchmarking the CANDO platform is that compounds with similar interaction signatures will behave similarly. On a proteomic scale, these behaviors take the form of efficacy and ADMET for a given indication. The CANDO platform was benchmarked using known drug-indication associations [18–29] derived from the Comparative Toxicogenomics Database [107] and, more recently, drugadverse events obtained from OFFSIDES [108,109] and SIDER [109]. We recently published the best metrics to use for benchmarking drug repurposing platforms [29]. The results of benchmarking and prospectively validating CANDO indicate that the proteomic-scale interaction modeling of drugs elucidates their behaviors, and these behaviors correspond

to treatments for indications for which these drugs are approved [18,21,26,29,47–50]. The benchmarking of the platform using known associations in this comprehensive all-againstall manner enables us to assess the correctness and utility of other parameters, such as the protein library composition, solved vs. modeled structures, different molecular docking and machine-learning algorithms, etc.

We performed several benchmarks of the RCVAE to verify its utility and robustness beyond that of its performance based on metrics used in training. Our benchmark set consisted of twenty approved or experimental objective drugs, comprised of subsets of ten derived from rational design and natural sources, respectively (Figure 2). These compounds and all related ones with a Tanimoto [110] coefficient ≥0.9 were omitted from training. Proteomic interaction signatures were computed by CANDO, and output SMILES strings were generated by the RCVAE design pipeline as described above. This resulted in SMILES strings corresponding to 100 novel redesigns for each objective drug. One-hundred compounds were selected at random from the curated ZINC database to serve as a naive control. As a second, more rigorous control, the CANDO platform was used to generate the 100 most similar compounds ("top 100") from its 13,194-sized library to each objective drug according to their proteomic interaction signatures. As a third control, the CANDO platform's indication prediction pipeline was used to predict a set of compounds for the indication associated with each objective, i.e., the indication that the objective is approved for ("same indication") [27,107]. Proteomic interaction signatures were generated for all compounds (designs and controls), which were then compared to that of the objective using the RMSDs between them. The RMSD distributions are illustrated using box plots with boxes depicting the first and third quartile ranges, horizontal bars depicting the median, whiskers representing non-outlier ranges, and outliers explicitly plotted (Figure 2, Results Section 2.1).

Kolmogorov–Smirnov (K-S) tests [111,112] were used to demonstrate statistical significance [113] between samples for each redesign–control RMSD distribution pair for each objective. We also compared the performance of the RCVAE design pipeline between objectives from the rational design and natural sources subsets, respectively. To do this, we computed the average RMSD between each naive control and the corresponding objective and compared this to the RMSD of each redesign for an objective. This yielded a percent increase of similarities to the objective for each redesign given by:

$$P = \frac{ -RMSD\_{reddesign}}{} \tag{2}$$

where *P* denotes the percent increase ("proximity"), < *RMSDcontrol* > denotes the average RMSD between the naive control and corresponding objective, and *RMSDredesign* denotes the RMSD of the redesign when compared to the objective proteomic interaction signature. This yielded 1000 total values of percent proximity increases for both the rational design and natural sources subsets. The values for both subsets were then averaged and compared using K-S tests (Figure 2, Results Section 2.1).

To better visualize the distributions of our redesigns and controls, t-distributed Stochastic Network Embedding (t-SNE) plots [55] that show the clustering of similar interaction signatures in two dimensions were generated with the interaction signatures of all objective, redesign, top 100, and same indication compounds (Figure 3, results Section 2.3).

We also computed Tanimoto coefficients for all redesigns relative to their respective objective compounds to determine structural diversity. Finally, we utilized SCScore [56], a machine-learning platform to predict the synthetic complexity of our redesigns in relation to the corresponding objectives (Figure 4, Results Section 2.5).

#### *3.4. Generating Novel Designs for Prospective Validation*

Design objectives for benchmarking were selected from a diverse set of indications and approved/experimental statuses to ensure broad coverage of the proteomic interaction signature space and to mitigate potential bias in the results. Regardless, the benchmark

set was repeatedly used to parameterize the pipeline described here, which has the potential to lead to overtraining. To address this and also to apply our design pipelines to relevant real-world problems of sufficient complexity where the proteomic approach would be relevant, we selected 13 (7 + 6) objective compounds that have shown promise for aging/developmental intervention [114–118] and NSCLC [118–120] to redesign for prospective validation (Figure 5, Results Section 2.6).

#### **4. Conclusions**

We utilized the RCVAE pipeline within the CANDO platform to take advantage of multiscale compound–proteome interaction modeling and develop an attractive approach to holistic drug design. We compared the predicted behaviors of the designed compounds to those of known drugs/compounds and demonstrated that the RCVAE pipeline is capable of generating novel compounds with the desired specificity of binding on a proteomic scale. We additionally demonstrated that compounds designed by our pipeline maintained reasonable predicted synthetic complexities and were structurally diverse. We expect the compounds designed using our pipeline for aging/developmental intervention and NSCLC will serve as novel leads for safe and effective therapeutics following prospective validation. The RCVAE design pipeline generates novel compounds that are synthetically feasible and behaviorally desirable, simultaneously taking efficacy and ADMET into account by examining interactions on a proteomic scale, which is necessary to understand the science of small molecule behavior and apply it to holistic therapeutic discovery.

**Author Contributions:** B.O. conceived of the RCVAE pipeline, research design, approach, and methods, conducted all experiments and analyses, and drafted the manuscript. Z.F. helped with the research design, approach, and methods and editing and proofing of the manuscript. W.M. helped with the research design, approach, and methods and editing and proofing of the manuscript. R.S. conceived of the research design, approach, and methods, supervised the overall project, and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by National Institutes of Health Director's Pioneer Award (DP1OD006779), National Institutes of Health Clinical and Translational Sciences Award (UL1TR001412), NIH T15 Award (T15LM012495), NCATS ASPIRE Design Challenge Award, NCATS ASPIRE Reduction-to- Practice Award, and startup funds from the Department of Biomedical Informatics at the University at Buffalo.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All relevant data are presented/illustrated in this manuscript.

**Acknowledgments:** The authors would like to acknowledge the support provided by the Center for Computational Research at the University at Buffalo. In addition to all members of the Samudrala Computational Biology Group, we give special thanks to Liana Bruggemann for providing us with the list of NSCLC objectives and Mira Moukheiber for assistance with proofreading.

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; nor in the decision to publish the results. The authors have formed multiple startups that seek to commercialize the outputs of the CANDO platform.

#### **References**

