*Article* **Rethinking Protein Drug Design with Highly Accurate Structure Prediction of Anti-CRISPR Proteins**

**Ho-Min Park 1,2 , Yunseol Park 1 , Joris Vankerschaver 1,3 , Arnout Van Messem <sup>4</sup> , Wesley De Neve 1,2 and Hyunjin Shim 1, \***


**Abstract:** Protein therapeutics play an important role in controlling the functions and activities of disease-causing proteins in modern medicine. Despite protein therapeutics having several advantages over traditional small-molecule therapeutics, further development has been hindered by drug complexity and delivery issues. However, recent progress in deep learning-based protein structure prediction approaches, such as AlphaFold2, opens new opportunities to exploit the complexity of these macro-biomolecules for highly specialised design to inhibit, regulate or even manipulate specific disease-causing proteins. Anti-CRISPR proteins are small proteins from bacteriophages that counter-defend against the prokaryotic adaptive immunity of CRISPR-Cas systems. They are unique examples of natural protein therapeutics that have been optimized by the host-parasite evolutionary arms race to inhibit a wide variety of host proteins. Here, we show that these anti-CRISPR proteins display diverse inhibition mechanisms through accurate structural prediction and functional analysis. We find that these phage-derived proteins are extremely distinct in structure, some of which have no homologues in the current protein structure domain. Furthermore, we find a novel family of anti-CRISPR proteins which are structurally similar to the recently discovered mechanism of manipulating host proteins through enzymatic activity, rather than through direct inference. Using highly accurate structure prediction, we present a wide variety of protein-manipulating strategies of anti-CRISPR proteins for future protein drug design.

**Keywords:** in-silico drug design; AlphaFold; anti-CRISPR proteins; prokaryotic defence mechanisms; bacteriophages; structural biology; protein drug

#### **1. Introduction**

Proteins are macromolecules composed of amino-acid residues that perform diverse roles in biological entities, including catalysing biochemical reactions, providing cell/capsid structure, transporting molecules, replicating genetic material, and responding to stimuli. It is estimated there are over 25,000 functionally distinct proteins in the human body [1,2], and mutations or abnormalities in these proteins may result in diseases. Thus, modern medicine has focused on targeting such proteins to alleviate diseases, mostly through small-molecule therapeutic agents acting as competitive or non-competitive inhibitors [3]. However, it is estimated that only ~10% of the human proteome can be targeted with small-molecule drugs [3]. Since the introduction of human insulin as the first recombinant protein therapeutic in the 1980s [4,5], protein-based therapeutics have expanded the scope of "druggable proteins". Compared to small-molecule drugs, the major advantage of protein therapeutics is improved target specificity and reduced immunogenicity due to their proteinaceous nature [5]. Protein therapeutics can also serve complex functions

**Citation:** Park, H.-M.; Park, Y.; Vankerschaver, J.; Van Messem, A.; De Neve, W.; Shim, H. Rethinking Protein Drug Design with Highly Accurate Structure Prediction of Anti-CRISPR Proteins. *Pharmaceuticals* **2022**, *15*, 310. https://doi.org/10.3390/ph15030310

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 27 January 2022 Accepted: 1 March 2022 Published: 4 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

that simple chemical compounds cannot achieve, such as replacing a deficient protein or providing a protein of novel function (Figure 1a). Furthermore, protein therapeutics can inhibit disease-related proteins that small-molecule drugs cannot target due to the lack of a cavity to bind. Currently, there are over 130 protein therapeutics commercially available and intense research efforts are ongoing to better design protein therapeutics [5].

**Figure 1.** Mechanisms of protein therapeutics and anti-CRISPR proteins. (**a**) Mechanism of protein **Figure 1.** Mechanisms of protein therapeutics and anti-CRISPR proteins. (**a**) Mechanism of protein therapeutics. The first group consists of prophylactic or therapeutic vaccines that induce immunity against foreign or cancer cells. A highly successful protein vaccine against human papillomavirus (HPV) combining the capsids from four pathogenic HPV strains is given as an example (PDB: 2R5K [6]). The second group consists of protein therapeutics that provide novel functions, replace deficient or abnormal proteins, or augment existing activities. The approval of recombinant insulin in the 1980s to treat diabetes as the first abundant, inexpensive and low immunogenic therapeutic protein is given as an example (PDB: 4F8F [7]). The third group consists of proteins that interfere with target proteins through high binding specificity. Some monoclonal antibodies use antigen recognition sites or receptor-binding domains like Trastuzumab against breast cancer cells (PDB: 6MH2 [8,9]). Some fusion proteins inhibit target proteins by blocking interaction sites like Abatacept against rheumatoid arthritis (PDB: 1DQT [8,9]). (**b**) Mechanisms of anti-CRISPR proteins. Upon successful infection, phage genomes express anti-CRISPR proteins that neutralize host CRISPR-Cas immunity. Anti-CRISPR proteins target various types of both Class I and Class II CRISPR-Cas systems and the inhibitory mechanisms are highly diverse, including direct interference of DNA/RNA binding and cleavage of Cas complexes, enzymatic inhibition of the active site by acetylation, and nuclease activity of degrading the signalling molecule (cyclic nucleotide cA<sup>4</sup> ).

10 31 In this study, we present a group of naturally-occurring protein therapeutics, called anti-CRISPR (Acr) proteins, as a good example of how small proteins are used by invading bacteriophages (phages) in nature to control host proteins. Phages are the most abundant and diverse biological entities in the biosphere (estimated 10 31 existing phages) that infect and replicate within host prokaryotes (such as bacteria or archaea) [10]. High selective pressures between these parasites and hosts drive dynamic coevolution of genomic and proteomic mechanisms and systems [11–13]. In particular, the evolutionary arms race

between phages and prokaryotes has resulted in a vast arsenal of immune systems, including the prokaryotic adaptive immune system known as CRISPR-Cas [14,15]. CRISPR-Cas systems are defence mechanisms against phages (and other mobilomes) through a complex of RNA-guided Cas proteins (Figure 1b). Remarkably, prokaryotic genomes with CRISPR-Cas systems can acquire short fragments of foreign genetic sequences in their CRISPR arrays, which serve as RNA templates to recognize and cleave invading phages through the nuclease complex of Cas proteins [16]. Since the successful application of CRISPR-Cas systems as genome-editing tools [17,18], there has been a burst in the discovery of diverse CRISPR-Cas systems [19], followed by the discovery of Acr proteins that neutralize the activity of this prokaryotic adaptive immune system [20] (Figure 1b). A family of Acr proteins was first identified in the CRISPR-Cas-inactivating prophages of *Pseudomonas* genomes that disable Type I-F and Type I-E CRISPR-Cas systems [20,21]. A number of Acr proteins inhibiting type II CRISPR-Cas systems have since been applied as regulators of gene-editing activities [22]. The Acr protein families are known to have short sequences (<100 amino acids) with no common genetic features, and interact directly with Cas proteins to inhibit target DNA binding, DNA cleavage, CRISPR RNA loading and protein-complex formation [22,23]. A recent study reveals that AcrVA5 proteins inactivate Cas12a of Type V CRISPR-Cas systems enzymatically by acetylation of the active site, with structural similarity to an acetyltransferase protein [24].

In this study, we conducted a comprehensive analysis on the key characteristics of Acr proteins viewed from the perspective of naturally-occurring protein therapeutics that effectively inhibit host protein functions. Motivated by the observation that these Acr proteins are genetically diverse, we examined the protein structure of these diverse proteins using AlphaFold2 [25]. AlphaFold is a state-of-the-art deep learning-based approach that performs protein structure prediction, which takes a protein sequence as an input to predict its 3-D protein structure through an iterative exchange of information between its genetic representation and its structural representation. The recent release of AlphaFold2, the winner of CASP14, which achieves highly accurate protein structure predictions [25,26], is revolutionary for the field of life sciences and medicine, and is expected to accelerate critical research in a large number of fields ranging from structural biology to drug discovery. In this study, we first assessed the performance of AlphaFold2 in predicting the 3-D structures of Acr proteins based on similarity metrics against their experimentally reconstructed 3-D macromolecular structures. Using this performance as a basis, we further examined the Acr proteins without experimental structures with AlphaFold2, to predict the structural diversity of these genetically distinct proteins that are natural inhibitory proteins against prokaryotic CRISPR-Cas systems. We used AlphaFold-predicted structures of Acr proteins to infer a range of inhibition mechanisms through homology search and functional analysis, to demonstrate how bacteriophages exploit diverse strategies to manipulate host immune systems, with the long-term goal of providing a unique opportunity to learn from the evolution-optimized inhibitor proteins for future protein drug design.

#### **2. Results**

#### *2.1. AlphaFold2 Prediction of Anti-CRISPR Protein Structures*

The Acr protein datasets were acquired from various viral and prokaryotic genomes, including *Pseudomonas phage*, *Pseudomonas aeruginosa* and *Escherichia coli* [27], which were categorised into three sets: verified Acr proteins with experimental structure (Set A), verified Acr proteins without experimental structure (Set B), and putative Acr proteins with experimental structure (Set C) (Supplementary Table S1). From the AlphaFold-predicted structures of each set (Figure 2a), we compared the prediction performance between CASP14 (52 AlphaFold2 evaluation results from the CASP14 competition), Set A and Set C (Figure 2b,c), based on TM-scores, relative Z-errors (see Section 4 for details) and root mean square deviation (RMSD) (Supplementary Figure S1) against the true experimental structures (Supplementary Table S2), where Set B without experimental structures was excluded. According to the TM-score, CASP14 had a higher median than Set A and Set C

(0.925 vs. 0.895; 0.843, respectively), but Set A had the highest mean as compared to Set C and CASP14 (0.896 vs. 0.868; 0.882, respectively). Furthermore, Set A and Set C had significantly smaller standard deviations than CASP14 (0.095; 0.092, respectively, vs. 0.120), indicating that, according to the TM-score, their predictions are more accurate. According to the relative Z-error, CASP14 recorded a lower median than Set A and Set C (0.201 vs. 0.217; 0.230, respectively), but Set A recorded the lowest mean as compared to CASP14 and Set C (0.211 vs. 0.237; 0.259, respectively). Like the TM-score, Set A and Set C recorded smaller standard deviations than CASP14 (0.128; 0.129 vs. 0.141, respectively). For RMSD, Set A and Set C have more outliers and higher mean values than CASP 14 (2.206; 2.271 vs. 1.106, respectively). As can be seen in the boxplots (Figure 2b,c and Figure S1), there was no significant difference in prediction performance, validating that AlphaFold2 predicts 3-D structures of Acr proteins as accurately as the CASP14 dataset. Previous studies demonstrated that structures of identical proteins obtained from different experimental techniques had RMSD values of around 2.3 Å [28]. Thus, the average result of 2.271 Å in the RMSD distribution in Set C further validates the prediction performance of AlphaFold2 on Acr proteins.

**Figure 2.** Performance analysis of AlphaFold2 on anti-CRISPR proteins in comparison to the CASP14 dataset. (**a**) Overall workflow to analyse the 3-D macromolecular structures of Acr protein sequences predicted with AlphaFold2. (**b**) The performance of AlphaFold2 on the Acr protein datasets in comparison to the CASP14 dataset using TM-scores. The closer the TM-score is to 1, the more similar the predicted structure is to its true experimental structure. (**c**) The performance of AlphaFold2 on the Acr protein datasets in comparison to the CASP14 dataset using relative Z-errors. The closer the relative Z-error is to 0, the more similar the predicted structure is to its true experimental structure. (**d**) The performance of AlphaFold2 on the Acr protein datasets using plDDT. The closer the plDDT is to 100, the higher the confidence level of prediction by AlphaFold2. (Set A: Verified Acr proteins with experimental structures, Set B: Verified Acr proteins without experimental structures, Set C: Putative Acr proteins with experimental structures).

For Set B without experimental structures, we calculated the predicted local distance difference test (plDDT) to check the confidence level of AlphaFold2 (Figure 2d). The median plDDT of Set B was almost identical to that of Set A and Set C (89.4 vs. 89.2; 89.1, respectively), but Set B had a slightly higher standard deviation (13.48 vs. 8.4; 7.84, respectively). However, as all the plDDT values of Set A, Set B and Set C (excluding a few outliers) were above the lower cut-off value of AlphaFold2 given as 70 [29], we concluded that the prediction quality of Set B was not significantly different from that of Set A and Set C.

#### *2.2. Evolutionary Trees of Anti-CRISPR Proteins*

The Acr proteins are known to be genetically diverse; this raises an intriguing question about the origin and evolution of Acr proteins. We reconstructed evolutionary trees of the verified Acr proteins (Set A + Set B; *n* = 207), using sequence-based and structure-based methods (see Section 4 for details). As expected, the phylogenetic tree built using genetic sequences (Figure 3a) shows high levels of variation, while forming consistent clades with the high bootstrap support. Analysis of the clades reveals some degree of clustering by the Acr family at the shallower nodes; however, this clustering is mostly due to the nearidentical protein sequences. For instance, many of the AcrIIA proteins were derived from AcrIIA1 and AcrIIA2, driven by the technological interest to regulate CRISPR-Cas geneediting activities in different cell types [22]. Otherwise, the phylogenetic tree shows absence of clustering by other biological features such as taxonomy and inhibition mechanism.

**Figure 3.** Evolutionary trees of anti-CRISPR proteins (numbered according to Anti-CRISPRdb [27]). (**a**) The phylogenetic tree of anti-CRISPR proteins reconstructed using sequence-based methods (Set A + Set B; *n* = 207). (**b**) The structural tree of anti-CRISPR proteins reconstructed using structure-based methods (Set A + Set B; *n* = 207). The clades of the two evolutionary trees were coloured by the Acr Family. The inner to outer rings display the Acr verification status, structural verification status, inhibition mechanism and source organism taxonomy. The outer magenta bars represent the genetic sequence length of each protein.

Given the low sequence similarity among the Acr proteins, we built a structurebased tree using AlphaFold-predicted structures, which included 98 Acr proteins without experimental structures (Figure 3b). The structural tree showed an even higher level of diversity in the Acr proteins than the phylogenetic tree. In the structural tree, the Acr proteins share no common ancestor and display deep branches, consistent with earlier

observations of how evolutionary pressure drives immunity-related mechanisms of hosts and parasites to coevolve rapidly [12,13]. The structural tree also shows some degrees of clustering by the Acr family, but the clusters do not always coincide between the two evolutionary trees. The visual analysis of the protein structures show that the branches of the structural tree are placed randomly in terms of representative structural forms and the functions are only related at the clade level (Supplementary Figure S2).

It is evident that the sequence-based and structure-based trees capture different evolutionary relationships between the Acr proteins. The 3-D structures of homologous proteins were previously shown to be better conserved than their corresponding genetic sequences, particularly when the sequence similarity was below 30% [30]. From the multiple sequence alignment, no site of the Acr proteins was conserved at 30% and only very few sites were conserved at 15% (Supplementary Figures S3 and S4). We calculated the congruence among distance matrices of the sequence-based and structure-based trees to be very low according to the measure of congruence (Kendall's coefficient of concordance, W = 6.58 × 10−<sup>1</sup> ), confirming the correlation between these two types of evolutionary trees is poor among highly divergent proteins [30].

#### *2.3. Structural Homology to Predicted Anti-CRISPR Structures*

The characterised Acr proteins use diverse strategies to interfere directly with CRISPR-Cas systems, including inhibiting DNA binding, DNA cleavage, guide loading and ribonuclease activity [22]. To investigate the relation between the 3-D protein structures and their inhibitory functions, we identified homologous structures to the AlphaFold-predicted Acr proteins using structure-based distance measures [31] (see Section 4 for details).

First, we used a subset of the Acr dataset with experimental structures (Set A) to successfully validate the closest structural homologue to each AlphaFold-predicted Acr protein matched with its true experimental structure (Supplementary Table S3).

Second, we analysed the Acr proteins without experimental structures (Set B) by matching the AlphaFold-predicted structures to the closest homologues from the Protein Data Bank archive [31] (Supplementary Table S4). We first validated that Acr proteins retrieved their neighbours in the same clade as the closest homologue, for those cases where their neighbouring proteins had experimental structures in the Protein Data Bank. For example, the Acr proteins labelled '0434' and '0435' are in the same clade, and the closest homologue of the Acr protein '0435' matched with the experimental structure of the Acr protein '0434'. Functional analysis of the closest homologue to each Acr protein revealed a wide variety of protein functions, including polymerase, ligase, nuclease, regulation, and transport (Figure 4c). We acknowledge that some of the closest homologues have low structural similarity (Z-score below 4); however, it is intriguing that these Acr proteins have no close structural homologues in the Protein Data Bank. Some Acr proteins from the families AcrIC6, AcrVIA2, AcrIIA19, and AcrIF8 (Figure 4a) have no structural homologues (significance threshold for similarity: Z-score = 2).

Third, we cross-examined the Acr proteins whose inhibitory mechanism was experimentally characterised to verify that the homologues retrieved were functionally related (indicated as the middle ring in Figure 3b). The homologue functions of this subset were related to a wide variety of functional domains (Supplementary Table S4). For instance, the two Acr proteins in the same clade (labelled '3628' and '3642') were characterised to inhibit the DNA-binding of Cas proteins [32] and their structural homologues have functional domains of ligase. A few closest homologues with functions related to acetyltransferase drew particular attention, as a recent biochemical study revealed an unprecedented mechanism of inhibiting CRISPR-Cas systems through enzymatic activity rather than through direct interaction [24]. According to this study, the closest structural homologue to this Acr protein (labelled '3625') was found to be N-Alpha-Acetyltransferase from *Homo sapiens* (4U9W-C) (Figure 4b), despite their low sequence similarity. We found another homologue (1Y9W-A) with a better similarity score to the AlphaFold-predicted structure of this Acr protein, that had the functional annotation of acetyltransferase from *Bacillus cereus* (Table 1).

In addition, we found several uncharacterised Acr proteins in the same clade of the structural tree (between '0430' and '3681') related to Acetylglucosaminidase from various Acr families, including AcrIC, AcrIE, AcrIF, AcrIIA, and AcrVIB. Intriguingly, several proteins have homologues with the functional annotations of nuclease activity, which is reminiscent of the newly-discovered mechanism of nuclease activity against crRNAs and CRISPR-Cas signalling molecules [33,34]. clade of the structural tree (between ' ' and '368 '

– transferase. The homologue to the Acr protein labelled ' ' **Figure 4.** AlphaFold-predicted 3-D structures of outlier anti-CRISPR proteins. (**a**) The AlphaFoldpredicted structures of the Acr proteins without homologues in the Protein Data Bank archive (Dali Z-score < 2). The 3-D protein structures are coloured according to the b-factor spectrum in PyMol, with a per-residue estimate of the AlphaFold2 confidence on a scale from 0–100 (high pLDDT accuracy in blue, low pLDDT accuracy in red). (**b**) Superimposition of the AlphaFold-predicted structures of Acr proteins and their closest structural homologues retrieved from the Protein Data Bank archive. All the closest structural homologues in grey have functional annotations related to acetyltransferase. The homologue to the Acr protein labelled '3625' has a cofactor (acetyl-CoA) bound, revealing the functionally critical site of the enzyme. (**c**) Functional analysis of the closest homologues to the AlphaFold-predicted Acr proteins without experimental structures (Set B). Only the functional annotations above the significance threshold of Dali Z-score (>4) were included.

**Table 1.** Closest homologue to the AlphaFold-predicted structure of Acr proteins with acetyltransferase annotations.


%ID structure: percentage identity in structure. %ID sequence: percentage identity in sequence.

#### *2.4. New Anti-CRISPR Family of Acetylation Inhibition*

Previously described Acr proteins of the Acr family VA5 disable Type V Cas12a by acetylation, which leads to a complete loss of the DNA-cleavage activity [24]. We found that this AcrVA5 protein (labelled '3625') structurally aligned closer to *Bacillus cereus* acetyltransferase than the previous structural homology of *Homo sapiens* acetyltransferase (Figure 4b). We further identified other Acr proteins that are related to this AcrVA5 protein on the evolutionary trees of the Acr proteins (Figure 3b). Notably, there are two Acr proteins in the same clade as the AcrVA5 protein on the structural tree, one of which was lacking experimentally validated structure or function. Analysis of its function using the structural homologues reveals that this Acr protein (labelled '3666') is related to acetyltransferase. Interestingly, this Acr protein belongs to a different Acr family (AcrIB) than the previously identified AcrVA5. Its superimposition with the closest structural homologue reveals a similar structural alignment at the functionally critical site of the acetyltransferase where acetyl-CoA binds (Figure 4b). The sequence identity of the AcrIB protein to its closest homologue was found to be 17.6% (Supplementary Table S4), while the structural identity between these two proteins in 3-D was found to be higher at 21% (Table 1 and Supplementary Figure S5). On the phylogenetic tree, this AcrIB protein was not placed close to the other two Acr proteins of acetyltransferase function (labelled '3625' and '0557') (Figure 3a), demonstrating that these two types of Acr proteins have close structural similarity but not genetic similarity. This finding suggests that for proteins with low sequence similarity, structure-based trees cluster proteins with most similar biochemical functional properties perform better than sequence-based trees [30]. Using the structural tree, we discovered a new family of Acr proteins belonging to AcrIB that was structurally similar to acetyltransferase from a different organism (gram-negative bacteria *Salmonella enterica*), whereas the previously characterised AcrVA5 matched to acetyltransferase from gram-positive bacteria *Bacillus cereus*.

#### **3. Discussion**

We show that the 3-D structures of Acr proteins predicted with AlphaFold2 achieve high accuracy. The structural tree reconstructed from these AlphaFold-predicted structures display more diversity of Acr proteins with no common evolutionary origin as compared to the phylogenetic tree. On the structural tree, the Acr proteins form small clades by their unique structural similarity, which are also related by the inhibition mechanism. The functional annotations of the Acr protein homologues are extremely diverse, relating to a wide range of enzymatic and regulatory activities from different organisms. Most characterised Acr proteins inhibit host CRISPR-Cas systems by direct interference; we show that this category of Acr proteins displays various functional annotations and unique structural forms in the multiple branches of the structural tree.

Specifically, we found a number of Acr proteins with homologue annotations related to acetylation. A recent discovery of Acr proteins that manipulate CRISPR-Cas systems through enzymatic activities demonstrates extensive phage defence mechanisms driven by the intense host-parasite arms race [24,33]. Through the AlphaFold-predicted structural analysis, we found a novel family of Acr proteins (AcrIB) from the genome of a human pathogen (*Leptotrichia buccalis* C-1013-b) that shows more structural similarity to acetyltransferase than the previously characterised AcrVA protein. Intriguingly, other Acr proteins on the multiple branches of the structural tree have homologues related to different types of acetyltransferase enzymes from heterologous species. As Acr proteins with acetyltransferase activities permanently disable Cas proteins by covalent modification [24], other Acr proteins with enzymatic activities such as acetylglucosaminidase are expected to have similar inactivation functions of biochemically modifying CRISPR-Cas systems. The Acr proteins could evolve independently from various host genomes and mobile genetic elements, exploiting a vast inventory of protein structures as the basis for their counter-defence advantage [10,13,35].

More broadly, Acr proteins are exceptional examples of coevolution dynamics optimizing the phage genomes to manipulate host systems and maximize survival. As phages can only replicate within host cells and are void of metabolic capacity to synthesize small molecules, their counter-defence machinery against the sophisticated and extensive prokaryotic anti-phage systems is protein-based. Nonetheless, phages are the most abundant biological entities in the biosphere [36], and their successful protein-based viral arsenals such as Acr proteins provide an important insight on how to expand the potential of protein therapeutics. Specifically, we could get inspiration from these phagederived protein structures that resemble segments of related target proteins, to design highly-specialised protein inhibitors with diverse protein-manipulating strategies [37]. In future, utilizing the ability of small proteins to engage in indirect interference through enzymatic activities is to be explored against disease-causing proteins (such as in cancer) and disease-causing organisms (such as drug-resistant bacteria) [36].

#### **4. Materials and Methods**

#### *4.1. Curation of Anti-CRISPR Datasets*

The anti-CRISPR dataset contained 443 Acr proteins (207 verified, 236 putative) that inhibit a wide range of CRISPR-Cas systems including I, II, III and VI from the Anti-CRISPRdb [27]. The term 'verified' indicates that the protein was validated as CRISPR-Cas-inactivating Acr (either by the database or other published papers), while 'putative' indicates the protein was predicted to be Acr without sufficient experimental support. The anti-CRISPR dataset was further curated into Set A, Set B, and Set C for AlphaFold2, according to the availability of experimentally reconstructed 3-D macromolecular structures (hereby referred to as "experimental structures") (Supplementary Table S1). Each protein was annotated with the Acr family, type of inhibited CRISPR-Cas systems, NCBI accession, genetic sequence, source organism, taxonomy, and inhibitory mechanism when available (Supplementary Tables S3 and S4).

#### *4.2. Prediction of Anti-CRISPR Protein Structure with AlphaFold2*

We predicted the 3-D protein structures of each set with AlphaFold2, using the Acr protein sequence as the input to AlphaFold2 (Figure 2a). AlphaFold2 creates genetic and structural representations by comparing the protein sequence with several pre-installed databases. Those representations are used as input to five prediction models to generate five candidate 3-D structures. The result with the highest per-residue confidence score (pLDDT: per residue estimate of confidence on a scale from 0 to 100 [29]) among the five results was determined as the final structure and saved in a protein data bank (PDB) format (Supplementary Figure S6). For the Acr datasets, we used the PDB archive until 31 December 2012 as templates in AlphaFold2 to exclude the true experimental structures of the Acr proteins. The details of the experiments related to the hardware specification and to the processor performance are given in the Supplementary Material (Supplementary Table S5 and Supplementary Table S6, respectively).

#### *4.3. Comparison of AlphaFold2 Performance on Anti-CRISPR against CASP14*

To validate the performance of AlphaFold2 for predicting Acr structures, we benchmarked the CASP14 dataset against Set A and Set C of Acr proteins with the corresponding true experimental structures available. We excluded predicted structure and experimental structure pairs for which the TM-score and/or the Z-score were too low. Finally, 52 pairs of CASP14, 99 pairs of Set A, and 207 pairs of Set C were used for the comparison study. We used the TM-score [38,39] and Dali Z-score [40] as similarity measures between the predicted and the experimental structures. Unlike traditional metrics (e.g., root-mean-square deviation), the TM-score is length-independent and more sensitive to the global similarity than to the local variation. The Dali Z-score is the sum of the equivalent residue-wise

intermolecular distances among two proteins, and does not have a fixed upper bound [40]. We then used the following relative Z-error to calculate the relative difference:

$$Z\_{error} = \frac{Z\_{\rm gt} - Z\_{\rm pd}}{Z\_{\rm gt}} \tag{1}$$

where *Zgt* is the self Dali Z-score between experimental structure and itself, and *Zpd* is the Dali Z-score between experimental structure and predicted structure. We obtained the *Zpd* and TM-score for the CASP14 set of AlphaFold2 from the CASP14 assessment scores [41], whereas *Zgt* of CASP14 and *Zgt* and *Zpd* of Set A and Set C were calculated using DaliLite.v5 [40]. Finally, a protein structure comparison and clustering tool called MaxCluster [42] was used to calculate the TM-scores of Set A and Set C. Both distance metrics have values between 0 and 1, with 1 as the best score for TM-scores and 0 as the best score for relative Z-errors.

#### *4.4. Reconstruction of Evolutionary Trees of Anti-CRISPR Proteins*

We reconstructed the evolutionary trees of the anti-CRISPR dataset (Set A and Set B) using sequence-based and structure-based inference. Set C was excluded from the evolutionary analysis due to the absence of functional verification and due to sequence variation (Supplementary Figure S3). The sequence-based tree of the Acr proteins was built by aligning the amino acid sequences using a multiple alignment program, MAFFT (version 7.471, -auto option) [43]. The multiple sequence alignment of the Acr proteins was then visualized using Jalview (version 2.11.1.3) with a conservation visibility of 15% (Supplementary Figure S2) [44]. Subsequently, a phylogenetic tree of the Acr proteins was built with IQ-Tree using ModelFinder (-auto option) to find the best-fit model among the supported range of protein substitution models [45,46] (Supplementary Table S7). Using the best-fit substitution model, 1000 ultrafast bootstrap replicates were run to check bootstrap support of the reconstructed tree topology [47].

The structure-based tree of the Acr proteins was built by calculating the similarity matrix between the Dali Z-scores of the AlphaFold-predicted structures and its corresponding experimental structures. We used the Dali server [40] for generating structural trees from hierarchical clustering of the similarity matrix. The structural tree of the Acr proteins was generated from distance matrices, where the pseudo-distance between two structures Q and T was defined as [48]:

$$D\_{\rm QT} = Z\_{\rm QQ} + Z\_{\rm TT} - 2Z\_{\rm QT} \tag{2}$$

The hierarchical clustering of the similarity matrix was outputted as a Newick formatted dendrogram. The phylogenetic tree and the structural tree of the Acr proteins were visualized with iTOL (version 4) and iTOL annotation editor [49,50] with the following labels: Acr Family, Taxonomy, and Inhibition Mechanism.

#### *4.5. Congruence among Distance Matrices of Sequence-Based and Structure-Based Trees*

We measured the congruence among distance matrices of the reconstructed trees from the sequence-based and structure-based methods using Kendall's coefficient of concordance, W, which ranges from 0 (no congruence) to 1 (complete congruence) [51]. First, we computed the cophenetic value of pairwise distances between the terminals from a phylogenetic tree using its branch lengths with the function cophenetic.phylo from ape-package (version 5.0) [52]. Then, we used the function CADM.global to calculate the coefficient of concordance among the distance matrices of the sequence-based and structure-based trees of the Acr proteins through a permutation test.

#### *4.6. Visualization of Protein Structure Superimposition*

For functional analysis, the AlphaFold-predicted structures with functional annotations of interest were superimposed with their structural homologues using PyMol (version 2.5.2) to visualize the overlap in structure of the functionally active sites. The

inhibitory mechanism of the Acr proteins without experimental structure was inferred through examining functional annotations of the structural homologues to the AlphaFoldpredicted structure, with the significance threshold of Z-score > 4.

#### *4.7. Code Availability*

Protein structures were predicted with AlphaFold2, available under an open-source license at https://github.com/deepmind/alphafold, accessed on 27 September 2021. For protein structure similarity metrics, we used MaxCluster (http://www.sbg.bio.ic.ac.uk/ ~maxcluster/index.html, accessed on 13 October 2021) for TM-score and DaliLite.v5 (http://ekhidna2.biocenter.helsinki.fi/dali/README.v5.html, accessed on 24 October 2021) for the Dali Z-score. For MSA, we used MAFFT.v7 (https://mafft.cbrc.jp/alignment/ server, accessed on 29 October 2021) and Jalview.v2 (https://www.jalview.org, accessed on 29 October 2021) for visualization. For phylogenetic tree reconstruction, we used IQ-Tree (http://www.iqtree.org, accessed on 14 November 2021) with ModelFinder and UFBoot options. For structural tree reconstruction, we used the Dali server (http://ekhidna2 .biocenter.helsinki.fi/dali, accessed on 16 November 2021) for building dendrograms. The 3-D Structure visualizations were created in Pymol v.2.5.2 (https://pymol.org, accessed on 4 November 2021) and Py3DMol v.1.7.0 (https://pypi.org/project/py3Dmol, accessed on 26 October 2021) with Jupyter v.1.0.0 (https://jupyter.org, accessed on 26 October 2021). For data analysis, Python v.3.6.4 (https://www.python.org, accessed on 27 November 2021), NumPy v.1.17.5 (https://github.com/numpy/numpy, accessed on 27 November 2021), SciPy v.1.1.0 (https://www.scipy.org, accessed on 27 November 2021), seaborn v.0.9.0 (https://github.com/mwaskom/seaborn, accessed on 25 November 2021), Matplotlib v.3.3.4 (https://github.com/matplotlib/matplotlib, accessed on 24 November 2021), pandas v.0.22.0 (https://github.com/pandas-dev/pandas, accessed on 24 November 2021) were used.

#### **5. Conclusions**

The high biodegradability issue of protein therapeutics has partially been solved by the recent success of mRNA vaccine delivery using lipid nanoparticles [53], making low risk protein therapeutics ever more attractive to the industry. From the AlphaFold-predicted structures, we accelerated the structural and functional analysis of the Acr proteins whose experimental 3-D structures remain to be resolved. In conclusion, we wonder whether there is a vast repertoire of unexplored protein structural configurations that can be exploited for protein drug design, given the number of Acr proteins without homologues in the current protein structure domain.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/ph15030310/s1. Figure S1. The performance of AlphaFold2 on the Acr protein datasets in comparison to the CASP14 dataset using RMSD; Figure S2. The structural tree of anti-CRISPR proteins reconstructed using structure-based methods; Figure S3. Multiple sequence alignment of anti-CRISPR proteins using MAFFT, visualized with Jalview; Figure S4. The structural tree of anti-CRISPR proteins reconstructed using structure-based methods; Figure S5. A scatter plot of percentage identity in structure against percentage identity in sequence of the closest homologue to each protein; Figure S6. Outlier AlphaFold structures with bad scores; Table S1. Acr protein types in Anti-CRISPRdb; Table S2. Detailed statistics and scores on AlphaFold prediction performance; Table S3. Closest homologue (highest TM score) to the AlphaFold-predicted structure of Acr proteins with experimentally reconstructed 3-D macromolecular structures (Set A) from the Protein Data Bank archive; Table S4. Closest homologue (highest DALI Z-score) to the AlphaFold-predicted structure of Acr proteins without experimentally reconstructed 3-D macromolecular structures (Set B) from the Protein Data Bank archive; Table S5. Hardware specification for AlphaFold experiments; Table S6. Time to create a 3-D structure from one protein sequence (in seconds); Table S7. The best-fit-model of the phylogenetic tree of each Acr set using IQ-TREE ModelFinder.

**Author Contributions:** Conceptualization, H.S.; methodology, H.-M.P. and H.S.; software, H.-M.P. and H.S.; validation, H.-M.P., Y.P. and H.S.; formal analysis, H.-M.P., Y.P. and H.S.; investigation, H.-M.P., Y.P. and H.S.; resources, J.V. and W.D.N.; data curation, H.-M.P. and H.S.; writing, H.-M.P., Y.P., J.V., A.V.M., W.D.N. and H.S.; visualization, H.-M.P. and H.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Ghent University Global Campus grant number [RC4].

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. All the input protein sequences analysed in this study are available in Anti-CRISPRdb at https://doi.org/10.1 093/nar/gkx835 (accessed on 13 November 2021). As well as our project GitHub page which can be found here: https://github.com/powersimmani/ACR\_alphafold, accessed on 28 November 2021. All the 3-D structures used as ground truth for calculating Dali Z-scores and TM-scores; and superimposing structures in this study are available in Protein Data Bank at https://www.rcsb.org/ accessed on 13 November 2021. CASP (Critical Assessment of Structure Prediction) competition datasets were used for measuring AlphaFold performance. This data can be found here: https: //predictioncenter.org/casp14/, accessed on 29 September 2021.

**Acknowledgments:** We thank members of the Center for Biosystems and Biotech Data Science at GUGC for helpful discussions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Exploring the Prominent and Concealed Inhibitory Features for Cytoplasmic Isoforms of Hsp90 Using QSAR Analysis**

**Magdi E. A. Zaki 1, \*, Sami A. Al-Hussain 1 , Syed Nasir Abbas Bukhari 2 , Vijay H. Masand 3, \*, Mithilesh M. Rathore 3 , Sumer D. Thakur <sup>4</sup> and Vaishali M. Patil 5**


**Abstract:** Cancer is a major life-threatening disease with a high mortality rate in many countries. Even though different therapies and options are available, patients generally prefer chemotherapy. However, serious side effects of anti-cancer drugs compel us to search for a safer drug. To achieve this target, Hsp90 (heat shock protein 90), which is responsible for stabilization of many oncoproteins in cancer cells, is a promising target for developing an anti-cancer drug. The QSAR (Quantitative Structure–Activity Relationship) could be useful to identify crucial pharmacophoric features to develop a Hsp90 inhibitor. Therefore, in the present work, a larger dataset encompassing 1141 diverse compounds was used to develop a multi-linear QSAR model with a balance of acceptable predictive ability (Predictive QSAR) and mechanistic interpretation (Mechanistic QSAR). The new developed six-parameter model satisfies the recommended values for a good number of validation parameters such as R2tr = 0.78, Q2LMO = 0.77, R2ex = 0.78, and CCCex = 0.88. The present analysis reveals that the Hsp90 inhibitory activity is correlated with different types of nitrogen atoms and other hidden structural features such as the presence of hydrophobic ring/aromatic carbon atoms within a specific distance from the center of mass of the molecule, etc. Thus, the model successfully identified a variety of reported as well as novel pharmacophoric features. The results of QSAR analysis are further vindicated by reported crystal structures of compounds with Hsp90.

**Keywords:** Hsp90; cancer; QSAR; machine learning; pharmacophores

#### **1. Introduction**

Cancer kills; therefore, medicinal chemists are continuously trying to develop therapeutic agents that could retard the growth of cancer cells. In cancer cells, a protein Hsp90 (heat shock protein 90, also known as HSPC) is overexpressed [1]. It is a highly conserved, non-fibrous, and chaperone protein with a key role in many cellular processes like proper folding of other proteins, apoptosis, cell cycle control, cell viability, and degradation and signaling events [1–6]. As the name indicates, Hsp (heat shock proteins) shield cells when stressed by higher temperatures. The number "90" comes from the fact that it weighs about 90 kDa. There are two isoforms of Hsp90: Hsp90α (the inducible form) and Hsp90β (the constitutive form), which are found in cytoplasm and share 85% sequence identity [1–6]. These two isoforms are like flexible biological catalysts and interact with a good number of newly synthesized proteins, such as Akt2, CDKs, PKC, MAP kinases, steroid receptors,

**Citation:** Zaki, M.E.A.; Al-Hussain, S.A.; Bukhari, S.N.A.; Masand, V.H.; Rathore, M.M.; Thakur, S.D.; Patil, V.M. Exploring the Prominent and Concealed Inhibitory Features for Cytoplasmic Isoforms of Hsp90 Using QSAR Analysis. *Pharmaceuticals* **2022**, *15*, 303. https://doi.org/10.3390/ ph15030303

Academic Editor: Osvaldo Andrade Santos-Filho

Received: 21 January 2022 Accepted: 23 February 2022 Published: 1 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

BCL-6, CAR, p53, Oct4, etc., to avoid their aggregation or mistakes in their folding [6]. Despite a crucial role, in cancer cells, these are responsible for the stabilization of a number of oncoproteins required for tumor growth, leading to their overexpression [1–6]. Consequently, Hsp90 is an attractive target for developing a drug for cancer.

The majority of Hsp90 inhibitors occupy the ATP (adenosine tri-phosphate) pocket in the N-terminal domain of Hsp90, leading to limited ATPase activity [1–6]. At present, several natural and semi-synthetic Hsp90 inhibitors (see Figure 1) are in different stages of clinical trials for a variety of cancers [2,3,7–9].

**Figure 1.** Different clinical trial candidates as inhibitors of Hsp90.

Unfortunately, several inhibitors have shown hepatotoxicity and ocular toxicity [2,10]; consequently, there is a need to modify them with retention of activity against Hsp90, which could be achieved on knowing the structural features responsible for their Hsp90 inhibitory activity. A simple, cost-effective, and faster yet effective strategy to know crucial pharmacophoric features is to use QSAR (Quantitative Structure–Activity Relationships), a successful, contemporary, and widely used branch of computer assisted-drug designing [11–16].

In QSAR analysis, generally, a good number of inhibitors are analyzed using a suitable technique like machine learning, deep learning, etc. There are two main advantages of using the QSAR approach [11,17,18]: (a) the analysis helps to identify the prominent structural features or patterns that influence the bioactivity profile of molecules (Mechanistic interpretation or Qualitative QSAR), and (b) the analysis could be used to predict the desired bioactivity of a molecule prior to its synthesis and lab testing (Predictive ability or Predictive QSAR). Therefore, many researchers prefer QSAR as a method of choice for drug/lead optimization. Nowadays, a QSAR analysis with a balance of mechanistic interpretation with predictive ability is highly preferred.

The literature survey reveals that QSAR analyses have been reported for Hsp90, but they are either based on a small dataset, lack general applicability, have poor predictive ability, are deficient of a mechanistic interpretation, or a combination of these factors, which limit their use [9,19–22]. Therefore, in the present work, we accomplished QSAR analysis for a larger and diverse dataset of Hsp90 inhibitors, and followed the OECD

(Organization for Economic Cooperation and Development) guidelines while developing a QSAR model to have a balance of mechanistic interpretation with predictive ability.

#### **2. Results**

The exhaustive and heuristic search resulted in the development of a six-descriptorbased QSAR model (see model-A), which was subjected to thorough statistical validation for internal and external validations.

**Model-A:** pIC<sup>50</sup> (M) = 3.903 (± 0.134) + 0.101 (± 0.013) × **com\_ringChyd\_4A** + 0.433 (± 0.058) × **faroCN2B** + 0.714 (± 0.214) × **aroCminus\_sumpc** + 0.065 (± 0.005) × **aroC\_aroN\_5B** + 0.266 (± 0.048) × **fringNsp3C5B** + 0.59 (± 0.082) × **da\_amdN\_6B**

Statistical validation of model-A:

Ntr = 915, Next = 226, R<sup>2</sup> tr = 0.779, R<sup>2</sup> adj. = 0.777, R<sup>2</sup> tr − R 2 adj. = 0.002, LOF = 0.244, Kxx = 0.219, ∆K = 0.122, RMSEtr = 0.487, MAEtr = 0.404, RSStr = 217.321, CCCtr = 0.876, s = 0.489, F = 533.134, R<sup>2</sup> cv (Q<sup>2</sup> loo) = 0.775, R<sup>2</sup> -R<sup>2</sup> cv = 0.004, RMSEcv = 0.491, MAEcv = 0.407, PRESScv = 220.839, CCCcv = 0.874, Q<sup>2</sup> LMO = 0.775, R<sup>2</sup> Yscr = 0.007, Q<sup>2</sup> Yscr = −0.009, RMSEex = 0.474, MAEex = 0.383, PRESSext = 50.675, R<sup>2</sup> ex = 0.779, Q<sup>2</sup> -F<sup>1</sup> = 0.778, Q2 -F<sup>2</sup> = 0.778, Q<sup>2</sup> -F<sup>3</sup> = 0.791, CCCex = 0.876, R<sup>2</sup> -ExPy = 0.779, R'<sup>o</sup> <sup>2</sup> = 0.727, k' = 0.989, 1−(R2/R'<sup>o</sup> 2 ) = 0.066, r'2m = 0.602, R<sup>o</sup> <sup>2</sup> = 0.779, k = 1.005, 1 − (R<sup>2</sup> − ExPy/R<sup>o</sup> 2 ) = 0, r <sup>2</sup>m = 0.766

Different researchers have recommended the above statistical parameters to judge the robustness and external predictive ability of a QSAR model [11–16,23–31]. The formula to calculate them is available in the Supplementary Materials. It is clear that model-A fulfils the recommended threshold for many validation parameters and other criteria. A high value of different parameters like R<sup>2</sup> tr (coefficient of determination), R<sup>2</sup> adj. (adjusted coefficient of determination), and R<sup>2</sup> cv (Q<sup>2</sup> loo, cross-validated coefficient of determination for leaveone-out), R<sup>2</sup> ex (external coefficient of determination), Q2−F n **,** and CCCex (Concordance Correlation Coefficient), etc., and a low value of LOF (lack-of-fit), RMSEtr (root mean square error), MAEtr (mean absolute error), R<sup>2</sup> Yscr (R<sup>2</sup> for Y-scrambling), etc. along with the different graphs (see Figure 2) associated with the model indicate that the model possesses statistical robustness with excellent internal and external predictive ability as well as free from chance correlations. Additionally, the Williams plot specifies that the model is statistically acceptable (see Figure 2d). Therefore, it fulfils all the OECD recommended guidelines for creating a useful QSAR model.

**Figure 2.** *Cont.*

**Figure 2.** Different graphs associated with model-A: (**a**) experimental vs. predicted pIC<sup>50</sup> (the solid line represents the regression line), (**b**) experimental vs. residuals, (**c**) Williams plot for applicability domain (the vertical solid line represents h\* = 0.023 and horizontal dashed lines represent the upper and lower boundaries for applicability domain), and (**d**) Y-randomization.

#### **3. Discussion**

#### *Mechanistic Interpretation of QSAR Model*

A very crucial aspect of a useful QSAR analysis is to gain deep insight into the pharmacophore or structure-oriented linking of molecular descriptors [17,32]. This not only helps throughout the drug discovery process, but also expands the information and understanding of mechanistic aspects of different types of molecules. Though, in the present work, a specific molecular descriptor was used to equate the pIC<sup>50</sup> values of different molecules, but an extending or reverse influence of unknown factors or other molecular descriptors, having a dominant effect in deciding the final pIC<sup>50</sup> value of a molecule, cannot be ignored. To simplify, a single molecular descriptor (in turn structure feature) cannot decide the overall experimental pIC<sup>50</sup> value of a molecule. In other words, the effective use of an appropriately validated QSAR model depends on the synchronous consideration of all constituent molecular descriptors. Interestingly, in model-A, all the molecular descriptors have positive coefficients, which indicates that increasing their value could result in a better Hsp90 inhibitory activity.

The descriptor **com\_ringChyd\_4A** represents the total number of hydrophobic ring carbons, having partial charge in the range ±0.2, within 4Å from the com (center of mass) of the molecule. From this, it appears that mere total number of ring carbons is very important, but replacing **com\_ringChyd\_4A** with **nringC** (number of ring carbon atoms) or **naroC** (number of aromatic carbon atoms) significantly reduced the statistical performance of the model (R<sup>2</sup> = 0.72). To add further, **com\_ringChyd\_4A** has a positive correlation of R = 0.488 with pIC50, whereas **nringC** and **naroC** have a correlation of R = 0.461 and 0.405, respectively. **com\_ringChyd\_3A** and **com\_ringChyd\_5A** represent the total number of ring carbons, having partial charge in the range ±0.2, within 3Å and 5Å from the com (center of mass) of the molecule, respectively. Replacement of **com\_ringChyd\_4A** with **com\_ringChyd\_3A** or **com\_ringChyd\_5A** resulted in slightly reduced performance of the model with R<sup>2</sup> = 0.75 and 0.76, respectively. This indicates that the optimum distance is 4Å.

The importance of hydrophobic ring carbon atoms is supported by the X-ray-resolved structure of a good number of Hsp90 inhibitors because the active site of Hsp90 consists of lipophilic side chains of Leu48, Ile91, Val186, Leu315, Ile388, and Val391 [33,34], which favors the presence of hydrophobic moiety in the inhibitors. For example, a comparison of molecule 988 (pIC<sup>50</sup> = 6.009, **com\_ringChyd\_4A** = 10) with 1007 (pIC<sup>50</sup> = 6.481, **com\_ringChyd\_4A** = 15) highlights the importance of **com\_ringChyd\_4A**. Another pair of molecules, viz. 794 and 814, also supports this observation. The molecular descriptor **com\_ringChyd\_4A** is depicted in Figure 3 for different molecules.

**Figure 3.** Depiction of **com\_ringChyd\_4A** using different molecules: (**a**) molecules 988, 1007 (MMFF94 optimized), and 1007 (X-ray resolved dock pose from pdb 6EY8); (**b**) molecules 794 and 814 (both X-ray-resolved poses from pdb 5XR9 and 4LWE, respectively). The small black sphere represents the com (center of mass) and the bigger transparent sphere represents the distance of 4Å from the center of mass. The dotted yellow line represents the distance (Å) of com from the centers of the different nearest rings.

From Figure 3, it is clear that the lowest energy conformer of molecule 988 has **com\_ringChyd\_4A** = 10 due to the closer presence of com (distance 1.206 Å) to the benzene ring of indazole ring. In case of molecule 1007 (MMFF94-optimized and X-ray-resolved pose from pdb 6EY8), the com is located slightly away from the benzene ring of Indazole ring at a distance > 2.40 due to specific conformation, thereby increasing the value of **com\_ringChyd\_4A** to 15. This could be a plausible reason for the difference in the bioactivity of these two compounds. Similarly, a better Hsp90 inhibitory activity of molecule 794 than 814 could be attributed to difference in their **com\_ringChyd\_4A** values.

Another molecular descriptor that has a positive effect on Hsp90 activity is **faroCN2B**, which signifies the presence of nitrogen exactly at two bonds from aromatic carbon atoms. If the same nitrogen atom is also present at two or less bonds from any other aromatic carbon atom, then it was excluded while calculating **faroCN2B**. This descriptor highlights the

importance of nitrogen atoms separated from aromatic ring (Benzene, etc.) by two bonds. As the majority of nitrogen atoms act as either an H-bond donor or acceptor; therefore, the presence of nitrogen atoms in the vicinity of aromatic rings could be useful in enhancing interactions with the polar residues of receptor (Hsp90). Additionally, the descriptor further points out the crucial role played by the aromatic rings undoubtedly due to their lipophilic nature. Taken together, the descriptor **faroCN2B** signifies the importance of two important structural features: aromatic rings and their vicinal nitrogen atoms.

This observation is confirmed when we compare the X-ray-resolved structures of molecule 727 (pIC<sup>50</sup> = 6.654, **faroCN2B** = 1, pdb = 4O09) with 725 (pIC<sup>50</sup> = 7.137, **faroCN2B** = 2, pdb = 4O05) depicted in Figure 4. The nitrogen atoms responsible for **faroCN2B** are highlighted by blue dotted circles. From Figure 4, it is clear that the aromatic ring B of both the molecules is responsible for hydrophobic interactions with the residue Met98. The nitrogen atom of ring A present in both the molecules is not only a constituent of **faroCN2B,** but also responsible for H-bonding with the residue Asp93. Thus, such a combination of aromatic carbons and nitrogen is highly beneficial to enhance the interactions with the receptor. In case of molecule 733, an additional nitrogen atom is present in ring F, which is a constituent of **faroCN2B**, and responsible for the H-bond interaction with the nearby water molecule. Thus, the present QSAR analysis revealed an important structural feature, which is also visible in X-ray-resolved structures of the same inhibitors with the same target enzyme Hsp90.

**Figure 4.** Depiction of **faroCN2B** using representative examples only.

A comparison of the following pairs of molecules further vindicates the importance of **faroCN2B** in determining the bioactivity: 213 (pIC<sup>50</sup> = 6.523, **faroCN2B** = 2) with 212 (pIC<sup>50</sup> = 6.469, **faroCN2B** = 1) and 758 (pIC<sup>50</sup> = 7.444, **faroCN2B** = 2) with 759 (pIC<sup>50</sup> = 7.569, **faroCN2B** = 3).

The importance of aromatic carbon atoms is further emphasized with the presence of **aroCminus\_sumpc** as a constituent variable of model-A. The molecular descriptor **aroCminus\_sumpc** represents the sum of partial charges on negatively charged aromatic carbon atoms. The positive coefficient for **aroCminus\_sumpc** indicates that the higher the value of this descriptor, the better the activity profile. The sum of partial charges on negatively charged aromatic carbon atoms will always be negative; therefore, in reality, this descriptor actually decreases the pIC<sup>50</sup> value. Further, the replacement of **aroCminus\_sumpc** by **aroCplus\_sumpc** (sum of partial charges on positively charged aromatic carbon atoms) led to a model with almost identical statistical performance (R<sup>2</sup> tr = 0.772, Q<sup>2</sup> LMO = 0.767, R<sup>2</sup> ex = 0.78, CCCex = 0.876). In fact, **aroCplus\_sumpc** has a better correlation (R = 0.33) with pIC<sup>50</sup> than **aroCminus\_sumpc** (R = 0.10). From this it is clear that, if aromatic carbons are positively charged than the molecule possesses better Hsp90 inhibitory activity. Therefore, the best strategy is to attach atoms or groups that enhance lipophilic and mild polar interactions with the receptor (for example -Cl, etc.) to the aromatic carbon atoms. In short, substituted aromatic rings are preferable for better activity. This observation is supported by comparing following pairs of molecules: 2 with 3, 1054 with 1059, and 214 with 212.

**aroC\_aroN\_5B**, which represents the total number of aromatic carbon atoms within five bonds from aromatic nitrogen atoms, again points out the key role played by aromatic carbon atoms in deciding Hsp90 inhibitory activity. It also underlines the usefulness of aromatic nitrogen atoms. This descriptor has a positive correlation with pIC<sup>50</sup> with R = 0.63. Therefore, an increase in number of aromatic carbon atoms within five bonds from aromatic nitrogen atoms leads to better Hsp90 inhibitory activity. The following pairs of the molecules support this observation: 888 (pIC<sup>50</sup> = 7.523, **aroC\_aroN\_5B** = 22) with 887 (pIC<sup>50</sup> = 6.046, **aroC\_aroN\_5B** = 20) and 107 (pIC<sup>50</sup> = 5.953, **aroC\_aroN\_5B** = 13) with 108 (pIC<sup>50</sup> = 4.874, **aroC\_aroN\_5B** = 10), to mention a few. Further, the 50 most active molecules possess relatively higher value of **aroC\_aroN\_5B** (range 8–17) than the 50 least active molecules (range 0–8).

**fringNsp3C5B** stands for the number of sp<sup>3</sup> -hybridized carbon atoms exactly at five bonds from the ring nitrogen atom. If the same sp<sup>3</sup> -hybridized carbon atom is also present at four or less bonds from any other ring nitrogen atom, then it was excluded while calculating **fringNsp3C5B**. It is interesting to note that the 50 most active molecules, except molecule 618, possess at least one or more of such a combination of carbon and ring nitrogen, whereas the 50 least active molecules either lack it or have **fringNsp3C5B** = 1. In the majority of compounds, the sp<sup>3</sup> -hybridized carbon atoms are present either as a linker between two rings or as a substituent, which therefore enhances conformational flexibility of the molecule to adopt a bioactive conformer or lipophilic characters of the molecule. A comparison of 895 (pIC<sup>50</sup> = 7.071, **fringNsp3C5B** = 2) with 896 (pIC<sup>50</sup> = 6.777, **fringNsp3C5B** = 1), 859 (pIC<sup>50</sup> = 7.237, **fringNsp3C5B** = 2) with 896 (pIC<sup>50</sup> = 7.071, **fringNsp3C5B** = 1), 326 (pIC<sup>50</sup> = 6.921, **fringNsp3C5B** = 1) with 328 (pIC<sup>50</sup> = 7.046, **fringNsp3C5B** = 2), and 412 (pIC<sup>50</sup> = 7.155, **fringNsp3C5B** = 1) with 411 (pIC<sup>50</sup> = 6.959, **fringNsp3C5B** = 0) and 410 (pIC<sup>50</sup> = 6.854, **fringNsp3C5B** = 0) confirms the importance of **fringNsp3C5B** in deciding the activity.

A molecular descriptor that identifies the relation of total number amide nitrogen atoms within six bonds from the H-bond donor and acceptor atoms is **da\_amdN\_6B**. In the majority of compounds in the present dataset, the amide group is present as a substituent on aromatic ring or as a linker between two rings. The descriptor **da\_amdN\_6B** suggests the significance of amide group and its correlation with the H-bond donor and acceptor atoms. This observation is confirmed on comparing molecule **A** with molecules **B** and **C** (see Figure 5).

**Figure 5.** Pictorial representation of **da\_amdN\_6B** using representative examples only.

A good number of researchers have also pointed out that the amide group is crucial for Hsp90 inhibitors to establish H-bonding with residues of the active site (see pdb 4AWO). For example, Zhao et al. [4] pointed out that the distance between the nitrogen atoms on the piperidine ring and the amide are important for Hsp90 inhibition. Similarly, Baruchello and co-workers [35] studied a library of 3,4-isoxazole diamides for Hsp90 binding and found that a substantial reduction in Hsp90 binding affinity when the amide was replaced with substituted amines. In addition, a H-bond donor at the C-4 position on the isoxazole is vital for retaining the activity. Davies et al. [36] observed that S-acetamide derivatives of compounds have better bioactivity profile than the S-alkylamines. The importance of **da\_amdN\_6B** was further confirmed by comparing following pair of the molecules: 856 (pIC<sup>50</sup> = 6.848, **da\_amdN\_6B** = 0) with 861 (pIC<sup>50</sup> = 7.114, **da\_amdN\_6B** = 1). The earlier work identified the role of amide group, and in the present work, we successfully identified that a combination of amide group with H-bond donor/acceptor within six bonds is a better strategy to have better Hsp90 inhibitory activity. Therefore, such a combination of the amide nitrogen atom and H-bond donor/acceptor should be retained in future optimizations.

In short, three molecular descriptors emphasize the importance of ring carbon atoms, especially aromatic carbon atoms. This could be attributed to the lipophilic character of the active site of Hsp90. Likewise, four molecular descriptors underline the significance of different types of nitrogen atoms, which are responsible for the establishment of the polar or H-bond interactions with polar residues and water molecules present inside the active site of Hsp90. Hence, the present work is successful in identifying reported as well as novel pharmacophoric features of Hsp90 inhibitors.

#### **4. Materials and Methods**

The OECD (Organization for Economic Cooperation and Development) guidelines and a standard protocol recommended by different researchers [11–13,16,18,25,26,29,30,37] involve the sequential execution of (1) data collection and its curation, (2) structure generation and calculation of molecular descriptors, (3) objective feature selection (OFS), (4) splitting the dataset into training and external validation sets, (5) subjective feature selection involving building a regression model and validation of the developed model, which have all been followed to build a widely applicable QSAR model for Hsp-90 inhibitory activity. This also ensures thorough validation and successful application of the model.

#### *4.1. Data Collection and Its Curation*

The dataset of Hsp-90 inhibitory activity used for building, training, and validating the QSAR model in the present work was downloaded from BindingDB (https://www. bindingdb.org/bind/index.jsp, accessed on 24 December 2021), which is a free and publicly accessible database. Initially, the dataset comprised 1839 molecules. Then, as a part of data curation, entries with ambiguous IC<sup>50</sup> values, duplicates, salts, metal-based inhibitors, etc. were omitted [11–13,16,18,25,26,29,30,37]. The final dataset comprises 1141 structurally diverse molecules with remarkable variation in structural scaffolds, which were tested experimentally for potency in terms of IC<sup>50</sup> (nM) (see the MS Excel file 'SupplementaryMaterial-Final' in the Supplementary Materials). The dataset includes N-terminal inhibitors of Hsp90. The experimental IC<sup>50</sup> values have a sufficient variation ranging from 5 to 350,000 nM. After that, IC<sup>50</sup> values were converted to their negative logarithmic value (pIC<sup>50</sup> = −log10IC50) so that a comparison of their values became easier. In Table 1 and Figure 6, some of the most and least active molecules are included as examples only.

**Table 1.** SMILES notation, IC<sup>50</sup> (nM) and pIC<sup>50</sup> (M) of the five most and least active molecules of the selected dataset.


**Figure 6.** Representative examples from the selected dataset (the five most active and five least active molecules).

#### *4.2. Calculation of Molecular Descriptors and Objective Feature Selection (OFS)*

A crucial step before the calculation of molecular descriptors is to convert the SMILES notations to 3D-optimized structures and partial charge assignment, which was accomplished using OpenBabel 3.1 [38] using MMFF94 force field. In the present work, the X-rayresolved structure of molecule **1007** (pdb 6YE8) was used to identify the parameter tuning in OpenBabel, required to get a better optimized structure, until there was a high similarity between the MMFF94-optimized structure and X-ray-resolved structure. This enhances the chances of getting a bioactive conformer, which in turn is highly beneficial for further optimization of Hsp90 inhibitors in the drug discovery pipeline. A comparison of the X-ray-resolved structures of molecules **1007** and **33** (pdb 2VCJ) and their respective MMFF94-optimized structures are represented in Figure 7.

From Figure 7, it is clear that there is a high similarity between the X-ray-resolved and MMFF94-optimized structure of molecules **1007** and **33**, which indicates that appropriate parameter tuning was achieved to optimize the rest of the molecules. That is, the same parameter tuning in OpenBabel was used to optimize the other molecules of the selected dataset. The parameters are as follows: geometry optimization, steepest descent, number of steps: 1500; cut off: 0.01.

In the next step, the 3D-optimized structures of all molecules in the dataset were used to calculate a good number of molecular descriptors. It is important to note that calculation of diverse molecular descriptors enhances the chances of a successful QSAR analysis and significantly helps in mechanistic interpretation. However, descriptor pruning is very useful as it further strengthens the diminished risk of overfitting from noisy redundant descriptors. To fulfil these objectives, more than 40,000 molecular descriptors were generated using *PyDescriptor* [39]. After that, OFS involved elimination of the near constant (90% molecules) and highly intercorrelated (|R| > 0.90) molecular descriptors. For this, QSARINS-2.2.4 was used. The final set of molecular descriptors comprises 1228 molecular descriptors, which still comprise manifold descriptors (1D- to 3D-), leading to coverage of a broad descriptor space.

#### *4.3. Splitting the Dataset into Training and External Sets and SFS (Subjective Feature Selection)*

Subjective feature selection involves selection of appropriate number and set of molecular descriptors to build a model using suitable algorithm. Prior to SFS, it is essential to divide the dataset into training and test (also known as external or prediction set) sets with a proper composition and proportions to circumvent information leakage and to verify the

predictive ability of a model [11–13,16,18,25,26,29,30,37]. Hence, the dataset was randomly split into training (80% = 915 molecules) and prediction or external (20% = 226 molecules) sets. It is to be noted that the training set was used for the selection of optimum number of molecular descriptors, and the sole purpose of prediction/external set was to validate the external predictive ability of the model (Predictive QSAR). A GA-MLR-based QSAR model is free from over-fitting if it comprises an optimum number of molecular descriptors. Therefore, in the present work, a simple yet effective method of identifying the breaking point was used. Generally, the continuous inclusion of molecular descriptors in the GA-MLR model significantly increases the value of Q<sup>2</sup> LOO, but after the breaking point, the value of Q<sup>2</sup> LOO does not increase significantly [24]. The number of molecular descriptors corresponding to the breaking point was considered optimum for model building. A graph (see Figure 8) was plotted between the number of molecular descriptors involved in the model and Q<sup>2</sup> LOO values, which indicated that the breaking point agreed with the six molecular descriptors. Consequently, QSAR models comprising more than six descriptors were not considered. For SFS, the set of molecular descriptors was selected using the genetic algorithm integrated with multilinear regression (GA-MLR) method available in QSARINS-2.2.4 (generations per size: 10,000; population size: 50; mutation rate: 60; significance level: 0.05; fitness parameter: Q<sup>2</sup> LOO).

**Figure 8.** Plot of number of descriptors against leave-one-out coefficient of determination Q<sup>2</sup> LOO to identify the optimum number of descriptors.

#### *4.4. Building Regression Model and Its Validation*

The GA-MLR approach resulted in the generation of a good number of models having good to excellent statistical performance. Therefore, the following stringent parameters and criteria suggested by different researchers were used to select the best model [11–13,16,18,25,26,29,30,37,40]: R<sup>2</sup> tr ≥ 0.6, Q<sup>2</sup> loo ≥ 0.5, Q<sup>2</sup> LMO ≥ 0.6, R<sup>2</sup> > Q<sup>2</sup> , R 2 ex ≥ 0.6, RMSEtr < RMSEcv, ∆K ≥ 0.05, CCC ≥ 0.80, Q<sup>2</sup> -F<sup>n</sup> ≥ 0.60, r<sup>2</sup> <sup>m</sup> ≥ 0.5, (1-r2/r<sup>o</sup> 2 ) < 0.1, 0.9 ≤ k ≤ 1.1 or (1-r2/r'<sup>o</sup> 2 ) < 0.1, 0.9 ≤ k' ≤ 1.1,| r<sup>o</sup> <sup>2</sup>− r'<sup>o</sup> <sup>2</sup>| < 0.3, RMSEex, MAEex, R 2 ex, Q<sup>2</sup> F1, Q<sup>2</sup> F2, and Q<sup>2</sup> F3, and low R<sup>2</sup> Yscr, RMSE, and MAE. The details of these statistical parameters are available in the Supplementary Materials. An important aspect of validation of a QSAR model is to identify the applicability domain. In the present work, the William's plot was plotted to assess the applicability domain of the QSAR model [11–13,16,18,25,26,29,30,37,41,42].

#### **5. Conclusions**

In the present work, a relatively large and structurally diverse dataset of 1141 Hsp90 inhibitors was used for developing a six-descriptor-based and extensively validated GA–MLR QSAR model with R<sup>2</sup> tr = 0.78, Q<sup>2</sup> LMO = 0.77, R<sup>2</sup> ex = 0.78, and CCCex = 0.88. The inclusion of easily understandable descriptors resulted in identification of important pharmacophoric features that are correlated with Hsp90 inhibitory activity. The present QSAR analysis effectively captured a mixture of reported as well as novel significant structural features. The analysis vindicates that ring and aromatic carbons are important in deciding the activity. In addition, different types of nitrogen atoms in correlation with different types of carbon atoms influence the Hsp90 inhibitory activity. A good balance of external predictive ability and mechanistic interpretations, which are further supported by the reported crystal structures of Hsp90 inhibitors, make the QSAR model useful for the future optimization of molecules in the pipeline as a better Hsp90 inhibitor.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/ph15030303/s1.

**Author Contributions:** Conceptualization, V.H.M., M.E.A.Z. and S.A.A.-H.; formal analysis and data curation, V.H.M. and V.M.P.; writing, M.E.A.Z., V.H.M., M.M.R., S.N.A.B. and S.D.T.; Revisions, M.E.A.Z., V.H.M., S.D.T., M.M.R. and S.N.A.B.; editing and proofreading, V.H.M., M.E.A.Z. and S.N.A.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors acknowledge the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia, for its support of this research through research group number RG-21-09-76.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article and Supplementary Materials.

**Acknowledgments:** The authors acknowledge the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia, for its support of this research through research group number RG-21-09-76. V. H. Masand is thankful to Paola Gramatica (Italy) and her team for providing the free copy of QSARINS 2.2.4.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**

