*Article* **A Novel Network Science and Similarity-Searching-Based Approach for Discovering Potential Tumor-Homing Peptides from Antimicrobials**

**Maylin Romero <sup>1</sup> , Yovani Marrero-Ponce 2,\* , Hortensia Rodríguez <sup>1</sup> , Guillermin Agüero-Chapin 3,4 , Agostinho Antunes 3,4 , Longendri Aguilera-Mendoza <sup>5</sup> and Felix Martinez-Rios <sup>6</sup>**


**Abstract:** Peptide-based drugs are promising anticancer candidates due to their biocompatibility and low toxicity. In particular, tumor-homing peptides (THPs) have the ability to bind specifically to cancer cell receptors and tumor vasculature. Despite their potential to develop antitumor drugs, there are few available prediction tools to assist the discovery of new THPs. Two webservers based on machine learning models are currently active, the TumorHPD and the THPep, and more recently the SCMTHP. Herein, a novel method based on network science and similarity searching implemented in the starPep toolbox is presented for THP discovery. The approach leverages from exploring the structural space of THPs with Chemical Space Networks (CSNs) and from applying centrality measures to identify the most relevant and non-redundant THP sequences within the CSN. Such THPs were considered as queries (Qs) for multi-query similarity searches that apply a group fusion (MAX-SIM rule) model. The resulting multi-query similarity searching models (SSMs) were validated with three benchmarking datasets of THPs/non-THPs. The predictions achieved accuracies that ranged from 92.64 to 99.18% and Matthews Correlation Coefficients between 0.894–0.98, outperforming stateof-the-art predictors. The best model was applied to repurpose AMPs from the starPep database as THPs, which were subsequently optimized for the TH activity. Finally, 54 promising THP leads were discovered, and their sequences were analyzed to encounter novel motifs. These results demonstrate the potential of CSNs and multi-query similarity searching for the rapid and accurate identification of THPs.

**Keywords:** cancer; tumor-homing peptide; in silico drug discovery; complex network; chemical space network; centrality measure; similarity searching; group fusion; motif discovery; starPep toolbox software

#### **1. Introduction**

Cancer is a group of diseases developed in different cell and tissue types, and corresponds to the second leading cause of death globally [1]. It is based on the abnormal growth of cells due to an inherited genetic mutation or induced by the environment [2].

**Citation:** Romero, M.;

Marrero-Ponce, Y.; Rodríguez, H.; Agüero-Chapin, G.; Antunes, A.; Aguilera-Mendoza, L.; Martinez-Rios, F. A Novel Network Science and Similarity-Searching-Based Approach for Discovering Potential Tumor-Homing Peptides from Antimicrobials. *Antibiotics* **2022**, *11*, 401. https://doi.org/10.3390/ antibiotics11030401

Academic Editor: Mire Zloh

Received: 1 February 2022 Accepted: 15 March 2022 Published: 17 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Despite novel therapy development for cancer treatment, improving chemotherapeutic drugs' specificity towards cancer cells remains a challenge [2,3]. Additionally, cancer cells are generating multi-drug resistance (MDR) [4]. Consequently, in the pharmaceutical industry, there is a need to develop new anticancer agents with a different mode of action to tackle the current drug resistance of cancer cells without being cytotoxic to healthy ones [2]. To fill this gap, peptides have emerged as a potential therapeutic alternative against cancer. From 2015 to 2019, 15 peptides or peptide-containing molecules were approved by the FDA as drugs, demonstrating the growing interest of the scientific community [5].

Peptides have different biochemical and therapeutic properties than small molecules and proteins, making them attractive to the pharmaceutical and biotechnological industry [6,7]. Being smaller than proteins allows peptides to penetrate tissues more easily, have low cost, more accessible synthesis, and do not require folding to be biologically active [8]. In contrast to small molecules, they have a higher specificity and efficacy due to representing the smallest functional part of a protein [9]. Moreover, they are not supposed to interact with the immune system, are biocompatible, have tunable bioactivity, and have low cytotoxicity due to their degradation products being amino acids [10–14]. Hence, peptide-based drugs open a new door to an improved cancer diagnosis and treatment.

Tumor blood and lymphatic vasculature differ molecularly and morphologically from normal lymphatic and blood vessels [15]. Tumor-homing peptides (THPs) take advantage of this peculiarity. Thus, they are widely investigated as drug carriers and for imaging purposes on oncology treatments and diagnosis [16]. The first-generation of THPs have RGD (Arg-Gly-Asp) and NGR (Asn-Gly-Arg) motifs. RGD peptides have the characteristic of selectively binding to α integrins expressed in vascular endothelial cells of the tumor and metastatic tumor cells, and NGR to aminopeptidase N (APN) receptors [17,18]. Although, there are neither non-RGD nor NGR peptides that home tumor blood vasculature and cancer cells by interactions with other receptors, such as the endothelial growth factor receptor (EGFR) [19–23].

THPs are discovered by using in vitro and ex vivo/in vivo phage display technology, which is time-consuming, expensive, and may not translate to humans due to differences between the animal models and humans [24–26]. For these reasons, bioinformatics tools such as databases and webservers are being employed for the accurate prediction of novel THPs [26–28]. In this way, short sets of the most promising THPs become the candidates for posterior experimental verification.

To date, the databases available for experimentally validated THPs are TumorHoPe (includes 744 THPs) [27] and starPepDB (includes 659 THPs) [29], and the available TH activity predictors are TumorHPD (https://webs.iiitd.edu.in/raghava/tumorhpd) (accessed on 1 May 2021) [26], THPep (http://codes.bio/thpep) (accessed on 1 May 2021) [28], and SCMTHP (SCMTHP (pmlabstack.pythonanywhere.com) (accessed on 5 January 2022) [30]. TumorHPD uses the supervised ML method Support Vector Machine (SVM) as a classifier with three features: amino acid composition, dipeptide composition, and binary profile patterns, achieving 86.56% as the highest accuracy [26]. The second ML method, THPep, has a Random Forest (RF) classifier with three features: amino acid composition, dipeptide composition, and pseudo amino acid composition, resulting in 90.13% of maximum overall accuracy [28]. However, the datasets used for training and testing both ML models contain peptides with highly similar sequences. On the other hand, SCMTHP is the most recently reported method based on the scoring card method (SCM) [30]. It determines the propensity scores for the amino acids' and dipeptides' composition accounting for THP sequences and applies a threshold value to discriminate between THP and non-THPs. Nonetheless, the performance of SCMTHP is similar to ML-based predictors, achieving a maximum accuracy of 82.7%.

Recently, Marrero-Ponce et al. published a new software named starPep toolbox (http://mobiosd-hub.com/starpep/) (accessed on 2 February 2021), which is aimed to perform network analyses on the integrated graph database called starPepDB, which include the most comprehensive and non-redundant database of antimicrobial peptides

(AMPs) [29,31]. Here, we propose an alternative methodology to identify potential THPs by combining network science with multi-query similarity searching against the AMPs of starPepDB. We used the starPep toolbox software as the main bioinformatics tool and the Chemical Space Network (CSN) to represent the chemical space of peptides as a coordinatefree system. To the best of our knowledge, there are no reported studies where data mining and screening is supported by network science to discover peptides for pharmaceutical purposes [29]. Firstly, we built models of representative and non-redundant THPs using centrality analysis and supervised retrospective similarity searching to perform the TH activity prediction. The outstanding model, named THP1, predicted the TH activity of three benchmarking datasets of THPs/non-THPs achieving accuracies between 92.64–99.18% and Matthews Correlation Coefficient (MCC) between 0.894–0.98, demonstrating the feasibility of this new methodology. Then, we performed a hierarchical screening for drug repurposing using network-based algorithms implemented in the starPep toolbox, the best model THP1, local alignments, and webservers to predict relevant activities related to the TH. Their TH activity was optimized by generating random libraries, where the peptide undergoes amino acid's stochastic substitutions at different positions. Finally, a set of 54 potential THPs from AMPs was proposed, where common motifs were identified.

#### **2. Materials and Methods**

The overall workflow of this report, shown in Figure 1, was based on two steps: (i) generation/selection of the model of representative THPs from starPepDB in starPep toolbox, and (ii) prediction of potential new THPs from AMPs. In the first step, some models of representative THPs from starPepDB were built using different centrality measures to rank the nodes and extract the representative and less similar sequences by applying local alignment. Then, the best multi-query similarity searching model (SSM) was selected by the classification performance and its ability to correctly retrieve THPs from benchmark THPs databases by using group fusion (MAX-SIM rule) similarity searching.

**Figure 1.** General overview of the experimental procedure.

In the second step, the model was used to perform similarity searching to repurpose AMPs as THPs from starPepDB, and their TH activity was optimized using the TumorHPD server. Additionally, sequence motifs were found from the set of potential THPs using multiple sequence alignments [32–35], alignment-free methods [36], and PROSITE server (https://www.genome.jp/tools/motif) (accessed on 15 July 2021).

#### *2.1. StarPep Toolbox Software*

The starPep toolbox uses FASTA files as inputs and includes the starPepDB. Peptides are represented as nodes connected by an edge if they have any relationship. It can perform querying, filtering, visualization of networks, scaffold extractions, single or multiple queries similarity searching, and analysis of peptides by graph networks [29,31].

Networks can be built based on the metadata of peptides or based on the pairwise similarity measures calculated for their respective sequence. In metadata networks, nodes are connected by a specific parameter in common, such as origin; the target against which they are assessed; functionality; the database where they come from; the cross-reference; Nterminus; C-terminus; or amino acid composition. In similarity networks, peptides are codified by descriptors, such as length, net charge, isoelectric point, molecular weight, Boman index, indices based on aggregation operators, hydrophobic moment, average hydrophilicity, hydrophobic periodicity, aliphatic index, and instability index [29,31,37]. Moreover, networks are visualized using different layouts, such as Fruchterman–Reingold [38].

Networks can be clustered, and communities are optimized using the Louvain method [39]. Moreover, the centrality of each node can be particularly measured by harmonic, community hub-bridge, betweenness, and weighted degree. Centrality is crucial to perform scaffold extractions because peptides are ranked according to their centrality score, and then redundant sequences are removed, prioritizing the most central. Thus, scaffold extractions depend on the type of centrality applied.

On the other hand, similarity searching, which is the basis of this study, is performed using a set of queries against a target dataset, where different percentages of identity (or similarity thresholds) can be applied. An identity score is a number between 0–1, and it is calculated using the Smith–Waterman local alignment with BLOSUM 62 substitution matrix [40]. Multiple queries similarity searching works using the group fusion model explained in the following section.

#### *2.2. Model Selection*

The dataset of reported THPs was extracted from starPepDB in the starPep toolbox. All 45120 peptides contained in starPepDB were filtered by the "Tumor Homing" query in the metadata function, where 659 entries were obtained (SI1-A).

#### 2.2.1. Network Analysis

#### Similarity Threshold Analysis

Network analysis of peptides was performed by building the CSN of 659 THPs in the starPep toolbox. To choose the appropriate similarity threshold to build the network of THPs, CSNs were built by varying in 0.05 the cut-off value from 0.10 to 0.90 (17 CSNs in total). Some metrics were retrieved from each CSN using the starPep toolbox, such as density, number of communities, modularity, and number of singletons.

By default, when CSN was built, nodes with higher than 98% of similarity were removed using the local alignment Smith–Waterman algorithm. The similarity metric used to establish the pairwise similarity relationships between nodes was the min–max normalized Euclidean distance. Then, a centrality was calculated and those nodes with 0 as vertex degree were identified as outliers and then removed, leaving the giant (or connected) components of the CSN, i.e., subgraph where all nodes are connected. In this case, community hub-bridge centrality was calculated. However, any centrality measure could have been calculated since singletons always have zero centrality. After that, the network was clustered and the modularity optimized using the modularity optimization algorithm based on the Louvain method [39].

The network was saved as a Graph ML file to be opened in Gephi [41] for subsequent calculation of ACC. Finally, density, modularity, and ACC as a function of similarity threshold were graphed in Origin to decide what similarity threshold is the best.

#### Network Characterization

CSN of the giant components derived from the application of the best similarity threshold was characterized by the number of nodes, edges, outliers, density, number of communities, and modularity. These parameters were obtained from starPep toolbox while ACC, diameter (larger shortest path), average path length, and a total of triangles were

drawn from Gephi. These parameters allow knowing the topology and structural patterns of the CSN.

For network visualization, Force Atlas 2 was used as a layout algorithm where colors represent different clusters, and node size means how central the node is according to the community hub-bridge centrality. Network visualization aims to obtain an aesthetically pleasing and understandable graph where nodes are not overlapped.

On the other hand, CSN of outliers was built with a cut-off of 0.30 to procure an appropriate density; then, it was clustered. Moreover, a subsequent scaffold extraction was applied based on hub-bridge centrality, and on 30% identity from local alignment.

The network of outliers was characterized according to the number of nodes, edges, communities, density, modularity, average degree, ACC, diameter obtained before scaffold extraction, and the number of nodes and edges obtained after scaffold extraction. For network visualization, Fruchterman–Reingold was used as a layout algorithm; colors represent different clusters while node size displays how central it is according to hubbridge measure.

#### 2.2.2. Centrality Analysis

The most influential nodes were used to find the new potential THPs, and centrality is the crucial parameter that provides this information. Thus, the four available centrality types in the starPep toolbox (weighted degree, community hub-bridge, betweenness, and harmonic) were calculated and normalized using the min–max method. Then, redundant peptides were removed by applying the scaffold extraction procedure that is described as follows: peptides were ranked based on the scores obtained after centrality calculation and we used 30% similarity cut-off of local identity from the Smith–Waterman algorithm to retrieve sets of sequences with a maximum of 30% similarity [40]. Subsequently, nodes with 10% lower centrality than the most central node were removed in each metric. The sets obtained after applying this process were named as 30 + 10%.

On the other hand, harmonic and weighted degree were calculated and normalized, and redundant peptides were removed by applying the scaffold extraction procedure using four different similarity cut-offs of local identity: 30, 40, 50, and 60%.

#### 2.2.3. Similarity Searching Model for THPs Prediction

This study's proposed method for discovering potential THPs was based on similarity searching. For that reason, multiple query similarity searching models (SSMs) composed of several queries representing the most important and less redundant nodes of CSN and a similarity threshold were tested against datasets that contain well-known THPs/non-THPs through similarity searching. The recoveries from the similarity searching were statistically evaluated to select the best model for identifying potential THPs within the AMPs.

#### Query Datasets (Reference Sequences)

The retrieved sets after applying scaffold extractions at each centrality measure; the two sets of outliers; combinations of outliers with sets obtained from centrality-based scaffold extractions; and combinations between sets obtained from scaffold extractions performed using different centrality metrics were used as queries (Qs). In total, we tested 22 sets of Qs, where twelve sets resulted from the application of the scaffold extraction procedures as well as two sets of outliers, and eight sets resulted from the combination between sets.

#### Target Databases

Three training datasets that consider well-known THPs and randomly generated non-THPs [42] were used as the target or calibration for the recovery. THPep, TumorHPD, and SCMTHP employed these datasets for training their methods [26,30,42].

• Main dataset: 651 experimentally validated THPs and 651 random non-THPs (SI1-B). They were collected from TumorHoPe [27] and the literature [26].


#### Group fusion

Group fusion is based on the variation in the query (reference peptide), but keeping constant the identity measure [43]. Each peptide's identity score is calculated from the target dataset varying the Qs. The fusion group's algorithm associates a fused score to each target peptide, i.e., the maximum similarity (MAX-SIM) scores from all resulting identity scores against the Qs. Therefore, considering peptide S from the target dataset, reference peptide Q from the Qs, the identity score I(S,Q), and the MAX-SIM score obtained, the algorithm assigns I(S,Q) as the fused score to peptide S. The local identities were calculated with the Smith–Waterman, and is a number between 0–1, with 1 being the maximum identity. The procedure is illustrated in Figure 2.

**Figure 2.** Schematic representation of the group fusion and similarity searching processes. Q is a peptide from a query dataset, n is the number of peptides contained in a query dataset, S is a peptide from the target dataset (Main, Small, or Main90 dataset), m is the number of peptides included in the target dataset (1302, 938, or 619, respectively). The similarity threshold is related to the percentage of identity.

#### Retrospective Similarity Searching

Main Dataset was imported to starPep toolbox. The similarity searching was performed using the "Multiple query sequences" option of the software and the Qs obtained from 30 + 10% similarity cut-offs of local alignment and outliers. The group fusion is applied by default during the similarity searching, and results were ranked according to the fused score (MAX-SIM value). Subsequently, seven different percentages of identity (similarity thresholds), 30, 40, 50, 60, 70, 80, and 90%, were tested, where peptides with identities equal to or higher than the applied threshold were retrieved as predicted THPs. Figure 2 illustrates how the similarity searching works.

The rescued nodes, i.e., predicted THPs, were statistically evaluated to validate the prediction. Thus, it is possible to identify the two centrality measures and percentages of sequence identity with the best performance.

Then, similarity searching was performed using only the sets of the best two centrality measures as Qs: harmonic and weighted degree, and 30, 40, 50, 60, and 70% of identity. In Small and Main90 datasets, only the sets of harmonic and weighted degrees were

used as Qs, applying 40, 50, and 60% of identity for recovery. In total, 98 different SSMs were evaluated.

#### 2.2.4. Statistical Analysis

The ability of the SSMs to predict THPs was validated by the measurement of their accuracy (Ac), kappa (κ), sensitivity (Sn), specificity (Sp), the precision of positives and negatives (Ppos and Pneg, respectively), MCC, and false accept rate (FAR%) using the following formulas.

$$\text{Ac} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}'} \tag{1}$$

$$\kappa = \frac{\text{Po} - \text{Pc}}{1 - \text{Pc}}\text{\textdegree}\tag{2}$$

$$\text{Sn} = \frac{\text{TP}}{\text{TP} + \text{FN}} \text{ \textdegree \tag{3}$$

$$\text{Sp} = \frac{\text{TN}}{\text{TN} + \text{FP}'} \tag{4}$$

$$\mathcal{P}\_{\text{pos}} = \frac{\text{TP}}{\text{TP} + \text{FP}} \; \prime \tag{5}$$

$$P\_{\rm neg} = \frac{\rm TN}{\rm TN + FN} \,\,\,\,\,\tag{6}$$

$$\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP}) \times (\text{TP} + \text{FN}) \times (\text{TN} + \text{FP}) \times (\text{TN} + \text{FN})}},\tag{7}$$

$$\text{FAR\%} = \frac{\text{FP}}{\text{FP} + \text{TN}} \times 100 \,\text{\AA} \,\text{\AA} \tag{8}$$

where, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, FN is the number of false negatives, Po is the relative observed agreement between the observers equal to the Ac, and Pc is the expected chance agreement calculated by the formula Pc = (TP+FP)×(TP+FN)+(FN+TN)×(FP+TN) (TP+TN+FP+FN) <sup>2</sup> .

Finally, the best 9 SSMs were compared and ranked using the Friedman test-based analysis performed in KEEL [44]. The Friedman test identified the best model based on the statistical metrics previously shown [45]. Moreover, it allowed us to compare the models and determine if their difference was statistically significant and not due to chance. The confusion or classification matrix of the best model was constructed. The best models were compared with reported ML models used for THP prediction, TumorHPD, and THPep, using the same three calibration datasets.

#### *2.3. Identification of Potential THPs*

#### 2.3.1. Hierarchical Screening

Drug repurposing is an alternative methodology widely applied to discover drugs because it reduces approval time for their clinical use [46,47]. Thus, firstly, we repurposed AMPs from starPepDB as THPs.


theoretical activities of virtual hits were predicted using webservers TumorHPD [26], THPep [28], AntiCP [48], CellPPD [49], ToxinPred [50], and HemoPI [51], to corroborate their potential as THPs and prioritize those that do not harm healthy cells. The activities of interest were tumor homing, anticancer, cell-penetrating, toxicity, and hemolysis. The SVM thresholds used were 0.30 in servers TumorHPD, AntiCP, and CellPPD, and 0 in server ToxinPred.


#### 2.3.2. Tumor-Homing Activity Optimization

Lead hits detected from hierarchical virtual screening were AMPs from starPepDB with a natural or designed activity different from tumor homing. That is the reason why their tumor-homing action should be enhanced. Lead hits were optimized by punctual amino acid mutations using the "Designing of Tumor Homing Peptides" module of TumorHPD (https://webs.iiitd.edu.in/raghava/tumorhpd/peptide.php) (accessed on 10 September 2021), and the procedure is shown in Figure 3. Both lead and mutated sequences were shortened into fragments of 5, 10, and 15 residues in length using the same server.

**Figure 3.** Procedure to optimize tumor-homing activity of lead hits.

The optimized sequences showing a higher tumor-homing activity score than parent hits were analyzed by CSN in the starPep toolbox using 0.60 as the similarity threshold to build the network. In addition, tumor homing, toxicity, hemolytic, anticancer, and cell penetrability were predicted using servers listed below: THPep (http://codes.bio/thpep), TumorHPD (https://webs.iiitd.edu.in/raghava/tumorhpd) (accessed on 25 September 2021), AntiCP (https://webs.iiitd.edu.in/raghava/anticp2) (accessed on 25 September 2021), CellPPD (https: //webs.iiitd.edu.in/raghava/cppsite1) (accessed on 25 September 2021), ToxinPred https: //webs.iiitd.edu.in/raghava/toxinpred (accessed on 25 September 2021), and HemoPI https: //webs.iiitd.edu.in/raghava/hemopi (accessed on 25 September 2021). Redundant sequences with higher than 50% similarity were removed by scaffold extraction.

The optimized sequences and parent hits were merged, and the corresponding CSN was built using 0.50 of cut-off and clustered. Next, harmonic centrality was calculated. Each cluster was analyzed separately to prioritize the most central, potent, non-toxic, and

non-hemolytic lead THPs. Finally, the heat map and histogram of pairwise sequence identity of lead compounds were constructed to explore their structural diversity.

#### 2.3.3. Motif Discovery

#### Multiple Sequence Alignments

As the resulting potential THPs were hard-to-align sequences because of their short length and variability, they were grouped into seven clusters according to the neighborhood in the CSN. Given that two peptides underrepresented clusters 1 and 5, they were fused in a cluster labeled 1–5. Thus, peptide clusters (2–4, 1–5, and singletons) were aligned independently using multiple sequence alignments (MSA), publicly available at https://www.ebi.ac.uk/Tools/msa/ (accessed on 28 September 2021). Four different MSA algorithms were applied with their default parameters to determine consensus motifs within each cluster: (1) Clustal-Omega v 1.2.4 [32], (2) MAFFT (Multiple Alignment using Fast Fourier Transform) v7.487 with the iterative refinement FFT-NS-i option [33], (3) MUSCLE (Multiple Sequence Comparison by Log-Expectation) v3.8 [34], and T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) v1.83 [35].

The resulting MSAs were employed to extract the conserved motifs by considering the consensus sequences estimation from the programs Jalview v2.11.1.4 [52], EMBOSS Cons v6.6.0 (https://www.ebi.ac.uk/Tools/msa/emboss\_cons/) (accessed on 28 September 2021), and Seq2Logov2.1 (http://www.cbs.dtu.dk/biotools/Seq2Logo/) (accessed on 28 September 2021) [53].

#### Alignment-Free Method

Peptides were analyzed in STREME [36] (Sensitive, Thorough, Rapid, Enriched Motif Elicitation) to discover fixed-length patterns (ungapped motifs) that were enriched with respect to a control set generated by shuffling input peptides [52]. The analyses were performed via its webserver (https://meme-suite.org/meme/tools/streme) (accessed on 28 September 2021), by considering both total peptides and by each cluster. The motif width was set between 3–5 amino acids length. STREME applies a statistical test at *p*-value threshold = 0.05 to determine the enrichment of motifs in the input peptides compared to the control set.

#### Motif Search in PROSITE

Peptides were queried by the Motif Search tool (https://www.genome.jp/tools/ motif/) (accessed on 28 September 2021) and integrated into the GenomeNet Suite (https: //www.genome.jp/) (accessed on 28 September 2021). PROSITE Pattern and PROSITE Profile libraries were only considered for the motif search.

#### **3. Results and Discussion**

*3.1. Model Selection*

3.1.1. Network Analysis

Similarity Threshold Analysis

Out of the set of 659 THPs retrieved from starPepDB, 627 peptides (SI1-A-I) were filtered with lower than 98% similarity by local alignment. The adequate similarity threshold was chosen before building CSN with the 627 peptides. This step is non-trivial since it is the parameter that defines the topology and network features [54]. Hence, the appropriate cut-off for building the CSN was determined based on the variability of network parameters such as density, modularity, ACC, and singletons at different cut-off similarity values. Graphml files corresponding to the 17 CSNs are available at SI2. Table S1 shows the obtained parameters at each cut-off.

The graph of density, modularity, and ACC as a function of the similarity threshold is shown in Figure 4. The density is lower at a higher similarity threshold. ACC follows the same pattern until the 0.65 similarity threshold. By contrast, modularity increases as the similarity threshold increases, while the clustering is optimized.

**Figure 4.** Density, modularity, and average clustering coefficient (ACC) as a function of similarity threshold of 627 THPs CSN.

A well-defined network needs a compromise among the density, modularity, and ACC parameters, but also accounts for the number of outlier nodes because they are atypical peptides with particular properties. Networks with very low density display too many outliers (see Table S1), while networks with very high density show a massive connection. In both cases, information is lost and interpretation becomes difficult. According to the literature, the best density percentages are generally around 1% or 2.5% because they generate high modularity but allow an adequate understanding of the network [54]. As modularity indicates the existence of community structures, the ideal value must show an equilibrium between a non-clustered network and an artificially clustered network due to the high modularity value. In this sense, the selected similarity threshold was 0.60, where CSN shows the best trade-off among network parameters and connectivity: 2.3% of density, 0.47 of modularity, 0.428 of ACC, and 99 outliers (15.8% of overall nodes). Therefore, the giant components of the network were 528 nodes (SI1-A-II).

#### Network Characterization

Some parameters such as density, number of clusters, modularity, average degree, ACC, and diameter were calculated and shown in Table 1 to get an overview on the giant component and outliers of the CSNs, which are represented in Figures 5 and 6, respectively.


**Table 1.** Global network properties of CSN of 528 nodes and outliers.

\* Density, number of clusters, and modularity were calculated in the starPep toolbox, while average degree, ACC, and diameter were calculated in Gephi. \*\* Sc.: Scaffold extraction.

**Figure 5.** CSN of giant component conformed by 528 THPs retrieved from starPepDB. Node color represents the community (cluster), and node size symbolizes the centrality values.

Additionally, the degree of distribution of the giant components is shown in Figure 7. It gives some information about the structure of the CSN. In this case, the distribution degree is concentrated in the nodes with low vertex degrees. However, it has a tail associated with the nodes with higher vertex degrees in a lower proportion. The nodes with higher degrees correspond to the most central nodes, which, as can be corroborated in Figure 5, are in the minority.

**Figure 7.** Degree distribution of the 528 giant components, where k is the vertex degree.

Outliers are relevant THPs because they present characteristics regarding 528 nodes that make up the giant component; so, they are unique or atypical sequences. CSN of the 99 singletons (SI1-E) was built using 0.30 of similarity threshold (Figure 6a). Then, sequences with higher similarity than 30% by local alignment were removed based on hub-bridge centrality ranking, where 34 outliers (SI1-E-I) with unique sequences were obtained (Figure 6b).

#### 3.1.2. Centrality Analysis and Similarity Searching

Centrality is the crucial parameter to build the model that will be proposed to identify THPs. It allows the identification of the most influential sequences of the giant components. SI3 (Excel file) shows the normalized centrality measurements of 528 THPs. On the other hand, outliers are nodes with unique properties that enrich the influential sequences model. Therefore, both sets from centrality measurements and sets of outliers represent the chemical space of THPs and will be used as queries to perform the similarity searching against the target datasets. In total, 98 different SSMs were generated based on 22 query sets (FASTA files available at SI4) and similarity thresholds between 0.3 and 0.9.

The predictions and performance of the 98 SSMs are shown in SI5 and SI6-A, respectively, where active and inactive labels indicate predicted THPs and non-THPs, respectively. In general, it is observed that the performance of query datasets followed the following tendency of relevance: weighted degree > harmonic > hub-bridge > betweenness > singletons (outliers). However, the combination of query datasets from different centrality types overperforms the sets selected with only one centrality measure. The addition of the outliers set improved the performance of the combination sets since it generates the complete representation of the chemical space of THPs. Moreover, better performance was obtained using 40, 50, and 60% identity in the similarity searching.

The performance of the best nine SSMs to predict activity in Main, Small, and Main90 datasets are shown in Table 2, Table 3, and Table 4, respectively, where H is the set obtained when harmonic centrality was calculated, W is the set obtained when the weighted degree was calculated, and sing is the set of 99 outliers.


**Table 2.** Statistical analysis for the performance of the best 9 SSMs on the target Main dataset.

\* H is the set obtained when harmonic centrality was calculated, W is the set obtained when the weighted degree was calculated, and sing is the set of 99 outliers.

**Table 3.** Statistical analysis for the performance of the best 9 SSMs on the target Small dataset.


\* H is the set obtained when harmonic centrality was calculated, W is the set obtained when the weighted degree was calculated, and sing is the set of 99 outliers.

**Table 4.** Statistical analysis for the performance of the best 9 SSMs on the target Main90 dataset.


\* H is the set obtained when harmonic centrality was calculated, W is the set obtained when the weighted degree was calculated, and sing is the set of 99 outliers.

It can be noticed that the best statistics were achieved using the query composed of the union of harmonic and weighted degree, both using 60% similarity cut-off of local alignment during scaffold extraction, and the 99 outliers sets, comprising in total 479 query sequences. Moreover, 60% was the best percentage of identity where there was a compromise for all statistical parameters. All statistical parameters showed values higher than 0.88.

The best nine SSMs were compared and ranked using the Friedman test by comparing multiple statistical metrics from each SSM on the three target datasets (details in SI6-B). The best SSM was the set **CSN-TH-0.60Sc-479-H+W+s-0.6-583 (479Q\_0.6)**, named THP1, showing excellent statistical metrics (>0.85) for the model (shown in Tables 2–4). It is

composed of the union of nodes with an identity lower than 60% from the global centrality harmonic with those obtained from applying weighted degree and the set of 99 outliers (479 nodes). The best percentage of identity used to search similarity was 60%. The confusion matrices of THP1 are shown in SI6-C. It can be seen that the prediction of the model was not at random as the MCC was much greater than zero [55].

Finally, the Friedman test of the THP1 versus the reported models used in TumorHPD [26] and THPep [28] servers revealed there is a significant difference between the models, being that the performance of the similarity searching methodology is superior (details in SI6- C and SI6-D). Figure 8 shows the ranking scores from the test, where THP1 is the first ranked method. Finally, Table 5 compares between the model on the three benchmarking datasets. The MCC of predictions using THP1 improved by an average of 28.76% over ML-based models.

**Figure 8.** Ranking scores obtained in the Friedman Test. Friedman statistic (distributed according to chi-square with 2 degrees of freedom): 11.166667. P-value computed by Friedman Test: 0.00376.

**Table 5.** Comparison between the best SSM THP1 and the state-of-the-art ML model to predict tumor-homing activity of benchmarking datasets.


*3.2. Identification of Potential THPs*

3.2.1. Hierarchical Screening

Starting from the 45120 AMPs contained in starPepDB, and after applying the previously explained filters and performing the similarity searching, 43 lead hits were retrieved (SI7-A). Figure 9 shows the step-by-step hierarchical virtual screening. Until today, these repurposed sequences have not reported any tumor-homing activity.

**Figure 9.** Hierarchical virtual screening for repurposing of peptides from starPepDB.

#### 3.2.2. Tumor-Homing Activity Optimization

A library of 180 sequences (SI7-B) was obtained from the optimization of 43 hits in TumorHPD. They have a higher TH score, lower toxicity, and hemolytic activity than the originals. Mutations enriched the sequences with W and C residues. Mainly G and V residues from the originals were mutated to W, while R and K were to C. Studies report the presence of W contributes positively to the intracellular translocation of peptides [56]. Additionally, it was reported that W enhances the stability of peptides in serum and salt [57].

Forty-one peptides from the library were prioritized by studying their CSN, where 50% scaffold extraction by local alignment was accomplished. The sequences were clustered and ranked according to the global harmonic centrality to perform the scaffold extraction. Only the most central sequences with a similarity among them lower than 50% were kept. Forty-one sequences have higher predicted TH activity by TumorHPD than the original peptides, with scores between 0.39 and 1.92. Furthermore, they are anticancer and have

less toxicity and hemolytic activity. 12 out of 41 sequences come from fragments of original sequences of 5, 10, and 15 lengths; 15 resulted from four punctual mutations from the originals; and 14 from fragments of mutated sequences of 5, 10, and 15 lengths. Two out of forty-one peptides, CNGRCGGKLA and WCAMS, are part of reported THPs, validating the novel methodology to discover potential THPs. CNGRCGGKLA is the *N*-end of the CNGRCGGKLAKLAKKLAKLAK peptide containing the NGR TH motif and a disulfide bridge that gives stability. CNGRCGGKLAKLAKKLAKLAK binds to CD13 of tumor cells acting as ACP and THP [58]. At the same time, WCAMS is the *C*-end of the KLWCAMS peptide that homes mouse B16B15b melanoma [59].

We selected the most promising 13 sequences from the 43 lead hits and these were combined with the 41 optimized hits. In total, we proposed 54 peptides (SET 1, FASTA file in SI7-C) with a diverse molecular structure, low toxicity, and hemolytic activity, with most of them also showing potential anticancer activity (SI7-D). Among the 54 lead hits, only one sequence has the well-known NGR motif. Therefore, SET 1 is composed of new structural entities within the known structural space of the THPs.

The sequence diversity of SET 1 was evaluated using all vs. all global alignment where pairwise sequence identities were explored. As shown in Figure 10, the 54 lead hits present a structure singularity sharing pairwise identities with 30%.

**Figure 10.** (**A**) Heat map and (**B**) histogram of pairwise sequence identity of SET 1 (54 lead compounds).

#### 3.2.3. Motif Discovery

As a consequence of the structural diversity of SET 1, the discovery of motifs accounting for the TH activity is not a straightforward task. In this sense, sensitive multiple sequence alignment (MSA) tools and alignment-free (AF) approaches (e.g., STREME) were applied to unravel new TH motifs.

The resulting 54 lead THPs were mapped onto CSN space to identify putative communities and make possible the application of MSA algorithms for motif identification. These networks communities were considered clusters containing related peptides. Finally, six clusters were conformed with 14, 10, 8, 4, 10, and 8 members, respectively (SI7-E). The last cluster grouped the singletons (peptides identified as atypical in the CSN).

Clustal-Omega [32], MAFFT [33], MUSCLE [34], and T-Coffee [35], which are MSA algorithms developed after the classical ClustalW, were applied, so that they can deal with hard-to-align sequences shown in each cluster, and thus to detect any conserved signature or motif. Since each MSA has implemented a different algorithm to improve alignment quality, their consideration for the estimation of consensus regions helped us identify TH motifs by using the Jalview, EMBOSS Cons and Seq2Logo programs (SI8). As the EMBOSS Cons, gives a more legible output, only displaying high scored amino acids/positions (capital letters), less scored but positive residues (lower-case letters), and non-consensus positions (x), were selected as the primary source to set consensus/conserved regions. The non-consensus positions were estimated using default parameters by visual inspection of the corresponding positions in the Jalview program [52] and the Seq2Logo [53]. Table 6 depicts the consensus motifs, unraveled by each MSA algorithm.


**Table 6.** Discovered motifs by Multiple Sequence Alignment (MSA).

\* Taken from TumorHoPe (outside parenthesis), and starPepDB (inside parenthesis).

None of the motifs found by MSA have been reported as TH motifs (Table 6). However, one of the motifs from No. 3 CxxxR, CGGCR, contains the CXXC motif, which is the active site of thioredoxin (Trx), a relevant protein in mammalian cells that acts as an antioxidant and participates in programmed cell death inhibition and cell growth, commonly used as a target for cancer treatments [60,61]. Moreover, CWKG (No. 5) is contained in a nanoscale molecular platform used as a drug delivery system in chemotherapy to enhance the conjugation of mitomycin C to the carrier [62].

On the other hand, the AF approach STREME was used to find unaligned patterns ranging from 3–5 aa length within the overall 54 peptides and each peptide cluster. STREME has been recently reported as the most accurate and sensitive algorithm among its competing state-of-the-art partners [36]. Unlike previous algorithms [63–65], STREME uses a position weight matrix (PWM) to count position matches efficiently for a motif candidate against a Markov model built up from shuffling the same input set (control sequences). Table 7 displays the enriched motifs, discriminating the 54 lead peptides against the control sequences. The same search was also performed by considering each cluster content. Motifs appearing in more than 20% of the query sequences are listed according to their statistical significance (score).


**Table 7.** Discovered Motifs by STREME.

\* Taken from TumorHoPe (outside parenthesis), and starPepDB (inside parenthesis).

One of the motifs discovered by STREME had been reported as tumor-homing, WRP interacting with VEGF-C [66,67]. Other found motifs have been reported but not as TH, such as WRPW, PRW, WKG, and PSHL. WRPW is the binding site of the 7 Enhancer of split E(spl) basic helix–loop–helix (bHLH) protein and the hairy protein to the corepressor protein Groucho-TLE via WD40 domain [68]. PRW is part of a biocatalyst, which is conjugated to a lipid by an ester or amide bond [69]. WKG is a ribosomally synthesized and post-translationally modified peptide [70] and PSHL is a tetrapeptide that affects HIV-1 protease (PR) [71].

Lastly, 54 lead THPs were queried against PROSITE's pattern and profile databases using the search engine Motif Search of the GenomeNet suite [72]. Only two query peptides, which are shown in Table 8, had significant matches with motifs found in gonadotropinreleasing hormones (GnRH) and bombesin-like peptides.



\* Taken from TumorHoPe (outside parenthesis), and starPepDB (inside parenthesis).

These two peptide signatures and their receptors are involved in neuroendocrine signaling pathways associated with physiological states and tumors. GnRH is the hypothalamic decapeptide that plays a crucial role controlling women's reproductive cycle. GnRH binds to specific receptors on the pituitary gonadotrophic cells, but it also is expressed in other reproductive organs, e.g., ovaries, and tumors derived from the ovaries. It has been shown GnRH is involved in ovarian cancer regulation proliferation and metastasis either by the indirect signaling pathway or direct interaction with the GnRH receptors placed at the surface of ovarian cancer cells [73].

Bombesin-like peptides were initially discovered from frog skin, where they are secreted from cutaneous glands as a means of communication and defense. They were later found to be widely distributed in mammalian neural and endocrine cells represented by the neuromedin B (NMB) and the gastrin-releasing peptide (GRP), respectively. Bombesin-like peptide receptors are G-protein-coupled and have seven membrane-spanning domains, so they are involved in signal transduction pathways [74]. Growing evidence shows that bombesin-like peptides and receptors play essential roles in physiological conditions and diseases. An abnormal expression of bombesin receptors has been observed in several types of tumors, which has motivated the development of more specific and safer bombesin derivatives for tumor diagnosis and therapy [75].

The motif search by using different approaches may render a diversity of outcomes. However, some hits shared by different search approaches can support the reliability of the findings. For example, one motif found by the PROSITE search, WSY, was also encountered by STREME, an algorithm that works regardless of database and sequence similarity. Some of the motifs estimated by MSA algorithms were also identified by the AF approach STREME, such as WWW and WKG. All motifs were searched against TH databases, TumorHoPe, and starPepDB to discriminate the possible new signatures from the existing ones. New motifs appear at very low frequency within THPs (last column of Table 6–8), except CNG found by STREME, which appears 34 times in TumorHoPe and 32 in starPepDB. However, CNG has not been reported as a TH motif.

#### **4. Conclusions**

In this study, a novel methodology based on network science and similarity searching was introduced to explore the chemical space of THPs and discover potential THPs from known AMPs. Statistically, the strategy's performance transcended current supervised ML approaches used in THP predictions, demonstrating the potential of this approach. Hence, in silico predictions using the model based on representative THPs, in conjunction with TumorHPD and THPep servers, gave a high reliability to discover potential THPs. As a result, 54 lead compounds were repurposed as potential from AMPs. In the set, novel motifs with promising tumor-homing activity were proposed.

The good performance of the methodology for predicting peptide activity based on similarity searching and network science suggests its application for the prediction of other endpoints in peptides, e.g., antibacterial activity, toxicity, hemolytic, or anticancer. Our models and pipeline are freely available through the starPep toolbox software at http://mobiosd-hub.com/starpep (accessed on 2 February 2021).

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/antibiotics11030401/s1, Table S1. CSN parameters of similarity threshold analysis.

**Author Contributions:** M.R. worked on the datasets' extraction and curation, designed the experiments, performed SAR analysis, and performed the virtual screening, as well as drafted the initial manuscript. Y.M.-P. worked on the conceptualizing of the complex network and similarity searching methods, supervised the applications, and prepared the manuscript. H.R., G.A.-C., and A.A. worked mainly on the motif discovery analysis and drafted the initial manuscript. L.A.-M. and F.M.-R. worked primarily on the implementation of the complex network and similarity searching module and performed SAR and statistical analysis. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by USFQ's Collaboration Grant 2019–2020 (Project ID16885), Yachay Tech grant number REG-INV-18-03244 and the APC was funded by Deanship of Research at USFQ.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The starPep toolbox software and the respective user manual, as well as *SSMs*, are freely available online at http://mobiosd-hub.com/starpep (accessed on 2 February 2021).

**Acknowledgments:** Yovani Marrero-Ponce (Y.M.-P.) thanks the program *Profesor coinvitado* for a postdoctoral fellowship to work at Valencia University in 2020-21. Y.M.-P. acknowledges the support from Collaboration Grant 2019–2020 (Project ID16885). G.A.-C. and A.A. were supported by national funds through FCT—Foundation for Science and Technology—within the scope of UIDB/04423/2020 and UIDP/04423/2020. Hortensia Rodríguez (H.R.) and Maylin Romero (M.R.) acknowledge the support of Yachay Tech internal project "Therapeutic Peptides with biological activity" (REG-INV-18-03244).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Mining Amphibian and Insect Transcriptomes for Antimicrobial Peptide Sequences with rAMPage**

**Diana Lin <sup>1</sup> , Darcy Sutherland 1,2,3, Sambina Islam Aninta 1, Nathan Louie 1, Ka Ming Nip 1,4 , Chenkai Li 1,4 , Anat Yanai <sup>1</sup> , Lauren Coombe 1, René L. Warren <sup>1</sup> , Caren C. Helbing <sup>5</sup> , Linda M. N. Hoang 2,3 and Inanc Birol 1,2,3,\***


**Abstract:** Antibiotic resistance is a global health crisis increasing in prevalence every day. To combat this crisis, alternative antimicrobial therapeutics are urgently needed. Antimicrobial peptides (AMPs), a family of short defense proteins, are produced naturally by all organisms and hold great potential as effective alternatives to small molecule antibiotics. Here, we present rAMPage, a scalable bioinformatics discovery platform for identifying AMP sequences from RNA sequencing (RNA-seq) datasets. In our study, we demonstrate the utility and scalability of rAMPage, running it on 84 publicly available RNA-seq datasets from 75 amphibian and insect species—species known to have rich AMP repertoires. Across these datasets, we identified 1137 putative AMPs, 1024 of which were deemed novel by a homology search in cataloged AMPs in public databases. We selected 21 peptide sequences from this set for antimicrobial susceptibility testing against *Escherichia coli* and *Staphylococcus aureus* and observed that seven of them have high antimicrobial activity. Our study illustrates how in silico methods such as rAMPage can enable the fast and efficient discovery of novel antimicrobial peptides as an effective first step in the strenuous process of antimicrobial drug development.

**Keywords:** antimicrobial peptide; AMP discovery; genome mining; antimicrobial resistance

#### **1. Introduction**

Due in large part to the overuse and misuse of antibiotics, the prevalence of multidrugresistant bacteria is rapidly growing at a rate that cannot be matched by antibiotic discovery efforts [1]. As a consequence, the world is currently in an arms race and is at the cusp of a post-antibiotic era [1]. The slow pace of new antibiotic drug discovery, development, and regulation, combined with the accelerated emergence of resistance to existing antibiotics creates what is referred to as the "discovery void" [2]. This gap between discovery and emergence of resistance highlights an urgency to develop new antimicrobial therapeutics. One such alternative is formulations based on the antimicrobial peptides (AMPs) [3].

AMPs are short amphipathic host defense peptides that are produced in all multicellular organisms as part of the innate immune system [3]. Many AMPs operate through nonspecific mechanisms [4], such as direct electrostatic interactions with the cell membrane and immunomodulation [3], allowing for a broad spectrum of efficacy against bacteria [5], viruses [6], and fungi [7]. Furthermore, pathogens develop a slower rate of resistance to

**Citation:** Lin, D.; Sutherland, D.; Aninta, S.I.; Louie, N.; Nip, K.M.; Li, C.; Yanai, A.; Coombe, L.; Warren, R.L.; Helbing, C.C.; et al. Mining Amphibian and Insect Transcriptomes for Antimicrobial Peptide Sequences with rAMPage. *Antibiotics* **2022**, *11*, 952. https:// doi.org/10.3390/antibiotics11070952

Academic Editors: Guillermin Agüero-Chapin, Agostinho Antunes and Yovani Marrero-Ponce

Received: 9 June 2022 Accepted: 13 July 2022 Published: 15 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

AMPs compared to conventional antibiotics [8]. It is these qualities that position AMPs as attractive alternatives to conventional antibiotics [9].

AMPs are often produced as precursor peptides within cells that consist of an Nterminal signal peptide, followed by an acidic pro-sequence, and a C-terminal basic bioactive mature peptide sequence [3]. The acidic pro-sequence neutralizes the basic mature peptide to keep the AMP in its inactive form and the signal peptide and acidic pro-sequence together are referred to as the prepro domain [3]. AMPs are then activated by proteolytic cleavage of the prepro sequence and the release of the mature peptide [3]. While the signal peptide is often highly conserved, the acidic pro-sequence and mature AMP can be quite variable [10]. However, there is evidence that the prepro sequence can vary across different organisms [3] and even within organisms [11].

Past research has shown that amphibians, such as the American bullfrog *Rana [Lithobates] catesbeiana*, possess a rich diversity of AMPs due to their aquatic and terrestrial life cycle, where the species encounter a wide spectrum of pathogens in these two environments [11]. In amphibians, AMPs such as ranatuerin are secreted at the skin surface upon pathogen exposure and can also stimulate an adaptive immune response [12]. In contrast, insects lack a sophisticated adaptive immune system and yet are highly tolerant to bacterial infection [13,14]. This may be due to the production of AMPs by the innate immune system [14]. In insects, AMPs are found in venom or salivary gland secretions. For example, melittin, a 26 amino acid (AA) peptide is the main component of honeybee venom [13]. While there are many known amphibian AMPs, there are far fewer known insect AMPs. Amphibian AMPs have their own designated database of 1923 peptides in the Database of Anuran Defense Peptides (DADP) [15]. Additionally, they also comprise 34% (1128 sequences) of the curated Antimicrobial Peptide Database 3 (APD3) [16]. Insect AMPs, however, only contribute 10% (325 sequences) in APD3, despite being the next largest non-mammalian classification. Better characterization of these AMP arsenals holds great potential in aiding the discovery of novel AMPs.

Because most AMPs under therapeutic investigation are derived from naturally occurring AMPs in various organisms [2], effective methods to discover natural AMPs would expand the number of potential candidates. Current wet lab screening protocols consist of extraction, isolation, and purification of AMPs through laborious methods such as the collection of skin secretions followed by liquid chromatography and sequence identification using mass spectrometry [17–21]. However, these protocols are costly, time-consuming, and expertise intensive. To resolve this, a scalable, rapid, high throughput in silico methodology built on genomics technologies and able to mine RNA sequencing (RNA-seq) datasets, would greatly aid in the discovery of AMPs funneling into drug development and enhancement processes. There are in silico AMP discovery methodologies presented in earlier studies [22–26], most of which start with processed data such as assembled genomic or protein sequences. Additionally, there are several state-of-the-art tools that perform AMP prediction [27]. Because AMP precursor genes have conserved sequence characteristics, these properties can be leveraged for filtering, and their inferred mature products can be classified as an AMP or not using machine learning methods. With the current unprecedented expansion of data generation and large amounts of sequencing data available in public repositories [28], there exists a rich untapped resource for AMP discovery.

To help fill the antibiotic discovery void, we offer rAMPage: Rapid Antimicrobial Peptide Annotation and Gene Estimation, a homology-based AMP discovery pipeline to mine for putative AMP sequences in publicly available genomic resources. To classify AMP sequences, rAMPage employs AMPlify [27], an attentive deep learning model. Currently, existing AMP databases, (e.g., APD3, DADP) contain less than 4000 validated nonredundant AMP sequences in total. In comparison, we have found over 1000 putative mature AMPs in the present study, with the potential to discover thousands more. Realizing the full potential of such pipelines would require the synthesis and validation of AMP candidates. Herein, we report our results on a select list of 21 peptides we detected using rAMPage.

#### **2. Results**

#### *2.1. Identification of Putative AMPs*

Using rAMPage, we assembled ~53 million transcripts from 84 RNA-seq datasets derived from the transcriptomes of 38 amphibian (33 frogs, five toads; anurans) and 37 hymenopteran insect (eight ants, five bees, 24 wasps) species and flagged 203,758 candidate peptide sequences to be classified (Figure 1). To select a list of high-confidence putative AMPs, we collapsed duplicates from multiple samples and applied three filters: AMPlify prediction score, peptide charge, and peptide length to obtain 1137 peptide sequences. Of these, 795 originate from amphibians, and 342 from insects. Running rAMPage on all 84 datasets took one week, with all datasets (comprising < 1 billion reads) taking less than 24 h (see Supplementary Materials Figure S1 for details on the computational platform and resource usage statistics).

**Figure 1.** Statistics and attrition as the sequencing data are processed by the rAMPage AMP discovery pipeline. rAMPage processes RNA-seq datasets from raw reads to transcripts to putative AMPs. In this case, a putative AMP is defined as a sequence with an AMPlify score ≥10 for amphibians or ≥ 7 for insects, a length ≤ 30 AA, and a charge ≥ 2. Datasets with a reference transcriptome used during assembly are indicated with an asterisk. The total number of putative AMPs (*n* = 1478, including 341 duplicates) are shown in purple, discovered from a total of ~53 million assembled transcripts.

For each sequence, AMPlify [27] reports a prediction score *s* from 0 to 80, where *s* is a log-transformation of the AMPlify probability score *p*

$$s = -10\log\_{10}(1 - p),\tag{1}$$

and 80 represents the highest confidence.

We note that the training data set for the AMPlify model had an over-representation of AMPs from amphibian species [27]; hence, it is biased towards assigning higher scores for amphibian AMPs. To compensate, we have applied separate score cut-offs for the two groups: 10 for amphibians and 7 for insects. Since the majority of AMPs are positively charged, a net charge threshold of ≥+ 2 was applied. As for length, we filtered for sequences that are ≤30 AA, because shorter peptides are more cost-effective to synthesize for downstream validation studies. Figure S2 shows that the length filter used is the most restrictive filter of the three, with only 4.28% and 1.45% of the sequences for amphibians and insects, respectively, meeting this criterion.

Score, charge, length distributions, and AA compositions of the 1137 putative AMPs are characterized in Figures S1 and S4. From this set, 21 AMPs were selected for synthesis

and validation, using three prioritization strategies: "Species Count", "Insect Peptide", and "AMPlify Score" (see Section 4). The peptides have been named after the species they were discovered from (Table S1), then numbered in order using their AMPlify scores.

#### *2.2. Antimicrobial Susceptibility Testing (AST) Results*

A total of 21 of the 1137 putative AMPs (Table S2) were synthesized (Genscript Biotech, Piscataway, NJ, USA) and tested for their antimicrobial activity against *Escherichia coli* ATCC 25922 and *Staphylococcus aureus* ATCC 29213 in a minimum of three independent experiments (see Figure S5 for a full set of experimental results). In these antimicrobial susceptibility tests, AMP activity was assessed using two metrics: minimum inhibitory and bactericidal concentrations (MIC and MBC, respectively). Lower MIC and MBC values are desirable as they indicate that lower AMP concentrations are sufficient for inhibitory or bactericidal activity, respectively. AMP toxicity was measured by HC50 hemolytic concentration values—the concentration required to lyse ≥ 50% of porcine red blood cells. In contrast to MIC/MBC assays, it is desirable to have higher HC50 values. All 21 putative AMPs exhibited minimal to no hemolytic activity with HC50 values of 64 μg/mL or higher.

Of these 21 putative AMPs, three displayed moderate activity (MIC and MBC in the range 8–16 μg/mL) and four displayed high activity (≤4 μg/mL) against *E. coli* and/or *S. aureus*, all with minimal hemolytic activity, as shown in Figure 2. The characteristics of these seven sequences are described in Table 1. All seven AMPs with moderate to high antimicrobial activity have AMPlify scores greater than 25.

**Figure 2.** Antimicrobial susceptibility and hemolysis test results of seven moderately and highly active putative AMPs. AMPs were tested for their bioactivity against *E. coli* and *S. aureus* to determine minimum inhibitory and bactericidal concentrations (MIC and MBC, respectively). AMPs were also tested for their hemolytic activity using pig red blood cells to determine hemolytic concentration (HC50) values. Moderate activity (MIC and MBC in the range of 8–16 μg/mL) and high activity (≤4 μg/mL) thresholds indicated by the dashed lines. AMPs are ordered by increasing MIC values against *E. coli* ATCC 25922.

**Table 1.** Characteristics of putative AMP sequences with moderate to high in vitro bioactivity against *E. coli* or *S. aureus*. Each sequence is separated into the prepro sequence and the predicted mature peptide sequence. Conserved proteolytic cleavage sites are underlined in the prepro sequences.


\* MIC: Minimum inhibitory concentration. † *Escherichia coli* ATCC 25922; *Staphylococcus aureus* ATCC 29213.

#### *2.3. Novelty of Discovered AMPs*

To assess if the putative AMPs discovered using rAMPage are novel, a BLASTp [29] (basic local alignment search tool) protein search was performed using the 1137 sequences that met our selection criteria. Of these, 1024 sequences are reported as novel, providing no antimicrobial characterization or exact match (sequence identity = 100%; query coverage = 100%) within the NCBI non-redundant protein database [29]. The novelty analysis results for the seven moderately to highly active AMPs are presented in Table 2. Four of the queried putative AMPs (AmMa1, OdMa12, PeNi10, and PeNi14) are novel in sequence, aligning with high sequence identity (≥90%) to existing NCBI annotations [29]. Two putative AMPs (PeNi11 and TeBi1) are known and published AMPs, aligning with 100% sequence identity of the precursor protein and across the prepro and mature regions. One putative AMP (TeRu4) aligns with high sequence identity to an uncharacterized protein in the NCBI non-redundant protein database.

**Peptide ID Source Organism Highest Scoring Blastp Match Sequence Identity (%) Precursor Prepro Mature** AmMa1 *Amolops mantzorum* Palustrin-2GN3 (ADM34231.1) [*Amolops granulosus*] 97 100 93 OdMa12 *Odorrana margaretae* Odorranain-F2 (ABG76517.1) [*Odorrana grahami*] 98 100 97 PeNi10 *Leptobrachium boringii Polypedates megacephalus Pelophylax nigromaculatus Rhacophorus dennysi Rhacophorus omeimontis* Pelophylaxin-1 (Q2WCN8.1) [*Pelophylax fukienensis*] Ranatuerin-2N (AEM68233.1) \* [*Pelophylax nigromaculatus*] 82 98 86 97 77 100 PeNi11 *Leptobrachium boringii Polypedates megacephalus Pelophylax nigromaculatus Rhacophorus dennysi Rhacophorus omeimontis* Pelophylaxin-1 (Q2WCN8.1) [*Pelophylax fukienensis*] 100 100 100 PeNi14 *Bufo gargarizans Polypedates megacephalus Pelophylax nigromaculatus Rhacophorus omeimontis* Palustrin-2HB1 (AIU998997.1) [*Pelophylax hubeiensis*] 90 93 86 TeRu4 *Temnothorax rugatulus* Uncharacterized protein (XP\_024884948.1) [*Temnothorax curvispinosus*] Uncharacterized protein (TGZ47385.1) \* [*Temnothorax longispinosus*] 94 91 93 90 97 97 TeBi1 *Tetramorium bicarinatum* M-myrmicitoxin(01)-Tb1a (W8GNV3.1) [*Tetramorium bicarinatum*] 100 - 100

**Table 2.** Comparison of sequence identities (%) of the discovered AMPs with their best-known AMP blastp matches to the NCBI non-redundant (nr) protein database over the entire sequence (precursor), prepro or mature sequences.

\* Highest scoring blastp match when query sequence consists of only the mature sequence instead of the whole precursor. -: no significant alignment.

AmMa1, derived from the Mouping sucker frog, *A. mantzorum*, aligned with 97% sequence identity to Palustrin-2GN3 [30] from a species of the same genus, *A. granulosus*, differing only by two AA in the mature region (Figure S3a). Similarly, OdMa12, found in the green odorous frog, *O. margaretae*, aligned with 98% sequence identity to odorranain-F2 [31] from a species of the same genus, *O. grahami*, differing only by one AA in the mature region (Figure S3b). While these two sequences (AmMa1 and OdMa2) are very similar to known sequences, we have additionally discovered each of them in a different species of the same genus.

PeNi10 was detected in the dark-spotted frog *P. nigromaculatus*, and aligned with 82% identity to pelophylaxin-1 [32] from a species of the same genus, *P. fukienensis* (Figure S3c). We also identified PeNi10 in four other species of frogs: *L. boringii*, *P. megacephalus*, *R. dennysi*, *R. omeimontis* (Figure S4a). Although the PeNi10 precursor aligns best to pelophylaxin-1, the mature region aligns with complete sequence identity to ranatuerin-2N (unpublished).

PeNi14, also derived from the dark-spotted frog, *P. nigromaculatus*, aligned with 90% sequence identity to palustrin-2HB1 [33] from a species of the same genus, *P. hubeiensis* (Figure S6d). PeNi14 was also detected in three other species of frogs: *B. gargarizans*, *P. megacephalus*, *R. omeimontis* (Figure S7b).

Originating from the dark-spotted frog, *P. nigromaculatus*, PeNi11 aligned with 100% sequence identity to pelophylaxin-1 [32] from a species of the same genus, *P. fukienensis*, meaning it is identical to a known AMP precursor (Figure S6e). However, in addition to *P. nigromaculatus*, we also detected PeNi11 in four other species of frogs: *L. boringii*, *P. megacephalus*, *R. dennysi*, *R. omeimontis* (Figure S7c).

Found in the venom of tramp ant, *T. bicarinatum*, TeBi1 aligned with 100% sequence identity with bicarinalin [34,35] of the same species (Figure S6f). In the case of TeBi1, its precursor was partial on the 5 end, accounting for no alignments in the prepro sequence.

TeRu4, discovered in the brain of the small myrmicine ant, *T. rugatulus,* aligned with 100% sequence identity to an uncharacterized protein [36] from a species of the same genus, *T. longispinosus* (Figure S6g). While TeRu4 is not a novel protein, it is a novel mature AMP as it has not been previously characterized to have antimicrobial properties.

Additional annotation of the seven bioactive peptides (five amphibians, two insects) can be found in Table S3. The underrepresentation of insect AMPs in the literature, compared to amphibians, is further demonstrated here; while the amphibian peptides have been annotated with "frog antimicrobial peptide" domains in both InterProScan [37] and Pfam [38], the insect sequences have no protein family annotations. Figure S8 illustrates the sequence identity between AMPs identified by rAMPage and known AMPs for amphibian and insect AMPs. Although the majority of putative AMPs from rAMPage were novel sequences, previously reported AMP sequences were also identified and are a good demonstration and internal validation of the robustness of this methodology.

#### **3. Discussion**

Using rAMPage, we analyzed 84 RNA-seq datasets of 38 amphibian and 37 insect species to discover 1137 putative AMPs, 1024 of which are novel. In the present study we report our validation results on 21 putative AMPs, with over 1000 additional peptide sequences left to investigate. This list is by no means exhaustive; adjusting the described filtering parameters may yield thousands more discoveries (Table S4). Further, the rAMPage pipeline can be readily used on other transcriptome sequencing datasets, though this might call for modifications in experimental designs. For instance, in the case of bacterial RNA-seq datasets with reduced post-transcriptional polyadenylation, RNA-seq data from rRNA depleted libraries would be recommended as input for the pipeline, as opposed to data from poly(A) enriched libraries [39,40].

While the sensitivity (proportion of reference AMPs captured by the three putative AMP filters) of rAMPage is <50% (Table S5 and Figure S9) with the default filtering thresholds, the filters are implemented to select for high confidence predictions that are also easier and more cost-effective to synthesize for validation. However, as more putative AMPs are discovered and the number of reference AMPs increase in public databases, the rAMPage filters can be adjusted accordingly to report more novel AMPs.

Although rAMPage captures most putative AMPs in their complete mature form, their associated precursor sequences may be incomplete, as shown using multiple sequence alignments with Clustal Omega v1.2.4 [41] (Figure S7). However, most partial transcripts are missing sequence on the 5 end. Therefore, while the AMP precursors may be partial, the mature AMPs at the C-termini are more likely to be complete, thereby still detectable by rAMPage.

Because progress is rapid in bioinformatics, rAMPage is designed to be flexible as new technologies are developed. The pipeline is implemented as a Makefile with each step as a separate target, making the pipeline modular and providing analysis checkpoints. The tools for each step can be substituted with newer/improved tools if needed. Similarly, the pipeline is versatile and can be adapted for other sequencing technologies, for instance by assembling RNA/cDNA long reads from Pacific Biosciences of California (Menlo Park, CA, USA) or Oxford Nanopore Technologies Ltd. (Oxford, UK) instruments.

Recently, our group released AMPlify and compared its performance to other state-ofthe-art tools for AMP prediction [27]. Other machine learning methods included iAMP- pred [42], iAMP-2L [43], AMP Scanner Vr. 2 [44], with AMPlify outperforming all previously described AMP prediction tools in metrics of accuracy, sensitivity, and specificity [27]. For this reason, rAMPage employs AMPlify as its AMP prediction step, and will continue to until it is surpassed in performance. Machine learning in AMP discovery is a dynamic study, ranging from AMP sequence prediction and structure classification to de novo AMP sequence generation and design [45–47]. While there are existing methods to mine protein databases [48,49], rAMPage is an all-in-one tool to mine next-generation sequencing data directly from reads to AMP prediction.

While rAMPage can find a substantial number of putative AMPs, its main limitation lies in the fact that it uses homology-based sequence selection and machine learning-based sequence classification steps. These two steps are limited by the quantity and quality of data currently available for training the tools. The homology-based step of rAMPage would be less sensitive when there are more divergent signal sequences in the precursor genes. Similarly, the sequence classification engine in the pipeline, AMPlify, may be biased by known (and limited) classes of AMPs in the databases. However, this limitation is not restricted to only AMPlify, but all approaches dependent on AMP databases for training data sets [48–50].

Despite these limitations, which are expected to resolve over time as curated AMP sequence databases grow, a sizeable number (>1000 from 84 RNA-seq datasets) of AMPs were reported by the pipeline with the filters described herein. In the tested set of 21 peptides, seven demonstrated antimicrobial activity against a defined set of bacteria in vitro and 15 did not. We note that AST experiments can assess activity against the tested pathogens but cannot rule in or out an activity against other targets. Further, AMPs have multiple modes of action, and the AST protocol used in our study only validates direct action and does not test the putative immunomodulatory effects of these peptides, for instance. Of the seven active putative AMPs, three were moderately active, and all three are expressed in multiple amphibian species, potentially signaling the evolutionary significance of these AMPs.

An AMP of particular interest in the present study is TeRu4, due to its novelty and specificity in bioactivity. The precursor sequence of TeRu4 is 234 AA long, indicating that TeRu4 may be a multi-functional protein, such as a histone whose subsequence includes antimicrobial properties [51]. Additionally, TeRu4 showed a 36.84% sequence similarity to the spaetzle protein from the fruit fly *Drosophila melanogaster*, a protein in the insect Toll pathway, which triggers AMP production [52]. TeRu4 is also the most specific of the active putative AMPs we tested. While all the other active peptides tested are active against both *E. coli* and *S. aureus*, TeRu4 is active only against *E. coli*, a Gram-negative bacterium. This specificity may indicate a unique mechanism of action.

Despite the great promise of discovering putative AMPs with rAMPage, AMP-based drug development still faces some biological challenges, such as peptide stability and bacterial resistance. AMPs in their mature form are considered more unstable and more easily degraded by proteases. While synthesizing precursors for testing would increase stability, doing so would drive up the cost of synthesis using conventional synthetic chemistry methods. Although resistance to AMPs emerges at a slower rate compared to resistance to antibiotics, bacteria may develop resistance to AMPs through surface remodeling, modulation of AMP gene expression, proteolytic degradation, trapping, efflux pumps, and biofilms [4,53–55]. To combat specific mechanisms of resistance, targeted AMP discovery methods are being developed. A method to discover AMPs with antibiofilm activity is described in a preprint [26], and a curated 3D structural and functional repository of AMPs relevant to biofilm studies called B-AMP was recently published [26]. Finding solutions to these and other challenges in developing AMPs as replacements for conventional small molecule antibiotics is an active field of research [56–58].

#### **4. Materials and Methods**

rAMPage is an AMP discovery pipeline that takes short RNA-seq reads as input, and outputs candidate putative AMPs for wet lab validation. Since it is a homology-based method to select a list of candidates for classification, a set of reference AMPs is required. Here, we describe how input datasets and reference AMPs are collated, as well as each step of rAMPage.

#### *4.1. Collating Input RNA-Seq Datasets*

The RNA-seq reads from 38 amphibian and 37 insect species were downloaded from the Sequence Read Archive (SRA) [59] using fasterq-dump v2.10.5 (http://ncbi.github.io/ sra-tools/, accessed on 4 November 2019) from the NCBI SRA Tool Kit. Analyzing RNAseq (transcriptomic) reads enables the discovery of expressed putative AMPs. Because some RNA-seq experiments were conducted with multiple tissues or treatments, there are 75 species in total, but 84 datasets are shown in Tables S6 and S7.

#### *4.2. Collating Reference AMP Datasets*

A set of 3306 AMP sequences were collated from two high-quality AMP databases: the Database of Anuran Defense Peptides (DADP; http://split4.pmfst.hr/dadp/, accessed on 6 December 2018) [15] and the Antimicrobial Peptide Database 3 (APD3; https://aps. unmc.edu, accessed on 14 September 2020) [16]. These databases are highly curated, where sequences have been validated for efficacy. To complement this list, 3835 precursor and mature AMP sequences of amphibian and insect origin were downloaded from the NCBI non-redundant (nr) protein database [29]. These sequences are less curated, including partial sequences and sequences with only in silico prediction, etc., accounting for the difference between numbers from DADP/APD3 and NCBI in Table S8.

#### *4.3. rAMPage Pipeline*

rAMPage is implemented as a Makefile and written in bash, Python3, and R. It is publicly available on GitHub (https://github.com/bcgsc/rAMPage, v1.0 accessed on 14 February 2021). The pipeline was tested for the dependencies listed in Tables S9 and S10, and is highly customizable, with its major parameter options listed in Table S4. Command and parameters for each step can be found in Table S11. A flowchart of the rAMPage pipeline is shown in Figure 3.

Because the datasets used for rAMPage originate from publicly available genomic resources and we have no control over the experimental design or protocols used, we performed rigorous quality control. The RNA-seq reads were trimmed to remove adapter sequences using fastp v0.20.0 [60], which does not require the adapter sequences to be known, and instead infers adapter sequences from sequence overlaps between reads. This is particularly convenient when dealing with multiple datasets that possibly have different sequencing protocols.

To assemble the RNA-seq reads into transcripts we used RNA-Bloom v1.3.1 [61], a de novo transcriptome assembler that works with single and paired-end reads. RNA-Bloom is able to assemble transcriptomes without a reference but also allows for reference-guided assembly if a reference is available. It also allows for multi-sample pooling, where, for instance, reads describing multiple tissues from the same individual or different treatments for the same species are assembled together while retaining the tissues or treatment specificity of assembled transcripts.

**Figure 3.** rAMPage workflow. The rAMPage pipeline and downstream selection of putative AMPs for validation.

We note that the transcripts with a smaller number of reads have less reconstruction evidence; thus, assembled sequences with lower measured expression levels may be enriched for misassemblies. To exclude such sequences from downstream analysis, we used Salmon v1.3.0 [62] to quantify assembled transcript expression levels, and filtered out transcripts with less than 1 TPM (transcripts per million) expression.

To obtain translated peptide sequences from the transcripts, TransDecoder v5.5.0 [63] was used to conduct an in silico six-frame open reading frame (ORF) translation, and ORFs that are at least 50 AA were selected for downstream analysis. In the case of nesting ORFs, the longest ORF was chosen.

To select putative AMP precursors from this vast pool of assembled and translated sequences, we conducted a homology search against our curated reference AMP dataset (Table S8) using HMMER v3.3.1 [64] and assigned an Expect (E) value to every sequence. The E-value describes the number of hits expected by chance when searching a database of a particular size [65]. Sequences that share a certain degree of identity, with E-values of less than 10<sup>−</sup>5, were selected as putative AMP precursors.

These putative precursor (or partial precursor) sequences were then cleaved in silico using ProP v1.0c [66] to obtain putative mature AMP sequences, to be further classified. However, cleavage prediction tools only predict where the cleavage occurs, not what each resulting cleaved peptide represents, and the AMP precursor organization shows interand intra-species variability [13,67,68]. While amphibian AMPs are typically cleaved at a lysine–arginine (KR) motif and their precursor structure follows a conserved structure (prepro sequence containing acidic AA residues and a mature bioactive AMP) [67], insect AMPs are typically cleaved at an RXXR motif (two arginine residues surrounding two optional AA) and the precursor structure is not always conserved [68]. Insect AMPs are more variable in structure [13], increasing the difficulty in identifying the putative mature peptide. This difficulty is especially present in precursor structures with multiple acidic regions (UniProtKB P54684.1) or multiple bioactive regions (UniProtKB P35581.1). In such multi-peptide precursors, it is unclear whether each bioactive region is its own isoform or part of a larger mature peptide. To account for this and to possibly discover novel but perhaps not naturally occurring putative AMPs, cleaved peptides were also recombined in a manner similar to alternative splicing (Figure S10). In this procedure, the order and orientation of the cleaved peptides were maintained, and cleaved peptides that originally share cleavage sites were not recombined, with a maximum of three cleaved peptides within recombination. This recombination feature can be turned off in rAMPage's options.

The collected candidate peptide sequences were classified with AMPlify v1.0.3 [27] as AMP or non-AMP sequences. When given a sequence, AMPlify calculates a score between 0 to 80, with the score ≈ 3.0103 corresponding to the classification probability cutoff of 50% through Equation (1).

To facilitate AMP synthesis for the validation experiments, we filtered the putative AMPs by length and charge, in addition to the AMPlify score. A maximum length of 30 AA was imposed to control the cost of peptide synthesis and to reduce the number of spurious hits from recombined sequences. A minimum charge of +2 was imposed as a proxy to assess the effectiveness of an AMP, as past evidence indicates that more positively charged AMPs show higher activity, especially when their mechanism of action is membrane disruption [69]. Because AMPlify was trained on mostly amphibian AMPs, different score thresholds were imposed for amphibian (≥10) and insect (≥7) datasets to compensate for the dearth of insect training AMPs.

To annotate the final set of filtered putative AMPs, ENTAP v0.10.7, Eukaryotic Non-Model Transcriptome Annotation Pipeline [70], were used, along with UniProtKB (release 2020\_06, accessed on 15 December 2020) [71], RefSeq (release 203, accessed on 15 December 2020) [72], and NCBI non-redundant (nr) (v5, accessed on 12 December 2020) [29] protein databases. For AMPs that ENTAP failed to annotate, InterProScan 5 v5.30-69.0 [37] was run separately to annotate protein families, functions, and domains. Exonerate v2.4.0 [73] was used to align the filtered putative AMPs against the reference AMPs to assess how many of the labeled AMPs were already known AMPs. Finally, SABLE v4.0 [74] was optionally used to predict secondary structures of the filtered putative AMPs, for visualization.

#### *4.4. Selecting Filtered Putative AMPs for Validation*

To select peptides to validate from the filtered putative AMPs, we ranked their sequences using AMPlify and chose peptides based upon three selection criteria (Figure 3): "Species Count" (*n* = 7), "Insect Peptide" (*n* = 12), or "AMPlify Score" (*n* = 2), for a final total of 21 AMPs (Table S2). The sequences were first clustered using CD-HIT [75] v4.8.1 with a sequence similarity cutoff of 100%. We chose the longest sequence for each of these clusters, removing duplicate and subsumed sequences to obtain a non-redundant sequence set.

In the first selection strategy of "Species Count", sequences that were present in more than two species were chosen. In the "Insect Peptide" strategy, to balance the training bias

of AMPlify towards AMPs of amphibian origin, we specifically selected insect-originating sequences using a reduced AMPlify score cutoff of >20. In the "AMPlify Score" strategy, the two highest-scoring peptides (AMPlify score = 80.0, 69.9) with the highest charge (+4) were chosen for validation.

#### *4.5. Antimicrobial Susceptibility Testing (AST)*

Twenty-one putative AMP sequences identified using the rAMPage pipeline were validated through a minimum of three AST experiments performed independently on separate days. In these tests, the AMP activity was assessed using two metrics: minimum inhibitory concentration and minimum bactericidal concentration (MIC and MBC, respectively). MIC and MBC values were determined using procedures outlined by the Clinical and Laboratory Standards Institute (CLSI), with the recommended adaptations for the testing of cationic AMPs described previously [76]. "Wild-type" strains of *Escherichia coli* (*E. coli* 25922) and *Staphylococcus aureus* (*S. aureus* 29213) were purchased from the American Type Culture Collection (ATCC; Manassas, VA, USA) and were used for screening of antimicrobial activity. Briefly, putative AMPs were synthesized by Genscript (Piscataway, NJ, USA) and received in lyophilized format. These peptides were suspended using ultrapure water (Life Technologies, Grand Island, NY, USA; Invitrogen cat# 10977-015), and an 11 μL two-fold serial dilution of 1280 to 2.5 μg/mL was prepared in duplicate rows in a 96-well microtiter plate, before being combined with 100 μL standardized bacterial inoculum yielding a final duplicate testing range of 128 to 0.25 μg/mL. The bacterial inoculum was prepared using colonies isolated on non-selective agar and combined with Mueller Hinton Broth. This suspension was measured and adjusted to achieve an optical density of 0.08–0.1, equivalent to a 0.5 McFarland standard (1–2 × <sup>10</sup><sup>8</sup> cfu/mL). The inoculum was then diluted to a target concentration of 5 ± <sup>3</sup> × <sup>10</sup><sup>5</sup> cfu/mL; total viable counts from the final inoculum were routinely performed to confirm the target bacterial density was achieved. MIC values were reported at the concentrations in which no visible growth was detected following 20–24-h incubation at 37 ◦C. The MIC and adjacent wells were plated onto non-selective agar; the concentration in which killed 99.9% of the inoculum following additional overnight incubation was determined to be the MBC.

#### *4.6. Hemolysis Experiments*

The twenty-one putative AMPs were evaluated for toxicity using three independent hemolysis experiments performed on separate days. Whole blood from healthy donor pigs was purchased from Lampire Biological Laboratories (Pipersville, PA, USA). Red blood cells (RBCs) were washed and isolated by centrifugation, using Roswell Park Memorial Institute medium (RPMI) (Life Technologies, Grand Island, NY, USA; Gibco cat# 11835-030). Lyophilized AMPs were suspended and serially diluted from 128–1 μg/mL using RPMI in a 96-well plate, before being combined with 100 μL of a 1% RBC solution. Following a minimum 30 min incubation at 37 ◦C, plates were centrifuged and <sup>1</sup> <sup>2</sup> volume from each supernatant was transferred to a new 96-well plate. The absorbance of these wells was measured at 415 nm. To quantify hemolytic activity and determine the AMP concentration that kills 50% of the RBCs (HC50), absorbance readings from wells containing RBCs treated with 11 μL of a 2% Triton-X100 solution or RPMI (AMP solvent-only) were used to define 100% and 0% hemolysis, respectively. All centrifugation steps were performed at 500× *g* for five minutes in an Allegra-6R centrifuge (Beckman Coulter, CA, USA).

#### **5. Conclusions**

rAMPage is a bioinformatics pipeline for high throughput identification of putative AMPs in RNA-seq datasets. It fills a current void in the AMP discovery process, bridging the gap between in silico and in vitro methods. The pipeline has the potential to accelerate the discovery of novel antibiotics, with the possibility to enrich existing AMP sequence repositories. The easy-to-run pipeline design with various checkpoints and the low computational resources required to run rAMPage increase its accessibility to users. By executing rAMPage on publicly available amphibian and insect transcriptome sequencing data, we have identified over 1000 putative AMPs. Of those, we performed functional tests on twenty-one putative AMPs and demonstrated that seven have moderate to high activity against *E. coli* ATCC 25922 and/or *S. aureus* ATCC 29213. As the number of tested peptides increases, the wet lab validation results can feed back into rAMPage by augmenting the reference AMP datasets, helping refine the underlying homology and machine learning approaches. We expect rAMPage to have broad utility in the discovery of novel antimicrobials from a wide variety of transcriptome sequencing datasets.

#### **6. Patents**

Patent applications pending on the reported novel peptides.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/antibiotics11070952/s1, Figure S1: Putative AMP filters used for amphibians and insects; Figure S2: Runtime and memory of each dataset through rAM-Page; Figure S3: Score, length, and charge distribution of filtered putative AMPs; Figure S4: Amino acid composition of filtered putative AMPs; Figure S5: Antimicrobial susceptibility and hemolysis testing of 21 putative AMPs; Figure S6: Multiple sequence alignments of moderately to highly active AMPs; Figure S7: Multiple sequence alignments of moderately to highly active AMP precursors; Figure S8: Distribution of alignment of filtered putative AMPs to mature reference AMPs; Figure S9: Distribution of reference mature AMPs; Figure S10: Approach for peptides with multiple cleavage sites; Table S1: Peptide naming convention; Table S2: Subset of 21 putative AMPs synthesized and validated against *E. coli* and *S. aureus*; Table S3: Annotation of moderately to highly active putative mature AMPs; Table S4: Major options for rAMPage; Table S5: Sensitivity of all putative AMP filter combinations; Table S6: Amphibian RNA-seq datasets; Table S7: Insect RNA-seq datasets; Table S8: Breakdown of AMP sequences in AMP databases; Table S9: Shell scripting dependencies of rAMPage; Table S10: Bioinformatic tool dependencies of rAMPage; Table S11: Command and parameters for each step of rAMPage. References [11,27,29,30,41,59–64,73–75,77–117] are cited in the supplementary materials.

**Author Contributions:** Conceptualization, I.B., C.C.H. and L.M.N.H.; methodology, I.B., R.L.W., L.C., D.L. and D.S.; software, D.L., S.I.A., K.M.N. and C.L.; validation, D.S., A.Y. and N.L.; data curation, D.L. and C.L.; writing—original draft preparation, D.L.; writing—review and editing, I.B., R.L.W., L.C., D.S., A.Y., C.L., D.L. and C.C.H., funding acquisition, I.B., C.C.H. and L.M.N.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by funds from Genome Canada, and Genome BC as part of the PeptAid (291PEP) and AnnoVis (281ANV) projects. Additional support was provided by the Canadian Agricultural Partnership, a federal–provincial–territorial initiative. The program is delivered by the Investment Agriculture Foundation of BC (INV106). Further funds were received from the Office of the Vice President, Research and Innovation of the University of British Columbia. Opinions expressed in this document are those of the author and not necessarily those of the Governments of Canada and British Columbia or the Investment Agriculture Foundation of BC. The Governments of Canada and British Columbia, and the Investment Agriculture Foundation of BC, and their directors, agents, employees, or contractors will not be liable for any claims, damages, or losses of any kind whatsoever arising out of the use of, or reliance upon, this information.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Accessions for input RNA-seq datasets can be found in Tables S6 and S7. rAMPage code is publicly available at https://github.com/bcgsc/rAMPage (v1.0, accessed on 14 February 2021).

**Acknowledgments:** This work was based on a prototype pipeline created by S. Austin Hammond.

**Conflicts of Interest:** I.B. is the founder of, and a shareholder in, Amphoraxe Life Sciences Inc.

#### **References**

