**Preface to "Computational Approaches in Discovery & Design of Antimicrobial Peptides"**

Antimicrobial resistance continues to be a pressing concern in the field of medicine, especially during the COVID-19 pandemic, where microbial infections were frequently observed as side-complications. To combat antibiotic-resistant pathogens, there has been a renewed interest in the use of antimicrobial peptides (AMPs). The naturally occurring AMPs have shown great promise in the search for new antibiotics. To this end, various computational approaches have been developed to assist in the search for and design of new AMPs. These computational methods range from classical homology-based and machine-learning prediction algorithms to complex similarity networks and evolutionary algorithms that use models of sequence evolution. Moreover, the improvement of high-throughput screening techniques in the discovery of AMPs from biological samples has also led to the evolution of computational approaches that aid in this biodiscovery process. This reprint is aimed at disclosing original research and review papers on in silico approaches used for the rational discovery and design of AMPs, addressed to the problem of antimicrobial resistance. The reprinted content will serve as a reference for researchers dedicated to peptide drug development.

#### **Agostinho Antunes, Guillermin Ag ¨uero-Chapin, and Yovani Marrero-Ponce** *Editors*

## *Editorial* **A 2022 Update on Computational Approaches to the Discovery and Design of Antimicrobial Peptides**

**Guillermin Agüero-Chapin 1,2,\* , Agostinho Antunes 1,2 and Yovani Marrero-Ponce 3,4,\***


The antimicrobial resistance process has been accelerated by the over-prescription and misuse of antibiotics. The World Health Organization (WHO) has listed it as one of the top 10 global public health threats. This worrisome situation has encouraged the search for new classes of antimicrobial agents, leveraging the ability of antimicrobial peptides (AMPs) to overcome resistance, mainly due to their versatile mode of action and multifunctionalities. However, the discovery of promising AMPs with relevant biological activities is a real challenge, considering the great structural diversity of the AMP class and their underrepresentation in terms of non-bioactive peptides. Consequently, several databases and computational approaches have been developed for over two decades to assist in the long development process of peptide-based drugs.

This Special Issue, entitled "*Computational Approaches to the Discovery & Design of Antimicrobial Peptides*," is mainly dedicated to state-of-the-art in silico approaches applied to the discovery and design of AMPs for therapeutic purposes. In this sense, Agüero-Chapin et al. published a comprehensive review article on emerging in silico approaches to the search for/design of bioactive peptides, from new machine learning (ML) algorithms to other non-conventional methodologies, such as complex networks and algorithms simulating peptide sequence evolution. New considerations incorporated into the biodiscovery workflow for unravelling AMPs from omics data were also analyzed [1].

Aligning with the previously mentioned review, Ruiz-Blanco et al. developed a new machine learning (ML)-based classifier for the detection of antibacterial peptides (ABPs) and their putative targets, including multi-drug-resistant (MDR) bacterial strains. The ML model was implemented in a web server called "ABP-Finder", which is one of the most stateof-the-art ABP predictors, with a proven high precision when detecting a promising peptide hit against *P. aeruginosa* during the screening of large databases such as the human urine peptidome [2]. The revision also comprised non-conventional methodologies applied to the field of AMPs. For example, García-Hernández et al. repurposed ROSE (Random Model of Sequence Evolution), an algorithm simulating sequence evolution, to generate diversityoriented libraries of peptides as one of the steps for the de novo design/optimization of antibacterial peptides (ABPs) by inhibiting the E. coli FoF1-ATP synthase [3].

On the other hand, Marrero-Ponce et al. applied network science to study the chemical space of tumor-homing peptides (THPs) by using alignment-free similarity networks and centrality measures to identify the most relevant and non-redundant THPs within the network. Such THPs, representing the original TH chemical space, were considered as references for multi-query similarity searches that apply a group fusion (MAX-SIM rule) model. The resulting multi-query similarity searching models outperformed state-of-the-art

**Citation:** Agüero-Chapin, G.; Antunes, A.; Marrero-Ponce, Y. A 2022 Update on Computational Approaches to the Discovery and Design of Antimicrobial Peptides. *Antibiotics* **2023**, *12*, 1011. https:// doi.org/10.3390/antibiotics12061011

Received: 16 May 2023 Accepted: 30 May 2023 Published: 5 June 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

predictors in the detection of THPs in benchmark datasets. This approach also served to search for THP leads and to discover TH motifs [4].

Related to the previously discussed on the new considerations and tools incorporated into the workflow for AMP biodiscovery [1], Birol et al. developed rAMPage, a scalable bioinformatics tool for identifying AMP sequences from RNA sequencing (RNA-seq) datasets. rAMPage was extensively evaluated on publicly available RNA-seq datasets from amphibian and insect species. It identified 1137 putative AMPs, of which 1024 were considered novel by homology criteria. From these, 21 peptides were tested for antimicrobial susceptibility against two bacteria species, *E. coli* and *S. aureus*, and 7 showed high activity. Thus, rAMPage can be integrated into the workflow for AMP biodiscovery to accelerate the process of antimicrobial drug development [5].

Although transcriptomic and proteomic analyses can streamline the biodiscovery workflow of AMPs by focusing on gene coding and protein expression, such high-throughput screening can be performed at the genomic level to unravel both encoded and cryptic AMPs. Hancock et al. proposed profile hidden Markov models to screen the genomes of four crocodilian species for identifying encoded cathelicidin sequences. Cathelicidins are one of the largest family of host defense peptides, showing a broad-spectrum activity against planktonic bacteria and some biofilm, as well as other beneficial features such as anti-inflammatory properties. Eighteen novel cathelicidin sequences were identified and subsequently synthetized and evaluated in vitro against planktonic and biofilm bacteria. Among the cathelicidins which displayed a broad-spectrum antimicrobial and antibiofilm activity against a range of antibiotic-resistant bacteria, As-CATH8 was highlighted because of its similar profile to the last-resort antibiotics vancomycin and polymyxin B [6]. An alternative method of searching for AMPs at the genomic level involves the in silico detection of corresponding biosynthetic gene clusters (BGCs). Ashraf et al. sequenced the genome of the Streptomyces sp. isolate BR123 and used the online antiSMASH (antibiotics & Secondary Metabolite Analysis Shell) platform to analyze the resulting assembled regions. Multiple BGCs were detected that were involved in the production of antimicrobial, antiparasitic, and anticancer compounds [7].

Two additional review papers were published in this Special Issue. Rivera-Fernández et al. examined the experimental effects of various bioactive peptides on Apicomplexan parasites, which are responsible for a range of dangerous diseases, such as toxoplasmosis, cryptosporidiosis, and malaria. They also discussed some biological and metabolomic generalities of the parasites to explain the mechanisms of action of the peptides on the Apicomplexan targets [8]. The other review paper was written by Prof. Jureti´c, which emphasizes the importance of designing multi-functional peptides that can reach intracellular targets in order to develop more effective peptide drugs. The review ranked known and novel peptides based on their predicted low toxicity to mammalian cells and broad-spectrum activity. The 20 most promising candidates that exhibited optimized cell-penetrating, antimicrobial, anticancer, anti-viral, antifungal, and anti-inflammatory activities were identified. These peptides also have the ability to form an amphipathic structure upon contact with membranes or nucleic acids [9].

Prof. Jureti´c's work also mentioned the urgent need to develop antifungal compounds that target intracellular molecules as a strategy to combat multidrug-resistant (MDR) pathogens such as *Cryptococcus neoformans*, which pose a threat to immunocompromised patients. Consequently, the study by Souza et al. designed and tested anticryptococcal AMPs and provided further information on their mechanism of action against *C. neoformans* using computational and experimental analyses [10].

**Author Contributions:** All authors wrote and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was financially supported by national funds through FCT—Foundation for Science and Technology of Portugal within the scope of UIDB/04423/2020 and UIDP/04423/2020 and by the USFQ Collaboration Grant (Project ID16911).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Review* **Emerging Computational Approaches for Antimicrobial Peptide Discovery**

**Guillermin Agüero-Chapin 1,2,\* , Deborah Galpert-Cañizares <sup>3</sup> , Dany Domínguez-Pérez 1,4 , Yovani Marrero-Ponce <sup>5</sup> , Gisselle Pérez-Machado 6, Marta Teijeira 7,8 and Agostinho Antunes 1,2,\***


**Abstract:** In the last two decades many reports have addressed the application of artificial intelligence (AI) in the search and design of antimicrobial peptides (AMPs). AI has been represented by machine learning (ML) algorithms that use sequence-based features for the discovery of new peptidic scaffolds with promising biological activity. From AI perspective, evolutionary algorithms have been also applied to the rational generation of peptide libraries aimed at the optimization/design of AMPs. However, the literature has scarcely dedicated to other emerging non-conventional in silico approaches for the search/design of such bioactive peptides. Thus, the first motivation here is to bring up some non-standard peptide features that have been used to build classical ML predictive models. Secondly, it is valuable to highlight emerging ML algorithms and alternative computational tools to predict/design AMPs as well as to explore their chemical space. Another point worthy of mention is the recent application of evolutionary algorithms that actually simulate sequence evolution to both the generation of diversity-oriented peptide libraries and the optimization of hit peptides. Last but not least, included here some new considerations in proteogenomic analyses currently incorporated into the computational workflow for unravelling AMPs in natural sources.

**Keywords:** artificial intelligence; machine learning; AMPs; evolutionary algorithms; molecular descriptors; complex networks; proteogenomics

#### **1. Introduction**

The rise of resistance to antimicrobial agents evidenced in the last decades have caused excess healthcare costs worldwide [1]. The microbial natural resistance process, moved by evolutionary events, has been accelerated by the over-prescription and misuse of antibiotics [2]. This worrying situation has encouraged the search of new antibiotics from antimicrobial peptides (AMPs) with the ability to overcome resistance, mainly given by

**Citation:** Agüero-Chapin, G.; Galpert-Cañizares, D.; Domínguez-Pérez, D.; Marrero-Ponce, Y.; Pérez-Machado, G.; Teijeira, M.; Antunes, A. Emerging Computational Approaches for Antimicrobial Peptide Discovery. *Antibiotics* **2022**, *11*, 936. https://doi.org/10.3390/ antibiotics11070936

Academic Editor: Mire Zloh

Received: 14 June 2022 Accepted: 8 July 2022 Published: 13 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

their versatile mode of action [3].Indeed, AMPs are not only considered for the development of antibiotics to treat multi-resistant bacterial strains [4,5], but also they are promising for the developing of antitumoral [6], antiviral [7], antifungal agents [8] and so on.

The discovery of peptides with relevant biological activities is a real challenge considering the great diversity of AMPs in terms of origin, structure, mode of action, activity, and, on the other hand by considering the overabundance of natural-occurring non-bioactive peptides [8]. Thus, several AMP databases with associated machine learning (ML)-based classifiers have been developed for over one decade, in order to assist wet-lab researchers in the long development process of peptide-based drugs [9]. AMP databases such as DAMPD [10], CAMPR3 [11], LAMP [12], DRAMP [13], ADAM [14], DBAASP [15] have incorporated ML predictors trained with alignment-free (AF) protein features such as amino acid (aa) and pseudo-aa composition, structural features, word frequency-based features, physicochemical aa properties with influence on the AMP activity, and some others [16,17] (Table 1). Figure 1 illustrates how databases and ML algorithms have been integrated to assist the discovery/design of AMPs for the developing of peptide drugs.

**Figure 1.** Workflow illustrating peptide drug discovery. The strategy involves the screening of query peptides from either natural or synthetic sources by applying ML models trained with the information stored in AMP databases. ML algorithms also assist the optimization/design step of lead peptides by means of a fitness/selection criterion [18,19].

The prediction tools built up with Support Vector Machine (SVM) and Random Forest (RF) based classifiers have been widely applied, but hardly considered the natural imbalance between the AMPs and non-AMPs [18]. On the other hand, emerging ML techniques such as Deep-Learning Neural Networks [18–21] and those based on the Rough Set Theory [22,23] have been applied to improve certain classification pitfalls like the quality in the learning phase and the classification boundaries between AMPs and non-AMPs, respectively. Although most of the classical predictive tools have focused on if a query peptide is an AMP or not, without targeting a specific biological activity among the reported for the AMPs [24], the current tendency is to address a hierarchical multi-level classification

by downstream considering the specific biological activities of the AMPs as labels e.g., the antibacterial, antifungal, antiviral and antitumoral among others.

The most popular hierarchical multi-label classifiers, also listed in Table 1, are the following: (i) the iAMP-2L, a two-level classifier trained with Chou's pseudo amino acid composition (PseACC) [25], aimed at identifying AMPs and their five functional types [26], (ii) the iAMPpred predictor that combines compositional, physicochemical, and structural features into Chou's general PseACC for training a SVM multi-classifier [16], (iii) the MLAMP, a RF-based classifier built up with a non-classical PseACC sequence formulation incorporating a Grey Model that firstly discriminates AMP from non-AMPs, and then subclassify their biological activities into antibacterial, anti-cancer, antifungal, antiviral, and anti-HIV [27], (iv) the Antimicrobial Activity Predictor (AMAP) [28], a hierarchical multi-label classifier targeting 14 biological activities that is built up with SVM and XGboost tree [29] algorithms trained with amino acid composition (ACC) features, (v) the AMPfun webserver containing RF-based models that firstly classify AMPs and non-AMPs and afterwards address the prediction of AMPs functional activities including their possible target types [30], and more recently, the (vi) AMPDiscover [31] and the (vii) ABPFinder webservers (https://protdcal.zmb.uni-due.de/ABP-Finder/index.php; accessed on 7 March 2022) containing hierarchical RF-based classifiers built up with protein descriptors from the ProtDCal software [32] to firstly detect AMPs and antibacterial peptides (ABPs), respectively. While the AMPDiscover uses several downstream RF models to predict AMPs specific functions (antibacterial, antifungal, antiparasitic and antiviral), the ABP-Finder sub-classifies ABPs according to the Gram staining type of the potential targets (Gram-positive, Gram-negative bacteria, or broad-spectrum peptides with expected activity against both types of bacteria) by using a multi-classifier. The high success classification rates of both tools stems from considering the StarPep database [33] which is probably the most comprehensive curated repository of AMPs so far, and from performing an applicability domain (AD) analysis for the proposed ML models [31]. Both, AMPDiscover and ABP-Finder defined ADs for their corresponding RF-based models, however, AMPDiscover perform a rigours AD analysis at applying a consensus-based decision from five different approaches [31].

Despite the great number of reported ML-based tools for AMPs prediction, only few ones have considered the lack of balance among either the specific activities of AMPs or among their putative targets, as well as the AD of their corresponding models. The imbalance among AMPs and non-AMPs as well as the existing one among AMP activities was addressed by applying the synthetic minority over-sampling technique (SMOTE) during the IAMPE and MLAMP building [27,34] while the ABP-Finder addressed the imbalance among the bacterial target types of the ABPs (Gram+, Gram- and Gram+/ bacteria) by training a RF multi-classifier with a cost matrix weighting the different types of misclassified cases according to the imbalance ratio between the two classes (https: //protdcal.zmb.uni-due.de/ABP-Finder/index.php; accessed on 7 March 2022).

On the other side, artificial intelligence (AI)-derived approaches like evolutionary algorithms have been applied to optimize lead candidates retrieved from the high-throughput screening in drug discovery. Evolutionary algorithms are inspired on several evolutionary events occurring in nature; they generally start with a small population of peptides identified as putative leads due to its relevant biological activities. The optimization is carried out by the generation of offspring peptides from these initial peptides by applying several operators simulating natural evolutionary process like cross-over and mutation operators, a parent and survival selection algorithms [40,41]. A parent selection algorithm is firstly applied on the initial peptide population to select the best parent peptides for the offspring generation. The survival aims at selecting a subset of good individuals (new population) from the generated offspring peptides. Then, the new peptide population will be iteratively subjected to the parent selection algorithm, evolutionary operators and the survival selection until finding an offspring peptide meeting a termination condition (selection criteria in Figure 2). The selection criteria can be represented by a fitness function which can be

a ML model scoring peptide bioactivity. This selection process may be accompanied to experimental evaluations against the desired biological activities [42] (Figure 2).

**Table 1.** Summary of the most relevant ML approaches, from the classical to the emerging ones, for assisting the discovery of bioactive peptides from AMPs.


Methods listed in Table 1 are currently active (Accessed on 7 March 2022) \* Multi-label classifiers allowing the prediction of specific biological activities (antibacterial, antifungal, antiviral, antitumoral and others) from AMPs \*\* Hierarchical multi-label classifiers addressing firstly AMPs detection and in the second level their specific biological activities. ACC: amino acid composition, ANN: artificial neural networks, DA: discriminant analysis, DNN: deep neural networks, k-NN: k- nearest neighbours, NMR: nuclear magnetic resonance, PseAAC: pseudo amino acid composition, RF: random forest, SVM: support vector machine.

**Figure 2.** Workflow illustrating the main steps of evolutionary and genetic algorithms. Both approaches are very similar, in fact the use of evolutionary and genetic terms have been interchangeable. Genetic algorithms particularly use a fixed-length binary array to represent peptides as genes into a chromosome-like structure.

The genetic algorithm is the most popular technique among the evolutionary approaches where the peptides with promising biological properties (initial solution) are encoded as binary strings into chromosome-like structures, called genotypes. The optimization process is performed by evolving each chromosome toward optimized solutions by iteratively applying genetic recombination (crossover) operators and survival fitness functions that is somehow similar to the parent selection mechanism [40,42–44]. Optimized solutions in the case of peptides consist in generating structural entities with optimized biological properties e.g., peptides showing a trade-off among their pharmaceutical potency, solubility, haemolytic and toxicity properties [42] (Figure 2).

Despite AI-derived approaches have been largely applied to the rational search and design of bioactive peptides; most of them are represented by classical ML and evolutionary algorithms that frequently also use canonical sequence-based features as peptide descriptors and therefore have been documented in literature [18,45,46]. However, there is a growing number of emerging computational approaches effectively applied to the search/design of bioactive peptides that are comprehensively revisited here (Table 1).

Most of the non-standard approaches are represented by classical ML algorithms which are either trained with non-conventional peptide features [31] or combined with sequence alignment methods [47]. In addition to the singularity of these predictors; preprocessing steps managing the natural imbalance between bioactive and inactive peptides have been hardly applied to the AMPs predictions [27,34] as well as no big data solutions have been implemented yet to address scalability problems. As mentioned before, other lessknown ML algorithms in the field of protein/peptide science like those based on the Rough Sets Theory (RST) are being currently intended for peptide classification/design [22,48]. Moreover, a non-conventional methodology that analyses the known chemical space of bioactive peptides by similarity networks was developed to identify the most relevant ones for each specific biological activity [33]. Such representative peptides were recently used

in multi-query similarity searches against the StarPep database to repurpose AMPs for specific activities such as antiparasitic and tumour homing [49,50].

By other side, evolutionary algorithms that simulate sequence evolution have been recently applied to design/optimize peptides having a pharmaceutical activity [51,52]. Last but not least, computational tools used in proteogenomic analyses are being modified for uncovering cryptic peptides with biological activities in natural sources [53,54]. From now on we go deeper into these emerging approaches in peptide search and design

#### **2. Non-Classical Peptide Features for Bioactivity Prediction**

#### *2.1. Peptide Features Inspired in Molecular Descriptors Used in Cheminformatics*

There is a set of chemoinformatics-derived peptide features considered as "nonconventional" because of its in-house development; however, have been successfully applied in the recognition of bioactive peptides by ML-based classifiers [31,55–58]. The definition of these peptide/protein features is generally inspired on the mathematical formalisms applied to the calculation of molecular descriptors for small organic molecules [59,60], which have been traditionally used to Quantitative-Structure-Activity Relationship (QSAR) studies for drug design/search. Most of them are classified as topological descriptors since they consider the connectivity either between adjacent amino acids (aas) or between aa groups by using both algebraic and statistic invariants [32,61,62].

Those based on algebraic forms express protein/peptide structural topology through the definition of connectivity or adjacency matrices. The elements of these matrices (nij or eij) reflect topological relationships between the aas or aa groups, they are equal to 1 if i and j are adjacent otherwise take the value of 0. Topological indices (TIs) are estimated by applying several algorithms on the connectivity/adjacency matrix. The most common algorithms for the TIs calculation involve the powers of the topological matrix, the multiplication of a property vector by the topological matrix and the multiplication of vector-matrix-vector (Figure 3). Many of the most popular TIs within the cheminformatic have been defined by these algebraic formalisms, such as the Winner index (W) [63], the Randi´c invariant (χ) [64], Broto–Moreau autocorrelation (ATSd) [65], the Balaban index (J) [66], and the spectral moments introduced by Estrada [59]. Thus, many of them were reformulated to describe the spatial topology of aa sequences at different structural levels, e.g., linear sequences (1D), pseudo-secondary structure (2D) and the 3D-dimensional space [61,62] (Figure 3).

**Figure 3.** Workflow for the calculation of topological indices from several representation types of the cyclopentapeptide [CPFVC] with promising antiviral activity against the hantavirus cardiopulmonary

syndrome [67]. Each peptide representation defines a singular topological matrix (TM) encoding structural features at different degrees. In addition to the several ways to represent the topology of a peptide (linear, circular, 2D-Cartesian), several algebraic formalisms/operators can be applied on the TM to calculate different topological indices (TIs) types. n represents the nodes in the peptidic representations (linear, circular, and Cartesian) as well as in their corresponding TMs, which may contain some elements in red font (e.g., n4 and 1) to highlight differences in structural encoding from the cyclopentapeptide. N indicates the number of rows and columns of matrices involved in TI calculation.

On the other hand, there is another set of topological descriptors that also comes from the chemoinformatic field that have been applied to the identification and design of AMPs [31,52,56,58]. They are not formulated by using algebraic forms but rather they rely on descriptive statistics as invariant operators on the aa properties either along the sequence or the 3D protein structure. In this case, the 1D or 3D topology is encoded by the application of classic cheminformatics algorithms that consider the neighborhood such as autocorrelation [65], Kier-Hall's electro-topological state [68], Ivanshiuc-Balaban [69], and Gravitational-like operators [70].

#### 2.1.1. Topological Indices from Algebraic Forms

Among the TIs defined for small molecules, the spectral moments formalism probably is one of the most extended to characterize proteins and peptides structures [61,62,71,72]. The spectral moments may encode peptide structures through the definition of their corresponding topological matrixes and the application of the trace operator on the k-th power of such matrixes (Figure 3).

A sort of stochastic spectral moments applied to the electronic or charge delocalization of the aas within the peptide backbone and the entropy involved on such delocalization, were applied to model the bitter tasting threshold of dipeptides by linear discriminant and regression analyses [57]. These non-standard peptide features provided accuracies higher than 83% in the detection of bitter taste, and the regression models could explain the experimental variance of the bitter tasting threshold in more than 80%. It was shown the non-standard peptide descriptors correlate with the bitter taste as good as or even better than other well-known peptide features like the z-scale [73].

The spectral moments have been also applied to characterize bacteriocins. Bacteriocins are peptidic toxins produced and exported by bacteria as a defense mechanism to kill or inhibit the grow of other strains but the producer. The bacteriocins are very attractive for the development of new antibiotics and anticancer agents, however their high structural diversity represents a challenge for alignment-based predictive tools. Since the hydrophobicity and basicity of bacteriocins are relevant for their antibacterial activity, Agüero-Chapin et al. introduced the 2D-Hydrophobicity and Polarity (2D-HP) maps to pseudo-fold bacteriocin protein sequences in order to derive a set of spectral moments encoding information beyond the linear sequence [74] (Figure 4). These TIs are implemented in the Topological Indices to Biopolymers (TI2BioP) software [75] and were useful to build an AF model based on Linear Discriminant Analysis with a higher sensitivity (66.7%) than the attained by InterProScan (60.2%). In addition, they could detect cryptic bacteriocins, ignored by alignment methods [74].

**Figure 4.** Different structural representations for the channel-forming domain of Colicin E1 (pdb 2I88). **A**—Primary structure, **B**—Pseudo secondary Cartesian map of hydrophobicity (H) and polarity (P) (2D Cartesian (HP) map), **C**—Three-dimensional structure. The 2D Cartesian protein map is an arbitrary bidimensional arrangement (pseudo-folding) of the protein/peptide sequences bearing higher-order useful patterns than contained in linear sequences.

#### 2.1.2. Topological Indices from Descriptive Statistics

The cheminformatic-derived protein descriptors that have been widely applied to the prediction and design of bioactive peptides were developed and implemented by Ruiz-Blanco et al. in the ProtDCal software [32]. ProtDCal provides a great diversity of protein/peptide descriptors thanks to its divide-and-conquer methodology that considers both the aa properties and those estimated for groups, which can be modified by the neighbourhood through the application of classic previously-mentioned chemoinformatics algorithms. The modified properties of the aas or their resulting groups are later aggregated using statistical operators to estimate local or global descriptors either at sequence or 3D structural level. Although a more detailed description of ProtDCal's protein descriptors can be found in [32], the Figure 5 shows an schematic representation of the protein descriptor generation process of ProtDCal. The diversity of ProtDCal's protein descriptors represented by different families stems from combinatorically applying different aa properties, the ways to consider the vicinity to the target aa by several operators, the criteria used to group the aas as well as the invariant operator used for aggregating aa properties within the same array (Figure 5).

ProtDCal's descriptors have been involved in the discovery of antibacterial peptides by developing a non-conventional multi-target QSAR models [56]. Despite the AMPs selected for training were evaluated against multiple targets (Gram-positive bacterial strains), they could be integrated in the same model by modifying their ProtDCal's descriptors through the Box-Jenkins moving average operator. This operator allows modifying the sequencebased descriptors by subtracting the corresponding mean of the descriptors of all AMPs assayed against the same Gram-positive bacterial strain. This is a way to particularize a sequence-based descriptor by incorporating information about the experimental conditions or biological assays. With this kind of descriptors, the multi-target cheminformatic model displayed percentages of correct classification higher than 90.0% in both training and prediction (test) sets [56].

**Figure 5.** Schematic representation of ProtDCal's descriptors calculation. 1D and 3D protein features implies the application of vicinity operators to modify amino acid (aa) properties while 0D features estimation go straightforward to group the original aa properties according to several grouping criteria.

Similarly, the same authors also applied the Box-Jenkins moving average operator to develop non-conventional multi-task QSAR models able to predict simultaneously antibacterial activity and toxicity [58]. This time, the continuous response variables measured on AMPs such as minimum inhibitory concentration (MIC), cytotoxic concentration at 50% (CC50), and haemolytic concentration at 50% (HC50) were transformed in a binary variable labelled as (1) referred to high antibacterial activity/low cytotoxicity, and (−1) assigned to low antibacterial activity/high cytotoxicity. The ProtDCal's descriptors that usually encodes only peptide features were modified by the Box-Jenkins moving average operator in order to consider the variability implying the evaluation of the antimicrobial activity and toxicity on different biological systems. Thus, a multi-task QSAR model displayed an accuracy higher than 96% for classifying/predicting peptides was built by using LDA discriminant [58].

ProtDCal's descriptors have also been involved in the design of new peptides that inhibit the *E. coli* ATP synthase, as putative antibiotics [52,76]. ProtDCal's descriptors, implemented in PPI-Detect [77], were applied to predict interactions between peptides and the main subunits of *E. coli*'s (Ec) and human's (Hs) F1Fo-ATP synthase. Those peptide with a maximum and a minimum interaction likelihood with EcF1Fo and HsF1Fo were selected for in vitro assays. An overall of three peptides resulted attractive for further optimization steps in the design of new antibiotics [52,76].

More recently, ProtDCal's protein descriptors were successfully applied to improve the prediction performance of the existing alignment-free models by using the largest experimentally validated non-redundant peptide dataset reported to date, the StarPepDB [78], together with Random Forest (RF) classifiers [31]. Pinacho-Castellanos at al. not only built RF-based models for identifying AMPs, but also addressed the main biological activities reported for them (antibacterial, antifungal, antiparasitic, and antiviral) as endpoints. The specific functions of AMPs were either directly predicted or by a hierarchical classification that first consider the antimicrobial activity. RF-based models, developed with ProtDCal's descriptors aimed to predict specific activities of AMPs, showed a higher effectivity and reliability than 13 freely available prediction tools. The best reported models were implemented in the AMPDiscover tool [31], publicly available at https://biocom-ampdiscover.cicese.mx/ (accessed on 7 March 2022). Ruiz-Blanco et al. also applied successfully ProtDCal's descriptors to predict antibacterial peptides by using

RF-based models trained with StarPepDB instances and, in a second step they are predicted on what bacterial targets according to their Gram-staining classification could be active by using a multi-classifier. These two RF-based models were implemented in the web server ABP-Finder: https://protdcal.zmb.uni-due.de/ABP-Finder/ (accessed on 7 March 2022) which is freely available but unpublished yet.

#### *2.2. Integration of Peptide Features from Heterogenous Sources*

Considering previous experiences in protein functional classification where protein features from heterogeneous sources have been integrated to improve classification rates; we wonder if this strategy has been applied to peptide classification? In this sense, the integration/combination of alignment-based (AB) and alignment-free (AF) protein features in machine learning models have been evaluated for such purpose. For example, Galpert et al. improved orthologs classification at the twilight zone (<30% of identity) by combining AB and AF protein similarity measures in supervised big data classifiers [79]. It has also been shown that the integration of AB and AF methods gives the best exploration of highly diverse protein classes, such as the nonribosomal peptide synthases (NRPS) represented by their A-domains [80]. Other examples of feature integration methods for remote homology detection can be found in [81], and the one of Borozan's et al. [82], based on weighted aggregation which is a very inclusive approach avoiding the loss of information.

Regarding AMPs classification improvements by integrating AB and AF peptide features, an algorithm applying AB measures and the SVM algorithm trained with AF pairwise measures was published for increasing AMPs prediction sensitivity [47]. The algorithm consists in two stages. Firstly, AMPs are identified by Basic Local Alignment Search Tool (BLAST) scores, and those peptides that cannot be unequivocally identified by pairwise alignments were inputted in an SVM-based classifier built with AF pairwise similarity scores. The AF similarity scores were estimated with the Lempel–Ziv's complexity algorithm [83]. The integrative algorithm achieved higher sensitivity performance for AMPs prediction than the prediction tools implemented within the first version of CAMPR3 database [11] and the integrated method proposed by Wang et al. [84]. Wang and colleagues had previously proposed a similar algorithmic workflow where BLAST is used to firstly classify a query peptide against a training set made up by 870 AMPs and 8661 non-AMPs. Classification label is transferred to the query peptide from the matching with highest similarity score. Query peptides that did not match with any within the training set were encoded by protein features like ACC and PseACC and the aas by five of their physicochemical and biochemical properties. As the number of generated features were relatively high, a rigorous feature selection step was performed by applying both the Maximum Relevance, Minimum Redundancy (mRMR) method [85] and the Incremental Feature Selection method [86] before building a Nearest Neighbour (NN)-based predictor. The NN algorithm assign the label AMP or non-AMP to a query peptide according to the class of the nearest neighbour.

Despite the efforts for integrating AB and AF features in a classification peptide system; they have actually been combined through their corresponding algorithms and have not been included in the same model or function. In this sense, AB and AF similarity scores could be combined to build an unique classifier for AMP prediction, as Galpert et al. did it for ortholog detection [79].

#### *2.3. NMR-Based Features for Peptides*

In 2020, the IAMPE webserver (http://cbb1.ut.ac.ir/; accesed on 17 March 2022) was released for an accurate prediction of AMPs by using classical ML-based classifiers trained with both conventional and 13CNMR-based features. The non-conventional 13CNMR-based features for peptides were defined from the quantitative NMR spectra for 13C isotope of the naturally-occurring aas. Firstly, 13CNMR-based features for each aa were calculated using 13CNMR spectra signals. Secondly the aas were grouped according to their 13CNMRbased features by applying Fuzzy c-means clustering algorithm. The resulting aa clusters

were used to extract feature vectors along the peptide sequences according to classical "composition", "transition" and "distribution" patterns. Despite the new information provided by such non-conventional peptide descriptors, authors suggested their combination with physicochemical features to yield higher accuracy for the prediction of active AMP sequences [34].

#### **3. Breakthroughs of ML Algorithms in the AMP Prediction**

#### *3.1. Data Imbalance and Multi-Label Classification in the Prediction of AMPs—New Algorithm Approaches*

As mentioned in the Introduction, data imbalance is an issue to tackle in the classification of potential peptide sequences. Here, we collected some other reported solutions combining two-level classifiers with imbalance management in both, the first level binary AMP/non-AMP problem, and the second level multi-label functional type problem. For example, the authors of MAMP-Pred [87] proposed two alternative imbalance management methods: (i) under-sampling of the non-AMP class, and (ii) weighting sequences according to the imbalance ratio; the second one being eligible after the experiment process. Then, they used pruned sets and label combinations, considering label correlations, to transform the RF binary classification. For the classification assessment, the Matthew's correlation coefficient was selected for the first level, and the multi-label metrics: Exact-Match Ratio (EMR), Hamming-Loss (H-Loss), Accuracy (Acc), Precision (Precision, Recall), Ranking-Loss (RL), Log-Loss, One-error (OE), F1-Measure (F1-Mic, F1-Mac), for the second level. As they assessed, MAMP-Pred outperformed iAMP-2L (proposed in 2013 as a two-level multi-label classifier) because of the feature extraction process involved ACC and its eight physicochemical selected properties, besides the classification process.

Another example of imbalance management can be found in [88] where the authors tried to identify peptides with dedicated anti-CoV antimicrobial function on an imbalanced dataset with relatively insufficient positive data. They used NearMiss under-sampling and balanced RF to build the classification model, and the sensitivity, specificity and geometric mean for the unbiased evaluation.

Ensemble learning has also been used to cope with class imbalance in the binary AMP/non-AMP prediction tool Ensemble-AMPPred [89]. The prediction model based on ensemble methods (RF, max probability voting, majority voting, adaptive boosting, or extreme gradient boosting) was combined with feature extraction (vectors of 517 numerical descriptors representing peptide sequences), feature engineering (hybrid feature generation by the fusion of various selected features using a logistic regression model) and feature selection to improve classification accuracy after the application of a balancing clustering-based proportionate stratified random sampling that selected peptide sequences representing the positive and negative data. Thus, representative sequences selected from each cluster were used as training data, while the other remaining sequences, as testing data.

A recent report in [90] presents a multi-label framework HMD-AMP to hierarchically annotate peptide sequences into AMP/non-AMP, and then, into eleven functional classes that can be small and extremely imbalanced classes. The classification framework includes an embedding layer of protein sequences, a protein language encoder, a feature transformer and a hierarchical deep forest model. An ablation study and a reduced feature test demonstrate the effectiveness of the framework based on the detailed structural information of AMPs to improve the accuracy of the prediction model and to manage data imbalance problem. At each function prediction level, the model demonstrates a cascade forest structure where each cascade level is an ensemble of decision tree forests, and different types of forests are included to make the model diverse. It's worth noting that deep forest does not rely on backpropagation, so it is suitable for training data with either imbalance labels or small sample sizes, hence preventing the model from overfitting.

#### *3.2. Deep-Learning in the Recognition of AMPs*

The lack of samples in the positive class, as well as, the ambiguity in the negative class are key issues concerning deep learning models in AMP prediction as stated in review [91]. The starting point for knowledge discovery in this rough scenario is the correct representation of raw data. Precisely, deep learning provides a solution to the human expert dependence problem of featurization, which is known as representation learning; but also allows the application of some widely-used features in peptide machine learning by means of unsupervised embeddings (pretrained representations that can be fine-tuned with specific downstream supervised tasks), learned embeddings (usually one-hot or oneletter encoding on the amino-acid level, producing a dimension-reduced dense vector for subsequent layers), or engineered features (physicochemical or evolution-based properties).

In generative approaches for AMP discovery, recently reviewed in [92], the reliance on expertise-engineered features may limit the generation of candidates qualitatively distinct from known AMPs, or the limited number of known structures of the annotated peptides may reduce the effectiveness of structured-based models [93]. On the contrary, those attribute-controlled models based on recurrent neural networks, variational autoencoders, adversarial autoencoders, generative adversarial networks may encourage novelty of designed sequences. That is the case of the specific bidirectional conditional generative adversarial network developed in AMPGAN v2 [94] that learns data driven priors through generator-discriminator dynamics and controls generation using conditioning variables. Thus, a learned encoder mapping data samples into the latent space of the generator implements the bidirectional component that aids iterative manipulation of novel, diverse, and application-tailored candidate peptides.

The diversity target in generative models has been also tackled with a semi-supervised learning approach combined with a variational autoencoder (VAE) that can simultaneously learn from the large unlabelled peptide sequence databases and a limited number of labelled sequences as in PepCVAE [95]. In this case, a controlled generative model is learned from large unlabelled peptide database for the encoder and decoder losses, together with a much smaller labelled dataset (peptides with reported antimicrobial annotation) for the classifier loss, that is, using a large unlabelled corpus to capture the distribution with VAE, and a small labelled corpus to learn a certain controlling attribute code.

Also with VAE generation, the report in [96] used the Giant Repository of AMP Activity (GRAMPA) [97] to apply an improved automated semi-supervised approach based on stochastic long short-term memory (LSTM) encoder-decoder networks for generating promising new sequences and an experimental investigation, resulting in low minimal inhibitory concentration (MIC) AMPs against *Escherichia coli*, *Staphylococcus aureus*, and *Pseudomonas aeruginosa*. In this approach, the decoding from the same point in the latent space may result in a different peptide being generated and is dependent on the random seed set prior to running. Thus, the VAE is trained on a curated AMP dataset followed by the development of a regression model for activity prediction and the subsequent development of the latent space. Then, new AMP sequences are identified from the latent space (by sampling) and, subsequently, the AMPs are produced and characterized with their corresponding MIC values. This method produces peptides with similar MICs as the input reference peptides, but with novel sequences not found in the training set; at the same time, without imposing thresholds on peptide characteristics or otherwise biasing output post-sequence generation. As a result, a list of newly generated active peptides includes non-canonical AMPs of low helicity and low net charge.

An alternative data augmentation method is presented in [98] to improve the recognition of neurotoxic peptides via a convolutional neural network model. Novel potential neurotoxic peptides were discovered from the best performed model in a simulation dataset among the transcriptome of an endemic spider of South Korea, *Callobius koreanus* (*C. koreanus*). The BLAST-based augmentation method was intended to improve the generalization property of the model.

Specifically, for candidate short peptide generation, the authors in [99] combined LSTM generation and bidirectional LSTM classification to design short novel AMP sequences with potential antibacterial activity against *E. coli*. The models were trained using sequences with proven low MICs and tuned with Bayesian hyperparameter optimization.

Some other deep learning methods are reviewed in [100] as a promising approach to meet short-length peptides requirements [101] where they combine deep convolutional neural network with reduced aa composition comprising clustered aas on the basis of evolutionary information, substitution score, hydrophobicity, and contact potential energy. As a result, a short peptide of 20 aa was selected by Deep-AmPEP30 from sequences extracted from the gut commensal fungus *C. glabrata* genome and experimentally validated to have antibacterial activities similar to ampicillin.

In a recent review [39], the authors presented some reasons to select ML approaches over deep learning ones in AMP prediction and design, when a fair balance is required among high accuracy and generalization capability, interpretability and low computational cost. However, some improvements like parameter tuning or model hybridization may lead to more robust deep learning classifiers in this field.

#### *3.3. Rough Sets Theory in the Classification of AMPs*

As an example of model hybridization, the authors in [48] presented a codon-based genetic algorithm combined with rough set theory methods to find a peptide active against *S. epidermidis*. Their rough set theory method provided explicit boundaries between physicochemical properties that active sequences possess and inactive sequences do not possess. Since this method produced explicit decision components, they could test sequences containing multiple components. They were inspired in their previous publication [22] where they tried to reduce false discovery rate with a rough set-based classification method generating similarity rule set boundaries between active and non-active peptides based on their physicochemical properties.

Another example of the rough set theory application can be found in [102] where they implemented a rough set classification framework together with a Rough Set Quick Reduct and Rough Set Relative Reduct based on an improved Harmony Search algorithm to classifyAnti-HIV-1 peptides. Specifically, they hybridized a rough set-based feature selection technique, with population-based meta-heuristic algorithms (Particle Swarm Optimization), to classify the peptide sequences and solve dimensionality problems. Besides, a fuzzy set classification framework [23] was also intended to cope with limited and severely skewed high-dimensional space for short (<30 aa) AMP activity prediction.

#### **4. Other Methodologies Than Classical ML for Identifying and Modelling AMPs**

#### *4.1. Homology-Based Prediction and Modelling of AMPs*

The most popular approaches in addition to classical machine learning algorithms for the identification of AMPs in databases are local alignments which are represented by BLAST and FASTA tools [103,104]. Although local alignments have been successfully applied by using iterative rounds and filters such as the presence of signal peptides, aa patterns and gene vicinity during AMP searches [105–107], they can fail in identifying some AMP sequences [55], if compared to pattern-matching searches [107,108]. There are two main ways for searching for sequences by patterns: hidden markov models (profile-HMM) [109] or regular expressions (REGEX) [110]. Both the REGEX and profile-HMM methodologies work similarly for the identification of AMPs. Firstly, a set of homologous sequences are aligned and the multiple sequence alignment (MSA) is inputted to a specific program such as Pratt [111] or HMMER [112] for the identification of REGEX patterns or profile-HMM, respectively. Currently, instead of building REGEX patterns and profile-HMMs, they are available for many protein families at the Prosite [113] and Pfam [114] databases where a query sequence/peptide can be identified. The pattern/profile-based searches for AMPs can be complemented with the identification of signal peptides and other structural filters. In fact, improved versions of databases have incorporated MSA, profiles-HMM and molecular modelling for AMPs detection [11,115,116]. Even so, when a query peptide could be high-scored against profile-HMMs from different peptide families, it is advisable to use a prediction tool combining different protein signature recognition methods such as InterProScan [117].

As we previously mentioned, the molecular modelling complements AMPs patternbased searches by confirming expected three-dimensional (3D) structural features characterizing them. The 3D structure can be also integrated into homology-based searches to identify homologous sequences sharing low identity but retaining a great structural conservation. Such structural similarities have enabled the detection of AMPs in databases with higher accuracy [9]. When the structures of peptides are not experimentally elucidated, two modelling techniques are suggested: homology-based and *ab initio* modelling. The homology-based modelling uses the structure experimentally-determined from available homologous as template to infer the 3D structure of novel peptides, but rather using structural than sequence similarities, especially if the query and template are remote homologous [118]. By contrast, the *ab initio* method is used to predict the structures of peptides with yet unknown homologs. The prediction of the 3D protein structure starts from scratch requiring an energy model describing the main factors that contribute to the stability of the folding process and an efficient method for the conformational space exploration of the peptide chain [119]. However, homology-based approaches are more suitable for peptides when homologs are identified. In fact, the second release of BACTIBASE [115] incorporated the MODELLER program [120], as a tool for the 3D structure prediction of query peptides by homology to known bacteriocins [115]. Besides, the incorporation of 3D structure prediction tools to AMP databases provide another filter for an accurate identification of query AMPs, the 3D structure can be used for scoring peptide-cellular target interactions which is a crucial step for the *in-silico* design of novel AMPs [121].

Especially, since classical ML algorithms were recently reviewed in [18], we have addressed here, traditional homology-based approaches applied to the search and the modelling of AMPs, and will describe next, the most singular algorithms.

#### *4.2. Emerging ML-Independent Methodologies for AMP Prediction/Design*

In this section, we will address other emerging methodologies regardless of ML approaches and classical homology-based approaches for AMP discovery. Firstly, we want to highlight the AMPA webserver (http://tcoffee.crg.cat/apps/ampa, accessed on 7 March 2022), developed to detect antimicrobial stretches within the protein sequences. The antimicrobial regions detected in proteins can serve as new templates for AMP design, especially those uncovered within proteins no related with the defense function. AMPA algorithm does not depend on homology-based searches since it estimates an antimicrobial index (AI) to each aa, derived from half-maximal inhibitory concentration (IC50) values in high-throughput screening experiments, encoding the propensity of each aa to be present in an AMP sequence. As low IC50 values correspond to high activity, aas with low AIs are more likely to be part of an AMP. By applying a sliding-windows analysis along the protein sequence, AMPA generates an antimicrobial profile based on the AIs. Those regions scored below certain threshold are considered putative antimicrobial domains [122]. The singularity of this approach is that it doesn't either rely on building machine learning models or similarity searches against AMP databases. However, potentially conserved antimicrobial regions can be checked in conjunction with the T-coffee alignment tool [123].

On the other hand, complex networks have been applied to explore the chemical space of AMPs aimed to discover structural entities with promising biological activities that also could serve as template for peptide drugs design/optimization. In this sense, Marrero-Ponce et al. were the pioneers on this topic by publishing a seminal of related works [33,78,124]. Firstly, Marrero-Ponce et al. analyzed both the diversity among 25 AMP databases and the showed within each one. The study revealed some AMP databases contained common sequences showing certain overlapping degree. After removing duplicates among AMP databases, a representative set of 16 990 non-redundant

AMPs was collected, which probably was the most comprehensive and exhaustively curated AMP dataset at that moment [124]. This relevant dataset was further enriched and structured in a graph database called StarPepDB (http://mobiosd-hub.com/starpep/; accessed on 17 March 2022) integrating 45 120 unique peptide sequences from 42 AMPs databases (Figure 6), with their metadata (origin organisms, function, biological target, source database, chemical modifications, cross-referenced entries to UniProt, PDB and PubMed) [78].

**Figure 6.** Chronological listing of AMP databases used in StarPep Database (StarPepDB) compilation. After collecting web pages from a large variety of bioactive peptide databases (see Table 1 in Ref. [78]), their contents were integrated into a graph database that holds total of 71.310 nodes and 348.505 relationships. In this graph structure, there are 45.120 nodes representing peptides (unique sequences) and the rest of the nodes are connected to peptides for the describing metadata.

StarPepDB has a star-like network architecture where a central node represents the peptide sequence and is connected to neighbour nodes labelled with the metadata. The edges depict a relational and unidirectional connection of the central node by a using a set of selection criteria "produced by", "assessed against", "*related to*", "*compiled in*" with its corresponding metadata nodes such as the origin, target, function and database, respectively. Peptide nodes besides the sequence also contain peptide's ID and length, while the metadata nodes have the 'name' property and relationships have the 'db-ref' property (referred as source database) [78]. Finally, different network topologies can be visualized by applying filtering criteria on StarPepDB. For example, it is possible to display a network of those peptides (central nodes) "*related to*" (edges) function "antibacterial" (metadata node) and "*compiled in"* (edges) the ADP database (metadata node).

Thus, the StarPepDB structure together with the StarPep toolbox allows building customized networks and their visualization. The visual and analytics exploration of the network by extracting some centralities measures (e.g., weighted degree or harmonic centralities) allows identifying the most relevant bioactive peptides in the network (Figure 7). Furthermore, peptide subsets can be either retrieved from the graph database by sequence identity searches or by applying filtering criteria such as peptide length, sequence motifs/patterns, physicochemical properties, and other metadata.

More recently, the same research group encoded each peptide sequence with a set of molecular descriptors bearing non-redundant structural information to set alignment-free (AF) pairwise similarity/distance relationships among the peptide nodes of the network by using a general pipeline as show in Figure 7. The resulting chemical space represented by these AF similarity networks are explored by visual inspection in combination with clustering and network science techniques [49,50].

Here, we show the chemical space network (CSN) of 174 non-redundant Anti-Biofilm Peptides (ABPs) (Figure 8) by applying the StarPep Toolbox flowchart represented above. Networks become more interpretable through visual inspection if having a community structure. Note that communities of ABPs may represent some biologically relevant regions

from the chemical space where bioactive compounds reside. Hence, we have explored the CSNs by varying the similarity threshold until a well-defined community structure emerged. In this way, a final CSN has been analyzed by adjusting the similarity threshold to 0.65, at network density of 0.0068, achieving 20 ABP outliers (singletons) with atypical or unique sequences (Figure 8). Also, for each peptide discovered to be a relevant node, additional information (metadata) is available in Supplementary Materials (File S1, SI1-A and B).

**Figure 7.** StarPep Toolbox flowchart. A flow diagram guiding the automatic construction and visual graph mining of similarity networks (see Figure 1 in Ref. [33]). Networks can be clustered, and communities are optimized using the Louvain method [125]. Moreover, the centrality of each node can be particularly measured by harmonic, community hub-bridge, betweenness, and weighted degree. Centrality is crucial to perform scaffold extractions because peptides are ranked according to their centrality score, and then redundant sequences are removed, prioritizing the most central. Thus, scaffold extractions depend on the type of centrality applied.

**Figure 8.** Visualizing the similarity network (Chemical Space Network, CSN) of a set of 174 nonredundant Anti-Biofilm Peptides (ABP\_98% identity) at threshold t = 0.65 and density = 0.068, using the (**A**) three main PCAs as coordinated of each ABPs, and (**B**) Fruchtermann Reingold layout algorithm. Node colour represents the community (e.g., the biggest communities represented by cluster 3, 10 and 12 are in blue, purple and green colours, respectively), and node size symbolizes the centrality values. There are 20 ABP outliers (singletons). This figure has been created using the software starPep toolbox (version 0.8), available at http://mobiosd-hub.com/starpep; accessed on 17 March 2022.

Once a community structure is found, we rank nodes in decreasing order according to the community Harmonic centrality measure for retaining the top-*k* of the ranked list. Particularly, the top 10 exposes densely connected groups of nodes like cliques, which are defined to be complete subgraphs. These related sequences may be forming families in the chemical space of ABPs. These central peptides within each local leading community are given in SI1-B, and they may be representing sequence fragments or naturally occurring peptides that could be identified as starting structures for lead discovery. For instance, the peptide starpep\_00000, starpep\_05561, starpep\_00361 are the most central nodes of the CSN (all in cluster 10). ABPs starpep\_03668, starpep\_04267, starpep\_00004 and starpep\_07895, starpep\_12531, starpep\_012529 are more central inside Communities 3 and 12, respectively (Figure 8 and SI1-C).

As can be observed in Table in SI1-C, some neighbor nodes within the communities may be representing a family of similar ABPs. Another example of closely related sequences can be seen in the 3 members of the Cluster 3 (see all ABPs in Community 3 in SI1-B). The peptides inside this cluster have the same length of 12 aas. So, it is expected that there are many ABPs with similar centrality values in the CSN, and it is advisable to extract some non-redundant ABPs from communities than just selecting the highest-ranked ones. To clearly extract central but non-redundant ABPs from each cluster (scaffold extraction, see Figure 7), we sort ABPs according to the decreasing order of their harmonic values. Then, the redundant sequences are removed at a given % of sequence identity. We have used an identity cutoff of > 35% to consider that a particular sequence is related to already-selected central ABPs and, as a consequence, removed from the CSN. Finally, the non-redundant 44 ABPs were ranked according to their decreasing values of Harmonic measure. The sorted list is given in SI1-D, and the top ranked peptides are those having relatively small similarity paths to all other nodes in the CSN.

This workflow allows the extraction of the most representative nodes/peptides describing the biologically-active chemical space (SI1-D). This representative subset can be used for multi-query similarity searches against peptide databases to retrieve all possible hits (Figure 9). The multi-query similarity search consists in using both the most central/representative nodes of the network communities and also the so-called singletons (isolated peptide nodes) as references/queries to retrieve the most similar peptides from databases by using local alignments. The best matches against the reference/query chemical space are determined by the maximum fusion rule by firstly ranking-down the similarity scores, to retrieve the best match between a query peptide and a target database and afterwards the best similarity scores are ranked for all reference peptides. Some studies have demonstrated that fusion by similarity scores and the maximum fusion rule are the best parameters for these models [126,127].

The integrated collection of 45 120 bioactive peptides registered in StarPepDB (http: //mobiosd-hub.com/starpep/; accessed on 17 March 2022), that probably is the largest and most diverse bioactive peptide database to date, can be used for the discovering of central peptide nodes targeting an specific biological activity in the Chemical Space Networks (CSNs) and for taking advantage of them in multi-query similarity searches [33]. In this sense, Marrero-Ponce et al. explored different similarity networks of antiparasitic peptides (APPs) from StarPepDB to identify the most relevant and non-redundant APPs, that were later used as queries in similarity-based searches to identify potential APPs among non-labelled peptides as such in the StarPepDB. The proposed multi-query similarity search strategy outperformed state-of-the-art machine learning models aimed at APPs prediction like the AMPDiscover (https://biocom-ampdiscover.cicese.mx; accessed on 17 March 2022) and the AMPFun (http://fdblab.csie.ncu.edu.tw/AMPfun/index.html; accessed on 17 March 2022) webservers [30,31]. The methodology will also permit the design of new APPs by using the motifs found among the repurposed APPs [49]. More recently, a similar workflow using CSNs was applied to identify the most relevant tumorhoming peptides (THPs) within the StarPepDB. Such THPs were considered as queries (Qs) for multi-query similarity searches that apply a group fusion (MAX-SIM rule) model. The resulting similarity searching models outperformed state-of-the-art tools for THPs detection, and the best one was applied to repurpose AMPs from the StarPepDB as THPs. Novel THP leads were identified as well as new motifs accounting for their TH activity [50].

**Figure 9.** Schematic representation of the group fusion and similarity searching processes. Qi is a i peptide from a query/reference dataset, n is the number of peptides contained in a query dataset, S is identity coefficient between M and Q obtained by local alignment with Smith-Waterman algorithm, m is the number of peptides included in the target dataset. The similarity threshold is related to the percentage of identity.

#### **5. Models of Sequence Evolution for the Design and Optimization of Bioactive Peptides**

Several *in silico* computational approaches inspired in molecular evolution events have been applied to the design and optimization of a peptide with a promising biological activity, known in medicinal chemistry as a "leading compound". These algorithms are aimed to produce offspring peptides from a parent (hit peptide) until the "desired property" is meet according to selection criteria conducted either by ML prediction models or by biological assays (Figure 2). The offspring generation process can be iterated until reaching optimized peptidic scaffolds showing a trade-off between desirable/undesirable activities. The simulation process for generating offspring have evolved from inducing random mutations within the peptide sequence until guiding such aa substitutions under directed evolution concepts [41,128,129]. Although, algorithms inducing random mutations are commonly applied to generate sequence diversity in the peptide library, they could render unpredictable results that should be carefully analysed with selection algorithms. By contrast, computational algorithms inspired on directed mutagenesis have focused the design and optimization of "leading peptides" by guiding the generation of peptide offspring incorporating secondary structure features that influence positively on the antimicrobial activity such as amphipathic helices, kinked amphipathic helices, and other structures aimed to interact with lipid membranes [130].

Schneider et al. were the pioneers to apply simulated molecular evolution (SME) algorithms as a strategy for a rational peptide design by coupling the *in silico* generation of peptidases cleavage sites of 12 residues long to a selection mechanism represented by trained ANN [131,132]. The design was oriented to this region by generating offspring from a 12-residue sequence/peptide (parent sequence) which was iteratively mutated until meeting the best ANN quality classification metrics, used as a selection criterion of the design. The offspring sequence simulation was performed by introducing random mutations according to Gaussian-distributed probability values around the parent sequence. The mutation degree (small or large) is then conditioned by the estimation of position-specific mutability and the selected aa distance matrix [131,132]. As the position-specific mutability

is averaged resulting the same for every position in the sequence; the aa mutation degree is determined by the aa substitution/scoring matrix type such the Grantham matrix [133], the Myata matrix [134], and the Risler matrix [135].

This SME approach was later applied by the same group to the optimization of anticancer peptides (ACP) aimed at improving their membranolytic activity and cell-type selectivity [51,136]. In [51], a known α-helical ACP served as the parent sequence for the generation of the offspring (ACP-derivatives). So that the generated offspring peptides retained similarity with the initial structural/property space and thus enabling a systematic optimization; the mutation function was controlled. This time the SME approach was accompanied with experimental measurements as a selection criteria or fitness objective within the optimization scheme. They used the half-effective concentration (EC50) on the breast cancer cell line MCF7 and the secondary structure preferences by circular dichroism (CD) spectroscopy as experimental filters. A similar SME protocol was applied in [136] to optimize the cell-type selectivity of the highest-scored candidate toward non-cancer cells and human erythrocytes. This candidate termed AmphiArc2 peptide resulted from the screening of virtual libraries generated by more advanced algorithms incorporating secondary structures features (alpha and amphipathic helices) that influence positively on the membranolytic action [130]. AmphiArc2 was selected as a parent sequence in the SME algorithm in which the mutated sequences (offspring) are generated from it. The offspring was scored according to a fitness function, defined by the anticancer activity and selectivity with respect to non-transformed cells. The best offspring was selected as a parent for the following optimization iteration [136].

Although the SME approach and the generation of oriented libraries toward certain secondary structures, relevant for the interaction with lipid membranes, have represented a step forward in the design and optimization pipeline of AMPs and ACPs [130,136], there still room for improving the simulation of molecular evolution of the offspring peptides. In this sense, algorithms that traditionally have been used for simulating sequence evolution in the field of molecular phylogenetics were recently applied to provide more rationality to the peptide library generation [52]. These algorithms were initially developed to evaluate the accuracy of MSA and phylogenetic reconstruction tools by generating sets of related simulated protein sequences from known phylogenies. The most representative ones are: ROSE (Random Model of Sequence Evolution) [137], SIMPROT (Simulation Protein Evolution) [138], and INDELible (Insertions and Deletions Simulator) [139]. In general, they are controlled by several evolutionary parameters such as tree topology, evolutionary distance matrices, mutation rate, insertion and deletion probabilities to simulate the evolution of offspring from a parent sequence. Ruiz-Blanco et al. incorporated the ROSE algorithm into the *de novo* design pipeline of peptide inhibitors of *E. coli* ATP synthase [52,76]. As parent peptides, both the natural inhibitor (IF1) of the mitochondrial ATP synthase and fragments of interfaces involved in protein—protein interactions between subunits of *E. coli* ATP synthase, were selected to generate peptide libraries. The residue conservation degree on these parent peptides was identified by MSAs within each class. A consensus parent peptide with its corresponding conservation scoring profile was estimated so different mutation rates to each position in the sequence could be assigned. This mutation probability vector together with a user-defined phylogenetic tree with a known topology and branch lengths guided the probabilistic function performing mutations, insertion and deletions on the parent peptide [52,76]. On the other hand, the sequence diversity of the offspring peptides in the library can be controlled by calibrating ROSE parameters against the pairwise identity [81]. A predefined binary phylogenetic tree with 1023 nodes and depth 9 implemented in ROSE was used in [52,76] for the generation of diversity-oriented libraries. The Figure 10 shows a schematic description of the ROSE algorithm.

Peptide libraries were screened by the PPI-Detect [77], an SVM-based model that predicts peptide interactions with both domains of the *E. coli* and human ATP synthases. As selection criterion, the high-scored interacting peptides with the *E. coli* ATP synthase but showing low values with the human's were subsequently evaluated by in vitro inhibition tests. At applying advanced SME algorithms involving more evolutionary models/parameters like ROSE makes easier subsequently screening steps to find lead peptides at high success rate.

**Figure 10.** The binary mutation guide tree used by ROSE to mutate the parent/root peptide. The binary tree topology is determined by the number of nodes (1023), depth (9) and average distance (dav = 5–20 PAMs). Peptide library may be selected either from internal or terminal nodes of the tree. The identity percentage of the offspring peptides respect to the parent/root peptide is colouredillustrated. Red colour means closely-related peptides to the parent while blue colour represents those distantly-related ones.

#### **6. Considerations in the Workflow for the High-Throughput Discovering of Bioactive Peptides**

#### *6.1. Brief Comparisons between High-Throughput (HT) and Classical Methods*

The classical approach for discovery of bioactive peptides has changed from analysing biological extracts/fluids to perform a wide-genomics and proteomics search. In this sense both next-generation sequencing (NGS) technologies and mass spectrometry (MS)–based proteomics combined with bioinformatic tools have provided suitable approaches for the large-scale identification of bioactive peptides outperforming the classical methods. These last ones usually include a purification step combined with bio-guided assays, which require higher amount of biomass from the subject organism. Although they can determine the biological activity of bioactive compounds relatively at high accuracy, are time-consuming and the yield of bioactive compounds is low as well as the coverage of the chemical space [140]. On the contrary, the HT analyses can be performed with around 1 cm3 or 0.5–1 g of fresh or preserved tissues, for genomic/transcriptomic or proteomic purposes, respectively [53,141,142]. Generally, the HT methods allow covering the whole picture for potential bioactive compounds much faster. Despite HT methods usually require of powerful computational resources, both NGS and MS-based proteomics are becoming cheaper and their corresponding workflows are continuously optimized within the discovery process as well, resulting in a long-term sustainable approach [143,144]. Moreover, HT OMICs technologies yield a big amount of free public data, allowing the decentralization of the knowledge for the biodiscovery process.

Hence, the integration of OMICs approaches is more recommendable than the classical ones at the early stage of bioactive peptide discovery. However, bioassays-guided methods are still valid and complementary at advanced phases of the research [140,145].

#### *6.2. Optimized Workflow for the Large-Scale AMPs Discovery from Profiled/Unexplored Organisms*

Despite the advances in the discovery of bioactive peptides, improved protocols are still needed to increase the accuracy in both their large-scale identification and functional characterization, which is a major challenge, nowadays. Figure 11 illustrates the overall steps for the HT bioactive peptide discovery from model and unexplored organisms.

In order to analyze OMICs data released by NGS and MS-based HT proteomics, several computational/bioinformatic tools and platforms have been developed. Among them, for the *de novo* genome/transcriptome assembly we can mention, i.e., MIRA [146], Spades [147], CAP3 [148], OASES [149] and the Trinity package [150] including the *de novo* assembler and the TransDecoder for ORFs prediction (https://github.com/TransDecoder/ TransDecoder/releases; accessed on 17 March 2022). Other ALL-IN-ONE licensed software like the toolbox CLC Genomic Workbench (CLC Bio-Qiagen, Aarhus, Denmark) [151] and OMICsBox (BioBam Bioinformatics, Valencia, Spain) [152], have integrated several tools for the complete workflow, including the *de novo* assemblers, custom/online/cloud functional annotation options with Blast+ [153], eggNOG [154], KEGG [155], providing as well as a set of functional analyses and statistical tests (i.e., Gene Ontology, deferential expression analyses and enrichment).

Among the NGS analyses, the RNA-seq has gained relevance because it can explore the coding regions of the genome by assembling, annotating and comparing expression profiles of the resulting transcripts [141,156]. Since elucidating the transcriptome demands lower computational cost than whole genome, and also provide useful information, its number has increasingly growth in databases. In this sense, transcriptomes from the same or related species are translated, usually with the TransDecoder or Six-Frame Translations Tool (S-FTT) (https://github.com/iracooke/protk; accessed on 17 March 2022), then annotated, and thus considered as reference database for improving protein identification in proteomics analyses from a target organism [157]. These are the grounds of proteogenomic analyses where genomic, transcriptomic and proteomic data are combined to assist the discovery of peptides from MS–based proteomic data, especially if they are not present in protein databases such as UniprotKB and other related ones (i.e, Swiss-Prot, TrEMBL and UniRef), the protein section of NCBI, Mendeley and ProteomExchange consortium [158]. On the other hand, the proteomic data can also be used to confirm gene expression [159].

**Figure 11.** Optimized workflow for the high-throughput (HT) AMPs discovery from profiled and/or unexplored organisms. The figure summarizes the main phases in the AMPs discovery using genomic,

transcriptomic and proteomic data from profiled or underexplored organisms. The figure depicts the pipeline for de novo HT discovery from un(der)explored organisms using OMICs approaches (shown in the top-left panel), and from nucleotide and proteomic information available at public databases (top-right). Genomic information publicly available at NCBI (Genome database https://www.ncbi. nlm.nih.gov/genome/; accessed on 17 March 2022) and transcripts encoding protein sequences under 100 aa length provided by the Transcriptome Shotgun Assembly (TSA) database (https://www.ncbi. nlm.nih.gov/genbank/tsa/; accessed on 17 March 2022), can be screened with the computational tool ampir for fast genome-wide prediction of AMPs [160]. Likewise, the remaining transcripts encoding peptides sequences ranging 5-100 aas length, usually discarded in transcriptomic analyses, can be translated with the six-frame translations tool (S-FTT) [157,161] after ORFs prediction with the TransDecoder. Considering bioactive peptides include animal toxins which are usually rich in cysteine, the aa sequences obtained with S-FFT can be either analyzed by the Proteomic toolkit (https://github. com/iracooke/protk; accessed on 17 March 2022) to identify cysteine-rich regions to discover novel Short Secreted Cysteine Rich (SSCRs) peptides, or by the Machine Learning (ML) tool ToxClassifier, that enables a simple and consistent discrimination of toxins from non-toxin sequences [162]. In addition, new tools like the pypgatk package [163] can recover a significant number of cryptic peptides of biomedical interest from pseudogenes, long non-coding RNAs (lncRNAs) and other non-canonical coding transcripts produced by alternative splicing. These filtering tools can be applied together CD-HIT [164] to screen nucleotide databases before custom and non-redundant peptide databases building for proteogenomic analyses or HT annotation. Finally, the StarPepDB with its associated tools [33] may have several roles within the presented workflow by providing nonredundant bioactive databases and also at reducing custom peptide databases with the identification of the most relevant peptides for proteogenomic analyses. Moreover, bioactive peptides detected in HT screening can be classified and clustered with StarPep in different categories according to their biomedical potential (e.g., AMPs, antitumor, antibacterial, antiparasitic, etc.).

In general, the overall proteomic approach for the discovery of bioactive peptides includes the following steps: (*i*) protein digestion, (*ii*) peptide separation, (*iii*) peptide fragmentation and MS spectra acquisition, (*iv*) peptide identification using MS spectra database by similarity searches or by *de novo* sequencing. In this sense, steps (*i*) and (*ii*) are addressed by several sample preparation protocols which selection determine the best yields/results. Specifically, for bioactive peptide discovery, it is advisable the solid-phase-enhanced sample-preparation (SP3) protocol [165] since it reaches a wider coverage of peptides than the filter-aided sample preparation (FASP) [166]; moreover, is less complicated and faster than the in-solution digestion (ISD) [167].

Besides to protocol improvements in sample processing [161], there have been advances in the peptide identification step by applying several computational strategies that have also refined their bioactivity prediction [159]. In addition to use transcriptomic data to increase peptide detection accuracy, the inclusion of custom databases is being applied to characterize the part of the proteome that remains unannotated. In this sense, composite databases have been explored for a deeper proteomic characterization of the salivary glands from *Octupus vulgaris* looking for revealing underexplored bioactive peptides/toxins from previous studies [54,157,161]. The composite database comprised data from the UniProtKB, built from *de novo* transcriptome assembly of Anterior (ASGs) and Posterior Salivary Glands (PSGs), combined with those retrieved from all transcriptomes available from the cephalopods' PSGs. In addition, a comprehensive non-redundant AMPs database [124] was also included to provide additional insights about bioactive compounds such as putative AMPs [54]. In a previous work the same AMP subset was also considered as custom database to characterize the Ascidian tunic proteome by shotgun proteomics [53]. The computational analysis of the raw data implied searches against the Uniprot database (Bacteria and Metazoan section) and the AMP database. The Ascidian tunic revealed the presence of AMPs from both eukaryotes and prokaryotes and the "Biosynthesis of antibiotics" pathway was among the most significant ones, which support this tissue as an interesting reservoir of bioactive peptides/toxins and its role on the interactions Ascidians

and their associated organisms. The AMP subset integrated in these previous analyses was published by Aguilera-Mendoza and probably was the most comprehensive and nonredundant AMP database reported so far [124], that later was updated in the StarPepDB (http://mobiosd-hub.com/starpep/; accessed on 17 March 2022) [78], as mentioned above.

Other important handicaps in the workflow of proteogenomic analysis are the False Discover Rate generated at analysing large protein/peptide databases [168–170] and the probable loss of information represented by small size transcripts encoding protein fragments < 100 aas that could be discarded by the TransDecoder [54,157,161], the tool dedicated to identify candidate coding regions within transcripts generated by *de novo* RNA-Seq, and such small-sized fragments could account for bioactive peptides. In order to perform a wider proteome analysis looking for uncovered AMPs and peptide toxins in the PSGs of *O. vulgaris,* contigs discarded in previous proteogenomic analyses (<100 aas) were translated with the S-FTT and then included in the protein database [54]. To optimize further proteogenomic analyses (i.e., time of analyses, FDR), or peptide annotation, sequences redundancy should be reduced with the CD-HIT [164] since the S-FTT generates many peptides sharing high similarities that could affect the overall peptide identification when increasing the FDR [170].

Other filters within the computational pipeline to process proteomic data have been applied to refine the search of peptide toxins against both canonical and custom databases. For example, the search can be framed against those toxins/peptides having signal peptides, responsible for their transport and secretion. Signal peptides have shown to contain common features across all life kingdoms [171]. In addition, cysteine-rich secretory proteins (CRISPs), small toxins (<100 aas) commonly found within the secretions of animal venoms, can be extracted from protein databases, to enrich reference databases for increasing proteomic toxin peptides detection [172]. Besides, the custom protein/peptide database can also be screened with ML-based tools e.g., ToxClassifier, that enables simple and consistent discrimination of toxins from non-toxin sequences [162], allowing the discovery of novel toxin-like bioactive peptides. Moreover, the fast genome-wide prediction of AMPs, using the ampir R package [160] can be used in the pipeline to retrieve novel peptides with antimicrobial signatures from public nucleotide databases, *de novo* transcriptomes/genomes assemblies, or as a filtering step before using S-FTT. More recently, new tools for the creation of proteogenomic databases considering the translation of pseudogenes, long non-coding RNAs (lncRNAs) and other non-canonical coding transcripts produced by alternative splicing, have allowed the identification of a significant number of cryptic peptides that may show interesting biological activities [163].

#### **7. Concluding Remarks**

Protein features inspired on molecular descriptors from chemoinformatics have emerged as successful predictors for AMPs activities. Particularly, ProtDCal's descriptors have been recently incorporated in two RF-based webservers (AMPDiscover and ABPFinder) targeting AMPs predictions as well as their specific activities and putative bacterial targets. Moreover, ProtDCal's descriptors have been involved in the design of antibiotic peptides by predicting their interaction to druggable targets from *E. coli*.

Among the recent ML approaches, undoubtedly DNNs have been the algorithm of choice for AMPs prediction in emerging tools. However, recently it has been shown that deep learning models' performance in AMP prediction is comparable to the one of classical ML algorithms being their use mostly advisable when the performance gains justify the associated computational cost.

Currently, the network science implemented in StarPep is being applied as one of the top emerging approaches, regardless of ML, to assist the search and design of bioactive peptides through the identification of lead peptides within the known chemical space. On the other hand, methodologies that simulate sequence evolution in the phylogenetics field have been repurposed to assist the optimization of such peptide leads by generating diversity-oriented libraries which are strictly controlled by evolutionary parameters.

New considerations in analysing genomic, transcriptomic and proteomic data for AMPs discovery from either profiled or underexplored organisms are being also applied. Several filtering steps have been proposed to reduce the FDR in AMPs detection when custom databases are included, but at the same time, to encompass the highest number and diversity of peptides as possible.

#### **8. Future Research Directions**

Despite a great diversity of peptide features (classical and non-classical) that has been used in AMPs prediction/design, most of those features are sequences- or property- based; however, the 3D structural information of AMPs has not been deeply exploited for such aims [173–175]. Although experimental determinate 3D structures of AMPs are used in minor proportion than their sequences, the 3D structure prediction tools are becoming more accessible and less computational demanding when considering new advances in both software and computer architectures [176,177]. These facts will ease the gradually inclusion of 3D structural features in the prediction models.

Another alternative for the inclusion of higher structural information in AMPs encoding is the use of artificial representations, which have been commonly used in comparative analyses of DNA and proteins and in QSAR-type modelling [81]. The integration of peptide features from heterogeneous sources e.g., from pairwise alignments and peptide sequences into the same classifier could be another outlook for improving the classification rates of AMPs. The main problem is to figure out a framework to integrate them (the resulting features, not the source methodologies) into ML models training. As a clue for future research directions, the alignment- based and -free similarity measures were successfully integrated for training bigdata ML-based classifiers for orthologs detection [79]. Bigdata solutions applied to the prediction/optimization of AMPs have not been explored yet in spite of the fact that the number of AMPs has grown in databases as well as the number of features/descriptors that can be derived from them. Bigdata platforms could be applied when performing virtual screening of millions of peptides, especially if they are described with computationally demanding structural descriptors. As previously mentioned, it would be advisable that future ML models for the AMP prediction could consider the natural imbalance ratio between AMPs and non-AMPs as well as the existing one among the AMPs activities. Moreover, the prediction of AMPs activities should be addressed with fuzzy-based models since they generally show overlapping activities which are not evenly-distributed within the AMP population [23]. Therefore, the resulting predictions for AMPs activities may be scored with probability values and not only treated as a binary value. On the other hand, for peptide leads optimization, the offspring generation step is crucial for the overall process. This step generally is carried out by evolutionary algorithms that introduce structural diversity among child peptides somewhat randomly. Although these AI-based algorithms have been continuously evolving to guide such diversity in order to gain optimization efficiency; there is still room for improvements in this direction. Thus, the algorithms commonly used in phylogenetics for simulating sequence evolution could provide more rationality to the generation of offspring peptides since they have been designed with more evolutionary parameters that can be strictly controlled [52,76].

Finally, StarPep is probably the most promising methodology regardless of ML approaches, that has been reported so far. The complex network theory implemented in this tool has provided a different outlook to address several steps in peptide drug discovery process. StarPep bears particular analysis tools that have not previously reported for peptides, such as (*i*) the chemical space analysis of AMP databases by similarity networks, (*ii*) the identification of the most representative and non-redundant subset of AMPs from the original chemical space, (*iii*) the mapping of unlabelled peptide datasets on similarity networks built with the representative AMPs (*iv*) the multi-query similarity searches using representative peptides against target databases. Consequently, StarPep is becoming in a competing tool to the existing ML-based methods since it has being giving clues of improved classification rates [49,50], and because of its great potentialities for the identification and optimization of new peptide leads from either in silico generated peptide libraries or released data by the Omics techniques (Figure 11).

The effort of StarPepDB developers to gather all AMP databases in a non-redundant database [124] has shown a direct impact for the AMPs prediction tools [31]. However, the annotation quality for the reported AMPs must still be improved as well as the information on their biological or molecular targets. It is urgent that AMPs activity evaluations can be harmonized under the same protocols to construct more reliable benchmark datasets for the accuracy sake of the computational analysis tools. The diverse computational methods available for AMPs discovery are a powerful tool for the accurate design of peptide drugs. The growing availability of 3D structural descriptors and scoring functions will allow developing more effective in silico peptide drug design technologies. The assembling of ML methods with peptide-protein docking and molecular dynamics seems to be an effective alternative as well [178]. If all these aspects were considered for the computationalassisted search/design of peptide drugs, the next-generation of AMP leads will be more valuable for developing therapeutic agents to face challenging health problems such as cancer, infectious diseases and more recently, COVID-19.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/antibiotics11070936/s1. File S1—Additional data relative to the analysis of the Chemical Space Network built with Anti-Biofilm Peptides (ABPs) from StarPepDB. File SI1-A—Metadata associated to ABPs. SI1-B—Node properties of the 174 ABPs embedded in the similarity network representing the ABPs chemical space. SI1-C—More central ABPs according to network centrality measures. SI1-D—ABPs selected as queries for multi-query similarity searches.

**Author Contributions:** G.A.-C. and A.A. designed and structured the overall review as well as drafted the initial manuscript. D.G.-C. analyzed emerging ML algorithms for the AMP prediction. D.D.-P. gathered the new considerations in OMICs analysis/workflow for the biodiscovery of AMPs. Y.M.-P. worked on the conceptualizing of the complex networks and similarity searching methods applied to AMPs discovery. G.P.-M. and M.T. looked for the non-standard peptide features used for AMPs prediction. G.A-C. reviewed the application of algorithms simulating sequence evolution to peptide drug design. G.A.-C., D.G.-C. and A.A. participated in the review-edition of the manuscript All authors have read and agreed to the published version of the manuscript.

**Funding:** Yovani Marero-Ponce was supported by the USFQ Collaboration Grant (Project ID16885). This research was supported in part by the Strategic Funding UIDB/04423/2020 and UIDP/04423/2020 through national funds provided by FCT and the European Regional Development Fund (ERDF) in the framework of the program PT2020, by the European Structural and Investment Funds (ESIF) through the Competitiveness and Internationalization Operational Program—COMPETE 2020 and by National Funds through the FCT under the project PTDC/CTA-AMB/31774/2017 (POCI-01-0145- FEDER/031774/2017).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **ABP-Finder: A Tool to Identify Antibacterial Peptides and the Gram-Staining Type of Targeted Bacteria**

**Yasser B. Ruiz-Blanco 1,\*, Guillermin Agüero-Chapin 2,3 , Sandra Romero-Molina 1, Agostinho Antunes 2,3 , Lia-Raluca Olari <sup>4</sup> , Barbara Spellerberg <sup>5</sup> , Jan Münch <sup>4</sup> and Elsa Sanchez-Garcia 1,\***


**Abstract:** Multi-drug resistance in bacteria is a major health problem worldwide. To overcome this issue, new approaches allowing for the identification and development of antibacterial agents are urgently needed. Peptides, due to their binding specificity and low expected side effects, are promising candidates for a new generation of antibiotics. For over two decades, a large diversity of antimicrobial peptides (AMPs) has been discovered and annotated in public databases. The AMP family encompasses nearly 20 biological functions, thus representing a potentially valuable resource for data mining analyses. Nonetheless, despite the availability of machine learning-based approaches focused on AMPs, these tools lack evidence of successful application for AMPs' discovery, and many are not designed to predict a specific function for putative AMPs, such as antibacterial activity. Consequently, among the apparent variety of data mining methods to screen peptide sequences for antibacterial activity, only few tools can deal with such task consistently, although with limited precision and generally no information about the possible targets. Here, we addressed this gap by introducing a tool specifically designed to identify antibacterial peptides (ABPs) with an estimation of which type of bacteria is susceptible to the action of these peptides, according to their response to the Gram-staining assay. Our tool is freely available via a web server named ABP-Finder. This new method ranks within the top state-of-the-art ABP predictors, particularly in terms of precision. Importantly, we showed the successful application of ABP-Finder for the screening of a large peptide library from the human urine peptidome and the identification of an antibacterial peptide.

**Keywords:** antibacterial peptide; machine learning; AMPs database; StarPep; Gram staining-based target; peptide library screening; human peptidome

#### **1. Introduction**

Antibiotic resistance is a life-threatening health problem worldwide, and one of the main causes of death in developing countries [1,2]. The potential capability of peptides to overcome resistance [3] has motivated the development of new antibiotics from antimicrobial peptides (AMPs) to combat multi-drug resistant pathogens and the threats of Gram-negative infections [4,5].

AMPs are oligopeptides produced by a great variety of organisms, from prokaryotes to eukaryotes, including humans. Due to their various functions, AMPs are considered a part of the innate immune system of higher eukaryotes. The structural diversity of AMPs allows them to display a broad range of antimicrobial activity against pathogenic agents, including viruses, Gram-positive and Gram-negative bacteria, as well as fungi. Besides, the bacterial

**Citation:** Ruiz-Blanco, Y.B.; Agüero-Chapin, G.; Romero-Molina, S.; Antunes, A.; Olari, L.-R.; Spellerberg, B.; Münch, J.; Sanchez-Garcia, E. ABP-Finder: A Tool to Identify Antibacterial Peptides and the Gram-Staining Type of Targeted Bacteria. *Antibiotics* **2022**, *11*, 1708. https://doi.org/10.3390/ antibiotics11121708

Academic Editor: Jean-Marc Sabatier

Received: 20 October 2022 Accepted: 17 November 2022 Published: 26 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

selectivity of AMPs over eukaryotic cells and their different action modes make peptides excellent antibiotic candidates [3,4,6]. A widespread mechanism of antibacterial peptides (ABPs) is the destabilization and destruction of bacterial membranes. However, these peptides can also interfere with intracellular processes such as nucleic acid and protein synthesis, enzymatic modulation, and protein degradation [7–9], which is an advantage over traditional antibiotics [3,10].

Most AMPs are naturally occurring peptides that represent promising candidates for optimization in advanced steps of the drug design process [11]. AMP-based drugs have been clinically approved to treat both topical and systemic infections. For instance, polymyxins and gramicidin S were formulated for the prevention of topical infections caused by *Pseudomonas aeruginosa* and *Acinetobacter baumannii*. Colistin, a polymyxin derivative, is currently used for the systemic treatment of lung infections, especially those caused by *Pseudomonas aeruginosa* [12]. Due to its problematic resistance profile, *Pseudomonas aeruginosa* is often difficult to treat by antibiotics [13]. However, it can be targeted by a variety of different AMPs [13–15] that may be further developed into innovative therapeutics.

The specificity of peptides toward certain targets is usually highlighted as an important benefit for therapeutic intervention. Nonetheless, a downside of this feature is the associated challenge for the drug design process, given that small structural modifications can significantly influence both the activity and pharmacokinetic properties of the peptides. Consequently, optimizing the precision of tools for the screening of large datasets of peptides is of utmost relevance to improve efficiency at the early steps of drug design processes.

For over a decade, growth in the publicly available data of AMPs has been witnessed, with the subsequent development of several machine learning (ML)-based predictors integrated with AMP databases such as DAMP [16], APD3 [17], CAMP [18], CAMPR3 [19], LAMP [20], DRAMP [21], ADAM [22], and DBAASP [23]. However, most of these prediction tools only discriminate between AMPs and non-AMPs. This is a highly ambiguous outcome given the broad scope of antimicrobial activity, which typically refers to more than 20 biological functions, such as the annotations in APD3 [17].

A group of predictors addressed this issue by applying a hierarchical classification scheme where first the peptides are classified as AMPs or not, and the positive cases are then sub-divided into a couple of classes based on selected AMP functions (e.g., antibacterial, antiviral, and antifungal peptides). Examples of such predictors, which include the antibacterial function are AntiBP2 [24], ClassAMP [25], MLAMP [26], *i*AMPpred [27], AMAP [28], AMP Scanner [29,30], and AMPDiscover [31]. However, of them, only AMP Scanner vr.1 predicts a type of bacterial target (*E. coli* or *S. aureus*) for the identified ABP [29].

In this context, we implemented a two-level predictor focused on antibacterial peptides (ABPs), named ABP-Finder, whose inner classifier estimates the Gram staining type of the putative targets. This tool leverages random forest (RF) classifiers trained with peptide data extracted from StarPep, the largest up to date public database of AMPs [32]. ABP-Finder categorizes ABPs and non-ABPs in the first classification level. Subsequently, the peptides identified as ABPs are sub-classified according to the Gram staining type of the potential targets i.e., exclusively Gram-positive, exclusively Gram-negative bacteria, or broad-spectrum peptides with expected activity against both types of bacteria. The ABPs used to develop this predictor show activity against at least one of nine representative bacterial targets (see Dataset section), among which are species with known multi-drug resistance such as *Acinetobacter baumannii*, *Enterococcus faecium*, *Klebsiella pneumonia*, *Pseudomonas aeruginosa*, and *Staphylococcus aureus.* With ABP-Finder, we weigh precision as the main performance feature of the prediction. In this way, we boost the efficiency of the screening step at the early stages of the drug design process aiming at the development of peptide-based antibiotics. Remarkably, we prove the efficacy of ABP-Finder for such screenings with the identification of a peptide from the human urine peptidome, displaying antimicrobial activity against *Pseudomonas aeruginosa*.

#### **2. Materials and Methods**

#### *2.1. Data Collection and Pre-Processing*

The models developed in this study were derived from the StarPep database [32,33]. This resource, as described by the authors, is a non-redundant compendium from 40 publicly available data sources, which encompasses annotations of more than 20 functions in approximately 45,000 AMPs, with nearly 8000 entries labelled as antibacterial peptides.

Before describing the construction of our training and test sets, we point out a shortcoming of several AMP-based predictors found in the literature [16–22], whose models do not obey the first principle dictated by the Organisation for Economic Co-operation and Development (OECD) to build reliable Quantitative Structure–Activity Relationship (QSAR)/ML-based models [34] (https://doi.org/10.1787/9789264085442-en (accessed on 16 November 2022)). This principle is stated as "a defined endpoint". Commonly, AMPs are annotated as such regardless of the target, mechanism, source, the method used to study the activity, to name some characteristics. The lack of such detailed information makes the discrimination between AMPs and non-AMPs a largely ambiguous endpoint for data analysis. In consequence, several criteria must be introduced to better define the modelled data and thus bring reliability to the predicted outcome. Notably, the most recent AMP predictors [24–29,31] have designed their modeling approaches to break down the AMP annotation into three classes (typically antibacterial, antifungal, and antiviral peptides). This strategy is a suitable approach to fulfil the need for a defined endpoint.

Our work focused on the identification of ABPs. To this end, we extracted peptides from the StarPep database ranging between 5 and 50 residues, and whose composition contains only the 20 standard amino acids. To further refine the selection of ABPs, we only extracted those peptides annotated as active against at least one of the following targets: *Acinetobacter baumannii*, *Bacillus subtilis*, *Enterococcus faecium*, *Escherichia coli*, *Klebsiella pneumonia*, *Listeria monocytogenes*, *Pseudomonas aeruginosa*, *Streptococcus agalactiae*, and *Staphylococcus aureus*. In this way, we discarded entries that are annotated as ABPs without information of their targets, and those exclusively reported with activity against underrepresented targets in the entire database. The selected species cover a set of both Gram-positive and Gram-negative bacteria and are examples of relevant targets for therapeutic applications. The peptides labeled as non-ABP for our learning process are not annotated as antibacterial, against any target, in StarPep, but with a different function such as antifungal or anticancer, among others. This approach clearly carries the risk of mislabeling non-ABP in our dataset, due to insufficient annotation of the peptide in the original source. The pseudo-negative cases in the training data lead to a more stringent prediction of positive cases, and consequently lower false-positive rate and higher precision. The downside is the expected lower recall as the true positives can be also diminished. Nonetheless, the favourable precision is aligned with our stated goal of boosting the precision of the classifier instead of its recall or a combined metric such as accuracy or AUC.

Hence, we extracted a total of 22,707 peptides to design our training and testing schemes. This collection was partitioned into four datasets: training, development, validation, and test sets. The two first are intended for the learning process, while the others are meant for testing the models with hold-out data. The development (Dev) set was used to monitor the generalization of the models built during the optimization of the hyperparameters in the learning algorithm. Usually, the terms development and validation set are applied indistinctively to a dataset used for the above-mentioned purpose. In this work, we made a distinction between these nomenclatures and reserved the term validation for a hold-out set, i.e., peptides that are not used in any step of the learning process. The difference between the validation and the strict test set is that we built the validation set in a way that its peptides share high similarity (≥90% identity) with at least one peptide in the training set (excluding identical matches). In turn, the test set was built in a way that its peptides share less than 90% identity among them, and with any peptide in the training data. Consequently, the test set comprises non-redundant peptides that are also not closely represented in our training. Challenging a peptide predictor in both scenarios, one that

closely resembles the training conditions (without strict superposition), and another more distant setup, is important to assess the biasing effect on the generalization of the model due to the characteristics of the training data.

Finally, a production dataset was generated by combining the training and the development sets. The purpose of this set is to perform a final re-training of the model with an augmented dataset, while keeping the selection of descriptors and configuration of hyper-parameters as optimized with the training and development sets. Figure 1 depicts the workflow followed to obtain the four datasets.

**Figure 1.** Workflow for the preparation of the datasets. The peptides extracted from StarPep were clustered with CD-Hit and subsequently distributed among the four sets used for training and testing the predictor. The final panel of the pipeline contains information about the number of peptides in every subset as well as their classification according to StarPep.

Together with the peptide sequences and their classification as ABP or non-ABP, we also extracted, from StarPep, the information about the Gram staining type of their known targets. Accordingly, we further categorized the ABPs into three activity classes: exclusively against Gram-positive targets (Gram+), exclusively against Gram-negative targets (Gram-), and broad-spectrum peptides. The four datasets resulting from the previous splitting were also used to train and assess the secondary classifier based on the Gram staining type of the targets. For this purpose, the non-ABP peptides were removed from such datasets. Table 1 summarizes the number of peptides per type of Gram staining class in the four datasets.

**Table 1.** Number of peptides per type of Gram staining class in the training, development, validation, and test datasets.


#### *2.2. Performance Measures*

In this section, we summarize the formulations of the performance measures used to assess the different models described here. The measures are sensitivity (Sn), precision (Pr), accuracy (Acc), F1 score, and the Mathew Correlation Coefficient (MCC) [35]. All of them are formulated in terms of the elements of a binary confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

$$\text{Sn} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

$$\text{Pr} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$

$$\text{Acc} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

$$\text{F1} = 2 \frac{\text{Sn} \ast \text{Pr}}{\text{Sn} + \text{Pr}} = \frac{\text{TP}}{\text{TP} + \frac{1}{2}(\text{FP} + \text{FN})}$$

$$\text{MCC} = \frac{\text{TP} \ast \text{TN} - \text{FP} \ast \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}$$

Besides, we define an ad-hoc measure named Fitness–Robustness Score (FRS) that is specifically used as a scoring function to tune the values of the hyper-parameters of the learning technique.

$$\text{FRS} = \left(\frac{\text{R}\_{\text{T}} + \text{R}\_{\text{CV}} + \text{R}\_{\text{D}}}{\text{3}}\right)^{2} - \left(\text{R}\_{\text{T}} - \text{R}\_{\text{CV}}\right)^{2} - \left(\text{R}\_{\text{T}} - \text{R}\_{\text{D}}\right)^{2}$$

The FRS is a quality measure that provides a consolidated value for the performance of a particular model considering its goodness-of-fit, generalization, and robustness. The first term corresponds to the average performance in the following assessments: re-substitution (RT, fitting the training data), 10-fold cross-validation (RCV, within the training data), and generalization (RD, using the development set). The other two terms weigh the robustness of the model by measuring the deviations from the performance in training samples when the model is evaluated in hold-out data (cross-validation and development set). We formulated this ad-hoc measure as a function of another base quality measure, labelled as R, which should be evaluated in the different assessment schemes. For this study, we selected the MCC as the base measure to evaluate our fitness-robustness score. In the case of the multi-classifier trained to distinguish between the Gram staining classes, the average MCC value among the three classes was used as the base measure. The average was weighted according to the number of peptides in each class.

The FRS, when computed as a function of the MCC, has an optimum maximum value of one. We leveraged this score to identify optimum values for the hyper-parameters of the random forest [36] algorithm used to develop our models.

#### *2.3. Machine Learning Approach and Software*

The classifiers developed in this work were random forest (RF) [36] predictors, based on the implementation of this technique in the WEKA environment [37]. RF belongs to the family of ensemble methods [38] with base classifiers formed by decision trees. Recently, RF has been compared with deep learning approaches showing comparable performance for modeling AMP datasets [39]. There, the authors conclude that no definitive evidence was found to support using deep-learning approaches for this problem, knowing the increased algorithmic complexity and computational cost of these methods.

Within RF, all the trees provide a prediction for every instance entering the forest, and the unified outcome is obtained as the majority vote among all the predictions. The hyper-parameters optimized during the learning process were the number of trees, the maximum number of descriptors used to build a tree (these descriptors are taken at the beginning of the training process from the global pool of attributes), and the maximum depth of the trees. In addition, the minimum number of instances in the final leaves of the trees was fixed to 10 in the case of the main classifier (ABPnon-ABP), and to five for the multi-classifier (Gram+/Gram−/broad spectrum).

The peptide descriptors fed to the learning algorithm were computed with the ProtDCal-Suite [40] using the configuration files enclosed in the Supplementary Material. The Prot-

DCal module [41] is intended for the calculation of general-purpose and alignment-free descriptors of amino acid sequences and protein structures. These features are descriptive statistics (such as the variance, average, maximum, minimum, percentiles, etc.) of the distribution of amino acid properties (such as hydrophobicity, isoelectric point, molar weight, among others), in multiple groups of residues extracted from a given protein or peptide. The program possesses additional procedures that modify the intrinsic properties of a residue according to its vicinity in the sequence, thus adding connectivity information in the descriptors. The features derived from ProtDCal have been used by us and other authors to develop machine-learning-based predictors of posttranslational modifications [42,43], protein–protein interaction [44], enzyme-like amino acid sequences [45], residues critical for protein functions [46], and antibacterial peptides [47,48], although with smaller databases. The project files enclosed in the Supplementary Material contain the setup used to compute all the descriptors employed in this work.

#### *2.4. Web Servers Available for ABPs Predictions*

In this section, we briefly describe the most relevant state-of-the-art ABP predictors that are available via web server tools. ClassAMP was among the first methods that broke down the AMP family thus allowing the prediction of ABPs specifically [25]. This tool was trained with peptides from the CAMP database [18] and used RF and support vector machine (SVM) [49] algorithms to identify antibacterial, antifungal, and antiviral peptides.

MLAMP, a multi-label classifier of AMPs was developed using a variant of Chou's pseudo amino acid composition (PseACC) features [50] to build an RF-based classifier that firstly distinguishes AMP from non-AMPs, and then subdivides the biological activity into antibacterial, anticancer, antifungal, antiviral, and anti-HIV [26].

Similarly, the *i*AMPpred predictor combines compositional, physicochemical, and structural features into Chou's general PseACC as input variables for an SVM multiclassifier [27]. This work reunited peptides from the databases CAMPR3 [19], APD3 [17], and AntiBP2 [24]. The multi-classifier uses three categories in the outcome variable: antibacterial, antifungal, and antiviral peptides [27].

The Antimicrobial Activity Predictor (AMAP) [28], with a hierarchical multi-label classification scheme, was trained with AMPs annotated with 14 biological activities in the APD3 database and a designed subset of non-AMP. The models used amino acid composition features to feed SVM and XGboost tree [51] algorithms.

The introduction of the AMP-Scanner webserver represented a significant improvement with respect to other predictors. AMP-Scanner vr.1 consists of two RF classifiers, trained with peptides selected from multiple sources [18,52,53]. The first output of the classifier is the identification of ABPs. The second is a classifier trained to distinguish between peptides with Gram-positive or Gram-negative targets, using data of *S. aureus* and *E. coli* as reference targets. The authors refer that peptides predicted with scores within the range [0.4–0.6] for both classes should be considered as active against both types of targets (broad-spectrum peptides) [29]. On the other hand, AMP-Scanner vr.2 is based on a Deep Neural Networks (DNN) classifier fed with ABP data only, obtained from the updated version of the ADP3 database [19,30].

Very recently, AMPDiscover [31] was developed by mining AMP data from StarPep [33]. AMPDiscover encompasses several binary (active/non-active) predictors of functions such as antibacterial, antiviral, antifungal, and antiparasitic peptides. The authors analyzed the performance of RF to model the antibacterial peptides data, which agrees with our choice of this learning scheme for our models.

#### *2.5. Experimental Determination of Antibacterial Activity*

Two batches of chemically synthetized peptides from different providers (KE Biochem and the U-PEP facility at Ulm University) were used to assess antimicrobial effects. Antibacterial activity was evaluated by agar diffusion as previously described [54]. Bacteria were cultured in liquid broth at 37 ◦C overnight, pelleted by centrifugation, and washed in

10 mM sodium phosphate buffer. Following resuspension, optical density was determined at 600 nm and 2 × 107 bacteria were seeded into a Petri dish in 1% agarose. After cooling at 4 ◦C for 30 min, 3–5 mm holes were placed into the 1% agarose. Peptides adjusted to the desired concentration in 10 μL of buffer were filled into the agar-holes. Following incubation at 37 ◦C in ambient air for 3 h, plates were overlaid with 1% agarose, tryptic soy solved in 10 mM phosphate buffer. Inhibition zones in cm were determined after 16–18 h incubation time at 37 ◦C in 5% CO2. LL37 at a concentration of 100 μg/mL served as positive control. Antimicrobial activity was tested on the following bacterial strains: *Bacillus subtilis*, *Streptococcus agalactiae* ATCC 12403, *Staphylococcus aureus* MRSA ATCC 43300, *Klebsiella pneumoniae* Extended Spectrum β-Lactamase (ESBL) ATCC 700603, *Pseudomonas aeruginosa* (ATCC 27853) and *Listeria monocytogenes* (ATCC BAA-679/EGD-e).

#### **3. Results and Discussion**

Below, we summarize the characteristics of the ML-based models developed in this work, as well as their performance relative to the available state-of-the-art ABP predictors. We also introduce a web server, ABP-Finder, which permits the free and user-friendly screening of large peptide libraries. Finally, we present the application of ABP-Finder for the screening of peptides obtained from the human urine peptide. Notably, ABP-Finder permitted to screen and propose a reduced set of eight ABP candidates out of an initial pool of 4696 peptides. From them, one active hit was experimentally validated with activity against *Pseudomonas aeruginosa*.

#### *3.1. Modeling Antibacterial Peptide Data*

*Feature selection:* The feature selection process comprises three steps. (*i*) First, the Information Gain (IG) [55,56] of all the descriptors was calculated with WEKA, retaining only those descriptors whose IG is >5% of the information content of the class variable. This procedure reduced an initial set of 11,298 descriptors to 2746, whose information contents are the most closely related to our end point variable. (*ii*) Secondly, the redundancy in this subset of features was removed, by clustering the descriptors using a quality-thresholdbased [57] clustering algorithm, which employs the Spearman correlation coefficient [58] as the similarity measure to group the descriptors. A correlation cut-off of 0.9 was used to form the clusters. The outcome of these steps is thus a non-redundant and smaller dataset that contains only the central attributes of the formed clusters. This step rendered 1242 attributes. (*iii*) Given the still large set of features, a last selection step was used by employing the Wrapper Evaluator and the Classifier Subset Evaluators of WEKA coupled with a genetic search algorithm [59]. The Wrapper Evaluator used five-fold cross-validation on the training data to assess the models obtained from diverse subsets of descriptors. Such models were built with an RF whose number of trees was limited to 15. Next, the Classifier Subset Evaluator used the performance with the development set to identify the most suitable pool of descriptors to train the RF. For both evaluators, the F1 measure was used to score all the assessed subsets of attributes. The genetic search employed to explore the space of all possible combinations of attributes was configured with 20 chromosomes (subsets of attributes) per population, 500 generations, and probabilities of cross-over and mutation of 0.6 and 0.1 respectively. The optimal subset resulting from these selection steps comprised 281 descriptors. A project file type IDL (Individual Descriptor Labels) is enclosed in the Supplementary Material; this project file can be uploaded directly to ProtDCal-Suite to compute the selected 281 descriptors in new peptide datasets.

*Tuning hyperparameters*: The hyperparameters of the RF were explored using a grid search according to ranges and binning schemes summarized in the top-left panel of Figure 2. The ad hoc FRS function was used to determine the optimum combination of hyperparameters' values, which was obtained with 75 trees each one built from a pool of 40 descriptors and a maximum depth of 14 splits. Such combinations of values rendered the maximum FRS at 0.517.

**Figure 2.** Tuning scheme of the RF's hyperparameters. The top-left panel summarizes the boundaries and binning of the grid search with the three hyper-parameters. This panel also shows the optimum value found for the FRS function and the values of the hyper-parameters in the corresponding solution. The remaining panels show surfaces plotted as heat maps keeping one of the hyperparameters fixed at its optimum value. The dark regions indicate the best solutions. The optimum regions are highlighted with a dashed circle. The plots highlight that the most critical parameter is the depth of the trees, while high-scored models can be obtained with almost any value of the other hyper-parameters; solutions with a depth below 10 are poorly scored.

#### *3.2. Modeling Data of Gram-Staining Types*

This model was trained with the same set of 281 descriptors obtained from the feature selection procedure to discriminate between ABPs and non-ABPs. The training, development, validation, and test sets used for this model were obtained from the splitting described in the Methods section, by removing the non-ABP present in these datasets. The ABPs were then subdivided according to the Gram-staining type of their known targets.

Due to the imbalance in the number of instances from each class, the cost-sensitive RF multi-classifier was trained by applying a cost matrix in the training process with distinct weights for the different types of misclassified cases. The cost matrix takes the form shown in Figure 3.


**Figure 3.** Cost matrix applied during the training process of the multi-classifier based on the Gramstaining types of the targets.

The multi-class classifier was built with a cost-sensitive learning scheme, which aims to balance the effective error between pairs of classes considering their different prevalence in the training data. The costs were defined as the inverse ratio of the imbalance between

the two classes involved in the matrix element, i.e., given the imbalance between Gram+ and broad-spectrum (BS) peptides in the training data is [1:14.197], then the cost of a Gram+ peptide classified as BS was fixed at 14.197 and the cost of a BS peptide classified as Gram+ remained at 1. This approach diminishes the trend towards BS predictions that originates due to the highest representation of this class in the training data.

The costs affect the training process by re-weighting the training samples in the calculation of the different misclassification errors during the training. No re-weighting is applied to the instances in the test datasets.

*Tuning hyperparameters*: Analogous to the previous model, the hyper-parameters of the RF were explored using a grid search with the ranges and binning schemes summarized in the top-left panel of Figure 4. The FRS function rendered a maximum value for a solution with 35 trees, 20 descriptors per tree, and a maximum depth of 7 splits. Such combinations of values rendered the maximum FRS at 0.185. The lower value of the optimum FRS value, compared with the ABP/non-ABP model, indicates the larger difficulty of discriminating between the three classes of Gram-staining types. Such difficulty is a natural consequence of the overlap between the classes, given that the peptides in the broad-spectrum category should gather intrinsic features of the other two classes.

**Figure 4.** Tuning scheme of the RF's hyper-parameters. The top-left panel summarizes the boundaries and binning of the grid search with the three hyper-parameters. This panel also shows the optimum value found for the FRS function and the hyper-parameters' values of the corresponding solution. The remaining panels show surfaces plotted as heat maps keeping one of the hyper-parameters fixed at its optimum value. The dark regions indicate the best solutions. The optimum regions are highlighted with a dashed circle. As in the exploration for the model ABPs/non-ABPs, the plots show that the most critical parameter is the depth of the trees. Nonetheless, the opposite trend is observed because high-scored models are only obtained with low (<8) depth values. The smaller size of the dataset for this model, as compared with the previous one, leads to the occurrence of overfitting when deep trees are trained.

#### *3.3. Applicability Domain*

Following the regulatory principles for QSAR models established by the OECD, we discuss the applicability domain (AD) of our models. Both of our models were built using peptides with lengths between 5 and 50 residues and containing exclusively the 20 standard amino acids. Thus, these length and composition boundaries constitute soft limits of our applicability domain. A quantitative approach for the AD is provided via the range of the descriptors' values in the training or production dataset. In the Supplementary Material, we provide the minimum and maximum values of the descriptors in these datasets. As part of the implementation of these predictors, we automatically evaluate whether any new peptide is found within these ranges or not. If any of the descriptor values of a new peptide falls outside the training ranges, this peptide is labelled as an outlier and the corresponding information is given in the outcome of the program.

#### *3.4. Performance of ABP-Finder in the Context of the State-of-the-Art*

*Predictors of antibacterial peptides*: We compare the performance of our models to five ML-based ABP predictors by employing the hold-out validation and test sets, respectively (Tables 2 and 3). In addition, we employ an external test set originally used by Veltri et al. [30] to assess the performance of AMP-Scanner vr2 (Table 4). We present the performance of our models obtained with the training data only, and with the production dataset. Additionally, we show the performance of our tool considering only those instances that are within the AD of our models.

**Table 2.** Comparison with external predictors in the validation set. The values in bold denote the best performance for a given measure.


AD: only instances within our applicability domain are considered as valid predictions. # AMPScanner\_v1 only considers peptides ≥ 10 AA for the predictions. \* The method was updated on 20.02.2020.


**Table 3.** Comparison with external predictors in the test set. The values in bold denote the best performance for a given measure.

AD: only instances within our applicability domain are considered valid predictions. # AMPScanner\_v1 only considers peptides ≥ 10 AA for the predictions. \* The method was updated on 20 February 2020.


**Table 4.** Comparison with external predictors in the test set built by Veltri et al. [30]. Redundant instances with our training set were removed. The values in bold denote the best performance for a given measure.

AD: only instances within our applicability domain are considered valid predictions. # AMPScanner\_v1 only considers peptides ≥ 10 AA for the prediction. \* Performance based on the model from the original training in Veltri et al. [30], where the cases in this test set are held out of the training process.

Tables 2 and 3 show that our models achieved the best precision and global accuracy in the test and validation sets. Particularly, the precision was significantly higher with ABP-Finder with respect to the other methods. This is a key feature to be leveraged when filtering large peptide libraries because the main aim during the screenings for new hits is to avoid false-positive predictions.

We also challenged our models with an external test set designed by Veltri et al. [30] (Table 4) to further assess the robustness of our predictions. This dataset is qualitatively different from our test set since it is not derived from the StarPep database as all our data, and therefore it was not subjected to any of the curation procedures carried out by the StarPep's developers.

These comparisons confirm that our RF-based models render the most precise predictions, although the sensitivity (and consequently the global accuracy) decays in this case compared with other ABP predictors. Nevertheless, we note the importance of a low false-positive rate in virtual screening analyses, which highlights the higher practical value of our predictors.

*Predictors of Gram-staining types*: Our antibacterial predictor was designed to provide an estimation of against which type of bacteria are the peptides active. Therefore, we tested how our multi-classifier performs for the Gram+, Gram−, and Broad-Spectrum classes compared to AMP-Scanner vr.1. Tables 5 and 6 summarize the comparison with respect to precision and sensitivity of our models and AMP-Scanner vr.1 on the validation and test sets, respectively. The performance measures were computed for the three classes (Gram+, Gram−, and Broad Spectrum).

**Table 5.** Comparison of ABP-Finder with AMP-Scanner\_v1 in the discrimination between Gram-staining classes within the validation set. The values in bold denote the best performance for a given measure.


AD: only instances within our applicability domain are considered valid predictions. # AMPScanner\_v1 only considers peptides ≥ 10 AA for the predictions. \* There are no instances outside the AD of the model.


**Table 6.** Comparison of ABP-Finder with AMP-Scanner\_v1 in the discrimination between Gramstaining classes within the test set. The values in bold denote the best performance for a given measure.
