New Multi-View Feature Learning Method for Accurate Antifungal Peptide Detection

Ferdous, Sayeda Muntaha; Mugdha, Shafayat Bin Shabbir; Dehzangi, Iman

doi:10.3390/a17060247

Open AccessArticle

New Multi-View Feature Learning Method for Accurate Antifungal Peptide Detection

by

Sayeda Muntaha Ferdous

¹,

Shafayat Bin Shabbir Mugdha

¹ and

Iman Dehzangi

^2,3,*

¹

Department of Computer Science & Engineering, United International University, Dhaka 1212, Bangladesh

²

Department of Computer Science, Rutgers University, Camden, NJ 08854, USA

³

Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08103, USA

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(6), 247; https://doi.org/10.3390/a17060247

Submission received: 28 April 2024 / Revised: 23 May 2024 / Accepted: 3 June 2024 / Published: 6 June 2024

(This article belongs to the Special Issue Algorithms for Feature Selection (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Antimicrobial resistance, particularly the emergence of resistant strains in fungal pathogens, has become a pressing global health concern. Antifungal peptides (AFPs) have shown great potential as a promising alternative therapeutic strategy due to their inherent antimicrobial properties and potential application in combating fungal infections. However, the identification of antifungal peptides using experimental approaches is time-consuming and costly. Hence, there is a demand to propose fast and accurate computational approaches to identifying AFPs. This paper introduces a novel multi-view feature learning (MVFL) model, called AFP-MVFL, for accurate AFP identification, utilizing multi-view feature learning. By integrating the sequential and physicochemical properties of amino acids and employing a multi-view approach, the AFP-MVFL model significantly enhances prediction accuracy. It achieves 97.9%, 98.4%, 0.98, and 0.96 in terms of accuracy, precision, F1 score, and Matthews correlation coefficient (MCC), respectively, outperforming previous studies found in the literature.

Keywords:

AFP-MVFL; peptide; multi-view feature learning; iFeature; random forest

1. Introduction

Fungal infections pose a significant threat to human health, affecting over one billion people worldwide annually [1]. Unlike bacteria, fungi share similar biological characteristics with mammalian cells as eukaryotes, making it challenging to develop antifungal drugs [2]. Currently, the clinical treatment options for fungal infections are limited to polyenes, azoles, echinocandins, and a few auxiliary drugs like flucytosine, which are constrained by fungal resistance and have drug toxicity side effects [3]. Therefore, there is a critical need to expand the repertoire of antifungal drugs [4].

Antifungal peptides (AFPs) represent a class of naturally occurring peptides produced by organisms as a defense mechanism against fungal pathogens [5]. Typically consisting of 10–100 amino acids, AFPs are amphipathic. AFPs have low toxicity and high efficiency. Due to these favorable characteristics, they have emerged as promising alternatives to chemical antifungal agents [6]. In contrast to traditional antifungal drugs, AFPs exhibit diverse modes of action, such as disrupting fungal cell membranes or inducing the production of reactive oxygen species (ROS) [7]. Identifying AFPs experimentally is time-consuming and costly, especially for the pre-screening of a large number of AFP candidates. Therefore, there is a pressing need for computational models that can rapidly and accurately predict AFPs [8].

In recent years, a wide range of machine-learning-based approaches have been proposed to predict antifungal peptides (AFPs). For instance, Leyi Wei et al. introduced a novel computational model called AFP-MFL (multi-view feature learning) for accurately identifying antifungal peptides (AFPs) by integrating different feature groups [9]. Later, Agrawal et al. [10] employed a combination of amino acid composition (AAC), dipeptide composition (DPC), split amino acid composition, and binary profiles to characterize peptides, subsequently utilizing support vector machine (SVM) classifier to construct prediction models [11].

More recently, Ahmad et al. introduced a feature fusion scheme to integrate diverse peptide features, which were then used to train a deep neural network (DNN) for prediction purposes [12]. Later, Ahmad et al. proposed another innovative computational model for AFP prediction using sequential and evolutionary information extracted from peptides and employing a minimum redundancy and maximum relevance (mRMR) based method for feature extraction [13]. Most recently, Zhang et al. proposed a machine-learning-based approach for accurately identifying and classifying AFPs by developing a comprehensive dataset of known AFPs and applying various feature extraction techniques to represent peptide sequences [14].

Most existing studies heavily rely on expert-knowledge-based handcrafted features to characterize intrinsic peptide properties [15]. These approaches need help with handling short peptide sequences. For instance, descriptors such as AAC, DPC, and reduced amino acid alphabet composition (RAAAC) only consider the frequency of individual amino acid residues, overlooking the sequential order of amino acids in the peptide sequence. The integration of different feature vectors into a high-dimensional feature space has been used to achieve a more expressive feature representation [16]. Nevertheless, this often leads to the curse of dimensionality, introducing redundant information and resulting in heightened computational complexity [17].

To address these issues, we present AFP_MVFL, a new machine learning model based on multi-view feature learning (MVFL) aimed at accurately identifying AFPs. The AFP-MVFL model leverages a diverse range of sequence-based information and physicochemical properties to comprehensively represent peptide characteristics [10]. By incorporating multiple properties of peptides, our model enhances its ability to capture patterns underlying antifungal activity. AFP-MVFL achieves 97.9%, 98.4%, 0.98, and 0.96 in terms of accuracy, precision, F1 score, and Matthews correlation coefficient (MCC), respectively, outperforming previous studies found in the literature. The AFP-MVFL model and its source code are publicly available at https://github.com/MuntahaMim/AFP-MVFL.git (accessed on 1 May 2024).

2. Materials and Methods

The initial steps include importing necessary libraries, loading training and test datasets, and converting labels into binary values (0 for the negative and 1 for the positive classes). The training data are then preprocessed via scaling to ensure uniformity. Next, a random forest classifier with 100 estimators is initialized and fitted to the scaled training data to determine feature importance. At this stage, features with an importance level above the mean are selected and extracted. For evaluating classifier performance, stratified 10-fold cross-validation is employed. The general architecture of our model is presented in Figure 1. This section elaborates on the materials and methods used to build AFP_MVFL.

2.1. Dataset

To ensure a comprehensive evaluation and proper comparison with previous studies, this study leverages three benchmark datasets, namely, Antifp_DS1, Antifp_DS2, and Antifp_DS3, which have been widely used in the literature [17,18,19,20,21]. These datasets, outlined in Table 1, encompass distinct characteristics and composition. In Antifp_DS1, Antifp_DS2, and Antifp_DS3, the positive samples originate from the data repository of antimicrobial peptides (DRAMP) [20], while excluding sequences containing unnatural amino acids (BIJOUX). However, the negative samples in each dataset differ. Antifp_DS1 negatives comprise active antimicrobial peptides, while Antifp_DS2 negatives are randomly generated from SwissProt. Notably, the maximum peptide length in these three datasets is 100. On the other hand, Antifp_DS3 encompasses peptides with lengths ranging from 5 to 30. Positive samples in Antifp_DS3 were collected from CAMP [21], DRAMP [22], and StarPep [23,24] databases, whereas negatives were randomly generated from the Swiss-Prot database. As shown in Table 1, all three datasets are balanced (equal number of positive and negative samples). They are also normally distributed.

2.2. Classifiers

To test the efficiency of our extracted features and identify the best classifier to build our model, we investigated eight different classifiers, most of which have been effectively used for similar studies [25]. These eight classifiers are support vector machine (SVM), logistic regression (LR), decision tree (DT), rotation forest (RT), stochastic gradient descent (SGD), AdaBoost, naive Bayes (NB), and random forest (RF).

2.2.1. Support Vector Machine (SVM)

SVM is one of the most extensively used machine learning techniques in this field. It has been shown to outperform other classifiers for similar tasks [26,27,28]. SVM aims to identify the biggest marginal hyperplane across classes to decrease prediction errors and improve classification task generality. In the case of linearly separable data, it generates a hyperplane with a maximum margin to distinguish two distinct classes. The SVM technique employs many kernel functions, such as linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid [29]. In this study, we have used the linear kernel for the SVM and “C = 1” as a regularization parameter to influence the trade-off between having a smooth decision boundary and classifying the training points correctly.

2.2.2. Logistic Regression (LR)

The probability is estimated using log odds in logistic regression. It has been frequently employed for a variety of tasks with promising outcomes [30]. It is also an excellent model for estimating the likelihood of a linear solution to a problem [31]. In this study, we have used the default random state “0” to ensure the reproducibility of this method in the future.

2.2.3. Naive Bayes (NB)

In the field of machine learning and data mining, naive Bayes is regarded as one of the most prevalent types of classifiers [32]. It is based on the assumption of conditional independence between features. This model creates a Gaussian naive Bayes classifier, and the instance is created with default parameters so that it represents the prior probabilities of the classes.

2.2.4. AdaBoost

AdaBoost is a booting-based approach that employs a basic classifier, also known as a weak learner, and improves its performance iteratively. It raises the cost of misclassified samples in each iteration to ensure they are correctly classified in the following iterations [33]. Adaboost’s performance strongly depends on its weak learner’s performance in each iteration. We implemented the AdaBoost classifier using a decision tree as weak learners with 50 n_estimators to shrink the contribution of each classifier.

2.2.5. Random Forest (RF)

Proposed by Breiman in 2001 [34], random forest aims at building a powerful and divergent decision boundary by employing decision trees on numerous random subsets of data collected using the bagging technique [34]. Random forest is a flexible technique for large-scale problems and has yielded promising results for various challenges [35].

2.2.6. Stochastic Gradient Descent (SGT)

Stochastic gradient descent (SGD) is a robust optimization algorithm widely used in machine learning and deep learning. It is a variant of the gradient descent method that is particularly well-suited for large datasets and complex models and has obtained promising results for similar studies [36,37].

2.2.7. Decision Tree (DT)

A decision tree is a non-parametric supervised learning approach for classification and regression applications. It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf nodes [38].

2.3. Feature Extraction

In this study, we employed iFeature, a widely used feature extraction tool, to extract informative features from the input data [39]. iFeature provides a comprehensive set of feature descriptors that capture diverse aspects of the data, enabling a more comprehensive analysis. iFeature possesses the ability to compute and derive an extensive array of 18 primary sequence encoding schemes that cover 53 diverse feature descriptors. Within various feature categories, users are also able to extract distinct physiochemical properties of amino acids from the AAindex database [40]. The following commonly used feature descriptors are calculated and extracted using iFeature.

2.3.1. Amino Acid Composition (AAC)

AAC presents the frequency or occurrence of each amino acid residue in the protein sequence. It provides insights into the overall amino acid distribution, which can be indicative of certain functional properties [41]. The frequencies are computed for all 20 natural amino acids, denoted as “ACDEFGHIKLMNPQRSTVWY”.

There are 20 elements in the amino acid composition (AAC) feature vector, each corresponding to one of the 20 standard amino acids. These elements indicate the percentage or frequency of each amino acid in the protein sequence.

2.3.2. Composition of Tripeptide (CTDC, CTDT, CTDD)

Tripeptide composition descriptors capture the occurrence frequencies of different combinations of three adjacent amino acids in the protein sequence. CTDC focuses on the composition of tripeptides in the C-terminus, CTDT in the middle, and CTDD in the N-terminus.

The composition of the tripeptide feature vector is an 8000-dimensional vector, with each dimension representing the frequency or occurrence of a specific tripeptide in the protein sequence. Each element in this vector corresponds to a unique tripeptide combination, capturing the information about the presence and distribution of these tripeptides in the protein sequence.

2.3.3. Dipeptide Composition (DPC)

DPC quantifies the occurrence frequencies of different combinations of two adjacent amino acids in the protein sequence. It provides information about local structural patterns and short-range interactions.

The dipeptide composition (DPC) is a 400-dimensional feature vector [42], with each dimension representing the frequency of a specific dipeptide in the protein sequence.

2.3.4. Grouped Amino Acid Composition (GAAC)

The “Grouped Amino Acid Composition” (GAAC) feature in iFeature involves grouping amino acids into predefined categories or classes and then computing the composition of these groups. We have used the basic grouping scheme that divides amino acids into four categories (e.g., hydrophobic, polar, charged, and aromatic). This GAAC feature vector has four features, each representing the composition of one of these groups in the protein sequence.

2.3.5. Global Descriptors of Protein Composition (GDPC)

GDPC captures the overall composition and properties of the protein by considering various physicochemical properties of the constituent amino acids. It provides a holistic view of the protein’s chemical characteristics.

The “Grouped Diamino Acid Composition” (GDAC) feature in iFeature involves grouping dipeptides (two consecutive amino acids) into predefined categories or classes and then computing the composition of these groups. The GDPC is a 25-dimensional feature vector, with each dimension representing how dipeptides are grouped and the chosen classification scheme.

2.3.6. Grouped Tripeptide Composition (GTPC)

GTPC extends the tripeptide composition by grouping tripeptides with similar physicochemical properties. This allows for capturing higher-level structural and functional patterns in the protein sequence.

The GTPC is a 125-dimensional feature vector. Each feature vector represents the frequency or composition of tripeptides grouped into predefined categories or classes based on their physicochemical properties or structural similarities.

2.3.7. Tripeptide Position-Specific Composition (TPC)

TPC captures the position-specific occurrence frequencies of tripeptides in the protein sequence. It provides insights into the specific arrangement and distribution of tripeptides, which can be relevant for understanding functional motifs.

For a protein sequence of length L and using a standard scheme where the tripeptide composition is encoded at each position, the TPC feature vector will be L × 8000 (where 8000 represents the number of possible tripeptides).

2.4. Feature Selection

As is explained in Section 2.3, using iFeature, we extract over 20,000 features. This number of features exceeds the number of samples in our employed datasets by a 1:20 ratio (less than 1200 training samples and over 20,000 features). Hence, reducing the number of features is necessary to avoid under-training. In this study, after extracting the features using iFeature, we performed feature selection on our extracted features to identify the most effective features and filter out redundant features or those with limited discriminatory information. In this way, we aim to use a shortened input feature vector, which consequently enables us to build a more generalizable model. The significance of each feature in the training data is assessed using a random forest with 100 estimators. We have investigated several feature selection techniques. Among them, RF demonstrated the best performance. RF is considered an effective model for feature selection and classification. Feature importance is calculated, and features with importance surpassing the mean importance are selected for further analysis. These selected features are then extracted from both the training and test datasets to focus on the most informative aspects of the data [34,35].

The Gini index, widely employed in decision tree-based algorithms such as RF, is a metric to evaluate impurity or purity within a dataset [43]. Specifically, it quantifies the likelihood that a randomly selected element would be misclassified, reflecting the overall impurity of a set of data points. In the context of the RF method, which is an ensemble of decision trees, the Gini index plays a crucial role in assessing the importance of each feature in contributing to the model’s predictive accuracy. Features that lead to nodes with lower impurity are considered more important, as they contribute to more accurate classifications [44].

Here, we focused on the most important features to reduce the model’s dimensionality, which improved computational efficiency. Features with higher Gini importance scores are indicative of their greater contribution to the overall predictive power of the model. This process aided in selecting a subset of features that are not only relevant but also collectively provide meaningful information for the given prediction task.

2.5. Performance Evaluation

To assess the performance and generalization capability of different classifiers, stratified 10-fold cross-validation and independent test sets are used. To report the results, we run our experiments 10 times and then report the average.

2.6. Evaluation Metrics

For the evaluation of our model’s performance, various metrics, including accuracy (ACC), precision (PRE), the area under the precision–recall curve (AUPRC), the area under the receiver–operating characteristic curve (AUC), Matthews correlation coefficient (MCC), and F1-score are used. These metrics serve as reliable measures to assess the effectiveness and robustness of the model. The calculations for each metric are defined as follows:

A C C = \frac{T P + T N}{T P + T N + F P + F N},

S N = \frac{T P}{T P + F N},

S P = \frac{T N}{T N + F P},

P r e = \frac{T P}{T P + F P},

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}},

F 1 - s c o r e = 2 \times \frac{P r e \times S N}{P r e + S N},

where TP represents the count of true positives, TN represents the count of true negatives, FP represents the count of false positives, and FN represents the count of false negatives.

3. Results and Discussion

This section showcases the experimental outcomes of multiple models employed for AFP prediction, utilizing diverse sequence encoding techniques and machine learning frameworks. A comprehensive comparison between our proposed model and state-of-the-art AFP classifiers is also provided.

3.1. Performance of the Model for Different Classifiers

We conducted a comprehensive comparison using different classifiers to identify the best classifier to build AFP_MVLF. The results of the comparison between the classifiers for 10-fold cross-validation and the independent test set are presented in Table 2 and Table 3 for the dataset Antifp_DS1, respectively. Note that for this comparison, we used all the extracted features using iFeature (no feature extraction). As shown in these tables, RF performs better than other classifiers used in this study. Using RF, we achieve an accuracy of 93.5%, F1 score of 0.92, precision of 91.2%, and MCC of 0.89 for the 10-fold cross-validation. RF also stands out with the highest accuracy of 93.8%, F1 score of 0.93, precision of 96.6%, and MCC of 0.80 for the independent test dataset.

As shown in Table 4 and Table 5, again, RF consistently delivered the top performance among all classifiers for the Antifp_DS2 dataset using a 10-fold cross-validation and independent test set, respectively. It achieves remarkable accuracy of (93.5% and 93.1%), F1 scores of (0.93 and 0.93), MCC scores of (0.87 and 0.86), and precision scores of (92.6% and 92.3%), respectively, using 10-fold cross-validation and the independent test dataset.

We also conducted the comparison for the Antifp_DS3 dataset and the summarized result in Table 6 and Table 7 for a 10-fold cross-validation and independent test set, respectively. As shown in these tables, the model constructed with the RF performs the best among all the other classifiers. The model outperformed, resulting in ACC values of 93.7% and 94.1%, F1 scores of 0.93 and 0.92, MCC scores of 0.82 and 0.87, and precision scores of 94.3% and 95.1%, respectively.

The ROC curves of all models on the independent test datasets—Antifp_DS1, Antifp_DS2, and Antifp_DS3—are displayed in Figure 2, Figure 3 and Figure 4, respectively. As shown in these figures, RF demonstrates better results compared to other classifiers.

3.2. Results Achieved on the Selected Feature Set

As a result of our comparison study, we use RF as the main classifier to build AFP_MVLF. Next, we use our feature extraction model and compare the results of using RF with and without feature extraction. The results of this comparison for Antifp_DS1 are presented in Table 8. As shown in this table, the models constructed with feature selection consistently outperform the alternative model (RF and the whole feature set without using feature selection) in terms of accuracy (ACC), precision (PRE), F1 score, and MCC.

Specifically, employing the random forest algorithm with 100 n_estimators in conjunction with the feature selection model yielded superior predictive capabilities, resulting in ACC values of 97.9% and 97.6%, F1 scores of 0.98 and 0.75, and MCC scores of 0.95 and 0.95 for the Antifp_DS1 dataset, respectively. These results substantiate the advantage of employing feature selection in improving the overall performance of the predictive models.

3.3. Comparison of the Proposed Model with Existing Models

Next, to investigate the effectiveness of our proposed model (AFP_MVLF), we compare its results against other state-of-the-art models found in the literature. The results achieved for AFP_MVLF compared to previous studies for Antifp_DS1 are presented in Table 9. As demonstrated in this table, AFP_MVLF outperforms previous studies, including [9,10,12,14,19,21,23,45] across all evaluation metrics. When compared to AFP-MFL, AFP-MVFL represents relative improvements of 2.1%, 2.4%, 1.3%, and 0.04 in terms of ACC, F1 score, precision, and MCC, respectively, for the Antifp_DS1 dataset.

We also compare AFP_MVLF’s performance to the state-of-the-art methods found in the literature for the Antifp_DS2 and Antifp_DS3 datasets. The experimental results obtained from these different datasets are presented in Table 10. AFP_MVLF consistently outperforms the competitive approaches across all evaluation metrics in each dataset.

Specifically, when tested on the Antifp_DS2 dataset, the AFP-MVFL achieves improved prediction rates with an accuracy of 98.3%, precision of 99.1%, F1 score of 0.98, and MCC of 0.97. A relative increase is observed compared to the AFP-MFL model. On the Antifp_DS3 dataset, the AFP-MVFL achieves an accuracy of 97.4%, precision of 98.4%, F1 score of 0.97, and MCC of 0.95, representing a relative improvement over the previous three models. These results establish that the AFP-MVFL consistently outperforms other methods in distinguishing AFPs from non-AFPs across all evaluated datasets.

We also generated t-SNE graphs in Figure 5, Figure 6 and Figure 7 for the Antifp_DS1, Antifp_DS2, and Antifp_DS3 datasets to explore the importance and contribution of different features. Here we choose t-SNE to investigate feature importance since it was demonstrated as a better candidate than principal component analysis (PCA) in similar studies [46]. By plotting the t-SNE, we can visualize the data in a reduced space, which helps us to identify which features are most relevant for distinguishing between different data points [47]. By visualizing the distribution of instances in the reduced space, we can also assess the quality of the feature selection process [48].

The results above highlight the robustness and generalizability of AFP-MVFL. By integrating a co-attention mechanism to fuse semantic information, evolutionary information, and physicochemical properties, AFP-MVFL effectively generates more informative features. Consequently, AFP-MVFL exhibits superior performance compared to alternative methods, positioning it as a reliable tool for AFP prediction. The AFP-MVFL model and its source code are publicly available at https://github.com/MuntahaMim/AFP-MVFL.git (accessed on 1 May 2024).

4. Conclusions

The accurate prediction of antifungal peptides is crucial for the advancement of therapeutic peptide design. In this study, we proposed a new machine learning framework called AFP_MVLF to predict AFPs accurately. Our approach employed a multi-view feature learning strategy to extract informative features from diverse perspectives, encompassing semantic information, evolutionary patterns, and physicochemical properties.

AFP-MVFL initially generated comprehensive profiles of peptide features by incorporating a set of sequence-based descriptors. AFP-MVFL achieved accurate AFP prediction based solely on sequence-based input features using the multi-view approach. Through rigorous cross-validation experiments conducted on three benchmark datasets, we demonstrated the superior performance of the AFP-MVFL compared to state-of-the-art methods in AFP prediction. Overall, AFP-MVFL presented a robust tool for accurate AFP prediction based solely on sequence-based information. The AFP-MVFL model and its source code are publicly available at https://github.com/MuntahaMim/AFP-MVFL.git (accessed on 1 May 2024).

One of the main limitations of this study is having a limited number of samples with which to train our model. As shown in the Section 3, the results on the independent test set are similar or slightly better than those reported using 10-fold cross-validation. It means that when we use all the training data to build our model, we can achieve better performance. Hence, if we have more samples, we are likely to achieve better results. Therefore, for our future direction, we aim to build larger benchmarks to train more complex models and possibly enhance prediction performance. We also aim to investigate more complex classification models to enhance the prediction performance even further to correctly determine unknown antifungal peptides.

Author Contributions

Conceptualization, S.M.F. and I.D.; methodology, S.M.F. and S.B.S.M.; software, S.M.F. and S.B.S.M.; validation, S.M.F. and S.B.S.M.; formal analysis, S.M.F., I.D. and S.B.S.M.; investigation, S.M.F., I.D. and S.B.S.M.; resources, S.M.F., I.D. and S.B.S.M.; data curation, I.D.; writing—original draft preparation, S.M.F., I.D. and S.B.S.M.; writing—review and editing, S.M.F., I.D. and S.B.S.M.; visualization, S.M.F. and S.B.S.M.; supervision, I.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available in the AFP-MVFL model, and its source code is publicly available at https://github.com/MuntahaMim/AFP-MVFL.git (accessed on 1 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bongomin, F.; Gago, S.; Oladele, R.O.; Denning, D.W. Global and Multi-National Prevalence of Fungal Diseases—Estimate Precision. J. Fungi 2017, 3, 57. [Google Scholar] [CrossRef] [PubMed]
Richardson, M.D. Changing patterns and trends in systemic fungal infections. J. Antimicrob. Chemother. 2005, 56, i5–i11. [Google Scholar] [CrossRef] [PubMed]
Miceli, M.H.; Diaz, J.A.; Lee, S.A. Emerging opportunistic yeast infections. Lancet Infect. Dis. 2011, 11, 142–151. [Google Scholar] [CrossRef] [PubMed]
Brown, G.D.; Denning, D.W.; Gow, N.A.R.; Levitz, S.M.; Netea, M.G.; White, T.C. Hidden Killers: Human Fungal Infections. Sci. Transl. Med. 2012, 4, 165rv13. [Google Scholar] [CrossRef] [PubMed]
Perfect, J.R. The antifungal pipeline: A reality check. Nat. Rev. Drug Discov. 2017, 16, 603–616. [Google Scholar] [CrossRef] [PubMed]
Butts, A.; Krysan, D.J. Antifungal Drug Discovery: Something Old and Something New. PLOS Pathog. 2012, 8, e1002870. [Google Scholar] [CrossRef]
Dhama, K.; Chakrabort, S.; Verma, A.K.; Tiwari, R.; Barathidas, R.; Kumar, A.; Singh, S.D. Fungal/mycotic diseases of poultry-diagnosis, treatment and control: A review. Pak. J. Biol. Sci. 2013, 16, 1626–1640. [Google Scholar] [CrossRef] [PubMed]
Lestrade, P.P.; Bentvelsen, R.G.; Schauwvlieghe, A.F.; Schalekamp, S.; van der Velden, W.J.; Kuiper, E.J.; van Paassen, J.; van der Hoven, B.; van der Lee, H.A.; Melchers, W.J.; et al. Voricona-zole resistance and mortality in invasive aspergillosis: A multi-center retrospective cohort study. Clin. Infect. Dis. 2019, 68, 1463–1471. [Google Scholar] [CrossRef]
Fang, Y.; Xu, F.; Wei, L.; Jiang, Y.; Chen, J.; Wei, L.; Wei, D.-Q. AFP-MFL: Accurate identification of antifungal peptides using multi-view feature learning. Brief. Bioinform. 2023, 24, bbac606. [Google Scholar] [CrossRef]
Agrawal, P.; Bhalla, S.; Chaudhary, K.; Kumar, R.; Sharma, M.; Raghava, G.P. In silico approach for prediction of antifungal peptides. Front. Microbiol. 2018, 9, 323. [Google Scholar] [CrossRef]
Fisher, M.C.; Hawkins, N.J.; Sanglard, D.; Gurr, S.J. Worldwide emergence of resistance to antifungal drugs challenges human health and food security. Science 2018, 360, 739–742. [Google Scholar] [CrossRef] [PubMed]
Ahmad, A.; Akbar, S.; Khan, S.; Hayat, M.; Ali, F.; Ahmed, A.; Tahir, M. Deep-AntiFP: Prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks. Chemom. Intell. Lab. Syst. 2021, 208, 104214. [Google Scholar] [CrossRef]
Akbar, S.; Mohamed, H.G.; Ali, H.; Saeed, A.; Khan, A.A.; Gul, S.; Ahmad, A.; Ali, F.; Ghadi, Y.Y.; Assam, M. Identifying Neuropeptides via Evolutionary and Sequential Based Multi-Perspective Descriptors by Incorporation With Ensemble Classification Strategy. IEEE Access 2023, 11, 49024–49034. [Google Scholar] [CrossRef]
Yao, L.; Zhang, Y.; Li, W.; Chung, C.; Guan, J.; Zhang, W.; Chiang, Y.; Lee, T. DeepAFP: An effective computational framework for identifying antifungal peptides based on deep learning. Protein Sci. 2023, 32, e4758. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Dang, W.; Xie, J.; Zhu, R.; Sun, M.; Jia, F.; Zhao, Y.; An, X.; Qiu, S.; Li, X.; et al. Antimicrobial peptide protonectin disturbs the membrane integrity and induces ROS production in yeast cells. Biochim. Biophys. Acta (BBA)-Biomembr. 2015, 1848, 2365–2373. [Google Scholar] [CrossRef]
Landon, C.; Meudal, H.; Boulanger, N.; Bulet, P.; Vovelle, F. Solution structures of stomoxyn and spinigerin, two insect antimicrobial peptides with an α-helical conformation. Biopolym. Orig. Res. Biomol. 2006, 81, 92–103. [Google Scholar] [CrossRef]
Mousavizadegan, M.; Mohabatkar, H. Computational prediction of antifungal peptides via Chou’s PseAAC and SVM. J. Bioinform. Comput. Biol. 2018, 16, 1850016. [Google Scholar] [CrossRef] [PubMed]
Ahmed, S.; Muhammod, R.; Khan, Z.H.; Adilina, S.; Sharma, A.; Shatabda, S.; Dehzangi, A. ACP-MHCNN: An accurate multi-headed deep-convolutional neural network to predict anticancer peptides. Sci. Rep. 2021, 11, 23676. [Google Scholar] [CrossRef] [PubMed]
Fang, C.; Moriwaki, Y.; Li, C.; Shimizu, K. Prediction of antifungal peptides by deep learning with character embedding. IPSJ Trans. Bioinform. 2019, 12, 21–299. [Google Scholar] [CrossRef]
Fan, L.; Sun, J.; Zhou, M.; Zhou, J.; Lao, X.; Zheng, H.; Xu, H. DRAMP: A comprehensive data repository of antimicrobial peptides. Sci. Rep. 2016, 6, 24482. [Google Scholar] [CrossRef]
Ahmad, A.; Akbar, S.; Tahir, M.; Hayat, M.; Ali, F. iAFPs-EnC-GA: Identifying antifungal peptides using sequential and evolutionary descrip-tors based multi-information fusion and ensemble learning approach. Chemom. Intell. Lab. Syst. 2022, 222, 104516. [Google Scholar] [CrossRef]
Sharma, R.; Shrivastava, S.; Kumar Singh, S.; Kumar, A.; Saxena, S.; Kumar Singh, R. Deep-AFPpred: Identifying novel antifungal peptides using pretrained embeddings from seq2vec with 1DCNN-BiLSTM. Brief. Bioinform. 2022, 3, bbab422. [Google Scholar] [CrossRef] [PubMed]
He, W.; Jiang, Y.; Jin, J.; Li, Z.; Zhao, J.; Manavalan, B.; Su, R.; Gao, X.; Wei, L. Accelerating bioactive peptide discovery via mutual information-based meta-learning. Brief. Bioinform. 2022, 23, bbab499. [Google Scholar] [CrossRef] [PubMed]
Lv, Z.; Ao, C.; Zou, Q. Protein function prediction: From traditional classifier to deep learning. Proteomics 2019, 19, e1900119. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 1 May 2024).
Bhasin, M.; Raghava, G.P.S. SVM based method for predicting HLA-DRB1* 0401 binding peptides in an antigen sequence. Bioinformatics 2004, 20, 421–423. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y. Support vector machine classification algorithm and its application. In Information Computing and Applications: Proceedings of the Third International Conference, ICICA 2012, Chengde, China, 14–16 September 2012; Part II 3; Springer: Berlin/Heidelberg, Germany, 2012; pp. 179–186. [Google Scholar]
Lata, S.; Mishra, N.K.; Raghava, G.P. AntiBP2: Improved version of antibacterial peptide prediction. BMC Bioinform. 2010, 11, S19. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Liu, J.; Yang, F.; Cao, J. Random radial basis function kernel-based support vector machine. J. Frankl. Inst. 2021, 358, 10121–10140. [Google Scholar] [CrossRef]
Westreich, D.; Lessler, J.; Funk, M.J. Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J. Clin. Epidemiol. 2010, 63, 826–833. [Google Scholar] [CrossRef] [PubMed]
Heinze, G.; Schemper, M. A solution to the problem of separation in logistic regression. Stat. Med. 2002, 21, 2409–2419. [Google Scholar] [CrossRef]
Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4 August 2001; Volume 3, pp. 41–46. [Google Scholar]
Cao, J.; Kwong, S.; Wang, R. A noise-detection based AdaBoost algorithm for mislabeled data. Pattern Recognit. 2012, 45, 4451–4465. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Kirasich, K.; Smith, T.; Sadler, B. Random forest vs logistic regression: Binary classification for heterogeneous datasets. SMU Data Sci. Rev. 2018, 1, 9. [Google Scholar]
Haji, S.H.; Abdulazeez, A.M. Comparison of optimization techniques based on gradient descent algorithm: A review. PalArch’s J. Archaeol. Egypt/Egyptol. 2021, 18, 2715–2743. [Google Scholar]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010. [Google Scholar]
Wu, C.-C.; Chen, Y.-L.; Liu, Y.-H.; Yang, X.-Y. Decision tree induction with a constrained number of leaf nodes. Appl. Intell. 2016, 45, 673–685. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, P.; Li, F.; Leier, A.; Marquez-Lago, T.T.; Wang, Y.; Webb, G.I.; Smith, A.I.; Daly, R.J.; Chou, K.C.; et al. iFeature: A python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018, 34, 2499–2502. [Google Scholar] [CrossRef] [PubMed]
Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 2007, 36, D202–D205. [Google Scholar] [CrossRef] [PubMed]
Bhasin, M.; Raghava, G.P. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem. 2004, 279, 23262–23266. [Google Scholar] [CrossRef]
Saravanan, V.; Gautham, N. Harnessing computational biology for exact linear b-cell epitope prediction: A novel amino acid composition-based feature descriptor. OMICS J. Integr. Biol. 2015, 19, 648–658. [Google Scholar] [CrossRef] [PubMed]
Chang, K.Y.; Yang, J.-R. Analysis and prediction of highly effective antiviral peptides based on random forests. PLoS ONE 2013, 8, e70166. [Google Scholar] [CrossRef]
Schaduangrat, N.; Nantasenamat, C.; Prachayasittikul, V.; Shoombuatong, W. ACPred: A computational tool for the prediction and analysis of anticancer peptides. Molecules 2019, 24, 1973. [Google Scholar] [CrossRef]
Liu, J.; Li, M.; Chen, X. AntiMF: A deep learning framework for predicting anticancer peptides based on multi-view feature extraction. Methods 2022, 207, 38–43. [Google Scholar] [CrossRef] [PubMed]
Pareek, J.; Jacob, J. Data compression and visualization using PCA and T-SNE. In Advances in Information Communication Technology and Computing: Proceedings of AICTC 2019; Springer: Singapore, 2021; pp. 327–337. [Google Scholar]
Charoenkwan, P.; Schaduangrat, N.; Moni, M.A.; Manavalan, B.; Shoombuatong, W. SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins. Comput. Biol. Med. 2022, 146, 105704. [Google Scholar] [CrossRef] [PubMed]
Charoenkwan, P.; Ahmed, S.; Nantasenamat, C.; Quinn, J.M.; Moni, M.A.; Lio’, P.; Shoombuatong, W. AMYPred-FRL is a novel approach for accurate prediction of amyloid proteins by using feature representation learning. Sci. Rep. 2022, 12, 7697. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The overall architecture of AFP-MVFL. The AFPs prediction pipeline consists of three modules: (i) feature extraction module using iFeature; (ii) feature selection using random forest; (iii) classification module for the prediction task.

Figure 2. ROC curve of the results for various classification models on the independent test of the Antifp_DS1 dataset.

Figure 3. ROC curve of the results for various classification models on the independent test of the Antifp_DS2 dataset.

Figure 4. ROC curve of the results for various classification models on the independent test of the Antifp_DS3 dataset.

Figure 5. Feature visualization of AFP-MVFL on the AntiFP_DS1 dataset. Blue dots correspond to instances where the label negative equals 0 and red dots correspond to instances where the label positive equals 1.

Figure 6. Feature visualization of random forest on the AntiFP_DS2 dataset. Blue dots correspond to instances where the label negative equals 0 and red dots correspond to instances where the label positive equals 1.

Figure 7. Feature visualization of random forest on the AntiFP_DS3 dataset. Blue dots correspond to instances where the label negative equals 0 and red dots correspond to instances where the label positive equals 1.

Table 1. Statistics and properties of the datasets employed in this study, namely, Antifp-DS1, Antifp-DS2, and Antifp-DS3.

Dataset		AFPs	Non-AFPs	Description of AFPs and Non-AFPs
Antifp_DS1	Train Test	1168 291	1168 291	The non-AFPs were chosen at random from the Swiss-Prot database, and antimicrobial peptides other than antifungals were obtained from the DRAMA database.
Antifp_DS2	Train Test	1168 291	1168 291	The non-AFPs were antimicrobial peptides other than antifungal peptides.
Antifp_DS3	Train Test	1168 291	1168 291	The non-AFPs were chosen at random from the Swiss-Prot database.

Table 2. The results of comparing machine learning algorithms based on various performance metrics using 10-fold cross-validation for the Antifp_DS1 dataset.

Algo	Acc (%)	F1	MCC	Precision (%)	SN (%)	SE (%)
SVM	91.5	0.91	0.83	91.2	90.5	91.4
RF	93.5	0.92	0.89	93.1	93.2	93.8
AdaBoost	91.2	0.93	0.90	90.2	89.5	91.1
NB	78.4	0.77	0.57	81.4	80.2	78.8
LR	90.9	0.90	0.81	90.5	90.1	90.5
SGD	89.3	0.88	0.80	89.1	88.3	86.1
Bernoulli NB	87.4	0.87	0.74	88.7	88.3	87.6
DT	91.7	0.91	0.83	90.8	89.5	90.0
RT	91.7	0.90	0.91	92.4	92.3	91.7

Table 3. The results of comparing machine learning algorithms based on various performance metrics using an independent test set for the Antifp_DS1 dataset.

Algo	Acc (%)	F1	MCC	Pre(%)	SN (%)	SE (%)
SVM	91.4	0.91	0.82	91.1	90.5	91.5
RF	93.8	0.93	0.80	96.6	93.2	93.5
AdaBoost	91.7	0.92	0.89	90.8	89.5	91.2
NB	78.8	0.78	0.57	79.7	80.2	78.4
LR	90.5	0.90	0.81	90.4	90.1	90.9
SGD	86.1	0.85	0.71	86.7	88.3	89.3
Bernoulli NB	87.6	0.87	0.75	90.1	88.3	87.4
DT	90.0	0.90	0.80	89.1	89.5	91.7
RT	91.7	0.91	0.90	92.5	92.3	91.7

Table 4. The results of comparing machine learning algorithms based on various performance metrics using 10-fold cross-validation for the Antifp_DS2 dataset.

Algo	Acc (%)	F1	MCC	Precision (%)	SN (%)	SE (%)
SVM	93.5	0.93	0.87	92.6	90.5	91.4
RF	96.4	0.95	0.92	97.7	93.2	93.8
AdaBoost	92.6	0.91	0.90	89.0	89.5	91.1
NB	88.9	0.89	0.82	86.3	80.2	78.8
LR	92.5	0.91	0.85	92.8	90.1	90.5
SGD	90.9	0.91	0.81	90.9	88.3	86.1
Bernoulli NB	91.1	0.90	0.82	93.9	88.3	87.6
DT	91.6	0.91	0.83	90.8	89.5	90.0
RT	92.5	0.92	0.92	92.2	92.3	91.7

Table 5. The results of comparing machine learning algorithms based on various performance metrics using an independent test set for the Antifp_DS2 dataset.

Algo	Acc (%)	F1	MCC	Precision (%)	SN (%)	SE (%)
SVM	93.1%	0.93	0.86	92.3	93.5	91.4
RF	96.2	0.96	0.92	96.8	96.4	93.8
AdaBoost	92.3	0.91	0.90	91.4	92.6	91.1
NB	85.6	0.89	0.86	80.5	88.9	78.8
LR	93.6	0.94	0.87	92.9	92.5	90.5
SGD	91.4	0.91	0.83	91.4	90.9	86.1
Bernoulli NB	90.4	0.90	0.80	92.1	91.1	87.6
DT	90.4	0.90	0.80	89.2	91.6	90.0
RT	91.1	0.91	0.92	91.2	92.5	91.7

Table 6. Results of the comparison of machine learning algorithms based on various performance metrics using 10-fold cross-validation for the Antifp_DS3 dataset.

Algo	Acc (%)	F1	MCC	Precision (%)	SN (%)	SE (%)
SVM	91.4	0.91	0.82	90.9	90.5	92.3
RF	93.7	0.92	0.86	94.3	93.2	96.8
AdaBoost	88.3	0.87	0.85	85.2	89.5	87.2
NB	77.2	0.74	0.55	84.2	80.2	80.6
LR	91.7	0.91	0.83	90.8	90.1	92.9
SGD	88.1	0.88	0.88	86.8	88.3	91.4
Bernoulli NB	86.1	0.86	0.72	86.4	88.3	92.1
DT	87.3	0.87	0.74	88.7	89.5	89.2
RT	91.2	0.92	0.90	92.6	92.3	91.3

Table 7. The results of comparing machine learning algorithms based on various performance metrics using an independent test set for the Antifp_DS3 dataset.

Algo	Acc (%)	F1	MCC	Precision (%)	SN (%)	SE (%)
SVM	90.8	0.91	0.81	89.9	90.5	92.3
RF	94.1	0.93	0.87	95.1	93.2	96.8
AdaBoost	83.4	0.82	0.81	82.1	89.5	82.0
NB	78.0	0.75	0.57	85.5	80.2	80.5
LR	90.1	0.90	0.80	89.3	90.1	92.9
SGD	88.3	0.88	0.76	89.6	88.3	91.4
Bernoulli NB	86.9	0.86	0.73	89.6	88.3	92.1
DT	87.6	0.87	0.75	90.1	89.5	89.2
RT	92.2	0.90	0.92	91.8	92.3	91.2

Table 8. Comparison of the AFP-MVFL model with and without feature selection of Antifp_DS1.

Datasets	Model	ACC (%)	PRE(%)	F1 Score	MCC
Antifp_DS1	Without Feature Selection	93.5	93.1	0.92	0.89
	With Feature Selection	97.9	98.4	0.98	0.96

Table 9. Comparison of AFP-MVFL with other antifungal peptide predictors on the independent test dataset of Antifp_DS1.

Datasets	Model	ACC (%)	PRE(%)	F1 score	MCC
Antifp_DS1	MIMML [23]	91.3	-	-	0.83
	AntiMF [45]	90.2	-	-	0.80
	Antifp [10]	86.3	-	-	0.73
	AFPDeep [19]	90.2	-	-	-
	Deep-AntiFP [12]	89.1	-	-	0.78
	iAFPs-EnC-GA [21]	93.9	-	-	0.90
	AFP-MFL [9]	95.8	97.1	0.96	0.92
	AFP-MVFL	97.9	98.4	0.98	0.96

Table 10. Comparison of AFP-MVFL with other antifungal peptide predictors on Antifp_DS2 and Antifp_DS3 datasets.

Datasets	Model	ACC (%)	PRE(%)	F1 score	MCC
Antifp_DS2	AFPDeep	93.5	-	-	-
	Antifp	85.9	-	-	0.72
	AFP-MFL	94.4	95.9	0.94	0.88
	AFP-MVFL	98.3	99.1	0.98	0.97
Antifp_DS3	AFPDeep	88.7	-	-	-
	Antifp	90.4	-	-	0.81
	AFP-MFL	96.8	97.6	0.96	0.93
	AFP-MVFL	97.4	98.3	0.97	0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ferdous, S.M.; Mugdha, S.B.S.; Dehzangi, I. New Multi-View Feature Learning Method for Accurate Antifungal Peptide Detection. Algorithms 2024, 17, 247. https://doi.org/10.3390/a17060247

AMA Style

Ferdous SM, Mugdha SBS, Dehzangi I. New Multi-View Feature Learning Method for Accurate Antifungal Peptide Detection. Algorithms. 2024; 17(6):247. https://doi.org/10.3390/a17060247

Chicago/Turabian Style

Ferdous, Sayeda Muntaha, Shafayat Bin Shabbir Mugdha, and Iman Dehzangi. 2024. "New Multi-View Feature Learning Method for Accurate Antifungal Peptide Detection" Algorithms 17, no. 6: 247. https://doi.org/10.3390/a17060247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

New Multi-View Feature Learning Method for Accurate Antifungal Peptide Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Classifiers

2.2.1. Support Vector Machine (SVM)

2.2.2. Logistic Regression (LR)

2.2.3. Naive Bayes (NB)

2.2.4. AdaBoost

2.2.5. Random Forest (RF)

2.2.6. Stochastic Gradient Descent (SGT)

2.2.7. Decision Tree (DT)

2.3. Feature Extraction

2.3.1. Amino Acid Composition (AAC)

2.3.2. Composition of Tripeptide (CTDC, CTDT, CTDD)

2.3.3. Dipeptide Composition (DPC)

2.3.4. Grouped Amino Acid Composition (GAAC)

2.3.5. Global Descriptors of Protein Composition (GDPC)

2.3.6. Grouped Tripeptide Composition (GTPC)

2.3.7. Tripeptide Position-Specific Composition (TPC)

2.4. Feature Selection

2.5. Performance Evaluation

2.6. Evaluation Metrics

3. Results and Discussion

3.1. Performance of the Model for Different Classifiers

3.2. Results Achieved on the Selected Feature Set

3.3. Comparison of the Proposed Model with Existing Models

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI