Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences
Abstract
:Simple Summary
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset Preparation
2.2. Feature Extraction
- Amino acid composition: The amino acid composition (ACC) is defined as the number of amino acids of each type normalized with the total number of residues [22]. With this descriptor group, we not only encoded the frequency of each amino acid in the peptide/protein sequences by using the ACC descriptor but also combined with the composition of k-spaced amino acid pairs (CKSAAP), dipeptide composition (DPC), dipeptide deviation from the expected mean (DDE), subsequently.
- Composition/Transition/Distribution (C/T/D): this descriptor group represents the distribution pattern of the amino acid-based on the specific structural or physicochemical property of that peptide/protein sequence [23,24]. Seven types of physical properties have been used for calculating these features: hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, secondary structures, and solvent accessibility.
- Conjoint triad: This feature extractor can be used for exploring the properties of each amino acid in protein sequences and its vicinal amino acids by relating any three continuous amino acids as a unique unit [25].
- Autocorrelation: The autocorrelation descriptor was firstly introduced by Moreau and Broto, which is mainly based on the distribution of amino acid properties along the sequence [26]. In this group, there are three autocorrelation descriptors including Geary, Moran, and Normalized Moreau-Broto (NMBroto).
- Pseudo-amino acid composition: The pseudo amino acid composition (PseAAC) was originally introduced by Chou in 2001 [27] to represent protein samples for improving the prediction of membrane proteins and protein subcellular localization. Compare to the original AAC method, it also characterizes the protein mainly using a matrix of amino-acid frequencies of the peptide/protein sequences without significant sequential homology to other proteins.
- Grouped amino acid composition: With this descriptor group, the features are extracted with the encoding and categorizing into five classes based on their physicochemical properties such as hydrophobicity, charge, or molecular size [21].
- Quasi-sequence-order: This promising descriptor can pass through the extreme complication of the peptide/protein sequences (permutations and combinations) based on the augmented covariant discriminant algorithm. It also allows us to approach the higher prediction quality with various protein features as well [28].
2.3. Feature Selection and Attribute Discrimination
2.3.1. Feature Selection
2.3.2. Attribute Discrimination
- Attribute Evaluator—Correlation-based feature subset selection (CfsSubsetEval): This evaluator was firstly introduced in 1998 by Hall et al. [30], that can assume the value of the subset attributes by considering the individual predictive ability of each feature compared with the degree of redundancy between them.
- Searching method—BestFirst”: This tool can examine the space of attribute subsets by greedy hill climbing augmented with a backtracking facility. An important notice here is that “BestFirst” may be searching from both directions: starts with the empty set of attributes and search forward, or begins the opposite with the full set of attributes or any random point.
2.4. Model Training
- Feature set F1: Hybridization of all nine original feature sets together. This feature set contains 3988 attributes.
- Feature set F2: We applied the “select attributes” function of Weka (developed by the University of Waikato, Hamilton, New Zealand, USA) software on Feature F1 data. This hybrid feature set contained 147 attributes.
- Feature set F3: Attributes were reduced by filtering each of the nine feature sets separately and then combining them to create a hybridized set of 317 attributes.
2.4.1. Random Forrest (RF)
2.4.2. J48
2.4.3. Sequential Minimal Optimization (SMO)
2.4.4. Deep Learning (DL)
2.5. Model Performance Evaluation
3. Results
3.1. Features Selection
3.2. Model Performance Among Different Features and Algorithms
- Feature dataset F1: (1:35; 2:1) for SMO and RF; (1:50; 2:1) for J48; (1:4; 2:1) for DL. However, we were unable to strike the balance when applying Feature set F1 on the SMO algorithm even though we tried many matrices: (1:25; 2:1), (1:30; 2:1), (1:35; 2:1), (1:35; 3:1), (1:35; 4:1), (1:40; 2:1). These matrices presented similar results so we chose the matrix (1:35; 2:1) for representation.
- Feature dataset F2: (1:10; 2:1) for SMO; (1:35; 2:1) for J48, RF, and DL.
- Feature dataset F3: the same matrix (1:35; 2:1) for all algorithms SMO, J48, RF, DL.
3.3. Comparision with Previous Methodologies
3.3.1. Comparison with Existing Models in the Cross-Validation Test
3.3.2. Comparison with Existing Models in the Independent Dataset 1
3.3.3. Comparison with Existing Models in the Independent Dataset 2
3.4. Replicating the Results Using Weka Toolkit
- (1)
- Extracting features via iFeature toolkit [21].
- (2)
- Selecting the best feature sets that have been provided in Figure 2.
- (3)
- Using Weka to load our provided best model. To prevent any incompatible error, it is suggested to use Weka version 3.8.
- (4)
- Testing your protein sequences based on the generated features in step 2.
3.5. Validation with Novel Antioxidant Protein Sequences
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Conflicts of Interest
References
- Lane, N. Oxygen: The Molecule that Made the World; Oxford University Press: Oxford, UK, 2003. [Google Scholar]
- Lobo, V.; Patil, A.; Phatak, A.; Chandra, N. Free radicals, antioxidants and functional foods: Impact on human health. Pharmacogn. Rev. 2010, 4, 118. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Valko, M.; Leibfritz, D.; Moncol, J.; Cronin, M.T.; Mazur, M.; Telser, J. Free radicals and antioxidants in normal physiological functions and human disease. Int. J. Biochem. Cell Biol. 2007, 39, 44–84. [Google Scholar] [CrossRef] [PubMed]
- Nimse, S.B.; Pal, D. Free radicals, natural antioxidants, and their reaction mechanisms. RSC Adv. 2015, 5, 27986–28006. [Google Scholar] [CrossRef] [Green Version]
- Butt, A.H.; Rasool, N.; Khan, Y.D. Prediction of antioxidant proteins by incorporating statistical moments based features into Chou’s PseAAC. J. Theor. Biol. 2019, 473, 1–8. [Google Scholar] [CrossRef] [PubMed]
- Bonomini, F.; Rodella, L.F.; Rezzani, R. Metabolic Syndrome, Aging and Involvement of Oxidative Stress. Aging Dis. 2015, 6, 109–120. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gilgun-Sherki, Y.; Melamed, E.; Offen, D. The role of oxidative stress in thepathogenesis of multiple sclerosis: The need for effectiveantioxidant therapy. J. Neurol. 2004, 251, 261–268. [Google Scholar] [PubMed]
- Guzik, T.J.; Touyz, R.M. Oxidative Stress, Inflammation, and Vascular Aging in Hypertension. Hypertension 2017, 70, 660–667. [Google Scholar] [CrossRef] [PubMed]
- Reuter, S.; Gupta, S.C.; Chaturvedi, M.M.; Aggarwal, B.B. Oxidative stress, inflammation, and cancer: How are they linked? Free Radic. Biol. Med. 2010, 49, 1603–1616. [Google Scholar] [CrossRef] [Green Version]
- Dhalla, S.N.; Temsah, R.M.; Netticadan, T. Role of oxidative stress in cardiovascular diseases. J. Hypertens. 2000, 18, 655–673. [Google Scholar] [CrossRef]
- Gupta, R.K.; Patel, A.K.; Shah, N.; Choudhary, A.K.; Jha, U.K.; Yadav, U.C.; Pakuwal, U. Oxidative stress and antioxidants in disease and cancer: A review. Asian Pac. J. Cancer Prev. 2014, 15, 4405–4409. [Google Scholar] [CrossRef] [Green Version]
- German, J.B. Food Processing and Lipid Oxidation. In Impact of Processing on Food Safety; Jackson, L.S., Knize, M.G., Morgan, J.N., Eds.; Springer: Boston, MA, USA, 1999; pp. 23–50. [Google Scholar]
- Mirończuk-Chodakowska, I.; Witkowska, A.M.; Zujko, M.E. Endogenous non-enzymatic antioxidants in the human body. Adv. Med. Sci. 2018, 63, 68–78. [Google Scholar] [CrossRef] [PubMed]
- Jin, S.; Wang, L.; Guo, F.; Zou, Q. AOPs-SVM: A Sequence-Based Classifier of Antioxidant Proteins Using a Support Vector Machine. Front. Bioeng. Biotechnol. 2019, 7, 224. [Google Scholar]
- Feng, P.-M.; Lin, H.; Chen, W. Identification of Antioxidants from Sequence Information Using Naïve Bayes. Comput. Math. Methods Med. 2013, 2013, 567529. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Feng, P.; Chen, W.; Lin, H. Identifying Antioxidant Proteins by Using Optimal Dipeptide Compositions. Interdiscipl. Sci. Comput. Life Sci. 2016, 8, 186–191. [Google Scholar] [CrossRef] [PubMed]
- Xu, L.; Liang, G.; Shi, S.; Liao, C. SeqSVM: A sequence-based support vector machine method for identifying antioxidant proteins. Int. J. Mol. Sci. 2018, 19, 1773. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, X.; Tang, Q.; Tang, H.; Chen, W. Identifying Antioxidant Proteins by Combining Multiple Methods. Front. Bioeng. Biotechnol. 2020, 8, 858. [Google Scholar] [CrossRef] [PubMed]
- Zhang, L.; Zhang, C.; Gao, R.; Yang, R.; Song, Q. Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy. PLOS ONE 2016, 11, e0163274. [Google Scholar] [CrossRef]
- Chen, Z.; Zhao, P.; Li, F.; Marquez-Lago, T.T.; Leier, A.; Revote, J.; Zhu, Y.; Powell, D.R.; Akutsu, T.; Webb, G.I.; et al. iLearn: An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief. Bioinform. 2019, 21, 1047–1057. [Google Scholar] [CrossRef]
- Chen, Z.; Zhao, P.; Li, F.; Leier, A.; Marquez-Lago, T.T.; Wang, Y.; Webb, G.I.; Smith, A.I.; Daly, R.J.; Chou, K.; et al. iFeature: A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018, 34, 2499–2502. [Google Scholar] [CrossRef] [Green Version]
- Bhasin, M.; Raghava, G.P.S. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. J. Biol. Chem. 2004, 279, 23262–23266. [Google Scholar] [CrossRef] [Green Version]
- Cai, C.Z.; Han, L.Y.; Ji, Z.L.; Chen, X.; Chen, Y.Z. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003, 31, 3692–3697. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA 1995, 92, 8700–8704. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shen, J.; Zhang, J.; Luo, X.; Zhu, W.; Yu, K.; Chen, K.; Li, Y.; Jiang, H. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef] [Green Version]
- Horne, D.S. Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers 1988, 27, 451–477. [Google Scholar] [CrossRef]
- Chou, K.-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Bioinform. 2001, 43, 246–255. [Google Scholar] [CrossRef]
- Chou, K.-C. Prediction of Protein Subcellular Locations by Incorporating Quasi-Sequence-Order Effect. Biochem. Biophys. Res. Commun. 2000, 278, 477–483. [Google Scholar] [CrossRef] [PubMed]
- Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl. 2009, 11, 10–18. [Google Scholar] [CrossRef]
- Hall, M.A. Correlation-Based Feature Subset Selection for Machine Learning. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand, 1998. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Quinlan, J. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. 1998. Available online: https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/ (accessed on 20 August 2020).
- Lang, S.; Bravo-Marquez, F.; Beckham, C.; Hall, M.; Frank, E. WekaDeeplearning4j: A Deep Learning Package for Weka Based on Deeplearning4j. Knowl.-Based Syst. 2019, 178, 48–50. [Google Scholar] [CrossRef]
- Do, D.T.; Le, T.Q.T.; Le, N.Q.K. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites. Brief. Bioinform. 2020. [Google Scholar] [CrossRef]
- Le, N.Q.K.; Do, D.T.; Chiu, F.Y.; Yapp, E.K.Y.; Yeh, H.Y.; Chen, C.Y. XGBoost Improves Classification of MGMT Promoter Methylation Status in IDH1 Wildtype Glioblastoma. J. Personal. Med. 2020, 10, 128. [Google Scholar] [CrossRef] [PubMed]
Feature Extraction Algorithm | Feature Set (Descriptor) | Dimensions (Amounts of Attributes) |
---|---|---|
Amino acid composition | AAC: Amino acid composition | 20 |
CKSAAP: Composition of k-spaced amino acid pairs | 2400 | |
DPC: Dipeptide composition | 400 | |
DDE: Dipeptide deviation from expected mean | 400 | |
C/T/D | CTDC: C/T/D Composition | 39 |
CTDD: C/T/D Distribution | 195 | |
CTDT: C/T/D Transition | 39 | |
Conjoint triad | CTriad: Conjoint triad | 343 |
Autocorrelation | Geary: Geary | 240 |
Moran: Moran | 240 | |
NMBroto: Normalized Moreau-Broto | 240 | |
Pseudo-amino acid composition | PAAC: Pseudo-amino acid composition | 50 |
APAAC: Amphiphilic Pseudo-amino acid composition | 80 | |
Grouped amino acid composition | CKSAAGP: Composition of k-spaced amino acid group pairs | 150 |
GAAC: Grouped amino acid composition | 5 | |
GDPC: Grouped dipeptide composition | 25 | |
Quasi-sequence-order | QSOrder: Quasi-sequence-order descriptors | 100 |
Predictor | Accuracy |
---|---|
With adjusting imbalance data | |
Without adjusting imbalance data | |
Butt et al. [5] (without adjusting imbalance data) |
Predictor | Sn | Sp | Acc | MCC | AUC |
---|---|---|---|---|---|
With adjusting imbalance data | 0.986 | 0.469 | 0.5515 | 0.341 | 0.983 |
Without adjusting imbalance data | 0.946 | 0.941 | 0.9421 | 0.811 | 0.981 |
Zhang et al., (2016) (without adjusting imbalance data) | 0.878 | 0.860 | 0.863 | 0.617 | 0.948 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ho Thanh Lam, L.; Le, N.H.; Van Tuan, L.; Tran Ban, H.; Nguyen Khanh Hung, T.; Nguyen, N.T.K.; Huu Dang, L.; Le, N.Q.K. Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences. Biology 2020, 9, 325. https://doi.org/10.3390/biology9100325
Ho Thanh Lam L, Le NH, Van Tuan L, Tran Ban H, Nguyen Khanh Hung T, Nguyen NTK, Huu Dang L, Le NQK. Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences. Biology. 2020; 9(10):325. https://doi.org/10.3390/biology9100325
Chicago/Turabian StyleHo Thanh Lam, Luu, Ngoc Hoang Le, Le Van Tuan, Ho Tran Ban, Truong Nguyen Khanh Hung, Ngan Thi Kim Nguyen, Luong Huu Dang, and Nguyen Quoc Khanh Le. 2020. "Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences" Biology 9, no. 10: 325. https://doi.org/10.3390/biology9100325
APA StyleHo Thanh Lam, L., Le, N. H., Van Tuan, L., Tran Ban, H., Nguyen Khanh Hung, T., Nguyen, N. T. K., Huu Dang, L., & Le, N. Q. K. (2020). Machine Learning Model for Identifying Antioxidant Proteins Using Features Calculated from Primary Sequences. Biology, 9(10), 325. https://doi.org/10.3390/biology9100325