Next Article in Journal
Xeno-Free Propagation of Spermatogonial Stem Cells from Infant Boys
Next Article in Special Issue
Meta-iAVP: A Sequence-Based Meta-Predictor for Improving the Prediction of Antiviral Peptides Using Effective Feature Representation
Previous Article in Journal
Sepsis: Precision-Based Medicine for Pregnancy and the Puerperium
Previous Article in Special Issue
SAR-mediated Similarity Assessment of the Property Profile for New, Silicon-Based AChE/BChE Inhibitors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Alignment-Free Method to Predict Enzyme Classes and Subclasses

by
Riccardo Concu
* and
M. Natália D. S. Cordeiro
*
LAQV@REQUIMTE/Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
*
Authors to whom correspondence should be addressed.
Int. J. Mol. Sci. 2019, 20(21), 5389; https://doi.org/10.3390/ijms20215389
Submission received: 9 September 2019 / Revised: 21 October 2019 / Accepted: 23 October 2019 / Published: 29 October 2019
(This article belongs to the Special Issue QSAR and Chemoinformatics Tools for Modeling)

Abstract

:
The Enzyme Classification (EC) number is a numerical classification scheme for enzymes, established using the chemical reactions they catalyze. This classification is based on the recommendation of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Six enzyme classes were recognised in the first Enzyme Classification and Nomenclature List, reported by the International Union of Biochemistry in 1961. However, a new enzyme group was recently added as the six existing EC classes could not describe enzymes involved in the movement of ions or molecules across membranes. Such enzymes are now classified in the new EC class of translocases (EC 7). Several computational methods have been developed in order to predict the EC number. However, due to this new change, all such methods are now outdated and need updating. In this work, we developed a new multi-task quantitative structure–activity relationship (QSAR) method aimed at predicting all 7 EC classes and subclasses. In so doing, we developed an alignment-free model based on artificial neural networks that proved to be very successful.

Graphical Abstract

1. Introduction

By the late 1950s, the International Union of Biochemistry and Molecular Biology foresaw the need for unique nomenclature for enzymes. In those years, the number of known enzymes had grown very rapidly and, because of the absence of general guidelines, the nomenclature of the enzymes was getting out of hand. In some cases, enzymes with similar names were catalyzing different reactions, while conversely different names were given to the same or similar enzymes. Due to this, during the third International Congress of Biochemistry in Brussels in August 1955, the General Assembly of the International Union of Biochemistry (IUB) decided to establish an International Commission in charge of developing a nomenclature for enzymes. In 1961, the IUB finally released the first version of the Enzyme Classification (EC) and Nomenclature List. This nomenclature was based on assigning a four number code to enzymes with the following meaning: (i) the first number identifies the main enzyme class; (ii) the second digit indicates the subclass; (iii) the third number denotes the sub-subclass; and (iv) the fourth digit is the serial number of the enzyme in its sub-subclass. Six enzyme classes were identified, with the classification based on the type of reaction catalyzed: oxidoreductases (EC 1), transferases (EC 2), hydrolases (EC 3), lyases (EC 4), isomerases (EC 5) and ligases (EC 6) [1]. Although several revisions have been made to the 1961 version, the six classes identified have not received any change. However, in August 2018, a new class was added. This new class contains the translocases (EC 7), and was added to describe those enzymes catalyzing the movement of ions or molecules across membranes or their separation within membranes. For this reason, some enzymes which had previously been classified in other classes—EC 3.6.3 for example—were now included in the EC 7 class.
Predicting enzyme classes or protein function using bioinformatic tools is still a key goal in bioinformatics and computational biology due to both the prohibitive costs and the time-consuming nature of wet-lab-based functional identification procedures. In point of fact, there are more than four thousand sequences whose function remains unknown so far and this number is still growing [2]. The problem is that our ability to assign a specific function to a sequence is far lower than our ability to isolate and identify sequences. For this reason, significant efforts have been devoted to developing reliable methods able to predict protein function.
Several methodological strategies and tools have been proposed to classify enzymes based on different approaches [3,4,5,6,7,8,9,10]. The Basic Local Alignment Search Tool (BLAST) [11] is likely to be one of the most powerful and used tools which finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates their statistical significance. However, as is the case with all methods, these procedures may fail under certain conditions. In some cases, enzymes with a sequence similarity higher than 90% may belong to different enzyme families and, thus, have different EC annotations [12,13,14]. On the other hand, some enzymes which share the same first EC number may have a sequence similarity below 30%. Some authors have described this situation well and highlighted the need to develop alignment-free methods, which may be used in a complementary way [15,16]. Other relevant tools based on sequence similarity are the UniProtKB database [17], the Kyoto Encyclopedia of Genes and Genomes (KEGG) [18], the PEDANT protein database [19], DEEPre [20], ECPred [21] and EzyPred [22]. DEEPre is a three-level EC number predictor, which predicts whether an input protein sequence is an enzyme, and its main class and subclass if it is. This method is based on a dataset of 22,198 sequences achieving an overall accuracy of more than 90%. ECPred is another enzymatic function prediction tool based on an ensemble of machine learning classifiers. The creators of this tool developed it using a dataset of approximately 245,000 proteins, achieving score classifications in the 6 EC classes and subclasses like the ones reported by DEEPre. EzyPred is a top-down approach for predicting enzyme classes and subclasses. This model was developed using a 3-layer predictor using the ENZYME [23] dataset (approximately 9800 enzymes when the model was developed), which was able to achieve an overall accuracy above 86%. Other relevant methods with similar classification scores have also been reported [10,15,20,24,25]. All these methods have proved to be robust; however, they are all outdated since they cannot predict the EC 7 classification, and should therefore be updated in accordance with the new EC class.
In light of what has been referred to so far, the major target of this work was to develop an alignment-free strategy using machine learning (ML) methods to predict the first two digits of the seven EC classes. Previous ML methods have used alignment-free numerical parameters to quantify information about the 2D or 3D structure of proteins [26,27,28,29]. Specifically, Graham, Bonchev, Marrero-Ponce, and others [30,31,32,33,34] used Shannon’s entropy measures to quantify relevant structural information about molecular systems. In addition, González-Díaz et al. [35,36,37] introduced so-called Markov–Shannon entropies (θk) to codify the structural information of large bio-molecules and complex bio-systems or networks. For comparative purposes, we developed different linear and non-linear models, including a linear discriminant analysis (LDA) and various types of artificial neural networks (ANNs). In addition, we focused our work on performing an efficient feature selection (FS). Nowadays, there are several software packages or tools that may be used to calculate thousands of molecular descriptors (MDs). As a result, a proper FS method is essential to develop robust and reliable quantitative structure–activity relationship (QSAR) models. This is particularly the case when using ANNs, since QSAR models developed with a large set of MDs are really complex, vulnerable to overfitting and difficult to obtain a mechanistic interpretation from [38,39].

2. Results

2.1. LDA Model

As a first step, we used the LDA algorithm implemented in the software STATISTICA® [40] to derive a linear model able to discriminate all of the subclasses of enzymes using a multi-task model, which means that a single model was developed in order to assign each enzyme to a specific class. From the first pool of more than 200 variables, we selected four that clearly had an influence on the model using a supervised forward stepwise analysis. In order to validate the model, we split our dataset, assigning 70% of the entries to the training class and the remaining 30% to the validation class. The latter was used for validation of the model using a cross-validation procedure. The LDA model had the following overall values for specificity: Sp = 99.71%, sensitivity: Sn = 98.16% and accuracy: Acc = 98.66%. In the training series, the model displayed Sp = 99.71%, Sn = 98.13% and Acc = 98.63%, while in the validation series Sp = 99.71, Sn = 98.27, Acc = 98.73. All of these statistics are reported in Table 1.
The linear equation (Equation (1)) for this model is shown below and information regarding its variables is given in Table 5:
E C =   < T r 3 s r n > * 0.95 + < T r 5 s r n * 0.80 + D T r 5 s r n * 0.80 + D t r 3 s r n * 1.01 2.05
Other relevant statistics for the LDA model (both training and validation), such as the Wilk’s lambda and Matthews correlation coefficient (MCC), are reported in Table 2.

2.2. ANN models

We then decided to move a step forward and try to develop non-linear models using various neural networks’ architectures. We firstly investigated ANN models using either the multi-layer perceptron (MLP) algorithm or the radial basis function (RBF) [41,42,43,44,45,46]. To do so, we ran a set of 50 ANN-MLP models in order to identify the best topology and architecture. The best model found had an MLP 4-9-2 topology, and was developed using the same four variables used for the LDA model. Additionally, it was able to correctly classify 100% of the cases in both the training and validation series. Table 3 shows the statistical parameters obtained for this model. As can be seen, the MCC value was, as expected, 1.
For comparative purposes, Table 4 reports the statistics of the 10 best MLP and RBF models found.
The results reported in Table 4 clearly indicate that MLP models perform better than RBF ones. Even if the best MLP model was able to achieve 100% overall accuracy, we decided to perform a quantitative analysis to infer whether the MLP models were failing. As can be seen in Table 5, the non-optimal MLP models were particularly problematic in discriminating the EC 6.5 subclass.
Finally, a sensitivity analysis was also performed to assess the influence of the MDs in the model. The results of this analysis are shown in Table 6.
Sensitivity analysis refers to the assessment of the importance of predictors in a developed model, with higher values of sensitivity being assigned to the most important predictors. As seen, the high sensitivity values found for some of the parameters suggest that the model’s performance can drastically fall if the parameters used in the model are removed. On the other hand, parameters with lower values of sensitivity may be discarded since they are not relevant to the performance of the model and may lead to an overfitted model. Regarding the variables presented in Table 6, they are traces of the n connectivity matrices of the amino acid sequences. The terms 3 and 5 represent the order of the matrix used in the calculation. The terms within brackets (“< >”) represent the mean value of each subclass, while “D” stands for the difference (or distance) between each amino acid sequence and the mean value of its subclass. This basically means that the model, in order to correctly predict each sequence as an enzyme and then input it into the specific subclass, is calculating the distance between each input and the mean of its subclass. This is in fact how a multi-target model works.

3. Discussion

The main aim of this study was to develop a new QSAR-ML model able to predict enzyme subclasses considering the new and recently introduced EC class 7. We retrieved from the Protein Data Bank (PDB) more than 26,000 enzyme and 55,000 non-enzyme sequences in order to build up our dataset. All of the enzyme sequences belonged to one of the 7 main classes and 65 subclasses. The EC 7 class was introduced just few months ago and, due to this, all of the current models do not include this new enzyme class. As a result, the classification or prediction such models are performing may be misleading. Hence, the development of new models which are capable of predicting all enzyme classes and subclasses—including the EC 7 class—are of utmost importance. In view of this, we developed a new machine learning model able to discriminate between enzymes and non-enzymes. In addition, the model was capable of assigning enzymes to a specific enzyme subclass. We generated linear and non-linear models using alignment-free variables to find the best model to predict EC classes and subclasses. The results of the linear model were impressive since with only four MDs the model could discriminate between enzymes and non-enzymes, as well as assign a specific EC class and subclass to each enzyme sequence. We checked the accuracy and robustness of the model and the results clearly indicate that the model is reliable. Regarding the validation, we performed a classical cross-validation procedure using 30% of the dataset. This led to almost the same results for the training and validation sets, indicating once more the robustness of the model and approach.
Although the accuracy of the derived LDA model was near 100%, we decided to further test our approach by developing some neural network models, which usually improve LDA results. To the best of our knowledge, an MLP is generally considered the best ANN algorithm and, in this case, had the potential to improve our linear model. As previously reported, the MLP was able to perfectly discriminate between enzymes and non-enzymes, in addition to assigning each enzyme sequence to a specific subclass. It is also remarkable that the best model only needed nine neurons in the hidden layer. This low number of neurons, considering the number of sequences and variables, suggest that the model is not suffering from an overfitting problem. Mechanistic interpretation of ANN models is always a challenging task since these models do not lead to simple linear equations. A sensitivity analysis may then be used to analyze the influence of each MD on the model. For the ANN model, we carried out such an analysis to evaluate the weight of each variable in the model. This analysis is also useful for identifying redundant variables in models, assisting in their eliminatation to avoid an unlikely overfitting problem. In the case of the ANN model, we identified that the same four variables used in the LDA model were able to perfectly discriminate between enzymes and non-enzymes and assign each enzyme sequence to a specific subclass.
Finally, we also tested RBF models, which afforded results that were worse than the MLP models. In fact, the general accuracy was lower when compared to the MLP models, which usually need less neurons to achieve greater accuracy.

4. Materials and Methods

4.1. Dataset

From the PDB, we retrieved a total of 81,486 protein FASTA sequences. Of those sequences, 26,073 were enzymes, while 55,413 were non-enzymes (α-proteins, β-proteins, membrane proteins, and so forth). Each of the 26,073 enzyme sequences belonged to one of the 65 enzyme subclasses. In order to avoid redundant sequences, we selected the enzymes using the specific EC classification query module of the PDB and then double-checked the dataset, eliminating double entries. Regarding the non-enzyme sequences, we randomly downloaded protein sequences belonging to different classes, such as membrane proteins, multi-domains, alfas and betas. The complete list of EC subclasses is reported in Supplemental Material S1, while Table 7 reports the number of entries for each one of the subclasses.

4.2. Molecular Descriptor Calculation

The software S2SNet [47] was used to transform each protein sequence into one sequence recurrence network (SRN). The SRN of a protein sequence can be constructed starting from one of two directions: (1) from a sequence graph with linear topology by adding amino acid recurrence information, or (2) from a protein representation graph with star graph (SG) topology by adding sequence information [48,49,50,51,52]. Note that, in both of these SRN representations of a protein sequence, the amino acids are the nodes and are paired (na and nb) in the network (being connected by a link, αab = 1) if they are adjacent and/or neighbour recurrent nodes. This means that αab = 1 if the topological distance between na and nb is d = 1 (chemically bonded amino acids), or if they are the nearest neighbour amino acid of the same type (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, X) with minimal topological distance, dab = min(dab), between them. The first node in the sequence (centre of the star graph) is a bias or a dummy non-residue vertex.
Secondly, we needed to transform the SRN of each sequence into one stochastic matrix 1Π. The elements of 1Π were found by considering the probability (pab) of reaching an amino acid (node nj) by walking from another amino acid (node ni) through a walk of length dij = 1 (Equation (2)):
p ab = α ab n = 1 n = L α ab
Note that the number of amino acids in the sequences was equal to the number of nodes (n) in the SRN graph, and was also equal to the number of rows and columns in 1Π, the length of the sequence (L), and the maximal topological distance in the sequence max(dab). In this work, we quantified the information content of a peptide using the Shannon entropy values (θk) of the k-th natural powers of the Markov matrix 1Π. The same procedure was used to quantify the information of the q-seqs (qθk) and r-seqs (rθk). The formula for the Markov–Shannon entropy qθk is as follows (Equation (3)):
θ q k s e q = a = 0 a = L p k a log p k a
where p k a represents the absolute probability of reaching a node moving throughout a walk of length k with respect to any node in the spectral graph. Further details of this formula can be seen in previous works [35,36,37].
In the Supplemental Material S2, we report the complete list of sequence entries with the respective value of the MD used to develop the models.

4.3. Multi-Target Linear model

The LDA model was developed using the General Discriminant tool implemented in the software STATISTICA [40]. The model is based on a multi-task approach, meaning it is able to predict if a sequence belongs to one out of the seven EC classes. It starts by identifying the presence of enzyme activity εq(ci) = 1 of subclass ci (or the absence of this activity εq(ci) = 0) for a query protein with a known amino acid sequence. The linear model is based on a linear equation, which directly correlates the dependent variable (enzyme or not) with the independent variable (MD). The multi-target LDA model was developed as follows. Once the MD were calculated, we computed the mean value of each subclass and then the difference between each sequence and the mean value of its subclass. Due to the model’s incorporation of the mean value of each subclass and the difference between each sequence, as well as the mean value of its subclass, the model is able to achieve a multi-target prediction. For further information regarding this statistical technique, please refer to the bibliography [53,54,55]. This same procedure was used also for the development of the multi-target ANN model. The validation of the model was performed using the cross-validation module implemented in the software. This procedure is aimed at assessing the predictive accuracy of a model. The test split the dataset into a training set and a validation set, ensuring that if an entry was included in the test set it could not be used in the validation test. In so doing, the model was developed using the cases in the training or learning sample, which, in our study, was 70% of the dataset. The predictive accuracy was then assessed using the remaining 30% of the dataset [56,57]. Standard statistics, such as the specificity (Sp), sensitivity (Sn), probability of error (p), cross-validation, and the Matthews correlation coefficient (MCC) [58], were used to assess the discriminatory power of the model.

4.4. Non-Linear Models

The non-linear models were developed using the neural network tool implemented in the software STATISTICA. In order to identify the best topology and architecture, we ran a large set of 50 models with various topologies. This step is crucial to avoid an (albeit unlikely) overfitting problem. We examined RBF and MLP networks since these usually perform better than other algorithms. The discriminatory power of the models was assessed using the cross-validation method. The models were validated using the cross-validation tool implemented in the ANN module of the STATISTICA software. In this validation procedure, the software automatically assigns 70% of the dataset to training the model. Once the model is trained, the remaining 30% of the inputs are used for validation. It is important to note that if an entry is used in the training set it cannot be used for the validation series.

5. Conclusions

Developing new, reliable, and robust methods for predicting protein function and enzyme class and subclasses is a key goal for theoreticians, especially in light of the recently introduced EC 7 class. In this work, we developed linear and non-linear models using an alignment-free approach to discriminate between enzymes and non-enzymes, as well as assign each enzyme sequence to a specific EC class. The best LDA model showed an overall accuracy of 98.63%, which is considered a remarkable result. However, we decided to explore further and develop some non-linear models using two different algorithms: MLP and RBF. While the latter was unable to improve the results of the LDA model, the MLP model was able to achieve an overall accuracy of 100%. This means that it was able to perfectly discriminate between enzymes and non-enzymes and identify the EC class of each enzyme.

Supplementary Materials

Supplementary materials can be found at https://www.mdpi.com/1422-0067/20/21/5389/s1.

Author Contributions

Conceptualization, R.C.; Data curation, R.C.; Investigation, R.C.; Methodology, R.C.; Software, R.C.; Supervision, M.N.D.S.C.; Validation, M.N.D.S.C.; Writing—original draft, R.C.; Writing—review & editing, M.N.D.S.C.

Funding

This work was supported by UID/QUI/50006/2019 with funding from FCT/MCTES through national funds.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Nomenclature, E. Enzyme nomenclature: Recommendations (1972) of the international union of pure and applied chemistry and the international union of biochemistry: Supplement i: Corrections & additions (1975). Biochim. Et Biophys. Acta (BBA) Enzymol. 1976, 429, 1–45. [Google Scholar]
  2. Rose, P.W.; Prlić, A.; Altunkaya, A.; Bi, C.; Bradley, A.R.; Christie, C.H.; Di Costanzo, L.; Duarte, J.M.; Dutta, S.; Feng, Z.; et al. The RCSB protein data bank: Integrative view of protein, gene and 3D structural information. Nucleic Acids Res. 2016, 45, D271–D281. [Google Scholar] [PubMed]
  3. Jensen, L.J.; Gupta, R.; Blom, N.S.; Devos, D.; Tamames, J.; Kesmir, C.; Nielsen, H.; Stærfeldt, H.; Rapacki, K.; Workman, C.; et al. Prediction of Human Protein Function from Post-translational Modifications and Localization Features. J. Mol. Biol. 2002, 319, 1257–1265. [Google Scholar] [CrossRef] [Green Version]
  4. Davidson, N.J.; Wang, X. Non-Alignment Features based Enzyme/Non-Enzyme Classification Using an Ensemble Method. In Proceedings of the Ninth International Conference on Machine Learning and Applications, Washington, DC, USA, 12–14 December 2010; pp. 546–551. [Google Scholar]
  5. Wang, Y.C.; Wang, X.B.; Yang, Z.X.; Deng, N.Y. Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein Pept. Lett. 2010, 17, 1441–1449. [Google Scholar] [CrossRef] [PubMed]
  6. Concu, R.; Dias Soeiro Cordeiro, M.; Munteanu, C.R.; Gonzalez-Diaz, H. Ptml model of enzyme subclasses for mining the proteome of bio-fuel producing microorganisms. J. Proteome Res. 2019, 18, 2735–2746. [Google Scholar] [CrossRef]
  7. Dobson, P.D.; Doig, A.J. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. 2003, 330, 771–783. [Google Scholar] [CrossRef]
  8. Che, Y.; Ju, Y.; Xuan, P.; Long, R.; Xing, F. Identification of Multi-Functional Enzyme with Multi-Label Classifier. PLoS ONE 2016, 11, e0153503. [Google Scholar] [CrossRef]
  9. Amidi, A.; Amidi, S.; Vlachakis, D.; Megalooikonomou, V.; Paragios, N.; Zacharaki, E.I. EnzyNet: Enzyme classification using 3D convolutional neural networks on spatial representation. PeerJ 2018, 6, e4750. [Google Scholar] [CrossRef]
  10. Hu, Q.N.; Zhu, H.; Li, X.; Zhang, M.; Deng, Z.; Yang, X.; Deng, Z. Assignment of EC Numbers to Enzymatic Reactions with Reaction Difference Fingerprints. PLoS ONE 2012, 7, e52901. [Google Scholar] [CrossRef]
  11. Cock, P.J.A.; Chilton, J.M.; Grüning, B.; Johnson, J.E.; Soranzo, N. Ncbi blast integrated into galaxy. Gigascience 2015, 4, 39. [Google Scholar] [CrossRef]
  12. Todd, A.E.; Orengo, C.A.; Thornton, J.M. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 2001, 307, 1113–1143. [Google Scholar] [CrossRef] [PubMed]
  13. Tian, W.; Skolnick, J. How Well is Enzyme Function Conserved as a Function of Pairwise Sequence Identity? J. Mol. Biol. 2003, 333, 863–882. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Rost, B.; Liu, J.; Nair, R.; Wrzeszczynski, K.O.; Ofran, Y. Automatic prediction of protein function. Cell. Mol. Life Sci. CMLS 2003, 60, 2637–2650. [Google Scholar] [PubMed]
  15. Nagao, C.; Nagano, N.; Mizuguchi, K. Prediction of Detailed Enzyme Functions and Identification of Specificity Determining Residues by Random Forests. PLoS ONE 2014, 9, 84623. [Google Scholar] [CrossRef] [PubMed]
  16. Quester, S.; Schomburg, D. EnzymeDetector: An integrated enzyme function prediction tool and database. BMC Bioinform. 2011, 12, 376. [Google Scholar] [CrossRef]
  17. The UniProt, C. Ongoing and future developments at the universal protein resource. Nucleic Acids Res. 2011, 39, D214–D219. [Google Scholar] [CrossRef]
  18. Kanehisa, M. From genomics to chemical genomics: New developments in KEGG. Nucleic Acids Res. 2006, 34, D354–D357. [Google Scholar] [CrossRef]
  19. Frishman, D.; Mokrejs, M.; Kosykh, D.; Kastenmüller, G.; Kolesov, G.; Zubrzycki, I.; Gruber, C.; Geier, B.; Kaps, A.; Albermann, K.; et al. The pedant genome database. Nucleic Acids Res. 2003, 31, 207–211. [Google Scholar] [CrossRef]
  20. Li, Y.; Wang, S.; Umarov, R.; Xie, B.; Fan, M.; Li, L.; Gao, X. Deepre: Sequence-based enzyme ec number prediction by deep learning. Bioinformatics 2018, 34, 760–769. [Google Scholar] [CrossRef]
  21. Dalkiran, A.; Rifaioglu, A.S.; Martin, M.J.; Cetin-Atalay, R.; Atalay, V.; Doğan, T. ECPred: A tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform. 2018, 19, 334. [Google Scholar] [CrossRef]
  22. Shen, H.B.; Chou, K.C. EzyPred: A top–down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun. 2007, 364, 53–59. [Google Scholar] [CrossRef] [PubMed]
  23. Bairoch, A. The enzyme data bank. Nucleic Acids Res. 1993, 21, 3155–3156. [Google Scholar] [CrossRef] [PubMed]
  24. Kumar, C.; Choudhary, A. A top-down approach to classify enzyme functional classes and sub-classes using random forest. EURASIP J. Bioinform. Syst. Biol. 2012, 2012, 1. [Google Scholar] [CrossRef] [PubMed]
  25. Matsuta, Y.; Ito, M.; Tohsato, Y. Ecoh: An enzyme commission number predictor using mutual information and a support vector machine. Bioinformatics 2013, 29, 365–372. [Google Scholar] [CrossRef]
  26. Agüero-Chapin, G.; González-Díaz, H.; Molina, R.; Varona-Santos, J.; Uriarte, E.; González-Díaz, Y. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence fromPsidium guajavaL. FEBS Lett. 2006, 580, 723–730. [Google Scholar] [CrossRef]
  27. Concu, R.; Dea-Ayuela, M.; Pérez-Montoto, L.G.; Prado-Prado, F.J.; Uriarte, E.; Fernandez, F.B.; Podda, G.; Pazos, A.; Munteanu, C.-R.; Ubeira, F.; et al. 3D entropy and moments prediction of enzyme classes and experimental-theoretic study of peptide fingerprints in Leishmania parasites. Biochim. Biophys. Acta (BBA) Proteins Proteom. 2009, 1794, 1784–1794. [Google Scholar] [CrossRef]
  28. Concu, R.; Dea-Ayuela, M.A.; Pérez-Montoto, L.G.; Bolas-Fernández, F.; Prado-Prado, F.J.; Podda, G.; Uriarte, E.; Ubeira, F.M.; González-Díaz, H. Prediction of Enzyme Classes from 3D Structure: A General Model and Examples of Experimental-Theoretic Scoring of Peptide Mass Fingerprints ofLeishmaniaProteins. J. Proteome Res. 2009, 8, 4372–4382. [Google Scholar]
  29. Bernardes, J.S.; E Pedreira, C. A review of protein function prediction under machine learning perspective. Recent Pat. Biotechnol. 2013, 7, 122–141. [Google Scholar] [CrossRef]
  30. Barigye, S.J.; Marrero-Ponce, Y.; Pérez-Giménez, F.; Bonchev, D. Trends in information theory-based chemical structure codification. Mol. Divers. 2014, 18, 673–686. [Google Scholar] [CrossRef]
  31. Graham, D.J.; Malarkey, C.; Schulmerich, M.V. Information Content in Organic Molecules: Quantification and Statistical Structure via Brownian Processing. J. Chem. Inf. Comput. Sci. 2004, 35, 44. [Google Scholar]
  32. Graham, D.J.; Schacht, D. Base information content in organic molecular formulae. J. Chem. Inf. Comput. Sci. 2000, 40, 942. [Google Scholar] [CrossRef] [PubMed]
  33. Graham, D.J. Information content and organic molecules: Aggregation states and solvent effects. J. Chem. Inf. Modeling 2005, 45, 1223–1236. [Google Scholar] [CrossRef] [PubMed]
  34. Graham, D.J. Information Content in Organic Molecules: Brownian Processing at Low Levels. J. Chem. Inf. Modeling 2007, 38, 376–389. [Google Scholar] [CrossRef] [PubMed]
  35. González-Díaz, H.; Molina, R.; Uriarte, E. Markov entropy backbone electrostatic descriptors for predicting proteins biological activity. Bioorganic Med. Chem. Lett. 2004, 14, 4691–4695. [Google Scholar]
  36. González-Díaz, H.; Saíz-Urra, L.; Molina, R.; Santana, L.; Uriarte, E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. J. Proteome Res. 2007, 6, 904–908. [Google Scholar] [CrossRef]
  37. Riera-Fernandez, P.; Munteanu, C.-R.; Escobar, M.; Prado-Prado, F.J.; Martín-Romalde, R.; Pereira, D.; Villalba, K.; Duardo-Sánchez, A.; González-Díaz, H. New Markov–Shannon Entropy models to assess connectivity quality in complex networks: From molecular to cellular pathway, Parasite–Host, Neural, Industry, and Legal–Social networks. J. Theor. Biol. 2012, 293, 174–188. [Google Scholar] [CrossRef]
  38. Cherkasov, A.; Muratov, E.N.; Fourches, D.; Varnek, A.; Baskin, I.I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y.C.; Todeschini, R.; et al. QSAR Modeling: Where Have You Been? Where Are You Going To? J. Med. Chem. 2014, 57, 4977–5010. [Google Scholar] [CrossRef]
  39. Basak, S.C.; Natarajan, R.; Mills, D.; Hawkins, D.M.; Kraker, J.J. Quantitative Structure—Activity Relationship Modeling of Juvenile Hormone Mimetic Compounds for Culex pipiens Larvae, with a Discussion of Descriptor-Thinning Methods. J. Chem. Inf. Modeling 2006, 37, 65–77. [Google Scholar] [CrossRef]
  40. Hill, T.; Lewicki, P. Statistics Methods and Applications. In A Comprehensive Reference for Science, Industry and Data Mining; StatSoft: Tulsa, OK, USA, 2006; Volume 1, p. 813. [Google Scholar]
  41. Shahsavari, S.; Bagheri, G.; Mahjub, R.; Bagheri, R.; Radmehr, M.; Rafiee-Tehrani, M.; Dorkoosh, F.A. Application of artificial neural networks for optimization of preparation of insulin nanoparticles composed of quaternized aromatic derivatives of chitosan. Drug Res. 2014, 64, 151–158. [Google Scholar] [CrossRef]
  42. Tenorio-Borroto, E.; Rivas, C.G.P.; Chagoyán, J.C.V.; Castañedo, N.; Prado-Prado, F.J.; Garcia-Mera, X.; González-Díaz, H. ANN multiplexing model of drugs effect on macrophages; theoretical and flow cytometry study on the cytotoxicity of the anti-microbial drug G1 in spleen. Bioorganic Med. Chem. 2012, 20, 6181–6194. [Google Scholar] [CrossRef]
  43. Honório, K.M.; De Lima, E.F.; Quiles, M.G.; Romero, R.A.F.; Molfetta, F.A.; Da Silva, A.B.F.; Da Silva, A.B.F. Artificial Neural Networks and the Study of the Psychoactivity of Cannabinoid Compounds. Chem. Biol. Drug Des. 2010, 75, 632–640. [Google Scholar] [CrossRef] [PubMed]
  44. Jung, E.; Choi, S.H.; Lee, N.K.; Kang, S.K.; Choi, Y.J.; Shin, J.M.; Choi, K.; Jung, D.H. Machine learning study for the prediction of transdermal peptide. J. Comput. Mol. Des. 2011, 25, 339–347. [Google Scholar] [CrossRef] [PubMed]
  45. Erol, R.; Ogulata, S.N.; Sahin, C.; Alparslan, Z.N.; Erol, R. A Radial Basis Function Neural Network (RBFNN) Approach for Structural Classification of Thyroid Diseases. J. Med Syst. 2008, 32, 215–220. [Google Scholar] [CrossRef] [PubMed]
  46. Bezerianos, A.; Papadimitriou, S.; Alexopoulos, D. Radial basis function neural networks for the characterization of heart rate variability dynamics. Artif. Intell. Med. 1999, 15, 215–234. [Google Scholar] [CrossRef]
  47. Munteanu, C.-R.; Magalhaes, A.; Duardo-Sánchez, A.; Pazos, A.; González-Díaz, H. S2SNet: A Tool for Transforming Characters and Numeric Sequences into Star Network Topological Indices in Chemoinformatics, Bioinformatics, Biomedical, and Social-Legal Sciences. Curr. Bioinform. 2013, 8, 429–437. [Google Scholar] [CrossRef]
  48. Vazquez, J.; Aguiar, V.; Seoane, J.A.; Freire, A.; Serantes, J.; Dorado, J.; Pazos, A.; Munteanu, C.-R. Star Graphs of Protein Sequences and Proteome Mass Spectra in Cancer Prediction. Curr. Proteom. 2009, 6, 275–288. [Google Scholar] [CrossRef]
  49. Randić, M.; Zupan, J.; Vikic-Topic, D. On representation of proteins by star-like graphs. J. Mol. Graph. Model. 2007, 26, 290–305. [Google Scholar] [CrossRef]
  50. Fernández-Blanco, E.; Aguiar-Pulido, V.; Munteanu, C.R.; Dorado, J. Random Forest classification based on star graph topological indices for antioxidant proteins. J. Theor. Biol. 2013, 317, 331–337. [Google Scholar] [CrossRef] [Green Version]
  51. Fernandez-Lozano, C.; Cuiñas, R.F.; Seoane, J.A.; Fernández-Blanco, E.; Dorado, J.; Munteanu, C.-R. Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models. J. Theor. Biol. 2015, 384, 50–58. [Google Scholar] [CrossRef] [Green Version]
  52. Munteanu, C.R.; González-Díaz, H.; Magalhães, A.L. Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. J. Theor. Biol. 2008, 254, 476–482. [Google Scholar] [CrossRef]
  53. Wang, H.; Yan, L.; Huang, H.; Ding, C. From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 503–513. [Google Scholar] [CrossRef] [PubMed]
  54. Hendryli, J.; Fanany, M.I. Classifying Abnormal Activities in Exam using Multi-Class Markov Chain LDA Based on MODEC Features. In Proceedings of the 4th International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia, 25–27 May 2016; pp. 1–6. [Google Scholar]
  55. Safo, S.E.; Ahn, J. General sparse multi-class linear discriminant analysis. Comput. Stat. Data Anal. 2016, 99, 81–90. [Google Scholar] [CrossRef]
  56. Beleites, C.; Salzer, R. Assessing and improving the stability of chemometric models in small sample size situations. Anal. Bioanal. Chem. 2008, 390, 1261–1271. [Google Scholar] [CrossRef] [PubMed]
  57. Ion-Mărgineanu, A.; Kocevar, G.; Stamile, C.; Sima, D.M.; Durand-Dubief, F.; Van Huffel, S.; Sappey-Marinier, D. Machine Learning Approach for Classifying Multiple Sclerosis Courses by Combining Clinical Data with Lesion Loads and Magnetic Resonance Metabolic Features. Front. Mol. Neurosci. 2017, 11, 398. [Google Scholar] [CrossRef] [PubMed]
  58. Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 2017, 12, 0177678. [Google Scholar] [CrossRef]
Table 1. Accuracy for the linear discriminant analysis (LDA) model.
Table 1. Accuracy for the linear discriminant analysis (LDA) model.
TrainingValidationOverall
All−1 = Sn1 = SpAll−1 = Sn1 = SpAll−1 = Sn1 = Sp
−198.1340,78177898.2713,61324098.1654,3941018
199.75719,49899.7119649899.717625,996
Total98.6340,83820,27698.7313,632673898.6654,47027,014
Table 2. Relevant statistics for the LDA model.
Table 2. Relevant statistics for the LDA model.
EigenvalueCanonicalRWilk’sLambdaChi-Sqr.dfp-valueMCC
1.2418790.7442750.44605449334.994.0000000.000.97
Table 3. Performance of the best multi-layer perceptron (MLP) model found.
Table 3. Performance of the best multi-layer perceptron (MLP) model found.
Obs. Sets aStat. Param. aPred. Stat. aPredicted sets
1−1nj
Training Series
1Sp a10017,500057,039
−1Sn a100039,5390
totalAc a10017,50039,53957,039
Validation Series
1Sp a1008572024,445
−1Sn a100015,8730
totalAc a100857215,87324,445
Overall
1Sp a10026,072081,484
−1Sn a100055,4120
totalAc a10026,07255,41281,484
a Obs. Sets = Observed sets, Stat. Param. = Statistical parameter, Pred. Stat. =Predicted statistics, Sp = Specificity, Sn = Sensitivity, Ac =Accuracy.
Table 4. Resumé of the 10 best MLP and radial basis function (RBF) models.
Table 4. Resumé of the 10 best MLP and radial basis function (RBF) models.
TrainingValidationOverall
Model −1 = Sn1 = SpAll−1 = Sn1 = SpAll−1 = Sn1 = SpAll
BEST
MLP: 4-9-2
Total55,41226,07281,48455,41226,07281,48455,41226,07281,484
Correct55,41226,07281,48455,41226,07281,48455,41226,07281,484
Incorrect0.000.000.000.000.000.000.000.000.00
Correct (%)100100100100100100100100100
Incorrect (%)0.000.000.000.000.000.000.000.000.00
1.MLP 4-7-2Total39,44817,59157,03915,873857224,44555,41226,07281,484
Correct39,44817,56757,01515,873856224,43555,41226,03481,446
Incorrect024240101003838
Correct (%)10099.8699.96100.0099.8899.96100.0099.8599.95
Incorrect (%)00.140.040.000.120.040.000.150.05
2.MLP 4-8-2Total39,44817,59157,03915,873857224,44555,41226,07281,484
Correct39,44817,56557,01315,873856324,43655,41226,03781,449
Incorrect0262609903535
Correct (%)10099.8599.95100.0099.9099.96100.0099.8799.96
Incorrect (%)00.150.050.000.100.040.000.130.04
3.MLP 4-10-2Total39,44817,59157,03915,873857224,44555,41226,07281,484
Correct39,44817,56557,01315,873856324,43655,41226,03781,449
Incorrect0262609903535
Correct (%)10099.8599.95100.0099.9099.96100.0099.8799.96
Incorrect (%)00.150.050.000.100.040.000.130.04
4.MLP 4-11-2Total39,44817,59157,03915,873857224,44555,41226,07281,484
Correct39,44817,56657,01415,873856324,43655,41226,03781,449
Incorrect0252509903535
Correct (%)10099.8699.96100.0099.9099.96100.0099.8799.96
Incorrect (%)00.140.040.000.100.040.000.130.04
5.MLP 4-16-2Total39,44817,59157,03915,873857224,44555,32126,16381,484
Correct39,44817,56757,01515,873857224,44555,32126,13981,460
Incorrect02424000000
Correct (%)10099.8699.96100.00100.00100.00100.0099.9199.97
Incorrect (%)00.140.040.000.000.000.000.090.03
6.RBF 4-21-2Total39,53917,50057,03915,873857224,44555,41226,07281,484
Correct39,52016,42655,94615,855805923,91455,37524,48579,860
Incorrect1910741093185135313715871624
Correct (%)99.9593.8698.0899.8994.0297.8399.9393.9198.01
Incorrect (%)0.056.141.920.115.982.170.076.091.99
7.RBF 4-29-2Total39,53917,50057,03915,873857224,44555,41226,07281,484
Correct39,16517,47556,64015,714856124,27554,87926,03680,915
Incorrect374253991591117053336569
Correct (%)99.0599.8699.399.0099.8799.3099.0499.8699.30
Incorrect (%)0.950.140.71.000.130.700.960.140.70
8.RBF 4-21-2Total39,53917,50057,03915,873857224,44555,41226,07281,484
Correct39,52616,13855,66415,868787323,74155,39424,01179,405
Incorrect131362137556997041820612079
Correct (%)99.9792.2297.5999.9791.8597.1299.9792.0997.45
Incorrect (%)0.037.782.410.038.152.880.037.912.55
9.RBF 4-28-2Total39,53917,50057,03915,197857123,76853,00826,06081,484
Correct39,48916,00023,48915,197844823,64553,00825,67478,682
Incorrect5015001,45001231230386386
Correct (%)99.8791.4395.65100.0098.5699.48100.0098.5299.51
Incorrect (%)0.037.784.350.001.440.520.001.480.49
10.RBF 4-26-2Total39,53917,50057,03915,873857224,44555,41226,07281,484
Correct11,880662918,50947483170791816,628979926,427
Incorrect27659108713853011125540216527387841627355057
Correct (%)30.0537.8832.4529.9136.9832.3930.0137.5832.43
Incorrect (%)69.9562.1267.5570.0963.0267.6169.9962.4267.57
Table 5. Quantitative analysis of the non-optimal MLP models.
Table 5. Quantitative analysis of the non-optimal MLP models.
ModelClassFailTotal Class
1. MLP 4-7-26.41104
6.53436
2. MLP 4-8-21.634
6.41104
6.53436
3. MLP 4-10-21.634
6.41104
6.53336
4. MLP 4-11-21.634
6.41104
6.53236
5. MLP 4-16-26.41104
6.533infer 36
Table 6. Sensitivity analysis for the artificial neural network (ANN) model.
Table 6. Sensitivity analysis for the artificial neural network (ANN) model.
Input VariableVariable SensitivityVariable Name/Details
<Tr5(srn)>15,896,991Expected value of Trace of order 5 of the srn for the sequence
D Tr5(srn)1,288,626Deviation of Trace of order 5 of the srn with respect to the mean value of the class
<Tr3(srn)>591,331.9Expected value of Trace of order 3 of the srn for the sequence
D Tr3(srn)108.7591Deviation of Trace of order 3 of the srn with respect to the mean value of the class
Table 7. Number of entries for each subclass.
Table 7. Number of entries for each subclass.
EC SubclassNumber of SequencesEC SubclassNumber of SequencesEC SubclassNumber of Sequences
1.15552.37224.6120
1.22502.44244.9995
1.31722.52915.1176
1.41082.6195.274
1.552.731125.3247
1.642.8715.4160
1.7912.9105.5115
1.81653.115595.6159
1.9733.1175.993
1.105553.1336.1277
1.111363.27006.238
1.12323.31646.3291
1.131233.414816.4104
1.142443.55616.536
1.151623.64177.18827
1.161733.7697.2927
1.171213.8777.4189
1.18453.937.5187
1.202504.14867.6197
1.21284.2460
1.2334.397
2.15224.439
2.21074.525

Share and Cite

MDPI and ACS Style

Concu, R.; Cordeiro, M.N.D.S. Alignment-Free Method to Predict Enzyme Classes and Subclasses. Int. J. Mol. Sci. 2019, 20, 5389. https://doi.org/10.3390/ijms20215389

AMA Style

Concu R, Cordeiro MNDS. Alignment-Free Method to Predict Enzyme Classes and Subclasses. International Journal of Molecular Sciences. 2019; 20(21):5389. https://doi.org/10.3390/ijms20215389

Chicago/Turabian Style

Concu, Riccardo, and M. Natália D. S. Cordeiro. 2019. "Alignment-Free Method to Predict Enzyme Classes and Subclasses" International Journal of Molecular Sciences 20, no. 21: 5389. https://doi.org/10.3390/ijms20215389

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop