Enhancing Machine-Learning Prediction of Enzyme Catalytic Temperature Optima through Amino Acid Conservation Analysis
Abstract
:1. Introduction
2. Results
2.1. Phosphatase Data Collection and Processing
2.2. Prediction of Optimal Catalytic Temperature for Phosphatases by Previous Tools
2.3. Feature Extraction and Model Evaluation Based on Topt249
2.4. Removing the Conserved Amino Acid in the Protein Sequence Facilitates the Prediction of Phosphatase Catalytic Temperature Optima by Machine-Learning
2.5. Prediction of Topt of Unmeasured Phosphatases by Optimized Machine Learning Model and Experimental Validation
3. Materials and Methods
3.1. Prediction Model for Optimal Catalytic Temperature of Phosphatases
3.2. Dataset Construction
3.2.1. Construction of the Dataset for Wild-type Sequences of Phosphatases
3.2.2. Construction of Phosphatase Dataset with Conserved Amino Acids Removed
3.3. Training and Testing Set Splitting
3.4. Feature Extraction and Selection
3.5. Construction and Training of Machine Learning Models
3.6. Evaluation of Machine Learning Models
3.7. Enzyme Expression and Purification
3.8. Determination of Enzyme Activity and Topt of Phosphatases
3.9. Determining the EC Number of Phosphatase Enzymes
4. Discussion
5. Availability of Final Model, Data, and Code
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Bornscheuer, U.T.; Huisman, G.W.; Kazlauskas, R.J.; Lutz, S.; Moore, J.C.; Robins, K. Engineering the third wave of biocatalysis. Nature 2012, 485, 185–194. [Google Scholar] [CrossRef] [PubMed]
- Nicolas, J. Artificial intelligence and bioinformatics. In A Guided Tour of Artificial Intelligence Research: Volume III: Interfaces and Applications of Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2020; pp. 209–264. [Google Scholar]
- Dou, Z.; Sun, Y.; Jiang, X.; Wu, X.; Li, Y.; Gong, B.; Wang, L. Data-driven strategies for the computational design of enzyme thermal stability: Trends, perspectives, and prospects: Data-driven strategies for enzyme thermostability design. Acta Biochim. Biophys. Sin. 2023, 55, 343. [Google Scholar] [CrossRef] [PubMed]
- Li, G.; Rabe, K.S.; Nielsen, J.; Engqvist, M.K. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 2019, 8, 1411–1420. [Google Scholar] [CrossRef] [PubMed]
- Gado, J.E.; Beckham, G.T.; Payne, C.M. Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. J. Chem. Inf. Model. 2020, 60, 4098–4107. [Google Scholar] [CrossRef] [PubMed]
- Li, G.; Buric, F.; Zrimec, J.; Viknander, S.; Nielsen, J.; Zelezniak, A.; Engqvist, M.K. Learning deep representations of enzyme thermal adaptation. Protein Sci. 2022, 31, e4480. [Google Scholar] [CrossRef] [PubMed]
- Ariaeenejad, S.; Mousivand, M.; Moradi Dezfouli, P.; Hashemi, M.; Kavousi, K.; Hosseini Salekdeh, G. A computational method for prediction of xylanase enzymes activity in strains of Bacillus subtilis based on pseudo amino acid composition features. PLoS ONE 2018, 13, e0205796. [Google Scholar] [CrossRef] [PubMed]
- Meng, D.; Wei, X.; Zhang, Y.H.P.J.; Zhu, Z.; You, C.; Ma, Y. Stoichiometric conversion of cellulosic biomass by in vitro synthetic enzymatic biosystems for biomanufacturing. ACS Catal. 2018, 8, 9550–9559. [Google Scholar] [CrossRef]
- You, C.; Shi, T.; Li, Y.; Han, P.; Zhou, X.; Zhang, Y.H.P. An in vitro synthetic biology platform for the industrial biomanufacturing of myo-inositol from starch. Biotechnol. Bioeng. 2017, 114, 1855–1864. [Google Scholar] [CrossRef] [PubMed]
- Meng, D.; Wei, X.; Bai, X.; Zhou, W.; You, C. Artificial in vitro synthetic enzymatic biosystem for the one-pot sustainable biomanufacturing of glucosamine from starch and inorganic ammonia. ACS Catal. 2020, 10, 13809–13819. [Google Scholar] [CrossRef]
- Li, Y.; Shi, T.; Han, P.; You, C. Thermodynamics-driven production of value-added D-allulose from inexpensive starch by an in vitro enzymatic synthetic biosystem. ACS Catal. 2021, 11, 5088–5099. [Google Scholar] [CrossRef]
- Lynch, M.; Force, A. The probability of duplicate gene preservation by subfunctionalization. Genetics 2000, 154, 459–473. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Huangfu, J.; Qi, F.; Kaleem, I.; E, W.; Li, C. Effects of a non-conservative sequence on the properties of β-glucuronidase from Aspergillus terreus Li-20. PLoS ONE 2012, 7, e30998. [Google Scholar] [CrossRef] [PubMed]
- Xu, B.L.; Dai, M.; Chen, Y.; Meng, D.; Wang, Y.; Fang, N.; Tang, X.F.; Tang, B. Improving the thermostability and activity of a thermophilic subtilase by incorporating structural elements of its psychrophilic counterpart. Appl. Environ. Microbiol. 2015, 81, 6302–6313. [Google Scholar] [CrossRef] [PubMed]
- Guo, Y.; Tu, T.; Zheng, J.; Bai, Y.; Huang, H.; Su, X.; Wang, Y.; Wang, Y.; Yao, B.; Luo, H. Improvement of Bs APA aspartic protease thermostability via autocatalysis-resistant mutation. J. Agric. Food Chem. 2019, 67, 10505–10512. [Google Scholar] [CrossRef] [PubMed]
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
- López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
- Treetharnmathurot, B.; Ovartlarnporn, C.; Wungsintaweekul, J.; Duncan, R.; Wiwattanapatapee, R. Effect of PEG molecular weight and linking chemistry on the biological activity and thermal stability of PEGylated trypsin. Int. J. Pharm. 2008, 357, 252–259. [Google Scholar] [CrossRef] [PubMed]
- Mckenna, A.; Dubey, S. Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. J. Biomed. Inform. 2022, 128, 104016. [Google Scholar] [CrossRef]
- Zeldovich, K.B.; Berezovsky, I.N.; Shakhnovich, E.I. Protein and DNA sequence determinants of thermophilic adaptation. PLoS Comput. Biol. 2007, 3, e5. [Google Scholar] [CrossRef]
- Schomburg, I.; Jeske, L.; Ulbrich, M.; Placzek, S.; Chang, A.; Schomburg, D. The BRENDA enzyme information system–From a database to an expert system. J. Biotechnol. 2017, 261, 194–206. [Google Scholar] [CrossRef]
- UniProt Consortium, T. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 2017, 45, D158–D169. [Google Scholar] [CrossRef] [PubMed]
- Li, F.; Yuan, L.; Lu, H.; Li, G.; Chen, Y.; Engqvist, M.K.; Kerkhoven, E.J.; Nielsen, J. Deep learning-based k cat prediction enables improved enzyme-constrained model reconstruction. Nat. Catal. 2022, 5, 662–672. [Google Scholar] [CrossRef]
- Sievers, F.; Higgins, D.G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 2018, 27, 135–145. [Google Scholar] [CrossRef] [PubMed]
- Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T.J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. [Google Scholar] [CrossRef] [PubMed]
- Gerlt, J.A.; Bouvier, J.T.; Davidson, D.B.; Imker, H.J.; Sadkhin, B.; Slater, D.R.; Whalen, K.L. Enzyme function initiative-enzyme similarity tool (EFI-EST): A web tool for generating protein sequence similarity networks. Biochim. Biophys. Acta (BBA)-Proteins Proteom. 2015, 1854, 1019–1037. [Google Scholar] [CrossRef] [PubMed]
- Atkinson, H.J.; Morris, J.H.; Ferrin, T.E.; Babbitt, P.C. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS ONE 2009, 4, e4345. [Google Scholar] [CrossRef] [PubMed]
- Cock, P.J.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422. [Google Scholar] [CrossRef] [PubMed]
- Cao, D.S.; Liang, Y.Z.; Yan, J.; Tan, G.S.; Xu, Q.S.; Liu, S. PyDPI: Freely Available Python Package for Chemoinformatics, Bioinformatics, and Chemogenomics Studies; ACS Publications: Washington, DC, USA, 2013. [Google Scholar]
- Shen, J.; Zhang, J.; Luo, X.; Zhu, W.; Yu, K.; Chen, K.; Li, Y.; Jiang, H. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef] [PubMed]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Molinaro, A.M.; Simon, R.; Pfeiffer, R.M. Prediction error estimation: A comparison of resampling methods. Bioinformatics 2005, 21, 3301–3307. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.; Pandya, C.; Liu, C.; Al-Obaidi, N.F.; Wang, M.; Zheng, L.; Toews Keating, S.; Aono, M.; Love, J.D.; Evans, B.; et al. Panoramic view of a superfamily of phosphatases through substrate profiling. Proc. Natl. Acad. Sci. USA 2015, 112, E1974–E1983. [Google Scholar] [CrossRef] [PubMed]
- Meng, D.; Liang, A.; Wei, X.; You, C. Enzymatic characterization of a thermostable phosphatase from Thermomicrobium roseum and its application for biosynthesis of fructose from maltodextrin. Appl. Microbiol. Biotechnol. 2019, 103, 6129–6139. [Google Scholar] [CrossRef] [PubMed]
- Bradford, M.M. A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding. Anal. Biochem. 1976, 72, 248–254. [Google Scholar] [CrossRef] [PubMed]
- Thompson, J.D.; Higgins, D.G.; Gibson, T.J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22, 4673–4680. [Google Scholar] [CrossRef] [PubMed]
- Chenna, R.; Sugawara, H.; Koike, T.; Lopez, R.; Gibson, T.J.; Higgins, D.G.; Thompson, J.D. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003, 31, 3497–3500. [Google Scholar] [CrossRef] [PubMed]
- Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; et al. Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23, 2947–2948. [Google Scholar] [CrossRef] [PubMed]
- Pucci, F.; Rooman, M. Physical and molecular bases of protein thermal stability and cold adaptation. Curr. Opin. Struct. Biol. 2017, 42, 117–128. [Google Scholar] [CrossRef] [PubMed]
- Kawashima, S.; Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 2000, 28, 374. [Google Scholar] [CrossRef] [PubMed]
- Magnan, C.N.; Baldi, P. SSpro/ACCpro 5: Almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 2014, 30, 2592–2597. [Google Scholar] [CrossRef]
No. | Feature Group | Number of Descriptors |
---|---|---|
1 | amino acid frequencies | 20 |
2 | dipeptide frequencies | 400 |
3 | amino acid frequencies and dipeptide frequencies | 420 |
4 | amino acid frequencies, protein descriptors (conjoint triad features and distribution) | 468 |
5 | dipeptide frequencies, protein descriptors (conjoint triad features and distribution) | 848 |
6 | amino acid frequencies, dipeptide frequencies, protein descriptors (conjoint triad features and distribution) | 868 |
7 | amino acid frequencies, protein molecular weights | 21 |
8 | dipeptide frequencies, protein molecular weights | 401 |
9 | amino acid frequencies, dipeptide frequencies, protein molecular weights | 421 |
10 | amino acid frequencies, protein molecular weights, protein descriptors (conjoint triad features and distribution) | 469 |
11 | dipeptide frequencies, protein molecular weights, protein descriptors (conjoint triad features and distribution) | 849 |
12 | amino acid frequencies, dipeptide frequencies, protein molecular weights, protein descriptors (conjoint triad features and distribution) | 869 |
No. | UniProt ID | EC Number | Topt10 Predicted Topt (°C) | Topt_rcaa10 Predicted Topt (°C) | Experimental Topt (°C) |
---|---|---|---|---|---|
1 | O58216 | 3.1.3.102 | 80 | 80 | 78 |
2 | Q5SLK1 | 3.1.3.104 | 78 | 78 | 80 |
3 | E8UU74 | 3.1.3.3 | 40 | 40 | 75 |
4 | Q97C22 | 3.1.3.70 | 15 | 80 | 65 |
5 | Q72I84 | 3.1.3.102 | 78 | 78 | 83 |
6 | Q5SKD4 | 3.1.3.70 | 78 | 65 | 78 |
7 | A0LR15 | 3.1.3.12 | 30 | 65 | 67 |
8 | Q8Z989 | 3.1.3.18 | 65 | 37 | 37 |
9 | O59374 | 3.1.3.104 | 22 | 70 | 78 |
10 | Q5SI65 | 3.1.3.102 | 50 | 65 | 70 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cao, Y.; Qiu, B.; Ning, X.; Fan, L.; Qin, Y.; Yu, D.; Yang, C.; Ma, H.; Liao, X.; You, C. Enhancing Machine-Learning Prediction of Enzyme Catalytic Temperature Optima through Amino Acid Conservation Analysis. Int. J. Mol. Sci. 2024, 25, 6252. https://doi.org/10.3390/ijms25116252
Cao Y, Qiu B, Ning X, Fan L, Qin Y, Yu D, Yang C, Ma H, Liao X, You C. Enhancing Machine-Learning Prediction of Enzyme Catalytic Temperature Optima through Amino Acid Conservation Analysis. International Journal of Molecular Sciences. 2024; 25(11):6252. https://doi.org/10.3390/ijms25116252
Chicago/Turabian StyleCao, Yinyin, Boyu Qiu, Xiao Ning, Lin Fan, Yanmei Qin, Dong Yu, Chunhe Yang, Hongwu Ma, Xiaoping Liao, and Chun You. 2024. "Enhancing Machine-Learning Prediction of Enzyme Catalytic Temperature Optima through Amino Acid Conservation Analysis" International Journal of Molecular Sciences 25, no. 11: 6252. https://doi.org/10.3390/ijms25116252
APA StyleCao, Y., Qiu, B., Ning, X., Fan, L., Qin, Y., Yu, D., Yang, C., Ma, H., Liao, X., & You, C. (2024). Enhancing Machine-Learning Prediction of Enzyme Catalytic Temperature Optima through Amino Acid Conservation Analysis. International Journal of Molecular Sciences, 25(11), 6252. https://doi.org/10.3390/ijms25116252