A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases

Mariano, Diego

doi:10.3390/app15094839

Open AccessArticle

A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases

by

Diego Mariano

Department of Computer Science (DCC), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte 31270-901, Brazil

Appl. Sci. 2025, 15(9), 4839; https://doi.org/10.3390/app15094839

Submission received: 18 January 2025 / Revised: 14 April 2025 / Accepted: 24 April 2025 / Published: 27 April 2025

(This article belongs to the Topic Computational Intelligence and Bioinformatics (CIB))

Download

Browse Figures

Versions Notes

Abstract

Featured Application

The models constructed here can be used to detect patterns related to thermostability and to detect novel enzymes capable of maintaining their activity at high temperatures.

Abstract

Thermostable β-glucosidases (E.C. 3.2.1.21) are essential enzymes used in second-generation biofuel production. However, little is known about the structural characteristics that lead to their thermostability. In this study, I used graph-based structural signatures to represent three-dimensional structures of β-glucosidase enzymes extracted from thermophilic organisms. I collected 1717 structures from thermophilic (n = 890) and non-thermophilic (n = 827) organisms and divided them into two datasets: training (n = 1134) and test (n = 583). I then used seven machine learning algorithms to classify them. The best model achieved 77.1% accuracy using logistic regression in training with 10-fold cross-validation and 81.6% accuracy in testing using the CatBoost algorithm. I hypothesize that the signature model proposed here can help understand the structural patterns in thermostable enzymes and shed light on the design of more efficient enzymes for biofuel production.

Keywords:

β-glucosidases; machine learning; graph-based structural signatures

1. Introduction

Biofuels are alternative energy sources, which can be obtained from organic biomass. Cellulose is one of the primary sources of biomass on planet Earth, making up half of the weight of plants [1,2]. It can be used in producing renewable fuels and can be found in raw materials such as corn and sugar cane. For example, in the first-generation production process of bioethanol from sugar cane, fermentable sugars can be obtained from sugar cane juice in a process known as saccharification [3]. The sugars are then used in the fermentation process to obtain bioethanol. Sugar cane bagasse is a waste resulting from this process [4]. However, it can be used as a by-product to produce second-generation biofuels.

The sugarcane bagasse is composed of three main lignocellulosic structures: cellulose, hemicellulose, and lignin [5]. In the second-generation biofuel production process, the bagasse undergoes pre-treatment to break down the rigid structure of the lignocellulose, facilitating the release of fermentable sugars for ethanol production [6]. Then, the saccharification process occurs, where sugar is extracted from the bagasse by the kinetic action of sets of enzymes. In this context, endoglucanases (E.C. 3.2.1.4) act on the cellulose structure, generating oligosaccharides of varying sizes. Then, exoglucanases (E.C. 3.2.1.91) cleave these oligosaccharides, releasing disaccharides, such as cellobiose [7]. Finally, β-glucosidases (E.C. 3.2.1.21) hydrolyze cellobiose, releasing two glucose molecules, the primary sugar used in fermentation [8].

It is widely known that their product (glucose) can inhibit β-glucosidases [9,10,11]. Furthermore, glucose inhibition of β-glucosidases generates cellobiose accumulation. Endoglucanases can be inhibited by high cellobiose concentrations [12,13,14]. Therefore, the class of glucose-tolerant β-glucosidase enzymes has been extensively studied to improve biofuel production. However, other characteristics may be necessary for more efficient industrial biofuel production, such as resistance to high temperatures. The literature has reported that high temperatures can cause the denaturation of β-glucosidases [15,16,17]. Additionally, temperature is a critical factor that depends on the production schemes adopted.

In industrial production, the saccharification and fermentation processes can occur using two main schemes: SSF and SHF [6]. In the simultaneous saccharification and fermentation (SSF) process, hydrolysis and fermentation occur in the same reactor. This process reduces the inhibition of accumulated sugars since, as glucose is released into the medium, fermentation occurs, producing bioethanol. However, the saccharification and fermentation processes occur at different optimum temperatures. While saccharification occurs at an optimum temperature of 55 °C, fermentation occurs at 30 °C [18]. Thus, when carrying out these two processes simultaneously, the difference in optimal enzyme temperatures can impair the efficiency of the process.

In the separate hydrolysis and fermentation (SHF) process, saccharification occurs in a reactor under controlled temperature and pH conditions [6]. The sugars released by hydrolysis are then separated from lignin and other solid residues and transferred to another reactor, where they are fermented by microorganisms such as yeast. This process has the advantage of allowing for the enzymatic hydrolysis and fermentation temperatures to be optimized, thus achieving the best production performance [19]. However, this can lead to increased costs and production time.

As can be seen, SSF and SHF have advantages and disadvantages. However, in both processes, we can see the importance of thermostable enzymes, which can maintain their activity even at high temperatures. Thermostable enzymes are widely applied in industry, playing a key role in different saccharification and fermentation processes [17]. Therefore, it is of great interest to detect enzymes, such as β-glucosidases, capable of operating at high temperatures or even to understand the physicochemical characteristics of structures that lead to greater resistance to high temperatures. Thus, β-glucosidase enzymes from thermophilic bacteria can be a good study target.

Thermophilic bacteria that inhabit high-temperature environments, such as hot springs or volcanoes, produce β-glucosidases capable of performing their catalytic activity at high temperatures [20,21,22,23]. These β-glucosidases have great potential for use in enzyme cocktails used in the industrial production of biofuels. However, β-glucosidase enzymes extracted from thermophilic bacteria are rare, and these bacteria are difficult to cultivate outside their native environment (although they can be expressed in prokaryotic or eukaryotic cell lines and produced industrially).

In this study, I use graph-based structural signatures to represent three-dimensional structures of β-glucosidase enzymes extracted from thermophilic organisms. I then use machine learning algorithms to classify β-glucosidases from thermophilic and non-thermophilic organisms. Furthermore, the rise in deep learning-based algorithms for obtaining three-dimensional structures, such as AlphaFold [24], has allowed for a wide range of structures obtained by NGS sequencing to be used for structural analysis. Thus, this work proposes using the wide range of structures obtained by these modeling techniques using deep learning. The models constructed here can be used to detect patterns related to thermostability and can be used to detect novel enzymes resistant to inhibition at high temperatures.

2. Related Works

A systematic literature review was performed to detect a dataset of β-glucosidase enzymes resistant to glucose inhibition [25]. Later, residues related to glucose tolerance were detected, and multiple sequence analyses were used to detect other structures that share these characteristics [26]. Molecular dynamics studies demonstrated the importance of residues D238, Y180, W350, and F349 in a mechanism for glucose release in glucose-tolerant β-glucosidases, termed the “slingshot mechanism” [27]. Furthermore, it was demonstrated that the conformational flexibility of β-glucosidase regions was related to enzymes more resistant to glucose inhibition [28].

Then, graph-based signatures were used together with a simple model based on the K-nearest neighbors (KNN) algorithm to detect a structural signature of glucose tolerance [29]. Structural signatures are mathematical representations of three-dimensional structures of proteins and other macromolecules capable of characterizing similarities and differences in vector space [30,31].

A literature review pointed out several factors that could be related to the thermostability of glycoside hydrolase enzymes [15]. These characteristics can be linked to intramolecular contacts, such as hydrophobic effect [32], hydrogen bonds and other electrostatic interactions [33], and aromatic stacking [34], among others [35]. Also, a machine learning model was used to detect thermostable enzymes [36]. However, this model was sequence-based and did not consider three-dimensional structures.

Moreover, a previous study demonstrated the importance of point mutations for the thermostabilizing mechanisms of GH1 β-glucosidases [17]. However, the study did not use a machine learning analysis to detect similar patterns in other structures. To the best of our knowledge, the use of structural signatures to detect thermostability patterns in β-glucosidase enzymes has not been found in the literature.

3. Materials and Methods

Briefly, I collected data on thermostable and non-thermostable β-glucosidase structures from two public datasets: UniProt (sequences) and AlphaFold DB (modeled 3D structures). Then, I calculated the signature matrix and divided it into two balanced datasets: train and test. Figure 1 illustrates the data preprocessing steps adopted in this study. The details will be presented in the following subsections.

3.1. Data Collection

I searched the UniProt database [37] for the E.C. Number code of β-glucosidases (E.C. 3.2.1.21). Thus, 76,582 β-glucosidase enzyme sequences were downloaded, 76,362 of which were extracted from TrEMBL (unreviewed) and 220 from Swiss-Prot (reviewed). Then, regular expressions were used to detect thermophilic organisms. To this end, a search for the substring “therm” was performed in lines starting with “>” and having the pattern “OS=”. I hypothesized that this simple strategy could be used to detect thermophilic organisms. For example, the sequence B9K7M5 of the organism Thermotoga neapolitana met these requirements. A total of 1563 sequences were obtained. Finally, I randomly selected 1500 sequences from thermophilic organisms (i.e., I discarded only 63). I also randomly selected another 1500 sequences from likely non-thermophilic organisms (i.e., sequences that did not have the substring “therm” in their header). I then split this dataset (n = 3000) into balanced training (n = 2000) and testing (n = 1000) subsets.

Furthermore, β-glucosidase enzymes of the GH1 family are approximately 400 residues long. A manual analysis of the collected structures revealed that some structures were incomplete, and others were complexed with other structures. Therefore, I removed structures with less than 200 amino acid residues and more than 800 amino acids. Then, I used the AlphaFold Protein Structure Database [38] API to download the three-dimensional structures corresponding to each sequence. However, some structures could not be found in this database (total instances not found: 537 in the training database and 262 in the test database).

In conclusion, I collected 1717 structures from thermophilic (n = 890) and non-thermophilic (n = 827) organisms and divided them into two datasets: training and test (Figure 1). The training dataset has n = 1134, with 589 positive (thermophilic) and 545 negative (non-thermophilic). The test dataset has n = 583, with 301 positive (thermophilic) and 282 negative (non-thermophilic).

3.2. Structural Signatures Calculation

Structural signatures were calculated using the SIGNA Python library v1.3. The aCSM-ALL algorithm [39] was used to calculate graph-based signatures with the cutoff max parameters of 10 Å and cutoff step of 0.1 Å (10/0.1 Å). These parameter values were based on previous works [29,40].

The aCSM-ALL algorithm models the structure of a macromolecule as multiple graphs. For each interval of size 0.1 ranging from 0 to 10, the signature constructs a graph where the atoms are the vertices and the pairs of atoms that meet the current cutoff interval are the edges. The model also classifies each atom into one of eight possible types: acceptor, donor, hydrophobic, positive, negative, neutral, sulfide, and aromatic. Finally, the algorithm calculates the distribution of the most common pairs of atoms in all graphs constructed, considering the distance intervals and atomic types. For each protein, this returns a vector with 3600 numerical values, representing its structural signature. Thus, the training dataset matrix has 1134 rows by 3600 columns, while the test dataset matrix has 583 rows by 3600 columns (in addition to the columns with the known labels).

3.3. Machine Learning Models

Machine learning models were built using the Orange Data Mining tool v3.38 [41]. In total, seven models were built using the KNN, random forest, gradient boosting (scikit-learn), SVM, neural network (MLP), logistic regression, and CatBoost (a variation of the gradient boosting algorithm) algorithms. The model was initially trained using stratified cross-validation (K = 10). It was then tested with the test dataset. Figure 2 illustrates the methodology adopted to build the classification models using machine learning.

The hyperparameters were manually selected, and the parameters with the best results based on training accuracy were chosen (details are available in the Supplementary Material—Supplementary Figures S1–S8). The following parameters were used in the final version:

KNN—k = 5; metric: Euclidean, Weight: by distances;
Random Forest—number of trees: 10; do not split subsets smaller than five;
Gradient Boosting (scikit-learn)—number of trees: 500; learning rate: 0.9; limit of the depth of individual trees: three; do not split subsets smaller than two;
SVM—cost (c): 1.00; regression loss epsilon: 0.1; kernel: linear; numerical tolerance: 0.001; iteration limit: 100;
Neural Network (MLP)—neurons in hidden layers: “100”; activation: ReLu; solver: Adam; regularization alpha = 0.0001; the maximal number of iterations 200;
Logistic Regression—regularization type: ridge (L2); strength C1;
CatBoost (Gradient Boosting variation)—number of trees: 200; learning rate: 0.4; regularization lambda 0.09; limit depth of individual trees: 5.

The results were evaluated using five metrics: accuracy, F1-score, precision, recall, and specificity.

3.4. Case Studies

Firstly, to compare the method proposed here with other methodologies, the 583 sequences of the test dataset were submitted to the TemStaPro software v0.2.6 [42] using default parameters. TemStaPro (Temperatures of Stability for Proteins) is a machine learning-based method for predicting protein thermostability directly from its sequence. The software receives, as input, a set of sequences and returns a binary value indicating whether the protein will be stable at temperatures of 40, 45, 50, 55, 60, and 65 °C.

Secondly, to better exemplify the results of the model, four structures were selected for further analysis: a thermophilic structure that the model correctly identified (UniProt ID: A0A0B3BP14) and one that the model incorrectly identified (UniProt ID: A0A087E2J5), as well as a non-thermophilic structure that the model correctly identified (UniProt ID: A0A0B2A6F6) and another that the model incorrectly identified (UniProt ID: A0A0G0PX59). Sequences were collected from the UniProt database [37]. The 3D model structures were collected from the AlphaFold Protein Structure Database [38].

Sequence alignments were performed using the Clustal Omega web tool 2025 [43]. Structure alignments were performed using the PyMOL tool v3.1.0 [44].

Analysis of the electrostatic surface charge distribution was performed using the APBS web tool v3.4.1 [45,46]. Initially, the PDB file was converted to PQR format using the PDB2PQR web tool v3.6.1 [47,48]. PROPKA was used to assign protonation states at a pH of 7.0 and the AMBER force field. The results were visualized through the electrostatic surface with a red–white–blue color scheme ranging from −5 kT/e to 5 kT/e.

Fast protein structure flexibility simulations were performed for A0A0B3BP14 (thermophilic) and A0A0B2A6F6 (non-thermophilic) using the CABSflex web tool v2.0 [49,50]. I used the following parameters: restraint mode ss2 (secondary structure only), minimum length restraint of 3.8 Å and maximum of 8 Å, gap of 3, and number of cycles of 50. For each protein, an experiment was performed using the default simulation temperature (1.4) and a higher temperature (3.0).

4. Results and Discussion

I collected 1717 β-glucosidase 3D structures from the AlphaFold Protein Structure Database. Approximately 50% of them were supposed to be from thermophilic organisms. Then, I calculated their structural signatures using the aCSM algorithm. Structural signatures are mathematical methods that extract representative features from three-dimensional structures of macromolecules. The aCSM-ALL signature returns a numerical vector of 3600 values for each protein using the parameters 10/0.1 Å. These numerical vectors were used as input for the various machine learning models built. I aimed to evaluate whether there is a structural pattern in the enzymes extracted from thermophilic organisms that are not present in the β-glucosidase enzymes of non-thermophilic organisms. Finally, I wanted to verify whether there are enzymes from non-thermophilic organisms with patterns similar to those of thermophilic ones.

A k-fold cross-validation (k = 10) was performed using 2/3 of the entire dataset as the training set. Among the evaluated models, logistic regression obtained the highest accuracy (77.1%), followed by gradient boosting (76.2%), neural network (74.8%), CatBoost (74.1%), random forest (73.0%), K-nearest neighbors (KNN, 72.3%), and support vector machine (SVM, 55.8%). Table 1 summarizes the cross-validation results.

Then, I performed the test validation using approximately 1/3 of the database. For the test, the CatBoost model obtained the highest accuracy of 81.6%, followed by the gradient boosting (scikit-learn) model (80.4%), the neural network model (79.6%), the random forest model (78.2%), the logistic regression model (76.2%), the KNN model (72.6%), and the SVM model (55.6%). Table 2 summarizes the test results (details can be found in Supplementary Table S1).

The logistic regression model achieved the best accuracy in training but ranked fifth in testing. However, its accuracy in cross-validation was less than 1% higher than in testing (77.1% in training and 76.2% in testing). This suggests that its fall in the ranking of the best models in testing is due more to an improvement in the results of the other models than to any deficiency of its own.

During training, the top six models achieved a difference of less than 5% in accuracy. In testing, this difference was 9%. Except for the KNN, logistic regression, and SVM models, all other models were superior in testing than in training. The ROC curve plot (Figure 3) confirms that the top six models achieved similar results in both training and testing. Only the SVM model achieved low accuracies in both training and testing.

Briefly, CatBoost was the model that generalized best, outperforming the others in accuracy on the test set. On the other hand, SVM presented generalization challenges and limitations in learning, being less suitable for the structural signature data.

4.1. Thermostability Patterns in Enzymes Obtained from Non-Thermostable Organisms

Since the best model obtained an accuracy higher than 80%, I assume that β-glucosidases obtained from thermostable organisms have patterns that differentiate them from others. Thus, the best model constructed could generalize this well. Therefore, I analyze whether there are thermostability patterns in enzymes obtained from non-thermophilic organisms. We can observe this by exploring false positives.

The confusion matrix (Figure 4) presents real and predicted values of the positive and negative classes for the seven models in both training and testing.

The “positive” class groups the enzymes obtained from thermophilic organisms, while the “negative” class groups the enzymes obtained from non-thermophilic organisms. False positives can be found in the cell present in the “negative” row and “positive” column. In training, the logistic regression model obtained the best result. This model obtained 137 false positives. However, for a better analysis, the data obtained in the test were used. In the test, the CatBoost model obtained 56 false positive instances (Supplementary Table S2). The model based on CatBoost also obtained the lowest number of false negatives (n = 51). Complementary studies need to be carried out to evaluate whether these structures are thermostable or not.

4.1.1. Case Study 1—Comparison to Another Tool

Then, I performed a comparison with another thermostability prediction software: TemStaPro [42]. TemStaPro is a deep learning method that predicts, based on the sequence, whether the 3D structure will be stable at certain temperatures. For this case study, I used the predictions made for the temperature of 55 °C, which is the optimal temperature at which saccharification occurs [18].

Thus, the 583 sequences from the test database were submitted for thermostability prediction by TemStaPro. Table 3 compares the results for TemStaPro predictions at 55 °C (TemStaPro-t55) with the prediction results using the aCSM-ALL signature and the CatBoost model obtained in this work.

TemStaPro achieved a lower accuracy (66.2%) than the model presented in this study (81.6%). However, this does not directly imply that the competing model was worse. When we evaluate other metrics, such as precision and specificity, we can see that TemStaPro was superior (84% vs. 80.5% in precision and 93.4% vs. 81.4% in specificity). TemStaPro obtained a significantly higher number of false negatives (i.e., entries labeled as positive but classified as negative). While the model built with CatBoost based on the aCSM-ALL signatures obtained 51 false negatives, TemStaPro obtained 177. There are several explanations for this, such as an error in the competing method (although TemStaPro-t55 obtained 93.4% accuracy in a test with the SAPPHIRE dataset [42]) or even inconsistency in the definition of the labels. The labels originally assigned are based only on the notation of the organism that encodes the protein. There is no robust experimental validation indicating that the protein is actually stable at 55 °C. Therefore, the TemStaPro result may indicate results closer to the real one. However, it should also be considered that TemStaPro predictions are made only based on the sequences. Incomplete sequences or sequences with annotation errors can induce errors in the models. More experiments are necessary to evaluate these results.

Furthermore, the model proposed in this work obtained 56 false positives, while TemStaPro-t55 obtained only 20 false positives. False positives are a group of interest for this work since they represent inputs labeled as negative that were classified as positive. Assuming that the model can generalize the prediction of thermostable β-glucosidase structures correctly, false positives would indicate possible structures obtained in non-thermostable organisms with thermostable characteristics. In total, eight β-glucosidase structures from non-thermophilic organisms were predicted as possibly thermostable by both software. They are A0A0G0PX59, A0A0S8EAX1, A0A0U5J538, A0A1C5QUE8, A0A367G1L2, A0A498KCE1, A0A7C3M843, and M1CZQ6. Further investigations into these structures are necessary. More details about TemStaPro results can be found in the Supplementary Material (Table S4).

4.1.2. Case Study 2—Exploring Some Results in Detail

To further examine the performance of the proposed model and explore the results obtained, I present a case study involving four selected structures (Table 4). The first structure, a β-glucosidase from Microbacterium mangrovi (UniProt ID: A0A0B2A6F6), was originally classified as non-thermostable (real class: negative) and correctly predicted by the model as negative (model hit). The second structure, from Thermoanaerobacter sp. YS13 (UniProt ID: A0A0B3BP14), was defined as thermostable (real class: positive) and correctly predicted as positive (model hit). Conversely, the third structure, from Candidatus Falkowbacteria bacterium, was initially defined as negative but was incorrectly predicted as positive (it can be a model error or—in an optimistic scenario—the model detected a new thermostable enzyme). Similarly, the fourth structure, from Bifidobacterium thermacidophilum, was initially described as positive but incorrectly classified by the model as negative (probably a model error).

When we analyze the three-dimensional structures, we can see they are pretty similar. When superimposing them using structural alignment, we obtain an RMSD lower than 0.81 in all cases (superimposition of the three structures against A0A0G0PX59 using the PyMOL “align” command). The four structures belong to the GH1 family of glycoside hydrolases from CAZY (Carbohydrate-Active enZYmes Database) [51]. The GH1 family comprises highly conserved TIM Barrel-type structures, where eight alpha-helices intercalate with eight beta strands that fold into a barrel shape (Figure 5A). This suggests that factors related to thermostability should not be directly linked to the positioning of the backbone but to other factors.

On the other hand, when we consider the sequence level, we can notice that the four structures have identities ranging from 31% to 50% (Figure 5B). The negative structure (non-thermophilic), which was incorrectly predicted as positive (thermophilic), has identities ranging from 31 to 36% compared to the others, being the most different. This demonstrates the difficulty of detecting thermostability patterns, considering only similarities between sequences. Structural signatures consider a higher level of detail, as they consider the atoms’ geometric coordinates, the atomic types, and their pharmacophoric properties. This indicates they have a more significant potential to find less easily perceived structural details.

However, how can we know that the enzymes labeled as non-thermostable are non-thermostable? It is important to remember that they were chosen randomly. The following subsections present some experiments to assess this.

4.1.3. Differences in the Electrostatic Surface

Assessing whether a β-glucosidase enzyme is thermostable is not a trivial task. For instance, Rocha et al. [17] suggest that the thermostability of a β-glucosidase may be related to the distribution of charges on the electrostatic surface. They argue that enzymes with a more equal distribution of positive and negative charges on the surface tend to be more thermostable.

To assess this, I performed electrostatic surface charge distribution analyses of the four case study enzymes using the APBS tool. Figure 6 shows the surface visualization of the four structures—region close to the active site entrance pocket. The surface was colored on the red–white–blue color scale, i.e., redder regions have a more negative charge, while bluer regions have a more positive charge.

Figure 6 indicates that the electrostatic surfaces of A0A0B3BP14 and A0A087E2J5 (both belonging to the thermophilic group) have more evenly distributed charges when compared to the others (note the distribution of red and blue dots). On the other hand, A0A0B2A6F6 (non-thermophilic group) appears to have more negative charges (more red dots) when compared to the others. However, visually, its surface is not comparable to A0A0G0PX59, which also belongs to the non-thermophilic group but was wrongly classified by our model.

A priori, this could be an indicator that A0A0G0PX59 has characteristics that differentiate it from the non-thermophilic group. It is important to emphasize that the non-thermophilic group was obtained by randomly selecting sequences. Therefore, there is no guarantee that they do not really have thermostable characteristics. Furthermore, these initial results indicate that the method proposed here could be used to detect new thermostable β-glucosidase enzymes. Also, the results obtained by the TemStaPro software in case study 1 corroborate the idea that A0A0G0PX59 may have thermostable characteristics. However, further experiments need to be performed to further evaluate this.

4.1.4. β-Glucosidases from Thermophilic Organisms Are Less Unstable at High Temperatures

Several studies have described that thermophilic enzymes tend to present reduced flexibility and better compaction of the central regions when compared to the mesophilic ones. This allows for them to maintain their function at higher temperatures [52,53,54]. Therefore, one way to assess whether the enzyme has greater resistance to high temperatures is to check its mobility and compaction at low and high temperatures.

The literature has described that several parameters can be used to quantify thermostability, such as evaluating the free energy of unfolding (ΔG_u), the thermal inactivation temperature (T₅₀), and the melting temperature (T_M), among others [52]. Considering in silico experiments, the thermostability evaluation can be made through molecular dynamics simulations, although they can have a high computing cost. Additionally, artificial intelligence techniques and heuristic-based algorithms have shown promise for thermostability prediction.

Thus, I performed fast protein structure flexibility simulations using the CABSflex 2.0 software. Two experiments were performed for A0A0B3BP14 (thermophilic) and A0A0B2A6F6 (non-thermophilic), respectively. First, I ran CABSflex with default parameters. Then, I set the temperature parameter to 3.0 (the default value is 1.4) and ran a second experiment. Importantly, the temperature parameter used by CABSflex cannot be directly linked to the actual temperature. This parameter controls the total energy of the modeled system. Thus, higher temperature values increase atomic mobility, leading to greater fluctuations [49]. CABSflex returns an RMSF (root mean square fluctuation in Angstroms) plot that indicates the mobility of the residue during the simulation.

Figure 7 summarizes the results. For both proteins, greater flexibility at high temperatures is noticeable (red lines above the blue lines). However, A0A0B2A6F6 (non-thermophilic) presents greater flexibility at higher temperatures than the other one (Figure 7A), which may indicate greater instability. A peak higher than 16 Å is noticeable between residues 305 and 330. This region is a loop at the entrance to the catalytic pocket. For A0A0B3BP14 (thermophilic), the maximum peak was approximately 8 Å, also in loop regions at the entrance to the active site (Figure 7B). The flexibility of certain regions at the highest temperature was close to the initial temperature, which may be in line with what is expected of a thermophilic enzyme. Strangely, in certain regions, the mobility at low temperatures was higher than at high temperatures, such as in the region between residues 176 and 183. This may have occurred due to a software mistake or characteristics of the enzyme itself. Complementary studies need to be carried out to better evaluate this.

In summary, when we compare the variation in the final RMSF minus the initial one (Figure 7C), it is clear that the non-thermophilic one has more flexibility at high temperatures (note the peaks in the salmon color). It is noteworthy that its sequence presents gaps when compared to the thermophilic one, which has a smaller sequence size. These gaps generally occur in the loops at the entrance to the active site. This may have some relation to the thermostability characteristic. That is, the loss of residues in the loops at the entrance to the active site may reduce the instability caused by the higher temperature environment. Further studies need to be carried out to attest to this.

4.2. Limitations and Perspectives

Despite the great potential of using the models presented here, some limitations and biases of this study need to be clarified. First, the definition of the group of thermophilic enzymes was rather simplistic (only the organism’s name was used). Applying the model to a more curated database would obtain more reliable results. However, few structures with experimental data on thermostability are publicly available, which would limit the experiments proposed here.

Furthermore, the phylogenetic bias in model construction should be considered since most thermophilic organisms used to extract β-glucosidases are bacteria. However, the main objective of model construction is to identify the structural patterns of the enzymes of this group and verify whether they can be found in other enzymes.

Additionally, the experiments were performed with 3D protein models. Despite recent evolutions in the computational modeling of proteins by new algorithms, such as AlphaFold [24], it is important to emphasize that the models are subject to prediction errors.

In this work, a graph-based structural signature algorithm was used. Graph-based signatures model each atom as a node and pairs of atoms of different types that meet the pre-established cutoff distance as edges. One of the challenges when using this algorithm is to obtain explainability in the constructed models. Aiming to address this challenge, the Information Gain Ratio (IGR) method from Orange Data Mining was applied to obtain the most important features. Thus, the five most important attributes are the edges composed of pairs of atoms: acceptor x aromatic (distance up to 2.5 Å), sulfide x sulfide (distance up to 3.3 Å), aromatic x sulfide (distance up to 3 Å), neutral x sulfide (distance up to 3 Å), and aromatic x neutral (distance up to 2.4 Å). Details can be found in the Supplementary Material (Table S5).

In summary, this indicates that pairs of acceptor and aromatic atoms at a distance of up to 2.5 Å are the most important for the classification into thermostable or non-thermostable according to the IGR. Aromatic atoms are found in the aromatic rings of tryptophan, tyrosine, and phenylalanine. On the other hand, hydrogen acceptor atoms can be found in the main chains of all amino acids. This demonstrates the difficulty in explaining the results of classifications made by structural signatures. Therefore, it can represent a limitation of this work.

In the future, it is intended to perform more robust molecular dynamics experiments to evaluate flexibility at high temperatures. Furthermore, experimental validations can be performed to evaluate enzymes originally defined as non-thermophilic (Supplementary Table S2), which were predicted by the model to be thermophilic.

5. Conclusions

The study of thermophilic enzymes has great potential for industrial applications. In this work, graph-based structural signatures were used to obtain representative physicochemical features of thermophilic protein structures. Then, machine learning models were built to predict thermophilic β-glucosidases. Β-glucosidases are essential enzymes in the production of second-generation biofuels. The detection of thermostability patterns in β-glucosidases has excellent potential to improve industrial production, allowing for, for example, the design of enzymes capable of operating at high temperatures. The best model built here obtained an accuracy of 81.6% using the CatBoost algorithm (based on gradient boosting). In addition, this model can be used to detect enzymes with thermostability-related features obtained from various types of organisms. I hope the results presented here can be used to engineer more efficient enzymes for biofuel production.

Supplementary Materials

The following supporting information can be downloaded at https://github.com/LBS-UFMG/therm-bgl (accessed on 23 April 2025), Figure S1. Classification accuracy of the KNN model evaluated across different values of the hyperparameter k. The following k-values were tested: 3, 5, 7, 9, 11, 13, and 15. Figure S2. Classification accuracy of the random forest model for different numbers of trees. The tested values were 3, 5, 10, and 12 trees. Figure S3. Accuracy of the SVM model using different kernel types. The evaluated kernels were linear, polynomial, radial basis function (RBF), and sigmoid. Figure S4. Accuracy of the neural network (MLP) model using different numbers of neurons in the hidden layers. The tested configurations included 50, 100, 150, and 200 neurons. Figure S5. Classification accuracy of the CatBoost model across different numbers of trees. The evaluated values were 50, 100, and 200 trees. Figure S6. Accuracy of the logistic regression model under different regularization types: Lasso (L1), Ridge (L2), and no regularization. Figure S7. Training time for the logistic regression model using different regularization methods: Lasso (L1), Ridge (L2), and None. Figure S8. Comparison of training time among all evaluated models: logistic regression, gradient boosting (scikit-learn), neural network, CatBoost, random forest, KNN, and SVM. Table S1. Predicted labels for the test dataset using all evaluated models. Table S2. List of false-positive predictions obtained in the test dataset. Table S3. Multiple sequence alignment for selected case study enzymes (generated using Clustal Omega). Table S4. Thermostability prediction results from TemStaPro software for the test dataset. Table S5. Top-ranked features according to different feature selection metrics: Information Gain, Information Gain Ratio, Gini Index, ReliefF, and Chi-squared (χ²).

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data, scripts, and models are available at https://github.com/LBS-UFMG/therm-bgl. The Signa Python library is available at https://github.com/LBS-UFMG/signa (accessed on 21 April 2025).

Acknowledgments

I started this work some years ago when I was taking the “Data Science” course on the HarvardX/edX platform. Hence, I would like to thank the lecturers and organizers of the course. Furthermore, I also thank Rafael Oliveira and Rafael Lemos for their valuable contributions and suggestions for future work. Lastly, I thank the Brazilian funding agencies CAPES, CNPq, and Fapemig.

Conflicts of Interest

The author declares no conflicts of interest.

References

Uchima, C.A.; Tokuda, G.; Watanabe, H.; Kitamoto, K.; Arioka, M. Heterologous Expression in Pichia Pastoris and Characterization of an Endogenous Thermostable and High-Glucose-Tolerant β-Glucosidase from the Termite Nasutitermes Takasagoensis. Appl. Environ. Microbiol. 2012, 78, 4288–4293. [Google Scholar] [CrossRef] [PubMed]
Uchiyama, T.; Yaoi, K.; Miyazaki, K. Glucose-Tolerant β-Glucosidase Retrieved from a Kusaya Gravy Metagenome. Front. Microbiol. 2015, 6, 548. [Google Scholar] [CrossRef]
Dunning, J.; Lathrop, E.C. Saccharification of Agricultural Residues. Ind. Eng. Chem. 1945, 37, 24–29. [Google Scholar] [CrossRef]
Meghana, M.; Shastri, Y. Sustainable Valorization of Sugar Industry Waste: Status, Opportunities, and Challenges. Bioresour. Technol. 2020, 303, 122929. [Google Scholar] [CrossRef]
Badhan, A.; Chadha, B.; Kaur, J.; Saini, H.; Bhat, M. Production of Multiple Xylanolytic and Cellulolytic Enzymes by Thermophilic Fungus Myceliophthora sp. IMI 387099. Bioresour. Technol. 2007, 98, 504–510. [Google Scholar] [CrossRef] [PubMed]
Santos, J.R.A.; Souto-Maior, A.M.; Gouveia, E.R.; Medina, C.M. Comparison of SHF and SSF processes from sugar cane bagasse for ethanol production by Saccharomyces cerevisiae. Quím. Nova 2010, 33, 904–908. [Google Scholar] [CrossRef]
Kumar, R.; Singh, S.; Singh, O.V. Bioconversion of Lignocellulosic Biomass: Biochemical and Molecular Perspectives. J. Ind. Microbiol. Biotechnol. 2008, 35, 377–391. [Google Scholar] [CrossRef] [PubMed]
Cairns, J.R.K.; Esen, A. β-Glucosidases. Cell. Mol. Life Sci. 2010, 67, 3389–3405. [Google Scholar] [CrossRef]
Teugjas, H.; Väljamäe, P. Selecting β-Glucosidases to Support Cellulases in Cellulose Saccharification. Biotechnol. Biofuels 2013, 6, 105. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, X.; Yin, Q.; Fang, W.; Fang, Z.; Wang, X.; Zhang, X.; Xiao, Y. A Mechanism of Glucose Tolerance and Stimulation of GH1 β-Glucosidases. Sci. Rep. 2015, 5, 17296. [Google Scholar] [CrossRef]
Salgado, J.C.S.; Meleiro, L.P.; Carli, S.; Ward, R.J. Glucose Tolerant and Glucose Stimulated β-Glucosidases—A Review. Bioresour. Technol. 2018, 267, 704–713. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Pang, Q.; Xie, J.; Pei, J.; Wang, F.; Fan, S. Enzymatic Properties of Thermoanaerobacterium Thermosaccharolyticum β-Glucosidase Fused to Clostridium Cellulovorans Cellulose Binding Domain and Its Application in Hydrolysis of Microcrystalline Cellulose. BMC Biotechnol. 2013, 13, 101. [Google Scholar] [CrossRef] [PubMed]
Chamoli, S.; Kumar, P.; Navani, N.K.; Verma, A.K. Secretory Expression, Characterization and Docking Study of Glucose-Tolerant β-Glucosidase from B. Subtilis. Int. J. Biol. Macromol. 2016, 85, 425–433. [Google Scholar] [CrossRef]
Philippidis, G.P.; Smith, T.K.; Wyman, C.E. Study of the Enzymatic Hydrolysis of Cellulose for Production of Fuel Ethanol by the Simultaneous Saccharification and Fermentation Process. Biotechnol. Bioeng. 1993, 41, 846–853. [Google Scholar] [CrossRef]
Ouyang, B.; Wang, G.; Zhang, N.; Zuo, J.; Huang, Y.; Zhao, X. Recent Advances in β-Glucosidase Sequence and Structure Engineering: A Brief Review. Molecules 2023, 28, 4990. [Google Scholar] [CrossRef] [PubMed]
Dadwal, A.; Sharma, S.; Satyanarayana, T. Thermostable Cellulose Saccharifying Microbial Enzymes: Characteristics, Recent Advances and Biotechnological Applications. Int. J. Biol. Macromol. 2021, 188, 226–244. [Google Scholar] [CrossRef]
Oliveira Rocha, R.; Mariano, D.; Almeida, T.; Sulfierry, L.; Fischer, P.H.; Santos, L.; Caffarena, E.; Da Silveira, C.; Lamp, L.; Fernandez-Quintero, M.; et al. Thermostabilizing Mechanisms of Canonical Single Amino Acid Substitutions at a GH1 β-Glucosidase Probed by Multiple MD and Computational Approaches. Proteins Struct. Funct. Bioinform. 2022, 91, 218–236. [Google Scholar] [CrossRef]
Martín, C.; Thomsen, M.H.; Hauggaard-Nielsen, H.; BelindaThomsen, A. Wet Oxidation Pretreatment, Enzymatic Hydrolysis and Simultaneous Saccharification and Fermentation of Clover–Ryegrass Mixtures. Bioresour. Technol. 2008, 99, 8777–8782. [Google Scholar] [CrossRef]
Olofsson, K.; Bertilsson, M.; Lidén, G. A Short Review on SSF–an Interesting Process Option for Ethanol Production from Lignocellulosic Feedstocks. Biotechnol. Biofuels 2008, 1, 1–14. [Google Scholar] [CrossRef]
Akram, F.; ul Haq, I.; Khan, M.A.; Hussain, Z.; Mukhtar, H.; Iqbal, K. Cloning with Kinetic and Thermodynamic Insight of a Novel Hyperthermostable β-Glucosidase from Thermotoga Naphthophila RKU-10T with Excellent Glucose Tolerance. J. Mol. Catal. B Enzym. 2016, 124, 92–104. [Google Scholar] [CrossRef]
Bai, A.; Zhao, X.; Jin, Y.; Yang, G.; Feng, Y. A Novel Thermophilic β-Glucosidase from Caldicellulosiruptor Bescii: Characterization and Its Synergistic Catalysis with Other Cellulases. J. Mol. Catal. B Enzym. 2013, 85–86, 248–256. [Google Scholar] [CrossRef]
Breves, R.; Bronnenmeier, K.; Wild, N.; Lottspeich, F.; Staudenbauer, W.L.; Hofemeister, J. Genes Encoding Two Different Beta-Glucosidases of Thermoanaerobacter Brockii Are Clustered in a Common Operon. Appl. Environ. Microbiol. 1997, 63, 3902–3910. [Google Scholar] [CrossRef]
Cota, J.; Corrêa, T.L.R.; Damásio, A.R.L.; Diogo, J.A.; Hoffmam, Z.B.; Garcia, W.; Oliveira, L.C.; Prade, R.A.; Squina, F.M. Comparative Analysis of Three Hyperthermophilic GH1 and GH3 Family Members with Industrial Potential. New Biotechnol. 2015, 32, 13–20. [Google Scholar] [CrossRef] [PubMed]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Mariano, D.C.B.; Leite, C.; Santos, L.H.S.; Marins, L.F.; Machado, K.S.; Werhli, A.V.; Lima, L.H.F.; de Melo-Minardi, R.C. Characterization of Glucose-Tolerant β-Glucosidases Used in Biofuel Production under the Bioinformatics Perspective: A Systematic Review. Genet. Mol. Res. 2017, 16, gmr16039740. [Google Scholar] [CrossRef]
Mariano, D.; Pantuza, N.; Santos, L.H.; Rocha, R.E.O.; de Lima, L.H.F.; Bleicher, L.; de Melo-Minardi, R.C. Glutantβase: A Database for Improving the Rational Design of Glucose-Tolerant β-Glucosidases. BMC Mol. Cell Biol. 2020, 21, 50. [Google Scholar] [CrossRef]
Costa, L.S.C.; Mariano, D.C.B.; Rocha, R.E.O.; Kraml, J.; da Silveira, C.H.; Liedl, K.R.; de Melo-Minardi, R.C.; de Lima, L.H.F. Molecular Dynamics Gives New Insights into the Glucose Tolerance and Inhibition Mechanisms on β-Glucosidases. Molecules 2019, 24, 3215. [Google Scholar] [CrossRef] [PubMed]
de Lima, L.H.F.; Fernandez-Quintéro, M.L.; Rocha, R.E.O.; Mariano, D.C.B.; de Melo-Minardi, R.C.; Liedl, K.R. Conformational Flexibility Correlates with Glucose Tolerance for Point Mutations in β-Glucosidases—A Computational Study. J. Biomol. Struct. Dyn. 2021, 39, 1621–1634. [Google Scholar] [CrossRef]
Mariano, D.; Santos, L.H.; Machado, K.D.S.; Werhli, A.V.; de Lima, L.H.F.; de Melo-Minardi, R.C. A Computational Method to Propose Mutations in Enzymes Based on Structural Signature Variation (SSV). Int. J. Mol. Sci. 2019, 20, 333. [Google Scholar] [CrossRef]
da Silveira, C.H.; Pires, D.E.V.; Minardi, R.C.; Ribeiro, C.; Veloso, C.J.M.; Lopes, J.C.D.; Meira, W.; Neshich, G.; Ramos, C.H.I.; Habesch, R.; et al. Protein Cutoff Scanning: A Comparative Analysis of Cutoff Dependent and Cutoff Free Methods for Prospecting Contacts in Proteins. Proteins 2009, 74, 727–743. [Google Scholar] [CrossRef]
Pires, D.E.; de Melo-Minardi, R.C.; dos Santos, M.A.; da Silveira, C.H.; Santoro, M.M.; Meira, W. Cutoff Scanning Matrix (CSM): Structural Classification and Function Prediction by Protein Inter-Residue Distance Patterns. BMC Genom. 2011, 12, S12. [Google Scholar] [CrossRef] [PubMed]
Watanabe, M.; Matsuzawa, T.; Yaoi, K. Rational Protein Design for Thermostabilization of Glycoside Hydrolases Based on Structural Analysis. Appl. Microbiol. Biotechnol. 2018, 102, 8677–8684. [Google Scholar] [CrossRef] [PubMed]
Sharma, S.; Vaid, S.; Bhat, B.; Singh, S.; Bajaj, B.K. Thermostable Enzymes for Industrial Biotechnology. In Advances in Enzyme Technology; Elsevier: Amsterdam, The Netherlands, 2019; pp. 469–495. [Google Scholar]
Ahmed, A.; Sumreen, A.; Bibi, A.; Batool, K. In Silico Approach to Elucidate Factors Associated with GH1 β-Glucosidase Thermostability. J. Pure Appl. Microbiol. 2019, 13, 1953–1968. [Google Scholar] [CrossRef]
Rahban, M.; Zolghadri, S.; Salehi, N.; Ahmad, F.; Haertlé, T.; Rezaei-Ghaleh, N.; Sawyer, L.; Saboury, A.A. Thermal Stability Enhancement: Fundamental Concepts of Protein Engineering Strategies to Manipulate the Flexible Structure. Int. J. Biol. Macromol. 2022, 214, 642–654. [Google Scholar] [CrossRef] [PubMed]
Erickson, E.; Gado, J.E.; Avilán, L.; Bratti, F.; Brizendine, R.K.; Cox, P.A.; Gill, R.; Graham, R.; Kim, D.-J.; König, G.; et al. Sourcing Thermotolerant Poly(Ethylene Terephthalate) Hydrolase Scaffolds from Natural Diversity. Nat. Commun. 2022, 13, 7850. [Google Scholar] [CrossRef]
Consortium, T.U. UniProt: A Hub for Protein Information. Nucl. Acids Res. 2015, 43, D204–D212. [Google Scholar] [CrossRef]
Varadi, M.; Bertoni, D.; Magana, P.; Paramval, U.; Pidruchna, I.; Radhakrishnan, M.; Tsenkov, M.; Nair, S.; Mirdita, M.; Yeo, J.; et al. AlphaFold Protein Structure Database in 2024: Providing Structure Coverage for over 214 Million Protein Sequences. Nucleic Acids Res. 2024, 52, D368–D375. [Google Scholar] [CrossRef]
Pires, D.E.V.; de Melo-Minardi, R.C.; da Silveira, C.H.; Campos, F.F.; Meira, W. aCSM: Noise-Free Graph-Based Signatures to Large-Scale Receptor-Based Ligand Prediction. Bioinformatics 2013, 29, 855–861. [Google Scholar] [CrossRef]
Martins, P.; Mariano, D.; Carvalho, F.C.; Bastos, L.L.; Moraes, L.; Paixão, V.; Cardoso de Melo-Minardi, R. Propedia v2.3: A Novel Representation Approach for the Peptide-Protein Interaction Database Using Graph-Based Structural Signatures. Front. Bioinform. 2023, 3, 1103103. [Google Scholar] [CrossRef]
Demšar, J.; Zupan, B.; Leban, G.; Curk, T. Orange: From Experimental Machine Learning to Interactive Data Mining; Springer: Berlin/Heidelberg, Germany, 2004; pp. 537–539. [Google Scholar]
Pudžiuvelytė, I.; Olechnovič, K.; Godliauskaite, E.; Sermokas, K.; Urbaitis, T.; Gasiunas, G.; Kazlauskas, D. TemStaPro: Protein Thermostability Prediction Using Sequence Representations from Protein Language Models. Bioinformatics 2024, 40, btae157. [Google Scholar] [CrossRef]
Sievers, F.; Higgins, D.G. Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences. Methods Mol. Biol. 2014, 1079, 105–116. [Google Scholar] [CrossRef]
Schrödinger, L. The PyMOL Molecular Graphics System, V2.0.0; Schrödinger, LLC: New York, NY, USA, 2019. [Google Scholar]
Jurrus, E.; Engel, D.; Star, K.; Monson, K.; Brandi, J.; Felberg, L.E.; Brookes, D.H.; Wilson, L.; Chen, J.; Liles, K.; et al. Improvements to the APBS Biomolecular Solvation Software Suite. Protein Sci. 2018, 27, 112–128. [Google Scholar] [CrossRef]
Unni, S.; Huang, Y.; Hanson, R.M.; Tobias, M.; Krishnan, S.; Li, W.W.; Nielsen, J.E.; Baker, N.A. Web Servers and Services for Electrostatics Calculations with APBS and PDB2PQR. J. Comput. Chem. 2011, 32, 1488–1491. [Google Scholar] [CrossRef] [PubMed]
Dolinsky, T.J.; Nielsen, J.E.; McCammon, J.A.; Baker, N.A. PDB2PQR: An Automated Pipeline for the Setup of Poisson–Boltzmann Electrostatics Calculations. Nucleic Acids Res 2004, 32, W665–W667. [Google Scholar] [CrossRef] [PubMed]
Dolinsky, T.J.; Czodrowski, P.; Li, H.; Nielsen, J.E.; Jensen, J.H.; Klebe, G.; Baker, N.A. PDB2PQR: Expanding and Upgrading Automated Preparation of Biomolecular Structures for Molecular Simulations. Nucleic Acids Res. 2007, 35, W522–W525. [Google Scholar] [CrossRef]
Kuriata, A.; Gierut, A.M.; Oleniecki, T.; Ciemny, M.P.; Kolinski, A.; Kurcinski, M.; Kmiecik, S. CABS-Flex 2.0: A Web Server for Fast Simulations of Flexibility of Protein Structures. Nucleic Acids Res. 2018, 46, W338–W343. [Google Scholar] [CrossRef] [PubMed]
Jamroz, M.; Kolinski, A.; Kmiecik, S. CABS-Flex: Server for Fast Simulation of Protein Structure Fluctuations. Nucleic Acids Res. 2013, 41, W427–W431. [Google Scholar] [CrossRef]
Cantarel, B.L.; Coutinho, P.M.; Rancurel, C.; Bernard, T.; Lombard, V.; Henrissat, B. The Carbohydrate-Active EnZymes Database (CAZy): An Expert Resource for Glycogenomics. Nucleic Acids Res. 2009, 37, D233–D238. [Google Scholar] [CrossRef]
Peccati, F.; Alunno-Rufini, S.; Jiménez-Osés, G. Accurate Prediction of Enzyme Thermostabilization with Rosetta Using AlphaFold Ensembles. J. Chem. Inf. Model. 2023, 63, 898–909. [Google Scholar] [CrossRef]
Radestock, S.; Gohlke, H. Protein Rigidity and Thermophilic Adaptation. Proteins Struct. Funct. Bioinform. 2011, 79, 1089–1108. [Google Scholar] [CrossRef]
Feller, G. Protein Stability and Enzyme Activity at Extreme Biological Temperatures. J. Phys. Condens. Matter 2010, 22, 323101. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Data preprocessing steps. After selecting the 3000 samples, 1500 positive (thermostable) and 1500 negative (probably non-thermostable), I included a label to separate the dataset into the train (n = 2000; 1000 positive and 1000 negative) and test (n = 1000; 500 positive and 500 negative) groups. However, not all structures met the requirements (number of residues higher than 200 and lower than 800) or were available at AlphaFold DB, so I collected only 1134 structures for the training dataset and 583 for the test dataset. In the signature matrix, n corresponds to the number of instances, and m corresponds to the number of attributes detected by the aCSM-ALL algorithm for each instance.

Figure 2. Representation of the constructed machine learning models. Training and testing were randomly separated in a previous step using Python scripts. Generated using Orange Data Mining.

Figure 3. ROC curve for training (left) and test (right).

Figure 4. Confusion matrix. Positive = thermostable; negative = probably non-thermostable.

Figure 5. Case study structures. (A) Structural alignment among the 3D structures. Generated using PyMOL. (B) Matrix of sequence identity. Generated using Clustal Omega (details available in Supplementary Table S3). PP: real positive and predicted as positive; NN: real negative and predicted as negative; NP: real negative and predicted as positive; and PN: real positive and predicted as negative.

Figure 6. Charge distribution on the electrostatic surface. Force field: AMBER—pH: 7.0. Generated using the APBS software v3.4.1.

Figure 7. CABSflex result. (A) RMSF for each residue of A0A0B2A6F6 (non-thermophilic) at the initial temperature (blue line) and a higher temperature (red line). (B) RMSF for each residue of A0A0B3BP14 (thermophilic) at the initial temperature (blue line) and a higher temperature (red line). (C) Variation in RMSF (ΔRMSF) between A0A0B2A6F6 (salmon—50% transparent) and A0A0B3BP14 (blue). The higher the peak, the more mobility the region will have at higher temperatures. Dark red regions indicate overlap. As the proteins have different sizes, the sequences were manually aligned (five gap regions are displayed).

Table 1. Results of cross-validation (k = 10)—stratified.

Model	Accuracy	F1-Score	Precision	Recall	Specificity
Logistic Regression	0.771	0.782	0.773	0.791	0.749
Gradient Boosting	0.762	0.774	0.764	0.784	0.738
Neural Network	0.748	0.764	0.742	0.788	0.705
CatBoost	0.741	0.754	0.743	0.766	0.714
Random Forest	0.730	0.729	0.763	0.698	0.765
kNN	0.723	0.738	0.725	0.752	0.692
SVM	0.558	0.610	0.563	0.664	0.444

Table 2. Results for the test dataset.

Model	Accuracy	F1-Score	Precision	Recall	Specificity
CatBoost	0.816	0.812	0.805	0.819	0.814
Gradient Boosting	0.804	0.793	0.811	0.777	0.831
Neural Network	0.796	0.779	0.817	0.745	0.844
Random Forest	0.782	0.771	0.784	0.759	0.804
Logistic Regression	0.762	0.760	0.741	0.780	0.744
kNN	0.726	0.717	0.715	0.720	0.731
SVM	0.556	0.595	0.532	0.674	0.445

Table 3. Results for case study 1 (n = 583; Neg = 282; Pos = 301). Comparison to TemStaPro (55 °C).

	This Study	TemStaPro-t55
Accuracy	0.816	0.662
F1	0.812	0.516
Precision	0.805	0.840
Recall	0.819	0.372
Specificity	0.814	0.934
TP	231	105
TN	245	281
FP	56	20
FN	51	177

Table 4. Examples selected for case study 2.

Uniprot ID	Organism	Real Class	Prediction
A0A0B2A6F6	Microbacterium mangrovi	Negative	Negative
A0A0B3BP14	Thermoanaerobacter sp. YS13	Positive	Positive
A0A0G0PX59	Candidatus Falkowbacteria bacterium	Negative	Positive
A0A087E2J5	Bifidobacterium thermacidophilum	Positive	Negative

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mariano, D. A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases. Appl. Sci. 2025, 15, 4839. https://doi.org/10.3390/app15094839

AMA Style

Mariano D. A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases. Applied Sciences. 2025; 15(9):4839. https://doi.org/10.3390/app15094839

Chicago/Turabian Style

Mariano, Diego. 2025. "A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases" Applied Sciences 15, no. 9: 4839. https://doi.org/10.3390/app15094839

APA Style

Mariano, D. (2025). A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases. Applied Sciences, 15(9), 4839. https://doi.org/10.3390/app15094839

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Machine Learning Approach for the Prediction of Thermostable β-Glucosidases

Abstract

Featured Application

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Data Collection

3.2. Structural Signatures Calculation

3.3. Machine Learning Models

3.4. Case Studies

4. Results and Discussion

4.1. Thermostability Patterns in Enzymes Obtained from Non-Thermostable Organisms

4.1.1. Case Study 1—Comparison to Another Tool

4.1.2. Case Study 2—Exploring Some Results in Detail

4.1.3. Differences in the Electrostatic Surface

4.1.4. β-Glucosidases from Thermophilic Organisms Are Less Unstable at High Temperatures

4.2. Limitations and Perspectives

5. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI