2. Results
We have analyzed the correlation between protein abundance and base elongation efficiency index (EEI) value for various groups of microorganisms (see the details in the Materials and Methods section) and have investigated how this correlation depends on the following factors:
Base EEI type, i.e., the mode of evolutionary optimization of translation exhibited by a particular genome;
Taxonomical identity of an analyzed genome;
Cell doubling time, i.e., microorganism’s reproduction rate;
Mean (M) and standard deviation (R) of ranks of ribosomal protein genes measured on the base EEI scale.
Taking into account these factors allows us to study the structure of the sample, disentangling their impact on the correlation coefficient value between protein abundance and EEI.
Different genomic features in association with the obtained correlation coefficients (corr(PA|EEI)) between base EEI and protein abundance have also been analyzed. Neither genome length (r = −0.004, p = 0.85) nor number of genes (r = 0.01, p = 0.84) nor number of tRNAs (r = 0.36, p = 0.37) correlate significantly with the correlation coefficient between protein abundance and EEI. At the same time, such characteristics as number of ribosomal protein genes (r = 0.488, p = 1 × 10−16) and GC content (r = −0.394, p = 0.02) demonstrate significant correlation with the corr(PA|EEI). The number of ribosomal genes also correlates with the minimal doubling time of a microbe (Spearman’s correlation coefficient r = −0.428, p = 0.046).
To understand the representativeness of using proteomic data, we have calculated proteome coverage. Proteome coverage, which is a percentage of protein-coding genes presented in proteomic data, varies among samples. The median coverage per studying organism is 50.8 with the standard deviation 24.1. This means that for most of the analyzed organisms, the data used for analysis do not characterize the entire proteome, but do cover at least a significant part of it.
The correlation, coverage, minimal doubling time,
EEI type, and mean (
M) and standard deviation (R) values for each organism are demonstrated in
Table 1.
Overall, the mean Spearman’s correlation coefficient between protein abundance and
EEI calculated for the whole sample equals to 0.4 (the boxplot depicting corresponding descriptive statistics is shown in
Figure 1). The majority of analyzed organisms, with the exception of
Neisseria meningitidis, have shown a significant correlation between protein abundance and base
EEI values. However, the correlation coefficient values vary greatly among the organisms.
This result means that predicting protein abundance solely based on elongation translation characteristics, such as those calculated by EloE, will have good accuracy for some organisms and poor accuracy for others. Further analysis aims to reveal the parameters that contribute to the correlation coefficients’ values.
2.1. Dependence of Correlation between Protein Abundance and the EEI from EEI Type
To determine the patterns of the correlation coefficients’ distribution among the organisms depending on their mode of evolutionary optimization of translation, we split the sample into several subsamples according to the genome’s base
EEI type established by EloE (see
Figure 2).
The highest correlation was obtained for the organisms belonging to the EEI1 type, which relies primarily on codon usage optimization for efficient translation. The correlation coefficients for organisms which were assigned to the EEI2 and the EEI4 types are significantly lower. The optimization of elongation efficiency types for these organisms were based on the optimization of number of secondary structures in mRNA for the EEI2 type, and codon usage and the optimization of number of secondary structures in mRNA for the EEI4 type.
It is important to note that the organisms belonging to the types other than the codon usage bias optimization only type (the
EEI1 type) do not demonstrate higher correlation coefficients if elongation efficiency indices are calculated taking into account codon usage bias only, i.e., using the
EEI1 formula (see
Figure 3 and
Table A1). The correlation coefficients between the
EEI1 indices and protein abundance are significantly lower (
p = 0.02, Welch’s
t-test) than the correlation coefficients between the base
EEI type and protein abundance for the organisms belonging to the type which minimizes the number of secondary structures (
EEI2) (
Figure 3a). They are also lower for
Pseudomonas aeruginosa, which belongs to the type that considers only energy of secondary structures (
EEI3, see
Figure 3b), though we do not have enough sample size to deduce any extrapolations from here. Finally, the type that considers the codon usage bias and the number of secondary structures in mRNA (
EEI4,
Figure 3c) demonstrates higher corr(PA|
EEI4) values than corr(PA|
EEI1) at a trend level (
p = 0.24). Thus, applying the approach that considers different elongation efficiency types allows improvement of the accuracy of predictions for those organisms that do not demonstrate a clear codon usage optimization pattern.
2.2. Dependence of Correlation between Protein Abundance and the EEI from Phylogeny
Phylogenetically distant organisms can have significant differences in the regulation of gene expression. Therefore, the significance of the effect of translation elongation factors on the overall level of gene expression may also differ among phylogenetically diverse organisms. In this regard, the ability to predict protein abundance based on the elongation translation characteristics can vary greatly for different phylogenic groups.
Below, we have mapped the analyzed strains onto a phylogenetic tree in order to reflect the diversity of phylogenetic groups represented in the analysis and to determine for which phylogenetic groups the prediction of protein abundance by EloE provides the most accurate results, which is demonstrated in
Figure 4 rendered using iTol [
60].
As one can see, the tree includes both species known for codon usage bias being a reliable measure of their translation elongation efficiency (such as E. coli), and those who have been shown to contravene that pattern (such as H. pylori and the representatives of Mycoplasma genus). Accordingly, the former belong to the EEI1 optimization type, while the latter are distributed to the other elongation efficiency optimization types, which take into account the effect of secondary structures in mRNA. Moreover, there are a number of new species that have not been studied in this regard before, and which, therefore, present a special interest.
The most represented taxa are phylum Firmicutes, namely, class Bacilli, and phylum Proteobacteria, in particular, class Gammaproteobacteria. Also, the mean correlation coefficients among the studied organisms of these taxa are 0.59 and 0.46, respectively, which are higher than the mean correlation coefficient for the entire dataset (0.4). Notably, most of the bacteria belonging to these classes belong to the EEI1 type, which show higher correlation coefficients. However, this difference is significant only for class Bacilli, compared with the other microorganisms from the analyzed set (Welch test, p = 2.2 × 10−5). Other taxa are represented by only a couple of species, if any, and their correlation coefficients corr(PA|EEI) are highly varied. The differences among correlation coefficients probably occur due to the different extents of influence of the codon usage bias and secondary structures on gene expression among species.
Thus, one can use elongation efficiency indices for a theoretical assessment of expected protein expression profile in the case of absence of proteomic data for a particular representative of one of those classes that demonstrate relatively high correlation between protein abundance and their base EEI, though biological implications of belonging to a specific elongation efficiency optimization type might vary depending on the particular taxa.
2.3. Dependence of Correlation between Protein Abundance and the EEI from Minimal Doubling Time
Doubling time as a characteristic reflecting reproduction rate varies greatly, both among various bacterial species and inside the same species if it grows in different conditions [
61]. It is known that bacterial growth rates are correlated with ribosome abundance [
62], and therefore it correlates with the entire translation rate due to reduction in active ribosome fraction during slow growth [
63]. However, translation elongation maintains a significant rate even in poor nutrient conditions with slow bacterial growth [
63], which enables cells to produce proteins crucial for surviving in harsh environments in a timely manner.
The prediction of protein abundance using elongation efficiency indices assumes that coding sequences of highly expressed genes, such as ribosomal protein genes, are heavily optimized compared to the genes with low level of basal expression. This means that if elongation efficiency is more evenly optimized because it is a less essential step in determining protein abundance than, for instance, a gene regulation, such an organism can demonstrate a reduced quality of protein abundance prediction. Higgs and Ran [
64] found a low correlation between tRNA gene abundance and codon usage for most bacteria with high doubling time. They supposed that, although the translation is the limiting factor of division in fast-growing organisms, this is not the case for slow-multiplying organisms. Although their results could also be explained by the high impact of mRNA secondary structures in translation, this aspect is still worth being tested.
Also, it was demonstrated [
65] that a prokaryotic growth rate is highly correlated with the codon usage bias. In fast-growing organisms, codon usage bias is more pronounced due to codon usage optimization, which is crucial since the tRNA pool becomes limiting at very high growth rates. Based on the codon usage bias of ribosomal protein, Weissman, Hou, and Fuhrman have predicted [
66] the minimum doubling time for about 200,000 prokaryotes. Such an estimation of the growth rate divides prokaryotes into two groups, which fits their ecological roles. The first one is copiotrophs, consisting of fast-growing microbes that grow in nutrient-rich environments. The other is oligotrophs, represented by microbes that are adapted to low levels of nutrients and tend to have slow growth rates. Based on these results, authors have defined oligotroph as an organism for which a selection for rapid maximal growth is weak enough so that translation efficiency is not optimized by selection for optimized codon usage.
In the light of the listed above, a hypothesis can be formulated that protein abundance predictions will be less efficient for prokaryotes with the high minimal doubling time.
Indeed, one can notice (
Figure 5a) an increase in the corr(PA|
EEI) with a decrease in the minimum doubling time (DT), although bacteria with fast growth and a low correlation coefficient also exist. The Pearson correlation coefficient between corr(PA|
EEI) and minimal doubling time for 25 organisms is r = −0.446 (
p = 0.025). No relationship was found between the base
EEI type and the doubling time. However, it is worth noting that slowly growing bacteria (with the DT ≥ 5 h) are mostly represented by
EEI types which consider secondary structures (only one out of seven organisms belong to the
EEI1 type). Consistent with previous studies, codon usage bias slightly reflects the gene expression profile for those six organisms, which is demonstrated by calculation of corr(PA|
EEI) for each of the five
EEI types (see
Table A1). Considering the secondary structures enables us to reach higher (but still quite low) correlation coefficients.
We hypothesize that some prokaryotic species living in harsh environments could demonstrate a similar level of translation efficiency optimization throughout the genome. Such organisms are supposed to show a high minimum doubling time and lower translation elongation efficiency for ribosomal genes than fast-growing species. As mean elongation efficiency of ribosomal proteins is reflected by
M values, we have compared them for fast-growing and slow-growing prokaryotes (
Figure 5b).
The Welch test between M values of fast-growing organisms (with the minimal doubling time no more than two hours) and slow-growing organisms (with the minimal doubling time higher than five hours) has shown a significant difference (p-value = 7.059 × 10−6). The comparison of medium-growing organisms (with the minimal doubling time between two and five hours) and slow-growing organisms also has shown a significant difference for M values (p = 0.0002).
Notably, the lower correlation between protein abundance and elongation efficiency for organisms with higher minimum doubling time cannot be explained only by a weaker optimization of ribosomal protein genes in favor of other genes. If we do not consider elongation efficiency of ribosomal protein genes during the selection of the base
EEI type by selection of the
EEI type that shows higher correlation coefficients between protein abundance and elongation efficiency, which simulates the usage of the optimal group of highly optimized genes, the correlation coefficients do not necessarily rise. In particular, changing
EEI type greatly increases (from 0.12 to 0.34 for
Acidithiobacillus ferrooxidans, and from 0.36 to 0.46 for
Leptospira interrogans) the correlation coefficient only for two of seven slow-growing organisms under study (see
Table A1). In summary, the prediction of protein abundance is less efficient for slow-growing organisms, which can be explained by less pronounced differences in elongation efficiency optimization throughout the genomes of these organisms. In other words, translation elongation efficiency does not appear to be a limiting factor in determining protein abundance for slow-growing microorganisms.
2.4. Dependence of Correlation between Protein Abundance and the EEI from Elongation Efficiency of Ribosomal Protein Genes
As mentioned earlier, the ranks of ribosomal gene proteins, which contribute to the
M (mean) and R (standard deviation) parameters, are used to determine a genome’s base elongation efficiency index type, which describes the mode of evolutionary optimization of translation in a particular genome in the most accurate way. Here we have examined how the correlation coefficient between the
EEI and protein abundance depends on the
M and R values for the base
EEI type (see
Figure 6).
The Pearson correlation coefficient between M and corr(PA|EEI) for 25 organisms is 0.7344 (p = 2.9 × 10−5). The Pearson correlation coefficient between R and corr(PA|EEI) is −0.454 (p = 0.022). This reassuring result indicates that the strategy of maximizing M and minimizing R that is used to determine the base EEI type in the EloE is the right way, which not only has a theoretical basis but also is substantiated by experimental data.
As these parameters are highly correlated with corr(PA|EEI), they could be used for estimating prediction potential (correlation coefficient between the EEI and protein abundance) for an organism under study. Also, these parameters are calculated by the algorithm itself and do not require the involvement of additional data, which makes them convenient enough to assess the efficiency of the algorithm.
For this purpose, a linear regression model has been built. The independent variable is represented by the
M parameter only, since
M and R parameters are highly correlated.
The determination coefficient (R2) equals 0.35, and the mean squared error (MSE) equals 0.011. The test for significance of regression shows F > F-critical (10.36 > 4.2793), p = 0.038, which means that the regression model is statistically significant. In summary, the statistics shows that the model has a prediction power.
Using this formula with caution, and taking into account the observed range of M values, one could predict the expected correlation coefficient for another organism, which does not have enough data covering its protein expression profile.
In summary, we can use the EloE for a rough prediction of gene expression at the protein level. Taking into account the EEI type, doubling time, taxonomic identity, as well as the M and R parameters, allows us to derive an approximate estimate of the expected correlation coefficient between base EEI values and actual protein abundance.
3. Discussion
The gene expression is a multi-level process including various regulation on a transcriptional and translational level. The protein abundance reflects the overall effect of all the factors contributing to the gene expression, whereas each of these factors has its own particular share in this cumulative effect. One of the intriguing questions within this context is the problem of predicting the basal gene expression based on only partial information available, in particular, the genomic sequence data. This study focuses on investigating correlation between the translation elongation characteristics and proteomic data. As our analysis indicates, the mean correlation coefficient between protein abundance and base elongation efficiency index (
EEI) calculated for the whole sample is not high, which was expected, since we are trying to predict the protein abundance based on the elongation efficiency, while the protein yield is also influenced by other stages, including the stage of transcription, translation initiation [
6,
15,
67], and other factors such as half-life values of the respective protein and mRNA [
15,
50,
52], as well as the protein’s structure and its resistance to proteases [
68,
69,
70]. To the best of our knowledge, this is the first time such an analysis of correlation between protein abundance and different elongation efficiency measures has been performed based on the proteomics data for the prokaryotes belonging to such a range of taxonomic groups including non-model organisms and the organisms which are known for codon usage being an ineffective measure of translation elongation efficiency of their genes.
However, the correlation coefficients between protein abundance and the
EEI values vary greatly among the organisms. The bioinformatic assessment of the factors affecting the correlation between protein abundance and elongation efficiency in prokaryotes has shown that there are several factors associated with the value of the correlation coefficient. The first is the
EEI type—organisms that correspond to the
EEI1 type, which takes into account codon bias only, have significantly higher correlation coefficients. Such a difference between these types could be explained by ambiguous [
71] contributions of secondary structures to protein abundance. Although secondary structures in mRNA decrease ribosome velocity, they can protect mRNA from ribonucleases and, therefore, increase mRNA abundance. As a result, protein abundance could both decrease and increase under the influence of secondary structures. Thus, we should expect a lower prediction accuracy for organisms belonging to optimization types, for which secondary structures play a significant role in determining the protein abundance (
EEI2,
EEI3,
EEI4, and, probably,
EEI5). Unfortunately, among the organisms with available protein profiles,
Neisseria meningitidis, the only one belonging to the
EEI5 base type, do not show a significant correlation between protein abundance and
EEI values—not only base ones but any
EEI values, including classic codon usage bias. Therefore, we refrain from making any decisive conclusions about that particular optimization type. It is worth noting, however, that for those organisms under study, which fall into one of the optimization types (
EEI2,
EEI3, and
EEI4) characterized by the role of mRNA secondary structures, applying their base elongation efficiency index allows us to reach higher correlation coefficients than if using
EEI1, which represents classic codon usage bias. We believe that this indicates the complex nature and the role of translation elongation efficiency in determining protein abundance in these classes of organisms.
The second factor is taxonomic identity of an organism under study—such a class as Bacilli is among those characterized by the highest correlation coefficient between
EEI and protein abundance. Using this information to derive estimates of expected correlation coefficients for the organisms that lack proteomic profiles seems to be a promising approach, though we definitely need more data to be able to improve the quality of such an assessment. The third factor is the microorganism’s reproduction rate. We observe an increase in the correlation coefficient between the
EEI and protein abundance with a decrease in the minimum doubling time, that is, fast-growing prokaryotes tend to have a high correlation coefficient. The latter might be associated with the similar level of elongation efficiency across the genome in slow-growing species, which is reflected in ribosomal protein coding genes being not the most highly optimized group of genes among them. The fact that genes encoding ribosomal proteins may not be highly efficient at translation elongation was shown on several
Mycoplasma species (
C. M. haemolamae,
M. haemocanis,
M. wenyonii,
M. haemofelis,
M. pneumonia,
C. M. haemominutum, and
M. suis). These species demonstrate decreased M values and a reduced number of perfect local inverted repeats (potential hairpins) in mRNA of both ribosomal and non-ribosomal genes. It makes translation elongation efficiency of non-ribosomal genes similar to ribosomal ones [
72]. Thus, there are various situations where either an organism possesses a quite compact and evenly optimized genome or translation elongation efficiency does not appear to be a limiting stage in determining protein level. However, we have also demonstrated that, in general, the initial approach used by the EloE that relies on assessing the ranks of ribosomal proteins in the gene list sorted by the base
EEI values is adequate to the experimental data of the organisms under study, especially for the organisms with a high number of ribosomal protein genes and low GC content. Therefore, it can be used in further development of the algorithms that would take into account not only translation elongation, but also other stages that affect the level of gene expression.
One of the difficulties in studying the relationships between elongation efficiency characteristics and protein abundance at the organism level is the lack of the genome-wide protein abundance profiles to assess the actual correlation between protein abundance and elongation efficiency indices based on representative datasets, which would include protein-encoding genes with various expression levels for taxonomically divergent organisms, including non-model ones. However, as more proteomic studies generating a full protein profile of an organism under study are published, the whole picture of how the particular aspects of optimization of translation elongation efficiency affect the protein abundance in various microorganisms will become more clear and detailed. We believe that a thorough bioinformatic estimation of factors contributing to protein abundance, such as elongation efficiency, paying attention to the actual biodiversity of prokaryotic species, is an important step towards in silico prediction of protein abundance levels.