Next Article in Journal
AglM and VNG1048G, Two Haloarchaeal UDP-Glucose Dehydrogenases, Show Different Salt-Related Behaviors
Next Article in Special Issue
Ultra Large Gene Families: A Matter of Adaptation or Genomic Parasites?
Previous Article in Journal
A Hypothesis: Life Initiated from Two Genes, as Deduced from the RNA World Hypothesis and the Characteristics of Life-Like Systems
Previous Article in Special Issue
Conservation of the Exon-Intron Structure of Long Intergenic Non-Coding RNA Genes in Eutherian Mammals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Gene-Family Extension Measures and Correlations

Department of Evolutionary and Environmental Biology, University of Haifa, Haifa 3498838, Israel
*
Author to whom correspondence should be addressed.
Submission received: 2 June 2016 / Revised: 18 July 2016 / Accepted: 18 July 2016 / Published: 3 August 2016
(This article belongs to the Special Issue Structure and Evolution of Genome)

Abstract

:
The existence of multiple copies of genes is a well-known phenomenon. A gene family is a set of sufficiently similar genes, formed by gene duplication. In earlier works conducted on a limited number of completely sequenced and annotated genomes it was found that size of gene family and size of genome are positively correlated. Additionally, it was found that several atypical microbes deviated from the observed general trend. In this study, we reexamined these associations on a larger dataset consisting of 1484 prokaryotic genomes and using several ranking approaches. We applied ranking methods in such a way that genomes with lower numbers of gene copies would have lower rank. Until now only simple ranking methods were used; we applied the Kemeny optimal aggregation approach as well. Regression and correlation analysis were utilized in order to accurately quantify and characterize the relationships between measures of paralog indices and genome size. In addition, boxplot analysis was employed as a method for outlier detection. We found that, in general, all paralog indexes positively correlate with an increase of genome size. As expected, different groups of atypical prokaryotic genomes were found for different types of paralog quantities. Mycoplasmataceae and Halobacteria appeared to be among the most interesting candidates for further research of evolution through gene duplication.

Graphical Abstract

1. Introduction

The existence of significant gene redundancy—or, in other words, the existence of multiple copies of protein-coding genes—has been known for a long time. The availability of numerous prokaryotic complete genome sequences confirmed this and provided data to examine various possible factors affecting attributes of gene-families [1,2,3,4]. There are several very fundamental questions related to the origin and variability of gene copy number. In this study, we do not pretend to contribute anything substantial to discussions around above-mentioned fundamental questions. Our work is specifically concerned with association between number of gene copies and genome size. As a rule, we use the term “gene copy” in the study; however, sometimes, we use the term “paralogs” as shorthand for “members of a gene family” or, simply, gene copies. In literature, one can find different usages of the term “paralog” [3,5]. Walter Fitch introduced this essential term [6] bearing in mind the following: paralogs are homologous genes that have diverged from each other because of genetic duplication. We hope that the occasional use of the term will not confuse the reader.
Strictly, a gene family is a set of several similar genes, formed by duplication of an original gene. In this study, for all practical purposes, a gene family is a subset of protein-coding genes belonging both to the same clusters of orthologous groups (COG) [7,8,9,10] and to the same genome. Our admittedly oversimplified approach has obvious limitations, yet, statistically it works as well as other more rigorous methods of paralog characterization.
Gene-families (see our operational definition above) are of variable size and of varying degree of similarity among their members. We believe that many aspects of gene-family’s attributes and origins require further study. In this study, we concentrate on the gene-family’s attributes, rather than their origins. Specifically, we do not try to distinguish effects of different types of gene duplication and horizontal gene transfer (HGT), since the relative contribution of gene duplication and HGT to genome expansion and variability is unknown [11,12,13,14].
One of the major associations related to gene-family size is that the latter correlates well with a genome size [11,15,16]. Pushker et al. [4] determined these correlations for 127 eubacterial genomes, updating the earlier work of King Jordan et al., which was done on a more limited dataset [3].
Gene duplication and HGT are the processes that can change the size of numerous gene-families, which is manifested as a discriminating attribute even between different strains of microbes. Expansion of gene-families represents an increased cost for a prokaryote. So, what is the evolutionary driving force behind retention of a gene duplicate? A plausible answer to the question has been proposed: the adaptation to altered environments. The duplicated genes may serve as genetic reservoir for coping with fluctuating environmental conditions such as altered salinity or thermal stress [17]. For the gene copy to avoid deletion, it must represent a positive response to environmental stress, e.g., by just increasing gene dosage as a response to higher demand [11,18]. When the selective pressure is removed, the paralogs may be lost again [17].
What is the role of phylogeny in the process? Pushker et al. [4] wrote: “The relative contribution of these genes [paralogous genes] in each genome seems to be independent of phylogenetic affiliation” referring in support of the statement to [3]. Actually, King Jordan et al., wrote: “… the graph topology recovered from the data on lineage-specific gene expansions reflects a combined effect of phylogenetic relationships, common patterns of gene loss, and horizontal transfer” [3]. A big evolutionary question is whether gene duplication is a random or regulated process. There is an additional question: if a new paralog must evolve to provide a new selectable function, by which gradual evolutionary process would the copy be preserved?
Our study has several goals: (i) to confirm that number of gene copies positively correlates with genome size and to measure the correlation using the biggest available dataset of prokaryotic genomes; (ii) to present quantitative descriptions of gene-family size genome size association; (iii) to use boxplot analysis for outlier detection; and (iv) to find taxa that have atypical associations between gene-family size and genome size, which make them good candidates for further genomic studies.

2. Methods

2.1. COGs Database and Input for Ranking

Here we used a very simple approach to consideration of paralogs: a gene family is a set of protein-coding genes from the same genome and from the same cluster of orthologous groups. In other words, we used the database of clusters of COGs [7,8,9,10] in order to prepare an input matrix of numbers of gene copies, from which estimates of gene-family extension level (GFE level) are calculated. Historically, information about completely sequenced and annotated prokaryotic genomes was stored at ftp://ftp.ncbi.nih.gov/genomes/, including tables of protein features, called PTT files. On 2 December 2015, the collection was moved to ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/. More than 2000 prokaryotic genomes belong to this frozen collection; however, only part of the collection was COG-annotated. So, only those complete and COG-annotated genomes that were included in NCBI dataset were considered. There are 1370 Bacterial and 114 Archaeal complete and COG-annotated genomes in our dataset. Proteins of these genomes are distributed among about 5600 COGs.
We created a combined matrix from this dataset of 1484 prokaryotic genomes. Rows and columns correspond to genomes and COGs respectively. We indexed genomes, thus, the ith genome corresponds to the ith row of the matrix. Every COG has its NCBI index. Datum in entry (i,j) is the number of genes from the ith genome belonging to the jth COG.
The goal was to rank genomes in such a way that genomes with lower number of paralogs would have lower rank. Meaning of the expression “lower number of paralogs” is rather undefined and can be interpreted in several ways. Even defining an optimal ranking is a nontrivial task. In our review [19] we described several approaches to find a nearly optimal ranking using methods from the field of combinatorial optimization. Until now, rank aggregation methods have not been applied to the problem.

2.2. Kemeny Rank Aggregation Approach

The rank aggregation problem may be formulated as follows: given K partial rankings of N fixed elements, the objective is to find a complete ranking that minimizes the sum of “distances” between itself and each given partial ranking. So, in other words, the ranking aggregation problem is to find a “consensus” ranking which reflects the characteristics of given rankings. In particular, the optimal ranking is called Kemeny optimal rank aggregation approach [20,21] when the distance is defined as a Kendall tau distance. Genome ranking assigns each genome to a rating vector x which most accurately minimizes the sum of tau distances:
x τ = min x [ k = 1 K d τ ( x , r k ) ]
where K is a number of all COGs and where given a rating vector x and an “individual” ranking rk related to COG k, dτ is a Kendall tau distance between them. Kendall tau distance between two permutations is the total number of pairs of elements for which the orders in two permutations disagree.
Informally, the rank aggregation problem is to combine many different rank orderings on the same set of objects in order to obtain the “consensus” ordering. In our case, one may say that every COG proposes its own (partial) ordering of genomes, and finding the function xτ (solving Equation (1)) provides the “optimal” ordering. Rank aggregation has been studied in many disciplines, most extensively in the context of social choice theory, where there is a rich literature dating from the latter half of the eighteenth century. By the definition, a Kemeny optimal ranking xτ minimizes the total number of pairwise disagreements within the sum (1) and maximizes sortedness.
Kemeny optimal aggregation has the property of eliminating noise from various different ranking schemes. Furthermore, Kemeny optimal aggregations are essentially the only ones that simultaneously satisfy natural and important properties of rank aggregation functions, called neutrality and consistency in the social choice literature, and the so-called Condorcet property [22]. Indeed, Kemeny optimal aggregations satisfy the extended Condorcet criterion.
It is known that finding a Kemeny optimal ranking is NP-hard [23,24]. This motivates the problem of finding a ranking that approximately minimizes the number of disagreements with the given input rankings. Given that Kemeny optimal aggregation is useful, but computationally hard, how do we compute it? The sorting procedure, similar to a procedure described in [25], serves as such approximation.

2.3. Ranking Methods

There are different methods to measure number of gene copies (we would call these GFE measures, which are the estimates of a level of gene-family extensions). Genome GFE levels are of interest to us since inter-species variation of genome GFE levels are strongly associated with genome ranking according to number of paralogs. Ranking (or ordering) of objects may be performed in many different ways. Finding an optimal ordering is a nontrivial task. In our review [19] we described several approaches to find a nearly optimal ranking using methods from the field of combinatorial optimization. In this study, we apply four ranking methods: (i) according to an average number (ave); (ii) according to a fraction of paralogous gene families (p.i.); (iii) according to the sorting procedure (rank); and (iv) an index of multi-paralogous families (mp).

2.3.1. Average Ranking Method

If Ai,j is the value of jth descriptor of the ith object, the average ranking method works in this way: for each object i the average of all its descriptor values are calculated, which determines the rank of object i relative to other objects. All missing values are ignored. In our case, the objects are genomes, the descriptors are COGs and the descriptor values are the quantities of gene copies.
a v e i = 1 K j = 1 K A i j
where K is a number of all COGs, Ai,j is a number of members in jth COG and ith genome, and K′ is a number of gene families in ith genome (number of Ai,j’s greater than zero).

2.3.2. Paralog Index

The number of gene-families of size larger than one (non-singletons) divided by the total number of gene-families is called “paralog index” (p.i.).
p . i . i = P K
where P is an amount of non-singletons, and K′ is a number of gene families in ith genome.

2.3.3. Index of Multi-Paralogous Families

The number of gene-families of size larger than two divided by the number of gene-families with sizes more than one is called “multi-paralog index” (mp).
m p = P 2 P
where P is an amount of non-singletons, and P2 is an amount of gene families with more than two copies.

2.3.4. Sort Ranking

We used a procedure similar to a heuristic S-ranking procedure described in [25]. The procedure was applied to an input matrix to rearrange the rows. While we associated a genome with a row in the matrix, the criterion by which adjacent rows (genomes), g1 and g2, were swapped, is as follows: comparing two rows, we considered only gene families present in both genomes, g1 and g2, and counted which row in a pair has larger values more frequently. In other words, if a genome associated with a row i has bigger gene-families than a genome associated with a row 𝑖 + 1, then these rows would be swapped. We note that this procedure would not necessarily lead to the optimal ordering. Moreover, the resultant ranking depends on an initial ordering of the objects (genomes). Therefore, we performed 10 runs of the S-ranking procedure starting from randomly chosen orderings and calculated rating vectors x (Equation (1)) for each run. After 10 runs, we calculated an averaged rank and its standard deviation for each genome. The standard deviations appeared to be small enough to justify the heuristic S-ranking procedure.

2.4. Regression Analysis and Outlier Detection

The relationship between genome sizes and levels of genomic GFE was investigated via the application of correlation and regression analysis. Correlation analysis estimates the statistical significance of the association, whereas regression analysis provides an equation, which precisely describes the relationship. Moreover, this description of the association by equation has predictive value.
In the model selection, two information-based criteria, Bayesian information criterion (BIC) and Akaike information criterion (AIC), were employed to determine the superior model. These criteria balance between goodness of fit and number of parameters in a combined fashion [26]. Minimal scores of AIC determine the best model from a class of models, therefore when fitting a curve to a set of data points, the model with the lowest AIC is chosen. Here, polynomial functions with degrees varying from 1 to 10 were fitted to the data.
A standard method for detecting outliers is boxplot analysis [27]. The notion of a quartile is an essential part of this method. Let us recall the definition of a quartile. Given a sorted list of numbers, the median is a value which divides the data into two parts so that half of numbers are smaller than the median and half are greater than the median. Similarly, quartiles Q1–Q4 split the data into four parts. The second quartile, Q2, is the median [28].
In boxplot analysis the first, second (median) and third quartiles are calculated. From these quantities the interquartile range (IQR), where IQR = Q3 − Q1, is computed, along with two additional values: upper whisker = min(max(x), Q3 + 1.5 × IQR) and lower whisker = max(min(x), Q1 − 1.5 × IQR). All these quantities are represented in a plot which consists of a box with added “T” shaped lines above and below. The box represents the first and third quartile and the T shaped lines are the upper and lower whiskers. The median is represented as a horizontal line within the box.
Outliers are defined as values outside the range defined by the whiskers. Here, we call these outliers atypical genomes. Once a model is fitted to the data, atypical genomes are determined by applying boxplot analysis on the residuals that is the difference between original (response) and the fitted values. These atypical genomes are marked in the relevant figures as crosses. Analysis was performed with R statistical computing environment [29].

2.5. Correlation between GFE Measures

When a set of variables are related, estimating the correlation between a pair of variables using standard methods, e.g., Kendall’s tau, is uninformative since standard correlation methods ignore the knowledge that the specific pair of variables are correlated with other variables. Partial and semi-partial correlation methods are modifications of the standard methods, which take into account correlations to other variables. Partial correlation is used when a pair of variables, say x and y, are both correlated with a variable z. The coefficient expresses the residual correlation between variables x and y after eliminating the correlations between variables x and y with variable z. Figure 1, Figure 2, Figure 3 and Figure 4 show that all measures of paralog indices are correlated with genome-size.
Therefore, we estimated the correlation between these indices by calculating Kendall’s tau (partial) correlation coefficient using ppcor R package [30].

3. Results

3.1. Number of Paralogs is Correlated with Genome Size

Here we examine associations between gene-family size and genome size using different measures of number of paralogs in order to analyze the universality of the trend and to highlight factors possibly influencing deviations from the common trend. In addition, in some cases we examine associations between gene-family size and number of genes.

3.1.1. Percentage of Paralogous Gene Families is Correlated with Genome Size

We divided all protein-coding genes into two categories: singletons and appearing in more than one copy, i.e., belonging to paralogous families. The number of paralogous families divided by the total number of gene families is called “paralog index” (p.i.). Pushker et al. [4] applied a closely related measure to 127 eubacterial genomes. (Pushker et al. used the following definitions: p.i. is a percentage of paralogs in the genome (genes with at least one local BLAST hit using the cut-offs) among all genes; ave is an average size of paralogous families (singletons are excluded).) Here, we applied p.i. to 1484 prokaryotic genomes and show the results in Figure 1a, where paralog index is plotted vs. genome sizes. Correlation between paralog index and genome size is clearly seen and the values of correlation coefficients are as follows: Spearman correlation is equal to 0.896, Pearson correlation is equal to 0.866 and Kendall rank correlation is equal to 0.723. We considered the latter correlation coefficient as the most relevant when analyzing ranking results; therefore, it was chosen for herein analysis. Actually, we see that the association of paralog index with genome size is different for small genomes as compared with larger genomes. A “break point” is located somewhere around 2.2 Mbp. The linear regression equation for small genomes is approximately y ≈ 0.1x, while for larger genomes it is y ≈ 0.03x + 0.15. The paralog index for smaller genomes grows faster with an increase of genome size compared to larger genomes. We can see that the data follow different linear trends over different regions of the data, so one can use piecewise linear regression, modeling the regression function in “pieces”. We preferred to apply the polynomial regression approach to all four measures of “genome GFE”.
The presented polynomial regression lines were chosen based on AIC criterion (see Materials and Methods). The regression polynomial function is 0.25 + 2.69x − 0.71x2 + 0.47x3 − 0.12x4. There are outliers among both small and larger genomes. Interestingly, all outliers related to p.i. are located under the regression line, which means that outliers have a smaller fraction of paralogous gene-families than would be predicted by regression analysis. There are 16 outliers including M. leprae and 6 Vibrio genomes (see Table 1).
There are 15 Vibrio genomes in our dataset. They are shown in Figure 1b. We can see that they all make a cluster, while 6 Vibrio genomes are outliers and 9 genomes are “almost” outliers.

3.1.2. Average Number of Paralogs Correlate with Genome Size

In Figure 2, average size of a gene family (Equation (2)) in a given genome is plotted vs. the size of that genome. Correlation between average number of all gene copies in all COGs and genome size is clearly seen with the Kendall rank correlation equal to 0.767. Interestingly, unlike in Figure 1, here in Figure 2 we observe similar behavior between small and larger genomes. Ranking of objects based on average value across all nonzero attributes is known to be an oversimplified ranking method. Figure 2 is very noisy, indeed. If for p.i. only 16 genomes were detected as outliers, which is about 1% of all the examined genomes, for ave 67 genomes were detected as outliers, which is a larger fraction of the analyzed genomes (~4.5%). Thus, only a partial list of the outliers is shown in Table 2. (The complete list of the ave outliers is in the Supplementary materials Table S1.) There are individual representatives of different taxa among these outliers, including Pirellula, Bordetella, Burkholderia, etc.; however, we decided to show in Figure 2b only two highly represented groups, Mycobacterium genus and Halobacteria class (see also, Table S1).
Some genomes of Mycobacterium genus and Halobacteria have smaller average gene-family sizes than would be predicted by the regression polynomial function but, interestingly, all outliers of these two groups appear above the regression line (Figure 2b). From Table 2 we can make another interesting observation: all four Rhodococcus genomes are among the outliers. We hypothesize that an explanation of an incidence of Rhodococcus occurring in the group of outliers would be the same as for Mycobacteria, because Rhodococcus genus is closely related to Mycobacterium genus.

3.1.3. Ranking of Prokaryotic Genomes Based on Gene-Family Size Confirms Correlation with Genome Size

As we described in Materials and Methods, we used a sorting procedure to rank genomes according to their family sizes. In Figure 3, genome rank is plotted vs. size of that genome. This ranking method results in genome ordering close to Kemeny optimal [31]. Correlation between average number of all gene copies in all COGs and genome size is clearly seen for Kendall rank correlation (0.78). There are 46 outliers of the regression model constructed for rank measure. They are placed in Supplementary Materials Table S2. Twenty-four out of these 46 outliers belong to the Archaea kingdom; half of these 24 Archaea belong to Halobacteria class and 5 of the remaining 12 Archaea are from Crenarchaeota. A partial list of the outliers is shown in Table 3.
We recognized some genomes of Mycobacterium genus as outliers of the regression model constructed for ave measure. None of them appear in Table S2. However, Mycobacterium leprae, which was not among outliers presented in Table S1, appears in Table 1 and Table S2. Halobacteria were among ave measure outliers in Table S1, and there are 12 Halobacteria in Table S2 as well. We show Halobacteria data in Figure 3b.

3.2. Fraction of Larger Gene-Families

In parallel to a paralog index (Figure 1), we calculated another simple measure of GFE. It is relative frequency of larger gene families:
m p = number of gene families with more than two gene copies total number of non singletons
In Figure 4, mp fraction is plotted vs. genome size. Interestingly, there is a striking shape-similarity between Figure 1 and Figure 4. In Figure 4, we see that association of mp with genome size is different for small genomes as compared with larger genomes (like it was for paralog index—see Figure 1). In the case of mp, a “break point” is located somewhere around 2.3 Mbp, similar to p.i. Small genomes produce a smear cloud of points with multiple outliers, while for larger genomes a linear regression line y ≈ 0.02x + 0.32. The regression polynomial function is 0.4 + 3.13x − 1.17x2 + 0.81x3 − 0.44x4 + 0.23x5 − 0.01x6. There are outliers among both small and larger genomes but mainly among the smaller ones. Among larger genomes there are a few genomes of Neisseria and Sulfolobus. Neisseria outliers have a smaller fraction of multiple paralogous gene families than would be predicted by regression analysis, while Sulfolobus show the opposite effect. Altogether, there are 29 outliers including 6 Phytoplasmas and 8 Mycoplasmas (see Table 4 and Table S4). Mycoplasmas are shown in Figure 4b. It seems that there is no correlation between genome size and mp for Mycoplasmas. For some of them, mp indices may be predicted pretty well by the regression polynomial function, and some of them are outliers. The latter are listed in Table 4.

4. Discussion

4.1. Number of Gene Copies Is Correlated with Genome Size

Correlation between gene-family size measured by paralog index and number of genes was discovered many years ago [2]. Huynen and van Nimwegen showed that an increase in the number of genes leads not only to an increase in the number of gene copies, but also to a relative increase of the number of large gene families over the number of small families. They obtained these results comparing complete genomes of six bacteria (E. coli, H. influenzae, H. pylori, M. genitalium, M. pneumoniae, and Synechocystis sp. PCC6803) and two Archaea (M. jannaschii and M. thermoautotrophicum). Huynen and van Nimwegen wrote [2] “as more genomes become available; it will be possible to analyze how general the observed trend is”.
In early 2000s, the following rule was stated several times on growing number of sequenced prokaryotic genomes: The number of paralogous genes and families are positively correlated with an increase in genome size [3,4,11,15,16]. Pushker et al. stated that “the relative contribution of paralogous genes in each genome seems to be independent of phylogenetic affiliation and, for a limited dataset, appears to depend on genome size” [4].
Our calculations, performed on much larger dataset, confirmed the above-mentioned rules, in general. In all mentioned above publications from 2000s, only the simplest ranking methods were applied to the problem. We decided to apply Kemeny optimal aggregation, which is one of the most adequate ranking methods [20,21]. This method produced ordering of genomes different from the simpler methods; however, all measures highly correlate. The correlation levels are moderate, yet highly significant (p-values < 2.2 × 10−16), therefore it is likely that these different measures highlight the same underlying core phenomenon. This phenomenon is so strong that even the averaging method, often giving untruthful results, is rather comparable with the valid Kemeny method, in this case. Regarding atypical genomes, which are method-dependent ones, we propose to put more trust into the results produced by the latter technique (Figure 3, Table 3).

4.2. Atypical Genomes

We detected some genomes as outliers via the application of a boxplot analysis. We referred to these genomes as atypical in a sense that they are “far” from the trend found in Figure 1, Figure 2, Figure 3 and Figure 4. They were marked by red crosses and are listed in their respective complete and partial lists of atypical genomes (Table 1, Table 2 and Table 3, Table 5, Table 7, Table S1, Table S4). Notably, certain taxa are omnipresent or, in other words, they are atypical with respect to all three measures of GFE (e.g., Candidatus Cloacamonas acidaminovorans Evry, Pirellula and Orientia). Other taxa are almost omnipresent (e.g., Mycobacteriaceae family, Halobacteria class). The Mycoplasmas are the predominant family with regard to mp index (Table 4). Likewise, genomes of the Neisseria family are atypical, also with respect to mp index. Taxonomy statistics of outliers (i.e., species combined in taxa with the corresponding number of species within each taxon) were calculated (see Table S4).
Let us compare our outliers with the outliers found by our predecessors. Huynen and van Nimwegen [2] found an outlier studying a rather small sample of eight prokaryotes: M. pneumoniae, showed a relatively high frequency of large gene families. Pushker et al. [4] identified several genomes with atypical mp values: Mycoplasma pneumoniae, Mycoplasma penetrans, and Mycoplasma gallisepticum. Our results also show that Mycoplasmataceae is worth a separate discussion, which is below. Pushker et al. [4] also mentioned the following outliers: Mycobacterium leprae, Pirellula sp., Shigella flexneri, Bordetella pertusis, B. parapertussis, and B. bronchiseptica. Our results only partly confirmed these observations. M. leprae is discussed below in a separate subsection devoted to Mycobacteriaceae family. Likewise, a separate subsection is devoted to Pirellula. Shigella flexneri is not an outlier (Tables S3 and S4). Yet two members of the Bordetella species were found as outliers for the average number of gene copies, B. bronchiseptica RB50 and B. petrii (Table S1).

4.2.1. Mycoplasmas

In Table 5, we show gene-family sizes of Mycoplasmataceae. In column titled 1, we present number of singletons, in columns 2 and 3, amounts of gene-families of two and three copies, correspondingly. Mycoplasmas have small genomes with amounts of COG-annotated proteins (NC) varying from ~250 to 700 proteins. Fraction of singletons “1”/NC is more or less invariant at about 70%–80%. mp measures relative frequency of gene-families with more than two copies per family: mp = <number of gene-families with more than two gene copies>/<total number of non-singletons>. For Mycoplasma fermentans M64, for example, mp is equal to 0.45, while an expected value is about 0.26. There are 383 singletons, 35 gene families composed of two copies each, 11 gene families of 3 gene copies, and 18 families with more than three gene copies. mp = (29 = 11 + 18)/(64 = 35 + 11 + 18). Total number of non-singletons is equal to 64 and this is expected number of paralog families (M. fermentans is not an outlier for the measures p.i., ave and rank), while 29 is a surprisingly high number of gene-families with more than two gene copies. We do not have an answer to the question “Why M. hyopneumoniae has a low mp index while M. bovis Hubei has a high one” (study in progress).
Pushker et al. [4] estimated Mycoplasma gallisepticum as an atypical genome according to an average number of gene copies but it is not in our list of outliers (Table S1). Our calculations of ave show that M. gallisepticum has an average number of gene copies equal to 1.2, which is close to an expected value. Probably, differences both in calculations of an average number of gene copies and of outliers result in dissimilar outcomes. Pushker et al. [4] also identified two Mycoplasmas with atypical mp values: Mycoplasma pneumoniae and Mycoplasma penetrans. These two genomes appear in Table 4 as well.

4.2.2. Mycobacterium

General considerations suggested that large genetic diversity should exist among M. leprae strains, however, comparative genomics revealed that genetic variation was found to be exceptionally rare [32,33]. All indices for two strains of M. leprae are practically identical, so, we would use a term “species” instead of discussing the two genomes separately. M. leprae is an outlier in two categories: p.i. is equal to 0.17, while the expected value is about 0.25; rank is equal to 207, while the expected value is about 740; ave is equal to 1.32, which is not so close to expected 1.55, but it is only an “almost outlier”. Interestingly, in the two categories in which M. leprae is an outlier, all other members of this genus are absent. In the category ave, 10 non-tuberculosis Mycobacteria are outliers (Figure 2b) but ave is the noisiest and less reliable index of GFE; thus we would consider only M. leprae as a paralog-atypical species. In the context of mycobacterial species, Mycobacterium leprae has the smallest genome as a result of massive reductive evolution. The differences in the total number of protein-coding genes and number having homolog genes between M. leprae and all other Mycobacteria are striking (Table S5). Actually, all Mycobacteria but M. leprae have rather similar genomic characters. There were several attempts to explain this well-known observation (see [34] and references therein), but still the very special reduced evolution of M. leprae requires additional studies to give a plausible explanation. Despite over a century of research we still lack a clear understanding of the pathogenesis and physiology of this pathogen. Even basic epidemiologic and genomic questions are yet to be resolved completely. Reasonable speculation would say that reductive evolution results in low level of paralogization; but evolution has worked on M. leprae by controversial means: low number of gene copies from one side and having the largest proportion of pseudogenes in comparison to other prokaryotes from the other side [32]. About 50% of the M. leprae genome is seemingly devoid of function [32,35]. Comparative genomics of M. leprae is a challenging task.

4.2.3. Halophiles

Sanchez-Perez et al. [36] proposed a very reasonable hypothesis of environmental adaptation. The idea is that the original and paralog (i.e., copy) gene share the same function, yet, the paralog gene is expressed under abnormal environmental conditions (They named these kinds of paralogs ecoparalogs). One example is the hyperhalophilic bacterium Salinibacter ruber. This bacterium has halophilic proteins that have their optimal activity and stability at high salinity. Sanchez-Perez et al. also found examples of ecoparalogs in other prokaryotes. We are investigating whether ecoparalogization is the main reason for majority of Halophiles having enlarged gene families (work in progress). Comparative genomics is the right instrument for this kind of analysis.

4.2.4. Pirellula

A marine bacterium Pirellula appears as an outlier both for ave and S-rank measures (Tables S1 and S2). We are not the first to recognize this species as an outlier. Already Pushker et al. have mentioned, “Pirellula has an enormous genome with a surprisingly low relative number of paralogs” [4]. An appearance of Pirellula in Tables S1 and S2 and absence from Table 1 is due to an overrepresentation of small gene families and the absence of large ones. Pirellula is a marine bacterium and Pushker et al. suggested that the reason for the reduced gene-family size might be the homogeneity of the marine environment. For instance, Pirellula has a greatly reduced number of transcriptional regulators [37]. There are four genomes even bigger than Pirellula with “a surprisingly low relative number of paralogs”. Trichodesmium, also called sea sawdust, are found in tropical and subtropical ocean waters. Hahella chejuensis is a marine microbe. Haliangium ochraceum is a species of moderately halophilic Myxobacteria. Myxococcus fulvus is a species from the Myxococcaceae family. From these five genomes (Table 6) Pirellula and Trichodesmium are rank-outliers and, as such, appear in Table S2 as well. Both are marine bacteria.
The idea that “Gene duplications in prokaryotes can be associated with environmental adaptation” [38] looks very reasonable. In Halophiles, environmental adaptation results in expanded gene-families, while in big marine bacteria it results in reduced gene-family size.

4.2.5. Orientia tsutsugamushi

Orientia tsutsugamushi (OT), an obligate intracellular bacterium belonging to the family Rickettsiaceae of the subdivision alpha-Proteobacteria, is the causative agent of scrub typhus, or Tsutsugamushi disease. The complete genome sequences of two OT strains were obtained and COG-annotated [39,40]. Both strains have a single circular chromosome and possess no plasmid. The chromosomes are very similar in size (2,008,987 bp in Ikeda and 2,127,051 bp in Boryong) with almost identical average G + C contents (30.5% in both strains). The numbers of rRNA and tRNA genes are identical. The numbers of protein-coding genes and pseudogenes, the coding content, and the repeat content were identified by Nakayama et al. [41].
OT appears as an outlier in all three paralog measures. Orientia tsutsugamushi Ikeda has a surprisingly high average number of gene copies (1.83 instead of expected 1.36). Orientia tsutsugamushi Boryong has a surprisingly low paralog index (0.09 instead of expected 0.2) and low rank (38 vs. 480) (Table S6, Figure 1, Figure 2 and Figure 3). Genomic analysis of the two OT strains revealed that extensive reductive genome evolution as well as explosive and comprehensive amplification of repetitive sequences have occurred in OT. In both strains, repetitive sequences occupy nearly half the genome [40,41].
Nakayama et al. [40,41] defined OT paralogs as the genes whose products exhibited at least 90% amino acid sequence identity over 60% of the alignment length. According to this definition, they found 1196 repeated genes that were classified into 85 OT paralogous gene families. Extensive gene decay has taken place in many Boryong-repeated genes as in those of Ikeda. We used a rather different gene copy definition and our results are 772 paralogous genes that were classified into 115 OT paralogous gene families.
Analyzing Table S6 we can conclude that all parameters excluding genome size are pretty similar among all Rickettsiaceae. Our hypothesis regarding OT being an outlier is that in the case of OT, genome size is not a relevant genomic characteristic because of very large number of repetitive sequences.

4.3. Ranking Methods

The objective of the study was to find associations between characteristics of genomic gene family sizes and other genomic attributes, like genome size. We believe that the ranking of genomes according to a gene family size, followed by the calculation of coefficients of association between genome rank and genome property, is a reasonable approach in revealing hidden driving factors. The goal is to rank genomes in a way such that genomes with lower number of gene copies would have lower rank. In this study we used different methods to rank genomes (see Methods): according to (i) an average number (ave); (ii) a fraction of paralogous gene families size (p.i.); (iii) the sorting procedure (rank); and (iv) a fraction of multi-paralogous families (mp).
In order to compare different methods of ranking, Kendall tau rank correlation coefficients were calculated. Since all measurements of “GFE levels” of a genome are correlated with a genome size (Figure 1, Figure 2, Figure 3 and Figure 4), the partial correlation was calculated (that is controlling for effects of genome size on the estimated correlations). The coefficients are shown in Table 7.
Should these three indexes inevitably correlate? Not necessarily. In Table 8 we show an example of imaginary data to illustrate different estimates of the levels of paralogization.
Genome sizes sort out the genomes in the order of B, A, C; p.i. – A, C, B; ave – B, C, A; rank – A, B, C; mp – B, C, A. p.i., ave and rank characterize differently the distribution of gene-family sizes in the three genomes A, B, C. In our fictitious example none of the indices gives the order B, A, C, the order of genome sizes. We would not say that only one of the indices is correct but, instead, we propose to consider all three estimates of GFE. Each estimate produces its own set of outliers, which we discussed above, and only several genomes belong to intersection of outliers’ sets: Candidatus Cloacamonas acidaminovorans Evry, Pirellula and Orientia are omnipresent; Mycobacterium leprae and many Halobacteria appear in two subsets.

5. Conclusions

In earlier works it was found that number of paralogs and size of genome are positively correlated. This result was achieved using the simplest methods of estimation of genomic number of paralogs. In this study, we reexamined these associations on a larger dataset consisting of 1484 prokaryotic genomes and using several ranking approaches including the Kemeny optimal aggregation approach. We found that for all measures of GFE associations between a measure and a genome size follow different approximately linear trends over different genome sizes. Until now, only linear regression models were applied to the model of gene-family size–genome size association. We preferred to apply the polynomial regression approach to all four measures of “genome GFE”. The polynomial regression lines were chosen based on AIC criterion. For more rigorous description, boxplot analysis was used for outlier detection.
We confirmed that number of gene copies positively correlates with an increase of genome size. As expected, different groups of atypical prokaryotic genomes were found for different types of gene-family-size quantities. We confirmed that M. leprae has a substantially lower number of gene copies than would be expected from its genome size. We found that the majority of the members of Mycoplasmataceae possess a surprisingly high number of gene-families with more than two gene copies. We obtained sound reasoning for the speculation that in Halophiles, environmental adaptation results in expanded gene families, while in big marine bacteria it results in the reduced gene family size.
All the above-mentioned results were obtained by applying different measures of genomic number of gene copies. We propose to use all four estimates of GFE because they may mirror different aspects of GFE. Kendall tau partial rank correlation coefficients were calculated between different measurements of “GFE levels”. They are all pairwise correlated and separately correlate with genome size, and all these correlations were found to be statically significant.
In summary: we not only demonstrated that previously found associations between genome size and characteristics of gene-families were corroborated on a considerably larger dataset of prokaryotic genomes; we also utilized additional ranking methods for more accurate descriptions of these associations and highlighted atypical microbes and whole taxonomic groups. Our results show that examination of gene-duplication history in these taxa may provide especially valuable insights into the underlying evolutionary processes.

Supplementary Materials

The following are available online at https://www.mdpi.com/2075-1729/6/3/30/s1: Complete table of results, Table S1: Complete list of atypical genomes according to average number of gene copies, Table S2: Complete list of atypical genomes according to S-Rank, Table S3: GFE indices of Shigellas, Table S4: Taxonomy of outliers, Table S5: Distribution of gene-family sizes of Mycobacteriaceae, Table S6: Orientia tsutsugamushi and Rickettsia.

Acknowledgments

Thanks to Nasseem Hanna for the last version of the Sort ranking software. Thanks to Bilal Salih, Irit Cohen and Tatiana Tatarinova to previous versions of the Sort ranking software.

Author Contributions

A.B. stated the problem; G.C. and A.B. chose the numerical methods; G.C. performed the calculations; G.C. and A.B. analyzed the data; A.B. and G.C. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
COGCluster of Orthologous Groups of proteins
HGTHorizontal Gene Transfer
GFEgene-family extension
Mbpmillions of base pairs
OTOrientia tsutsugamushi
p.i.the number of protein-coding gene families having more than one copy divided by the total number of COG-annotated protein-coding gene families
aveaverage size of protein-coding gene families, including singletons
mpthe number of protein-coding gene families having more than two copies divided by the number of protein-coding gene families having more than one copy

References

  1. Brenner, S.E.; Hubbard, T.; Murzin, A.; Chothia, C. Gene duplications in H. Influenzae. Nature 1995, 378, 140. [Google Scholar] [CrossRef] [PubMed]
  2. Huynen, M.A.; van Nimwegen, E. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 1998, 15, 583–589. [Google Scholar] [CrossRef] [PubMed]
  3. Jordan, I.K.; Makarova, K.S.; Spouge, J.L.; Wolf, Y.I.; Koonin, E.V. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 2001, 11, 555–565. [Google Scholar] [CrossRef] [PubMed]
  4. Pushker, R.; Mira, A.; Rodriguez-Valera, F. Comparative genomics of gene-family size in closely related bacteria. Genome Biol. 2004, 5, R27. [Google Scholar] [CrossRef] [PubMed]
  5. Jensen, R.A. Orthologs and paralogs—We need to get it right. Genome Biol. 2001, 2, interactions1002.1001–interactions1002.1003. [Google Scholar] [CrossRef] [PubMed]
  6. Fitch, W.M. Distinguishing homologous from analogous proteins. Syst. Zool. 1970, 19, 99–113. [Google Scholar] [CrossRef] [PubMed]
  7. Tatusov, R.L.; Koonin, E.V.; Lipman, D.J. A genomic perspective on protein families. Science 1997, 278, 631–637. [Google Scholar] [CrossRef] [PubMed]
  8. Tatusov, R.L.; Galperin, M.Y.; Natale, D.A.; Koonin, E.V. The cog database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28, 33–36. [Google Scholar] [CrossRef] [PubMed]
  9. Tatusov, R.L.; Natale, D.A.; Garkavtsev, I.V.; Tatusova, T.A.; Shankavaram, U.T.; Rao, B.S.; Kiryutin, B.; Galperin, M.Y.; Fedorova, N.D.; Koonin, E.V. The cog database: New developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29, 22–28. [Google Scholar] [CrossRef] [PubMed]
  10. Tatusov, R.L.; Fedorova, N.D.; Jackson, J.D.; Jacobs, A.R.; Kiryutin, B.; Koonin, E.V.; Krylov, D.M.; Mazumder, R.; Mekhedov, S.L.; Nikolskaya, A.N.; et al. The cog database: An updated version includes eukaryotes. BMC Bioinform. 2003, 4, 41. [Google Scholar] [CrossRef] [PubMed]
  11. Hooper, S.D.; Berg, O.G. Duplication is more common among laterally transferred genes than among indigenous genes. Genome Biol. 2003, 4, R48. [Google Scholar] [CrossRef] [PubMed]
  12. Snel, B.; Bork, P.; Huynen, M. Genome evolution. Gene fusion versus gene fission. TIG 2000, 16, 9–11. [Google Scholar] [CrossRef]
  13. Snel, B.; Bork, P.; Huynen, M.A. The identification of functional modules from the genomic association of genes. Proc. Natl. Acad. Sci. USA 2002, 99, 5890–5895. [Google Scholar] [CrossRef] [PubMed]
  14. Kunin, V.; Ouzounis, C.A. The balance of driving forces during genome evolution in prokaryotes. Genome Res. 2003, 13, 1589–1594. [Google Scholar] [CrossRef] [PubMed]
  15. Yanai, I.; Camacho, C.J.; DeLisi, C. Predictions of gene family distributions in microbial genomes: Evolution by gene duplication and modification. Phys. Rev. Lett. 2000, 85. [Google Scholar] [CrossRef] [PubMed]
  16. Enright, A.J.; Kunin, V.; Ouzounis, C.A. Protein families and tribes in genome sequence space. Nucleic Acids Res. 2003, 31, 4632–4638. [Google Scholar] [CrossRef] [PubMed]
  17. Gevers, D.; Vandepoele, K.; Simillon, C.; Van de Peer, Y. Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends Microbiol. 2004, 12, 148–154. [Google Scholar] [CrossRef] [PubMed]
  18. Hooper, S.D.; Berg, O.G. On the nature of gene innovation: Duplication patterns in microbial genomes. Mol. Biol. Evol. 2003, 20, 945–954. [Google Scholar] [CrossRef] [PubMed]
  19. Bolshoy, A.; Tatarinova, T. Methods of combinatorial optimization to reveal factors affecting gene length. Bioinform. Biol. Insights 2012, 6, 317–327. [Google Scholar] [CrossRef] [PubMed]
  20. Kemeny, J.G. Mathematics without numbers. Daedalus 1959, 88, 571. [Google Scholar]
  21. Kemeny, J.G.; Snell, J.L. Mathematical Models in the Social Sciences; The MIT Press: Cambridge, UK, 1972. [Google Scholar]
  22. Young, H.P.; Levenglick, A. A consistent extension of condorcet’s election principle. SIAM J. Appl. Math. 1978, 35, 285–300. [Google Scholar] [CrossRef]
  23. Bartholdi, I.; Tovey, C.A.; Trick, M.A. Voting schemes for which it can be difficult to tell who won the election. Soc. Choice Welf. 1989, 6, 157–165. [Google Scholar] [CrossRef]
  24. Dwork, C.; Kumar, R.; Naor, M.; Sivakumar, D. Rank aggregation methods for the web. In Proceedings of the 10th International Conference on World Wide Web (WWW01), Hong Kong, China, 1–5 May 2001; p. 613.
  25. Tatarinova, T.; Salih, B.; Dien Bard, J.; Cohen, I.; Bolshoy, A. Lengths of orthologous prokaryotic proteins are affected by evolutionary factors. BioMed Res. Int. 2015, 2015, 786861. [Google Scholar] [CrossRef] [PubMed]
  26. Zucchini, W. An introduction to model selection. J. Math. Psychol. 2000, 44, 41–61. [Google Scholar] [CrossRef] [PubMed]
  27. Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Boston, MA, USA, 1977. [Google Scholar]
  28. DeCoursey, W.J. Statistics and Probability for Engineering Applications with Microsoft Excel; Newnes: Burlington, ON, Canada, 2003. [Google Scholar]
  29. R Core Team. R: A language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2013. [Google Scholar]
  30. Kim, S. Ppcor: An R package for a fast calculation to semi-partial correlation coefficients. Commun. Stat. Appl. Methods 2015, 22, 665–674. [Google Scholar] [CrossRef] [PubMed]
  31. Bolshoy, A.; Salih, B.; Cohen, I.; Tatarinova, T. Ranking of prokaryotic genomes based on maximization of sortedness of gene lengths. J. Data Min. Genom. Proteom. 2014, 5, 151. [Google Scholar] [CrossRef]
  32. Singh, P.; Cole, S.T. Mycobacterium leprae: Genes, pseudogenes and genetic diversity. Future Microbiol. 2011, 6, 57–71. [Google Scholar] [CrossRef] [PubMed]
  33. Singh, P.; Benjak, A.; Schuenemann, V.J.; Herbig, A.; Avanzi, C.; Busso, P.; Nieselt, K.; Krause, J.; Vera-Cabrera, L.; Cole, S.T. Insight into the evolution and origin of leprosy bacilli from the genome sequence of Mycobacterium lepromatosis. Proc. Natl. Acad. Sci. USA 2015, 112, 4459–4464. [Google Scholar] [CrossRef] [PubMed]
  34. Akinola, R.O.; Mazandu, G.K.; Mulder, N.J. A quantitative approach to analyzing genome reductive evolution using protein-protein interaction networks: A case study of Mycobacterium leprae. Front. Genet. 2016, 7, 39. [Google Scholar] [CrossRef] [PubMed]
  35. McGuire, A.M.; Weiner, B.; Park, S.T.; Wapinski, I.; Raman, S.; Dolganov, G.; Peterson, M.; Riley, R.; Zucker, J.; Abeel, T.; et al. Comparative analysis of mycobacterium and related actinomycetes yields insight into the evolution of Mycobacterium tuberculosis pathogenesis. BMC Genom. 2012, 13, 120. [Google Scholar] [CrossRef] [PubMed]
  36. Sanchez-Perez, G.; Mira, A.; Nyiro, G.; Pasic, L.; Rodriguez-Valera, F. Adapting to environmental changes using specialized paralogs. TIG 2008, 24, 154–158. [Google Scholar] [CrossRef] [PubMed]
  37. Gloeckner, F.O.; Kube, M.; Bauer, M.; Teeling, H.; Lombardot, T.; Ludwig, W.; Gade, D.; Beck, A.; Borzym, K.; Heitmann, K.; et al. Complete genome sequence of the marine Planctomycete pirellula sp. Strain 1. Proc. Natl. Acad. Sci. USA 2003, 100, 8298–8303. [Google Scholar] [CrossRef] [PubMed]
  38. Bratlie, M.S.; Johansen, J.; Sherman, B.T.; Huang da, W.; Lempicki, R.A.; Drablos, F. Gene duplications in prokaryotes can be associated with environmental adaptation. BMC Genom. 2010, 11, 588. [Google Scholar] [CrossRef] [PubMed]
  39. Cho, N.H.; Kim, H.R.; Lee, J.H.; Kim, S.Y.; Kim, J.; Cha, S.; Kim, S.Y.; Darby, A.C.; Fuxelius, H.H.; Yin, J.; et al. The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes. Proc. Natl. Acad. Sci. USA 2007, 104, 7981–7986. [Google Scholar] [CrossRef] [PubMed]
  40. Nakayama, K.; Yamashita, A.; Kurokawa, K.; Morimoto, T.; Ogawa, M.; FUKuhara, M.; Urakami, H.; Ohnishi, M.; Uchiyama, I.; Ogura, Y.; et al. The whole-genome sequencing of the obligate intracellular bacterium Orientia tsutsugamushi revealed massive gene amplification during reductive genome evolution. DNA Res. 2008, 15, 185–199. [Google Scholar] [PubMed]
  41. Nakayama, K.; Kurokawa, K.; Fukuhara, M.; Urakami, H.; Yamamoto, S.; Yamazaki, K.; Ogura, Y.; Ooka, T.; Hayashi, T. Genome comparison and phylogenetic analysis of Orientia tsutsugamushi strains. DNA Res. 2010, 17, 281–291. [Google Scholar] [CrossRef] [PubMed]
Figure 1. (a) Dimension of fraction of paralogous families is plotted versus genome size. Input dataset consists of 1484 prokaryotic genomes. Kendall rank correlation between p.i. and genome size is equal to 0.72. Regression polynomial function is 0.25 + 2.69x − 0.71x2 + 0.47x3 − 0.12x4. Regression is found to be statistically significant (F statistic = 1790.059, p-value < 2.2 × 10−16). Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses; (b) The same as (a) showing only genomes of species from the Vibrio genus.
Figure 1. (a) Dimension of fraction of paralogous families is plotted versus genome size. Input dataset consists of 1484 prokaryotic genomes. Kendall rank correlation between p.i. and genome size is equal to 0.72. Regression polynomial function is 0.25 + 2.69x − 0.71x2 + 0.47x3 − 0.12x4. Regression is found to be statistically significant (F statistic = 1790.059, p-value < 2.2 × 10−16). Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses; (b) The same as (a) showing only genomes of species from the Vibrio genus.
Life 06 00030 g001
Figure 2. (a) Genomic average size of gene-families versus genome size. Kendall rank correlation between average family size and genome size is equal to 0.77. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 176.698, p-value < 2.2 × 10 −16). Regression polynomial function is 1.66 + 13.92x + 0.82x2 + 0.3x3 − 0.47x4 − 0.02x5 + 0.87x6 + 0.41x7; (b) Showing genomes of the species from the Mycobacterium genus (black rectangles and rectangles with crosses mark atypical genomes) and genomes of the species from the Halobacteria class (red circles and circles with crosses mark atypical genomes).
Figure 2. (a) Genomic average size of gene-families versus genome size. Kendall rank correlation between average family size and genome size is equal to 0.77. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 176.698, p-value < 2.2 × 10 −16). Regression polynomial function is 1.66 + 13.92x + 0.82x2 + 0.3x3 − 0.47x4 − 0.02x5 + 0.87x6 + 0.41x7; (b) Showing genomes of the species from the Mycobacterium genus (black rectangles and rectangles with crosses mark atypical genomes) and genomes of the species from the Halobacteria class (red circles and circles with crosses mark atypical genomes).
Life 06 00030 g002
Figure 3. (a) Genome ranking versus genome size for the same genomes. Ranking of prokaryotic genomes is performed applying a sorting procedure to the complete input matrix. Kendall rank correlation between a genome rank and its genome size is equal to 0.78. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 1672.68, p-value < 2.2 × 10−16). Regression polynomial function is 741.36 + 14769.57x − 3783.31x2 − 641.64x3 + 880.83x4 − 344.26x5 + 277.53x6; (b) Shows (magnifies) the genomes of the species from the Halobacteria class.
Figure 3. (a) Genome ranking versus genome size for the same genomes. Ranking of prokaryotic genomes is performed applying a sorting procedure to the complete input matrix. Kendall rank correlation between a genome rank and its genome size is equal to 0.78. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 1672.68, p-value < 2.2 × 10−16). Regression polynomial function is 741.36 + 14769.57x − 3783.31x2 − 641.64x3 + 880.83x4 − 344.26x5 + 277.53x6; (b) Shows (magnifies) the genomes of the species from the Halobacteria class.
Life 06 00030 g003
Figure 4. Relative frequency of larger gene families mp = <number of gene-families with more than two gene copies>/<total number of non-singletons> (a) Relationship between mp index versus genome size in the same prokaryotic genomes. Kendall rank correlation between mp and genome size is equal to 0.66. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 722.90, p-value < 2.2 × 10−16). Regression polynomial function is 0.4 + 3.13x − 1.17x2 + 0.81x3 − 0.44x4 + 0.23x5 − 0.01x6; (b) Relationship between mp index versus genome size in Mycoplasmas.
Figure 4. Relative frequency of larger gene families mp = <number of gene-families with more than two gene copies>/<total number of non-singletons> (a) Relationship between mp index versus genome size in the same prokaryotic genomes. Kendall rank correlation between mp and genome size is equal to 0.66. Green line shows the fitted model and black lines delimit confidence interval at level of 0.95. Atypical genomes are determined by boxplot analysis on the residuals (see text for details) and are marked by red crosses. Regression is found to be statistically significant (F statistic = 722.90, p-value < 2.2 × 10−16). Regression polynomial function is 0.4 + 3.13x − 1.17x2 + 0.81x3 − 0.44x4 + 0.23x5 − 0.01x6; (b) Relationship between mp index versus genome size in Mycoplasmas.
Life 06 00030 g004
Table 1. Atypical genomes according to a paralog index measure 1.
Table 1. Atypical genomes according to a paralog index measure 1.
Rankp.i.Size (Mb)Atypical Genomes
34.80.0721.516Ehrlichia ruminantium Welgevonden
38.30.0942.127Orientia tsutsugamushi Boryong
1060.1202.279Treponema pallidum SS14
3790.1583.168Prevotella melaninogenica ATCC25845
2070.1663.268Mycobacterium leprae Br4923
2080.1663.268Mycobacterium leprae TN
6110.1783.286Brucella abortus bv 19941
7690.1983.939Vibrio cholera M662
7630.1944.033Vibrio cholera O1 biovar ElTor N16961
3850.1824.171Sodalis glossinidius morsitans
8200.2044.236Vibrio cholera MJ1236
14830.1934.494Candidatus Cloacamonas acidaminovorans Evry
7870.2114.532Aliivibrio salmonicida LFI1238
10720.2255.008Vibrio vulnificus MO624O
12810.2315.166Vibrio parahaemolyticus RIMD2210633
12930.2345.969Vibrio harveyi ATCCBAA1116
1 p.i.—paralog index, Rank—is an averaged rank calculated for multiple runs of the S-ranking procedure. Genomes are sorted by ascending size of genome for easier comparison with Figure 1.
Table 2. Partial list of atypical genomes according to average number of paralogs 1.
Table 2. Partial list of atypical genomes according to average number of paralogs 1.
RankAveSize (Mb)Atypical Genomes
246.81.5210.853Onion yellows phytoplasma OYM uid58015
1225.11.9152.809Halalkalicoccus jeotgali B3 uid50305
1233.41.9362.821Halogeometricum borinquense DSM
1235.32.0082.848Haloferax volcanii DS2 uid46845
1091.11.8782.914Halophilic archaeon DL31 uid72619
1240.82.0673.420Haloarcula marismortui ATCC 43049 uid57719
1306.52.0713.668Halopiger xanaduensis SH6 uid68105
1260.92.0363.752Natrialba magadii ATCC 43099 uid46245
1419.52.3783.889Haloterrigena turkmenica DSM 5511
948.42.2284.644Mycobacterium JDM601 uid67369
1074.72.2774.830Mycobacterium aviumparatuberculosis K10
1211.82.2935.067Mycobacterium abscessus uid61613
1074.92.4955.475Mycobacterium avium 104 uid57693
1275.82.3995.548Mycobacterium gilvum Spyr1 uid61403
1303.62.4915.620Mycobacterium gilvum PYRGCK uid59421
1306.92.4835.705Mycobacterium MCS uid58465
1320.92.5675.737Mycobacterium KMS uid58491
1319.42.5826.048Mycobacterium JLS uid58489
1449.22.9386.988Mycobacterium smegmatis MC2155 uid57701
1477.83.46310.237Amycolatopsis mediterranei U32 uid50565
1 Rank—is an averaged rank calculated for multiple runs of the S-ranking procedure; ave—average number of paralogs.
Table 3. Partial list of atypical genomes according to S-Rank.
Table 3. Partial list of atypical genomes according to S-Rank.
RankSize (Mb)Atypical Genomes
622.81.591Candidatus Korarchaeum cryptofilum OPF8
803.42.001Halobacterium salinarum R1
811.52.014Halobacterium NRC1
1225.12.809Halalkalicoccus jeotgali B3
1233.42.821Halogeometricum borinquense DSM11551
1235.32.848Haloferax volcanii DS2
1091.12.914Halophilic archaeon DL31
1186.83.261Halorubrumlacus profundi ATCC 49239
1240.83.420Haloarcula marismortui ATCC 43049
1235.03.484Haloarcula hispanica ATCC 33960
1306.53.668Halopiger xanaduensis SH6
1260.93.752Natrialba magadii ATCC 43099
1419.53.889Haloterrigena turkmenica DSM 5511
1057.67.750Trichodesmium erythraeum IMS101
Table 4. List of atypical genomes according to mp 1.
Table 4. List of atypical genomes according to mp 1.
RankmpSize (Mb)Atypical Genomes
31.20.320.580Mycoplasma genitalium G37
166.60.340.602Candidatus Phytoplasma Mali
21.90.000.706Candidatus Blochmannia floridanus
246.70.490.707Aster yellows witches broom phytoplasma AYWB
11.50.040.792Candidatus Blochmannia pennsylvanicus BPEN
183.60.310.799Mycoplasma synoviae 53
31.80.390.816Mycoplasma pneumoniae M129
246.80.490.853Onion yellows phytoplasma OY M
167.20.480.880Candidatus Phytoplasma australiense
192.80.370.948Mycoplasma bovis Hubei 1
199.50.410.964Mycoplasma pulmonis UAB CTIP
191.80.340.978Mycoplasma fermentans JER
297.60.371.007Mycoplasma agalactiae
186.20.451.119Mycoplasma fermentans M64
77.10.091.161Candidatus Ruthia magnifica Cm Calyptogena magnifica
420.90.451.317Thermosphaera aggregans DSM 11486
411.00.481.580Staphylothermus hellenicus DSM 12710
358.70.441.667Gardnerella vaginalis ATCC 14019
481.40.461.796Streptococcus thermophilus CNRZ1066
196.20.201.887Haemophilus influenzae PittGG
156.90.212.145Neisseria meningitidis alpha14
158.00.222.153Neisseria meningitidis 053442
154.30.232.154Neisseria gonorrhoeae FA 1090
160.00.222.184Neisseria meningitidis Z2491
166.20.242.272Neisseria meningitidis MC58
105.60.242.279Treponema pallidum SS14
859.30.542.702Sulfolobus islandicus Y G 57 14
1131.50.552.992Sulfolobus solfataricus P2
1483.00.334.494Candidatus Cloacamonas acidaminovorans Evry
1 Mp = <number of gene-families with more than two gene copies>/<total number of non-singletons>.
Table 5. Distribution of gene-family sizes of Mycoplasmataceae 1.
Table 5. Distribution of gene-family sizes of Mycoplasmataceae 1.
Genome NameNpNONC123>3mp
M. agalactiae PG27422674753354210414/56
M. agalactiae uid4667981329152233242151025/67
M. arthritidis 158L3 163121441734720336/26
M. bovis Hubei 180127952234637111122/59
M. bovis PG45765239526354439716/59
M. capricolum ATCC 273438122365763905810717/65
M. conjunctivae69227242032339044/43
M. crocodyli MP145689199490380376410/47
M. fermentans JER7972475503883881220/58
M. fermentans M64104945959038335111829/64
M. gallisepticum R low763274489357434610/53
M. genitalium G374759138433015437/22
M. haemofelis Langford 11545125828723016213/19
M. hominis ATCC 2311452314537831521145/26
M. hyopneumoniae 23269125443733139134/43
M. hyopneumoniae 744865721444333338145/43
M. hyopneumoniae J65718647134444246/50
M. hyorhinis HUB 165819446433936729/45
M. leachii PG50882316566398509817/67
M. mobile 163K63318345037026628/34
M. mycoides capri LC 950109223036194005561420/75
M. mycoides SC PG1101732569239755151631/86
M. penetrans HF 2103737965844754101430/84
M. pneumoniae M129648203445359196612/31
M. pulmonis UAB CTIP7822225603873681725/61
M. putrefaciens KS165017647437934437/41
M. suis Illinois84559225320914022/16
M. suis KI380679455324121211112/13
M. synoviae 536591804793573310515/48
U. parvum serovar 3 ATCC 2781560919641334625123/28
U. parvum serovar 3 ATCC 70097061417344136029325/34
U. urealyticum serovar 10 ATCC 3369964623041634225325/30
1 NP—number of proteins; NO—number of ORFans; NC—number of COG-annotated proteins; M.—Mycoplasma; U.—Ureaplasma.
Table 6. Partial list of atypical genomes according to average number of gene copies.
Table 6. Partial list of atypical genomes according to average number of gene copies.
RankAveSize (Mb)Atypical Genomes
8611.8276.196Pirellula staleyi DSM_6068_uid43209
13412.0247.215Hahella chejuensis KCTC_2396_uid58483
10581.9617.750Trichodesmium erythraeum IMS101_uid57925
14112.3709.004Myxococcus fulvus HW_1_uid68443
13192.3499.446Haliangium ochraceum DSM_14365_uid41425
Table 7. Pairwise partial Kendall correlation between all ranking methods 1.
Table 7. Pairwise partial Kendall correlation between all ranking methods 1.
p.i.AveRankmpGenome Size
p.i.0.570.570.460.72
ave0.570.610.520.77
rank0.570.610.380.78
mp0.460.520.380.66
1 All correlations were controlled for genome size and are statistically significant (p-value < 2.2 × 10−16).
Table 8. The indices of GFE of fictional data.
Table 8. The indices of GFE of fictional data.
GenomeORFansCOGsp.i.AveRankMp
12345
A101111160.2411.0
B8112240.6220.3
C20111660.4330.7

Share and Cite

MDPI and ACS Style

Carmi, G.; Bolshoy, A. Gene-Family Extension Measures and Correlations. Life 2016, 6, 30. https://doi.org/10.3390/life6030030

AMA Style

Carmi G, Bolshoy A. Gene-Family Extension Measures and Correlations. Life. 2016; 6(3):30. https://doi.org/10.3390/life6030030

Chicago/Turabian Style

Carmi, Gon, and Alexander Bolshoy. 2016. "Gene-Family Extension Measures and Correlations" Life 6, no. 3: 30. https://doi.org/10.3390/life6030030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop