Next Article in Journal
Exploring the Dietary Strategies of Coated Sodium Butyrate: Improving Antioxidant Capacity, Meat Quality, Fatty Acid Composition, and Gut Health in Broilers
Previous Article in Journal
Molecular Signatures of Exercise Adaptation in Arabian Racing Horses: Transcriptomic Insights from Blood and Muscle
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GC Content in Nuclear-Encoded Genes and Effective Number of Codons (ENC) Are Positively Correlated in AT-Rich Species and Negatively Correlated in GC-Rich Species

C. S. Mott Center for Human Growth and Development, Institute for Environmental Health Sciences, Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI 48201, USA
Genes 2025, 16(4), 432; https://doi.org/10.3390/genes16040432
Submission received: 26 February 2025 / Revised: 31 March 2025 / Accepted: 2 April 2025 / Published: 5 April 2025
(This article belongs to the Section Molecular Genetics and Genomics)

Abstract

:
Background/Objectives: Codon usage bias affects gene expression and translation efficiency across species. The effective number of codons (ENC) and GC content influence codon preference, often displaying unimodal or bimodal distributions. This study investigates the correlation between ENC and GC rankings across species and how their relationship affects codon usage distributions. Methods: I analyzed nuclear-encoded genes from 17 species representing six kingdoms: one bacteria (Escherichia coli), three fungi (Saccharomyces cerevisiae, Neurospora crassa, and Schizosaccharomyces pombe), one archaea (Methanococcus aeolicus), three protists (Rickettsia hoogstraalii, Dictyostelium discoideum, and Plasmodium falciparum),), three plants (Musa acuminata, Oryza sativa, and Arabidopsis thaliana), and six animals (Anopheles gambiae, Apis mellifera, Polistes canadensis, Mus musculus, Homo sapiens, and Takifugu rubripes). Genes in all 17 species were ranked by GC content and ENC, and correlations were assessed. I examined how adding or subtracting these rankings influenced their overall distribution in a new method that I call Two-Rank Order Normalization or TRON. The equation, TRON = SUM(ABS((GC rank1:GC rankN) − (ENC rank1:ENC rankN))/(N2/3), where (GC rank1:GC rankN) is a rank-order series of GC rank, (ENC rank1:ENC rankN) is a rank-order series ENC rank, sorted by the rank-order series GC rank. The denominator of TRON, N2/3, is the normalization factor because it is the expected value of the sum of the absolute value of GC rank–ENC rank for all genes if GC rank and ENC rank are not correlated. Results: ENC and GC rankings are positively correlated (i.e., ENC increases as GC increases) in AT-rich species such as honeybees (R2 = 0.60, slope = 0.78) and wasps (R2 = 0.52, slope = 0.72) and negatively correlated (i.e., ENC decreases as GC increases) in GC-rich species such as humans (R2 = 0.38, slope = −0.61) and rice (R2 = 0.59, slope = −0.77). Second, the GC rank–ENC rank distributions change from unimodal to bimodal as GC content increases in the 17 species. Third, the GC rank+ENC rank distributions change from bimodal to unimodal as GC content increases in the 17 species. Fourth, the slopes of the correlations (GC versus ENC) in all 17 species are negatively correlated with TRON (R2 = 0.98) (see Graphic Abstract). Conclusions: The correlation between ENC rank and GC rank differs among species, shaping codon usage distributions in opposite ways depending on whether a species’ nuclear-encoded genes are AT-rich or GC-rich. Understanding these patterns might provide insights into translation efficiency, epigenetics mediated by CpG DNA methylation, epitranscriptomics of RNA modifications, RNA secondary structures, evolutionary pressures, and potential applications in genetic engineering and biotechnology.

Graphical Abstract

1. Introduction

Codon usage bias, the non-random use of synonymous codons encoding the same amino acid, plays a crucial role in gene expression and genome evolution [1]. The effective number of codons (ENC) is a widely used measure of codon usage bias, ranging from 20 (where each amino acid in a gene is encoded by a single codon) to 61 (where all synonymous codons in a gene are used at least one time) [2,3]. Many factors influence codon bias, including mutational pressure, translational efficiency, and selection for optimal tRNA usage [4,5,6]. Similarly, GC content, defined as the proportion of guanine and cytosine nucleotides in each nuclear-encoded gene coding sequence, is an important genomic feature affecting gene regulation, DNA stability, and evolutionary adaptation [7,8,9,10,11]. DNA methylation at CpG sites in both promoter and enhancer regions and the coding region can regulate transcriptional regulation and alternative mRNA splicing [12,13,14,15,16,17,18,19]. Previous studies have shown that both ENC and GC content exhibit characteristic distributions across species, often following unimodal or bimodal patterns [20,21,22,23,24,25]. However, the relationship between ENC and GC content has been less extensively explored, and its evolutionary and functional implications remain unclear.
Studies on various organisms have reported mixed findings regarding the correlation between ENC and GC content. In some species, ENC and GC content are positively correlated, suggesting that increased GC content promotes codon diversity, possibly due to mutational biases or selection for GC-rich codons in highly expressed genes [26,27,28]. In contrast, other species exhibit a negative correlation between ENC and GC content, implying that highly biased codon usage occurs in GC-rich genes, potentially due to translational selection favoring specific tRNA pools [29,30]. These opposing trends raise fundamental questions about the evolutionary forces shaping codon usage and GC content across different taxa.
Two-Rank Order Normalization (TRON) is a new mathematical method I developed to compare lists of items with different types of characteristics to determine whether the characteristics are correlated. For example, if you have 100 people, you can rank them in terms of height and hair color and see these traits are correlated. One might hypothesize, for instance, that taller people tend to have lighter hair, and this can be tested by the TRON method. In this paper, I performed TRON calculations for two of the many characteristics of nuclear-encoded genes, namely GC content and ENC levels. The number of nuclear-encoded genes ranges from several thousand in prokaryotes to several tens of thousands in eukaryotes. One advantage of the TRON method over other correlation methods is that the TRON method provides insights into the two-dimensional relationships of the correlations, such as the bimodal GC and ENC distributions.
Characteristics of each gene, such as GC content or ENC levels, are available on tables such as in the Codon Statistics Database [31]. Before I began these analyses, I predicted that, as GC content increases, ENC levels for these genes would decrease because there are fewer As and Ts available to make codons. As I describe in this paper, my prediction was true for species with high GC content, such as rice (Oryza sativa), mice (Mus musculus), and humans (Homo sapiens), but, surprisingly, the opposite is true for species with low GC (i.e., high AT) content, such as bees (Apis mellifera) and other species, where ENC levels decrease as GC content increases.
In this study, I systematically analyze ENC and GC rank distributions across 17 species with a range of GC content in nuclear-encoded genes. I find that ENC rank and GC rank exhibit species-specific correlations, i.e., a positive correlation in bees and a negative correlation in rice. Moreover, I demonstrate, using the TRON method, that the mathematical interplay between these distributions leads to opposite outcomes—subtraction of ENC from GC (GC-ENC) rank distributions results in a unimodal distribution pattern in bees, and a bimodal distribution pattern in rice. Also, the addition of GC rank and ENC rank (GC+ENC) produces a bimodal distribution pattern in bees and a unimodal distribution pattern in rice. These findings provide new insights into the complex interactions between codon usage and genomic composition, with potential implications for understanding translation efficiency, genome evolution, and species-specific selective pressures.
My results contribute to the broader field of comparative genomics by elucidating how different species maintain distinct codon usage strategies and GC content distributions. I discuss possible evolutionary mechanisms driving these correlations, including mutational biases, selection for translational efficiency, and constraints imposed by tRNA availability. Finally, I explore the practical applications of understanding ENC and GC relationships in evolution, as well as in optimizing gene expression in synthetic biology and biotechnology [32,33,34,35].

2. Materials and Methods

All the GC (G or C content for a nuclear-encoded gene) and ENC (effective number of codons for a nuclear-encoded gene) data used in this paper are from the Codon Statistics Database (http://codonstatsdb.unr.edu, accessed on 15 March 2025) [31]. I performed all of analyses in this paper on Excel™ (version 16.89.1).
To generate the rank order for ENC, the “gene stats” table for a particular species (e.g., the mouse, Mus musculus) was downloaded into an Excel™ file and sorted (low to high number) by ENC (with a lowest possible value of 20 and a highest possible value of 61) and the genes were numbered in the order of 1, 2, 3, … NENC, where NENC is a nuclear-encoded gene sorted by ENC value. Genes that had an NA (not applicable) rating for an ENC value at the end of the table were deleted, presumably because these genes do not encode all 20 amino acids, but these were generally fewer than 10–50 genes.
To generate the rank order for GC, while keeping the rank order for ENC, the entire Excel™ table was next sorted (low to high number) by GC content (with a lowest possible value of 0.00 and a highest possible value of 1.00), after removing the ENC NA rows, and the genes were numbered 1, 2, 3, … NGC, where NGC is a nuclear-encoded gene sorted by GC content.
To generate correlations between GC rank and ENC values, the entire Excel™ table was sorted by GC rank and a scatter plot of the ENC values (y-axis) vs. GC rank (x-axis) was plotted using Excel™ with the “insert chart > scatter plot” function. On the scatter plots, GC rank is the x-axis and ENC value is the y-axis (Figure 1). Trendlines were added by right clicking a point on the scatter plot and clicking “add trendline.” Next, “set intercept” was checked, “y-intercept” was added (INTERCEPT = (GC rank column, ENC rank column), “display equation on chart” was checked, and “display R2 value on chart” was checked.
To generate the GC-ENC and GC+ENC histograms, NGC-NENC and NGC+NENC columns were selected for each species and plotted using Excel™ with the “insert chart > histogram” function.
To perform Two-Rank Order Normalization (TRON) analyses, I used the equation: TRON = SUM(ABS((A1:AN) − (B1:BN))/SUM(ABS(A1:AN) − (R1:RN)), where A1:AN is a rank-order series of one trait (e.g., GC), B1:BN is a rank-order series of a second trait (e.g., ENC), sorted by the rank-order series of the first trait. For example, if a gene has the lowest GC rank (A = 1) and the 10th lowest ENC rank (B = 10), then the first number in the ((A1:AN) − (B1 − BN)) series is −9 (i.e., A–B = −9). The series R1-RN is a randomization of the number of items that are ranked (e.g., genes) using the equation SORTBY(A1:AN, RANDARRAY(N)). RANDARRAY differs from RAND in that the former uses each number in an array only once, whereas RAND chooses a random number within a range (e.g., 1–1000) for each number in the array.

3. Results

3.1. Description of GC and ENC Histograms in Bees (Apis mellifera), Rice (Oryza sativa), and Yeast (Saccharomyces cerevisiae)

GC histograms of nuclear-encoded genes are shown for bees (blue), rice (red), and yeast (green) (Figure 1a). They were plotted overlapping on the same graph with Adobe Photoshop™, using 100% opacity for layer 1 (bee) and 50% opacity for layers 2 (rice) and 3 (wasp). These histograms demonstrate that there is a wide range of GC content from very low GC (the lowest is 20% GC) in bees and very high GC (the highest is 81%) in rice. Also, both bees and rice, as well as many other insects and plants, have bimodal GC peaks, as was demonstrated earlier [20,21,22,23,24,25] (see also Figure 2 and Figure 3). Yeast has a unimodal GC peak approximately in the center of the histogram (Figure 1a, green). In addition to bees, rice, and yeast, I analyzed a total of 17 species for this paper (Figure 2, Figure 3, Figure 4 and Figure 5).
Figure 1. Bees (Apis mellifera), rice (Oryza sativa), and yeast (Saccharomyces cerevisiae) have different patterns of GC content and ENC ranks for nuclear-encoded genes. Figure 2d–f shows the three histograms separately for clarity. (a) Bees (blue), rice (red), and yeast (green) GC content (0.00 to 1.00) histograms of all nuclear-encoded genes (x-axis) are plotted against the number of genes with that range of GC content (y-axis). (b) Bees, rice, and yeast ENC level (20–61) histograms of all nuclear-encoded genes (x-axis) are plotted against the number of genes with that range of GC content (y-axis). The overlaps of the histograms are shown in different shades, as indicated.
Figure 1. Bees (Apis mellifera), rice (Oryza sativa), and yeast (Saccharomyces cerevisiae) have different patterns of GC content and ENC ranks for nuclear-encoded genes. Figure 2d–f shows the three histograms separately for clarity. (a) Bees (blue), rice (red), and yeast (green) GC content (0.00 to 1.00) histograms of all nuclear-encoded genes (x-axis) are plotted against the number of genes with that range of GC content (y-axis). (b) Bees, rice, and yeast ENC level (20–61) histograms of all nuclear-encoded genes (x-axis) are plotted against the number of genes with that range of GC content (y-axis). The overlaps of the histograms are shown in different shades, as indicated.
Genes 16 00432 g001
Figure 2. First column, bee (Apis mellifera), second column, rice (Oryza sativa), and third column, yeast (Saccharomyces cerevisiae) correlations between GC content and ENC level for nuclear-encoded genes. (ac) Correlations between ENC ranks (y-axis) and GC ranks (x-axis) for bees, rice, and yeast. ENC rank was determined by sorting all columns based on ENC levels (20–61) and then numbering the rows 1-N, where N is the number of genes in that species. GC rank was determined by sorting all columns based on GC levels (0.00–1.00) and then numbering the rows 1-N. When GC levels are sorted and all columns are selected, the original ranks of the ENC levels are maintained. Correlations between ENC levels and GC levels by selecting the ENC ranked column and making a scatter plot (shown in blue bars). Trend lines were made by right-clicking (control clicking) a point on the graph and selecting TRENDLINES (red arrows). Under TRENDLINES, select boxes for set intercept (=INTERCEPT(GCrank:ENCrank)), display equation on chart, and display R-squared value on chart (shown). Notice that bees have a positive correlation, rice has a negative correlation, and yeast has no correlation between ENC rank and GC rank (red arrows). (df) GC histograms for bees, rice, and yeast. The GC contents (0–1.00) for all nuclear-encoded genes are on the x-axis and the number of genes with that range of GC values is on the y-axis. Histograms were made by selecting the GC column and selecting histogram chart under the INSERT tab. Notice that bees and rice have bimodal distributions of GC content and Yeast has a unimodal distribution. (gi) ENC histograms for bees, rice, and yeast. The ENC levels (20–61) for all nuclear-encoded genes are on the x-axis and the number of genes with that range of ENC levels are on the y-axis. Notice that bees and rice have bimodal distributions of ENC and Yeast has a unimodal distribution. (jl) GC rank minus ENC rank histograms for bees, rice, and yeast. Notice that GC rank minus ENC rank (GC-ENC) is unimodal in bees and bimodal in rice. (mo) GC rank plus ENC rank histograms for bees, rice, and yeast. Notice that GC+ENC is bimodal in bees and unimodal in rice. This is the opposite of the pattern in (jl).
Figure 2. First column, bee (Apis mellifera), second column, rice (Oryza sativa), and third column, yeast (Saccharomyces cerevisiae) correlations between GC content and ENC level for nuclear-encoded genes. (ac) Correlations between ENC ranks (y-axis) and GC ranks (x-axis) for bees, rice, and yeast. ENC rank was determined by sorting all columns based on ENC levels (20–61) and then numbering the rows 1-N, where N is the number of genes in that species. GC rank was determined by sorting all columns based on GC levels (0.00–1.00) and then numbering the rows 1-N. When GC levels are sorted and all columns are selected, the original ranks of the ENC levels are maintained. Correlations between ENC levels and GC levels by selecting the ENC ranked column and making a scatter plot (shown in blue bars). Trend lines were made by right-clicking (control clicking) a point on the graph and selecting TRENDLINES (red arrows). Under TRENDLINES, select boxes for set intercept (=INTERCEPT(GCrank:ENCrank)), display equation on chart, and display R-squared value on chart (shown). Notice that bees have a positive correlation, rice has a negative correlation, and yeast has no correlation between ENC rank and GC rank (red arrows). (df) GC histograms for bees, rice, and yeast. The GC contents (0–1.00) for all nuclear-encoded genes are on the x-axis and the number of genes with that range of GC values is on the y-axis. Histograms were made by selecting the GC column and selecting histogram chart under the INSERT tab. Notice that bees and rice have bimodal distributions of GC content and Yeast has a unimodal distribution. (gi) ENC histograms for bees, rice, and yeast. The ENC levels (20–61) for all nuclear-encoded genes are on the x-axis and the number of genes with that range of ENC levels are on the y-axis. Notice that bees and rice have bimodal distributions of ENC and Yeast has a unimodal distribution. (jl) GC rank minus ENC rank histograms for bees, rice, and yeast. Notice that GC rank minus ENC rank (GC-ENC) is unimodal in bees and bimodal in rice. (mo) GC rank plus ENC rank histograms for bees, rice, and yeast. Notice that GC+ENC is bimodal in bees and unimodal in rice. This is the opposite of the pattern in (jl).
Genes 16 00432 g002
Figure 3. GC and ENC analyses with negative correlations between GC rank and ENC rank: Mosquito (Anopheles gambiae), pufferfish (Takifugu rubripes), human (Homo sapiens), bread mold (Neurospora crass), banana (Musa acuminata), and mouse (Mus musculus). (a) Mosquito GC rank (y-axis) versus ENC rank (x-axis) shows a negative correlation. X-axis is 1 to 12,402 for the rank order of the 12,402 nuclear encoded mosquito genes based on GC content (0.00 to 1.00). Y-axis is 1 t0 12,402 for the rank order of genes based on GC levels, sorted on ENC rank (see Figure 2). (b) Mosquito histogram of GC content (0 to 1.00) versus the number of genes (N) that fall within the indicated range of GC content. (c) Mosquito histogram of ENC levels (0 to 1.00) versus the number of genes (N) that fall within the indicated range of ENC levels. (d) Mosquito histogram of GC rank—ENC rank versus the number of genes (N) that fall within the indicated range of GC rank—ENC rank. The x-axis is −12,402 to +12,402. (e) Mosquito histogram of GC rank + ENC rank versus the number of genes (N) that fall within the indicated range of GC rank + ENC rank. The x-axis is 1 to 2 × 1204, which is two times the number of nuclear-encoded genes in mosquitoes. (fj) Pufferfish analyses (as described in (ae)) for the 22,104 nuclear-encoded genes in this species. (ko) Human analyses (as described in (ae)) for the 19,708 nuclear-encoded genes in this species. (pt) Bread mold analyses (as described in (ae)) for the 9728 nuclear-encoded genes in this species. (uy) Banana analyses (as described in (ae)) for the 30,700 nuclear-encoded genes in this species. (zdd) Mouse analyses (as described in (ae)) for the 22,405 nuclear-encoded genes in this species.
Figure 3. GC and ENC analyses with negative correlations between GC rank and ENC rank: Mosquito (Anopheles gambiae), pufferfish (Takifugu rubripes), human (Homo sapiens), bread mold (Neurospora crass), banana (Musa acuminata), and mouse (Mus musculus). (a) Mosquito GC rank (y-axis) versus ENC rank (x-axis) shows a negative correlation. X-axis is 1 to 12,402 for the rank order of the 12,402 nuclear encoded mosquito genes based on GC content (0.00 to 1.00). Y-axis is 1 t0 12,402 for the rank order of genes based on GC levels, sorted on ENC rank (see Figure 2). (b) Mosquito histogram of GC content (0 to 1.00) versus the number of genes (N) that fall within the indicated range of GC content. (c) Mosquito histogram of ENC levels (0 to 1.00) versus the number of genes (N) that fall within the indicated range of ENC levels. (d) Mosquito histogram of GC rank—ENC rank versus the number of genes (N) that fall within the indicated range of GC rank—ENC rank. The x-axis is −12,402 to +12,402. (e) Mosquito histogram of GC rank + ENC rank versus the number of genes (N) that fall within the indicated range of GC rank + ENC rank. The x-axis is 1 to 2 × 1204, which is two times the number of nuclear-encoded genes in mosquitoes. (fj) Pufferfish analyses (as described in (ae)) for the 22,104 nuclear-encoded genes in this species. (ko) Human analyses (as described in (ae)) for the 19,708 nuclear-encoded genes in this species. (pt) Bread mold analyses (as described in (ae)) for the 9728 nuclear-encoded genes in this species. (uy) Banana analyses (as described in (ae)) for the 30,700 nuclear-encoded genes in this species. (zdd) Mouse analyses (as described in (ae)) for the 22,405 nuclear-encoded genes in this species.
Genes 16 00432 g003
Figure 4. GC and ENC analyses of species with positive correlations between GC rank and ENC rank: wasp (Polistes canadensis), rickettsia (Rickettsia hoogstraalii), slime mold (Dictyostelium discoideum), arabidopsis (Arabidopsis thaliana), and plasmodium (Plasmodium falciparum). (ae) Wasp analyses (as described in Figure 4) for the 9854 nuclear-encoded genes in this species. (fj) Rickettsia analyses (as described in Figure 4) for the 1663 nuclear-encoded genes in this species. (ko) Slime mold analyses (as described in Figure 4) for the 13,078 nuclear-encoded genes in this species. (pu) Arabidopsis analyses (as described in Figure 4) for the 10,160 nuclear-encoded genes in this species. (vy) Plasmodium analyses (as described in Figure 4) for the 5321 nuclear-encoded genes in this species.
Figure 4. GC and ENC analyses of species with positive correlations between GC rank and ENC rank: wasp (Polistes canadensis), rickettsia (Rickettsia hoogstraalii), slime mold (Dictyostelium discoideum), arabidopsis (Arabidopsis thaliana), and plasmodium (Plasmodium falciparum). (ae) Wasp analyses (as described in Figure 4) for the 9854 nuclear-encoded genes in this species. (fj) Rickettsia analyses (as described in Figure 4) for the 1663 nuclear-encoded genes in this species. (ko) Slime mold analyses (as described in Figure 4) for the 13,078 nuclear-encoded genes in this species. (pu) Arabidopsis analyses (as described in Figure 4) for the 10,160 nuclear-encoded genes in this species. (vy) Plasmodium analyses (as described in Figure 4) for the 5321 nuclear-encoded genes in this species.
Genes 16 00432 g004
Figure 5. GC and ENC analyses of species with little or no correlations between GC rank and ENC rank: E. coli (Escherichia coli), pombe (Schizosaccharomyces cerevisiae), and methanobacteria (Methanococcus aeolicus). (ae) E. coli analyses (as described in Figure 4) for the 10,276 nuclear-encoded genes in this species. (fj) Pombe analyses (as described in Figure 4) for the 5110 nuclear-encoded genes in this species. (ko) Methobacteria analyses (as described in Figure 4) for the 1485 nuclear-encoded genes in this species.
Figure 5. GC and ENC analyses of species with little or no correlations between GC rank and ENC rank: E. coli (Escherichia coli), pombe (Schizosaccharomyces cerevisiae), and methanobacteria (Methanococcus aeolicus). (ae) E. coli analyses (as described in Figure 4) for the 10,276 nuclear-encoded genes in this species. (fj) Pombe analyses (as described in Figure 4) for the 5110 nuclear-encoded genes in this species. (ko) Methobacteria analyses (as described in Figure 4) for the 1485 nuclear-encoded genes in this species.
Genes 16 00432 g005
ENC histograms for nuclear-encoded genes are shown overlapping for bees (blue), rice (red), and yeast (green) (Figure 1b). As with the GC histograms, the ENC histograms are bimodal for both bees and rice and unimodal in yeast. ENC levels vary from very low (the lowest is 20, which means that each amino acid in that gene has only one codon) to very high (there are sixty-one codons for amino acids and three stop codons).

3.2. Description of Two-Rank Order Normalization (TRON) Mathematics

To describe this method, I will start with some definitions (Figure 6. Line A (number series 1, 2, …, 1000) and Line B (number series 1000, 999, …, 1) are graphed (blue and red lines, respectively, Figure 6a). The sum of A and the sum of B can be determined with the “Nth triangle number formula” (N(N + 1)/2) (Figure 6a). This equation is so named because of the triangular shape that is made [36]. This equation is also called “Gauss’ trick” because, according to lore, in the 1780s, a German schoolteacher assigned his class the task of summing the first 100 integers, expecting it to take a while. However, young Carl Friedrich Gauss quickly found the answer—5050—by recognizing a pattern: pairing numbers from opposite ends of the sequence (1 + 100, 2 + 99, etc.), each summing to 101, with 50 such pairs yielding 50 × 101. This insight led to the general formula for summing consecutive numbers: N(N + 1)/2 [36].
Line A+B (number series 1001, 1001, …, 1001) and Line A-B (number series −999, −997, …, 997, 999) are graphed (blue and red lines, respectively) (Figure 2b). The sum of A+B is N(N+1) (i.e., 2(N(N + 1)/2) and the sum of A-B, using the Excel™ function SUM(ABS(series A- series B)), is 500,000, which is exactly (N2/2) (Figure 6b).
Next, I generated a randomized list of Line A (Random) with Excel with SORTBY and RANDARRAY functions (i.e., SORTBY(A1:A1000,RANDARRAY(1000)). A+Random shows a triangular distribution (Figure 6c). Similary A-Random, graphed with the histogram chart, also shows a distribution that resembles a triangular distribution (Figure 6d). The RANDARRAY function differs from the RAND function because the former uses each number in the column only once, whereas the RAND function uses a random number in the range for each row (Excel™ tutorials).
This was repeated with Line A’ (number series 1, 2, …, 10,000) and Random’ was made with all 10,000 numbers represented once in the column. A’+Random’ shows a triangular distribution (Figure 6e). Similarly, A’-Random’ shows a triangular-like distribution that is much more clearly triangle shaped that A-Random (1–1000) (Figure 6f).
To calculate the sums of A+Random and A’+Random’, I added the numbers in each column and found that A+Random = 500,500 and A’+Random’ = 50,005,000. In my example, it does not matter whether the numbers are in order or randomized when they are summed because each number is present only once with the RANDARRAY function, and addition does not lead to the loss of any numbers. The solution is twice the Nth triangular number formula (i.e., N(N + 1)) because the series 1, 2, …, 1000 and the random array are added together in A+Random, thus doubling the number. In summary, (A + R) = N(N + 1), exactly, in all cases (Figure 6e)
To calculate the areas of A-Random and A’-Random’, I summed the numbers in each column, using ABS to generate absolute values since approximately half of the numbers are negative. I found that A-Random = 326,926 and A’-Random’ = 33,249,974. I noticed that the solutions are approximately N2/3, which would be 333,333 and 33,333,333, for A-Random and A’-Random, resepectively (Figure 6f. This makes logical sense because the range of A-Random would be from 0.00 (Line A—Line A, where both are series 1, 2, …1000) to 500,000 (i.e., N2/2). One hundred permutations of A-Random (i.e., (A-R1 + A-R2, … + A-R100/100) is approximately (N2)/3). (Figure 6g,h). I did a mathematical proof that indicates that the exact solution for E[S], the expected number for distribution S, is exactly N(N + 1)(2N−2)/6N. For large N, then the answer is approximately N2/3, as I estimated with my permutation analysis.

3.3. Comparisons Between GC and ENC Distributions

  • GC and ENC distributions are positively correlated in bees (Apis mellifera) and negatively correlated in rice (Oryza sativa). There is little or no correlation between GC and ENC in yeast (Saccharomyces cerevisiae).
Bees display a positive correlation between GC rank and ENC rank, meaning that as GC rank increases, ENC rank also increases (Figure 2a, red arrow; R2 = 0.60, Table 1). This indicates that genes with higher GC content tend to use a broader range of codons, while genes with lower GC content use a more restricted set. This might be partly explained by the lower average GC content in bees compared to rice, especially in bee GC peak 1 (Figure 2d).
In contrast to bees, in rice, which has a high GC content in nuclear-encoded genes, GC rank and the effective number of codons (ENC) exhibit a negative correlation, meaning that as GC content increases, ENC decreasess (Figure 2b; red arrow). This finding suggests that genes with higher GC rank tend to use a more restricted set of codons (lower ENC), possibly due to selection for specific tRNA availability or translational efficiency. Conversely, genes with lower GC rank use a more diverse range of codons (higher ENC), potentially indicating relaxed selection on codon usage bias. The positive correlation between GC and ENC in bees suggests a different evolutionary strategy in bees compared to rice, possibly reflecting differences in genome organization, translational selection, or adaptation to distinct environmental pressures.
There is no correlation between GC rank and ENC rank in yeast (Figure 3c). Also, yeast has unimodal GC and ENC distributions (Figure 2f,i). Yeast also has unimodal GC-ENC rank and GC+ENC rank distributions (Figure 2l,o).
2.
Rice (Oryza sativa) has bimodal GC and ENC distributions, GC-ENC is a bimodal distribution, and GC+ENC is a unimodal distribution; (2 − 2 = 2; 2 + 2 = 1).
In rice, both GC content and the effective number of codons (ENC) follow a bimodal distribution (Figure 2e,h). However, rice GC+ENC is a unimodal distribution (Figure 2n). If GC rank and ENC rank were not correlated, a unimodal distribution is expected for GC rank + ENC rank based on the Two-Rank Order Normalization (TRON) method described above. However, I found that the sum of GC-ENC (i.e., SUM(ABS(GC-ENC)) in rice is 48% larger than expected by the equation: SUM(ABS)(A-R)/((N2)/3) (Figure 6e; Table 1, second to last column). In other words, even though rice GC-ENC historgram resembles a triangle distribution (i.e., unimodal), the sum of all the GC-ENC rows for the 28,571 nuclear-encoded genes in rice is 1.48 times the expected value if GC ranks and ENC ranks were randomly associated (Table 1, second to last column). I interpret this as further evidence that GC ranks and ENC ranks are highly correlated.
Rice GC+ENC is a bimodal distribution. However, SUM(GC+ENC) = SUM(A + R) = N(N + 1), and SUM(GC + ENC)/SUM(A + R) = N(N + 1)/N(N + 1) = 1.00 for all 17 species (Table 1, last column). Because of this limitation, Two-Rank Order Normalization (TRON) can only be done when subtracting two series of ranks, such as GC-ENC, and not when adding two series of ranks, such as GC+ENC. Therefore, for this paper, I will focus only on (GC-ENC) normalization to SUM(ABS(A-R)) = ((N2)/3) in this paper (Table 1, second to last column).
As a shortcut, I refer to rice having a (2 − 2 = 2; 2 + 2 = 1) pattern, which means that the GC distribution is bimodal and the ENC distribution is bimodal, GC-ENC is bimodal, and GC+ENC is unimodal. One possible explanation is that different functional classes of genes contribute differently to the ENC and GC distributions. While their individual distributions appear continuous, the way they interact—such as certain genes having disproportionately high or low ENC relative to their GC content—creates two separate peaks when subtracted. This may reflect distinct evolutionary pressures acting on different subsets of genes, such as differences in selection for codon usage bias, gene expression levels, or functional constraints.
3.
The bee (Apis mellifera) has bimodal GC and ENC distributions, GC-ENC has a unimodal distribution, and GC+ENC forms a bimodal distribution (2 − 2 = 1; 2 + 2 = 2)
In bees, both GC content and the effective number of codons (ENC) exhibit bimodal distributions, meaning that each measure clusters into two distinct peaks rather than forming a single continuous distribution (Figure 2m). As expected for random distributions, GC-ENC forms a unimodal peak. However, unexpectedly, GC+ENC forms a bimodal distribution (Figure 3m). Given that both GC and ENC are bimodal, one might expect that adding ENC and GC (GC+ENC) would result in a unimodal distribution. This is because A+Random forms a unimodal triangle distribution (Figure 6f). This result (i.e., GC+ENC is bimodal) suggests that there are underlying relationships between these two variables (GC and ENC) that maintain a bimodal distribution when added.
As a shortcut, I refer to bees having a (2 − 2 = 1; 2 + 2 = 2) pattern, which means that the GC distribution is bimodal and the ENC distribution is bimodal, and GC-ENC is unimodal and GC+ENC is bimodal. The 2 + 2 = 2 pattern could indicate a balancing effect of evolutionary pressures, such as codon usage adaptation and GC content constraints, leading to a more uniform distribution when considering their difference. Such a pattern might reflect underlying genomic organization or selection pressures that maintain a more stable relationship between GC content and codon usage across the genome.
4.
I repeated the analyses described above for bees, rice, and yeast with 14 other species. Six additional species have a negative correlation between GC and ENC, as I found with bees (Figure 3). Five additional species have a positive correlation between GC and ENC, as I found with rice (Figure 4). Three additional species have little or no correlation between GC and ENC, as I found with yeast (Figure 5).
5.
Two-Rank Order Normalization (TRON) was plotted for the 17 species analyzed in this study (SUM(ABS(GC-ENC))/(N2/3)). Table 1 is a summary of several of the variables that were used to make Figure 7. I compared all the variables with each other and highlighted the strongest correlations in Figure 7. I found a strong inverse correlation between TRON in the 17 species with slope of the GC and ENC correlations in the 17 species (Figure 7a; R2 = 0.98). I also found a strong second-order parabolic (ax2-bx) correlation between R2 and the slope of the GC versus ENC correlations (Figure 7b: R2 = 0.99).
I found a good correlation between the GC content in peak 1 in the 17 species (R2 = 0.24) and a better correlation between the GC content in peak 2 in the 17 species (Figure 7c; R2 = 0.60). Since two peaks can be mostly overlapping, such has the GC peaks in mice, if there was a unimodal GC distribution, I set peak 1 = peak 2 (Table 1). This is an interesting result, but I consider it tentative because of the limited number of species that I analyzed for this paper (17). This result could be influenced by selection bias, for instance, because I purposely chose species with known bimodal GC content distributions. Further analyses with a much larger number of species is needed to validate this result.

4. Discussion

4.1. Interpretation of GC and ENC Distributions Across Species

My findings reveal distinct patterns in the relationships between GC content and the effective number of codons (ENC) across different species, demonstrating unexpected correlations that challenge prior assumptions. In rice, both GC content and ENC exhibit bimodal distributions, and their difference (GC-ENC) forms a bimodal distribution (2 − 2 = 2; Figure 2e). Conversely, in bees, both GC and ENC display bimodal distributions, but their difference (GC-ENC) results in a unimodal distribution (2 − 2 = 1; Figure 2d). Similarly, in rice, both GC and ENC are bimodal, but their sum (GC+ENC) forms a unimodal distribution (2 + 2 = 1; Figure 2h). However, in mice, GC and ENC are bimodal, but their sum (GC+ENC) forms a bimodal distribution (2 + 2 = 2). These contrasting patterns suggest that GC content and codon usage bias interact differently across species, potentially reflecting underlying biological, evolutionary, and functional constraints.

4.2. Correlation Between GC Content and ENC Across Species

A critical insight from my study is that the correlation between GC content and ENC differs among species (Figure 3, Figure 4, Figure 5 and Figure 6). In mice and rice, these two parameters show a negative correlation, whereas in bees and wasps, they are positively correlated. This discrepancy indicates fundamental differences in codon usage optimization strategies across species. A negative correlation, as seen in mice and rice, suggests that genes with high GC content tend to use a more limited subset of codons, possibly reflecting strong selection for translational efficiency. In contrast, the positive correlation in bees implies that genes with high GC content use a broader range of codons, potentially indicating different selection pressures on codon usage.

4.3. Evolutionary and Functional Implications

The observed differences in GC and ENC distributions, as well as their correlation patterns, may be driven by multiple evolutionary forces. In insects such as bees, codon usage bias is strongly influenced by selection for translational efficiency, tRNA availability, and mutational biases [37,38,39,40]. The bimodal GC distribution in bees indicates that they maintain two distinct clusters of genes with differing GC content. The lower GC peak in bees, which falls below the typical GC content of most species, may indicate different selective pressures acting on subsets of the bee genome.
Bees exhibit distinct differences between their two GC peaks. The low GC peak, which can also be represented as “observed over expected” (o/e) [12], contains higher levels of DNA methylation at CpG sites compared to the high GC peak. Notably, “CpG” includes the “p” to distinguish it from GC content, which represents the proportion of guanine and cytosine in a nuclear-encoded gene. This finding is paradoxical because, despite having fewer CpG sites, the low GC peak exhibits greater DNA methylation than the high GC peak. This pattern is reminiscent of CpG islands in mammals, which are typically characterized by high GC content but relatively low levels of CpG DNA methylation (reviewed in [41,42,43]).
I observed a strong enrichment of transcription activators and HOX genes among the top 200 genes in the high GC peak (FDR = 10−38 for GO:0000981~DNA-binding transcription factor activity, RNA polymerase II-specific; FDR = 10−25 for IPR001356:Homeobox domain). My findings here, and in my previous publication on bee genomics, suggest that the transcriptional activation and/or alternative mRNA splicing of these genes are likely influenced by differential DNA methylation and hydroxymethylation at CpG sites in bees [12].
In plants like rice, the bimodal GC and ENC distributions reflect distinct classes of genes, potentially corresponding to different functional categories. The GC bimodal distribution was first discovered in plants in the 1980s and called “isochores” [44,45,46]. ENC values have also been shown to have bimodal distributions. For example, when ENC (effective number of codons) was plotted against the expected ENC based on GC3 content, the plant Magnolia lotungensis, an extremely endangered endemic tree in China, was found to have a bimodal distribution [47].

4.4. Implications of GC-ENC and GC+ENC Transformations

The unexpected transformation of bimodal distributions into unimodal ones (or vice versa) when GC and ENC are combined suggests that codon usage and GC content exhibit complex, non-independent relationships. In bees, where GC-ENC forms a unimodal distribution despite both components being bimodal, the subtraction operation may reveal an underlying functional relationship that aligns codon usage patterns more uniformly (Figure 3g). This could be due to selection pressures balancing codon usage bias across different genomic regions.
In mice, the unimodal GC and ENC distributions leading to a bimodal GC-ENC distribution suggest that codon usage efficiency varies among gene subsets, potentially reflecting translational optimization (Figure 4, bottom row). The emergence of a bimodal GC-ENC pattern implies that two distinct gene groups exist, possibly corresponding to highly expressed and moderately expressed genes with differing selection pressures.
For rice, the transformation of bimodal GC and ENC distributions into a unimodal GC+ENC distribution suggests that genes with different GC content and codon usage bias contribute to a broad, continuous range when considered together (Figure 2b). This may indicate that codon usage adaptation is influenced by multiple interacting factors, such as transcriptional regulation, GC-biased gene conversion, and evolutionary constraints.

4.5. Experimental Validation of the Possible Importance Between the Correlation Between GC Content and ENC Levels

I speculate that the correlations between GC content and ENC levels have important functional consequences that can be experimentally validated. One could test, for instance, whether transgenes in bees are more highly transcribed and/or translated when the GC content is high and the ENC levels are high, versus when GC content is high and ENC levels are low. I predict that ENC-high/GC-high genes would be more highly expressed in bees, which have a positive correlation between ENC and GC, and more poorly expressed in rice, which have a negative correlation between ENC and GC. Conversely, I predict that ENC-high/GC-low genes would be more highly expressed in rice, which have a strong positive correlation between ENC and GC. Experimental validation can be done with reporter genes, such as luciferase or GFP, or with genes that are important in biotechnology, such as Cas9 and its various derivatives. However, such experiments would be difficult because one would need to control for possible RNA secondary structures and RNA binding protein binding sites.

4.6. Further Uses for the Two-Rank Order Normalization (TRON) Approach

I developed the TRON mathematical approach to systematically and efficiently compare the correlations between GC content and ENC levels. These two metrics cannot be directly compared because they use different scales—GC content ranges from 0.00 to 1.00, while ENC ranges from 20 to 61, representing the number of codons in a gene. Although GC content appears normalized due to its 0 to 1 range, in reality, it typically falls between approximately 0.20 and 0.80 because some amino acids, such as methionine (ATG), always contain a G in their codon. ENC, on the other hand, can be rescaled by dividing by 61, creating a normalized range between approximately 0.33 and 1.00. While these normalization techniques yield results similar to TRON, TRON is a more straightforward and versatile method for comparing multiple variables.
Beyond GC content and ENC levels, TRON can be extended to incorporate additional gene characteristics, such as RNA secondary structure rank, GC-clamp rank [48,49], and RNA pseudoknot stability rank [50,51]. Any RNA-related feature can be integrated into TRON, making it highly adaptable. As the field of epitranscriptomics advances and features like m6A modification levels in mRNAs become better characterized, these can also be included in high-dimensional TRON analyses.
In an extended TRON framework, multiple characteristics—such as GC content, ENC levels, and m6A modifications—could be analyzed pairwise or plotted in three-dimensional space. This approach has the potential to reveal previously hidden correlations and clustering patterns among genes, offering new insights into the relationships between nucleotide composition, codon usage bias, and RNA modifications.

4.7. Broader Implications and Future Directions

My findings provide novel insights into the interplay between GC content and codon usage across species, revealing unexpected distributional transformations and correlation patterns. These results highlight the complexity of genome evolution and suggest that selection for codon usage optimization varies widely among taxa. Future research should investigate the mechanistic basis of these patterns, including the role of tRNA abundance, translation efficiency, and GC-biased gene conversion.
Further studies should also explore additional species to determine whether these trends hold across broader evolutionary lineages. Comparative genomics approaches integrating transcriptome and proteome data could provide deeper insights into how GC content and codon usage influence gene expression and translational efficiency. Moreover, experimental validation using synthetic gene constructs with varying GC content and codon usage biases could help elucidate the functional consequences of these patterns.
One consideration that is not discussed in this paper is the role of RNA secondary and tertiary structures, which are predicted to increase as GC content increases, given the higher melting temperature of GC base pairs compared with AT base pairs. RNA secondary structure is important in the design of RNA viruses, for instance, because double-stranded RNA is more stable than single-stranded RNA. Also, the double-stranded nature of RNA can have both positive and negative effects on translation efficiency of genes [52].
Finally, I did not discuss possible evolutionary reasons for why GC vs ENC correlations increase in low GC-content species such as bees and decrease in high GC-content species such as rice. It could be trivial in that increasing GC content in nuclear-encoded genes in AT-rich species provides more GC codons to use and thereby increases the ENC levels for those genes. Conversely, increasing GC content in nuclear-encoded genes in GC-rich species provides fewer AT codons to use, and thereby decreases the ENC levels for those genes. Even if the explanation for the correlations between GC and ENC are trivial, there could be important emergent properties that occur, such as high GC content having more stable secondary structures.
The correlations that I identified could also be profound and lead to new understandings of evolutionary processes. The evolutionary conservation of olfactory genes in low GC and high ENC regions could reflect their need to be rapidly translated in olfactory neurons, for instance. Similarly, the evolutionary conservation of HOX genes in high GC and low ENC regions, which I observed in both insects and mammals, could reflect their possible complex RNA secondary structures needed for proper translation or RNA modifications [53,54], for instance.

5. Conclusions

In conclusion, my study uncovers intriguing species-specific relationships between GC content and codon usage bias, demonstrating that simple numerical transformations can reveal underlying biological constraints. The unexpected distributional patterns across mice, bees, and rice highlight the complexity of genome evolution and codon usage adaptation. These findings emphasize the importance of considering multiple evolutionary and functional factors when interpreting codon usage bias and suggest promising avenues for future research in comparative genomics and translational regulation.

Funding

This research was funded by the National Institutes of Health, grant numbers 5UG3OD023285, 5P42ES030991, and 1P30ES036084.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

All of the GC and ENC data used in this paper are from the codon statistics database (https://codonstatsdb.unr.edu, accessed on 15 March 2025) [31]. Data analyses for many species was done using Microsoft Excel and the data analyses are available upon request.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GCGC content for a nuclear-encoded gene (range 0.00 to 1.00)
ENCEffective Number of codons (range 20–61).
GC-ENCGC rank minus ENC rank
GC+ENCGC rank plus ENC rank

References

  1. Plotkin, J.B.; Kudla, G. Synonymous but not the same: The causes and consequences of codon bias. Nat. Rev. Genet. 2011, 12, 32–42. [Google Scholar] [CrossRef] [PubMed]
  2. Wright, F. The ’effective number of codons’ used in a gene. Gene 1990, 87, 23–29. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, X. A more accurate relationship between ’effective number of codons’ and GC3s under assumptions of no selection. Comput. Biol. Chem. 2013, 42, 35–39. [Google Scholar] [CrossRef] [PubMed]
  4. Sharp, P.M.; Li, W.H. The codon Adaptation Index—A measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15, 1281–1295. [Google Scholar] [CrossRef] [PubMed]
  5. Puigbò, P.; Bravo, I.G.; Garcia-Vallvé, S. E-CAI: A novel server to estimate an expected value of Codon Adaptation Index (eCAI). BMC Bioinform. 2008, 9, 65. [Google Scholar] [CrossRef] [PubMed]
  6. Zaytsev, K.; Bogatyreva, N.; Fedorov, A. Link Between Individual Codon Frequencies and Protein Expression: Going Beyond Codon Adaptation Index. Int. J. Mol. Sci. 2024, 25, 11622. [Google Scholar] [CrossRef]
  7. Gu, X.; Li, W.H. A model for the correlation of mutation rate with GC content and the origin of GC-rich isochores. J. Mol. Evol. 1994, 38, 468–475. [Google Scholar] [CrossRef] [PubMed]
  8. Hurst, L.D.; Williams, E.J. Covariation of GC content and the silent site substitution rate in rodents: Implications for methodology and for the evolution of isochores. Gene 2000, 261, 107–114. [Google Scholar] [CrossRef]
  9. Belle, E.M.; Duret, L.; Galtier, N.; Eyre-Walker, A. The decline of isochores in mammals: An assessment of the GC content variation along the mammalian phylogeny. J. Mol. Evol. 2004, 58, 653–660. [Google Scholar] [CrossRef]
  10. Huttener, R.; Thorrez, L.; In’t Veld, T.; Granvik, M.; Snoeck, L.; Van Lommel, L.; Schuit, F. GC content of vertebrate exome landscapes reveal areas of accelerated protein evolution. BMC Evol. Biol. 2019, 19, 144. [Google Scholar] [CrossRef]
  11. Bohlin, J. A simple stochastic model describing the evolution of genomic GC content in asexually reproducing organisms. Sci. Rep. 2022, 12, 18569. [Google Scholar] [CrossRef] [PubMed]
  12. Cingolani, P.; Cao, X.; Khetani, R.S.; Chen, C.C.; Coon, M.; Sammak, A.A.; Bollig-Fischer, A.; Land, S.; Huang, Y.; Hudson, M.E.; et al. Intronic non-CG DNA hydroxymethylation and alternative mRNA splicing in honey bees. BMC Genom. 2013, 14, 666. [Google Scholar] [CrossRef]
  13. Deng, X.; Fan, G. Tuning up gene transcription via direct crosstalk of DNA and RNA methylation. Mol. Cell 2025, 85, 674–676. [Google Scholar] [CrossRef]
  14. Huang, K.Y.; Feng, Y.Y.; Du, H.; Ma, C.W.; Xie, D.; Wan, T.; Feng, X.Y.; Dai, X.G.; Yin, T.M.; Wang, X.Q.; et al. DNA methylation dynamics in gymnosperm duplicate genes: Implications for genome evolution and stress adaptation. Plant J. 2025, 121, e70006. [Google Scholar] [CrossRef] [PubMed]
  15. Ji, J.; Li, D.; Zhao, X.; Wang, Y.; Wang, B. Genome-wide DNA methylation regulation analysis provides novel insights on post-radiation breast cancer. Sci. Rep. 2025, 15, 5641. [Google Scholar]
  16. Vollger, M.R.; Korlach, J.; Eldred, K.C.; Swanson, E.; Underwood, J.G.; Bohaczuk, S.C.; Mao, Y.; Cheng, Y.H.H.; Ranchalis, J.; Blue, E.E.; et al. Synchronized long-read genome, methylome, epigenome and transcriptome profiling resolve a Mendelian condition. Nat. Genet. 2025, 57, 469–479. [Google Scholar] [CrossRef] [PubMed]
  17. Huang, C.F.; Zhu, J.K. RNA Splicing Factors and RNA-Directed DNA Methylation. Biology 2014, 3, 243–254. [Google Scholar] [CrossRef] [PubMed]
  18. Ma, J.; Li, S.; Wang, T.; Tao, Z.; Huang, S.; Lin, N.; Zhao, Y.; Wang, C.; Li, P. Cooperative condensation of RNA-DIRECTED DNA METHYLATION 16 splicing isoforms enhances heat tolerance in Arabidopsis. Nat. Commun. 2025, 16, 433. [Google Scholar] [CrossRef]
  19. Shukla, S.; Kavak, E.; Gregory, M.; Imashimizu, M.; Shutinoski, B.; Kashlev, M.; Oberdoerffer, P.; Sandberg, R.; Oberdoerffer, S. CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing. Nature 2011, 479, 74–79. [Google Scholar] [CrossRef]
  20. Tatarinova, T.; Elhaik, E.; Pellegrini, M. Cross-species analysis of genic GC3 content and DNA methylation patterns. Genome Biol. Evol. 2013, 5, 1443–1456. [Google Scholar]
  21. Clement, Y.; Fustier, M.A.; Nabholz, B.; Glemin, S. The bimodal distribution of genic GC content is ancestral to monocot species. Genome Biol. Evol. 2014, 7, 336–348. [Google Scholar] [CrossRef] [PubMed]
  22. Bowers, J.E.; Tang, H.; Burke, J.M.; Paterson, A.H. GC content of plant genes is linked to past gene duplications. PLoS ONE 2022, 17, e0261748. [Google Scholar] [CrossRef] [PubMed]
  23. Teng, W.; Liao, B.; Chen, M.; Shu, W. Genomic Legacies of Ancient Adaptation Illuminate GC-Content Evolution in Bacteria. Microbiol. Spectr. 2023, 11, e0214522. [Google Scholar]
  24. Mazumdar, P.; Binti Othman, R.; Mebus, K.; Ramakrishnan, N.; Ann Harikrishna, J. Codon usage and codon pair patterns in non-grass monocot genomes. Ann. Bot. 2017, 120, 893–909. [Google Scholar]
  25. Jørgensen, F.G.; Schierup, M.H.; Clark, A.G. Heterogeneity in regional GC content and differential usage of codons and amino acids in GC-poor and GC-rich regions of the genome of Apis mellifera. Mol. Biol. Evol. 2007, 24, 611–619. [Google Scholar] [CrossRef]
  26. Scapoli, C.; Bartolomei, E.; De Lorenzi, S.; Carrieri, A.; Salvatorelli, G.; Rodriguez-Larralde, A.; Barrai, I. Codon and aminoacid usage patterns in mycobacteria. J. Mol. Microbiol. Biotechnol. 2009, 17, 53–60. [Google Scholar] [CrossRef]
  27. Gaona-Mendoza, A.S.; Massange-Sánchez, J.A.; Barboza-Corona, J.E.; Abraham-Juárez, M.J.; Casados-Vázquez, L.E. Codon Optimization is Required to Express Fluorogenic Reporter Proteins in Lactococcus lactis. Mol. Biotechnol. 2024. Online ahead of print. [Google Scholar] [CrossRef]
  28. Steindorff, A.S.; Aguilar-Pontes, M.V.; Robinson, A.J.; Andreopoulos, B.; LaButti, K.; Kuo, A.; Mondo, S.; Riley, R.; Otillar, R.; Haridas, S.; et al. Comparative genomic analysis of thermophilic fungi reveals convergent evolutionary adaptations and gene losses. Commun. Biol. 2024, 7, 1124. [Google Scholar] [CrossRef] [PubMed]
  29. Rudolph, K.L.; Schmitt, B.M.; Villar, D.; White, R.J.; Marioni, J.C.; Kutter, C.; Odom, D.T. Codon-Driven Translational Efficiency Is Stable across Diverse Mammalian Cell States. PLoS Genet. 2016, 12, e1006024. [Google Scholar] [CrossRef] [PubMed]
  30. López, J.L.; Lozano, M.J.; Fabre, M.L.; Lagares, A. Codon Usage Optimization in the Prokaryotic Tree of Life: How Synonymous Codons Are Differentially Selected in Sequence Domains with Different Expression Levels and Degrees of Conservation. mBio 2020, 11, 10–1128. [Google Scholar] [CrossRef] [PubMed]
  31. Subramanian, K.; Payne, B.; Feyertag, F.; Alvarez-Ponce, D. The Codon Statistics Database: A Database of Codon Usage Bias. Mol. Biol. Evol. 2022, 39, msac157. [Google Scholar] [CrossRef] [PubMed]
  32. Sabi, R.; Tuller, T. Modelling the efficiency of codon-tRNA interactions based on codon usage bias. DNA Res. 2014, 21, 511–526. [Google Scholar] [CrossRef]
  33. Zhang, Q.; Zhang, Y.; Chai, Y. Optimization of CRISPR/LbCas12a-mediated gene editing in Arabidopsis. PLoS ONE 2022, 17, e0265114. [Google Scholar] [CrossRef] [PubMed]
  34. Bajaj, P.; Bhasin, M.; Varadarajan, R. Molecular bases for strong phenotypic effects of single synonymous codon substitutions in the E. coli ccdB toxin gene. BMC Genom. 2023, 24, 732. [Google Scholar] [CrossRef]
  35. Ando, D.; Rashad, S.; Begley, T.J.; Endo, H.; Aoki, M.; Dedon, P.C.; Niizuma, K. Decoding Codon Bias: The Role of tRNA Modifications in Tissue-Specific Translation. Int. J. Mol. Sci. 2025, 26, 706. [Google Scholar] [CrossRef]
  36. Ding, N.Q. Advanced Algebra; World Scientific: Hackensack, NJ, USA, 2025; Volume XVI, 495p. [Google Scholar]
  37. Jabbari, K.; Bernardi, G. An Isochore Framework Underlies Chromatin Architecture. PLoS ONE 2017, 12, e0168023. [Google Scholar] [CrossRef] [PubMed]
  38. Karro, J.E.; Peifer, M.; Hardison, R.C.; Kollmann, M.; Von Grünberg, H.H. Exponential decay of GC content detected by strand-symmetric substitution rates influences the evolution of isochore structure. Mol. Biol. Evol. 2008, 25, 362–374. [Google Scholar] [CrossRef]
  39. Matsuo, K.; Clay, O.; Takahashi, T.; Silke, J.; Schaffner, W. Evidence for erosion of mouse CpG islands during mammalian evolution. Somat. Cell Mol. Genet. 1993, 19, 543–555. [Google Scholar] [CrossRef]
  40. Schmegner, C.; Hoegel, J.; Vogel, W.; Assum, G. The rate, not the spectrum, of base pair substitutions changes at a GC-content transition in the human NF1 gene region: Implications for the evolution of the mammalian genome structure. Genetics 2007, 175, 421–428. [Google Scholar] [CrossRef]
  41. Guo, Y.; Zhao, S.; Wang, G.G. Wang, Polycomb Gene Silencing Mechanisms: PRC2 Chromatin Targeting, H3K27me3 ‘Readout’, and Phase Separation-Based Compaction. Trends Genet. 2021, 37, 547–565. [Google Scholar] [CrossRef] [PubMed]
  42. Tirado-Magallanes, R.; Rebbani, K.; Lim, R.; Pradhan, S.; Benoukraf, T. Whole genome DNA methylation: Beyond genes silencing. Oncotarget 2017, 8, 5629–5637. [Google Scholar] [CrossRef] [PubMed]
  43. Tse, J.W.T.; Jenkins, L.J.; Chionh, F.; Mariadason, J.M. Aberrant DNA Methylation in Colorectal Cancer: What Should We Target? Trends Cancer 2017, 3, 698–712. [Google Scholar] [CrossRef] [PubMed]
  44. Matassi, G.; Montero, L.M.; Salinas, J.; Bernardi, G. The isochore organization and the compositional distribution of homologous coding sequences in the nuclear genome of plants. Nucleic Acids Res. 1989, 17, 5273–5290. [Google Scholar] [CrossRef] [PubMed]
  45. Salinas, J.; Matassi, G.; Montero, L.M.; Bernardi, G. Compositional compartmentalization and compositional patterns in the nuclear genomes of plants. Nucleic Acids Res. 1988, 16, 4269–4285. [Google Scholar] [CrossRef] [PubMed]
  46. Vogl, C.; Karapetiants, M.; Yıldırım, B.; Kjartansdóttir, H.; Kosiol, C.; Bergman, J.; Majka, M.; Mikula, L.C. Inference of genomic landscapes using ordered Hidden Markov Models with emission densities (oHMMed). BMC Bioinform. 2024, 25, 151. [Google Scholar] [CrossRef]
  47. Shi, C.; Xie, Y.; Guan, D.; Qin, G. Transcriptomic Analysis Reveals Adaptive Evolution and Conservation Implications for the Endangered Magnolia lotungensis. Genes 2024, 15, 787. [Google Scholar] [CrossRef] [PubMed]
  48. Kimsey, I.; Al-Hashimi, H.M. Increasing occurrences and functional roles for high energy purine-pyrimidine base-pairs in nucleic acids. Curr. Opin. Struct. Biol. 2014, 24, 72–80. [Google Scholar] [CrossRef]
  49. Woodside, M.T.; García-García, C.; Block, S.M. Folding and unfolding single RNA molecules under tension. Curr. Opin. Chem. Biol. 2008, 12, 640–646. [Google Scholar] [CrossRef]
  50. Aruda, J.; Grote, S.L.; Rouskin, S. Untangling the pseudoknots of SARS-CoV-2: Insights into structural heterogeneity and plasticity. Curr. Opin. Struct. Biol. 2024, 88, 102912. [Google Scholar] [CrossRef]
  51. Kiliushik, D.; Goenner, C.; Law, M.; Schroeder, G.M.; Srivastava, Y.; Jenkins, J.L.; Wedekind, J.E. Knotty is nice: Metabolite binding and RNA-mediated gene regulation by the preQ(1) riboswitch family. J. Biol. Chem. 2024, 300, 107951. [Google Scholar] [CrossRef]
  52. Chełkowska-Pauszek, A.; Kosiński, J.G.; Marciniak, K.; Wysocka, M.; Bąkowska-Żywicka, K.; Żywicki, M. The Role of RNA Secondary Structure in Regulation of Gene Expression in Bacteria. Int. J. Mol. Sci. 2021, 22, 7845. [Google Scholar] [CrossRef] [PubMed]
  53. Alghoul, F.; Eriani, G.; Martin, F. RNA Secondary Structure Study by Chemical Probing Methods Using DMS and CMCT. Methods Mol. Biol. 2021, 2300, 241–250. [Google Scholar] [PubMed]
  54. Zhou, Y.; Huang, Q.; Wu, C.; Xu, Y.; Guo, Y.; Yuan, X.; Xu, C.; Zhou, L. m(6)A-modified HOXC10 promotes HNSCC progression via co-activation of ADAM17/EGFR and Wnt/beta-catenin signaling. Int. J. Oncol. 2024, 64, 10. [Google Scholar] [PubMed]
Figure 6. Combinatorial effects of adding or subtracting GC and ENC ranks. (a) Line A (1, 2, …, 1000) (red) and Line B (1000, 999, …, 1) are plotted. Column A on Excel™ has the numbers for Line A and column B has the numbers for Line B. (b) Line A minus Line B (A−B) (blue) and Line A+B (red) are plotted. A−B was made by selecting column A (rows 1–1000) and subtracting column B (rows 1–1000) and placing the results in column C. A+B was made by selecting column 1 and adding column 2 and placing the results in column D. (c) A histogram of Line A minus a randomization of Line A (Random) is plotted (A-Random). Random was generated on Excel™ with the RANDARRAY function, i.e., =SORTBY(A1:A1000,RANDARRAY(1000)). The results were placed in column E. The histogram was made by selecting column E (rows 1–1000) and selecting the histogram chart under the INSERT tab. (d) A histogram of Line A plus a randomization of Line A (R) is plotted (A+Random). The results of A+Random was inserted into column F. (e) A histogram of Line A’ (1, 2, …, 10,000) (column G) minus a randomization of A’ (column H) and placed in column I (A’-Random’). The steps in C were repeated using numbers 1–10,000 for line A’ and randomization of numbers 1–10,000 for Random’. The area was determined by the equation (SUM(ABS(I1:I10,000). ABS (absolute value) was used in this equation because half of the numbers are negative. The area can also be approximated as N2/3, where N is the number of rows, in this case there are 10,000 rows (see methods). (f) A histogram of Line A’ plus Random’ and placed in column J (A’+Random’). The area was determined by the equation (SUM(J1:J10,000)) = N(N + 1)/2. (g) A scatter plot of 100 repetitions of SUM(ABS(A1:A1000) − (R1:R1000)), where R is a randomization of the numbers between 1 and 1000 using the equation SORTBY(A1:A1000,RANDARRAY(1000). The red line shows the average = 333,023 +/− 7360, which is equivalent to N2/3 +/− 2%. (h) A histogram of the results in g, where the x-axis is SUM(ABS(A1:A1000) − (R1:R1000)) and the y-axis is the number of times that range of number occurred in 100 repetitions.
Figure 6. Combinatorial effects of adding or subtracting GC and ENC ranks. (a) Line A (1, 2, …, 1000) (red) and Line B (1000, 999, …, 1) are plotted. Column A on Excel™ has the numbers for Line A and column B has the numbers for Line B. (b) Line A minus Line B (A−B) (blue) and Line A+B (red) are plotted. A−B was made by selecting column A (rows 1–1000) and subtracting column B (rows 1–1000) and placing the results in column C. A+B was made by selecting column 1 and adding column 2 and placing the results in column D. (c) A histogram of Line A minus a randomization of Line A (Random) is plotted (A-Random). Random was generated on Excel™ with the RANDARRAY function, i.e., =SORTBY(A1:A1000,RANDARRAY(1000)). The results were placed in column E. The histogram was made by selecting column E (rows 1–1000) and selecting the histogram chart under the INSERT tab. (d) A histogram of Line A plus a randomization of Line A (R) is plotted (A+Random). The results of A+Random was inserted into column F. (e) A histogram of Line A’ (1, 2, …, 10,000) (column G) minus a randomization of A’ (column H) and placed in column I (A’-Random’). The steps in C were repeated using numbers 1–10,000 for line A’ and randomization of numbers 1–10,000 for Random’. The area was determined by the equation (SUM(ABS(I1:I10,000). ABS (absolute value) was used in this equation because half of the numbers are negative. The area can also be approximated as N2/3, where N is the number of rows, in this case there are 10,000 rows (see methods). (f) A histogram of Line A’ plus Random’ and placed in column J (A’+Random’). The area was determined by the equation (SUM(J1:J10,000)) = N(N + 1)/2. (g) A scatter plot of 100 repetitions of SUM(ABS(A1:A1000) − (R1:R1000)), where R is a randomization of the numbers between 1 and 1000 using the equation SORTBY(A1:A1000,RANDARRAY(1000). The red line shows the average = 333,023 +/− 7360, which is equivalent to N2/3 +/− 2%. (h) A histogram of the results in g, where the x-axis is SUM(ABS(A1:A1000) − (R1:R1000)) and the y-axis is the number of times that range of number occurred in 100 repetitions.
Genes 16 00432 g006
Figure 7. Correlations between GC content, ENC, and the number of nuclear-encoded genes. Data for all graphs is from Table 1. (a) Plot of TRON score (y-axis) versus slope (GC rank vs. ENC rank) (x-axis) for all 17 species. Species with negative slopes between GC ranks and ENC ranks are on the left and species with positive slopes are on the right. The trendline and R-squared value is shown. TRON score is SUM(ABS((GC1:GCN) − (ENC1:ENCN))/(N2/3). (b) Plot of R-squared correlation (y-axis) versus slope (GC rank vs ENC rank) (x-axis) for all 17 species. Species with negative slopes between GC ranks and ENC ranks are on the left and species with positive slopes are on the right. The polynomial trendline and R-squared value is shown. (c) GC content at peak 1 (y-axis) versus number of nuclear-encoded genes (x-axis) for all 17 species. The trendline and R-squared value is shown. (d) GC content at peak 2 (y-axis) versus number of nuclear-encoded genes (x-axis) for all 17 species. The trendline and R-squared value is shown. (e) ENC level at peak 1 (y-axis) versus GC content at peak 1 (x-axis) for all 17 species. The trendline and R-squared value is shown. (f) ENC level at peak 2 (y-axis) versus GC content at peak 1 (x-axis) for all 17 species. The trendline and R-squared value is shown.
Figure 7. Correlations between GC content, ENC, and the number of nuclear-encoded genes. Data for all graphs is from Table 1. (a) Plot of TRON score (y-axis) versus slope (GC rank vs. ENC rank) (x-axis) for all 17 species. Species with negative slopes between GC ranks and ENC ranks are on the left and species with positive slopes are on the right. The trendline and R-squared value is shown. TRON score is SUM(ABS((GC1:GCN) − (ENC1:ENCN))/(N2/3). (b) Plot of R-squared correlation (y-axis) versus slope (GC rank vs ENC rank) (x-axis) for all 17 species. Species with negative slopes between GC ranks and ENC ranks are on the left and species with positive slopes are on the right. The polynomial trendline and R-squared value is shown. (c) GC content at peak 1 (y-axis) versus number of nuclear-encoded genes (x-axis) for all 17 species. The trendline and R-squared value is shown. (d) GC content at peak 2 (y-axis) versus number of nuclear-encoded genes (x-axis) for all 17 species. The trendline and R-squared value is shown. (e) ENC level at peak 1 (y-axis) versus GC content at peak 1 (x-axis) for all 17 species. The trendline and R-squared value is shown. (f) ENC level at peak 2 (y-axis) versus GC content at peak 1 (x-axis) for all 17 species. The trendline and R-squared value is shown.
Genes 16 00432 g007
Table 1. Species (17) analyzed in this study. Shown are the number of genes (N), the GC content at peaks 1 and 2, the ENC levels at peaks 1 and 2, the slope equation of the GC rank versus ENC ranks, the slopes from this equation (negative for rice and positive for bees), the R2 values for the slopes, the Two-Rank Order Normalizations (TRON) for GC-ENC (i.e., SUM(ABS(GC-ENC)/(N2/3), and the Two-Rank Order Normalization (TRON) for GC+ENC (i.e., SUM(GC + ENC)/N(N+1) = 1.00).
Table 1. Species (17) analyzed in this study. Shown are the number of genes (N), the GC content at peaks 1 and 2, the ENC levels at peaks 1 and 2, the slope equation of the GC rank versus ENC ranks, the slopes from this equation (negative for rice and positive for bees), the R2 values for the slopes, the Two-Rank Order Normalizations (TRON) for GC-ENC (i.e., SUM(ABS(GC-ENC)/(N2/3), and the Two-Rank Order Normalization (TRON) for GC+ENC (i.e., SUM(GC + ENC)/N(N+1) = 1.00).
Common NameSpeciesGenes (N)GC Peak 1GC Peak 2ENC Peak 1ENC Peak 2Line
Equation
SlopeR2(GC-ENC)
/(N2/3)
(GC+ENC)
/N(N+1)
RiceOryza sativa28,5710.480.73256y = −0.77x + 25253−0.770.591.481
MosquitoAnopheles gambiae12,4020.580.584353y = −0.75x + 10878−0.750.571.41
Puffer fishTakifugu rubripes22,1070.540.545353y = −0.64x + 18180−0.640.421.51
HumansHomo sapiens19,7080.460.64353y = −0.61x + 15954−0.610.381.391
Bread moldNeurospora crassa97280.550.555757y = −0.60x + 7830−0.600.371.341
BananaMusa acuminata30,7000.450.64155y = −0.53x + 23495−0.530.281.181
MouseMus musculus22,4050.490.555454y = −0.45x + 16307−0.450.211.321
E. coli bacteriaEscherichia coli10,2760.520.524848y = −0.35x + 6975−0.350.131.221
Pombe yeastpombe51100.40.45050y = −0.19x + 3063−0.190.041.11
MethanobacteriaMethanococcus aeolicus14850.320.324141y = 0.064x + 6960.0640.0040.971
Bakers yeastSaccharomyces59580.390.395151y = 0.0064x + 29830.00640.00080.971
Honey beeApis mellifera99180.330.473556y = 0.78x + 11030.780.60.440.99
Red paper waspPolistes canadensis98540.380.384848y = 0.72x + 13670.720.520.480.99
Spotted fever parasiteRickettsia hoogstraalii16630.330.334343y = 0.41x + 4890.410.170.711
Slime moldDictyostelium discoideum13,0780.280.283232y = 0.38x + 40440.380.150.751
Mustard weedArabidopsis thaliana10,1600.450.455353y = 0.20x + 40660.200.040.861
Malaria parasitePlasmodium falciparum53210.250.253838y = 0.20x + 21410.200.040.881
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ruden, D.M. GC Content in Nuclear-Encoded Genes and Effective Number of Codons (ENC) Are Positively Correlated in AT-Rich Species and Negatively Correlated in GC-Rich Species. Genes 2025, 16, 432. https://doi.org/10.3390/genes16040432

AMA Style

Ruden DM. GC Content in Nuclear-Encoded Genes and Effective Number of Codons (ENC) Are Positively Correlated in AT-Rich Species and Negatively Correlated in GC-Rich Species. Genes. 2025; 16(4):432. https://doi.org/10.3390/genes16040432

Chicago/Turabian Style

Ruden, Douglas M. 2025. "GC Content in Nuclear-Encoded Genes and Effective Number of Codons (ENC) Are Positively Correlated in AT-Rich Species and Negatively Correlated in GC-Rich Species" Genes 16, no. 4: 432. https://doi.org/10.3390/genes16040432

APA Style

Ruden, D. M. (2025). GC Content in Nuclear-Encoded Genes and Effective Number of Codons (ENC) Are Positively Correlated in AT-Rich Species and Negatively Correlated in GC-Rich Species. Genes, 16(4), 432. https://doi.org/10.3390/genes16040432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop