Next Article in Journal
Peptidyl Transferase Center and the Emergence of the Translation System
Previous Article in Journal
Small and Random Peptides: An Unexplored Reservoir of Potentially Functional Primitive Organocatalysts. The Case of Seryl-Histidine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses

by
Christian J. Michel
Theoretical Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France
Submission received: 6 February 2017 / Revised: 23 March 2017 / Accepted: 31 March 2017 / Published: 18 April 2017

Abstract

:
In 1996, a set X of 20 trinucleotides was identified in genes of both prokaryotes and eukaryotes which has on average the highest occurrence in reading frame compared to its two shifted frames. Furthermore, this set X has an interesting mathematical property as X is a maximal C 3 self-complementary trinucleotide circular code. In 2015, by quantifying the inspection approach used in 1996, the circular code X was confirmed in the genes of bacteria and eukaryotes and was also identified in the genes of plasmids and viruses. The method was based on the preferential occurrence of trinucleotides among the three frames at the gene population level. We extend here this definition at the gene level. This new statistical approach considers all the genes, i.e., of large and small lengths, with the same weight for searching the circular code X . As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. The genes of mitochondria and chloroplasts contain a subset of the circular code X . Finally, by studying viral genes, the circular code X was found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes.

1. Introduction

Circular code is a mathematical structure of genes and genomes. This concept initially found for genes is extended for genomes (non-coding regions of eukaryotes) according to recent results. A circular code X is a set of words such that any motif from X , called X motif, allows it to retrieve, maintain, and synchronize the original (construction) frame.
The circular code X identified in the genes of bacteria, eukaryotes, plasmids, and viruses [1,2] contains the 20 following trinucleotides
X =   { AAC , AAT , ACC , ATC , ATT , CAG , CTC , CTG , GAA , GAC ,   GAG , GAT , GCC , GGC , GGT , GTA , GTC , GTT , TAC , TTC }
which allows it to both retrieve the reading frame with a window of 13 nucleotides (Figure 3 in [3]) and to code the 12 following amino acids
{ Ala ,   Asn ,   Asp ,   Gln ,   Glu ,   Gly ,   Ile ,   Leu ,   Phe ,   Thr ,   Tyr ,   Val } .
The current genetic code is not circular. Thus, it cannot retrieve the reading frame. The loss during evolution of this circular code property on the 4-letter alphabet { A , C , G , T } required a complex translation mechanism using 20 amino acids and proteins in current genomes.
X motifs from Equation (1) are identified in (i) genes “universally” [1,4]; (ii) tRNAs of prokaryotes and eukaryotes [3,5]; (iii) rRNAs of prokaryotes (16S) and eukaryotes (18S), in particular in the ribosome decoding center where the universally conserved nucleotides G530, A1492, and A1493 are included in the X motifs [3,6,7]; and (iv) genomes (non-coding regions of eukaryotes) [4,8].
The X motifs of maximal cardinality 20 (composition) in genes with the properties of the circular code, C 3 and complementary allow the two reading frames and the four shifted frames to be retrieved by pairing between DNAs-DNAs, DNAs-mRNAs, mRNAs-rRNAs, mRNAs-tRNAs, and rRNAs-tRNAs, as shown with a 3D visualization of the X motifs in the ribosome [3,6,7].
The X motifs in genomes have a different structure compared to the X motifs in genes [8]. Indeed, their cardinality is not maximal (less than 10 for an order of magnitude), their size is longer, and their structure contains repeated trinucleotides. Furthermore, the X motifs of minimal cardinality 1 generated with the 20 repeated trinucleotides t n where t X (Equation (1)) are very common in the genomes of eukaryotes (e.g., [8,9,10]). Their length n can be very large (e.g., n > 6000 , see Figure 1). The repeated trinucleotides are very unstable with mutation rates up to 100,000 times higher than the genomic average mutation rate. Mutation in repeats increases its evolutionary stability.
A model of evolution of the X motifs in genes and genomes can be proposed according to the previous works and the recent results [8]. It proposes that the X motifs of maximal cardinality 20 in genes have evolved from the X motifs of minimal cardinality 1 (repeated trinucleotides) in genomes (Figure 1). An X motif of minimal cardinality 1 which is unstable, mutates into an X motif of low cardinality < 10 containing thus different repeated trinucleotides of short lengths. This evolutionary process continues by increasing the cardinality and decreasing the length of the X motifs up to generate the X motifs of high 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)) in genes. The X motifs of high cardinality have acquired the protein coding function in addition to the reading frame retrieval. This model suggests that the property of reading frame retrieval has preceded the protein coding function.
Since 1996, all the statistical analyses studying the preferential occurrence of trinucleotides among the three frames were done at the gene population level (kingdoms, taxonomic groups, genomes). We extend here the method from [1] at the gene level. This new approach is important as all the genes, i.e., of large and small lengths, are now considered with the same weight in the statistical definition for searching the circular code X . As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. Thus, at the gene level, the circular code X is searched here in the genes of bacteria, archaea, eukaryotes, plasmids, viruses, and eukaryotic organelles, i.e., mitochondria and chloroplasts. Finally, genes of double-stranded DNA and RNA viruses, and single-stranded DNA and RNA viruses are also analysed with this approach in order to assign a genetic information unit (DNA or RNA, double-stranded or single-stranded) to the circular code X .

2. Method

2.1. Definitions

We recall a few definitions without detailed explanations (i.e., without figures and examples) for understanding the main properties of the trinucleotide circular code X identified in genes [1,2].
Notation 1.
Let us denote the nucleotide 4-letter alphabet B = { A , C , G , T } where A stands for Adenine, C stands for Cytosine, G stands for Guanine, and T stands for Thymine. The trinucleotide set over B is denoted by B 3 = { A A A , , T T T } . The set of non-empty words (words, respectively) over B is denoted by B + ( B * , respectively).
Notation 2.
Genes have three frames f . By convention here, the reading frame f = 0 is set up by a start trinucleotide { A T G , C T G , G T G , T T G } , and the frames f = 1 and f = 2 are the reading frame f = 0 shifted by one and two nucleotides in the 5 3 direction (to the right), respectively.
Two biological maps are involved in gene coding.
Definition 1.
According to the complementary property of the DNA double helix, the nucleotide complementarity map C : B B is defined by C ( A ) = T , C ( C ) = G , C ( G ) = C , and C ( T ) = A . According to the complementary and antiparallel properties of the DNA double helix, the trinucleotide complementarity map C : B 3 B 3 is defined by C ( l 0 l 1 l 2 ) = C ( l 2 ) C ( l 1 ) C ( l 0 ) for all l 0 , l 1 , l 2 B . By extension to a trinucleotide set S , the set complementarity map C : ( B 3 ) ( B 3 ) , being the set of all subsets of B 3 , is defined by C ( S ) = { v   : u , v B 3 , u S , v = C ( u ) } , e.g., C ( { C G A ,   G A T } ) = { A T C , T C G } .
Definition 2.
The trinucleotide circular permutation map P : B 3 B 3 is defined by P ( l 0 l 1 l 2 ) = l 1 l 2 l 0 for all l 0 , l 1 , l 2 B . P 2 denotes the 2nd iterate of P . By extension to a trinucleotide set S , the set circular permutation map P : ( B 3 ) ( B 3 ) is defined by P ( S ) = { v   : u , v B 3 , u S , v = P ( u ) } , e.g., P ( { C G A ,   G A T } ) = { A T G , G A C } and P ( { C G A ,   G A T } ) = { A C G , T G A } .
Definition 3.
A set S B + is a code if, for each x 1 , , x n , y 1 , , y m S , n , m 1 , the condition x 1 x n = y 1 y m implies n = m and x i = y i for i = 1 , , n .
Definition 4.
Any non-empty subset of the code B 3 is a code and called trinucleotide code C .
Definition 5.
A trinucleotide code C B 3 is self-complementary if, for each t C , C ( t ) C , i.e., C = C ( C ) .
Definition 6.
A trinucleotide code X B 3 is circular if, for each x 1 , , x n , y 1 , , y m X , n , m 1 , r B * , s B + , the conditions s x 2 x n r = y 1 y m and x 1 = r s imply n = m , r = ε (empty word), and x i = y i for i = 1 , , n .
The proofs to decide whether a code is circular or not are based on the flower automaton [2], the necklace 5LDCN (Letter Diletter Continued Necklace) [11], the necklace n LDCCN (Letter Diletter Continued Closed Necklace) with n { 2 , 3 , 4 , 5 } [12], and the graph theory [13].
Definition 7.
A trinucleotide circular code X B 3 is C 3 self-complementary if X , X 1 = P ( X ) , and X 2 = P 2 ( X ) are trinucleotide circular codes such that X = C ( X ) (self-complementary), C ( X 1 ) = X 2 , and C ( X 2 ) = X 1 ( X 1 and X 2 are complementary).
The trinucleotide set X = X 0 (Equation (1)) coding the reading frame ( f = 0 ) in genes is a maximal (20 trinucleotides) C 3 self-complementary trinucleotide circular code [2] where the circular code X 1 = P ( X ) coding the frame f = 1 contains the 20 following trinucleotides
X 1 =   { AAG , ACA , ACG , ACT , AGC , AGG , ATA , ATG , CCA , CCG ,   GCG , GTG , TAG , TCA , TCC , TCG , TCT , TGC , TTA , TTG }
and the circular code X 2 = P 2 ( X ) coding the frame f = 2 contains the 20 following trinucleotides
X 2 =   { AGA , AGT , CAA , CAC , CAT , CCT , CGA , CGC , CGG , CGT ,   CTA , CTT , GCA , GCT , GGA , TAA , TAT , TGA , TGG , TGT } .
The trinucleotide circular codes X 1 and X 2 are related by the permutation map, i.e., X 2 = P ( X 1 ) and X 1 = P 2 ( X 2 ) , and by the complementary map, i.e., X 1 = C ( X 2 ) and X 2 = C ( X 1 ) [14].
Several classes of methods were developed for identifying the circular code X in genes over the last 20 years: frequency methods [2,15,16], correlation function [17], covering capability function [18], and occurrence probability of a complementary/permutation (CP) trinucleotide set at the gene population level [1].
The class of the 216 C 3 self-complementary trinucleotide circular codes (Definition 7; [2]; list given in Tables 4a, 5a, and 6a in [19]; [20]) is included in a larger class of codes C by relaxing the circularity property which was defined in [1]:
Definition 8.
A trinucleotide code C B 3 is C 3 self-complementary if C = C ( C ) (self-complementary), C ( C 1 ) = C 2 , and C ( C 2 ) = C 1 ( C 1 and C 2 are complementary) where C 1 = P ( C ) and C 2 = P 2 ( C ) .
The statistical approach developed analyses the C 3 self-complementary codes (Definition 8) for searching the particular circular code X .

2.2. Gene Kingdoms

Gene kingdoms K of bacteria B , archaea A , plasmids , eukaryotes E , chromosomes of eukaryotes E chr , mitochondria M , chloroplasts , viruses V , and its five taxonomic double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt are obtained from the GenBank database (http://www.ncbi.nlm.nih.gov/genome/browse/, May 2016) (Table 1). Computer tests exclude genes when (i) their nucleotides do not belong to the alphabet B ; (ii) they do not begin with a start trinucleotide { ATG , CTG , GTG , TTG } ; (iii) they do not end with a stop trinucleotide { TAA , TAG , TGA } ; and (iv) their lengths are not modulo 3. In order to have an order of magnitude of data acquisition (details in Table 1), the kingdom of bacteria B contains 15,735,053 genes and 5,222,267,667 trinucleotides (7,851,762 genes and 2,481,566,882 trinucleotides in [1]), i.e., a trinucleotide increase of about 110%, and the kingdom of eukaryotes E contains 4,356,391 genes and 2,406,844,838 trinucleotides (1,662,579 genes and 824,825,761 trinucleotides in [1]), i.e., a trinucleotide increase of about 192%. The gene kingdoms M , , V dsRNA , V ssDNA , and V rt have gene and trinucleotide data that are significantly lower (less than 1 million trinucleotides) than the other gene kingdoms (Table 1).

2.3. Preferential Frame of a Trinucleotide in a Gene

The method developed in [1] for identifying the circular code X in genes determined the preferential frame of trinucleotides at the gene population level (kingdoms, taxonomic groups, genomes), i.e., after summing the trinucleotide frequencies of all genes in a kingdom. We extend this method at the gene level, i.e., the preferential frame of trinucleotides among the three frames is determined for each gene. There is no sum of trinucleotide frequencies of all genes in a kingdom. Thus, all the genes, i.e., of large and small lengths, have the same weight in respect to the preferential frame.
Consider a gene kingdom K listed in Table 1. Let P r f ( t , g ) be the occurrence frequency of a trinucleotide t B 3 in a frame f { 0 , 1 , 2 } of a gene g belonging to a kingdom K . Thus, there are 3 × 64 = 192 trinucleotide occurrence frequencies P r f ( t , g ) in the three frames f of a gene g . Then, the preferential frame F ( t , g ) { 0 , 1 , 2 } of a trinucleotide t in a gene g is the frame of maximal occurrence frequency P r f ( t , g ) among the three frames f of g
F ( t , g ) = arg max f { 0 , 1 , 2 } P r f ( t , g ) .
The three frequencies of a given trinucleotide are computed in the three frames 0, 1, and 2 of a gene. Then, the preferential frame of the trinucleotide in this gene is the frame associated to its highest trinucleotide frequency.
Remark 1.
In [1], the three occurrence frequencies P r f ( t , K ) of a trinucleotide t in the three frames f computed in a gene kingdom K , always have different values, thus a unique preferential frame can be assigned to the trinucleotide. At the gene level, particularly for genes g of small lengths, a trinucleotide t may have an identical occurrence frequency P r f ( t , g ) in two or three frames f . In this case, two or three preferential frames F ( t , g ) are assigned to the trinucleotide t . If a trinucleotide t is absent in a gene g , mainly for genes g of very small lengths, then no preferential frame is attributed to t .
The indicator function δ f ( F ( t , g ) ) { 0 , 1 } is 1 if the preferential frame F ( t , g ) of a trinucleotide t is equal to the frame f of a gene g , and 0 otherwise
δ f ( F ( t , g ) ) = { 1   if   F ( t , g ) = f 0   otherwise
where F ( t , g ) is defined in Equation (5).

2.4. Number of Preferential Frames of a Trinucleotide in a Gene Kingdom

The number N b f ( t , K ) of preferential frames of a trinucleotide t B 3 for each frame f { 0 , 1 , 2 } in a gene kingdom K is simply obtained by summing for all genes in K
N b f ( t , K ) = g K δ f ( F ( t , g ) )
where δ f ( F ( t , g ) ) is defined in Equation (6).

2.5. Occurrence Probability of a Complementary/Permutation Trinucleotide Set in a Gene Kingdom

In order to study the C 3 self-complementary codes C (Definition 8) including the class of circular codes, and in particular the circular code X , Equation (7) for a trinucleotide t is expanded to a set T of six trinucleotides involving the complementarity map C and the permutation map P simultaneously, precisely T = { T 0 , T 1 , T 2 } with T 0 = { t , C ( t ) } in frame 0, T 1 = P ( T 0 ) = { P ( t ) , P ( C ( t ) ) } in frame 1, T 2 = P 2 ( T 0 ) = { P 2 ( t ) , P 2 ( C ( t ) ) } in frame 2, and t B 3 \ { AAA , CCC , GGG , TTT } . T is called a complementary and permutation (CP) trinucleotide set and is completely defined by the trinucleotide t .
Remark 2.
P ( t ) = C ( P 2 ( C ( t ) ) ) and P 2 ( t ) = C ( P ( C ( t ) ) ) (proof obvious).
When the trinucleotide t is given then the trinucleotide C ( t ) is also known. Thus, there are 60 / 2 = 30 CP trinucleotide sets noted T 1 , , T 30 where T i = { T i 0 , T i 1 , T i 2 } with T i 0 = { t , C ( t ) } i in frame 0, T i 1 = P ( T 0 ) i = { P ( t ) , P C ( t ) } i in frame 1, and T i 2 = P 2 ( T 0 ) i = { P 2 ( t ) , P 2 C ( t ) } i in frame 2. A maximal (20 trinucleotides) C 3 self-complementary code C is identified with the first 10 values of the numbers N b ( T 1 , K ) , , N b ( T 10 , K ) (defined below). Precisely, the code C has 20 trinucleotides C = C 0 = { T 1 0 , , T 10 0 } in frame 0, 20 trinucleotides C 1 = P ( C ) = { T 1 1 , , T 10 1 } in frame 1, and 20 trinucleotides C 2 = P 2 ( C ) = { T 1 2 , , T 10 2 } in frame 2 with C = C ( C ) (self-complementary), C ( C 1 ) = C 2 , and C ( C 2 ) = C 1 ( C 1 and C 2 are complementary). There are ( 30 10 ) = 30 , 045 , 015 C 3 self-complementary trinucleotide codes, and among them only 216 are circular [2,20].
Notation 3.
A CP trinucleotide set T = { T 0 , T 1 , T 2 } belongs to the C 3 self-complementary trinucleotide circular code X , i.e., T X , if T 0 X , i.e., if the trinucleotide t and its complementary trinucleotide C ( t ) belong to X . Ten CP trinucleotide sets T among 30 belong to the C 3 circular code X , i.e., such that 10 sets T 0 X with T 1 = P ( T 0 ) P ( X ) = X 1 and T 2 = P 2 ( T 0 ) P 2 ( X ) = X 2 .
Notation 4.
In order to facilitate the reading of Table 2, the 30 CP trinucleotide sets T = { T 0 , T 1 , T 2 } are presented in the following way (i) the first 10 sets T 1 , , T 10 belong to the circular code X (with T 0 = { t , C ( t ) } X , T 1 X 1 and T 2 X 2 ) and are in lexicographical order with respect to the trinucleotide t X (in bold), and (ii) the 20 remaining sets T 11 , , T 30 are in lexicographical order with respect to the trinucleotide t X 1 (in italics).
The occurrence number N b ( T , K ) of a CP trinucleotide set T = { T 0 , T 1 , T 2 } in a gene kingdom K is equal to
N b ( T , K ) = N b 0 ( t , K ) + N b 0 ( C ( t ) , K ) + N b 1 ( P ( t ) , K ) + N b 1 ( P ( C ( t ) ) , K ) + N b 2 ( P 2 ( t ) , K ) + N b 2 ( P 2 ( C ( t ) ) , K )
where N b f ( t , K ) is defined in Equation (7).
In order to normalize the numbers N b ( T , K ) which depend on the numbers of genes in a kingdom K , we simply define the occurrence probability P b ( T , K ) of a CP trinucleotide set T = { T 0 , T 1 , T 2 } in a gene kingdom K as follows
P b ( T , K ) = N b ( T , K ) i = 1 30 N b ( T i , K )
where N b ( T , K ) is defined in Equation (8).
The parameter R k ( T , K ) { 1 , , 30 } gives the rank of the values P b ( T , K ) among the 30 CP trinucleotide sets T , the 1st rank being associated to the highest value of P b ( T , K ) and the 30th rank, to the lowest value of P b ( T , K ) .

2.6. A Statistical Test to Evaluate the Significance of the Obtained Ranks

In order to evaluate the statistical significance of the ranks R k ( T , K ) of the probabilities P b ( T , K ) (Equation (9)) of the 30 CP trinucleotide sets T in a given kingdom K , we derive confidence intervals for P b ( T , K ) . If the confidence interval for two probabilities P b ( T , K ) do not overlap, then their associated ranks R k ( T , K ) are assumed to be valid (in the population). The confidence interval for two probabilities P b ( T , K ) is evaluated by using the classical 2-sample z-test which is briefly recalled here.
Let P ( T ) and P ( T ) be the populations associated to the CP trinucleotide sets T and T of probabilities P b ( T , P ) and P b ( T , P ) , respectively. The probabilities P b ( T , K ) and P b ( T , K ) of T and T are observed in a given gene kingdom K (sample) of size n = i = 1 30 N b ( T i , K ) (defined from Equation (8)). The tests carried out in Section 3 are applied on large samples (the size of the smallest sample analysed being n = 10921 with the archaea A ). Thus, the assumptions of normality for the variables and of the homogeneity for the variances in the two populations are not needed. The equality H 0 : P b ( T , P ) P b ( T , P ) is tested against the alternative H 1 : P b ( T , P ) < P b ( T , P ) if they are not equal. Under H 0 and with large samples ( n > 30 ), min { n P b ( T , P ) , n ( 1 P b ( T , P ) ) , n P b ( T , P ) , n ( 1 P b ( T , P ) ) } > 5 (always verified in the tests carried out in Section 3), and T and T are independent events (realistic hypothesis with kingdoms K of large sizes), then
Z = P b ( T , P ) P b ( T , P ) ( n P b ( T , P ) + n P b ( T , P ) n + n ) ( 1 n P b ( T , P ) + n P b ( T , P ) n + n ) ( 1 n + 1 n ) = 2 ( P b ( T , P ) P b ( T , P ) ) ( P b ( T , P ) + P b ( T , P ) 2 ) ( P b ( T , P ) + P b ( T , P ) ) n ~ N ( 0 , 1 ) .
The z-value and the p-value are given for each statistical test carried out in Section 3.

2.7. Explained Example of the Statistical Approach Developed

As an example, we explain the definition of the occurrence probability P b ( T , K ) (Equation (9)) which takes the value of 6.1% (see Table 2) with the CP trinucleotide set T 1 = { T 1 0 , T 1 1 , T 1 2 } with T 1 0 = { t , C ( t ) } = { AAC , GTT } in frame 0, T 1 1 = P ( T 0 ) 1 = { P ( t ) , P ( C ( t ) ) } = { ACA , TTG } in frame 1 and T 1 2 = P 2 ( T 0 ) 1 = { P 2 ( t ) , P 2 ( C ( t ) ) } = { CAA , TGT } in frame 2 in the gene kingdom of bacteria K = B (Table 1).
The 3 × 64 = 192 occurrence frequencies P r f ( t , g ) of the 64 trinucleotides t are computed in the three frames f of each gene g belonging to B . Then, the preferential frame F ( t , g ) of each trinucleotide t for each gene g in B is determined according to Equation (5). For example, with the trinucleotide t = AAC in a gene g 1 of B , if the frequency P r 0 ( AAC , g 1 ) of AAC in frame f = 0 (reading frame) is greater than the two frequencies P r 1 ( AAC , g 1 ) and P r 2 ( AAC , g 1 ) of AAC in frames f = 1 and f = 2 , i.e., P r 0 ( AAC , g 1 ) > M a x { P r 1 ( AAC , g 1 ) , P r 2 ( AAC , g 1 ) } , then the preferential frame of AAC in g 1 is 0, i.e., F ( AAC , g 1 ) = 0 .
The indicator function δ f ( F ( t , g ) ) of each trinucleotide t for each gene g in B is obtained from Equation (6). With the previous example of AAC in the gene g 1 of B , the indicator function is equal to δ 0 ( F ( AAC , g 1 ) ) = 1 for the frame f = 0 and δ 1 ( F ( AAC , g 1 ) ) = δ 2 ( F ( AAC , g 1 ) ) = 0 for the frames f = 1 and f = 2 .
The number N b f ( t , B ) of preferential frames of each trinucleotide t for each frame f in B is computed according to Equation (7). With the previous example of AAC in B , the following numbers are obtained: N b 0 ( AAC , B ) = 3486 for the frame f = 0 , N b 1 ( AAC , B ) = 1742 for the frame f = 1 , and N b 2 ( AAC , B ) = 1819 for the frame f = 2 . Thus, the preferential frame of AAC in B is 0.
The occurrence number N b ( T , B ) of the 30 CP trinucleotide sets T i = { T i 0 , T i 1 , T i 2 } in B is determined according to Equation (8). With T 1 in B , the following numbers are obtained: N b 0 ( GTT , B ) = 3765 for the frame f = 0 , N b 1 ( ACA , B ) = 4002 and N b 1 ( TTG , B ) = 5650 for the frame f = 1 , and N b 2 ( CAA , B ) = 3999 and N b 2 ( TGT , B ) = 4677 for the frame f = 2 . Then, the occurrence number of T 1 in B is equal to N b ( T 1 , B ) = N b 0 ( AAC , B ) + N b 0 ( GTT , B ) + N b 1 ( ACA , B ) + N b 1 ( TTG , B ) + N b 2 ( CAA , B ) + N b 2 ( TGT , B ) = 3486 + 3765 + 4002 + 5650 + 3999 + 4677 = 25579 .
Finally, the occurrence probability P b ( T , B ) of the 30 CP trinucleotide sets T i = { T i 0 , T i 1 , T i 2 } in B is deduced from Equation (9). With T 1 in B , the occurrence probability of T 1 in B is equal to P b ( T 1 , B ) = N b ( T 1 , B ) i = 1 30 N b ( T i , B ) = N b ( T 1 , B ) N b ( T 1 , B ) + ... + N b ( T 30 , B ) = 25579 25579 + ... + 11856 = 25579 422598 6.1 % .

3. Results

3.1. Maximal C 3 Self-Complementary Circular Code X in Genes

This new statistical approach will show that the same set X of 20 trinucleotides among ( 30 10 ) = 30 , 045 , 015 sets occurs preferentially in genes (reading frame) of bacteria B , archaea A , plasmids , eukaryotes E , and viruses V . This set X is the maximal C 3 self-complementary circular code defined in Equation (1).

3.1.1. Circular Code X in Genes of Bacteria

In the genes of bacteria B , the 10 CP trinucleotide sets T 1 , , T 10 X have occurrence probabilities P b ( T , B ) (Equation (9)) with the 10 highest ranks R k ( T , B ) among 30 (Table 2), i.e., { t , C ( t ) } X , { P ( t ) , P ( C ( t ) ) } X 1 and { P 2 ( t ) , P 2 ( C ( t ) ) } X 2 leading to the 20 trinucleotides of X in frame 0, 20 trinucleotides of X 1 in frame 1, and 20 trinucleotides of X 2 in frame 2. The highest rank with P b ( T 8 , B ) = 8.2 % is related to the complementary pair { t , C ( t ) } = { GAC , GTC } X . The 10th rank with P b ( T 5 , B ) = 4.55 % is very significantly greater than the 11th rank with P b ( T 22 , B ) = 3.31 % ( n = i = 1 30 N b ( T i , B ) = 422598 , z -value = 29.33 , p -value = 10 189 ). The 20 trinucleotides of the circular code X are identified in the genes of bacteria:
X B =   X .
The same result is obtained at the gene level and the gene population level [1].

3.1.2. Circular Code X in Genes of Archaea

In the genes of archaea A , the eight CP trinucleotide sets T 1 , T 2 , T 4 , T 6 , , T 10 X (except T 3 and T 5 ) have occurrence probabilities P b ( T , A ) with the eight highest ranks R k ( T , A ) among 30 (Table 2). The highest rank with P b ( T 8 , A ) = 9.7 % is also related to the complementary pair { GAC , GTC } X . The CP set T 22 X with R k ( T 22 , A ) = 9 explains that the two complementary trinucleotides { t , C ( t ) } = { ACC , GGT } X ( T 3 ) do not occur preferentially in A . As the CP set T 5 X has a rank R k ( T 5 , A ) = 13 with P b ( T 5 , A ) = 3.66 % greater than R k ( T 15 , A ) = 14 with P b ( T 15 , A ) = 3.39 % and R k ( T 28 , A ) = 15 with P b ( T 28 , A ) = 2.95 % , the two complementary trinucleotides { t , C ( t ) } = { CAG , CTG } X occur preferentially in A compared to { AGC , GCT } ( T 15 ) and { GCA , TGC } ( T 28 ), however the statistical significance between the ranks R k ( T 5 , A ) and R k ( T 15 , A ) is not confirmed due to the lack of archaeal gene data (see Section 2.2) ( n = i = 1 30 N b ( T i , A ) = 10921 , z -value = 1.08 , p -value = 0.14 ). Thus, a subset of X of 18 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of archaea:
X A =   X Y A   with   Y A = { ACC , GGT } .
Note that the code X A { CAC , GTG } ( T 22 ) is the variant X code observed in Deinococcus [1]. The circular code X retrieved in the genes of archaea is a new result which was not found in a study of variant X codes in archaeal genomes [15].

3.1.3. Circular Code X in Genes of Plasmids

In the genes of plasmids , the 10 CP trinucleotide sets T 1 , , T 10 X have occurrence probabilities P b ( T , ) with the 10 highest ranks R k ( T , ) among 30 (Table 2). The highest rank with P b ( T 8 , ) = 7.8 % is again related to the complementary pair { GAC , GTC } X . The 10th rank with P b ( T 5 , ) = 3.93 % is very significantly greater than the 11th rank with P b ( T 21 , ) = 3.43 % ( n = i = 1 30 N b ( T i , ) = 144366 , z -value = 7.14 , p -value = 10 13 ). The 20 trinucleotides of the circular code X are identified in the genes of plasmids:
X =   X .
The same result is obtained at the gene level and the gene population level [1].

3.1.4. Circular Code X in Genes of Eukaryotes

In the genes of eukaryotes E , the 10 CP trinucleotide sets T 1 , , T 10 X have occurrence probabilities P b ( T , E ) with the 10 highest ranks R k ( T , E ) among 30 (Table 2). The highest rank with P b ( T 8 , E ) = 9.0 % is again related to the complementary pair { GAC , GTC } X . The 10th rank with P b ( T 5 , E ) = 4.23 % is significantly greater than the 11th rank with P b ( T 22 , E ) = 3.82 % ( n = i = 1 30 N b ( T i , E ) = 11401 , z -value = 1.57 , p -value = 0.06 ). The 20 trinucleotides of the circular code X are identified in the genes of eukaryotes:
X E =   X .
The same result is obtained at the gene level and the gene population level [1].
The subset X E H o m o   s a p i e n s =   X \ { ACC , GCC , GGC , GGT } of X of 16 trinucleotides in the genes of Homo sapiens identified at the gene level is also identical to the subset found at the gene population level [1].

3.1.5. Circular Code X in Genes of Eukaryotic Chromosomes

The statistical analysis in Section 3.1.4 takes the eukaryotic genome as the genetic information unit. Indeed, Equation (7) with g E is achieved with C a r d ( E ) = 190 eukaryotic genomes (see Table 1). We complete this classical approach by choosing the eukaryotic chromosome as the genetic information unit. Thus, Equation (7) with g E chr is performed with C a r d ( E chr ) = 2979 eukaryotic chromosomes of C a r d ( E ) = 190 genomes (see Table 1).
In the genes of eukaryotic chromosomes E chr , the 10 CP trinucleotide sets T 1 , , T 10 X have occurrence probabilities P b ( T , E chr ) with the 10 highest ranks R k ( T , E chr ) among 30 (Table 2). The highest rank with P b ( T 8 , E chr ) = 9.1 % is again related to the complementary pair { GAC , GTC } X . The 10th rank with P b ( T 3 , E chr ) = 4.74 % is very significantly greater than the 11th rank with P b ( T 22 , E chr ) = 4.47 % ( n = i = 1 30 N b ( T i , E chr ) = 179136 , z -value = 3.86 , p -value = 10 5 ). The 20 trinucleotides of the circular code X are identified in the genes of eukaryotic chromosomes:
X E chr =   X .
It is a new result which completes the statistical analysis of genes in eukaryotic genomes (Section 3.1.4).

3.1.6. Non-Maximal Circular Code X in Genes of Eukaryotic Organelles

The genes of eukaryotic organelles, i.e., mitochondria and chloroplasts, are investigated with this statistical approach. It should also be stressed that the available data have an order of magnitude very significantly lower than the other gene kingdoms studied (less than 1 million trinucleotides for each class of organelles, see Table 1). However, we can already observe some statistical trends with the trinucleotides in the preferential frame.

Non-Maximal Circular Code X in Genes of Mitochondria

Surprisingly, in the genes of mitochondria M , the four CP trinucleotide sets T 9 , T 7 , T 8 , T 3 X have occurrence probabilities P b ( T , M ) with the four highest ranks R k ( T , M ) among 30 (Table 2). The CP set T 28 X with R k ( T 28 , M ) = 5 explains that the two complementary trinucleotides { CAG , CTG } X ( T 5 ) do not occur preferentially in M . The CP set T 25 X with R k ( T 25 , M ) = 6 determines that the two complementary trinucleotides { CTC , GAG } X ( T 6 ) do not occur preferentially in M . The CP set T 24 X with R k ( T 24 , M ) = 7 implies that the two complementary trinucleotides { ATC , GAT } X ( T 4 ) do not occur preferentially in M . The CP set T 17 X with R k ( T 17 , M ) = 11 explains that the two complementary trinucleotides { AAT , ATT } X ( T 2 ) do not occur preferentially in M . Thus, a subset of X of 12 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of mitochondria M :
X M =   X \ Y M   with   Y M = { AAT , ATC , ATT , CAG , CTC , CTG , GAG , GAT } .
This subset X M =   { AAC , ACC , GAA , GAC , GCC , GGC , GGT , GTA , GTC , GTT , TAC , TTC } is very close to the subset X M ˜ =   { ACC , ATC , CTC , GAA , GAC , GAT , GCC , GGC ,   GGT , GTA , GTC , GTT , TTC } of X of 13 trinucleotides previously identified by inspection in mitochondrial genes [21], as X M X M ˜ = { ACC , GAA , GAC , GCC , GGC , GGT , GTA , GTC , GTT , TTC } has 10 trinucleotides in common.

Non-Maximal Circular Code X in Genes of Chloroplasts

In the genes of chloroplasts , the highest occurrences of CP trinucleotide sets again belong to the circular code X . The three CP trinucleotide sets T 2 , T 9 , T 3 X have occurrence probabilities P b ( T , ) with the three highest ranks R k ( T , ) among 30 (Table 2). The CP set T 13 X with R k ( T 13 , ) = 4 explains that the two complementary trinucleotides { GAC , GTC } X ( T 8 ) do not occur preferentially in . The CP set T 28 X with R k ( T 28 , ) = 5 states that the two complementary trinucleotides { CAG , CTG } X ( T 5 ) do not occur preferentially in . The CP set T 14 X with R k ( T 14 , ) = 8 implies that the two complementary trinucleotides { GTA , TAC } X ( T 10 ) do not occur preferentially in . The CP set T 18 X with R k ( T 18 , ) = 10 explains that the two complementary trinucleotides { ATC , GAT } X ( T 4 ) do not occur preferentially in . The CP set T 25 X with R k ( T 25 , ) = 12 implies that the two complementary trinucleotides { CTC , GAG } X ( T 6 ) do not occur preferentially in . Thus, a subset of X of 10 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of chloroplasts :
X =   X \ Y   with   Y = { ATC , CAG , CTC , CTG , GAC , GAG , GAT , GTA , GTC , TAC } .

3.1.7. Circular Code X in Genes of Viruses

In the genes of viruses V , the nine CP trinucleotide sets T 1 , , T 4 , T 6 , , T 10 X (except T 5 ) have occurrence probabilities P b ( T , V ) with the nine highest ranks R k ( T , V ) among 30 (Table 2). The highest rank with P b ( T 8 , V ) = 7.2 % is again related to the complementary pair { GAC , GTC } X . The CP set T 15 X with R k ( T 15 , V ) = 10 explains that the two complementary trinucleotides { CAG , CTG } X ( T 5 ) do not occur preferentially in V . Thus, a subset of X of 18 trinucleotides (a non-maximal C 3 self-complementary circular code) is identified in the genes of viruses:
X V =   X \ Y V   with   Y V = { CAG , CTG } .
The statistical method of viral genes at the gene population level [1] could not decide between the two codes X 18 = X \ { CAG , CTG } and X 16 = X \ { CAG , CTG , GTA , TAC } . The statistical analysis at the gene level confirms the code X V =   X 18 of 18 trinucleotides in the genes of viruses.

3.2. Circular Code X Found in DNA and RNA Genomes and in Double-Stranded and Single-Stranded Genomes

The self-complementary property of the circular code X has been related since 1996 to the complementary property of the DNA double helix. In order to deepen this idea, we searched with this statistical approach the circular code X in five important sub-classes of viral genes using either DNA genome or RNA genome, and either double-stranded genome or single-stranded genome, i.e., in the genes of double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt .
In the genes of double-stranded DNA viruses V dsDNA , the 10 CP trinucleotide sets T 1 , , T 10 X have occurrence probabilities P b ( T , V dsDNA ) with the 10 highest ranks R k ( T , V dsDNA ) among 30 (Table 2). Thus, the circular code X is found in V dsDNA :
X V dsDNA =   X .
In the genes of double-stranded RNA viruses V dsRNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt , respectively, the nine CP trinucleotide sets T 1 , , T 4 , T 6 , , T 10 X (except T 5 ) have occurrence probabilities P b ( T , V dsRNA ) , P b ( T , V ssRNA ) , and P b ( T , V rt ) , respectively, with the nine highest ranks R k ( T , V dsRNA ) , R k ( T , V ssRNA ) , and R k ( T , V rt ) , respectively, among 30 (Table 2). Note that the ranks R k ( T , V dsRNA ) , R k ( T , V ssRNA ) , and R k ( T , V rt ) for a given CP trinucleotide set are not identical (Table 2). Thus, by using the reasoning mentioned previously ( T 15 X with R k ( T 15 , V ) > R k ( T 5 , V ) for V in V dsRNA , V ssRNA , and V rt ), a subset of X of 18 trinucleotides is observed in V dsRNA , V ssRNA , and V rt :
X V dsRNA =   X \ Y V dsRNA   with   Y V dsRNA = { CAG , CTG } ,
X V ssRNA = X \ Y V ssRNA   with   Y V ssRNA = { CAG , CTG } ,
X V rt = X \ Y V rt   with   Y V rt = { CAG , CTG } .
In the genes of single-stranded DNA viruses V ssDNA , the eight CP trinucleotide sets T 1 , , T 4 , T 6 , , T 9 X (except T 5 and T 10 ) have occurrence probabilities P b ( T , V ssDNA ) with the eight highest ranks R k ( T , V ssDNA ) among 30 (Table 2). Thus, by using the reasoning as previously mentioned ( T 15 X with R k ( T 15 , V ssDNA ) > R k ( T 5 , V ssDNA ) and T 14 X with R k ( T 14 , V ssDNA ) > R k ( T 10 , V ssDNA ) ), a subset of X of 16 trinucleotides is observed in V ssDNA :
X V ssDNA = X \ Y V ssDNA   with   Y V ssDNA = { CAG , CTG , GTA , TAC } .
All these results show that the circular code X is found almost perfectly in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. The very few exceptions, either the two trinucleotides { CAG , CTG } or the four trinucleotides { CAG , CTG , GTA , TAC } for one case, are related to the CP set or the two CP sets having the lowest occurrence among the 10 CP sets T 1 , , T 10 X .

4. Conclusions

The “universal” occurrence in genes of a same set X of 20 trinucleotides, which has in addition the mathematical property to be a circular code, must be confirmed by several statistical approaches and various gene data analyses at different levels: kingdom, taxonomic group, genome, and gene. All the previous approaches have studied and identified the circular code X at the gene population level (kingdom, taxonomic group, and genome) [1,2,15,16,17,21]. The statistical approach at the gene level developed here, for the first time since 1996, analyses the preferential occurrence of trinucleotides among the three frames of each gene. This new methodology allows all genes, i.e., of large and small lengths, to be considered with the same weight. As a consequence, the concept of circular code, in particular the reading frame retrieval, is directly associated to each gene. Thus, X motifs from the circular code X at different locations in a gene may assist the ribosome to maintain and synchronize the reading frame. The number, the cardinality, and the length of X motifs in genes may be associated to the length, the function, and the ancestry of genes. This research work is currently under investigation.
At the gene level, the circular code X is strengthened in the genes of bacteria, eukaryotes, plasmids, and viruses, and is now also identified in the genes of archaea. In addition to eukaryotic genomes, it is also found in the genes of eukaryotic chromosomes. The genes of mitochondria and chloroplasts contain a subset of the circular code X . It should be stressed that some mitochondrial and chloroplast genes lack the stop codon and are excluded from this data acquisition. Such a statistical bias may prevent a proper detection of preferential frames for some trinucleotides in the genes of eukaryotic organelles. The circular code X is searched in the large class of ( 30 10 ) = 30 , 045 , 015 C 3 self-complementary trinucleotide codes which contains in particular the 216 maximal C 3 self-complementary circular codes. Thus, for a basic order of magnitude, the probability to retrieve the same circular code X in four independent gene kingdoms (bacteria B , plasmids , eukaryotes E , double-stranded DNA viruses V dsDNA ) is equal to 1 / ( 30 10 ) 4 10 30 .
In the genes of the bacterial, eukaryotic, and plasmid kingdoms, 14 among the 47 studied gene taxonomic groups (about 30%) have variant X codes [1], i.e., trinucleotide codes which differ from X . Seven variant X codes are identified. However, all have at least 16 trinucleotides of X . Two variant X codes X A (according to the notation in [1]) in cyanobacteria and plasmids of cyanobacteria, and X D in birds, are self-complementary, without permuted trinucleotides, but are non-circular. Five variant X codes X B in Deinococcus, plasmids of chloroflexi and Deinococcus, mammals, and kinetoplasts, X C in elusimicrobia and apicomplexans, X E in fishes, X F in insects, and X G in basidiomycetes and plasmids of spirochaetes, are C 3 self-complementary circular. Thus, two variant X codes X A and X D are not circular and do not belong to the set of the 216 maximal C 3 self-complementary circular codes [2] having the strong mathematical structure of the dihedral group [20]. The reason could be related to the gene data or to a biological property which remains to be identified. All these variant X codes in the genes are identified at the taxonomic group level. However, as the circular code X is now also identified at the gene level, variant X codes may also be associated with genes belonging to the same genome but with different protein coding functions. This interesting and open problem should be investigated in the future.
A probability measure of the reading frame retrieval ( R F R ) of each trinucleotide of X has been introduced in [22] and [23] (Section 2.2 and 1st row of Table 1). The R F R probability P r R F R of the circular code X , i.e., the average R F R probability of the 20 trinucleotides of X , is equal to P r R F R ( X ) = 82.5 % (Result 5 in [22]; 1st row of Table 1 in [23]). This R F R measure can be applied to the non-maximal C 3 self-complementary circular codes, precisely to the excluded trinucleotides Y A = { ACC , GGT } of archaea (Equation (11)), Y M = { AAT , ATC , ATT , CAG , CTC , CTG , GAG , GAT } of mitochondria (Equation (15)), Y = { ATC , CAG , CTC , CTG , GAC , GAG , GAT , GTA , GTC , TAC } of chloroplasts (Equation (16)), Y V = { CAG , CTG } of viruses (Equation (17)), and Y V ssDNA = { CAG , CTG , GTA , TAC } of single-stranded DNA viruses (Equation (22)). The computation leads to P r R F R ( Y A ) = 69.0 % , P r R F R ( Y M ) = 88.5 % , P r R F R ( Y ) = 87.1 % , P r R F R ( Y V ) = 100.0 % , and P r R F R ( Y V ssDNA ) = 85.7 % . Archaeal genes miss two trinucleotides of X which have the lowest R F R values. In contrast, mitochondrial, chloroplast, and viral genes miss trinucleotides of X with high R F R values. Thus, the genes in reduced genomes are more flexible in translation, allowing overlap coding by frameshifting in agreement with [24] (and the cited references). However, it should be stressed that this result may vary with the increase of gene data of eukaryotic organelles in the future. The circular code X (20 trinucleotides) with the functions of reading frame retrieval and maintenance in regular RNA transcription, may also have, through its bijective transformation codes, the same functions in nucleotide exchanging RNA transcription in mitochondrial genes [23]. Indeed, as the mitochondrial gamma polymerase has bacterial origins (e.g., [25]), mitochondrial polymerization and its associated bijective transformations might use the circular code X . However at the translational level, the ribosome might follow the non-maximal C 3 self-complementary circular code X M observed in mitochondrial genes (Equation (15)). A similar explanation could be applied to the chloroplast genes which have also bacterial origins (cyanobacteria).
By a study of viral genes, the circular code X is found in DNA genomes, RNA genomes, double-stranded genomes, and single-stranded genomes. Thus, the reading frame retrieval property of X could operate for translating DNA and RNA genes, in particular for the “primitive” RNA genes. The C 3 property of X could be involved for translating the two shifted frames in DNA and RNA genes, in particular for optimizing the genomes of small sizes. The complementarity property of X is naturally associated to the double-stranded DNA and RNA genomes. It could also be used to pair single-stranded DNA genomes between them and single-stranded RNA genomes between them. Thus, the C 3 and complementary properties of X could be involved for translating the three frames (reading frame and its two shifted frames) in one strand and the three frames in the complementary strand of DNA and RNA genes.
In summary, this new statistical approach at the gene level which is applied to massive gene data identifies the maximal C 3 self-complementary trinucleotide circular code X in the genes of bacteria, archaea, eukaryotes, plasmids, and viruses, which may be involved in translation coding [3].

Acknowledgments

I thank the three reviewers for their advice, and Denise Besch, Svetlana Gorchkova, Elisabeth Michel, Professor Jacques Streith, and Jean-Marc Vassards for their support.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Michel, C.J. The maximal C3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses. J. Theor. Biol. 2015, 380, 156–177. [Google Scholar] [CrossRef] [PubMed]
  2. Arquès, D.G.; Michel, C.J. A complementary circular code in the protein coding genes. J. Theor. Biol. 1996, 182, 45–58. [Google Scholar] [CrossRef] [PubMed]
  3. Michel, C.J. Circular code motifs in transfer and 16S ribosomal RNAs: A possible translation code in genes. Comput. Biol. Chem. 2012, 37, 24–37. [Google Scholar] [CrossRef] [PubMed]
  4. El Soufi, K.; Michel, C.J. Circular code motifs in genomes of eukaryotes. J. Theor. Biol. 2016, 408, 198–212. [Google Scholar] [CrossRef] [PubMed]
  5. Michel, C.J. Circular code motifs in transfer RNAs. Comput. Biol. Chem. 2013, 45, 17–29. [Google Scholar] [CrossRef] [PubMed]
  6. El Soufi, K.; Michel, C.J. Circular code motifs in the ribosome decoding center. Comput. Biol. Chem. 2014, 52, 9–17. [Google Scholar] [CrossRef] [PubMed]
  7. El Soufi, K.; Michel, C.J. Circular code motifs near the ribosome decoding center. Comput. Biol. Chem. 2015, 59, 158–176. [Google Scholar] [CrossRef] [PubMed]
  8. El Soufi, K.; Michel, C.J. Unitary circular code motifs in genomes of eukaryotes. Biosystems 2017, in press. [Google Scholar] [CrossRef] [PubMed]
  9. Canapa, A.; Cerioni, P.N.; Barucca, M.; Olmo, E.; Caputo, V. A centromeric satellite DNA may be involved in heterochromatin compactness in gobiid fishes. Chromosome Res. 2002, 10, 297–304. [Google Scholar] [CrossRef] [PubMed]
  10. Gemayel, R.; Vinces, M.D.; Legendre, M.; Verstrepen, K.J. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 2010, 44, 445–477. [Google Scholar] [CrossRef] [PubMed]
  11. Pirillo, G. A characterization for a set of trinucleotides to be a circular code. In Determinism, Holism, and Complexity; Pellegrini, C., Cerrai, P., Freguglia, P., Benci, V., Israel, G., Eds.; Kluwer Academic Publisher: New York, NY, USA, 2003. [Google Scholar]
  12. Michel, C.J.; Pirillo, G. Identification of all trinucleotide circular codes. J. Theor. Biol. 2010, 34, 122–125. [Google Scholar] [CrossRef] [PubMed]
  13. Fimmel, E.; Michel, C.J.; Strüngmann, L. n-Nucleotide circular codes in graph theory. Philos. Trans. R. Soc. A 2016, 374, 20150058. [Google Scholar] [CrossRef] [PubMed]
  14. Bussoli, L.; Michel, C.J.; Pirillo, G. On conjugation partitions of sets of trinucleotides. Appl. Math. 2012, 3, 107–112. [Google Scholar] [CrossRef]
  15. Frey, G.; Michel, C.J. Circular codes in archaeal genomes. J. Theor. Biol. 2003, 223, 413–431. [Google Scholar] [CrossRef]
  16. Frey, G.; Michel, C.J. Identification of circular codes in bacterial genomes and their use in a factorization method for retrieving the reading frames of genes. Comput. Biol. Chem. 2006, 30, 87–101. [Google Scholar] [CrossRef] [PubMed]
  17. Arquès, D.G.; Michel, C.J. A code in the protein coding genes. Biosystems 1997, 44, 107–134. [Google Scholar] [CrossRef]
  18. Gonzalez, D.L.; Giannerini, S.; Rosa, R. Circular codes revisited: a statistical approach. J. Theor. Biol. 2011, 275, 21–28. [Google Scholar] [CrossRef] [PubMed]
  19. Michel, C.J.; Pirillo, G.; Pirillo, M.A. A relation between trinucleotide comma-free codes and trinucleotide circular codes. Theor. Comput. Sci. 2008, 401, 17–26. [Google Scholar] [CrossRef]
  20. Fimmel, E.; Giannerini, S.; Gonzalez, D.L.; Strüngmann, L. Circular codes, symmetries and transformations. J. Math. Biol. 2015, 70, 1623–1644. [Google Scholar] [CrossRef] [PubMed]
  21. Arquès, D.G.; Michel, C.J. A circular code in the protein coding genes of mitochondria. J. Theor. Biol. 1997, 189, 273–290. [Google Scholar] [CrossRef] [PubMed]
  22. Ahmed, A.; Frey, G.; Michel, C.J. Essential molecular functions associated with circular code evolution. J. Theor. Biol. 2010, 264, 613–622. [Google Scholar] [CrossRef] [PubMed]
  23. Michel, C.J.; Seligmann, H. Bijective transformation circular codes and nucleotide exchanging RNA transcription. Biosystems 2014, 118, 39–50. [Google Scholar] [CrossRef] [PubMed]
  24. Seligmann, H. Chimeric mitochondrial peptides from contiguous regular and swinger RNA. Comput. Struct. Biotechnol. J. 2016, 14, 283–297. [Google Scholar] [CrossRef] [PubMed]
  25. Wolf, Y.I.; Koonin, E.V. Origin of an animal mitochondrial DNA polymerase subunit via lineage-specific acquisition of a glycyl-tRNA synthetase from bacteria of the Thermus-Deinococcus group. Trends Genet. 2001, 17, 431–433. [Google Scholar] [CrossRef]
Figure 1. Model of evolution of the X circular code motifs (Equation (1)) by increasing its cardinality (composition) and decreasing its length. Evolution begins with X motifs of minimal cardinality 1 (long repeated trinucleotides) in genomes (the examples given are extracted from Table 2 in [8]). Then, the mutations in repeated trinucleotides lead to X motifs of low cardinality < 10 (different and short repeated trinucleotides) in genomes (the examples given are extracted from Table 4 in [8]) up to X motifs of high 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)).
Figure 1. Model of evolution of the X circular code motifs (Equation (1)) by increasing its cardinality (composition) and decreasing its length. Evolution begins with X motifs of minimal cardinality 1 (long repeated trinucleotides) in genomes (the examples given are extracted from Table 2 in [8]). Then, the mutations in repeated trinucleotides lead to X motifs of low cardinality < 10 (different and short repeated trinucleotides) in genomes (the examples given are extracted from Table 4 in [8]) up to X motifs of high 10 and maximal cardinality 20 coding the 12 amino acids (Equation (2)).
Life 07 00020 g001
Table 1. Kingdoms K of genes extracted from the GenBank database (http:// www.ncbi.nlm.nih.gov/genome/browse/, May 2016) with their symbol and their numbers of genomes, genes, and trinucleotides.
Table 1. Kingdoms K of genes extracted from the GenBank database (http:// www.ncbi.nlm.nih.gov/genome/browse/, May 2016) with their symbol and their numbers of genomes, genes, and trinucleotides.
Kingdom K (Symbol)Nb of GenomesNb of GenesNb of Trinucleotides
Bacteria B 703915,735,0535,222,267,667
Archaea A 182282,80281,460,549
Plasmids 2319575,760159,169,387
Eukaryotes E 1904,356,3912,406,844,838
Chromosomes of eukaryotes E chr 29794,356,3912,406,844,838
Mitochondria M 2283347862,327
Chloroplasts 393192925,303
Viruses V 5217299,40166,677,580
Double-stranded DNA viruses V dsDNA V 2480259,69659,239,700
Double-stranded RNA viruses V dsRNA V 2111061783,020
Single-stranded DNA viruses V ssDNA V 7153291802,405
Single-stranded RNA viruses V ssRNA V 125750934,406,365
Retro-transcribing viruses V rt V 137560289,447
Table 2. Identification of the maximal C 3 self-complementary trinucleotide circular code X in gene kingdoms K of bacteria B , archaea A , plasmids , eukaryotes E , chromosomes of eukaryotes E chr , mitochondria M , chloroplasts , viruses V , and its five taxonomic groups: double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt (Table 1). Occurrence probability P b ( T , K ) (%) of the 30 complementary and permutation (CP) trinucleotide sets T = { T 0 , T 1 , T 2 } with T 0 = { t , C ( t ) } in frame 0, T 1 = P ( T 0 ) = { P ( t ) , P ( C ( t ) ) } in frame 1, T 2 = P 2 ( T 0 ) = { P 2 ( t ) , P 2 ( C ( t ) ) } in frame 2, in a gene kingdom K computed according to Equation (9) and its rank R k ( T , K ) , the 1st rank being associated to the highest value of P b ( T , K ) and the 30th rank, to the lowest value of P b ( T , K ) . The 20 trinucleotides of the C 3 self-complementary circular code X are in bold, the 20 trinucleotides of the circular code X 1 = P ( X ) are in italics, and the 20 trinucleotides of the circular code X 2 = P 2 ( X ) are both in bold and italics. The first 10 CP sets T 1 , , T 10 belong to the circular code X ( T 0 = { t , C ( t ) } X with T 1 = P ( T 0 ) P ( X ) = X 1 and T 2 = P 2 ( T 0 ) P 2 ( X ) = X 2 ) and are in lexicographical order with respect to the trinucleotide t X in bold, and the 20 remaining CP sets T 11 , , T 30 are in lexicographical order with respect to the trinucleotide t X 1 in italics. The numbers in italics occurring with the CP sets T 1 , , T 10 are associated with the two trinucleotides T 0 = { t , C ( t ) } of X which do not occur preferentially in the gene kingdom.
Table 2. Identification of the maximal C 3 self-complementary trinucleotide circular code X in gene kingdoms K of bacteria B , archaea A , plasmids , eukaryotes E , chromosomes of eukaryotes E chr , mitochondria M , chloroplasts , viruses V , and its five taxonomic groups: double-stranded DNA viruses V dsDNA , double-stranded RNA viruses V dsRNA , single-stranded DNA viruses V ssDNA , single-stranded RNA viruses V ssRNA , and retro-transcribing viruses V rt (Table 1). Occurrence probability P b ( T , K ) (%) of the 30 complementary and permutation (CP) trinucleotide sets T = { T 0 , T 1 , T 2 } with T 0 = { t , C ( t ) } in frame 0, T 1 = P ( T 0 ) = { P ( t ) , P ( C ( t ) ) } in frame 1, T 2 = P 2 ( T 0 ) = { P 2 ( t ) , P 2 ( C ( t ) ) } in frame 2, in a gene kingdom K computed according to Equation (9) and its rank R k ( T , K ) , the 1st rank being associated to the highest value of P b ( T , K ) and the 30th rank, to the lowest value of P b ( T , K ) . The 20 trinucleotides of the C 3 self-complementary circular code X are in bold, the 20 trinucleotides of the circular code X 1 = P ( X ) are in italics, and the 20 trinucleotides of the circular code X 2 = P 2 ( X ) are both in bold and italics. The first 10 CP sets T 1 , , T 10 belong to the circular code X ( T 0 = { t , C ( t ) } X with T 1 = P ( T 0 ) P ( X ) = X 1 and T 2 = P 2 ( T 0 ) P 2 ( X ) = X 2 ) and are in lexicographical order with respect to the trinucleotide t X in bold, and the 20 remaining CP sets T 11 , , T 30 are in lexicographical order with respect to the trinucleotide t X 1 in italics. The numbers in italics occurring with the CP sets T 1 , , T 10 are associated with the two trinucleotides T 0 = { t , C ( t ) } of X which do not occur preferentially in the gene kingdom.
B A E E c h r M V V dsDNA V dsRNA V ssDNA V ssRNA V rt
T t C ( t ) P b R k P b R k P b R k P b R k P b R k P b R k P b R k P b R k P b R k P b R k P b R k P b R k P b R k
T 1 AACGTT6.167.446.048.538.434.494.996.636.846.435.917.035.05
T 2 AATATT7.425.387.328.728.723.4145.816.627.525.945.626.345.34
T 3 ACCGGT5.193.9125.185.184.7105.745.234.984.994.684.655.373.79
T 4 ATCGAT6.546.756.238.148.243.6134.6146.546.756.425.447.126.02
T 5 CAGCTG4.6103.7133.9104.2104.990.7300.0302.9153.8102.4182.8191.5242.918
T 6 CTCGAG6.257.535.967.057.552.6180.4275.666.365.374.1105.464.96
T 7 GAATTC5.875.375.575.095.276.324.975.275.675.364.664.985.43
T 8 GACGTC8.219.717.819.019.115.830.6267.218.017.315.437.316.41
T 9 GCCGGC6.738.226.055.765.287.115.325.957.035.554.385.754.77
T 10 GTATAC5.486.665.095.475.764.684.8114.795.384.694.0114.5104.38
T 11 AAGCTT3.0134.1112.9163.4143.5131.4273.2183.3123.0143.2133.6123.6132.916
T 12 ACATGT1.1261.4201.1260.3300.2304.3102.9191.2270.9261.3281.4281.5261.529
T 13 ACGCGT1.6220.3281.8210.6260.6251.9245.041.8221.4231.6253.2161.6222.124
T 14 ACTAGT2.9151.3223.3133.0152.5152.1224.983.6113.1133.7114.374.1113.414
T 15 AGCGCT2.9143.4143.1153.4133.1143.9124.964.0103.6114.2104.394.693.513
T 16 AGGCCT2.3191.5192.5192.4162.0172.2214.8132.7172.5172.6173.5132.7172.422
T 17 ATATAT1.6214.1101.6240.8230.9234.2112.7202.3191.7202.9162.8212.7162.917
T 18 ATGCAT3.1122.5163.2141.3201.4191.8264.8102.5182.6152.3193.1171.8202.720
T 19 CCATGG1.6231.3211.8201.1220.8242.8164.5152.0211.6222.3202.6221.8192.819
T 20 CCGCGG0.6280.1290.7280.8241.0220.8290.6251.5260.8281.4272.8201.5252.323
T 21 GCGCGC2.7171.7183.4113.5123.9122.0234.2172.9162.4183.2153.4143.0143.612
T 22 GTGCAC3.3114.793.4123.8114.5111.8250.3283.3133.5123.2143.2152.9153.710
T 23 TAGCTA1.7202.1171.6221.6191.8183.0150.3291.7241.6212.1221.8261.6232.025
T 24 TCATGA0.4290.8250.6290.6250.4284.870.7240.7300.6301.0290.9300.8300.930
T 25 TCCGGA1.5241.0241.6230.6270.6265.064.8121.7231.1251.9232.4232.0182.721
T 26 TCGCGA0.2300.1300.4300.4290.3292.6174.3161.0290.7290.8301.7271.0281.728
T 27 TCTAGA1.2250.6271.5251.6181.3212.4192.1221.6251.4241.8242.0251.7211.727
T 28 TGCGCA2.5183.0152.8182.4172.0165.254.953.0142.6163.4122.9183.7123.215
T 29 TTATAA0.9270.6261.0270.5280.4272.3201.6231.1280.9271.5261.4291.0291.926
T 30 TTGCAA2.8161.2232.9171.2211.4201.2282.3212.1202.2192.2212.3241.4273.611

Share and Cite

MDPI and ACS Style

Michel, C.J. The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses. Life 2017, 7, 20. https://doi.org/10.3390/life7020020

AMA Style

Michel CJ. The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses. Life. 2017; 7(2):20. https://doi.org/10.3390/life7020020

Chicago/Turabian Style

Michel, Christian J. 2017. "The Maximal C3 Self-Complementary Trinucleotide Circular Code X in Genes of Bacteria, Archaea, Eukaryotes, Plasmids and Viruses" Life 7, no. 2: 20. https://doi.org/10.3390/life7020020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop