Next Article in Journal
Train-Induced Unsteady Airflow in a Metro Tunnel with a Ventilation Shaft
Previous Article in Journal
Partial Path Overlapping Mitigation: An Initial Stage for Joint Detection and Decoding in Multipath Channels Using the Sum–Product Algorithm
 
 
Article
Peer-Review Record

Are There Seven Symbols for the Nucleotide-Based Genetic Code?

Appl. Sci. 2024, 14(20), 9176; https://doi.org/10.3390/app14209176 (registering DOI)
by Adam Kłóś 1,2, Przemysław M. Płonka 1,* and Krzysztof Baczyński 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2024, 14(20), 9176; https://doi.org/10.3390/app14209176 (registering DOI)
Submission received: 19 August 2024 / Revised: 25 September 2024 / Accepted: 1 October 2024 / Published: 10 October 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors, after reading your study and proposed hypothesis I have mainly a single (but maybe crucial) suggestion:

You state that genetic code is composed of 4 letters (A, T, G, C), and based on this you proposed a 7-letter model that is extended by "purines", "pyrimidines", and "any nucleotides". However, according to the IUPAC (see e.g. here: https://www.bioinformatics.org/sms/iupac.html), there are even 15 letters (consider e.g. W (weak) for A or T; or S (strong) for G or C).

In overall, the use of the 7-letter model in your analysis should be justified, i.e. why not use the 10-letter model, 12-letter model, or even the full 15-letter model? Well, I am a molecular biologist, not a specialist in natural languages, anyway, this caught my eye and should be explained in detail. 

In addition, I am not sure about the methods used. You state that "The coding sequences from each organism were extracted in random order and merged into one continuous text of size 3Mbp". This means that only a small subset of coding sequences were analyzed e.g. for the higher eukaryotic genomes?

Also, I lack some particular examples pointing to the utility of the proposed approach.

Overall the study is certainly interesting, but it needs to be sold much better. In the current state, I am afraid that the readers will not appreciate the complex results provided. 

 

 

Author Response

Why 7-symbol alphabet?

 

Our 7-symbol alphabet is specifically designed to represent the degeneracy of the genetic code due to wobble base pairing at the third position of a codon during translation. The code table (Tab. 1) we present is indeed a modified version of the standard genetic code table, with introduction of "purine," "pyrimidine," and "any symbol" character at the third position. The table structure allows for a clear visualization of codon families that encode the same amino acid, highlighting how changes in the third base (wobble position) often do not affect the amino acid outcome.

 

We have limited our alphabet to just 7 symbols (4 standard nucleotides + 3 additional symbols) as this is sufficient to achieve an accurate representation of codon degeneracy due to wobble base pairing. Expanding the alphabet further would introduce unnecessary complexity without providing additional relevant information for our specific purpose.

 

It's important to note that our approach differs from the IUPAC code in both purpose and implementation:

  • The IUPAC code was created with different objectives in mind, such as representing uncertainty in sequencing data, consensus sequences, and single-nucleotide polymorphisms ect. Our code, in contrast, is specifically tailored to represent codon degeneracy.
  • In our study, the introduction of additional symbols (X, Y, `*`) is context-dependent, based on the first and second nucleotides in the triplet. This context determines the possibilities for the third position, ruling out other options. The IUPAC code does not take into account the position of the nucleotide within a codon, and therefore requires a more extensive set of symbols (10 - 15) to cover all possibilities.

 

By focusing on the specific context of codon degeneracy and wobble base pairing, our 7-symbol alphabet provides a concise yet comprehensive representation of this phenomenon, serving the unique aims of our study.

 

The size of cds texts

We understand the reviewer's objection regarding the seemingly small size of our analyzed text (3Mbp). However, our choice to use this subset of coding sequences was deliberate and based on several important considerations.

  • The most important thing was to ensure the universality of our findings. Our primary goal was to demonstrate the results are valid across a diverse range of organisms. Focusing solely on organisms with large genomes (primarily eukaryotes) would have limited the scope and applicability of our study. Even with the 3Mbp limit, we encountered challenges in finding sequenced genomes for some families of archaea and bacteria. Therefore, 3Mbp represents a carefully considered compromise between sequence length and taxonomic diversity.
  • Beside, we conducted a comparative analysis to test how genome size influences the appearance of frequency-range plots. Our results showed no visual difference in the distribution patterns between 3Mbp samples and whole genomes in the examined cases. This finding supports the claim that the size of the examined text does not substantially influence the outcomes of our study.
  • Finally, there is a matter of standardization. The use of a uniform 3Mbp size across all studied organisms provides a standardized basis for comparison, eliminating potential biases that could arise from vastly different genome sizes.

 

In light of these factors, we are confident that our choice of 3Mbp coding sequences strikes an appropriate balance between sample size and diversity.

 

The utility of the proposed approach

To address potential practical implications, we have added a section to the article outlining some general propositions for potential applications of our approach.

 

"The presented hypothesis touches on the very foundation of biology and, therefore, may have profound consequences for biological studies. The 7-symbol approach may reshape genetic text analysis, deeply influencing bioinformatics, biophysics, systems biology, and, more broadly, a substantial part of biological science. Several potential applications are envisioned below.

Firstly, the method introduced in this study could be used to characterize genomic sequences. For instance, our results indicate that the frequency-range distribution of genomes significantly differs from that of pseudo-genomes constructed from random text. This difference could potentially be quantified to develop a 'randomness' scale. Estimating the degree of similarity between a sequence and random text may provide valuable insights into the amount of potentially useful information it carries, and thus its functionality and importance. While this study focused on coding sequences (CDS), it is plausible that analyses of other genomic features such as introns, untranslated regions (UTRs), promoters, and other may yield distinct results. These differences could potentially serve as a tool for initial classification of unknown sequences, guiding researchers towards their putative functions and characteristics.

In the realm of bioinformatics and genomic data management, the proposed 7-symbol alphabet (A, T, C, G, X, Y, `*`) could streamline genome analysis by reducing sequence data complexity while preserving its functional significance. This codon compression approach could prove particularly beneficial for large-scale comparative genomics studies. Moreover, it might enhance the performance and accuracy of sequence alignment algorithms, especially for distantly related sequences, and could lead to more efficient compression and storage of genomic data. Consequently, the development of novel bioinformatics tools and algorithms designed to work with this simplified representation of genetic information would be possible.

The rapidly evolving field of pattern recognition and artificial intelligence in genomic studies could also benefit from this method. The extended alphabet could potentially improve the performance of machine learning algorithms in genomics. For instance, by capturing essential information about amino acid properties (hidden in code degeneracy) might enhance predictions of protein structure and function from genomic sequences. Furthermore, this genomic representation might reveal patterns or motifs in genomic sequences that are not easily detectable in the standard 4-letter code.

In evolutionary studies, the 7-symbol representation of genomic information could provide novel insights into phylogenetic relationships by focusing on more fundamental aspects of sequence conservation. This approach could be particularly valuable for analyzing highly divergent sequences, where traditional methods may be less effective. On a more fundamental level, this simplified representation might offer new perspectives on the evolution of the genetic code and the underlying principles of its organization. Finally, should the 7-symbol code representation prove useful, its implementation would need to be considered in synthetic biology, potentially leading to the design of artificial genetic systems with reduced complexity but retained functionality.

Importantly, demonstrating the links between linguistic and genomic analysis, as presented in this proposition, could renew interest in alternative approaches to genome examination, such as linguistic genomics, and foster further exchange of ideas and methodologies between these fields."

 

 

The major text improvements:

  • To improve understanding of the experiment design and method used, Figure 2, which illustrates in detail the types of tokenization used, was added.
  • The results were presented in a new version of Table 3 that summarizes observations, calculations, and estimates the goodness of fit for power law distribution in genomic and random texts.
  • The discussion was rewritten to enhance the study's conclusions and provide a clearer explanation of the results
  • A practical application was added.
  • The random text analysis was repeated for 60 tests per analysis (780 texts in total), to match the exact number of genomic texts, which makes the comparison between genes and random texts more reliable.
  • Figures 3, 5, 6, 7, 8, and 9 were updated to improve their visibility, readability, or to include all texts in the case of random text.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The research presented is very interesting and well defined. I think it is a quality paper. However, there are some elements that should be revised and improved.

The word biosemiotics is used, but there is no reference to this approach throughout the text. Perhaps it would be more appropriate to change this term to “code biology” and to make a brief mention to Marcello Barbieri's proposal. Actually, in my opinion, focus of this research is not the biosemiotic but would be within the biology of code.

I say this, since the focus of the work seems to be more in line with Barbieri's proposal, which has already been extensively studied. In this sense, I think it would be sufficient to make a brief reference to his approach. Consider reviewing this monograph, for example: https://www.sciencedirect.com/special-issue/10S60V7SHC6. 

The introduction shows us the research context in a clear way and is sufficiently explanatory. However, I consider that the research objectives are not well defined or sufficiently clear. This should be corrected. In addition, it would be good if the authors could explain better the reason for opting for a linguistic analysis to analyze the genome. We know that there are other types of analysis. What would be the reason to consider using the approach they propose?

Second, I think the authors should revise the design of the experiment to make it clearer. I suspect that there may be interested readers who are not familiar with the tokenization process, it should be explained in some depth. Similarly the design of the experiment is too dense. I think an effort should be made to clarify it. In fact, it is not clear why Tables 1 and 2 are included. Line 223 talks about statistical evaluation and I think it would be more appropriate to talk about statistical analysis. 

In the results section, the name of D. melanogaster should be changed, as it is always italicized. In addition, If the fly genome is mentioned in this section, I think it would be convenient to indicate it clearly in the methodological section.

At the beginning of the discussion, reference is made to some conclusions. These conclusions should be in the corresponding section.

Author Response

Keyword: biosemiotics

 

Biosemiotics is an emerging and dynamic field of study, the precise boundaries of which are still being defined. It can be broadly characterized as the discipline that investigates the production, action, and interpretation of signs and codes within biological systems. The second-level representation of genetic information presented in this study could potentially be situated within the domain of genetic semiosis research, a subset of biosemiotic inquiry. However, the reviewer is correct that using 'biosemiotics' as a keyword could be misleading, as the topic focuses on experimental results without presenting them in the context of the broader biosemiotic literature.   The authors were concerned that such contextualization could unnecessarily extend the paper and would be more appropriately addressed in a separate article dedicated to that purpose.In light of these considerations, the suggestion to employ the term "code biology" appears to be more accurate and appropriate in this context. To acknowledge Barbieri's seminal contributions to the field of code biology, his work has been cited.

 

Citation:

  • Barbieri, M. The Organic Codes: An Introduction to Semantic Biology; Cambridge University Press,
  • Barbieri, M. Code Biology: A New Science of Life; Springer International Publishing, https://doi.org/10.1007/978-3-319-14535-8

 

Research Objectives

The following section was added to paper:

 

The primary objectives of this study is to (1) investigate this alternative representation of nucleotide sequences using a seven-character alphabet derived from genetic code degeneracy, (2) assess whether coding sequences (CDS) rewritten using a 7-symbol alphabet and tokenized accordingly exhibit power law distribution in their frequency-range analysis, indicating meaningful informational structures, (3) evaluate the effectiveness of triplet tokenization and frame tokenization in detecting this semiotic information within genomic sequences by examining the conformity of their texts' frequency distributions to Zipf plot, (4) differentiate between genuine genomic patterns and randomly generated texts, thereby validating that the observed results are not due to chance.”

 

Adding explanation of tokenization methods

 

In order to facilitate understanding of the experimental design, Figure 2 has been added to explain the types of tokenization used.

 

Why Tables 1 and 2 are included?

  • Table 1 is vital in understanding the methodology of our study. As explained in the paper: "The original texts were translated into contextual texts based on the rules shown in Table 1." In other words, Table 1 provides the sets of rules for transforming nucleotide text to contextual text.

 

  • Table 2 provides explanations for the analysis names used in the study. Due to the complexity of the experiment, the use of shorthand notations such as S0TCAC and SNTTAN was unavoidable. Recognizing that these designations might be overwhelming for readers, we have included Table 2 as a quick reference guide, allowing for easy verification of each analysis's meaning.

 

Discussion should correspond to conclusion.

 

  • The discussion has been rewritten, the conclusion has been completed, and care was taken to ensure that the points in the conclusion correspond to the structure of the discussion.
  • The discussion was supplemented with practical applications.

 

The major text improvements:

  • To improve understanding of the experiment design and method used, Figure 2, which illustrates in detail the types of tokenization used, was added.
  • The results were presented in a new version of Table 3 that summarizes observations, calculations, and estimates the goodness of fit for power law distribution in genomic and random texts.
  • The discussion was rewritten to enhance the study's conclusions and provide a clearer explanation of the results
  • A practical application was added.
  • The random text analysis was repeated for 60 tests per analysis (780 texts in total), to match the exact number of genomic texts, which makes the comparison between genes and random texts more reliable.
  • Figures 3, 5, 6, 7, 8, and 9 were updated to improve their visibility, readability, or to include all texts in the case of random text.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors, thank you for the revised version of the manuscript. Now I feel it was significantly improved and would be much accessible to a broader readership. Good luck with follow-up research!

Reviewer 2 Report

Comments and Suggestions for Authors

The article has been significantly enhanced and is now more coherent and accessible. The objectives are clearly delineated, and the methodology has been substantially refined through the incorporation of visual aids that facilitate a more comprehensive understanding of the process. The results have also been meticulously revised. The discussion has been considerably expanded, imparting greater depth and scientific rigor. Initially, we were examining an intriguing paper; now, we have a highly sophisticated piece of research. 

Back to TopTop