*3.4. Detection of Other Context Features of RNA*

Statistical approaches are rather efficient when a set of potential characteristic features is determined for a pool of sequences and is used as a reference in the analysis. However, the specific cis-regulatory sequences in mRNAs that can modulate translation are identified using specialized approaches and/or resources for their prediction. The examples below illustrate the approaches to predict cis-regulatory sequences in the case studies of internal ribosome entry sites (IRESs) and upstream ORFs (uORFs), first and foremost, conserved peptide uORFs (CPuORFs).

IRES are the nucleotide sequences that mediate translation initiation of alternative reading frames (aORFs) under stress conditions, when the trivial cap-dependent translation mechanism is inhibited without the corresponding changes in gene transcription [40]. In general, the IRESs of plant mRNAs, unlike the IRESs of viruses, display considerable diversity in both nucleotide composition and structure [41]. Despite this diversity, characteristic functional modules are distinguishable in the IRESs, namely, (i) the presence of several start codons and their localization and (ii) the fact that some IRESs carry short conserved modules, which are recognized by the plant translational machinery and are directly involved in the immobilization of ribosome small subunit [42]. Polypurine blocks residing close to the start codon, which may be directly involved in the immobilization of ribosome small subunit, are an example of such conserved motifs [43].

The mRNAs potentially carrying IRESs can be selected by analyzing the experimental data obtained by polysome or ribosome profiling followed by deep sequencing and/or by mass spectrometry analysis. First and foremost, such mRNAs must retain a high level of their translational activity under the impact of adverse environmental factors and carry additional alternative start codons. The following strategy is appropriate for further selection and analysis of the mRNAs carrying IRESs. (i) Interspecific comparison of the transcript sequences of homologous genes, which allows for identification of the conserved region in the vicinity (30–50 nucleotides) of the alternative start codon followed by (ii) assessment of the context of the alternative start codon, the optimal neighborhood of which may suggest that translation can be potentially started from it. This strategy has been successfully implemented for predicting translation initiation of a short aORF with involvement of a polypurine block via internal ribosome entry [43]. Note that a commonly accepted confirmation for an IRES activity is still its ability to provide a coordinated translation of reporter genes within a bicistronic transcript (see below).

The advent of RP and high-throughput sequencing technologies made it possible to determine the translation start sites and to discover numerous mRNAs with aORFs that (i) may have a putatively inert sequence that acts as a mere translation barrier upstream of the main ORF or (ii) may encode short peptides referred to as CPuORFs [44]. The main difference between CPuORFs and the other ORFs is their length and, although there are no strictly defined frames, the ORFs shorter than 200–250 codons are regarded as short. In general, the search for CPuORFs is analogous to the approach for prediction of main ORFs and the strategy utilizing interspecific comparison of CPuORF sequences to identify the conserved regions is in most cases used for this search and estimation of the coding potential. This strategy is based on revealing the homology between such short peptide sequences and to a

considerable degree depends on the range of the species selected for comparison. For example, it is quite possible that the fact of preservation of CPuORFs within the plant species selected for comparison is insufficient to reveal the conserved regions. In this case, the analyzed CPuORFs will not be identified although their sequence is sufficiently conserved among the other species. When comparing the CPuORF sequences among closely related species, the observed similarity between short uORF peptide sequences may result from nucleotide sequence retention rather than conservation of these peptides. In order to overcome the problems associated with selection of the species for comparative analysis, a new method for CPuORF identification, BAIUCAS, was developed and tested [45]. BAIUCAS utilizes sets of EST (expressed sequence tag) data for thousands of plant species to search for homologs. The BAIUCAS algorithm consists of six successive stages: (i) exhaustive search for uORFs; (ii) search for homologs of CPuORF amino acid sequences over EST databases using tBLASTn; (iii) selection of CPuORFs based on the conservation of the stop codon position; (iv) selection of the CPuORFs conserved in a wide range of species, i.e., the CPuORFs with the conserved stop codons detected in each of several taxonomic categories; and (v) and (vi) are filtration stages, which excludes the "false" conserved CPuORFs [45]. Using this approach to the search for CPuORFs, 16 *A. thaliana* CPuORFs were identified; five of them are the new CPuORFs involved in the translation regulation of the main ORF, which has been experimentally confirmed [46].

The list of the computational approaches that have been so far successfully used in decoding the specific structural features in nucleotide composition of the plant mRNAs that mediate differential translation control is rather short. One of the possible efficient computational tools for analyzing the translatomes and predicting numerous regulatory codes in transcript sequences could be artificial neural networks (ANNs). This assumption relies on the facts that (i) most neural network architectures is theoretically able to approximate any function, i.e., it is potentially possible using ANNs to construct a model for almost any biological pattern, and (ii) the capabilities of the supercomputers have reached the appropriate level to model biological processes using neural networks. However, a positive result of analysis depends first on an adequately selected architecture of the network and second on the amount and composition of the training sample and a training strategy. Several recent reports confirm the utility of ANNs in deciphering the molecular mechanisms involved in decoding the eukaryotic genome. For example, the ANNs constructed based on RP data have been used to predict the yield of protein products [47]; to search for the motifs potentially able to influence translation [48]; to extract the biologically important information from omics data [49]; and to simulate the interaction between nucleic acids and different types of ligands (protein and peptides) [50]. The ANN potential is broad enough and it can be expected its broad application to the research in diverse and multilevel mechanism underlying translation in plants.
