*3.3. Methods for Identification of RNA Motifs*

The methods that are used to identify potential regulatory motifs in mRNA sequences assess the statistical significance of represented potential regulatory motifs in the mRNAs with different translation efficiencies. These methods are in general the same or very similar to those for DNA motifs except for the methods addressing the DNA conformational properties. The latter utilize physical parameters of DNA double helix and can be plied both to prokaryotic [36] and eukaryotic [37] genomes. Having appropriate parameters to convert letter representation of RNA into numerical representation, the same methodology could be applied to analysis of mRNA. Conventional approaches are based on accounting for conserved nucleotides within a short motif. One of the most frequently used programs for the detection of motifs in the transcript pools with different translation efficiencies is MEME, which is based on the maximal likelihood optimization [38]. Ease of use and a wide set of the accompanying programs for visualization and further search are advantages of this program. The MEME suite comprises four main sections, namely, motif discovery, motif enrichment analysis, motif search, and motif comparison, altogether 14 different tools. This toolkit allows the researcher to both determine motifs de novo and to scan a dataset of sequences for the matches of the already known motifs. MEME shows a schematic arrangement of the found motifs on the initial sequence, constructs a graphical representation for them, and computes statistical significance (*p*-value) for these motifs.

In particular, MEME suite has allowed identification of a nine-nucleotide-long element present in both the 5'UTRs and 3'UTRs of numerous *A. thaliana* and *Gynandropsis gynandra* transcripts; the authors named it MEM2. Later, it was experimentally confirmed that the MEM2 motif residing in the 5'UTR was necessary for preferential protein accumulation in the mesophyll cells. It is assumed that this motif can be involved in the mechanism guiding preferential cellular accumulation of several enzymes necessary for C4 photosynthesis, which provides a more efficient carbon capture as compared with the ancestral C3 pathway [39]. The MEME suite has been used in a comparative analysis of the 5'UTR sequences for steady-state and polysomal *A. thaliana* mRNAs and allowed for discovery of two motifs (TAGGGTTT and AAAACCCT) present in many genes, which potentially suggests their contribution to the translation efficiency. Furthermore, it has been experimentally shown that only one of these motifs, TAGGGTTT, regulates gene expression at the level of translation [33].

However, the search for the motifs using this tool also has some limitations. Among the serious disadvantages of this program is the trend to find very long motifs (over 20 nucleotides), these motifs are present only on a small subgroup of sequences and/or frequently repeated motifs in one or just a few sequences. Although statistical significance (*p*-value) of such motifs is very high, the motifs themselves, as a rule, are rarely of any biological/practical interest and represent statistical artifacts.

Most likely, these limitations are the main reason why several studies of motifs failed to bring any positive results [8,32,33]. Correspondingly, other computational approaches were used for this search and their statistically significant representation in the transcripts with different translation efficiency, for example, by comparing the frequencies of mono-, di-, and trinucleotide sequences. Statistical tests, for example, the Kolmogorov–Smirnov or Fisher test, allow the detection of statistically significant differences in such nucleotide distributions. Moreover, the use of linear or logistic regression

makes it possible to detect not only the individual contribution of each sequence, but also the effect of their combinations. In particular, partial least regression analysis has been applied to the detection of the short regions residing in the neighborhood of the 5'-proximal region of 5'UTR that can play an important role in differential translation in response to heat shock [8]. However, the linear or logistic regression methods are also not free from limitations. For example, it is not practically feasible to analyze motifs with a length of four nucleotides or longer, because their frequencies sharply decrease and, as a consequence, the computation of statistical characteristics becomes too complicated. In addition, these methods do not take into account the locations of motifs on sequences, which in terms of biology mean the equal contributions of the codons residing far from the translation start codon and in the immediate proximity.
