*2.3. Phylogenetic Analysis, Gene Structure, and Motif Composition of Trihelix Genes*

To better understand the phylogenetic relationships of trihelix genes, a maximum likelihood phylogenetic tree was built based on the multiple sequence alignment of Myb/SANT-LIKE domains among rice and other species which include dicotyledonous plants such as *Arabidopsis*, soybean, tomato, chrysanthemum and monocotyledonous plant such as maize, wheat, wild rice, *Brachypodium distachyon*. As shown in Figure 4, OsMSLs were divided into five subfamilies named SIP1, GTγ, GT, SH4, and GTδ according to the characteristics of their trihelix DNA binding domains. Some genes that have been classified previously such as *SIGT-4/7/12/18/36* in tomato [8], *CmTH2/6/12/17/19/20* in chrysanthemum [14], *GmGT-2A* and *GmGT-2B* in soybean [11] was as a classified marker. The GT clade was the largest subfamily, containing 28 trihelix genes, whereas the SH4 clade was the smallest, consisting of 13 members, indicating that trihelix genes were distributed unevenly in the different clades. All clades consisted of genes both from dicot and monocot species. There is a similar classification in rice which was previously named GTδ in tomato and two tomato trihelix genes *SIGT-4* and *SIGT-12* have been found in this subfamily. To demonstrate the evolutionary relationships among *OsMSL*s, we constructed an unrooted phylogenetic tree using the full-length amino acid sequences of the OsMSLs. Of the 43 transcripts of the 41 rice trihelix genes, nine belonged to SIP1, 10 belonged to GTγ, 11 belonged to GT, five belonged to SH4, and eight belonged to GTδ (Figure 5A). Most of the duplicated genes were present in the GTδ classification. The phylogenetic tree of the all MSLs between rice and *Arabidopsis* was constructed and is shown in Figure S1. However, we found that the GT *Int. J. Mol. Sci.*  δ subfamily does not contain **<sup>2018</sup>**, *19*, x FOR PEER REVIEW *Arabidopsis* trihelix genes. 11 of 29

**Figure 4.** Phylogenetic relationships among 105 trihelix proteins in rice, *Arabidopsis*, soybean, maize, tomato, wheat, chrysanthemum, wild rice and *Brachypodium distachyon*. The maximum likelihood tree was created using MEGA v. 7.0 (bootstrap value = 1000) and the bootstrap value of each branch is displayed. Forty-three OsMSL proteins are marked with black circles and other species are marked with white circles. The phylogenetic tree was clustered into SIP1, GTγ, GT, SH4, and GTδ. **Figure 4.** Phylogenetic relationships among 105 trihelix proteins in rice, *Arabidopsis*, soybean, maize, tomato, wheat, chrysanthemum, wild rice and *Brachypodium distachyon*. The maximum likelihood tree was created using MEGA v. 7.0 (bootstrap value = 1000) and the bootstrap value of each branch is displayed. Forty-three OsMSL proteins are marked with black circles and other species are marked with white circles. The phylogenetic tree was clustered into SIP1, GTγ, GT, SH4, and GTδ.

*Int. J. Mol. Sci.* **2018**, *19*, x FOR PEER REVIEW 12 of 29

**Figure 5.** Phylogenetic analysis and gene structure of the rice trihelix family. (**A**) Phylogenetic analysis of the rice trihelix family. The phylogenetic tree was constructed based on the full-length amino acid sequences of the rice trihelix proteins by using MEGA v. 7.0 with the maximum-likelihood method. Bootstrap = 1,000. SIP1, GTγ, GT, SH4, and GTδ are marked with different colors. (**B**) Gene structures of the rice trihelix family. These were analyzed by the Gene Structure Display Server (GSDS v. 2.0). Exons, introns, and untranslated regions are marked by round red rectangles, black lines, and blue rectangles, respectively. The scale bar at the bottom estimates the lengths of the exons, introns, and untranslated regions. **Figure 5.** Phylogenetic analysis and gene structure of the rice trihelix family. (**A**) Phylogenetic analysis of the rice trihelix family. The phylogenetic tree was constructed based on the full-length amino acid sequences of the rice trihelix proteins by using MEGA v. 7.0 with the maximum-likelihood method. Bootstrap = 1,000. SIP1, GTγ, GT, SH4, and GTδ are marked with different colors. (**B**) Gene structures of the rice trihelix family. These were analyzed by the Gene Structure Display Server (GSDS v. 2.0). Exons, introns, and untranslated regions are marked by round red rectangles, black lines, and blue rectangles, respectively. The scale bar at the bottom estimates the lengths of the exons, introns, and untranslated regions.

To identify the differences between the rice trihelix family genes, we analyzed the *OsMSL* gene structure by comparing each coding sequence with its corresponding genomic sequence. As shown in Figure 5B, the number of *OsMSL* exons is discontinuously distributed from 1 through 18. Combining the gene structure with the phylogenetic tree, we found that the *OsMSL* exon-intron distribution is related to its classification. Closely related genes usually have homologs. Therefore, their gene structures are similar. For example, the *OsMSL* genome sequences in the SIP1 subfamily have no introns and only one exon. Therefore, the evolution of this gene subfamily is relatively conservative. The genes in the GTδ subfamily have no UTR region and only exons and introns except for *OsMSL16* and *OsMSL41.* In contrast, the structures of the various genes in the GTγ, GT, and SH4 subfamilies are

relatively different. These results indicate that although the *OsMSL*s are subdivided into five families, their genes are relatively conservative.

To determine the functions of the trihelix family genes, the OsMSL motif composition was analyzed by amino acid sequence in the MEME program. Ten motifs with E < 1.8×10−<sup>45</sup> were identified. These resemble the MSLs in chrysanthemum. The genes for each subfamily were classified [14]. As shown in Figure 6, except for OsMSL01 and OsMSL09, most trihelix family genes contain motif 1 (Myb-type DNA-binding domain) located at the *N*-terminus of the amino acid sequence. Motifs 2, 6, and 8 are various trihelix DNA binding domains (WWW, WWF, and WWI). These determine OsMSL classification, structure, and function [18]. As the gene structure analysis indicated, the gene motifs and distribution patterns are closely related to their subfamilies. SIP1 contains motif 8, GTγ contains motif 6, and only OsMSL06 contains an extra motif 8. Both GT and SH4 contain motif 2 but that in SH4 is longer than that in GT. OsMSL09 and OsMSL12 in SH4 also contain an additional motif 8. OsMSL02, OsMSL03, OsMSL04, OsMSL26, OsMSL29, and OsMSL38 in the GT subfamily also contain motif 6. Motif 2 with other functional domains and conservative sequences are contained in the rice-specific GTδ subfamily. Although their functions have yet to be elucidated, they may indicate that the GTδ gene in rice has multiple functions. *Int. J. Mol. Sci.* **2018**, *19*, x FOR PEER REVIEW 13 of 29

**Figure 6.** Motif composition of rice trihelix proteins. Motif analysis was performed using the MEME program as described in the methods section. The trihelix proteins are listed on the left. Boxes of different colors represent the various motifs. Their location in each sequence is marked. Motif sequences are shown in Figure S2. The scale bar at the bottom indicates the lengths of the trihelix protein sequences. **Figure 6.** Motif composition of rice trihelix proteins. Motif analysis was performed using the MEME program as described in the methods section. The trihelix proteins are listed on the left. Boxes of different colors represent the various motifs. Their location in each sequence is marked. Motif sequences are shown in Figure S2. The scale bar at the bottom indicates the lengths of the trihelix protein sequences.

*2.4. Cis-Element Analysis of Rice Trihelix Genes*

was added to the *OsMSL* expression analysis.

*OsMSL*s were identified and labeled by different colors in the promoter sequence (Figure 7).

To understand the genetic functions, metabolic networks, and regulatory mechanisms of rice trihelix genes, the shared *cis*-elements in the promoter regions of the *OsMSL*s were analyzed. The 1500-bp upstream *OsMSL* sequence was obtained and identified as a hypothetical promoter. The

As shown in Figure 7, A (ACGTATERD1) and M (GT1GMSCAM4) element are two dehydrationresponsive elements and M is a core element. Therefore, *OsMSL*s probably participate in dehydration (including drought and salt) stress responses. GB (GATABOX element), GC (GT1CONSENSUS), and I (INRNTPSADB) are three light-responsive elements. They indicate that the *OsMSL*s family potentially consists of light-inducible/repressible genes. Light responsiveness is typical of the GT factor (now known as the trihelix family gene) and was confirmed in our *cis*-element study. To verify whether *OsMSL*s are regulated by light under both normal- and stress conditions, a dark treatment
