**2. Results**

#### *2.1. Identification of TPS Family Members in Rosaceae*

To explore the distribution of TPSs among Rosaceae species and the evolutionary trajectory in the subfamily, we selected eight Rosaceae species for the identification of TPSs, including three Prunoideae species (*P. persica*, *P. mira*, *P. mume*), three Maloideae species (*P. betulifolia*, *M. domestica*, *M. baccata*) and two Rosoideae species (*F. vesca*, *R. chinensis*) (Table 1). BLASTP and HMM searches were performed against their entire protein sequences, and these two approaches produced the similar number of hits, indicating the relative conservation of the TPS family. We merged the hits together and verified them for the existence of Pfam domains PF03936 (metal-binding domain) and PF01397 (N-terminal TPS domain). Pfam domain distribution in the TPSs of eight Rosaceae species is listed in Table S1. As a result, hundreds of complete TPSs that contained both domains were identified. A recent study used this similar approach to detect TPSs in peach [2], and it detected 38 full-length TPSs with both domains; another study detected cultivated apple contained 55 putative TPS genes [4], the similar amount of TPSs as our results indicates the approach used in our study is reliable. For each Rosoideae species, we found that a certain ratio (64.29–0%) of putative TPSs was composed of both domains. The reference information and family number distribution of TPSs is listed in Table 1. The average family number of all TPSs in the Rosoideae species is the highest (65–76), followed by Maloideae (48–56) and Prunoideae (10–45). The Prunoideae species showed varied numbers of TPSs; there were only 10 putative TPSs identified in *P. mira*. In contrast, 30 and 45 putative TPSs were detected in *P. mume* and *P. persica*, respectively, whereas for the complete TPSs with both domains, 38, 36, and 43 TPSs were identified in *P. persica*, *M. domestica*, and *F. vesca*, respectively. *M. domestica* displayed a lower ratio of complete TPSs, with less than 64.29%, while the ratio in Prunoideae was up to 90%. All the putative TPSs were renamed

numerically with the abbreviation of species names as a prefix (Table S2); only complete TPSs that contained both domains were used for the subsequent analysis.


**Table 1.** Summary of genome information and TPSs of sequenced Rosaceae species used in this study.

\* GDR, Genome Database for Rosaceae; CNGB, China National GeneBank DataBase; NCBI, National Center for Biotechnology Information. Species used for the phylogenetic tree are highlighted in bold.

#### *2.2. TPS Classification and Motif/Domain Annotation*

All the TPSs containing both PF01397 and PF03936 domains were subjected to classification and motif annotation. TPSs from the three representative Rosaceae species (*P. persica*, *M. domestica*, and *F. vesca*) were chosen for the visualization of the classification and domain/motif distribution in TPSs. After removing putative TPSs lacking both domains, a total of 117 TPSs in the three species were used for phylogenetic construction (Figures 1 and S1). The phylogenetic topology revealed that all the TPSs were divided into seven known clades TPS a-g. TPS-b and TPS-g clustered with the TPS-a clade and forms a large branch. TPS-e and TPS-f formed sister clades and clustered close to the TPS-c clade. No TPSs clustered with the TPS-d clade, which was only encoded in gymnosperms. The conserved motifs were constructed using the online MEME software and three conserved motifs (motifs 1, 2, 3) were detected in nearly all the TPSs (109, 96, and 102). However, the frequency and distribution of these motifs varied among TPSs. For example, *Ma.dom-TPS11* only contained motif 2, and lost motifs 1 and 3, whereas *Fr.ves-TPS28* contains an extra copy of motif 3. The motif distribution of TPSs in the eight Rosaceae species is listed in Table S1; the significant E-value indicates the reliability of the identified motifs. For different clades, we found that the motif composition of TPS-a/b/g clades is more conservative than that of TPS-c/e/f, as shown in Figure 2B; nearly all TPSs in these clades contain all of the three motifs. In contrast, many TPSs from TPS-c/e/f lost motif 2, indicating the differences of motif distribution among clades. The conserved domain (CD) annotation used by the CDsearch tool in NCBI revealed the discrepancy of domain annotation among different clades. CD domains Terpene\_cyclase\_plant\_C1 (accession: cd00684) and Isoprenoid\_Biosyn\_C1 (accession: cd00385), which both belong to superfamily Isoprenoid\_Biosyn\_C1 superfamily (accession: cl00210), were annotated in clades of TPS-a, TPS-b, and TPS-g. PLN02279 super family (ent-kaur-16-ene synthase) was annotated in TPS-e/f clades. PLN02592 superfamily (ent-copalyl diphosphate synthase) was annotated in TPS-c clade. Pfam domain annotation results verified that each full-length TPS is characterized by two conserved domains with PF01397 (N-terminal) and PF03936 (C-terminal). The protein lengths of TPSs ranged from 232 AA to 1726 AA, showing a wide distribution of TPS lengths. One of the conserved aspartate-rich motifs in the C-terminal domain that is involved in the coordination of divalent ions, water molecules, and the stabilization of the active site was characterized based on motif sequences alignment (Figure 2C). The conservation of amino acid composition varied among different TPS clades. The TPS-c subfamily is characterized by the "DXDD" motif but not the "DDXXD" motif that was detected in other clades.

**Figure 1.** Phylogenetic relationship and distribution of motif/domain of TPSs in three Rosaceae species (*P. persica*, *M. domestica*, *F. vesca*). The phylogenetic tree is shown on the left panel, while conserved motifs, conserved domains, and Pfam domains are shown on the right three panels. The phylogenetic tree from full-length amino acid sequences was constructed using the MEGA with maximum likelihood (ML) method. The conserved motifs were assessed using the online MEME software. The conserved domain was annotated based on the conserved domain database in NCBI, whereas the gene structure and domains were annotated by using the PfamScan tool. The conserved motifs and domains are shaded in different colors. The root nodes of TPS-a, TPS-g, TPS-b, TPS-c, TPS-e, and TPS-f clades are indicated by blue, green, yellow-green, red, benzo, and purple, respectively.

**Figure 2.** The unrooted phylogenetic tree of TPSs and motifs comparison between different TPS clades. (**A**) The maximum-likelihood phylogenetic tree of the TPS proteins in three Rosaceae species (*P. persica*, *M. domestica*, *F. vesca*). The TPS-a, TPS-g, TPS-b, TPS-c, TPS-e, and TPS-f clades are shaded in blue, green, yellow-green, red, benzo, purple, respectively. (**B**) The frequency of different motifs among different TPS clades. (**C**) The seqLogo of 'DDxxD' motif in the C-terminal domain of different TPS clades, the bit score represents the information content for each position in the sequence.
