*2.3. RNA-Seq and Annotation of Unigenes*

To understand the molecular basis of flower polymorphism in *P. limprichtii*, three distinctly colored flower petals and lips were used for RNA-seq. A total of 80,525 unigenes with a mean length of 856 bp were obtained by de novo assembly. We assessed the quality of the unigenes in the transcriptome library. The length of N50 (sequence length of the shortest contig at 50%) was 1470 bp, the Q20 and Q30 percentages were 98% and 95%, the GC content was 40%, and the unigenes were generally distributed between 200 bp and 3000 bp in length. A total of 69,004,304 base pairs were aligned. These data show that the throughputs and sequencing quality were high enough to ensure further analysis.

According to the BLASTx results, a total of 33,724 unigenes were annotated, accounting for 41.88% of all the unigenes. Among these, 33,459 could be annotated using the Nr database (Non-redundant protein database, 41.55%), 21,177 using the Swiss-prot database (Swiss-Protein protein database, 26.30%), 11,067 using the KEGG database (Kyoto encyclopedia of genes and genomes, 24.68%), and 19,871 using the KOG database (Eukaryotic orthologous groups, 13.74%). In addition, there were 8396, 151, 42, and 23 unigenes annotated only using the Nr, Swiss-protein, KEGG, and KOG databases, respectively (Figure 2).

**Figure 2.** Venn diagram of the number of unigenes annotated by BLASTx with four protein databases.

Statistical analysis of the E-value (The probability due to chance) characteristics of the Nr annotations revealed that 32.05% of the mapped sequences showed strong homology (E-value <sup>&</sup>lt; <sup>1</sup> <sup>×</sup> 10-3), while 29.48% in the Swiss-protein database, 44.77% in the KEGG database, and 30.07% in the KOG database showed strong homology (Figure 3).

**Figure 3.** E-value distribution of top BLASTx hits against four protein databases for each unigene. (**a**) distribution E-values in non-redundant protein database; (**b**) distribution E-values in Swiss-protein database; (**c**) distribution E-values in the Kyoto Encyclopedia of Genes and Genomes database; (**d**) distribution E-values in the Eukaryotic Orthologous Group database.

Based on the Nr annotations, 6495 unigenes were classified into 53 functional categories, belonging to three functional terms: molecular function, cellular component, and biological process. The largest percentages of unigenes identified within each of the three functional terms were metabolic process (3602 unigenes), cell and cell part (2237 and 2233 unigenes), and binding and catalytic activity (2481 and 3409 unigenes), corresponding to biological process, cellular component, and molecular function, respectively (Figure 4).

**Figure 4.** Function classification of the gene ontology of all unigenes based on Nr annotation.

To exhaustively explore the potential functions of the annotated unigenes, 11,067 unigenes were mapped onto 131 KEGG pathways, including metabolic pathways, biosynthesis of secondary metabolites, biosynthesis of antibiotics, and many other important metabolic pathways. Metabolic pathways and biosynthesis of secondary metabolites were enriched in the most unigenes, with 2450 and 1294 unigenes annotating to these two pathways, respectively, while the anthocyanin pathway was enriched in only one unigene. Among these 131 pathways, there are two color formation-related pathways, the flavonoid biosynthesis pathway (ko00941) to which 36 unigenes were mapped, and the ABP (ko00942) to which one unigene was mapped (the pathways are listed in Supplementary Table S3).
