*3.8. Gene Annotations*

There are tens of thousands of genes and proteins in the eight strawberry species, and these sequences contain large amounts of valuable species information for which researchers are searching. Consequently, we have integrated millions of data into the GDS Gene Search tool for obtaining detailed information on target genes. The following are the types of detailed gene information that our tool integrates.

1. Gene family annotation. The ancestral genes of strawberries have undergone genomic duplication and mutation during their long evolutionary history [32], resulting in a series of related genes with similar conserved sequence motifs. The Pfam [33] (http: //pfam.xfam.org/, accessed on 10 November 2021) database is a large collection of protein families. Each family is represented by multiple sequence alignments and a hidden Markov model (HMMs) [34]. We have analyzed the proteins of the eight strawberry species using the Pfam 34.0 database and hmmscan (version 3.3) software.

2. KEGG (Kyoto Encyclopedia of Genes and Genomes) annotation. KEGG is a resource for understanding the functions and utilities of biological systems, such as the cell, organism, and ecosystem. It contains molecular-level information, especially large-scale datasets generated by genome sequencing and other high-throughput technologies [35]. KofamKOALA is a web server that assigns KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with precomputed adaptive thresholds. KofamKOALA was installed using Ruby (v2.4 and above, v2.7 was used in this study), HMMER (v3.1 and above, v3.3 was used here), and Parallel (the latest version). The GDS uses KofamKOALA [36] software to make KEGG predictions that contain the KO IDs and more exhaustive information from the official website (https://www.kegg.jp/, accessed on 10 November 2021). We use KofamKOALA (v1.2), which relies on a file named exec\_annotation, to analyze the protein files of eight kinds of strawberries.

3. GO annotation. GO [37] is a database established by the Gene Ontology Consortium. It aims to establish a database that is applicable to various species and that limits and describes the functions of genes and proteins. The updated semantic vocabulary standard is applicable to all species. By establishing a set of controlled vocabulary terms with a dynamic form, GO annotations can describe the roles of genes and proteins in cells and organisms. InterPro [38,39] was developed based on Java and aggregates data resources from multiple functional annotation databases such as Pfam, Panther, SMART, SUPERFAMILY, and tmhmm. It predicts the biological functions of proteins by classifying their sequences into protein families and predicting protein domains. InterProscan (v5.5) was used to annotate proteins from the eight strawberry species. A comparison library is available upon downloading the latest version of InterProscan. Instructions on InterProscan can be obtained by entering "./interproscan.sh" in the terminal. The final data can be obtained from the MySQL database.

4. Signal peptide prediction [40]. Signal peptides are short (5–30 amino acid) peptide chains that guide the transfer of newly synthesized proteins to the secretory pathway. The SignalP [41] software tool predicts whether there is a potential signal peptide cleavage site and identifies its location in a given amino acid sequence. Users may enter the "singalP" folder of the download interface to download data. SignalP (5.0 version) is used here with command "signalp -batch 30,000 -org euk -fasta proteins" for the analysis of proteins from the eight strawberry species.

To date, eight nuclear genomes, 436,160 protein sequences, 3,107,804 GO annotations, 27,481 signal peptides, and 1918 transcription factors [42] (Table 2) have been downloaded, analyzed, and organized in the GDS MySQL database.


**Table 2.** Statistics of whole datasets in GDS.
