*4.1. Frequency, Distribution, and Characterization of Microsatellites in Three Cucurbita Genomes*

With the development of sequencing technology, the discovery and mining of genomic SSR loci has successfully been applied in many plant species, such as cotton [35,36], foxtail millet [37], cucumber [11], watermelon [13], tobacco [38], and melon [12]. *Cucurbita moschata*, *C. maxima*, and *C. pepo* are important species that are cultivated worldwide, and their graft genomes were released several years ago. However, there remains little information on the development of genome-wide SSR markers in *Cucurbita* species, which has strongly limited their genetic research. In the present study, genome-wide microsatellites were identified and characterized in the three *Cucurbita* species. A total of 34,375, 30,577, and 38,104 SSR loci were detected in the *C. moschata*, *C. maxima*, and *C. pepo* genomes, respectively. The smallest genome size and maximum number of microsatellites were detected in *C. pepo*, indicating that there was no direct correlation between genome size and the number of microsatellites. The density of the SSR markers in the three *Cucurbita* species was approximately 113–145 SSR/Mb, which is lower than that in cucumber (552 SSR/Mb) but comparable to that in melon (109 SSR/Mb) and watermelon (111 SSR/Mb) [11–13]. In addition to the natural differences among different genomes, many other factors could affect the deviations in SSR density such as the software and parameters used for microsatellite detection. We suspect that the main reason for the difference in SSR density between *Cucurbita* species and cucumber was the different selection criteria for the SSR loci, e.g., the repeat types (di- to octa-nucleotides versus mono- to penta-nucleotides) and the minimum lengths (18 bp versus 12 bp).

We further analyzed the distribution and frequency of microsatellites in the three *Cucurbita* species (Figures 1 and 2). In most cases, a negative correlation was observed between the microsatellite frequency and the number of repeat units. Consistent with previous studies in watermelon and melon, the di-nucleotide repeats were the most abundant SSRs, followed by tri-, tetra-, penta-, hepta-, hexa-, and octo-nucleotide repeats [12,13]. This is something that varies in different species. For example, the density of tetra-nucleotide repeats was highest in *C. sativus* (164.2 SSR/Mb), *Populus trichocarpa* (144.9 SSR/Mb), *Medicago truncatula* (102.8 SSR/Mb), and *Vitis vinifera* (171.3 SSR/Mb), whereas the density of tri-nucleotide repeats was the highest in Arabidopsis thaliana (146.6 SSR/Mb), Glycine max (103.1 SSR/Mb), and Oryza sativa (220.1 SSR/Mb) [11]. Some studies have revealed that the di-nucleotide motifs with high repeat numbers are more abundant and polymorphic compared to those with short repeat units [39]. The reason is that di-nucleotide repeats are much less frequent in coding regions than in non-coding regions [40,41]. It is also reported that the exon region contains more triplet SSRs than other repeats, and triplet SSR motifs may be related to high frequencies of certain amino acids [42,43]. These SSRs in the coding sequence may have the potential to affect all aspects of genetic functions including gene regulation, development, and evolution. However, the function of genes that contain SSRs and the role of these SSR motifs in plant genes are less studied and poorly understood [44]. It is interesting to note that many bacterial SSRs in the intergenic regions have regulatory functions [45], and whether these SSR motifs in the intergenic regions of *Cucurbita* species play a role in specialization or gene regulation should be further studied.

The low number of repeat motifs was predominant, and the AT-rich motifs in particular contributed a large proportion of all types of di-nucleotide repeats in the *Cucurbita* species (Figure S1). The AT or AAT type is more common in dicots [13], which is consistent with our results. Recently, the characterization of SSR markers in bitter gourd showed that the tri-nucleotide repeat units were the main type, with an overrepresentation of A/T, AT/AT, AAT/ATT, and AAAT/ATTT motifs in all kinds of repeat types [46]. This has also been found in other genomes [11,47,48]. On the contrary, the frequency of the GC or CCG type was much lower at the genomic level [49,50], and the GC, TC, or GA types have relatively stable structures. Most of the AT types are distributed in non-genic regions, while the TC/GA types are primarily distributed in coding sequences [38].
