**Characteristics of Microsatellites Mined from Transcriptome Data and the Development of Novel Markers in** *Paeonia lactiflora*

## **Yingling Wan 1, Min Zhang 1, Aiying Hong 2, Yixuan Zhang <sup>1</sup> and Yan Liu 1,3,4,5,\***


Received: 28 January 2020; Accepted: 17 February 2020; Published: 19 February 2020

**Abstract:** The insufficient number of available simple sequence repeats (SSRs) inhibits genetic research on and molecular breeding of *Paeonia lactiflora*, a flowering crop with great economic value. The objective of this study was to develop SSRs for *P. lactiflora* with Illumina RNA sequencing and assess the role of SSRs in gene regulation. The results showed that dinucleotides with AG/CT repeats were the most abundant type of repeat motif in *P. lactiflora* and were preferentially distributed in untranslated regions. Significant differences in SSR size were observed among motif types and locations. A large number of unigenes containing SSRs participated in catalytic activity, metabolic processes and cellular processes, and 28.16% of all transcription factors and 21.74% of hub genes for inflorescence stem straightness were found to contain SSRs. Successful amplification was achieved with 89.05% of 960 pairs of SSR primers, 55.83% of which were polymorphic, and most of the 46 tested primers had a high level of transferability to the genus *Paeonia*. Principal component and cluster dendrogram analyses produced results consistent with known genealogical relationships. This study provides a set of SSRs with abundant information for future accession identification, marker-trait association and molecular assisted breeding in *P. lactiflora*.

**Keywords:** herbaceous peony; molecular marker; next-generation sequencing; pedigree

#### **1. Introduction**

Herbaceous peony, which has many varieties with distinct flower types and colors, provides great commercial benefits in the form of cut flowers and potted plants. It has a long juvenile period before flowering, which slows the development of new cultivars with specific and stable characteristics by traditional hybridization breeding [1]. Based on appropriate DNA markers, molecular-assisted breeding can be employed to select a target genotype and detect whether hybrids have the expected trait at an early stage; thus, it improves breeding efficiency and accuracy and saves time, labor and material resources [2]. The molecular breeding of herbaceous peony is not currently well developed due to a lack of foundational research; hence a large number of highly polymorphic and stable molecular markers of herbaceous peony should be developed to further identify associations with target traits.

Microsatellites, also known as simple sequence repeats (SSRs), are widely used for plant fingerprinting, genetic diversity assessment and association analysis between target traits and quantitative trait loci (QTLs) [3–5] due to their abundance in the genome, high polymorphism, codominant inheritance and good reproducibility [6,7]. SSRs can be developed from DNA or complementary DNA (cDNA) reverse transcribed from RNA [8]. In four previous studies on herbaceous peony, there were fewer than 16 SSR primers in each SSR-enriched genomic library or magnetic bead enrichment dataset [9–11]. Previous researchers synthesized 384 pairs of SSR primers from barcoded Illumina sequencing libraries of several species from the genus *Paeonia*; the researchers utilized 12 pairs of these SSRs and nine other SSR pairs to successfully identify 93 genotypes of *Paeonia* [12]. The numbers and repeat motif types of expressed sequence tag SSRs (EST-SSRs) from two sets of transcriptome data were previously reported by bioinformatics analysis, but additional PCR experiments or further validation were not performed [13,14]. Moreover, based on the transferability of SSRs among congeners of dicotyledonous plants [15], several primers from *Paeonia* were selected and used to successfully identify cultivars of *Paeonia lactiflora* [16]. Therefore, neither the number nor the application range of SSRs in herbaceous peony was not sufficient in these previous studies.

SSRs within genes or ESTs are more likely than genic SSRs obtained from SSR-enriched libraries or random DNA sequences to be effectively linked to target traits [17]. In *Populus tomentosa*, the genic SSRs selected from candidate genes related to wood formation were successfully used in family-based linkage mapping [18]. Twenty-four SSR primers of *Pisum sativum* were successfully mapped to several existing linkage groups [19]. Similar methods were also used in raspberry, in which SSRs were associated with several developmental traits [20]. For *Paeonia rockii*, SSR markers were used to perform association mapping, and 2.68–23.97% of flowering trait variance was explained [21]. Moreover, a genetic linkage map of tree peony covering five linkage groups was constructed by 124 EST-SSR primers [22]. Thus, development SSRs from herbaceous peony transcriptome may be effective for future use. Further studies showed the genomic distribution of SSRs is nonrandom. SSRs in genes may influence gene transcription or translation and gene activity [6,23], and recent studies showed a higher abundance of SSRs in response to environmental stress [24]. The polymorphism levels and potential functions of SSRs differ among the 5 untranslated region (5 UTR), the 3 UTR and coding sequences (CDs) are different; SSRs in 5 UTR may affect transcription or translation, SSRs in CDS may inactivate or activate genes, or truncate proteins, and SSRs in 3 UTR may cause silencing or slippage [25]. However, no such information has been reported in herbaceous peony, which is not conducive to developing desirable SSR markers.

In this study, we mined SSRs from herbaceous peony transcriptome data and analyzed the distribution and location of the SSRs and the function of unigenes containing them. Initially, a total of 960 pairs of SSR primers were developed and amplified in eight cultivars from a core collection to initially validate the polymorphism level. Then, 46 pairs of primers were used to analyze transferability among nine species in *Paeonia*. Finally, we constructed a phylogenetic tree containing seven species and 24 varieties. This study provides a number of efficient and informative SSR primers for future molecular-assisted breeding of herbaceous peony.

#### **2. Materials and Methods**

#### *2.1. SSR Identification, Annotation from Transcriptome Data and SSR Primer Design*

All the SSR sequences used in this study were obtained from transcriptome data from inflorescence stems of *P. lactiflora* 'Da Fugui' and 'Chui Touhong' at five developmental stages (i.e., stages representing every seven days from stem elongation to flowering). The transcriptome data have been deposited in the Sequence Read Archive (SRA) database as described previously (accession number: PRJNA528693) [26,27] and were assembled by Trinity 3.0. MISA-web (http://pgrc.ipk-gatersleben.de/misa/) was used to search for SSRs in the unigenes [28]. The parameters were set as follows: dinucleotide (Di-) repeats had to be repeated at least 6 times, trinucleotide (Tri-) repeats had to be repeated at least five times, tetranucleotide (Tetra-) repeats had to be repeated at least four times, pentanucleotide (Penta-) repeats had to be repeated at least four times, and hexanucleotide

(Hexa-) repeats had to be repeated at least four times. Interruptions were set to 100 to merge two SSR sequences into one SSR when the distance was shorter than 100 bp. Notably, mononucleotide repeats were not analyzed in this study. To identify possible SSR functions for future use, unigenes that contained SSRs were mapped to terms in the Gene Ontology (GO) database, gene numbers were calculated for each term, and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database was used for pathway analysis. TransDecoder v3.0.1 (http://transdecoder.github.io/) was used to identify candidate coding regions, dividing transcript sequences into 5 UTR, CDS and 3 UTR sections. Primer 3 (http://bioinfo.ut.ee/primer3) [29] was used to design primers on both sides of the microsatellite sequences, following previous product size, primer length, GC content and annealing temperature principles [8].

#### *2.2. Plant Materials*

To evaluate the specificity and polymorphism of primers, fresh young leaves of *P. lactiflora* 'Qihua Lushuang', 'Jinxing Shanshuo', 'Lian Tai', 'Fu Shi', 'Da Fugui', 'Dongfang Shaonu', 'Yangfei Chuyu' and 'Hong Fushi' were obtained from Caozhou Peony Garden, Heze, Shandong Province, China, in April 2019. To evaluate transferability, fresh leaves of seven species of *Paeonia* were collected from different habitats in China: i.e., *P. lactiflora* was collected from Xilin Gol, Inner Mongolia (115◦13 –117◦06 E, 43◦02 –44◦52 N), *Paeonia emodi* and *Paeonia sterniana* were collected from Tibet (84◦35 –86◦20 E, 28◦3 –29◦3 N), *Paeonia obovata* was collected from Pingquan, Hebei Province (118◦21 –119◦15 E, 40◦24 –40◦40 N), *Paeonia anomala* was collected from Altay city (85◦31 –91◦04 E, 45◦00 –49◦10 N), *Paeonia intermedia* was collected from Yumin, Xinjiang Province (82◦12 –83◦30 E, 45◦24 –46◦3 N), and *Paeonia veitchii* was collected from Lanzhou, Gansu Province (103◦40 E, 36◦03 N). Twenty-four cultivars of *P. lactiflora* used for phylogenetic analysis were also collected from Caozhou Peony Garden. These leaves were bagged with silica gel and transported to Beijing, ground to powder with liquid nitrogen and stored at −80 ◦C in the laboratory of the National Engineering Research Center for Floriculture, Beijing, China. A DNAsecure plant kit (TIANGEN Biotech, Co., Ltd., Beijing, China) was used for DNA extraction. The quality and quantity of total DNA were estimated by a NanoDrop 2000 spectrophotometer (Thermo Scientific, Waltham, MA, USA). DNA was diluted to 30 ng/μL in preparation for polymerase chain reaction (PCR).

#### *2.3. SSR Primer Evaluation in Eight Cultivars*

A total of 960 pairs of primers were selected for synthesis (Ruibiotech, Co., Ltd., Beijing, China). To improve the efficiency of primer fluorescence labeling, the thermocycler amplification protocol was conducted in two rounds. First, the primers synthesized for the DNA of the eight cultivars were used for amplification. The 10 μL PCR mixture consisted of 0.1 μL of 10 μmol/μL forward primer containing the M13(-21) tail at its 5 end and reverse primer, 1 μL of 30 ng/μL DNA, 5 μL of 0.1 U/μL 2×Taq PCR MasterMix (containing 0.05 units/μL Taq DNA polymerase (recombinant), 4 mM MgCl2 and 0.4 mM dNTPs, Aidlab Biotechnologies Co., Ltd., Beijing, China) and 3.8 μL of ddH2O. After an initial denaturation step of 95 ◦C for 5 min, 20 cycles of 95 ◦C for 30 s, 60 ◦C for 30 s and 72 ◦C for 30 s, as well as extension at 72 ◦C for 10 min, were performed. Second, to efficiently and economically analyze the length of PCR products, fluorescently labeled (i.e., FAM, HEX, TAMRA or ROX) M13(-21) universal primers were added to the PCR mix [30]. The 10 μL PCR mixture contained 0.15 μL of 10 μmol/μL M13(-21) universal primer and reverse primer, 2 μL of the PCR product from the first round, 5 μL of 0.1 U/μL 2×Taq PCR MasterMix and 2.7 μL of ddH2O. In the thermocycler, amplification was performed at 95 ◦C for 5 min and followed by 35 cycles of 95 ◦C for 30 s, 52 ◦C for 30 s and 72 ◦C for 30 s, as well as extension at 72 ◦C for 10 min. After 3% agarose gel electrophoresis, the amplified loci of the final PCR product were detected by a 3730xl DNA Analyzer with 96 capillaries (Applied Biosystems, Foster City, CA, USA) and sized with GS500 LIZ. The amplified loci were analyzed by GeneMarker V2.2.0.

#### *2.4. Phylogenetic Analysis of Seven Species of Paeonia and 24 Cultivars of P. lactiflora*

As shown in Table 1, the 46 forward primers showing the most abundant polymorphic loci were resynthesized by adding a fluorescent label to the 5 tail. The 10 μL PCR mixture consisted of 0.2 μL of 10 μmol/μL forward primer and reverse primer, 1 μL of 30 ng/μL DNA template, 5 μL of 0.1 U/μL 2×Taq PCR MasterMix and 3.6 μL of ddH2O. PCR was performed at 95 ◦C for 5 min, followed by 35 cycles of 94 ◦C for 30 s, annealing at an appropriate temperature (as shown in Table S3) for 30 s, and 72 ◦C for 30 s and a final extension at 72 ◦C for 7 min.

**Table 1.** Size range of amplification products, sample size (N), the frequency of null allele at locus (Null allele), number of different alleles (Na), number of effective alleles (Ne), Shannon's information index (I), observed heterozygosity (Ho), expected heterozygosity (He), fixation index (F) and polymorphism information content (PIC)./means we did not calculated the frequency of null allele at the locus because there was no significant difference in the observed and expected value in this locus and it followed Hardy-Weinberg equilibrium (Chi-Square tests, *p* < 0.05).


Polymorphism information content (PIC) was calculated by the Microsatellite Toolkit according to the methods of a previous study [31]. GenAlEx 6.51b2 was used for genetic analysis and principal coordinates analysis (PCoA), and parameters including the number of alleles (*Na*), tests for Hardy-Weinberg equilibrium, the number of effective alleles (*Ne*), Shannon's information index (*I*), heterozygosity (*Ho*), heterozygosity (*He*) and PIC were calculated [32,33]. The frequency of null alleles at loci were estimated by maximum likelihood method [34]. A dendrogram of the accessions of *Paeonia* was generated according to Bruvo's distance with 1000 bootstrap replicates by the R package poppr [35].

#### **3. Results**

#### *3.1. Numbers and Distribution of SSRs in Transcriptome Data*

A total of 122,670 unigenes with a total length of 1.06E+08 bp were searched by MISA-web, and 10,468 SSRs (including 825 compound formations) were found. These SSRs were distributed among 8837 unigenes (7.20%), 1321 of which contained more than one SSR. As shown in Figure 1A, Di- repeats were the most abundant (63.52%) type of repeat motif, followed by Tri- repeats (22.70%), Tetra- repeats (7.73%) and all the other types of repeat motifs (6.05%). Di- repeats were of four types, namely, AG/CT (36.55%), AT/AT (14.78%), AC/GT (11.97%) and CG/CG (0.23%). The number of each type of Di- repeat (except CG/CG) exceeded the number of SSRs. TransDecoder analysis showed that these SSRs involved 4482 CDSs and 3958 unigenes. As shown in Figure 1B, the positions of many motifs were not putative, and in known positions of unigenes, different motifs exhibited distinct preferences. Di- repeats were mostly located in the 5 UTR, followed by the 3 UTR. In CDSs, Tri- repeats were the most abundant motif, and Tetra- repeats were mostly located in the 3 UTR and 5 UTR.

**Figure 1.** Distribution of different repeat motifs and positions of simple sequence repeats (SSRs) in unigenes. (**A**) Proportion and distribution of each type of motif in dinucleotide (Di-), trinucleotide (Tri-), tetranucleotide (Tetra-) and other (i.e., Penta-, Hexa- and compound) repeats. In the legend, 'other Tri-' consists of ACG/CGT (0.26%), ACT/AGT (0.26%) and CCG/CGG (0.42%), and 'Other Tetra-' consists of 26 types of Tetra- repeats, the most abundant of which are AATC/ATTG (0.22%) and AGGG/CCCT (0.22%). (**B**) Abundances of six motifs in different unigene positions. Two types of SSRs were located in multiple regions. One of these SSRs was located across two 5 UTRs, a coding sequences (CDS) and a 3 UTR; another was located in multiple CDSs in one unigene. The SSR locations differed from each other. Unknown refers to the SSRs without matching locations.

SSR size was analyzed as shown in Figures S1 and S2 and Table S1. For each type of repeat motif, the smallest SSRs were the most abundant, with sizes of 12 for Di- repeats, 15 for Tri- repeats, 16 for Tetra- repeats, 20 for Penta- repeats and 24 for Hexa- repeats, all of which had distinct discrete values with the increase in repeat motifs. Different types of repeat motifs exhibited significantly distinct sizes according to pairwise comparisons (Wilcoxon rank sum test, with Bonferroni adjustment). For all the SSRs, the most common size was 12 for Di- repeats (frequency of 1894), 15 for Tri- repeats (frequency of 1309) and 14 for Di- repeats (frequency of 1200). For each location (including the unknown positions) of

SSRs, a large number of discrete values were observed to have high repetitions. Significant differences in the sizes of SSRs among the 5 UTR, 3 UTR and CDSs appeared according to pairwise comparisons (Wilcoxon rank sum test, with Bonferroni adjustment). Only the size of SSRs in the 5 UTR differed from that in multiple regions. The size of SSRs in unknown regions was not obviously different from that in the 3 UTR and multiple regions.

#### *3.2. Annotation of the Unigenes with SSRs*

To understand the potential functions of the unigenes containing SSRs, we classified these unigenes using GO annotation. A total of 8837 unigenes were blasted against a protein database and annotated with GO terms. The results showed that 8522 unigenes (96.44%) were involved in three functional groups and 42 putative processes or functions, as shown in Figure 2. These unigenes participated in 17 types of biological processes, 11 types of molecular functions and 14 types of cellular components. Catalytic activity (972, 11.40% of the total blasted genes), metabolic process (959, 11.25%) and cellular process (930, 10.91%) were the three most abundant terms for putative gene functions. Fewer than ten unigenes were involved in each of the two types of biological processes, six types of molecular functions and four types of cellular components, and rhythmic process (1), transcription factor (TF) activity and protein binding (1) and extracellular region part (1) were the terms assigned the fewest unigenes containing SSRs.

**Figure 2.** Gene Ontology (GO) analysis of unigenes containing SSRs. The lighter color of each bar represents the number of unigenes without matching coding sequences.

To further categorize the unigenes, KEGG annotation was used. As shown in Figure 3, unigenes with SSRs were involved in 126 pathways, which were divided into five classes and 18 subclasses. In total, unigenes participating in metabolism were most abundant. At the subclass level, the global and overview pathway (38.22%) had the largest number of unigenes, followed by the transcription pathway (11.44%), and the carbohydrate metabolism pathway (9.79%). The top five unigenes belonged to global and overview subgroups, and they were involved in metabolic pathways (15.38%), biosynthesis of secondary metabolites (8.63%), biosynthesis of antibiotics (4.11%), microbial metabolism in diverse environments (3.98%) and carbon metabolism (2.73%).

**Figure 3.** Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotation of unigenes with embedded SSRs. The numbers outside the circle represent the cumulative number of unigenes beginning at zero and moving in a clockwise manner.

As shown in Table S2, the TFs containing SSRs accounted for 28.16% of the total TFs (1218) in the transcriptome of herbaceous peony, including 51 kinds of TF families (e.g., ERF, MYB, MYB related and ARF). Regarding the hub genes (46) for inflorescence stem straightness, as described in a previous study [27], 21.74% of the genes contained SSRs and were involved in lignin monolignol biosynthesis (*4CL1*, *CCoAOMT2*, *HST* and *CAD2*), xylan synthesis and metabolic process (*IRX-15 L*), auxin signaling transduction (*IAA26*, *IAA31* and *SAUR20*) and lateral organ boundary domain TF (*LBD15* and *LBD36*) pathways.

#### *3.3. Initial Amplification of SSR Primers*

Primer3 was used to design primers for the 9643 SSRs. Appropriate primers could not be designed for 2369 of these SSRs due to short or missing flanking sequences. A total of 7274 pairs of primers were designed, 3721 of which were able to identify CDSs. We further selected 960 pairs of primers considering SSR types and locations for synthesis and used them for amplification in eight distinct cultivars of *P. lactiflora*. As shown in Figure S3, 89.05% of the primers resulted in successful amplification in these cultivars, and 55.83% (i.e., 62.72% of the total with successful amplification) of the SSR marker primers had polymorphic amplification products. The total number of polymorphic loci decreased with an increasing number of alleles per locus. Appropriately 30% of the primers amplified only two or three types of products, and almost 16% of the primers amplified more than five alleles in the eight DNA templates.

#### *3.4. Polymorphism in P. lactiflora*

The 46 primers (listed in Table S3) with the most abundant amplified loci were used to reveal the information and transferability of the primers, as shown in Table 1.

Forty-four pairs of primers were amplified in the accessions; however, TA564 (172–201 bp) and TA566 (167–185 bp) were successful in only 26 and 25 accessions, respectively. The product size range varied among accessions. The products of T852 (209 bp) and TA144 (100 bp) had the maximum size difference among accessions, and T163 (14 bp), T237 (14 bp) and TA464 (14 bp) presented the smallest size differences among accessions. A total of 472 different alleles (Na) were amplified. The Na ranged from 6 to 16, with an average value of 10.26 ± 0.36; the Ne ranged from 2.47 to 9.96, with an average value of 5.26 ± 0.21. I varied from 1.39 to 2.41, with an average value of 1.88. The ranges of Ho and He were 0.13–0.97 (the average was 0.67) and 0.59–0.90 (the average was 0.80), respectively. SSR typing data of seven locus followed Hardy-Weinberg equilibrium, and the frequency of null alleles in other locus were 0.000–0.391 (the average was 0.11). The PIC ranged from 0.57–0.89 with an average value of 0.77.

#### *3.5. Transferability of SSR Markers among Paeonia Species*

PCoA of 31 accessions was conducted according to the amplified alleles, as shown in Figure 4A. Eigen values by axis and sample eigen vectors are shown in Table S4. A total of 30 dimensions were extracted, and dimension 1 (10.7%) and dimension 2 (7.1%) represented 17.8% of the total information. The points representing the 24 cultivars of *P. lactiflora* were close to each other and separated from points of the other seven species. Of these species, *P. lactiflora* was nearest to the 24 cultivars in the PCoA plot, and *P. intermedia* was significantly separated from all the accessions. A cluster dendrogram was drawn according to Bruvo's distances calculated by the amplified alleles, as shown in Figure 4B. All the accessions were divided into two groups, and the species *P. lactiflora* and its cultivars were tightly clustered and separated from the other *Paeonia* species, which was consistent with the results of PCoA. Furthermore, at a height of 0.85, *P. obovata* and *P. emodi* were further clustered and separate from the other four species. Among the cultivars of *P. lactiflora*, 'Yinxian Xiuhongpao' had the maximum distance from the other cultivars and was separated at a height of 0.65.

**Figure 4.** Principal coordinates analysis (PCoA) of amplified loci (**A**) and dendrogram generated by Bruvo's distances (**B**) of 31 accessions, including seven species and 24 cultivars of *Paeonia lactiflora*. The cultivar names are abbreviated with capitalized letters in (**A**), and their full names are shown in (**B**). The UPGMA tree was produced with 1000 bootstrap replicates, and the node values greater than 50 are shown in the tree.

#### **4. Discussion**

Next-generation sequencing makes it possible to develop microsatellites efficiently and inexpensively [8]. In this study, we identified a total of 10,468 SSRs, covering 7.20% of the transcripts assembled using our transcriptome data. This method was significantly more convenient and effective than the use of SSR-enriched genomic libraries or magnetic bead enrichment, which were the primary methods used in previous herbaceous peony studies [9–11,36,37]. The coverage of SSRs in ESTs reported in the present study was higher than that reported in five cereals (average of 3.2%) [38] and similar (6.6%) to that generated by Trinity for *P. lactiflora* 'Hang Baishao' [13]. The distribution of motif types varies among plants. The most frequent motif type for *Parrotia subaequalis* was Di- [39], while the most frequent motif type for *Lychnis kiusiana* and *Dendrocalamus hamiltonii* was Tri- [40,41]. In this study, Di- repeats were the most abundant motif (mononucleotides were not considered) in *P. lactiflora*, and AG/CT accounted for 36.55% of all SSRs, followed by AT/AT (14.78%) and AC/GT (11.97%). The observation that AG/CT was the most frequent repeat was consistent with the finding of a previous study in the genus *Paeonia* regardless of assemble methods [13], while the proportions of TC/GA and AC/GT repeats significantly differed [14,42]. Differences in the transcriptomic SSR motifs can explain the relatively low transferability of SSR primers (approximately 26%) from the genus *Paeonia* to *P. lactiflora* [16].

SSR size varied significantly among unigene locations and motif types in this study. Furthermore, our results suggested distinct preferences in the distribution of motifs among different gene parts; in annotated positions, a large number of Di- repeat motifs were distributed in the 5 UTR and 3 UTR, and most of the Tri- repeat motifs were distributed in CDSs. Polymorphism level is affected by location; notably, a large proportion of SSRs in the 3 UTR were polymorphic in *Hordeum vulgare* [43]. Further studies in grape showed that the most polymorphic SSR position differed at three levels, that is, among cultivars, among cultivars and species, and among species and genera [25]. As shown in Figure S3, in the initial screening of 960 SSR primers in eight cultivars, we found that at the cultivar level, SSRs from the 5 UTR (64.05%) were the most polymorphic, followed by those from the 3 UTR (60.61%). This result suggested that the polymorphism level of SSR locations was related to species.

SSRs from the transcribed sequence may be directly related to phenotypic variation and thus related to functional trait. SSR alleles associated with biotic or abiotic stress, such as heat, cold, salt and resistant to multiple diseases have been reported [44–47]. SSRs from specific organ are likely to associate with corresponding morphological traits; SSRs obtained from flower bud transcriptome in *P. rockii* have been demonstrated significantly associating with flower colors and shapes [21]. In this study, SSRs were investigated from transcriptome that was obtained from two cultivars with distinct straightness of inflorescence stem, and a large quantity of unigenes with SSRs were annotated with the catalytic activity, metabolic process and cellular process terms. Furthermore, 28.16% of all TFs and 21.74% of the hub genes for inflorescence stem straightness contained SSRs. We speculate that these SSRs likely associated with straightness characteristics of the herbaceous peony inflorescence stem, while association analysis and QTL mapping are needed in further study.

In our experiment, 89.05% of 960 pairs of primers were validated by PCR, and 55.83% of the primers were polymorphic, which was approximately the percentage (59.90%) previously reported in peony [10], higher than the percentage (36.67%) in *Amentotaxus argotaenia* [48], and lower than the percentage for SSR markers (77.2%) generated from the soybean genome [49]. These amplification differences may be due to the number of individuals used for amplification or locus mutations (e.g., insertions, deletions and translocations) among species or cultivars [50]. To identify the reason, more individuals should be subjected to PCR amplification, and cloning experiments and sequencing should be carried out.

The mean Na in this study was 10.26, and Ho and He were 0.67 and 0.80, respectively, which were higher than the Ho and He reported in previous studies on tree peony and herbaceous peony [9,10,51]. The mean PIC value was 0.77, showing a high level of high informativeness [52]; compared to mean PIC value (0.4149–0.678) revealed by previous SSR development [9,10,36,37], our results were significantly

higher, suggesting our contribution for future effective genetic analysis or QTL mapping of *Paeonia* with fewer SSRs. Previous studies suggested that the presence of null alleles is common, and it has influence on evaluation of genetic diversity of population, even causes misunderstanding in parentage analysis [53,54]. Literature showed that the frequencies of null alleles were almost fewer than 0.40 and most of them were fewer than 0.20 [55]. Our results showed the frequencies of null alleles of 22 SSRs were less than 0.08 (or no presence), and only that of four SSRs were between 0.20 and 0.40. The future use of these SSRs should carefully consider the influence of null alleles according to research objective and choose the appropriate SSRs.

EST-SSRs developed for one species can be transferred to related species, with transferability varying depending on the plant and SSR source. In *Magnolia wufengensis* and *Elymus sibiricus*, the transferability of EST-SSRs to related species was 50–68.1% and 49.1%, respectively. In SSRs from candidate genes of *Oryza sativa*, transferability ranged from 70.37% to 77.78% according to different complexes [56]. In this study, 52.17% of the 46 pairs of SSR primers selected from the initial screening could be completely transferred to seven species of the genus *Paeonia*, 39.13% of the pairs had high transferability (six or five of seven accessions were successfully amplified), and 4.35% of the pairs had moderate transferability and could be amplified in four of seven species.

The diversity of dimensions extracted from PCoA and the low explanatory power of one dimension suggested that the genetic background of the *Paeonia* accessions involved in this study varied greatly. Combining the PCoA plot and the dendrogram of 31 accessions, the genetic relationships between these accessions were almost consistent with their recognized morphological classification [57,58]. These SSR markers can be used in genetic variance analysis and to initially evaluate the value of breeding parents selected according to genetic distance in the genus *Paeonia*.

#### **5. Conclusions**

In this study, a large quantity of informative SSRs were conveniently identified from transcriptome data of *P. lactiflora*, and the distribution and location of motifs were defined. SSR containing genes associated with TFs and inflorescence stem straightness were identified, providing a foundation for future marker-trait association research. To the best of our knowledge, this is the first study to comprehensively reveal the characteristics and functional annotations of EST-SSRs in *P. lactiflora*. In future studies, more herbaceous peony accessions should be tested to further evaluate the polymorphism of markers, and more functional markers potentially associated with traits should be developed to advance the molecular breeding of *P. lactiflora*.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4425/11/2/214/s1, Figure S1: Size distribution of SSRs with different motifs and different unigene positions. (A) SSR size of different motifs. Considering the data coverage and to improve the readability of the graph, the figure shows 99% of the data. The distribution of all data is shown in Figure S2. Size differences between motifs were analyzed with pairwise comparisons using the Wilcoxon rank sum test (after Bonferroni correction), and all the outputs were less than 2.2 <sup>×</sup> 10<sup>−</sup>16. (B) Sizes of SSRs in different unigene positions. The figure shows 98% of the data. The distribution of all data is shown in Table S1. Size differences between positions were compared using the same method as in (A), and the figure shows combinations with p-values less than 0.05, Figure S2: Size distribution of each type of repeat motif in all data, Figure S3: Polymorphism rate of primers designed for different positions of SSRs in unigenes, Table S1: Numbers of SSRs with distinct sizes in different regions, Table S2: Count of SSRs in TFs, Table S3: Sequence, position, repeat motif, annotation, annealing temperature and amplified product size of 46 SSR primers, Table S4: Eigen values by axis and sample eigen vectors.

**Author Contributions:** Conceptualization, Y.W. and Y.L.; methodology, Y.W. and M.Z.; software, Y.W.; validation, Y.W., M.Z. and Y.Z.; formal analysis, Y.W.; investigation, A.H. and Y.Z.; resources, A.H. and Y.Z.; data curation, M.Z.; writing—original draft preparation, Y.W.; writing—review and editing, M.Z., A.H., Y.Z., and Y.L.; visualization, Y.W.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Beijing Municipal Science & Technology Commission, grant number D161100001916004.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Genetic Analysis of QTL for Resistance to Maize Lethal Necrosis in Multiple Mapping Populations**

**Luka A. O. Awata 1,2,3, Yoseph Beyene 2, Manje Gowda 2,\*, Suresh L. M. 2, McDonald B. Jumbo 2, Pangirayi Tongoona 3, Eric Danquah 3, Beatrice E. Ifie 3, Philip W. Marchelo-Dragga 4, Michael Olsen 2, Veronica Ogugo 2, Stephen Mugo <sup>2</sup> and Boddupalli M. Prasanna <sup>2</sup>**


Received: 24 October 2019; Accepted: 24 December 2019; Published: 26 December 2019

**Abstract:** Maize lethal necrosis (MLN) occurs when maize chlorotic mottle virus (MCMV) and sugarcane mosaic virus (SCMV) co-infect maize plant. Yield loss of up to 100% can be experienced under severe infections. Identification and validation of genomic regions and their flanking markers can facilitate marker assisted breeding for resistance to MLN. To understand the status of previously identified quantitative trait loci (QTL)in diverse genetic background, F3 progenies derived from seven bi-parental populations were genotyped using 500 selected kompetitive allele specific PCR (KASP) SNPs. The F3 progenies were evaluated under artificial MLN inoculation for three seasons. Phenotypic analyses revealed significant variability (*P* ≤ 0.01) among genotypes for responses to MLN infections, with high heritability estimates (0.62 to 0.82) for MLN disease severity and AUDPC values. Linkage mapping and joint linkage association mapping revealed at least seven major QTL (*qMLN3\_130* and *qMLN3\_142*, *qMLN5\_190* and *qMLN5\_202*, *qMLN6\_85* and *qMLN6\_157 qMLN8\_10* and *qMLN9\_142*) spread across the 7-biparetal populations, for resistance to MLN infections and were consistent with those reported previously. The seven QTL appeared to be stable across genetic backgrounds and across environments. Therefore, these QTL could be useful for marker assisted breeding for resistance to MLN.

**Keywords:** multiple population; linkage mapping; JLAM; QTL; validation; genomic prediction; maize lethal necrosis

#### **1. Introduction**

Maize lethal necrosis (MLN) is a major disease in sub-Saharan Africa (SSA) caused by co-infections of maize chlorotic mottle virus (MCMV) and sugarcane mosaic virus (SCMV) [1]. MCMV can able to interact with any member of the Potyviridae family to cause lethal necrosis in maize [2]. Yield loss due to MLN can reach up to 100% under severe infection and MLN favorable environments [1,3]. Breeding for host resistance to MLN is the most effective means of preventing yield losses in farmer's fields. Application of molecular markers could enhance breeding for resistance to MLN. Although markers are widely used in breeding for crop improvement including maize, the tools are inconsequential unless

the linked markers or quantitative trait loci (QTL) are tested for their effectiveness and reproducibility in different genetic backgrounds. QTL validation adds weight to assess the effectiveness of alleles and their linked markers.

QTL mapping approaches are one of the popular genomic tools to dissect the genetic architecture of complex traits [4]. The presence of QTL conferring resistance to several viral diseases in maize has been investigated in numerous linkage and linkage disequilibrium mapping studies [4–8]. QTL mapping or linkage mapping is known for high QTL detection power. Joint linkage association mapping (JLAM) based on different segregating biparental populations is known to provide both the high QTL detection power and high mapping resolution [9,10]. Application of both linkage mapping and JLAM in multiple bi-parental populations is useful to validate earlier findings and to detect possible new sources of resistance for MLN.

Several studies were conducted to validate QTL effects on traits of economic importance in different crops including maize [11–15]. Sukruth et al [16] validated the markers associated with late leaf spot and rust in groundnut using recombinant inbred line (RIL) population and two backcross populations. Zhou et al [17] validated two major QTLs (LEN-3H and LEN-4H) for kernel length in wild barley using a biparental population derived from Fleet (*Hordeum vulgare* L.) and Awcs276. The authors reported significant association between the two QTL and kernel length. Gowda et al [7] employed four biparental maize populations and adopted linkage mapping and joint linkage mapping options to discover and confirm QTL associated with resistance to MLN. They consistently confirmed in two of the populations that three major QTLs were localized on chromosomes 3, 6, and 9. For effective use of trait linked markers and resources, validation of discovered QTL with large expected impact is crucial, as there are too many QTLs to validate and validation population development and assessment is expensive [18–20]. It is a pre-requisite though that for QTL to be effectively used in crop improvement, it should be confirmed that their effects remain consistent across populations and environments [12,16,21].

QTL validation study greatly contributes towards increased resolutions of some of the target QTLs as well as complementarity to previous findings either from genome wide association studies (GWAS) or JLAM or any other approaches. The CIMMYT Global Maize Program has recently identified a number of QTL across maize chromosomes which are associated with resistance to MLN in multiple mapping populations [7,8]. The validation of these QTL using seven different F3 mapping populations would provide better understanding of QTLs associated with resistance to MLN and justify their application for marker assisted breeding towards improvement of maize lines for resistance to MLN. Further this study also helps to find new source of resistance which might not have been reported in earlier studies. This study was aimed to: (i) evaluate seven different F3 populations for their responses to MLN under artificial inoculation; (ii) conduct individual population-based QTL mapping; (iii) apply JLAM with three biometric models and compare with linkage mapping results to identify stable and/or unique QTL; and (iv) assess the potential of genomic prediction for MLN resistance within biparental populations with low marker density.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

Seven elite maize parental lines with contrasting response to MLN developed by CIMMYT through pedigree and DH breeding schemes were used in this study. Parents CKDHL120918, CML494, CKLTI0227, and CKDHL312 were tolerant to MLN; CML543 and CKDHL221 were moderately tolerant, and CKDHL0089 was susceptible to MLN. These materials were also known for tolerance to various biotic and abiotic stresses with good agronomic performances. The seven bi-parental populations (F3 pop1–CKDHL120918 × ML494, F3 pop2–CML543 × CML494, F3 pop3–CKDHL120918 × CML543, F3 pop4–CKLTI0227 × CKDHL120918, F3 pop5–CKDHL0089 × CKDHL120918, F3 pop6–CKDHL0221 × CKDHL120312 and F3 pop7–CKDHL0089 × CML494) were used for linkage mapping and JLAM. To develop F3 populations, crossing blocks were established in the nursery in Kiboko, Kenya (37◦75 E; 2◦15 S; 975 m a.s.l.; of 530 mm/year of rain fall and temperature ranges from 14.3 to 35.1 ◦C) during the 2016/2017 cropping season. Seven bi-parental crosses were made among the seven elite parental lines. Single cross (F1) seeds were grown and F1 plants were selfed. About 300 to 350 F2 plants from each population were randomly selfed and F3 seeds were harvested.

#### *2.2. Phenotypic Evaluation*

At least 306 F3 families from each population with their seven parental lines and six commercial checks were evaluated to determine their response to MLN under artificial inoculation in field. Experiments were conducted for three seasons (April 2017, April 2018, and October 2018) in confined MLN facility in Naivasha (36◦26 E; 0◦43 S; 1896 m a.s.l.; 677 mm/year of rain fall and temperature ranges from 12 to 29 ◦C), hereafter seasons are referred as environments. Trials were planted using alpha lattice design in a 1-row plot of 3.0 m long, with spacing of 0.75 m between rows and 0.25 m between plants. Two seeds were planted per hill and thinned to one plant per hill 3 weeks after germination, making a total of 13 plants per row. Standard agronomic practices were adopted.

MLN inoculum was obtained from separate, sealed greenhouses maintained in Naivasha for each of SCMV and MCMV [7]. Maintenance of virus stocks in the greenhouses was described earlier [22]. In brief, MLN infected leaf tissues were collected from the field and ground in grinding buffer solution at 1:10 dilution ratio (10 mM potassium-phosphate, pH 7.0) [1,22]. The resulting sap extract was centrifuged at 12,000 rpm for 2 min. The sap was decanted and celite powder was added at the rate of 0.02 g/mL. To propagate the viruses in the greenhouses, a susceptible maize hybrid (H614) was grown and infected by rubbing the sap on the leaves at two to four leaf stages. A separate, sealed greenhouse was the maintained for each virus as stock for further use.

Three weeks before inoculation of the experimental materials, enzyme linked immunosorbent assay (ELISA) test was conducted on leaf samples, randomly collected from infected plants in the greenhouses, to determine the presence and purity of the MCMV and SCMV [7]. Separate extracts from the SCMV and MCMV infected plants was prepared and the two extracts were then mixed to form MLN inoculum at the ratio of 4 parts of SCMV to 1 part MCMV (weight/weight). Two inoculations were applied at the 4th and 5th week after planting. In order to keep uniform disease pressure across experiments, a motorized, backpack mist blower (Solo 423 Mist Blower, 12 L capacity) with an open nozzle (2-inches diameter) was used to inoculate the experimental plants at a high inoculum delivery pressure of 10 kg/cm. Drip irrigation was used to provide water and fertilizer. All other agronomic practices relating to maize production were followed according to standard procedures for field practices. Spreader rows of susceptible maize hybrid (H614) were planted along the experiment to enhance disease spread and intensity [16,23].

MLN Disease severity (MLN-DS) score started 10 days after the 2nd inoculations and was recorded four times at 10 days interval using standardized qualitative scale of 1 to 9, where 1 = resistant, clean, no symptoms; 2 = fine or no chlorotic specks, but vigorous plants; 3 = mild chlorotic streaks on emerging leaves; 4 = moderate chlorotic streaks on emerging new leaves; 5 = chlorotic streaks and mottling throughout plants; 6 = intense chlorotic mottling throughout plants, necrosis on leaf margins; 7 = excessive chlorotic mottling, mosaic and leaf necrosis, at times dead heart symptoms; 8 = excessive chlorotic mottling, leaf necrosis, dead heart and premature death of plants; and 9 = susceptible (complete plant necrosis and dead plants) [7,8]. After analyzing MLN-DS for each time score, we chose the third score (40 days post-inoculation) for further analysis because of its higher heritability and full expression of disease symptoms. The area under the disease progress curve (AUDPC) was calculated for each plot using SAS 9.4 (SAS Institute Inc., 2015, Cary, NC, USA) so as to understand the trend of development of MLN severity across the score intervals.

#### *2.3. Phenotypic Data Analysis*

All quantitative genetic parameters were estimated based on the performance of the 2142 F3 lines. To check the quality of the data, first, test for normality of distribution of error terms for MLN-DS and AUDPC was determined using the Shapiro-Wilk test [24]. Secondly, analyses of description statistics (mean, range, skewness and kurtosis) and correlation among phenotypic traits were performed using META-R [25]. We assumed that each environment was a representative of a replication. Therefore, statistical analysis of phenotypic data was conducted using linear mixed model:

$$\mathbf{Y}\_{\rm ijk} = \mu + \varrho\_i + \mathbf{l}\_{\parallel} + \mathbf{b}\_{\rm kj} + \varepsilon\_{\rm ijk,} \tag{1}$$

where Yijk was the disease severity of the ith genotype at the jth environment in the kth incomplete block, μ was an intercept term, gi was the genetic effect of the ith genotype, lj was the effect of the jth environment, bkj was the effect of the kth incomplete block at the jth environment, and εijk was the error term confounding with the genotype-by-environment interaction effect. To determine variance components by the restricted maximum likelihood (REML) method, both block and genotype effects were treated as fixed. Significance of variance component estimates was tested by model comparison with likelihood ratio tests where the halved *P*-values were used as approximations [26]. Heritability (*H*2) on an entry-mean basis was estimated as the ratio of genotypic to phenotypic variance. The phenotypic variance comprises genotypic variance and the masking GxE interaction variances divided by the number of environments. Further, the mixed linear model (MLN) established in META-R software (http://hdl.handle.net/11529/10201) was adopted and the best linear unbiased predictor (BLUP) and best linear unbiased estimator (BLUE) for each genotype across environments were generated. The BLUPs were used in linkage mapping and joint linkage association mapping analyses whereas BLUEs were used in genomic prediction studies.

#### *2.4. DNA Extraction, Genotyping, Linkage Map Construction and QTL Analysis*

In this study, SNP markers were first screened on parental lines and 500 markers, which showed polymorphism between at least two of the seven parents were selected and used to genotype all the populations. Leaf samples were collected from 306 individuals per population 2 to 3 weeks after emergence, based on CIMMYT laboratory protocols [27]. The leaf samples were sent to LGC Genomics (https://www.biosearchtech.com/services/genotyping-services, Herts, UK) and were genotyped using 500 SNPs. Genotypic data obtained from LGC was subjected to quality check and SNPs were called and filtered using TASSEL version 5.0 software [28].

Polymorphic markers for each population were selected. Segregation of each SNP was verified for deviation from classic Mendelian inheritance using *x*2-test and SNPs that significantly deviated were discarded [29,30]. Linkage maps were created for the individual populations using the MAP function established in QTL IciMapping v 4.1 software [24,31]. Linkage groups were identified using Group command based on logarithm of odds (LOD) score of 3.0, and recombination rate were converted into centimorgans (cM) using Kosambi mapping function [32]. Ordering of the markers was conducted using the "ordering" instruction with the nnTwoOpt algorithm. Adjustment of the map order was done according to the sum of adjacent recombination frequencies (SARF) and sorted in the "rippling" instruction with a window size of 5 as the amplitude. Instruction generated from the map was used to draw and visualized the map using Map Chart software [33].

The BLUPs obtained from across environments for MLN-DS and AUDPC for each population were subjected to inclusive composite interval mapping (ICIM) analysis so as to determine the QTL linkage. The ICIM is an effective two-step statistical approach that allows separation of co-factor selection from interval mapping process, in order to control the background effects and improve mapping of QTL with additive effects [31]. A LOD threshold of 3.0 with a scanning step of 1 cM were used to declare significant QTL [21,34]. Stepwise regression was adopted to determine the percentages of phenotypic variance explained (*R*2) by individual QTL and additive effects at LOD peaks. QTL nomenclature [35]

was adopted to nominate QTL conferring MLN resistance, where a two or three letter abbreviation of trait, followed by the chromosome number on which the QTL is found and the marker position to distinguish multiple QTLs were employed. The percentages of phenotypic variance explained (% PVE) by individual QTL, and additive and dominant effects at LOD peaks were generated. Sources of favorable alleles were determined depending on signs of the QTL additive effects [7,36]. For each F3 population, estimated additive (a) and dominance (d) effects for each QTL were used to calculate the ratio of dominance level (|*d*/*a*|). This ratio was used to classify the nature of QTL as follow [37]: additive (A; 0 ≤ |*d*/*a*| ≤ 0.2); partially dominant (PD; 0.2 ≤ |*d*/*a*| ≤ 0.8); dominant (D; 0.8 < |*d*/*a*| ≤ 1.2) and over dominant (OD; |*d*/*a*| > 1.2).

Based on the SNP markers shared by different populations, an integrated map was built by using IciMapping software [31]. In brief, SNPs overlapped across genetic maps were selected as anchor markers and used to integrate corresponding linkage groups on individual linkage maps. The marker order and marker positions were calculated after calculating the order and the relative position (within each genetic map) of the anchored markers, followed by integrating of all the detected markers into one map. Then all QTL identified from the seven populations were projected onto the integrated map based on their confidence interval.

#### *2.5. Joint Linkage Association Mapping (JLAM)*

The JLAM method is a combination of linkage mapping and association mapping approaches [38,39]. Here, BLUPs obtained from across seven bi-parental populations and about 420 SNPs with missing values of <5% and minor allele frequency >0.05 were considered for JLAM study. We employed three biometric models to elucidate the QTL-trait relationships [7]. For each model, first, co-factors were selected using stepwise multiple linear regression based on the Schwarz Bayesian Criterion [40], following the Proc GLM SELECT model of SAS 9.2 [41]. Secondly, *P*-value and *F*-test were determined based on the full model (with QTL effects) and the reduced model (without QTL effects), followed by QTL scan using R software (version 3.5.3). The approach for Model A involved the trait as a function of the co-factor + marker. Model B incorporated population effect as an additional factor to adjust for the population structure so that trait = population + co-factor + marker; and Model C involved nesting of the co-factor and marker within population such that trait = pop + co-factor (pop) + marker (pop) [10,42]. Significance of association between QTL and MLN resistance was determined at *P* < 0.05 following Bonferroni-Holm procedure [43]. The adjusted total phenotypic variance explained (*R*2) values for the detected QTL were generated when the significant QTL were simultaneously fitted in linear model. Further principal component analysis (PCA) of all F3 lines was carried out using TASSEL version 5.2 [28], and the CurlyWhirly version 1.15 software [44] was used to construct PCA biplot for the population structure.

#### *2.6. Genomic Prediction*

Due to its advantages over either phenotypic and marker assisted selection, genomic prediction (GP) is becoming a popular approach for breeding, especially for complex traits [45,46]. The technique allows for the selection of superior genotypes without conducting phenotypic evaluation, if a subset of the population has genotypic and phenotypic data. Here, we performed GP analyses using ridge-regression BLUP (RR-BLUP) with five-fold cross-validation. Uniformly distributed, polymorphic, SNPs from each population showing minor allele frequency of less than 5% were used.

Genotypes were sampled from each population and used to constitute both training set and testing set. The sampling of the sets was repeated 100 times [7]. The predictive ability of the GP was computed by dividing the correlation between genomic estimated breeding values (GEBVs) and the observed phenotypic values by the square root of the heritability estimates obtained from the respective populations [46].

#### **3. Results**

#### *3.1. Response of Parents and F3 Populations to MLN Infections*

The response of F3 lines for MLN-DS and AUDPC showed continuous distribution, ranging from highly resistant or tolerant to completely susceptible in each individual population as well as across populations (Figure 1).

**Figure 1.** Phenotypic distribution of MLN disease severity (MLN-DS) and the area under disease progress curve (AUDPC) values recorded in seven individual and across F3 populations. Arrows indicate the performance of parents.

The normality tests indicated that the means were positively and moderately skewed with values ranging from 0.65 to 0.77. Kurtosis values were platykurtotic with maximum of 2.07 for AUDPC. High *W*-test values were observed for all the traits and ranged from 0.95 to 0.97 (data not shown). Individual means across populations ranged from 4.40 to 5.11 with a mean of 4.70 for MLN-DS, and the AUDPC mean values for individual populations ranged from 125.6 to 134.4 with an across population mean of 132.4. The magnitude of genotypic variance for MLN-DS was lowest (0.17) in F3 pop 2 (CML543 × CML494) and highest (0.69) in F3 pop 4 (CKLTI0227 × CKDHL120918). Genotypic variances were highly significant (*P* < 0.01) in all seven F3 populations and across populations for both MLN-DS and AUDPC values. Moderate to high broad–sense heritability estimates of 0.62 to 0.79 were found for MLN-DS and AUDPC values, indicative of high-quality phenotypic data for further genetic analysis (Table 1).


**Table 1.** Analysis of variance for MLN disease severity (MLN-DS) and area under disease progress curve (AUDPC) values using seven segregating F3 populations evaluated for three seasons under MLN inoculated plots.

\* and \*\* indicate significance at *P* < 0.05 and *P* < 0.01, respectively MLN-DS = MLN disease severity 42 dpi; AUDPC = area under disease progress curve; σ*<sup>2</sup> <sup>G</sup>* = genotypic variance; σ*<sup>2</sup> e*\* = error variance confounded with GxE variance; and *H*<sup>2</sup> = broad sense heritability.

Three-dimensional principal component analysis (PCA) of the seven bi-prenatal populations are shown in Figure 2. The results identified five distinct groups including CML543× CML494, CKDHL120918 × CML543, CKDHL0089 × CKDHL120918, CKDHL0221 × CKDHL120312, and CKDHL0089 × CML494. The distinctiveness of the five populations implies their diversity in their genetic backgrounds. However, populations CKDHL120918 × CML494 and CKLTI0227 × CKDHL120918 were overlapping and CKDHL221 × CKDHO120312 is distinct from other populations. This might indicate their relatedness and distinctness in genetic compositions.

#### *3.2. Molecular Analyses*

#### 3.2.1. Linkage Group

Genetic linkage groups consisting of 10 maize chromosomes were constructed for each of the seven F3 populations. Marker density on each map varied among the seven populations. F3 pop 4 carried the largest number of polymorphic markers with 298 SNPs and a total length of 1550.07 cM at an average density of 5.20 cM between the markers. F3 pop 1 had only 112 polymorphic SNPs spanning a total length of 1223.97 cM with a mean spacing of 10.93 cM between markers. Similarly, variation was detected in number of SNPs per linkage group (LG), with marker density per LG ranging from 40 with a mean spacing of 5.71 for LG7 to 153 and an average of 21.86 on LG1. Total linkage map consisted of 389 SNPs with a total length of 2007.85 cM and an average marker interval length of 5.16 cM (Table S1).

**Figure 2.** PCA (principal component analysis) biplot showing structures of the seven F3 populations.

#### 3.2.2. QTL Mapping

QTL analyses across environments revealed 60 and 58 significant QTLs for MLN-DS and AUDPC values for the seven populations, respectively (Table 2), with different sizes of QTL effects. Generally, the study revealed that positive alleles for resistance to MLN were donated either by male or female parent in a cross in each population. Some QTL identified in F3 pop 6 and 7 showed large additive effects. QTL detected in the other F3 populations predominantly showed dominant to over-dominant effects. Majority of the QTLs was detected in F3 pop4 (CKLTI0227 × CKDHL120918), whereby 7% of the QTL showed additive effects for MLN-DS and AUDPC values.


**Table2.**Quantitativetraitloci(QTL)detectedbyintegratedcompositeintervalsmappinganalysisforresistancetoMLNinsevenF3populationsevaluated



#### *Genes* **2020**, *11*, 32


**Table 2.** *Cont*.



resistance is contributing; a QTL name composed by the trait code followed by the chromosome

 number in which the QTL was mapped and a physical position of the QTL.

In F3 pop 1, for MLN-DS and AUDPC values, together 15 QTL were detected, which explained 30.08% and 34.72% of the total PVE, respectively. Most of the detected QTL were of small effects except *qMLN6\_85*, which had AUDPC values explaining 21.6% of PVE. In F3 pop 2, seven and 10 QTLs were detected which together explained 43.5 and 49.8% of the total phenotypic variance for MLN-DS and AUDPC values, respectively. Among all the QTLs identified, only three QTLs are specific to AUDPC values. In F3 pop3, a total of eight QTLs were found for MLN-DS, explaining 47.9% of the total observed phenotypic variability. Among the QTL with major effects, *qMLN3\_142* was detected with 28.8% and 11.1% of PVE for MLN-DS and AUDPC values, respectively. There were 19 and 10 QTLs detected which together explained 25.1 and 28.7% of the PVE for MLN-DS and AUDPC values, respectively in F3 pop4. All the detected QTLs were of small effects. For F3 pop5, seven QTL each were found for MLN-DS and AUDPC values which together explained 54.3% and 57.6% of the total phenotypic variance, respectively. QTL *qMLN3\_142* was found for both MLN-DS and AUDPC values with major effects. F3 pop6 revealed five QTLs explaining 52.1% of the variation for MLN-DS. Whereas for AUDPC values, nine QTLs were found which explained 59.1% of the total phenotypic variance. QTL *qMLN3\_130* consistently explained >25% of the phenotypic variance. F3 pop7 revealed a total of three and four QTL with 46.7% and 50.8% PVE for MLN-DS and AUDPC values, respectively.

Among several genomic regions identified with stable QTL for MLN resistance, QTL on chromosome 3 was highly consistent. We found two strongly associated SNPs (*PHM15449\_10* at 125,077,922 bp and *PZA00920\_1* at 142,821,031 bp) within this region and their allelic effects on MLN resistance in different populations were also prominent. The phenotypic values of the different allele classes of these two major-effect SNPs in five F3 populations and across populations for MLN-DS and AUDPC value were presented in Figure 3.

**Figure 3.** Box plots showing the phenotypic values of the different allele classes of two major-effect SNPs identified in five F3 populations and across populations for MLN-DS and AUDPC value. The SNP names, alleles and the specific F3 population where the effect is witnessed are mentioned above. The black horizontal lines in the middle of the boxes are the median values for the MLN-DS and AUDPC value in the respective allele classes. The vertical size of the boxes represents the inter-quantile range. The upper and lower whiskers represent the minimum and maximum values of data.

#### 3.2.3. Consensus Map Construction

Since all populations were genotyped with the same set of SNPs, we integrated the seven maps into a consensus linkage map based on the markers shared by populations (Figure 4). From the total of 118 QTLs (60 for MLN-DS and 58 for AUDPC), similar QTL detected for both traits were treated as single QTL, as a result 77 QTLs were mapped on the consensus map. Among all the QTLs detected from seven populations, 19 QTLs were consistently found in at least two populations; QTL *qMLN3\_142* on chromosome 3 was consistently identified in six of the seven populations. Two QTLs (*qMLN1\_265* and *qMLN6\_157*) were consistently found in four populations, seven QTLs (*qMLN1\_47*, *qMLN3\_130*, *qMLN4\_150*, *qMLN6\_85*, *qMLN7\_158*, *qMLN8\_10*, and *qMLN10\_9*) were repeatedly identified in three populations and nine QTLs (*qMLN3\_146*, *qMLN4\_30*, *qMLN5\_42*, *qMLN5\_160*, *qMLN5\_190*, *qMLN5\_202*, *qMLN9\_109*, *qMLN9\_142*, and *qMLN10\_114*) were consistently found in two populations. These clustered QTLs from different populations were placed on the consensus map (Figure 4).

**Figure 4.** QTL for MLN resistance in the consensus linkage map of seven bi-parental populations. Different colors represent QTL detected from different populations.

#### 3.2.4. Joint Linkage Association Mapping (JLAM)

The JLAM analyses based on three biometric models together found 27 and 28 main effect QTL for MLN-DS and AUDPC values (Table 3). For MLN-DS, 15, 12 and 18 QTLs were detected, which together explained 34.4%, 27.3% and 29.1% of the total variation in model A, B and C, respectively. Among 27 QTLs, five QTLs were consistently detected in all three models for MLN-DS. For AUDPC values, for model A, B and C, we found 14, 12 and 22 QTL which together explained 33.6%, 29% and 39.8% of the total variance. Seven QTLs were consistently detected in all three models. Most of the QTL detected through JLAM showed minor effects except one QTL *qMLN3\_148* which was also consistent across models as well as across traits. Genomic prediction (GP) was attempted to predict the genetic values in the seven F3 populations for MLN-DS and AUDPC performances. The mean prediction accuracy ranged from as low as 0.12 in F3 pop1 to 0.75 in F3 pop 7 for MLN-DS (Figure 5). The prediction accuracies were similar for AUDPC values with range of 0.13 in F3 pop1 to 0.75 in F3 pop5.


**Table 3.** Joint linkage association mapping depicting allele substitution effects and total phenotypic variance explained (PVE) using segregating F3 progenies derived from seven bi-parental populations evaluated for three seasons under MLN inoculation.

**Figure 5.** Box-whisker-plots for the accuracy of genomic predictions assessed by five-fold cross-validation within each population for MLN-DS (white box) and AUDPC values (grey box).

#### **4. Discussion**

#### *4.1. Response of Parents and F3 Populations to MLN Infections*

A total of 2142 F3 genotypes and seven parental lines were evaluated for MLN response in an artificially inoculated MLN plots for three seasons (2017–2018). Average mean for the F3 populations was lower than parental average for MLN-DS. Superiority of the populations over parents could be due to dominance and over-dominance effects resulting from combinations of favorable alleles from different parents and therefore, had better heterosis for resistance to MLN. The concept of heterosis has been widely reported in maize [47–49]. Significant mid-parent and best-parent heterosis for resistance to MLN across locations were reported in Northern Tanzania [50]. The observed heterosis could also be partially explained by higher vigor in the progenies than the inbred parents. The inbred parents are not as vigorous as the heterozygous progenies and less able to cope with stresses of various kinds including pathogens.

The significant variability observed among the populations for resistance to MLN is possibly due to genetic effects and could be an indication of differences in genetic backgrounds of the genotypes for resistance to MLN (Table 1). Further, the study detected high heritability estimates (0.71 to 0.94) for resistance to MLN in all the seven populations. Previous investigations revealed moderate to high heritability estimates (0.34–0.89) for resistance to MLN [7,22,48,51]. The high heritability estimates detected in this study corroborate earlier reports and strongly suggest that resistance to MLN could be largely influenced by several genes with major effects.

#### *4.2. QTL Analyses*

In this study, seven major QTL were identified for resistance to MLN, localized on chromosomes 3 (*qMLN3\_130* and *qMLN3\_142*), 5 (*qMLN5\_190* and *qMLN5\_202*), 6 (*qMLN6\_85* and *qMLN6\_157*), 8 (*qMLN8\_10*) and 9 (*qMLN9\_142*) and spread across the 7-biparetal populations with the percentage of PVE ranging from 10.54% to 44.50% (Table 2). The results indicate that a large portion of the phenotypic variation in MLN resistance can be explained by few QTL with major effects while the remaining portion is due to QTL with minor effects. Similar observations have been made on various quantitative traits [52], including MLN [7,8]. It was observed in the present study that some QTLs showed major effects in some populations and minor effects in other populations. This variability could be attributed to various factors and gene actions including QTL × QTL interactions or QTL × environment interactions which might impact the size of effect of any given QTL in any given biparental population.

To better understand the above interactions and their cause of variability among the genotypes, it is important to consider each component of gene action (additive, dominance and epistasis) separately. Additive effect is realized when each allele is independently contributing to genetic variability. Dominance effect is a variation caused by interaction between alleles at a locus. Epistasis effect is observed when the PVE is due to interactions between alleles at different loci. In this study, it was observed that QTL had varying levels of additive and dominance effects for resistance to MLN (Table 2), suggesting the importance of both modes of gene action in conditioning resistance to the disease. Similarly, some QTL had smaller %PVE values but had larger additive or dominance effects compared to the QTL with larger %PVE. Generally, when epistasis is involved, QTL with moderate and major additive effects are much more affected and they tend to have reduced %PVE. Therefore, epistasis could also be one of the main causes in some QTL having larger additive effects but with reduced %PVE values for resistance to MLN. QTL detected in F3 pop 1, 2, 3, 4, and 5 predominantly showed dominant to over-dominant effects. The dominance effects observed indicates MLN resistance is possibly due to interactions between individual alleles at specific loci. Other factors may include recombination between QTL and markers, leading to change in number of expected resistant alleles in the progeny hence reduced penetrance in QTL effects [53,54]. For QTL to be useful in a breeding program, it is important to first carry out validation studies to confirm their repeatability in different genetic backgrounds and environments. In line with expectations, the major QTL observed across populations in this study such as qMLN3\_130, qMLN3\_142, qMLN6\_85, qMLN5\_190, etc., indicated their stability across different genetic backgrounds.

It was observed that favorable alleles for each QTL were contributed by either resistant or susceptible parent, which is indicated here by sign of the additive effect of the QTL. Negative (−) additive effect was considered for resistant parent being the source of favorable allele for resistance to MLN. Similarly, positive (+) additive effect implied that the susceptible parent was the donor of favorable allele for the observed phenotypic variability. This suggests the importance of major QTL in reduction of MLN infections and their stability across different genetic backgrounds [54]. Earlier studies [7,8] also detected contributions of favorable alleles from both resistant and/or susceptible parents.

At least seven major QTLs localized across chromosomes and across populations with significantly larger %PVE were detected with total %PVE ranging from 25.13 to 59.07. Three of these QTLs (*qMLN3\_130*, *qMLN6\_85* and *qMLN5\_190*) showed effects (%PVE), chromosome positions and confidence intervals (CI) comparable to the previous study [7]. For example, for *qMLN3\_130*, the current CI is 125 to 169 Mb, whereas in the previous study CI was 125–130 Mb; for *qMLN6\_85* CI is 84–91 Mb and from previous study CI was 85–96 Mb; and for *qMLN5\_190* CI is 23–191 Mb and previous CI was 190–191 Mb. The larger CI observed in this study, compared to earlier study, could be due to few SNPs used in this study, however the results confirms the stability of the QTL in specific genomic regions. Another major QTL (*qMLN6\_157*) detected on chromosome 6 in the current study was also found in the previous investigations [8]. This observation implies that three QTLs are stable across diverge genetic backgrounds, and hence can be useful in marker assisted breeding for resistance to MLN. Another major effect QTL *qMLN3\_142* was detected in six out of the seven populations in this study, which also overlapped with QTL for MLN resistance in DH populations from the earlier study [8]. Two major QTLs (*qMLN5\_202* and *qMLN8\_10*) mapped across populations in this study, have not been reported previously. This implies the novelty of these genomic regions for MLN resistance. Similarly, it could suggest their specificity to the genetic backgrounds used in the current study.

#### *4.3. Joint Linkage Association Mapping (JLAM)*

For JLAM, three biometric models were used to detect the maximum number of QTL linked to genes for MLN resistance. The principle of both models A and B resemble the composite interval mapping approach used in bi-parental populations [55] where co-factors were used to adjust for the population structure and background error within the segregating populations, leading to enhanced capacity to detect QTL. In both MLN-DS and AUDPC, we observed that >65% of QTL detected in model A and B overlapped with slightly higher %PVE in Model A (Table 3). This clearly indicates that control of population stratification by using co-factors alone was moderate, so population effect is also important. The slightly higher number of QTL detected by model A compared to model B could imply that model A was able to more effectively detect the variability among populations, and hence was able to reveal more QTL [10]. Model C with the nested marker effects, exploited the linkage disequilibrium (LD) within segregating populations and assumed population-specific QTL effects [10]. Model C was comparable with Model B in terms of the %PVE by these QTLs. This was well supported by all seven populations with relatively bigger population size (*n* = 306), where the nested model C is able to detect QTL with small to large effect size.

To evaluate the reliability of QTL detected in linkage mapping, we compared the identified 60 and 58 QTLs with 27 and 28 main effect QTLs detected through JLAM for MLN-DS and AUDPC values, respectively. Twenty-three QTLs are common across linkage mapping and JLAM for both the traits (Tables 2 and 3). Comparison of QTL detected through JLAM and linkage mapping revealed five unique QTLs (*qMLN3\_199*, *qMLN4\_235*, *qMLN5\_207*, *qMLN5\_209*, and *qMLN6\_13*) which were detected only through JLAM. QTLs common across methods helped to pinpoint the specific markers compared to QTLs detected with large CI through linkage mapping. Plausible gene candidates underlie some of the detected QTL regions with significant SNPs found in genes involved in plant defense system, like *PZA00447\_8* associated with serine/threonine protein phosphatase, *PZA01313\_2* linked to leucine-rich repeat protein and *PZA00726\_8* linked to zinc ion binding function which has a role in plant defense responses (Table S2). Among the unique QTL detected in JLAM, QTL on chromosome 6 seems to be overlapping with earlier reported major effect QTL for MLN resistance [7]. Major recessive QTL was reported in this region [56] in F2 populations. Similarly, with different association panels new QTL detected for resistance to SCMV in the same location on chromosome 6 [57]. Unique QTL on chromosomes 3 and 6 found in both model A and B but not in linkage mapping in this study indicates the variation across populations has helped to detect these QTL, which supports the use of multiple populations to find the novel source of resistance. On the contrary, the other three unique QTLs on chromosomes 4 and 5 were detected only in model C and were QTLs with population specific expression.

We employed the same populations to evaluate GP by sampling both training and testing sets. Moderate to high accuracy of GP was observed for MLN-DS and AUDPC within populations. Low accuracy in F3 pop1 is also due to low number of polymorphic SNPs compared to other populations. Application of GP for selection for improvement of MLN resistance has been reported [7]. The previous investigations detected high prediction accuracy within the populations for both MLN-DS and AUDPC values. Resistance for MLN is complex and requires considerable resources and multiple selection cycles to identify resistant lines using phenotypic selection alone. Therefore, GP could provide an alternative practical approach for breeding for resistance to MLN because it accounts both major and minor effect QTL for accuracy, enhanced breeding gain (shortened breeding cycle) and reduced costs.

#### **5. Conclusions**

Seven major QTL were identified for resistance to MLN, localized on chromosomes 3 (*qMLN3\_130* and *qMLN3\_142*), 5 (*qMLN5\_190* and *qMLN5\_202*), 6 (*qMLN6\_85* and *qMLN6\_157*), 8 (*qMLN8\_10*) and 9 (*qMLN9\_142*) and spread across the 7-biparetal populations with a percentage of PVE ranging from 10.54% to 44.50%. Two SNPs on chromosome 3 (*PHM15449\_10* at 125,077,922 bp and *PZA00920\_1* at 142,821,031) were strongly associated with MLN resistance and their allelic effects in different populations were also prominent. Several QTL on chromosomes 1, 3, 5, and 6 shared similar CI with those reported previously. These QTL could be adopted for marker assisted breeding for improvement of maize against MLN infection. Additional three major QTL were not reported before further research is warranted to confirm their reproducibility and use in breeding for resistance to MLN. Both linkage mapping and JLAM can be incorporated to confirm earlier reported QTLs and discover new sources of resistance. Incorporation of GP can help to capture both large effect and small effect QTL to improve the level of MLN resistance as a result improved genetic gain. Twenty-three QTL were common across linkage mapping and JLAM however, five unique QTL (*qMLN3\_199*, *qMLN4\_235*, *qMLN5\_207*, *qMLN5\_209*, and *qMLN6\_13*) were unique to JLAM with the QTL on chromosome 6 appearing to be overlapping with an earlier reported major effect QTL for SCMV resistance. Both linkage mapping and JLAM can be incorporated to confirm earlier reported QTL and discover new source of resistance. The incorporation of genomic selection can help to capture both large effect and small effect QTL to improve the level of MLN resistance.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4425/11/1/32/s1, Table S1: genetic linkage groups constructed for resistance to MLN using 500 SNPs in seven F3 populations; Table S2: Twenty-three common QTLs and specific SNP detected in both linkage mapping and joint linkage association mapping across seven F3 populations and their associated candidate genes.

**Author Contributions:** Conceptualization and methodology, L.A.O.A.; software and data analyses, L.A.O.A. and M.G.; validation, M.G., P.T., E.D. and B.E.I.; investigation, L.A.O.A.; resources, B.M.P. and M.G.; writing—original draft, L.A.O.A.; writing—review and editing, B.E.I., P.T., E.D., P.W.M.-D., V.O., Y.B., S.L.M., M.B.J., M.O., S.M., B.M.P. and M.G.; supervision, B.E.I., P.T., E.D., P.W.M.-D. and M.G.; project administration, P.T., S.L.M. and M.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was part of a PhD program funded by Intra-ACP Mobility Scheme. The research was supported by the Bill and Melinda Gates Foundation, the Howard G. Buffett Foundation, and the United States Agency for International Development (USAID) through Stress Tolerant Maize for Africa (STMA, Grant # OPP1134248), and the CGIAR Research Program MAIZE. MAIZE receives W1 and W2 support from the Governments of Australia, Belgium, Canada, China, France, India, Japan, Korea, Mexico, Netherlands, New Zealand, Norway, Sweden, Switzerland, U.K., U.S., and the World Bank. We thank CIMMYT technicians for data collection at experimental sites.

**Acknowledgments:** The study was managed by The Global maize Program, CIMMYT and University of Ghana.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Natural Variation and Domestication Selection of** *ZmPGP1* **A**ff**ects Plant Architecture and Yield-Related Traits in Maize**

#### **Pengcheng Li, Jie Wei, Houmiao Wang, Yuan Fang, Shuangyi Yin, Yang Xu, Jun Liu, Zefeng Yang and Chenwu Xu \***

Jiangsu Key Laboratory of Crop Genetics and Physiology/Key Laboratory of Plant Functional Genomics of the Ministry of Education/Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding/Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops, Agricultural College of Yangzhou University, Yangzhou 225009, China

**\*** Correspondence: qtls@yzu.edu.cn; Tel.: +86-0514-87979358

Received: 3 July 2019; Accepted: 28 August 2019; Published: 30 August 2019

**Abstract:** *ZmPGP1*, involved in the polar auxin transport, has been shown to be associated with plant height, leaf angle, yield traits, and root development in maize. To explore natural variation and domestication selection of *ZmPGP1*, we re-sequenced the *ZmPGP1* gene in 349 inbred lines, 68 landraces, and 32 teosintes. Sequence polymorphisms, nucleotide diversity, and neutral tests revealed that *ZmPGP1* might be selected during domestication and improvement processes. Marker–trait association analysis in inbred lines identified 11 variants significantly associated with 4 plant architecture and 5 ear traits. SNP1473 was the most significant variant for kernel length and ear grain weight. The frequency of an increased allele T was 40.6% in teosintes, and it was enriched to 60.3% and 89.1% during maize domestication and improvement. This result revealed that *ZmPGP1* may be selected in the domestication and improvement process, and significant variants could be used to develop functional markers to improve plant architecture and ear traits in maize.

**Keywords:** natural variation; maize; nucleotide diversity; domestication selection; *ZmPGP1* gene

#### **1. Introduction**

Maize (*Zea mays* L.) is one of the most widely grown and important cereal crops, which plays a critical role in ensuring food security. Maize was domesticated from the wild grass teosinte more than 8700 years ago [1]. The domestication of maize went through two stages: domestication selection and subsequent genetic improvement (post-domestication selection) [2]. Strong directional selection had profound effects on the morphological structure of maize, and genetic improvement affected its productivity [3]. For example, from 2000 to 2014, the total maize production in the United States and China increased by 31% and 49% respectively, of which half could be attributed to genetic advances [4,5]. Human selection has profound effects on the genetic diversity for the genomic region under selection and target genes [3]. Genetic consequences during the domestication and breeding history will enable us to understand its important role on yield increase in the modern maize breeding.

Grain yield (GY) is a complicated quantitative trait and is mainly determined by three yield components: effective ear number, kernel number, and kernel weight [6]. Maize kernel and ear morphological traits are the most important factors determining grain yield. Kernel weight is mainly affected by kernel size, which is usually measured by kernel length (KL), kernel width (KW), and kernel thickness (KT). Ear length (EL), ear diameter (ED), and kernel row number (KRN) are important traits determining the kernel number [7]. Planting density is a major factor in determining the effective ear number. The increased maize productivity is predominantly due to higher planting density, resulting from the domestication and improvement of plant shoot architecture [4]. Plant architecture is influenced by aboveground phenotypes, such as plant height (PH), ear height (EH), and leaf number (LN). From 1930 to 2001 in the United States, maize ear height was reduced by 3 cm per decade, leaf angle became more upright, tassel branch numbers became averaged 2.5 fewer branches per decade, and leaf number increased from 12.2 in the 1930s to 13.8 in the 1970s [8]. Identification of genes associated with grain yield and plant architecture traits will be helpful for maize yield improvement.

Most plant and ear traits are quantitative traits, which are controlled by a large number of small effect quantitative trait locus (QTLs). Many QTLs related to yield components and plant architecture traits have been identified in several maize linkage populations. A total of 163 QTLs were detected for four ear traits in 10 different RIL populations, accounting for 55.4–82% of phenotypic variation [9]. In the same panel, approximately 800 QTLs with major and minor effects were identified for 10 plant architecture-related traits [10]. Martinez et al. [11] assembled a yield QTLome database, and 808 QTLs for GY and seven additional GY components of common interest in maize breeding from 32 mapping populations were used for meta-QTL analysis. A total of 84 meta-QTLs were projected on the 10 maize chromosomes [11]. A number of genes that affect plant and ear traits have been identified, such as *fea2*, *fea3*, the ramose genes and *KRN4* for kernel row number, *df3*, *df8*, *df9*, and *br2* for plant height, and *td1*, *bif2*, *ba1*, and *tsh4* for tassel morphology [12–21]. Numerous kernel and morphological traits have changed during maize domestication and improvement, and some key genes have been cloned. *Tb1* has been shown to be associated with maize branching [22], the teosinte allele *gt1* confers multiple ears per branch [23], and *tga1* was associated with kernel structure [24]. In addition, genome-wide selection signals during maize domestication and improvement were assessed, 484 domestication and 695 improvement selective sweeps were detected, and a number of genes with stronger signals for selection underlie major morphological changes [3].

Indole-3-acetic acid (IAA), an active form of auxin, is a key regulator of plant growth and development. *ZmPGP1* (*ABCB1* or *br2*) was firstly cloned using a Mu element, the mutant was characterized by compact lower stalk internodes and the plants showed semi-dwarf stalks [16]. *ZmPGP1* was an adenosine triphosphate (ATP) binding cassette (ABC) transporter which belonged to the multidrug resistant (MDR) class of P-glycoproteins (PGPs), and functioned as an efflux carrier in polar auxin transport. The protein had two transmembrane domains that provide the translocation pathway of auxin and two cytoplasmic nucleotide-binding domains that hydrolyse ATP and drive the transport reaction [25,26]. Different alleles of *ZmPGP1* have been shown to be associated with plant height, ear height, leaf angle, ear length, yield traits, and root development under aluminum stress [16,27–31]. Although several mutations of *ZmPGP1* have also been identified, the sequence polymorphism and natural variations of the gene have not been investigated. It is also unclear whether *ZmPGP1* exists as a signal of selection during maize domestication and improvement. In the present study, we re-sequenced *ZmPGP1* in 349 inbred lines, 68 landraces, and 32 teosintes, and aimed to: (1) examine the *ZmPGP1* nucleotide diversity between maize inbred lines, landraces, and teosintes, (2) identify natural variations in candidate genes associated with grain yield and plant architecture traits, and (3) examine the significant associations for their involvement in maize domestication and improvement.

#### **2. Materials and Methods**

#### *2.1. Plant Materials and the Phenotypic Evaluation*

A total of 349 inbred lines, 68 landraces, and 32 wild relatives were selected in this study [32]. The inbred lines were grown in the field in a randomized block design with two replicates in 2016, 2017, and 2018 in Sanya, Hainan Province (18◦23 N, 109◦44 E). Each inbred line was grown in a single row with 13 plants, 3 m in length, and 0.5 m between adjacent rows. Fifteen days after pollination, 6 plants in the middle of each row were selected to measure leaf number above the topmost ear (LNAE), plant height (PH), tassel branch number (TBN), and tassel main axis length (TMAL), the first leaf above

the ear position leaf was selected to measure leaf angle (LA, the angle between the horizontal and the midrib of the leaf) and leaf width (LW). The measure method of plant architecture traits referred to are as described in Pan et al. (2017) [10]. After harvesting and drying, 3 well-developed ears were selected to measure ear traits, including ear grain weight (EGW), 100-kernel weight (HKW), ear diameter (ED), ear weight (EW), ear length (EL), kernel length (KL), kernel width (KW), kernel thickness (KT), and kernel number per row (KNR). The root and shoot traits at the seedling stage were determined by Li et al. [32] in a hydroponic system.

#### *2.2. DNA Isolation and ZmPGP1 Resequencing*

A modified cetyl trimethylammonium bromide (CTAB) method was used to exact genomic DNA from young leaves of each line at the seedling stage. The sequences of the *ZmPGP1* gene were sequenced by BGI (Beijing Genomics Institute) Life Tech Co. China using targeted sequence capture technology on the NimbleGen platform [33]. The genomic sequence of *ZmPGP1* (GRMZM2G315375) of the B73 inbred line was used as a reference for target sequence capture.

#### *2.3. Analysis of Sequence Data*

Multiple sequence alignment of the maize *ZmPGP1* gene was performed using MAFFT software and was further edited manually [34]. Using DNASP5.0 software [35], we analyzed single nucleotide polymorphisms (SNP) and allelic diversities across all tested lines. Two parameters, π and θ, were used to estimate the degree of polymorphism within the tested population. Tajima's D [36], Fu and Li's D\*, as well as Fu and Li's F\* [37] statistical tests were used to test for neutral evolution within each group and each defined region. The sequence data and markers were shown in Dataset 1–2.

#### *2.4. Marker–Trait Association Analysis in Inbred Lines*

Genotyping-by-sequencing (GBS) was used to identify the genotypes of 349 inbred lines [32]. A total of 163,931 SNPs were obtained by filtering out markers with more than 20% of missing data and below 1% minor allele frequency. Three models were used to conduct marker–trait associations: (1) the K model, controlling for kinship, (2) the PCA + K model, controlling for both population structure (principal component, PC) and kinship, and (3) the Q + K model, controlling for both population structure (Q) and kinship. Principal component analysis (PCA) and kinship were calculated using Tessel5.0, and Q was calculated by admixture. A total of 499 *ZmPGP1*-based markers with minor allele frequency (MAF) ≥0.05 were selected for association analysis in 349 inbred lines, and the *p* value threshold was set at 2.00 <sup>×</sup> <sup>10</sup>−<sup>3</sup> (0.5/499).

#### **3. Results**

#### *3.1. Sequence Polymorphisms of ZmPGP1*

The *ZmPGP1* sequence alignment of 349 inbred lines, 68 landraces, and 32 teosintes spanned 9710 bp, which covered a 1762 bp upstream region, a 182 bp 5 UTR region, a 6821 bp coding region containing five exons and four introns, a 400 bp 3 UTR region, and a 545 bp downstream region (Table 1). Sequence polymorphisms, including SNPs and InDels, at *ZmPGP1* were identified, and 1070 variations were detected, including 878 SNPs and 192 InDels. On average, SNPs and InDels were found every 11.06 bp and 50.57 bp, respectively. The highest frequency of SNPs and InDels were found in the 3 UTR (5.86 bp) and 5 UTR (14 bp). The overall nucleotide diversity (π) of the *ZmPGP1* locus was 0.007. Among five regions of the *ZmPGP1*, and when the coding regions were less diverse than other regions (0.006), the downstream and 3 UTR showed high nucleotide diversity (0.016 and 0.015, respectively).



indicates a statistical significance at *p* < 0.05 level, \*\* indicates a statistical significance at *p* < 0.01 level. "UTR" indicated untranslated region.

#### *3.2. Nucleotide Diversity and Selection of ZmPGP1 in Inbred Lines, Landraces and Teosinte*

To investigate the genetic diversity of *ZmPGP1* in inbred lines, landrace, and teosinte, the sequence conservation (C) and nucleotide diversity (π) were analyzed and compared. For all test lines, the values of C and π × 1000 were 0.793 and 7.110, respectively (Figure 1a). Compared with teosintes, landraces and inbred lines showed higher conservation (CT = 0.845, CL = 0.920 and CI = 0.923) and lower diversity (π × 1000T = 20.724, π × 1000L = 9.970 and π × 1000I = 6.558). The highest divergence between inbred lines and teosintes was observed in the upstream and downstream regions (Figure 1b). A divergence peak was found in the fourth intron by comparing landraces to inbred lines. To investigate the involvement in maize domestication and improvement of *ZmPGP1*, the entire sequence was tested by the neutral test, including Tajima's D and the D\* and F\* of Fu and Li. The values for Tajima's D and the D\* and F\* of Fu and Li of *ZmPGP1* were significantly less than 0, indicating this gene maybe selected in the domestication and improvement process (Figure 1a).

**Figure 1.** Nucleotide diversity in inbred lines, landraces, and teosinte. (**a**) Summary of nucleotide polymorphisms and neutrality test of ZmPGP1, Hd represents haplotype diversity, Dens denotes number of single nucleotide polymorphisms (SNP) per 1000 bp, C represents sequence conservation, and D\* and F\* represent Fu and Li's D\*and F\*. (**b**) Nucleotide diversity (π) of inbred lines, landraces, and teosinte. π was calculated using the sliding windows method with a window size of 100 bp and a step length of 25 bp. \* indicates a statistical significance at *p* < 0.05 level, \*\* indicates a statistical significance at *p* < 0.01 level.

#### *3.3. Association Analysis of Phenotypic Traits with ZmPGP1*

To identify significant variants associated with phenotypic traits, association analysis was performed using 499 variants, including 269 SNPs and 230 InDels with minor allele frequency (MAF) ≥0.05 in 349 inbred lines. Three mixed linear models (MLM), MLM + K, MLM + Q + K, and MLM + PCA + K, were employed to perform marker-traits association analysis. Comparing the Quantile-Quantile (QQ) plots generated for these models, we selected MLM + PCA + K to minimize both false positives and false negatives (Figure 2a).

**Figure 2.** *ZmPGP1*-based association mapping. (**a**) QQ plot for the association analysis under three models, red, green and blue dots denote MLM + K, MLM + PCA + K, and MLM + Q + K, respectively. (**b**) Manhattan plot by using the MLM + PCA + K model. Triangles and dots represent InDels and SNPs, respectively. Abbreviations for traits are as follows: ED, ear diameter; EGW, ear grain weight; EW, ear weight; HKW, 100-kernel weight; KL, kernel length; LA, leaf angle; PH, plant height; RDW, root dry weight; TMAL, tassel main axis length.

Using a Bonferroni correction based on 499 *ZmPGP1*-based markers, the *P*-value thresholds were set at 0.001 (0.5/499). A total of 24 significant marker–trait associations involved 15 variants (12 SNPs and 3 InDels) were identified for 9 traits using the MLM + PCA + K model (Table S1). Among these 24 associations, 9 and 15 sites were associated with 4 plant architecture (PH, LA, TMAL, and RDW [root dry weight]) and 5 ear traits (ED, EGW, EW, HKW and KL), respectively (Figure 2b; Table S2). A total of 3, 4, 6, and 2 variants were distributed in the upstream, exon, intron, and 3 UTR regions, respectively. The SNP at site 1708 in exon 3, which was associated with ED, EGW, and KL, caused synonymous changes. SNPs at sites 438, 453, and 555 in exon 1 caused non-synonymous changes in the amino acid sequence. Three high LD SNPs 438, 453, and 555 were associated with PH (Table S1). All significant variants could explain 2.98–6.91% of the phenotypic variation. Most of the associations were small effect variants and could explain less than 4% of the phenotypic variation. SNP1473, associated with KL (*<sup>p</sup>* <sup>=</sup> 9.34 <sup>×</sup> <sup>10</sup><sup>−</sup>7), explained the most phenotypic variation, up to 6.93%. In addition, 4 pleiotropic sites, including SNP1473, SNP1708, SNP7213, and InDel3387, were significantly associated with ED, EGW, KL, EGW, LA, and RDW (Figure 3). SNP1473 in intron 2 was associated with four ear traits (ED, EGW, KL, and EW), SNP1708 was associated with three ear traits (ED, EGW, and KL), SNP7213 in the 3 UTR was associated with ED, LA, and RDW, and the InDel at site 3387 in intron 4 was associated with ED, EGW, and KL (Table 2).

Linkage disequilibrium (LD) analysis found that SNP438, SNP453, SNP555, SNP628 and SNP706 showed strong LD (*r*<sup>2</sup> > 0.95) with each other in inbred lines. After the clumping of variants, 11 significant sites were identified. Six major haplotypes which contained more than 10 lines emerged from the 11 significant sites across inbred lines, and a significant phenotypic difference was observed between haplotypes in 8 traits, except for TMAL (Table S2). Four significant variants were significantly associated with KL, including InDel-970, SNP1473, SNP1708, and InDel3387. Three major haplotypes, which contained more than 20 lines, emerged from the 4 significant sites across 349 inbred lines (Figure 4c). The phenotypic differences in KL between the three major haplotypes were compared, and a significant difference was detected by ANOVA (*p* = 6 <sup>×</sup> 10−10) between haplotypes. Hap1, carrying all increased alleles, had the longest kernel length, followed by Hap2, which included the majority of tested inbred lines. Hap3, carrying all decreased alleles, had the shortest kernel length. SNP1473 was the most significant sites, the allele T group had a significantly longer KL than the allele C group (*p* = 6.9 <sup>×</sup> 10−8, Figure 4d). Further, we analyzed the allele frequencies among the three populations. The results showed that the frequency of the SNP1473T in teosintes was 40.6%, and in landraces and inbred lines, the frequency increased to 60.3% and 89.1%, respectively (Figure 4e). These results suggested that SNP1473 might have been selected during domestication and improvement of maize. Three variants at sites 1473, 1708, and 3387 were significantly associated with EGW, which could divide the tested inbred lines into 2 major haplotypes (Figure S1). A significant difference between haplotypes was observed for EGW (*p* = 1.3 <sup>×</sup> 10−4). The SNP at site 1473 also had the most significant association with EGW. Three variants were identified for HKW that divided the inbred lines into four groups (Figure S2). The HKW of Hap1 was higher than the other three haplotypes (*<sup>p</sup>* <sup>=</sup> 8.3 <sup>×</sup> <sup>10</sup><sup>−</sup>9). The most significant site was SNP-769, and the frequency of the increased allele, SNP-769T, increased from 8.3% in teosintes to 33.3% in inbred lines. Five SNPs with high LD (*r*<sup>2</sup> > 0.95) were significantly associated with PH. The plant height in the inbred lines carrying allele SNP453G was higher than those containing the C allele (Figure S3). The frequency of the G allele decreased from 50.0% in teosintes to 16.1% in inbred lines. Two SNPs significantly associated with RDW divided the tested inbred lines into 3 major haplotypes (Figure S4). The frequency of increased allele SNP7137C increased from 0 in teosintes to 74.1% in inbred lines.

**Figure 3.** The network between pleiotropic site and associated traits. Yellow, green, and orange circle indicated the variations were in exon, intron, and 3 UTR, and the lines represent *p* value.


**Table 2.** Significant markers associated with phenotypic traits.

<sup>a</sup> The position of the start codon (ATG) is labelled as "0".

**Figure 4.** Natural variations in *ZmPGP1* were significantly associated with KL. (**a**) *ZmPGP1*-based association mapping for kernel length (KL). (**b**) Linkage disequilibrium (LD) heatmap for six significant variants associated with KL. (**c**) Haplotypes of *ZmPGP1* among natural variations in inbred lines. (**d**) Comparison of kernel length between different alleles of SNP1473. (**e**) The allele frequency of SNP1473 in teosinte, landraces, and inbred lines.

#### **4. Discussion**

The process of maize domestication and improvement has been studied with population genetics–genomics [3], QTL mapping [38], and gene expression assays [39]. During domestication and improvement, the plant morphology and productivity of maize have changed dramatically. Maize plants typically have one or two short branches and only two ears, each with several hundred kernels [38]. These changes involved artificial selection of specific genes controlling key morphological and agronomic traits [40], resulting in reduced genetic diversity. Previous studies have identified several genes underlying maize evolution: 484 domestication and 695 improvement regions were identified from population genetics analyses [3]. It is estimated that approximately 2–4% of genes have been selected during maize domestication and improvement [40]. Here, we examined DNA sequence variation in *ZmPGP1*, which is involved in the polar movement of indole-3-acetic acid (IAA). Plant hormones, such as auxins, play a key role in plant growth, development, defenses, and stress tolerance [41]. A previous study reported that an auxin response factor might contribute to the morphological difference between maize and teosinte [40]. We found that the level of nucleotide diversity (π × 1000) in teosintes is 20.724, decreased to 9.970 in landraces and 6.558 in maize inbred lines (Figure 1a), suggesting that approximately half of the genetic diversity has been lost during domestication process. Similar results were observed in several plants, such as soybeans and cucumbers [42,43]. Many previous studies employed only a limited number of teosinte, landraces, and maize to identify the domestication signals. For example; a total of 14 inbred lines, 16 landraces, and 16 teosinte accessions were chosen to artificial selection of 1095 genes. 28 inbred lines, 16 landraces, and 16 teosinte accessions were used to investigate the involvement of 32 MADS-box genes during maize domestication and improvement [44,45]. In this study, a larger population including 349 inbred lines, 68 landraces and 32 wild relatives were used to re-sequence *ZmPGP1* with high sequencing depth (more than 100×), which could help us to identify the selection signals with larger effective and high accurate.

Plant architecture and kernel and ear traits, the key factors affecting grain yield, were the main traits targeted of maize breeding. The identification of the natural variations in these traits could help to improve the efficiency of breeding selection. Although hundreds of QTLs related to these traits have been identified [10,11], few genes have been cloned from the natural germplasm. *ZmPGP1* (ABCB1 or br2), involved in auxin polar transport, has been shown to be associated with plant height, stalk diameter, leaf length and leaf angle [28]. Three Mu insertions were detected in the exon and intron of *ZmPGP1* [16]. These mutations dramatically affected height reduction but were rare variations in natural accessions. Natural germplasm with a broad genetic base could be a potential resource for improving yield [46]. Natural variations of *ZmPGP1* have also been identified [16,27–31], and some alleles have great potential in maize improvement. One rare SNP variant in the exon could reduce plant height without affecting yield [47]. A new 241-bp deletion in the last exon of PGP1 also had no negative effect on yield, but significantly reduced plant height and ear height and increased stalk diameter and erected leaves. The deletion was a rare allele that could be detected in only one line of 311 diverse maize accessions [28]. The result revealed that *ZmPGP1* has good potential to reshape plant architecture without the loss of yield in maize breeding. Candidate gene association analysis can identify the elite variation and the best haplotype for target traits. The elite variations of more than 30 genes involved in flowering time, kernel composition, drought tolerance, and root development were detected by candidate gene association analysis [48]. For example, *crtRB1* was proved to be associated with β carotene concentration and conversion in maize kernels, and the most favorable alleles were developed to inexpensive markers to use for crop provitamin A biofortification [49]. In this study, to identify the natural variations and favorable haplotypes of *ZmPGP1*, 1070 variations were detected from 9710 bp re-sequenced genomic region of *ZmPGP1*. In total, 11 variants were identified for 5 yield-related traits and 4 plant architecture (Figure 2; Table 2). However, two previously rare variations [28–31,47] were not found in our study. SNP1473 was the most significant variant for KL and EGW. The frequency of the increased allele T was 40.6% in teosintes and was enriched to 60.3% and 89.1% during maize domestication and improvement (Figure 4; Figure S1). The selection patterns were similar with the 1.2-Kb presence-absence variant of *KRN4*, which is likely responsible for increased kernel row number in maize [15]. In conclusion, we re-sequenced the *ZmPGP1* gene in 349 inbred lines, 68 landraces, and 32 teosintes, sequence polymorphisms, nucleotide diversity and neutral tests revealed that *ZmPGP1* might be selected during domestication and improvement processes. A total of 11 variants significantly associated with 4 plant architecture and 5 ear traits were identified by marker–trait association analysis in inbred lines. The significant variants could be used to develop new markers to improve plant architecture and ear traits in maize.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4425/10/9/664/s1, Table S1: All significant markers associated with phenotypic traits. Table S2: Phenotypic differences among different haplotypes. Figure S1: Natural variations in *ZmPGP1* were significantly associated with EGW. Figure S2: *Genes* **2019**, *10*, 664

Natural variations in *ZmPGP1* were significantly associated with HKW. Figure S3: Natural variations in *ZmPGP1* were significantly associated with PH. Figure S4: Natural variations in *ZmPGP1* were significantly associated with RDW. Dataset 1: The sequence of ZmHKT1.fa. Dataset 2: The variaton of ZmPGP1.fa.

**Author Contributions:** P.L., J.W., Z.Y. and C.X. conceived and designed research. J.W., Y.F., S.Y., and J.L. conducted experiments. P.L., J.W. and Y.X. analyzed data. P.L., H.W., Z.Y. and C.X. wrote the manuscript. All authors read and approved the manuscript.

**Funding:** This research was funded by This work was supported by grants from the National Key Technology Research and Development Program of MOST (2016YFD0100303), the Natural Science Foundations of Jiangsu Province (BK20180920), the National Natural Science Foundations (31601810, 31801028), the Innovative Research Team of Universities in Jiangsu Province, a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD) and Research and innovation program for graduate students of Yangzhou university (XKYCX18\_082).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **Transferability and Polymorphism of SSR Markers Located in Flavonoid Pathway Genes in** *Fragaria* **and** *Rubus* **Species**

**Vadim G. Lebedev 1,2, Natalya M. Subbotina 1,2, Oleg P. Maluchenko 3, Tatyana N. Lebedeva 4, Konstantin V. Krutovsky 5,6,7,8,9,\* and Konstantin A. Shestibratov <sup>2</sup>**


Received: 30 November 2019; Accepted: 19 December 2019; Published: 21 December 2019

**Abstract:** Strawberry (*Fragaria*) and raspberry (*Rubus*) are very popular crops, and improving their nutritional quality and disease resistance are important tasks in their breeding programs that are becoming increasingly based on use of functional DNA markers. We identified 118 microsatellite (simple sequence repeat—SSR) loci in the nucleotide sequences of flavonoid biosynthesis and pathogenesis-related genes and developed 24 SSR markers representing some of these structural and regulatory genes. These markers were used to assess the genetic diversity of 48 *Fragaria* and *Rubus* specimens, including wild species and rare cultivars, which differ in berry color, ploidy, and origin. We have demonstrated that a high proportion of the developed markers are transferable within and between *Fragaria* and *Rubus* genera and are polymorphic. Transferability and polymorphism of the SSR markers depended on location of their polymerase chain reaction (PCR) primer annealing sites and microsatellite loci in genes, respectively. High polymorphism of the SSR markers in regulatory flavonoid biosynthesis genes suggests their allelic variability that can be potentially associated with differences in flavonoid accumulation and composition. This set of SSR markers may be a useful molecular tool in strawberry and raspberry breeding programs for improvement anthocyanin related traits.

**Keywords:** *Fragaria*; *Rubus*; microsatellites; transferability; polymorphism; introns; exons; flavonoid biosynthesis pathway; transcription factor genes; chitinase

#### **1. Introduction**

The Rosaceae family comprises approximately 3000 species and includes very important fruit, berry, and ornamental plants. This family has been relatively recently reorganized into three subfamilies: Dryadoideae, Spiraeoideae, and Rosoideae. The latter one includes cultivated berries in the genera *Fragaria* (strawberry) and *Rubus* (raspberry and blackberry) [1]. Strawberry and raspberry are in especially high demand among consumers due to their appearance, taste, and aroma [2,3]. They are also rich in antioxidants and other bioactive compounds beneficial for human health. The strawberry is the most consumed berry worldwide—more than 9 million tons were harvested in 2017, while the production of raspberry and blackberry increased by 50% for the period 2010–2017 and exceeded 800,000 tons [4]. It is suggested that the consistent demand for healthy and delicious berryfruit observed in the current decade will increase in the nearest future [2].

Recently, the high interest in berry crops has led not only to increased production levels, but also to the expansion of *Rubus* and *Fragaria* breeding programs; several dozens of them are now known [3,5]. For a long period of time, the main directions in breeding were crop yield and disease resistance, but in the past years, fruit sensorial and nutritional qualities have become major objectives [6]. However, *Rubus* and *Fragaria* breeding is complicated because of several genetic problems including polyploidy and the highly heterozygous nature of the germplasm [3,5]. The genus *Rubus* consisted of several hundred species, and the ploidy level can widely vary among them. Raspberries are mainly diploids (2*n* = 14), while blackberries may vary from diploids to dodecaploids (2*n* = 84), whereas the hybrids between them can be hexaploid (loganberry) or septaploid (boysenberry) [7]. A total of 22 wild species of *Fragaria* have been described. Almost half of them are diploids (2*n* = 14), while tetra-, hexa-, octo-, and decaploid species are also known [8]. The main cultivated strawberry crop, *F*. × *ananassa*, is a hybrid between *F. chiloensis* and *F. virginiana*, but the origin of its octoploid genome from four diploid ancestors has long been unknown. In addition to *F. vesca* and *F. iinumae*, contribution of different species was assumed, but only in 2019 the phylogenetic analyses of Edger et al. [9] provided a strong genome-wide support that *F. iinumae*, *F. nipponica*, *F. viridis*, and *F. vesca* are diploid progenitor species. *F. vesca* (the wild strawberry) and *F. viridis* (the green strawberry) can be used in strawberry breeding as donors of abiotic and biotic stress resistance and fruit aroma [10] and firm flesh, remontant flowering habit, and an acidic apple-like aroma [11], respectively.

Developing a new cultivar by traditional methods is a very long process that can take up to 15 years for raspberry [5] or 10 years for strawberry [12]. Molecular markers, however, can be used at any stage of plant growth and can increase the speed and accuracy of germplasm assessment. A good choice for breeding purposes is a simple sequence repeat (SSR) or microsatellite markers consisting of tandem repeats (1–6 nucleotides). Due to their codominant inheritance, high level of polymorphism, and abundance in genome, they play an important role in identifying genomic regions associated with the traits of economic importance [13]. SSR markers for the *Rubus* and *Fragaria* species were first developed in the early 2000s [14,15] and in subsequent years, a number of studies were carried out, including evaluation of marker transferability. SSR transferability depends on genetic distance between individual specimens. SSRs are more transferable, overall, within the species of the same genus or among related genera within families than between remote genera and different families [16]. The *Fragaria* and *Rubus* species from Rosoideae subfamily have both the same basic number of chromosomes (*x* = 7) and close phylogenetic relationships based on their chloroplast and nuclear DNA markers, as well as similar morphology. These facts suggest collinearity between *Fragaria* and *Rubus* genomes [17]. In raspberry breeding, interspecific hybridization is widely used, and development of molecular genetic markers that can be transferred between different species, especially with different ploidy, becomes an important task. However, very few molecular genetic markers are known for *Rubus*, and fewer are transferable between species [13]. All strawberry cultivars now available at the market have been produced using traditional breeding methods [3]. Among raspberry cultivars, there are currently only two productive cultivars with root rot resistance that have been produced using the marker-assisted selection (MAS), but they are still in the commercial trial stage [18]. The development

of new molecular genetic markers that could be used for molecular genetic characterization of both wild species and germplasm would accelerate the breeding of new cultivars [2]. The transferability is very important for their wide use.

The SSR markers can be developed using either genomic or expressed sequence tag (EST) nucleotide sequences. The EST-based markers are more transferable and more useful in MAS as they are linked with expressed genes and could be associated with important agronomical traits [19]. The strawberry studies have shown that EST-SSR markers were more transferable compared to random genomic nongenic noncoding SSR (ncSSR) markers [20,21]. At the same time, it is known that ESTSSR markers are generally less polymorphic than ncSSRs that mostly represent noncoding genomic regions because of a greater DNA sequence conservation of transcribed regions [22]. There are no markers that would be universal or ideal for all practical applications and tasks. For some tasks such as analyses of population structure, genetic drift, migration, gene flow, mating system, and individual identification of clones, cultivars, etc. selectively neutral ncSSRs would be the most appropriate markers, but for functional analysis of adaptive variation, MAS, QTL mapping, etc., ESTSSRs that represent functional alleles and haplotypes and link to important adaptive and breeding traits would be more appropriate and useful markers. Among SSR markers, those that are located in introns and untranslated regions (UTRs) are more polymorphic and potentially can combine advantages of both ncSSR and EST-SSR markers, while those that are located in exons are less polymorphic, but more likely to be under direct selection or represent selectively different alleles, and therefore are more useful for MAS because they might better represent functional traits important for breeding. We studied diploid *Rubus* species and confirmed that SSR markers located in introns were more variable in comparison to EST-SSR located in exons [23]. However, development of such markers complete nucleotide sequences of important adaptive and breeding trait related genes.

Many MAS studies are aimed at developing markers representing key genes, but only few studies have focused on the developing markers representing structural and regulatory genes of metabolic pathways in cultivars with contrasting phenotypic traits of interest [24]. We are especially interested in developing SSR markers representing flavonoid pathway genes because the main polyphenols in fruits are flavonoids (Figure 1). Complete nucleotide sequences are already available for many of these genes. Thus, breeding of new berry cultivars with the improved nutritional value using MAS and SSR markers representing flavonoid biosynthesis-related genes seems to be a highly promising approach. The combinatorial interactions of the regulatory genes with structural genes that act to control the flux of various branches of the pathway ultimately determines the flavonoid composition [25]. However, as far as we know, SSR markers representing *Fragaria* and *Rubus* transcription factors (TFs) were not developed before our study. Main aims of our study were to: (1) develop SSR markers using coding (CDS) and non-coding (NCDS) sequences of structural and regulatory genes of flavonoid biosynthesis pathways, (2) evaluate them in *Rubus* and *Fragaria* species of different ploidy, (3) test cross-species transferability within and among *Rubus* and *Fragaria* genera, (4) assess the relationship of transferability and polymorphism of SSR markers with location of primer binding sites for polymerase chain reaction (PCR) and SSR loci in respective genes. Some of these genes could control anthocyanin content that correlates strongly with color. Therefore, strawberry and red raspberry cultivars with a wide range of berry colors were used in our study. In addition to nutritional value, disease resistance is also considered to be a valuable trait for the berry crops. The key factor for developing pathogen resistance in new cultivars is the identification and introgression of genes from cultivated varieties or their wild relatives [26]. Strawberry and raspberry production suffers from a number of agriculturally important diseases, and, therefore, we included in our study developing of SSR markers in strawberry genes encoding pathogenesis-related (PR) proteins – β-1,3-glucanases (PR-2 family) and chitinases (PR-3, 4, 8, 11 families). Since the sequences for raspberry chitinase genes were absent in the NCBI GenBank database, we sequenced the fragments of chitinase III genes in raspberry cultivars.

**Figure 1.** Schematic representation of flavonoid biosynthesis in strawberry and raspberry. PAL phenylalanine lyase; C4H—cinnamate 4-hydroxylase; 4CL—4-coumarate CoA ligase; CHS—chalcone synthase; CHI—chalcone isomerase; F3'H—flavonoid 3'-hydroxylase; FHT—flavonone 3-hydroxylase; FLS—flavonol synthase; DFR—dihydroflavonol 4-reductase; LAR—leuco-anthocyanidin reductase; ANS—anthocyanidin synthase; ANR—anthocyanidin reductase; F3GT—flavonoid 3-O-glycosyl transferase. Pink color indicates strawberry genes and dark-red color indicates *Rubus* genes used in this study.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

Sixteen *Fragaria* specimens, including *F*. × *ananassa*, *F. vesca*, *F. viridis* and (*F*. × *ananassa*) × *Comarum palustre*, and 32 *Rubus* cultivars, including red raspberry (*R. idaeus*; Idaeobatus subgenus), black raspberry (*R. occidentalis*; Idaeobatus subgenus), blackberry (*Rubus* subgenus), cloudberry (*R. chamaemorus*; Chamaemorus subgenus), arctic bramble (*R. arcticus*; Cyclastis subgenus), and hybrid arctic bramble (*R.* × *stellarcticus*; Cyclastis subgenus) were chosen to genotype newly developed SSR loci located in the flavonoid biosynthesis and pathogenesis-related genes. These cultivars have a wide range of fruit color, ploidy and various geographic and genetic origins, but mostly of Russian origin (Table 1). The plants used in this study were kindly provided by I.A. Pozdniakov (OOO Microklon, Pushchino, Russia). Each cultivar represented a microclonally vegetatively propagated line containing genetically identical plants. Therefore, a single specimen per culture was used for further DNA isolation and genotyping.


**Table 1.** List of 48 *Fragaria* and *Rubus* specimens genotyped in the study.

#### *2.2. Simple Sequence Repeat (SSR) Marker and Polymerase Chain Reaction (PCR) Primer Development*

The WebSat software [27] was used to detect SSR loci in the nucleotide sequences of *F*. × *ananassa* and *Rubus* genes available in the NCBI GenBank database (http://www.ncbi.nlm.nih.gov) (Table S1). To search for SSRs, the following threshold criteria were used: ten for mononucleotide repeats, five for dinucleotide motifs, four for tri-, three for tetra-, and two for penta-, and hexanucleotide repeats. The Primer 3 software (http://primer3.org) was used to design appropriate (PCR) primers based on the sequences flanking the SSR loci.

Primers were designed using the following criteria: primer length of 18–27 bp (optimally 22 bp), GC content of 40–80%, annealing temperature of 57–68 ◦C (optimally 60 ◦C), and expected amplified product size of 100–400 bp. Primers for the *RiG001* locus were as in [28]. Primers were synthesized by Syntol Comp. (Moscow, Russia) and are presented in Table S1.

#### *2.3. DNA Isolation, PCR Amplification and Fragment Analysis*

A single DNA sample per each specimen was produced from young expanding leaves representing a single plant per each sample. Total genomic DNA was extracted using the STAB method [29]. The quality and quantity of extracted DNA were determined by the NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific Inc., Waltham, MA, USA). The final concentration of each DNA sample was adjusted to 50 ng/μL in TE buffer before the PCR amplification.

For genotyping, PCR was performed separately for each primer pair using a forward primer labeled with the fluorescent dye 6-FAM and an unlabeled reverse primer (Syntol Comp., Moscow, Russia). The PCR amplification was performed in a total volume of 20 μL consisted of 50 ng of genomic DNA, 10 pmol of the labeled forward primer, 10 pmol of an unlabeled reverse primer, and PCR Mixture Screenmix (Evrogen JSC, Moscow, Russia). After an initial denaturation at 95 ◦C for 3 min, DNA was amplified during 33 cycles in a gradient thermal cycler (Bio-Rad Laboratories, Inc., Hercules, CA, USA) programmed for a 30 s denaturation step at 95 ◦C, a 20 s annealing step at the optimal annealing temperature of the primer pair, and a 35 s extension step at 72 ◦C. A final extension step was done at 72 ◦C for 5 min.

The PCR generating clear, stable, and specific DNA fragments within an expected length (200–400 bp) were considered as successful PCR amplifications. If a primer pair failed three times to amplify template DNA that was amplified with other primers, then it was scored as a null genotype.

Separation of amplified DNA fragments was performed in an ABI 3130xl Genetic Analyzer using S450 LIZ size standard (Syntol Comp.). Peak identification and fragment sizing were done using the Gene Mapper v4.0 software (Applied Biosystems, Foster, CA, USA).

#### *2.4. Genetic Data Analysis*

Genetic statistics were calculated for each polymorphic microsatellite marker. The number of alleles, observed (*Ho*) and expected (*He*) heterozygosities, and polymorphic information content (PIC) for 32 diploid *Rubus* cultivars were calculated using the PowerMarker 3.25 software [30]. Analogous parameters for 13 octoploid *Fragaria* cultivars were calculated using the GenoDive 3.0 software [31]. Principal component analysis (PCA) and construction of the box plots were performed with the PCORD 5 software [32].

#### *2.5. Chitinase Gene Sequencing and Sequence Alignment*

Based on the expected homology between *Fragaria* and *Rubus* species, the following two primers were used for PCR amplification of a 528 bp long fragment homologous to the strawberry chitinase III gene in three raspberry cultivars: Ch-Up1 5 -GAAGATGCCCGCCAAGTTG and Ch-Low2S 5 -TTGATGGAGGAGCTGTATC. The amplification reaction mixture (25 μL) contained ~0.15 μg of genomic DNA, ScreenMix-HS buffer (Evrogen JSC, Moscow, Russia), 0.8 mM of each primer, and Milli-Q water. The PCR protocol included an initial denaturation step at 95 ◦C for 5 min followed by 31 cycles consisting of 45 s at 95 ◦C, 30 s at 59 ◦C, and 60 s at 72 ◦C each. A final step of 10 min at 72 ◦C ended the cycles followed by a hold at 4 ◦C. The PCR products were purified and sequenced by Evrogen JSC. The chitinase III sequences were aligned, visualized and manually inspected using the MView 1.63 software (www.ebi.ac.uk/Tools/msa/mview).

#### **3. Results**

#### *3.1. Development and Characterization of SSR Markers*

A total of 118 SSR loci were detected in 21 gene sequences (45.6 Kb). The number of SSRs ranged from one to 13 per gene (5.6 on average). One SSR was found per every 387 bp on average; less frequent in exons with one SSR per every 628 bp, but more frequent in introns with one SSR per every 263 bp on average. In our SSR analysis, loci with pentanucleotide motifs were detected at the highest frequency (45%), followed by loci with hexa-(25%) and dinucleotide (13%) motifs. Loci with tetra-, mono- and trinucleotide motifs were relatively less frequent—8%, 5%, and 4%, respectively. All loci with mononucleotide repeats consisted of only T nucleotide and contained from 10 to 12 T nucleotides. Among 17 loci with dinucleotide repeats, the AT/TA motif was the most frequent (47%), followed by CT/GA (24%) and GT/CA (18%). On average, number of repeats were 9.7 for loci with dinucleotide motifs and 7.5 for loci with trinucleotide motifs ranging from 5 to 32 and 4 to 14 motifs per locus, respectively. Loci with tetranucleotides motifs contained only three repeats, and loci with penta- and hexanucleotide motifs contained only two repeats.

Location of microsatellite loci in CDS (exons) and NCDS (introns, 5 and 3 UTRs, and upstreams upfront regions further than 500 bp from the first exon) was determined, with majority of them (35%) being located in introns, 28% in exons, 14% in 5 UTRs and upstreams, and 9% in 3 -UTRs (Table 2). Loci with pentanucleotide repeats prevailed everywhere (41–55%), except in upstreams (29%), where the proportion of hexanucleotide SSRs was higher (41%). A relatively high proportion of hexanucleotide SSRs was also in exons (33%). The nucleotide distribution was approximately equal in the exons (21, 24, 26, and 28% for T, G, A, and C, respectively), while in NCDS T (46%) and A (34%) prevailed over C (12%) and G (8%).


**Table 2.** Number of microsatellite loci with different nucleotide repeat motifs and their location in gene.

However, not all microsatellite loci could be developed into useful SSR markers. For example, some microsatellite loci were located at the end of the sequenced DNA fragment. Microsatellite loci mononucleotide repeats were also not used for developing SSR markers. We also tried to use various combinations of the location of SSR loci and annealing sites for PCR primer pairs. This analysis resulted to selection of 24 sequences ranging from 122 to 400 bp long and harboring 43 microsatellites. Finally, 24 primer pairs were successfully designed (Table S1). In addition to the newly developed markers, we used the *RiG001* marker from *R. idaeus* [28]. The developed SSR loci were in all gene regions except 3 UTRs: ten were in introns, eight in exons, two in upstreams, and one was in 5 UTR (Table S1). In addition, four loci were located at the junction of CDS and NCDS (two loci at the junction of 5 UTR and exon, and intron and exon each). Eleven markers contained more than one SSR locus. The *FaFS01*, *FaAR01*, *RhUF01*, *FaMY02*, and *FaFG01* markers contained two SSR loci, the *FaCH01*, *FaCH02*, *FaF3H01*, *FaBG01*, and *RiMY01* markers contained three SSR loci, and the marker *FaMY01* contained five SSR loci. Among the new SSR markers developed, 15 were developed using *F*. × *ananassa* sequences whereas nine were developed using *Rubus* species sequences.

#### *3.2. Cross-Specific Transferability of SSR Markers*

The 24 genic SSRs developed in this study and one published SSR from *R. idaeus* (Table S1) were evaluated for cross-amplification in two important genera of the subfamily Rosoideae, *Fragaria* and *Rubus*. These 25 SSR loci represented 18 structural and regulatory flavonoid biosynthesis genes and three PR protein genes. A total of 48 specimens belonging to 11 species and hybrids with a wide range of ploidy (di-, tetra-, hexa-, hepta- and octoploids) were used for marker validation (Table 1). The collection of selected specimens included samples from different breeding programs worldwide, including specimens from Finland, Germany, The Netherlands, Poland, Russia, Sweden, Switzerland, UK, Ukraine, and USA. Cross-amplification results and allele sizes are presented in Table S1. Twenty

two of the 25 primer pairs amplified a PCR product or products of approximately the size expected for a homologous gene. Only primer pairs for the *RiMY01* locus generated multiple bands of approximately the expected size in *R.* × *stellarcticus* hybrids. In total, 10 primer pairs, representing nine out of 21 genes, amplified a product of the expected size in all two genera, indicating that primer binding sites were conserved across two rosaceous genera screened.

All primer pairs (15) from *F*. × *ananassa* amplified a PCR product in each of the 14 strawberry specimens including *Fragaria* × *Comarum* hybrid that showed 100% transferability despite the various genetic origin. More than two (up to eight) fragments were amplified in the *F*. × *ananassa* specimens, which was expected because of their polyploid nature. Ten SSR markers (67.7%) revealed genetic differentiation among strawberry specimens, while polymorphism was not detected in five SSR markers. Among them, two loci (*FaAR01* and *FaCH01*) each had two amplified fragments, but they were identical in all tested specimens. In order to evaluate the transferability of SSR markers in the diploid *Fragaria* species, the developed PCR primer pairs were used to genotype *F. vesca* and *F. viridis*.

Transferability of SSRs within *Fragaria* was high. Eleven of these primers (73.3%) amplified fragments in *F. vesca* and *F. viridis*. However, four strawberry primer pairs for the *FaDR01*, *FaLR01*, *FaMY01*, and *FaMY02* loci failed to amplify in two *Fragaria* species, suggesting that either these sequences have diverged between octoploid and diploid species or are not present in *F. vesca* and *F. viridis*, but are present in two other ancestors of the octoploid genome of *F*. × *ananassa*.

The transferability of *Rubus* markers within the genus was lower than that of *Fragaria—*50–70% for 10 markers, depending on the species. The maximum transfer was in red raspberry, the minimum—in hybrid arctic bramble: four markers worked in all six hybrid arctic bramble cultivars, but *RcFH01*—only in two of them. All of the five *Rubus idaeus*-derived SSR markers successfully amplified fragments in all red raspberry cultivars. In addition, a marker from *R. coreanus* and only one markers from blackberry (*RiMY01*) were amplified in red raspberry cultivars. Six *Rubus* SSRs amplified PCR products in all *Rubus* species and five in all six hybrid arctic bramble cultivars. In addition, the *RiG001* marker was amplified only in red raspberry. Typically, markers are amplified within the same species, but three markers *RhDR01*, *RhDR02*, and *RhAR01* developed originally in a blackberry (cultivar Arapoho, NCBI GenBank accession number JF764809) did not produce a product in our blackberry cultivars and hybrids. Moreover, these three markers were not amplified in any tested specimen.

Transferability from *F*. × *ananassa* to *Rubus* species was demonstrated for 5–7 out of the 15 primer pairs (33.3–46.7%). Thus, the transferability of strawberry markers decreased as cultivars become less related: all 15 markers were amplified in *F*. × *ananassa*, 11 markers in two diploid *Fragaria* species, and 5–7 in *Rubus* species. Six out of the 15 *F*. × *ananassa* primer pairs (40%) amplified fragment of the expected size in red raspberry and hybrid cultivars such as Loganberry, Tayberry, and Boysenberry. Successful cross amplification in other *Rubus* species ranged from 33.3% in black raspberry and blackberry to 46.7% in Nordic species (*R. chamaemorus*, *R. arcticus*, and *R.* × *stellarcticus*), where the *FaLR01* marker was also amplified, while it did not produce any PCR product in other *Rubus* species. The majority of *Rubus* cultivars had one or two amplified fragments per primer pair, however, for some polyploidy cultivars, as well as the blackberry and hybrids, there were more than two fragments (up to four) amplified by some primer pairs. The octoploid species *R. chamaemorus* almost always showed the presence of only one or two fragments.

Transferability of *Rubus* SSRs to the *Fragaria* species was relatively low: only three out of 10 markers (all from *R. idaeus*) were amplified, and there were no differences among species. In total, four out of 14 SSR markers had amplified fragments in *F. vesca* and *F. viridis* in the same range of size as that in *F*. × *ananassa*, and the rest can be used to separate diploid species and octoploid strawberry. Six SSRs had *F*. × *ananassa* and unique fragments amplified, and four markers only unique fragments: the same for two diploid species (*FaFH01*) and different (*FaCH01*, *FaBG01*, and *RiMY01*). Only the *FaFS01* marker amplified two fragments unique to *Fragaria* × *Comarum* hybrid.

Twelve out of 18 SSR markers (66.7%) were found to be polymorphic in the 13 strawberry specimens. In addition, three markers (*FaAR01*, *RiAS01*, and *FaCH01*) had the same two fragments

amplified in all specimens. Thus, these markers were monomorphic in these specimens, but may be polymorphic in a wider collection. The majority of the polymorphic SSR markers in *Fragaria* genes (7 out of 10) had more than two fragments amplified in strawberry specimens, probably, originating from four genomes of octoploid strawberry. In total, 11 out of 14 markers (78.6%) were polymorphic in the genus *Rubus*. However, only one of them (*RhUF01*) was polymorphic in all species. There are two reasons for this, firstly, not all markers were amplified in all species. Secondly, a small number of specimens were used for most species. Most of the markers were polymorphic in species with a large number of tested specimens and/or having a hybrid origin. In red raspberry (14 cultivars), hybrid Arctic bramble (six specimens), and *Rubus* hybrid (four specimens) the proportion of polymorphic markers was 61.5%, 58.3% and 66.7%, respectively. This approximately corresponds to the proportion of polymorphic markers in strawberry. In black raspberry and arctic bramble, each having two specimens tested, 27.3% and 30.8% markers were polymorphic, respectively.

Monomorphic markers of strawberry *FaFS02*, *FaTG01* and *FaCH01* produced alleles of the same size in *Fragaria* and *Rubus* species—271, 324, and 324 + 329 bp, respectively. In addition, the polymorphic strawberry markers *FaFS01*, *FaLR01*, and *FaFH01* had one allele amplified in northern *Rubus* species, which was almost the same (different by only one or two nucleotides) as one of the main strawberry alleles. Interestingly, the alleles of *Rubus* species for seven markers from *Fragaria* had almost the same size as expected or were different by no more than 10–20 nucleotides in both directions, while the alleles in strawberry for *Rubus* markers (*RiAS01*, *RiMY01*, and *RiHL01*) were always less than the expected size by about 40–100 nucleotides. The most multi-allelic SSR markers (17–19 alleles; *FaFS01*, *FaMY0*2, and *RiMY01*) contained dinucleotide repeats in introns and presented series of consecutive alleles in 2-bp steps. In addition, the *RiAS01* marker with exon-located locus showed a number of alleles with a step of 24: 261, 285, 309, 333, and 357 in *Rubus* species; 261, 285, and 333 in *Fragaria* species.

#### *3.3. Allelic Polymorphism and Genetic Diversity*

Number of alleles, expected (*He*) and observed (*Ho*) heterozygosities, and polymorphism information content (PIC) were calculated for 12 polymorphic SSR markers in 13 octoploid strawberry specimens (Table 3) and in 24 diploid *Rubus* cultivars (Table 4). The number of alleles in strawberry specimens varied widely among these markers ranging from two in *RiMY01* to 14 in *FaFS01*, with 6.7 on average, respectively. The *Ho* and *He* values ranged from 0.37 to 0.90 and 0.35 to 0.90, with 0.63 and 0.66 on averages, respectively. The *FaCH02* locus, located in exon, demonstrated the lowest heterozygosity. The PIC ranged from 0.34 to 0.89 with an average of 0.66. The most markers (nine out of 12) had PIC values higher than 0.5, suggesting that these markers can efficiently measure genetic diversity in strawberry.


**Table 3.** Diversity of 12 polymorphic SSR markers in 13 *F*. × *ananassa* specimens.


**Table 4.** Diversity of 10 polymorphic SSR markers in 24 diploid *Rubus* cultivars.

From two alleles in the *RiHL01* marker to 13 alleles in the *RiMY01* marker were amplified in 24 diploid *Rubus* cultivars, with a mean number of 4.6 alleles per locus for 10 polymorphic markers (Table 4). Observed (*Ho*) and expected (*He*) heterozygosity ranged from 0 to 0.5 and 0.04 to 0.84 with mean values of 0.23 and 0.51, respectively.

Unexpectedly, the lowest heterozygosity was observed in the intron-located *RiHL01*marker. The PIC ranged from 0.04 to 0.82 with a mean of 0.46, which was noticeably lower than in strawberries (0.66), and only six markers had PIC values higher than 0.50 suggesting their high potential to measure high genetic diversity. The other four markers had *PIC* ranging from 0.25 to 0.50 showing their rather moderate potential to measure genetic diversity, and only the *RiHL01* marker was slightly informative (<0.25).

Principal Component Analysis (PCA) was used to reveal genetic relations among *Fragaria* (Figure 2) and *Rubus*(Figure 3) species and cultivars based on SSR markers representing the flavonoid biosynthesis pathway genes. The first three PCs explained 22.5%, 16.3%, and 9.8% of the total variance in *Fragaria*, respectively (Figure 2). All cultivars of *F*. × *ananassa* formed a compact group and were completely separated from diploid *Fragaria* species and (*F*. × *ananassa*) × *C. palustre* hybrid. *F. vesca* and *F. viridis* were grouped along the PC1, whereas *F. virids* and *F*. × *ananassa* along the PC2. Interestingly, *Fragaria* × *Comarum* hybrid (*F*. × *ananassa*) × *C. palustre* was very distant from the rest *Fragaria* cultivars and species, which is in agreement with a pink color of flowers in this hybrid, which makes it also very different from other cultivars. The PC3 did not contribute much in delineation of cultivars, therefore plots with PC3 are not presented here.

The PCA in Figure 3 represents the relationships between 32 individual *Rubus* cultivars and species. The first three PCs explained 20.3%, 11.4%, and 9.7% of the total variance, respectively. The PC3 did not contribute much in delineation of cultivars, therefore plots with PC3 are not presented here. In general, the grouping was as expected, and a good discrimination was observed between four *Rubus* subgenera—all of them were well-separated along PC1 and PC2. Unexpectedly, arctic bramble (*R. arcticus*; Cyclastis subgenus) was very distant from other *Rubus* species. There was a clear overlap between two *R. arcticus* and *R.* × *stellarcticus* clusters. Despite belonging to different subgenera black raspberry (*R. occidentalis*; Idaeobatus subgenus) cultivars were closer to blackberry (*Rubus* subgenus), which is in agreement with having also a common black color of their berries. The *Rubus* hybrids cluster coincided with the red raspberry (*R. idaeus*; Idaeobatus subgenus) cluster and was clearly distinguished from blackberry group, which suggests closer relationship of hybrids with red raspberry. All these hybrids have berries in different shades of red color.

**Figure 2.** PCA of genotyped strawberry (*Fragaria*) species and specimens based on 13 polymorphic SSR markers.

**Figure 3.** PCA of *Rubus* cultivars based on 10 polymorphic SSR markers.

#### *3.4. Genetic Data Analysis*

To determine the relationship between transferability and polymorphism of SSR markers and the location of loci and of primer pairs, the data were grouped in the Table 5. The identification of the location of the primer binding sites and the separation of the SSR markers into four groups based on this trait showed their clear connection with transferability level. When both primers are located in the conserved exons, the complete transferability is observed both within *Fragaria* and *Rubus* species and the cross-amplificatons between the *Fragaria* and *Rubus* genera. In the opposite direction (from *Rubus* to *Fragaria*), only one out of three markers (*RiMY01*) was amplified.


**Table 5.** Relationships between transferability and polymorphism of SSR markers and location of primer pairs and SSR loci in gene.

<sup>1</sup> in—intron, ex—exon; up—upstream; the number is not provided in case, if gene has only a single or no intron.

<sup>2</sup> *R. idaeus*/*R. occidentalis* and blackberry/hybrids/*R. chamaemorus* and *R. arcticus*/*R.* × *stellarcticus.* <sup>3</sup> *R. idaeus*/

*R. occidentalis*/blackberry/hybrids/*R. arcticus*/*R.* × *stellarcticus.* n—no amplification.

When one of the primer binding site is located in more variable NCDS (introns or 5 UTR), the transferability level decreases. With intrageneric transferability, two of the *F*. × *ananassa* markers, *FaMY01* and *FaLR01*, were no amplified product in the diploid *F. vesca* and *F. viridis*, while out of the three *Rubus* markers, only the *RiHL01* (representing TF) marker was amplified in all *Rubus* cultivars. The *RiG001* marker was amplified only in red raspberry, and *RcF3H* was transferred in Anna and Beata, but not in Linda, Sofia, Astra, and Aura hybrid arctic bramble cultivars. The transferability between the genera was also much lower. Out of the five *Fragaria* markers, only *FaLR01* was amplified in the Nordic species: cloudberry, Arctic bramble, and hybrid Arctic bramble, and *FaCH01*—in some cultivars of red raspberry, hybrids and Nordic species, but was not amplified in black raspberry and blackberry at all. A similar situation was observed when both primer binding sites were located in NCDS: the amplification of some markers failed in *Fragaria* diploid species, and none of the *Fragaria* markers was amplified in *Rubus*.

The location of one of the binding sites across intron-exon junction had the worst effect on transferability. Three out of four markers were not amplified in any specimen, even when the second primer site was located in exon and in the same species (blackberry). Only the *RiMY01* marker amplified some fragments, but it was inconsistent; multiple fragments were generated in Anna and Beata, but no fragments in Linda, Sofia, Astra, and Aura hybrid Arctic bramble cultivars.

We also observed a clear relationship between the location of loci and polymorphism level. When loci were located in introns or 5 UTR, the larger number of alleles, up to 14 alleles in *F*. × *ananassa* and 17 alleles in *Rubus*, was observed, and all markers were polymorphic. On the other hand, the allele number in exon-located loci did not exceed four in *F*. × *ananassa* (*FaCH02*) and five in *Rubus* species (*RiAS01*, *RhUF01*) and many markers were monomorphic. For example, all three markers that amplified one fragment in strawberry specimens were located in exons. Markers that contained more than one microsatellite locus located both in CDS and NCDS demonstrated high polymorphism (*FaFS01*, *FaFG01*) as well as monomorphism (*FaCH01*). In total, nine out of 12 polymorphic markers in strawberry were in NCDS, two in exons + introns, and only one in exon (*FaCH02*). More than half (six out of 11) of polymorphic markers in *Rubus* were also located in introns. With regard to the relationship between polymorphism and SSR motif type, the highest allelic variation was revealed in markers with dinucleotide motifs and large number of repeats, such as in the *FaMY02*, *FaFS01*, and *RiMY01* markers. It should be noted that two of these markers represented the *MYB10* TF genes in raspberry and strawberry.

#### *3.5. Sequence Analysis of Chitinase III Genes*

Based on the *F*. × *ananassa* chitinase III (chi3) sequence (GenBank accession number AF134347), we designed a pair of primers and amplified cDNA fragments from three raspberry cultivars with yellow-, orange- and red-colored berries (Zolotaya Osen, Oranzhevoe Chudo, and Babye Leto II, respectively). Sequencing confirmed that these fragments were composed of 528 nucleotides within the full length of open reading frame (ORF) and encoded 176 amino acids. The red-fruited Babye Leto II cultivar differed from the other two cultivars by two synonymous nucleotide substitutions. The nucleotide sequences of three chitinase III gene fragments were deposited in the NCBI GenBank (accession numbers MK333194, MK333195, and MK333196, respectively). The translated amino acid sequences of the raspberry chitinase III were aligned with published sequences of strawberry (cv. Chandler) and raspberry of unknown origin (Figure S1) [33]. The identity between amino acid sequences of the amplified chitinase III gene was 93.1% for all three Russian raspberry cultivars vs. unknown raspberry and 86.9% vs. strawberry. Twenty amino acid substitutions were the same for all three Russian and one unknown raspberry cultivars compared to strawberry sequence. In addition, eight substitutions were unique only for the unknown raspberry cultivar, and two substitutions were unique for our three cultivars.

#### **4. Discussion**

Modern plant breeding, including also berry crop breeding, seems to be almost impossible without modern genomic methods. They are needed to develop DNA based molecular genetic functional markers, such as single nucleotide polymorphisms (SNPs) and SSRs in adaptive and breeding trait related genes. The SSRs are highly variable in length due to insertion-deletion of the entire repeat units or motifs, mainly as a result of recombination errors or DNA polymerase slippage [34]. Genic SSRs could be even more informative than SNPs because unlike biallelic SNPs they usually have multiple alleles that can mark multiple alleles and haplotypes in these genes. It makes them very suitable for the MAS [35]. In comparison to random noncoding genomic SSRs the genic and EST-SSR markers are more transferable [21,36] and are better amplified [37] because of more conservative primer annealing sites. In addition, in silico development of genic and EST-SSR markers can now be relatively easily done using publicly available nucleotide sequence databases. Genic SSRs in functional genes can be used as "functional genetic markers", and they have a much higher transferability across different taxa than random genomic SSRs [38]. Sargent et al. [39] used primer pairs based on the binding sites in the *Fragaria* exons flanking polymorphic introns and found that their transferability was significantly higher compared to the random genomic SSRs. However, genic SSRs are usually less polymorphic than random genomic SSRs, which can limit their use in MAS [34]. Thus, the most optimal marker could be those that have a SSR locus in a variable gene region such as intron and the primer annealing sites in the conserved exons flanking this intron. In case of long introns, it would be important to have at least one annealing site in an exon. However, information on the exon-intron structure is not available in the

EST databases, more genome sequence data become available allowing to design reliable, consistent, polymorphic, functional, and transferable genic SSRs.

#### *4.1. Choice of Genes and Genic SSR Marker Development*

The flavonoid pathway is initiated by chalcone synthase (CHS) and involves more than 10 enzymes that act at early and late stages leading to the biosynthesis of different compounds such as flavonols, condensed tannins (proanthocyanidins) and anthocyanins (Figure 1). It is well known that pelargonidin-3-glucoside is a major anthocyanin in strawberry [40], whereas cyanidin-3-sophoroside is a major anthocyanin in red raspberry, followed by other cyanidin glycosides [41]. The late structural genes are regulated at transcriptional level by a ternary protein complex named MBW, which is formed by R2R3-MYB TFs, basic helix-loop-helix (bHLH) proteins, and WD40-repeat proteins [42].

For development of new SSR markers we used 13 structural and four regulatory flavonoid biosynthesis genes from GenBank (9 *F*. × *ananassa* and 8 *Rubus* species genes) (Table S1). Particular attention was paid to the flavonol synthase (*FLS*) and dihydroflavonol 4-reductase (*DFR*) genes, for which two SSR markers were developed. These enzymes competed for common substrates, dihydroflavonols (Figure 1), in order to direct the biosynthesis to colorless flavonols or colored anthocyanins, respectively, and may determine color phenotype [43]. For comparison, we also used a pair of primers designed for the *RiG001* locus from the *R. idaeus* aromatic polyketide synthase (*RiPKS3*) gene [28]. Unlike the RiPKS1, the typical naringenin chalcone synthase (CHS), the *RiPKS3* produced predominantly p-coumaryltriacetic acid lactone [44], but the sequences of both genes amplified by their PCR primer pairs are almost identical [23]. Among the *MYB TFs* genes we used the *MYB10* orthologs involved in the anthocyanin biosynthesis during ripening in more than 20 commercially important *Rosaceae* species [45]. Assuming its important role in flavonoid pathway regulation, we developed two markers for this gene. The *bHLH* gene from red raspberry is very similar to the *MdbHLH33* gene that is closely associated with anthocyanin production in apple [46]. In addition, we included in the study genes encoding PR proteins—chitinase (*FaChi2-1*) and ß-1,3-glucanases (*FaBG2-2* and *ToyoGluIII*). Previous studies demonstrated that expression of the *FaChi2-1* and *FaBG2-2* genes in strawberry are induced in response to pathogen inoculation [47].

A total of 118 SSR loci were identifies in 21 genes, and pentanucleotide motifs were the most abundant. Our result is in contrast to previous findings identifying trinucleotide [21,37] as the most frequent genic repeats in Rosaceae plant species, unlike dinucleotide motifs that were the most frequent among non-genic random genomic repeats [36,48,49]. The difference can be also explained, at least partly, by less stringent search parameters that were used in our study compared to those that are typically used in searches for random genomic SSRs, but they were similar with those that are usually used for the search of SSRs in coding regions. For example, Park et al. [37] found that the most profound allelic variation was revealed by the primer pairs flanking the penta-repeats (91%), whereas authors noted no significant difference among di-, tri-, and tetra-repeat (61–66%) motifs.

Identified SSRs were categorized by location in exons, introns, 5 UTRs, 3 UTRs, and upstreams. SSRs were located mainly in NCDS in genes: although exons occupied 45% of the gene length, but they contained only 28% of the SSR loci. It is known that genic SSRs located mainly in variable NCDS, but not in conserved exons. For example, development of genome-wide SSR markers in such different species as papaya and chickpea showed a similar distribution of SSRs across genome: 78–87% were in the intergenic regions, 9–10% in introns, and 2–3% in exons [34,50]. It has been repeatedly reported, that tri- and hexanucleotide motifs were more abundant in exons since they do not lead to frame shift and do not effect protein function and property as much as mutation of other repeats that could be under mutation pressure. Meanwhile, di-, tetra-, and pentanucleotide motifs are abundant in NCDS [35,36,50]. Such selection also reduces the variability of SSRs in exons. This is not consistent with our data, where pentanucleotide SSRs were the most common in the exons. However, apparently selection nevertheless occurred, and we did not find SSRs with mono-, di-, and tetranucleotide motifs in exons in our study, although they made up a significant part of all SSRs in introns and UTRs (Table 2). We found also that A and T nucleotides prevailed both in general in the studied genes (72%), and to a greater extent in NCDS (80%). It is known that some motifs, such as AT/TA, showed a greater abundance in most species [51].

When developing markers, we took into account the location of loci in genes (so that they were in different gene regions), as well as the location of several loci in the same marker. In addition, preference was given to dinucleotide motifs and a greater number of motif repeats, because such SSRs are more polymorphic [28,51]. As a result, we designed 15 and 9 primer pairs for SSR markers based on nucleotide sequences of genes with known function in *F*. × *ananassa* and *Rubus* species, respectively. These markers included all flavonoid biosynthesis genes available for *Fragaria* and *Rubus* in the NCBI GenBank database. Earlier, sets of SSRs for poplar genes involved in wood formation [52] or stress related genes in peanut [53] were developed, but not on flavonoid biosynthesis genes. We also developed SSR markers for the TF regulatory genes, which were not previously reported for *Fragaria* and *Rubus*.

#### *4.2. Choice of Cultivars and Transferability of SSR Markers*

It was shown that total anthocyanin content correlated with color of berry in both strawberry [54] and raspberry [55], and cultivars were selected primarily to have a broad variety of berry colors. In addition, we took into account their commercial value and application in the breeding programs. Other *Rubus* and *Fragaria* cultivars were chosen due to their diverse ploidy (black raspberry, blackberry, loganberry, and boysenberry) and as wild potential donors of traits of interest (*F. vesca* and *F. viridis*). The rare cultivated species of *Rubus* were also included in the study, such as cloudberry (*R. chamaemorus*), arctic bramble (*R. arcticus*), and hybrid arctic bramble (*R.* × *stellarcticus*). Only a few rare reports on the SSR markers are available for these boreal species [56,57] that are rich in ellagic acid and are regionally extremely important and valuable crops. Moreover, *R. arcticus* is used to develop new cultivars. The hybrids of the octoploid *F*. × *ananassa* and the hexaploid *C. palustre*, which unlike the white-flowered strawberry have red and pink flowers and are usually grown for ornamental purposes, are also of high interest for the flavonoids biosynthesis research.

The amplification of the SSR markers in the strawberry specimens using primer pairs based on the *F*. × *ananassa* sequences was 100% successful, and this is in agreement with data obtained by many other researchers. The transferability of the strawberry SSR markers to *F. vesca* and *F. viridis* was noticeably lower (73.3%): *FaMY01*, *FaMY02*, *FaDR01*, and *FaLR01* were not amplified using primer pairs developed from *F*. × *ananassa* genes. It is known that rapid genomic changes can occur after polyploidization, and loss of homologous copies of many duplicated genes was often observed [21]. Hirakawa et al. [58] calculated that the size of the octoploid *Fragaria* genome (698 Mb) approximately equalled 80% of the combined genomes of its four diploid wild relatives, *F. iinumae*, *F. nipponica*, *F. nubicola*, and *F. orientalis* (~200 Mb each). It is also possible that primer binding sites in *F. vesca* and *F. viridis* for these SSR markers have mismatches with sites that were used to design primers. It cannot be also excluded that the prolonged selection (with the purpose to change the quality of strawberries) contributed to the discrepancy between binding sites. The alleles unique for some diploid species were identified at a number of loci. The marker transferability to (*F*. × *ananassa*) × *C. palustre* hybrid is not different from the strawberry specimens. This happened due to the fact that *C. palustre* is a close relative of *Fragaria*—they are members of the same subtribe Fragariinae [1]. Twelve out 15 markers (80%) were polymorphic in *F*. × *ananassa* (produced two and more fragments). This is consistent with other authors who reported 81% [59] and 91% [21] of polymorphic loci.

The transference among the *Rubus* species was not as successful as among *Fragaria* species. The transferability of loci from the *Rubus* flavonoid pathway genes was 70% for red raspberry, 50% for hybrid arctic bramble, and 60% for the other *Rubus* species. This fact suggests a close genetic relationship among the studied *Rubus* species and a high conservativeness of the flavonoid biosynthesis genes. Mnejja et al. [16] showed that transferability negatively correlated with genetic distance between the genera in the Rosaceae family and between species within the *Prunus* genus. However, many

authors reported similar and high (80–100%) cross-species amplification of raspberry loci (*Idaeobatus* subgenus) in other species, such as black raspberry (*Idaeobatus* subgenus) [60,61], blackberry (*Rubus* subgenus) [28,60,61], and Arctic bramble (*Cyclastis* subgenus) [57]. The *RiG001* marker was amplified only in red raspberry. Castillo et al. [28] found that this marker was not amplified in blackberry and hybrids. We demonstrated it also in black raspberry earlier [23] and in cloudberry, arctic bramble, and hybrid Arctic bramble in this study. This marker appears to be a good identifier for red raspberry among cultivated *Rubus* species.

The transferability of *F*. × *ananassa* loci to *Rubus* species was moderate: 46.7% in Nordic species, 40% in red raspberry and hybrids, and 33.3% in black raspberry and blackberry (33.3%). Other authors [20,62] reported even a lower transferability from *F*. × *ananassa* to diploid red and black raspberries (*Idaeobatus* subgenus, 8–23%) than to tetraploid blackberry (*Rubus* subgenus, 26–36%). Only two of the 15 strawberry markers were polymorphic in raspberries (*FaFS01* and *FaCH01*) and five markers if all *Rubus* species were taken into account. This is consistent with other data that demonstrated the difficulty of identifying polymorphic loci in *Rubus*—only 10% of the markers from *Fragaria* amplified a polymorphic product in *Rubus* [17].

There are no published data on transferability of markers from *Rubus* to *Fragaria*. In our study, only three of 10 *Rubus* primer pairs amplified in all *Fragaria* specimens. All of them were from *R. idaeus* (the *ANS* locus and two TF genes—*MYB10* and *bHLH*). This level is slightly higher than the transferability from *Fragaria* to *Rubus*. It should be noted that the primers for the similar genes in *Fragaria*, *ANS* and *MYB10*, did not amplified in *Rubus*. The amplification of *RiMY01* showed the presence of unique alleles in *F. vesca* and *F. viridis*.

The moderate transferability level between *Fragaria* and *Rubus* suggests a remote relationship between these two genera. Potter et al. [1] reported that the phylogenetic relationship between *Fragaria* and *Rosa* is closer than between *Fragaria* and *Rubus*. Qi et al. [63] evaluated the SSR primer pairs using in silico PCR and demonstrated that strawberry is the closest to the rose followed by the raspberry. We found an interesting regularity in the size of alleles during cross-species amplification: alleles in *Rubus* species amplified using *Fragaria* primers had sizes similar to the expected, while alleles in *Fragaria* species amplified using *Rubus* primers were significantly smaller than expected. This was not previously reported, since there was no work on the transferability of *Rubus* markers to *Fragaria*. Perhaps this is due to the fact that the genus *Rubus* is evolutionary older than the *Fragaria* genus [1,64].

The SSR transferability to the poorly studied northern *Rubus* species did not differ from raspberries and blackberries, except for the amplification of the *FaLR01* marker in them. Kostamo et al. [57] reported amplification of all seven markers in arctic bramble cultivars using primer pairs developed for raspberry SSR loci, but the annealing temperature was lowered to 50 ◦C from the original 60 ◦C. It is known that lowering the annealing temperature may increase transferability, but nonspecific amplification can occur [21]. We have repeatedly observed null-alleles among hybrid arctic bramble cultivars for the *FaFS01*, *RcFH01*, *FaLR01*, *FaFH01*, and *FaCH01* markers and only once for cultivars of other *Rubus* species (for FaCH01 marker). This is probably due to the high heterozygosity in primer binding sites that results in mismatch between primers and binding sites in the *R.* × *stellarcticus* cultivars. Closer similarity of allele sizes to strawberry markers in northern *Rubus* species compared to raspberries and blackberries, as well as the amplification of the *FaLR01* marker in them, suggests their closer relationship with *Fragaria* compared to other tested *Rubus* species.

#### *4.3. Genetic Diversity of Fragaria and Rubus*

The number of alleles varied among 12 SSR markers genotyped in 13 strawberry specimens and ranged from two to 14 alleles (6.67 alleles on average) (Table 3). This is slightly higher than previously reported for *F*. × *ananassa* (5.6) [65]. The low number of alleles per locus can reflects a poor choice of microsatellites or low levels of genetic diversity. Hilmarsson et al. [10] found a mean number of alleles in *F. vesca* was only 4.5 and the authors believed that this was due to low levels of genetic diversity in the species, but the most polymorphic marker had 16 alleles. Within the strawberry collection, the

mean *Ho* and *He* were 0.63 and 0.66, respectively, and the mean polymorphism information content (PIC) was 0.66. Our *He* value is similar to 0.66 obtained by Yoon et al. [65], but *Ho* and PIC significantly exceed their values of 0.51 and 0.45, respectively. The mean *Ho* and *He* vary significantly across different *Fragaria* species: from 0.08 and 0.17 in *F. vesca* [10] up to 0.75 and 0.86 in *F. virginiana* [66], respectively. Nine informative markers having high polymorphism information content (PIC) values (0.52–0.89) can be used for efficient evaluation of large collections of strawberry samples.

The parameters of polymorphism were also calculated for 24 diploid *Rubus* cultivars based on 10 SSR markers (Table 4). From two to 13 (mean 4.6) alleles were observed for polymorphic markers in diploid *Rubus* cultivars. These values were rather similar with published results obtained in other diploid *Rubus* species. From two to 5 (mean 3) alleles per polymorphic locus were observed in 21 *R. occidentalis* cultivars [67] and 2–15 (7.5) in 24 *R. idaeus* cultivars [28]. In our study, the mean values of *Ho*, *He*, and PIC were 0.23, 0.51, and 0.46, respectively. These values were lower than those observed in *R. idaeus* [28] and *R. coreanus* [49]. This may be because the gene sequences, from which our SSR markers are derived, are more conserved compared to random genomic SSRs used in these published reports. However, average *He* and PIC in our study were higher than in the published study in *R. occidentalis* [67], while *Ho* was lower. The most polymorphic SSR loci based on high *Ho*, *He*, and PIC were *RiMY01* and *FaFS01*. *FaFS01* was also the most polymorphic marker for strawberry. These loci contained long dinucleotide repeats, which usually have higher levels of polymorphism compared to other repeats [21]. The *RiG001* marker demonstrated the PIC value, that was very similar with results, reported by Castillo et al. [28]—0.43 vs. 0.46, respectively. Our results demonstrated that the SSR markers developed in this study might be useful for the genetic assessment of *Fragaria* and *Rubus* species.

PCA of 16 *Fragaria* specimens genotyped with 13 SSR markers demonstrated clear separation between *F*. × *ananassa*, *F. vesca*, *F. viridis*, and *Fragaria* × *Comarum* hybrid (Figure 2). Biswas et al. [68] also demonstrated a clear separation of 26 *F*. × *ananassa* and 7 *F. vesca* specimens by SSR. Vallarino et al. [69] showed a clear separation between domesticated (*F*. × *ananassa*) and wild (*F. moschata*, *F. vesca*, and *F. chiloensis*) specimens using profiles of primary and secondary metabolites, but wild species formed a common cluster. Unlike our study, Sanchez-Sevilla et al. [70] showed no differences between (*F*. × *ananassa*) × *C. palustre* hybrid (Pink Panda cultivar) and strawberry cultivars using РCA based on DArT markers, although hybrid and cultivars were most diverse according to the phylogenetic dendrogram.

PCA of 32 *Rubus* cultivars genotyped with 10 SSR markers demonstrated, in general, a good separation of the four *Rubus* subgenera, except *R. occidentalis*, which clustered separately from *R. idaeus* despite belonging to the same *Idaeobatus* subgenus (Figure 3). Earlier, PCA of 50 *Rubus* cultivars based on genomic SSRs showed that black raspberry is the most distant from other *Rubus* species [60]. Interestingly, in report of Simlat et al. [71], Jewel cultivar was not separated from red raspberry using PCA based on SSRs. According to our data, *Rubus* hybrids clustered together with red raspberry, while in Graham et al. [60] they were located approximately in the middle between red raspberry and blackberry cultivars.

#### *4.4. Relationship between Properties of SSR Markers and Their Location*

The more transferable SSR markers are characterized by a lower frequency of null alleles. The most likely reason for null alleles are mutations in one or two primer binding sites creating mismatch between primer and binding site sequences [22]. The chances for mismatch depend on location of these sites in gene, and, therefore, location of binding sites is important for amplification. We showed a clear relationship between the location of primer binding sites and transferability of SSR markers within and between *Fragaria* and *Rubus* genera (Table 5). The most transferability demonstrated markers that were amplified by primer with binding sites located in exons. The location of binding sites in NCDS resulted in a significant decrease of transferability. It should be noted that location only one of two binding sites in variable sequences was sufficient to drastically reduce transferability, especially between genera. It is known, that chances of mutations in binding sites and their mismatch with primers will increase in taxa that are phylogenetically distant [48]. In our study, all null alleles in some *R.* × *stellarcticus* hybrids were observed only for loci with primer binding sites located in NCDS.

The lowest transferability was found also when one of the two primer binding sites were located across the intron/exon junction. Three out of four markers with such binding sites were not amplified in any of the 48 tested specimens, even in the blackberry which sequences were used to develop these markers. Thus, we experimentally confirmed the assumption of Vidal et al. [34] that PCR amplification failures can occur in the case of designing primers with binding sites across exon-intron junctions. It should be noted that for two SSR loci detected in the *RhDFR* gene and one SSR in the *RhANR* gene all computer programs used in this study to design primers selected the intron-exon junctions for binding sites, and these two markers were not amplified. Thus, either parameters for designing primers for these markers should be changed or primers should be designed manually to develop successful SSR markers for these genes.

The polymorphism of SSR markers in transcribed regions can affect transcription, translation, and/or gene function. SSR polymorphisms within exons can result in amino acid change that can lead to a gain or loss of function, in the 5 UTR they can regulate gene expression by affecting transcription and translation, in the 3 UTR they can be responsible for gene silencing or transcription slippage, and in introns they can affect gene transcription, mRNA splicing, or export to cytoplasm [35,52]. Ultimately, all these polymorphisms can affect phenotypes. However, the likelihood of polymorphism in different gene regions may vary. Our studies have shown a clear association between polymorphism of SSRs and their location in CDS and NCDS (Table 5). Highly polymorphic loci were located in introns, 5 UTRs, and upstreams, whereas loci with moderate polymorphism and monomorphic ones were located in exons. This relationship has been observed for both *Fragaria* and *Rubus*. For example, the *FaFS01* marker representing SSR in the second intron of the flavonol synthase gene was one of the most polymorphic in strawberries (14 alleles) and *Rubus* (eight alleles), while the *FaFS02* marker representing SSR in the first exon of the same gene was monomorphic in both genera. In general, SSRs located in introns had 5.5 and 6.0 alleles per locus on average in *Fragaria* and *Rubus*, respectively, whereas in SSRs located in exons—only 1.8 and 3.0 alleles per locus, respectively. Our results confirm the data of Du et al. [52] that intronic SSRs of *Populus tomentosa* were more variable (3.7 alleles/locus) than exonic SSRs (2.4 alleles/locus), likely due to higher selection pressure on CDS than on NCDS.

#### *4.5. SSR Markers Representing Transcription Factor (TF) Genes*

The TFs of MYB and bHLH families are widespread in plants. Using the lily transcriptome Biswas et al. [72] developed 71 SSRs in genes of 31 different TF families and most of them represented the bHLH TF (10 SSRs) and MYB (nine SSRs) families. The SSR markers in the TF gene in plants were first developed by Kujur et al. [73], which showed the influence of the SSR polymorphism on secondary structure of proteins and on such traits as seed weight and number of pods and seeds. Since then, there have been few similar studies in Rosaceae where similar markers have been developed for Japanese plum [24], but no such markers have been reported for *Rubus* and *Fragaria* before our first study, where we developed a SSR marker representing the *RiMYB10* TF gene, an activator of transcription of flavonoid biosynthesis [23]. The association between the *MYB10* gene and fruit color has been demonstrated for apple [45] and peach [74]. Gonzalez et al. [24] found three allele variants in the EST–SSRs designed for the *PsMYB10* TF gene. In our study of 21 varieties of red and black raspberries presented here, we showed that the *RiMY01* marker representing the *R. idaeus MYB10* gene had a significantly greater polymorphism compared to six markers representing structural genes of flavonoid biosynthesis—9 vs. 2–4 alleles per locus and *PIC* of 0.82 vs. 0.05–0.35 [23]. This is consistent with high PCR amplification efficiency and high polymorphism found for SSR markers representing TF genes in a number of crops [72].

In this study, we developed five SSRs using sequences of all publicly available TF genes of *Rubus* and *Fragaria*, which are part of the MBW transcriptional complex. As a result, two of the three most multiallelic markers were the *RiMY01* and *FaMY02* from TF genes. We assume that their high variability is caused by a combination of several factors: arrangement in introns, long dinucleotide repeats, and a noticeable AT enrichment. It is interesting that according to Biswas et al. [72] the SSRs in most TF genes were GC-rich, but genes in the bHLH and MYB families have higher AT content than other GC-rich genes. It is possible that this is a feature of these TF families, causing their increased variability. High level polymorphism of these SSR markers suggests association of alleles in flavonoid biosynthesis TF genes with variation in flavonoid accumulation and composition. Thus, we believe that SSR markers developed in this study and representing the TF genes can serve as useful tools for MAS of *Fragaria* and *Rubus* species.

#### *4.6. Chitinase III as Allergen in Raspberry*

Chitinases are able to degrade the chitin, a major component of fungal cell walls, and they play key roles in plant defense system from fungal pathogens. Plant chitinases are also a well-known group of food allergens. It is a relatively small group, but it is present in highly consumed fruits [75]. Chitinases can be divided into five classes (I-V), and plant class III chitinases (PR-8 proteins) have an additional lysozyme activity, which is not found in other classes of chitinases [76]. Marzban et al. [33] identified four allergen proteins including class III chitinase in raspberry fruits. They showed high sequence identity of raspberry proteins to different PR protein families in Rosaceous species (especially to Fra a strawberry proteins) and suggest that the consumption of raspberries might be responsible for adverse reactions in sensibilized individuals. We sequenced fragments of chitinase III genes in three Russian raspberry cultivars that have yellow-, orange- and red-colored berries and compared them with the chitinase III genes in Chandler strawberry (GenBank AF134347) and unknown raspberry [33]. These sequences matched both raspberry and strawberry sequences, but there were nonsynonymous substitutions in the raspberry sequences. Two nonsynonymous substitutions differed both strawberry and previous published raspberry chitinase III sequences. The amino acid sequences of all three raspberry cultivars with different berry colors were identical.

It is known that fruit color correlates with allergen content in strawberry. Hjerno et al. [77] demonstrated that the strawberry Fra a 1 allergen (a homolog of the major birch pollen allergen Bet v 1) is synthesized in red ripe fruits of *F.* × *ananassa*, but not in white (colorless) cultivars. Proteomic analyses have shown that Fra a 1 allergen and several enzymes of the flavonoid biosynthesis pathway were down-regulated. Later, suppression of three Fra a genes using RNAi approach alters phenolic compound levels in strawberry and led to decreased accumulation of main anthocyanins responsible for the red color of fruits [78]. Finally, Casanal et al. [79] suggested that Fra a proteins may play an important role in the control of flavonoid pathways by binding to metabolic intermediates. It can be assumed that a similar connection exists in raspberry, and that future studies in this direction are necessary. If this hypothesis is confirmed, yellow-colored raspberry cultivars will have less allergenic potential and are suitable for feeding children with increased allergenic sensitivity.

#### **5. Conclusions**

In this study, we developed a set of SSR markers representing structural and regulatory flavonoid biosynthesis genes of berry crops. These genic SSRs have been successful to identify allelic variations in *Fragaria* and *Rubus* species with contrasting color berry phenotypes. The results suggest that boreal *Rubus* species are more related to *Fragaria* compared to raspberry and blackberry. We demonstrated a clear relationship between transferability of SSR markers within and between the *Fragaria* and *Rubus* genera and location of primer binding sites and between the marker polymorphism and its location in coding and non-coding sequences. The SSR markers representing TF genes showed high allelic variability and may be good candidates for MAS in berry species. The genic SSRs developed in this study may be used for future genetic diversity and population genetics studies in *Fragaria* and *Rubus* species, as well as may be considered as candidate markers in breeding programs for improvement anthocyanin related traits.

*Genes* **2020**, *11*, 11

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4425/11/1/11/s1, Figure S1: Alignment of the chitinase III amino acid sequences from strawberry and raspberry cultivars *F*. × *ananassa* cv. Chandler (AF134347), *R. idaeus* of unknown origin (Marzban et al. [33]), *R. idaeus* cvs. Zolotaya Osen (MK333194), Oranzhevoe Chudo (MK333195), and Babye Leto II (MK333196). Identical amino acid positions are highlighted by the same color, Table S1: Data on 25 SSR loci and their PCR primer pairs used to genotype *Fragaria* and *Rubus* specimens.

**Author Contributions:** Conceptualization, V.G.L. and K.A.S.; Methodology, V.G.L. and K.A.S.; Investigation, V.G.L., N.M.S., O.P.M. and K.A.S.; Formal Analysis, V.G.L., O.P.M. and T.N.L.; Visualization, T.N.L.; Data curation, V.G.L., K.V.K. and K.A.S.; Funding Acquisition, V.G.L., K.V.K. and K.A.S.; Project Administration, V.G.L. and K.A.S.; Resources, V.G.L. and K.A.S.; Supervision, V.G.L. and K.A.S.; Writing, V.G.L., K.V.K. and K.A.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work was financially supported by the Ministry of Education and Science of the Russian Federation (grant No. 075-15-2019-1205 from 04.06.2019, unique project identifier RFMEFI57417X0149).

**Acknowledgments:** We thank I. A. Pozdniakov (OOO Microklon, Pushchino, Russia) for providing us with plant materials used in this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **The Molecular Determination of Hybridity and Homozygosity Estimates in Breeding Populations of Lettuce (***Lactuca sativa* **L.)**

#### **Alice Patella, Fabio Palumbo \*, Giulio Galla and Gianni Barcaccia**

Department of Agronomy, Food, Natural Resources, Animals and Environment, University of Padova, 35020 Legnaro PD, Italy; alice.patella@phd.unipd.it (A.P.); giulio.galla@unipd.it (G.G.); gianni.barcaccia@unipd.it (G.B.)

**\*** Correspondence: fabio.palumbo@unipd.it

Received: 23 September 2019; Accepted: 7 November 2019; Published: 9 November 2019

**Abstract:** The development of new varieties of horticultural crops benefits from the integration of conventional and molecular marker-assisted breeding schemes in order to combine phenotyping and genotyping information. In this study, a selected panel of 16 microsatellite markers were used in different steps of a breeding programme of lettuce (*Lactuca sativa* L., 2 *n* = 18). Molecular markers were first used to genotype 71 putative parental lines and to plan 89 controlled crosses designed to maximise recombination potentials. The resulting 871 progeny plants were then molecularly screened, and their marker allele profiles were compared with the profiles expected based on the parental lines. The average cross-pollination success rate was 68 ± 33%, so 602 F1 hybrids were completely identified. Unexpected genotypes were detected in 5% of cases, consistent with this species' spontaneous out-pollination rate. Finally, in a later step of the breeding programme, 47 different F3 progenies, selected by phenotyping for a number of morphological descriptors, were characterised in terms of their observed homozygosity and within-population genetic uniformity and stability. Ten of these populations had a median homozygosity above 90% and a median genetic similarity above 95% and are, therefore, particularly suitable for pre-commercial trials. In conclusion, this study shows the synergistic effects and advantages of conventional and molecular methods of selection applied in different steps of a breeding programme aimed at developing new varieties of lettuce.

**Keywords:** pure lines; F1 hybrids; microsatellite markers; marker-assisted breeding; crop improvement; varieties

### **1. Introduction**

Lettuce (*Lactuca sativa* L.) is a self-pollinating leafy vegetable species (2 *n* = 2 *x* = 18) of the Asteraceae family. It is cultivated on a large scale throughout the world for consumption as a fresh vegetable on its own or in combination with other ready-to-eat vegetable products [1]. Its growing economic importance has led seed companies to regularly develop new varieties with ever higher agronomic traits. However, breeding programmes are highly limited by the reproductive system of this species. The flower structure of lettuce determines a reproductive strategy known as cleistogamy, in which anther dehiscence and subsequent pollination take place before flower opening, resulting in a very high rate of self-pollination, very often equal to or close to 100% [2]. According to recent estimates, out-cross rates are limited to 1–6% [3]. These reproductive barriers mean that in natural conditions the species spontaneously constitutes pure lines, characterised by phenotypic uniformity and genotypic stability, due to their very high homozygosity. In conventional breeding programmes, developing segregating and recombinant F2 populations traditionally requires crosses to be hand pollinated while self-pollination is prevented by emasculating the flowers. The most popular emasculation and

hand-pollination technique is that described by Olivier [4]. Known as the "wash method", it involves hand-spraying the inflorescence with water during pistil emergence to remove the pollen attached to the female part of the flower. The inflorescence is then left to dry for a short period, after which it is rubbed with a ripe flower of the pollinating variety [5]. A slightly different, but also widely used, technique is the "clip-and-wash method", which involves clipping the tip of the corolla before spraying with water. This guarantees more efficient pollen removal and cross rates close to 100% from the subsequent manual pollination [2]. However, these breeding techniques are time-consuming and technically highly demanding, and are only really effective if coupled with molecular analyses aimed at screening progeny plants and assessing their hybridity.

In recent years, many seed firms have begun using molecular markers to carry out assisted selection schemes and to speed up varietal development programmes [3]. Simple Sequence Repeat (SSR) markers are, so far, the most commonly used markers for these purposes [6–8] as they are codominant, have high reproducibility and multi-allelism, and can be detected at any stage of plant development, without being influenced by the environment [9]. There are a considerable number of SSR markers for lettuce in the literature [10]. Truco, et al. [11] produced an integrated genome map from 7 different linkage maps, which included 130 SSR loci organised in 9 linkage groups. Rauscher and Simko [10] augmented this genomic map with 54 genomic SSR and 52 EST-SSR (Expressed Sequence Tag) loci. Finally, with the publication of the *L. sativa* genome draft [12], tens of thousands of new SSR regions have become available for testing and use.

Given the availability of markers in lettuce, Marker-Assisted Selection (MAS) has started to be adopted in plant breeding programmes for various purposes, including identification of resistance genes [13,14] or Quantitative Trait Loci (QTLs) of phytopathogens [15,16], the study of QTLs controlling complex traits [17,18], and investigation of the genetic identity and purity of either experimental or commercial lines [19]. On the other hand, very few attempts have been made to prove the efficiency of molecular markers in Marker-Assisted Breeding (MAB) activities, where the genotypic background is molecularly investigated to complement traditional phenotypic selection [20].

In this work, SSR markers were used in three different steps of a conventional breeding scheme aimed at developing new varieties characterised by distinctiveness, uniformity, and stability (Figure 1).

**Figure 1.** Simplified overview of a lettuce breeding scheme in which selection is based on both plant phenotyping and genotyping.

We first examined the genetic background of a number of superior pure lines in order to plan experimental matings to produce F1 hybrids and then derived F2 progenies manifesting morphological variability as a result of genetic segregation and recombination (Figure 1). Each offspring in the F1 generation was analysed to distinguish the individuals resulting from planned out-crosses from those resulting from accidental selfing (Figure 1). After genotyping, the S1 individuals were discarded, and the F1 individuals were self-pollinated. In the F3 generations (Figure 1); experimental populations, previously selected according to their morphological traits, were also characterised by molecular markers due to the need to assess their stability and uniformity in order to run pre-commercial trials.

#### **2. Materials and Methods**

#### *2.1. Plant Materials and Breeding Techniques*

Plant materials were developed and provided by Blumen Group SpA, Italy and belonged to five different lettuce cultivar types (Table S1). Seventy-one parental lines (germplasm composed of experimental, pre-commercial and commercial lines) were involved in 89 combinations of crosses, in which each progeny consisted of 6–12 individuals (871 progeny samples). Parental lines were grown in the spring of 2015, and the 89 programmed crosses were carried out in the summer using the clip-and-wash method [2]. This involved making an incision in the calyx and corolla and washing the anthers in the early morning before the pollen grains could settle naturally on the outermost stigmatic surface of pistils. The plants were then manually pollinated by rubbing anthers of the pollen donor on the stigma of the seed parent. For each planned cross, a bulk of 4/5 flowers from a pollinator parent was used to pollinate as many flowers of a seed plant. Seeds were collected from the seed plant and sown in early autumn for genotyping selection and agronomic evaluation (spring 2016).

Finally, to assess the uniformity of the 47 experimental lines, previously chosen for morpho-phenological traits and pathogen resistances, 940 samples belonging to the 47 F3 populations (labelled 1 to 47) were collected in the spring of 2018. Each experimental line comprised 20 individuals.

#### *2.2. DNA Isolation*

A total of 100 mg of fresh leaves was collected from each of the 1882 lettuce samples (71 parents, 871 progeny and 940 F3) and ground to a fine powder using Tissue Lyser II (Qiagen, Valencia, CA, USA). Genomic DNA (gDNA) was extracted with the Dneasy® 96 Plant Kit (Qiagen), according to the manufacturer's protocols. After extraction, the integrity of the gDNA was assessed by electrophoresis on 1% (*w*/*v*) agarose gel stained with SYBR Safe® <sup>1</sup> <sup>×</sup> DNA Gel Stain (Life Technologies, Carlsbad, CA, USA) in Tris-Acetate-EDTA (TAE) running buffer. Both the yield and purity of the extracted gDNA samples were evaluated using a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Carlsbad, CA, USA). Following DNA quantification, all DNA samples were diluted to a final concentration of 20 ng/μL.

#### *2.3. Primer Design and Testing of SSR Marker Amplification*

Sixteen SSR marker loci were selected from those available in the scientific literature [10,21], according to (i) chromosomal location; (ii) polymorphism rate, expressed as PIC (Polymorphism Information Content); (iii) allele size range; (iv) annealing temperature of the locus-specific primers. Amplifications were performed according to the method previously described by Schuelke [22], with some minor modifications. Briefly, three primers were used to amplify each microsatellite locus: a pair of locus-specific primers, one with an oligonucleotide tail at the 5 end (M13, PAN-1, PAN-2 or PAN-3, Table S2), and a third universal primer complementary to the tail and labelled with a fluorescent dye (6-FAM, VIC, NED, or PET). Primer pair efficiency was tested in silico using the PRaTo [23] web-tool and were organised in three multiplex reactions, as shown in Table 1.


**Table 1.** Microsatellite loci information. For each primer pair, the original simple sequence repeat (SSR) name, ID used in this work, linkage group [10,21], SSR motif, primer sequences (PAN1, PAN2, PAN3, or M13 tails at the 5' end are indicated in square brackets; for further details see Table S2), dye and the multiplex to which the SSR marker locus belongs is shown.

The 16 primer pairs were first tested individually (singleplex reactions) using three randomly chosen lettuce gDNA to evaluate primer efficiency and to check the correspondence between expected and actual size of the bands; they were then evaluated in multiplex PCRs to assess possible primer interactions.

All amplification reactions (both singleplex and multiplex) were performed in a 10 μL reaction volume containing 1<sup>×</sup> Platinum® Multiplex PCR Master Mix (Thermo Scientific), 10% GC Enhancer (Thermo Scientific), 0.25 μM of non-tailed primer, 0.75 μM of tailed primer, 0.50 μM of fluorophore-labelled primer (universal primer) and 20 ng of genomic DNA. Thermal cycling conditions were as follows for multiplex 1 and 2:94 ◦C for 5 minutes followed by 6 cycles of 94 ◦C for 30 seconds, 61 ◦C for 30 seconds, 72 ◦C for 45 seconds, with a 1 ◦C annealing temperature stepdown per cycle (from 61 ◦C to 56 ◦C). The annealing temperature for the following 35 cycles was set at 56 ◦C, with denaturation and extension phases as above and a final extension at 60 ◦C for 30 minutes. The multiplex 3 thermal cycling conditions were instead: 94 ◦C for 5 minutes followed by 6 cycles of 94 ◦C for 30 seconds, 56 ◦C for 30 seconds, 72 ◦C for 45 seconds, with a 1 ◦C annealing temperature stepdown per cycle (from 56 ◦C to 51 ◦C). The annealing temperature for the following 35 cycles was set at 51 ◦C with denaturation and extension phases as above and a final extension at 60 ◦C for 30 minutes. All amplifications were performed in a GeneAmp® PCR 9700 thermal cycler (Applied Biosystems, Carlsbad, CA, USA). PCR products were first checked on gel electrophoresis (2% Ultrapure™ Agarose in TAE 1×, SYBR Safe® 1×, Life Technologies) then run on capillary electrophoresis with ABI 3730 DNA Analyzer (Applied Biosystem), using LIZ500 as the molecular weight standard. The size of each peak was determined with the Peak Scanner 1.0 software (Applied Biosystems).

#### *2.4. Genotyping and Data Analysis*

The 71 potential parents were genotyped at 16 SSR loci and statistical analyses were performed using NTSYS (Numerical Taxonomy and Multivariate Analysis System) version 2.2 (Exeter Software) [24]. Rohlf's (or the simple matching) coefficient was used to calculate pairwise genetic similarity (GS) in all possible comparisons and to construct a genetic similarity matrix, according to the formula:

$$\rm CS\_{ij} = m/(m+n) \tag{1}$$

where "i" and "j" are two different individuals, while "m" and "n" represent the number of matching and non-matching attributes, respectively. An unweighted pair group method with an arithmetic mean (UPGMA) dendrogram and a Principal Coordinates Analysis (PCoA) of parental lines were carried out using the Jaccard coefficient in the PAST software v. 3.14 with 10,000 bootstrap repetitions [25]. The genetic structure of the lines was modelled using a Bayesian clustering algorithm implemented in STRUCTURE v. 2.2 [26]. Since no *a priori* knowledge of the origin of the populations under study was available, the admixture model and then the correlated allele frequencies model were used. Ten replicate simulations were conducted for each value of K, with the number of founding groups ranging from 1 to 8, using a burn-in of 200,000 and 1,000,000 iterations. The most likely K Estimates were determined using the method described by Evanno et al. [26]. Estimates of membership were plotted as a histogram in an Excel spreadsheet. Finally, observed homozygosity (Ho) was determined with the POPGENE software [27].

The 89 subsequent crosses were planned according to the following criteria: (i) high genetic dissimilarity values among parents within the same lettuce cultivar type and between them, (ii) availability of informative loci able to distinguish between the resulting offspring and individuals resulting from accidental self-pollitated. Only homozygous loci for different alleles were considered informative, whereas heterozygous loci were taken into account only if the origin of the parental alleles could be clearly discerned in the progenies. The resulting offspring (871 samples) were then screened, with the analysis restricted to those SSR loci which had previously proven to be informative for hybrid detection. This made it possible to determine whether individuals belonging to a given F1 population originated from cross-pollination or self-pollination. The successful crosses (S.C.) rate of 89 was calculated as follows:

$$\text{S.C.} = (\text{No } \text{F1} \times 100) \text{(No } \text{Tot} - \text{No } \text{U.G.)} \tag{2}$$

where "No F1" is the number of hybrid individuals, "No Tot" is the number of all individuals in a progeny population (No tot = No F1 + No U.G. + No SP) and "No U.G." is the number of unexpected genotypes deriving from unplanned crosses.

Finally, 940 samples from the 47 F3 populations were genotyped using the previously-described panel of SSR markers. The POPGENE software [27] was used to compute the mean values of observed homozygosity for each population (3), where n is the total number of samples). In addition, the median of genetic similarity between the 47 lines was calculated using Rohlf's coefficient, which was designed for codominant molecular markers [28,29]. Comparison of genetic similarity among ten selected populations was instead calculated using the Jaccard coefficient, in accordance with the literature [30]. Genetic similarity matrices were generated in the NTSYS software [24].

$$\overline{\text{Ho}} = \sum\_{\mathbf{n}} \text{Ho/n} \tag{3}$$

#### **3. Results**

#### *3.1. Parental Lines*

The 16 SSR markers, organised in three multi-locus PCRs, were used firstly to amplify and score the 71 parental lines. Fourteen of the 16 SSR markers proved to be polymorphic among plant accessions. The similarity matrix constructed using Rohlf's coefficient revealed genetic similarity values ranging from 53% to 100% (Figure S1). The resulting unweighted pair group method with an arithmetic mean (UPGMA) dendrogram showed the samples clustering into two main sub-groups. Eighteen parental lines were not fully distinguishable, while the remaining 53 had unique genotypic profiles. The first principal coordinate from the PCoA accounted for 22% of the total variation and divided the samples into two groups, analogous to the clustering in the tree. The second principal coordinate accounted for 12% of the total variation. These results were confirmed by investigation of the genetic structure of the 71 parental lettuce lines based on allele frequencies; the best estimate of population size was *K* = 2 (Figure S2), such that the samples were grouped into two genetically distinct clusters (Figure 2). The lettuce cultivar types were reported in Table S1, but they did not correspond to different clusters in the UPGMA tree.

**Figure 2.** (**a**) Genetic similarity-based unweighted pair group method with an arithmetic mean (UPGMA) dendrogram of 71 parental lines calculated using the Jaccard coefficient. Bootstrap estimates ≥30% are reported next to the nodes (red and blue branches indicate the two clusters identified). (**b**) Principal coordinate analysis (PCoA). The 71 samples are shown in red or blue according to the clustering shown in the UPGMA tree. (**c**) The population genetic structure of the 71 lines as estimated by STRUCTURE. Each sample is represented by a vertical histogram partitioned into *K* = 2 coloured segments (red or blue, in accordance with (**a**,**b**)) representing the estimated membership. The proportion of subgroup membership (%) is reported on the ordinate axis, and the identification number of each accession is reported below each histogram.

The mean observed homozygosity was 82%, with a minimum value of 69% and a maximum of 100%. It is worth noting that 19 of the 71 parental lines (27%) had observed homozygosity values greater than 90%, while 30 of the 71 (42%) had a medium-high observed homozygosity (Ho) between 81% and 90%. Fourteen of the 71 parental lines (20%) had observed homozygosity ranging from 71% to 80%, and only 8 individuals had values lower than 70% (Figure 3a and Figure S1).

**Figure 3.** (**a**) Observed homozygosity of 71 lettuce parental individuals belonging to as many pure lines. (**b**) Histogram of discriminating loci in 89 cross combinations (in percentages). (**c**) Histogram of the percentages of pollination success in 89 programmed lettuce crosses.

#### *3.2. Determination of Hybridity*

Using a combination of genotypic and phenotypic data, 89 cross combinations were planned (Table S3). Before proceeding, we also checked the availability of informative loci able to distinguish between offspring resulting from out-cross and those obtained by accidental self-pollination. Screening identified 1 discriminant locus in 16% of cases, 2 informative loci in 36% of cases, and 3 to 7 informative molecular markers in 48% of the crosses (Figure 3b). The three most informative loci were Lsat3, Lsat7, and Lsat6, while Lsat4 and Lsat13 were monomorphic in almost all parental groups. It is worth noting that the Lsat8 marker was in a heterozygous state in all but four parental genotypes (7, 45, 58, and 60).

We were able to take advantage of these informative loci to calculate the success rate of each cross. In 30% of manual pollinations (27 out of 89), a success rate of 100% was achieved (i.e., all the offspring were hybrids), whereas in 18% of crosses (16 out of 89), the S.C. ranged from 71% to 90%. A hybridity rate fluctuating between 51% and 70% was reported in 15% of cases (13 out of 89), while 26% of the crosses produced fewer than 50% hybrids each. Finally, in only 7% of crosses (6 out of 89) were all the offspring the result of self-pollination (Figure 3c and Table S3). Overall, the mean hybridization rate (the average number of hybrids per crosses) was 68 ± 33%, and out of a total of 871 individuals, 602 (69%) were hybrids, and 556 were derived from programmed crosses. The remaining 46 individuals (5% of the total) had a unexpected genotypes (U.G.) compared with their putative parents (Table S3).

#### *3.3. Lettuce Breeding Populations*

The 47 F3 experimental lines were genotyped using the same set of 16 SSR loci as for the previous analyses. The homozygosity estimates of all samples (940) ranged from 67% to 93% (Figure 4a). Ten experimental populations had a median observed homozygosity ≥90%. Outliers—with homozygosity values consistently deviating from the median—were present in only three experimental populations (11, 14, and 32).

The median genetic similarity observed within each line was always greater than 90% (Figure 4b), and 37 experimental populations had a median genetic similarity ≥95%. Outliers were present in 21 of the 47 lines (Figure 4b).

After assembling the data, we found 10 breeding populations, belonging to butterhead type (Table S4), to have Ho values ≥ 90%, and a median genetic similarity ≥ 95%; the box-plots of these populations are labelled in red in Figure 4a,b. Finally, in the genetic similarity matrix calculated from all pairwise comparisons of these ten populations, the Jaccard's index ranged from 44% ± 3% (between populations 3 and 18) to 96 ± 5% (between populations 45 and 47, Figure S3). Moreover, the populations called 45 and 47 were constituted starting from the same parents (2 × 6, Table S4).

**Figure 4.** Statistics relating to the observed homozygosity and genetic similarity among lines. (**a**) Box-plot of the median observed homozygosity (in percentages) in each of the 47 populations. The red dotted line represents the homozygosity threshold set at 90%. (**b**) Box plot of the median genetic similarity in each experimental population (in percentages). The red dotted line represents the genetic similarity threshold set at 95%. The red box-plots represent the ten best experimental populations (observed homozygosity ≥90% and genetic similarity values ≥95%). The second and third quartiles are marked inside the square and are divided by a bold bar (median). Dots show outlier samples.

#### **4. Discussion**

The last decade has seen major advances in the acquisition of knowledge concerning the genetics of lettuce and, in particular, the development of molecular markers [1,11,21]. This has facilitated marker-assisted selection programmes, especially those aimed at countering the onset of new diseases. For example, several studies have dealt with identifying the QTLs associated with biotic and abiotic stress resistance [17,31,32]. Molecular markers have also been extensively used to assess genetic variation and relationships in lettuce germplasm [19] and to identify possible duplicate varieties [33]. However, although the benefits derived from exploitation of these molecular tools have also been discussed in marker-assisted breeding programmes [34] and demonstrated in several species [7,35], there are only a few studies on this type in lettuce [20]. The aim of our work, therefore, was to integrate conventional and biotechnological methods in three different steps of a breeding programme to show that this strategy is also effective in *L. sativa* (Figure 1). This is of pivotal importance if we consider the economic impact of lettuce (the world production of lettuce and chicory in 2017 was 26.8 million tons [36]) and the need to regularly develop new varieties.

Commercial lettuce varieties are usually characterised by pure lines due to the autogamous nature of this species. In order to introduce variability, manual pollination is usually carried out to cross genetically stable parent lines with agronomic traits of interest. Progeny selection is a crucial step, but despite the efficiency of some emasculation and hand pollination methods developed over the years [2], a major problem—distinguishing unequivocally and rapidly F1 individuals from self-pollinated progeny—still remains. The use of molecular assays to quickly and accurately screen progeny makes it possible to overcome most of the conventional breeding limits in this species.

In this context, our SSR-based analysis has (i) facilitated selection of the best parents to cross in order to maximise the variability of the progeny both within the same cultivar type and among them, (ii) allowed accurate evaluation of the resulting offspring, and (iii) sped up the screening of experimental F3 lines for their stability and uniformity.

The first part of our work focused on pre-screening 71 parental lines for crossing with the aim of maximising the gains obtained from each out-pollination within cultivar type and, in some cases, among them. As expected, the similarity matrix and the unweighted pair group method with an arithmetic mean (UPGMA) dendrogram showed varying levels of similarity among the different parental genotypes. Parental germplasm appeared to divide into two different groups, as revealed by the Principal Coordinates Analysis (PCoA) results and particularly by the genetic structure analysis. However, samples did not separate in UPGMA tree and PCoA according to the cultivar type, but we may assume that increasing the number of markers it could be possible to clarify this clustering. Although 53 parental lines were found to be fully distinguishable, with similarity values ranging from 53% to 98% and characterised by a unique genotypic profile, it was impossible to identify unequivocally the remaining 18. This is not surprising if we consider that some of the parental lines were closely related. We may speculate that increasing the number of SSRs would allow us to address these remaining issues. Given the aim of this study, these data were useful to avoid crosses between parents with 100% similarity. To introduce variability according to the phenotype and the lowest similarity values, we carried out 89 crossing combinations. Another aspect that needs to be considered when planning crosses is the stability of the parental line in terms of homozygosity. In our study, the median observed homozygosity of the parental lines was lower than expected (82%), especially in light of the strictly autogamous reproductive system of lettuce [37]. Overall, the fact that only one individual in four had homozygosity values greater than 90% showed that some of these lines were not entirely stable. However, it must be borne in mind that, although the observed homozygosity was not optimal, some of these lines, experimental lines, were chosen to produce F1 partly because they displayed resistance to multiple pathogens and had interesting phenotypic traits.

Before proceeding with hand pollination, in order to distinguish between F1 individuals resulting from cross-pollination and those resulting from self-pollination, we first examined the informative loci among the parental lines used in the crosses. Only homozygous loci for different alleles in parental lines were considered informative. Our analysis showed at least 2 informative loci in 84% of the programmed crosses. It is worth pointing out that restricting the analysis to the informative loci brought us considerable savings in terms of time and costs.

Overall, the molecular determination of hybridity was successful: F1 individuals represented at least 51% of the offspring in 67% of the manual crosses, and 100% of the offspring in 30% of the crosses (100% success rate), in agreement with the estimates originally reported by the developer of the pollination technique [2]. Unexpected genotypes (U.G.) were identified in 5% of the individual progeny. In these cases, the progeny genotypes appeared to diverge from what would be expected given the parents. This percentage is consistent with the spontaneous or undesired occurrences of cross-pollination (1–6%) reported in the literature for this species [3], mainly due to pollinator insects. However, we cannot exclude human error during manual pollination or seed collection.

Finally, at an advanced step of the breeding programme, we genetically assessed 47 different experimental F3 populations (940 samples), previously selected for their morpho-phenotypic traits and resistances of interest (Figure 1). Interestingly, the findings in terms of both homozygosity and intra-line similarity were very good. This would suggest that in strictly autogamous species, such as lettuce, three cycles of self-pollination may already be sufficient to reach desired outcomes in terms of genetic uniformity and homozygosity. This also confirms that the use of molecular markers could speed up the process by making it possible to select the best individuals on the basis of their genotype, thereby reducing the number of generations needed to develop new varieties. The ten experimental populations with the highest homozygosity estimates (≥90%) and the highest intra-genetic similarity values (≥95%) were considered suitable for pre-commercial trials (red box plot, Figure 4). However, a pairwise comparison of two of them (identified as 45 and 47) showed them to be genetically too similar (96% ± 5% genetic similarity, Figure S3), in agreement with phenotypic data and

their common origin (Table S4), to be registered and marketed as distinct varieties. According to the most recent guidelines concerning the protection of new plant varieties, the similarity threshold to define two lettuce varieties as distinct is set at 96% [30]. The next step will be to integrate molecular data and morphological observations in order to the select the best genotypes (positive selection) for evaluation as pre-commercial varieties. In particular, the eligible genotypes will be self-pollinated to multiply the seed so that their agronomic performance can be compared in different locations and periods of the year, and with the best commercial varieties already on the market.

For the remaining experimental populations (white box blot, Figure 4), an attempt could be made to increase their genetic uniformity through negative selection to remove the most genetically divergent individuals (i.e., outlier samples). Moreover, if necessary, the remaining genotypes can undergo a further selfing cycle aimed at reaching optimum values of homozygosity.

#### **5. Conclusions**

The results of this study demonstrate the advantages of mutual integration of traditional and biotechnological methods and show the added value that molecular markers can give to breeding programmes. We used microsatellite markers in three different steps of a conventional lettuce breeding program (see Figure 1) and demonstrated, firstly, the efficiency of SSR markers not only in selecting the best parental plants for crossing based on their observed homozygosity and dissimilarity values, but also in screening the resulting F1 progeny to distinguish between the offspring resulting from cross-pollination and those resulting from self-pollination. Furthermore, using the same SSR panel, we were able to act downstream of the breeding scheme to assess the uniformity of some pre-commercial cultivars. Our molecular assay could therefore also be used by seed firms to assess newly developed varieties for distinctiveness, uniformity and stability (DUS test), three major requirements for registering plant materials [6]. Finally, molecular characterisation of a new variety could also be used to register it in national or international varietal catalogues. In fact, the genotype or molecular profile of a registered variety can be crucial in solving cases of fraudulent practices, and in curbing plagiarism and unfair free-riding on the original plant breeder's time and investment [30].

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4425/10/11/916/s1: Figure S1: Pairwise genetic similarity matrix of the 71 individuals analysed (in percentages) based on Rohlf's genetic similarity coefficient. High genetic similarity values are labelled in green, the low values in red, and intermediate values are coloured on a scale from green to red. The observed homozygosity values of the 71 putative parental lines are reported to the left of each ID name. Figure S2: Definition of the subgroup number of parental lines based on the SSR marker dataset. Mean ΔK is calculated as |L" (K)|/(SD(L(K)), following Evanno et al. [23]. The blue line represents the ΔK values; Figure S3: Pairwise genetic similarity matrix of ten selected populations (in percentages) based on the Jaccard coefficient. The high genetic similarity values are labelled in green, the low values in red, and intermediate values are coloured on a scale from green to red; Table S1: Lettuce parental lines information, including ID of accessions, cultivar type of materials and subpopulation classification based on STRUCTURE analysis (1 = blue and 2 = red).; Table S2: SSR primer tails and dyes. List of the primer tails used with their sequences and corresponding dyes; Table S3: Lettuce plant material information, including ID of accessions used in the crosses, total number of plants per programmed cross, number of informative marker loci, hybrid plants, selfed plants and unexpected genotypes, and the mean hybridisation values (in percentages) for all the programmed crosses; Table S4: Information about ten F3 selected populations, including ID of parental lines used in the crosses.

**Author Contributions:** Conceptualization, F.P. and G.B.; formal analysis, A.P. and G.G.; investigation, A.P. and G.G.; methodology, F.P. and G.B.; resources, A.P.; software, A.P. and G.G.; supervision, G.B.; Writing—Original Draft, A.P. and F.P.; Writing—Review and Editing, F.P. and G.B.

**Funding:** This project was financially supported by Blumen Group SpA (Bologna, Italy), within a Research Contract stipulated with DAFNAE, University of Padova (Italy): "Development of new varieties in horticultural species using molecular marker-assisted breeding methods" PI: Gianni Barcaccia.

**Acknowledgments:** This research was carried out in partial fulfillment of the Ph.D. Program of Alice Patella by taking advantage of a Doctoral Research Fellowship provided by Blumen Group SpA (Bologna, Italy).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
