**Molecular Marker Technology for Crop Improvement**

Printed Edition of the Special Issue Published in *Agronomy* José Miguel Soriano Edited by

www.mdpi.com/journal/agronomy

## **Molecular Marker Technology for Crop Improvement**

## **Molecular Marker Technology for Crop Improvement**

Editor

**Jos ´e Miguel Soriano**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Jose Miguel Soriano ´ Institute for Food and Agricultural Research and Technology (IRTA) Spain

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Agronomy* (ISSN 2073-4395) (available at: https://www.mdpi.com/journal/agronomy/special issues/molecular-marker).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-03943-863-1 (Hbk) ISBN 978-3-03943-864-8 (PDF)**

Cover image courtesy of Sustainable Field Crops Programme (IRTA).

c 2020 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


Reprinted from: *Agronomy* **2020**, *10*, 144, doi:10.3390/agronomy10010144 ............. **111**


## **About the Editor**

**Jos´e Miguel Soriano** is a researcher in the Sustainable Field Crops Programme at IRTA. He has a PhD in Molecular and Evolutionary Genetics. His research has been focused on the development and identification of molecular markers linked to traits of interest in crop plants, the construction of linkage maps, and the use of genomic technologies in plant breeding. His early research focused on fruit breeding to incorporate pathogen resistance assisted by molecular techniques, and in recent years his focus has shifted to cereal research for drought adaptation in Mediterranean environments.

### *Editorial* **Molecular Marker Technology for Crop Improvement**

#### **Jose Miguel Soriano**

Sustainable Field Crops Programme, IRTA (Institute for Food and Agricultural Research and Technology), 25198 Lleida, Spain; josemiguel.soriano@irta.cat

Received: 3 September 2020; Accepted: 23 September 2020; Published: 24 September 2020

**Abstract:** Since the 1980s, agriculture and plant breeding have changed with the development of molecular marker technology. In recent decades, different types of molecular markers have been used for different purposes: mapping, marker-assisted selection, characterization of genetic resources, etc. These have produced effective genotyping, but the results have been costly and time-consuming, due to the small number of markers that could be tested simultaneously. Recent advances in molecular marker technologies such as the development of high-throughput genotyping platforms, genotyping by sequencing, and the release of the genome sequences of major crop plants open new possibilities for advancing crop improvement. This Special Issue collects sixteen research studies, including the application of molecular markers in eleven crop species, from the generation of linkage maps and diversity studies to the application of marker-assisted selection and genomic prediction.

**Keywords:** crop breeding; genetic maps; QTL mapping; GWAS; marker assisted selection; genomic selection; DNA sequencing

#### **1. Introduction**

Classical breeding was the main approach used by breeders to increase crop productivity during the 20th century. It implies the selection of cultivars with the desired characteristics for the target trait, usually morphological or visual characteristics. The best genotypes were selected and used as parents in a backcross scheme with a recurrent parent to dilute the irrelevant or undesired traits [1]. However, the long time to get a commercial cultivar and limitations related to traits highly dependent on the environment or with low heritability make necessary the use of complementary approaches to assist the breeding process. The development of molecular biology made possible the appearance of a new type of marker based on polymorphisms in the DNA sequence, the molecular markers, which broaden the possibilities for new challenges in plant breeding. Molecular markers are widely distributed in the genome, they are not affected by the environment, and they can be identified in any tissue and developmental stage. From their development, their use in agriculture increased through the construction of genetic maps in crop species, the association between molecular markers and important agronomic traits, the dissection of quantitative traits, and the positional cloning of genes of interest. Besides the estimation of genetic distances and molecular cloning, molecular markers provide the most suitable tool for the evaluation of genetic diversity, allowing for the selection of the most suitable parental lines in breeding programs, the management of germplasm collections, and varietal identification [2].

Once the association between a marker and a trait is detected, it can be deployed into a breeding program through marker-assisted selection (MAS). The success of this technique relies on the identification of markers tightly linked with the genome region of a target trait. MAS improves the efficiency of the selection of Mendelian traits, facilitating the introgression of single genes with the desired alleles into elite cultivars and removing the undesirable genome of the donor parent in a backcrossing program, and allows for the identification and protection of commercial cultivars through fingerprinting [3]. Although MAS has been effectively used before for Mendelian traits or those regulated by a low number of genes, many of the agronomic traits show a quantitative nature and are influenced by the environment [4]. In the last decade, the development of high-throughput genotyping platforms has allowed for the screening of whole genomes for the selection of desired traits. In addition, novel statistical approaches for the use of large amounts of genetic and phenotypic data have been developed. Altogether, this has allowed for the development of new selection strategies, such as the genomic prediction or selection (GS), which attempts to skip the limitations of MAS [5].

Coupled with the rapid development of high-throughput genotypic technology, the next DNA sequencing technologies and their applications in genetic and physical mapping have made significant progress in accelerating plant breeding with a lower cost.

#### **2. Overview of the Special Issue**

This special issue of *Agronomy* with the title "Molecular marker technology for crop improvement" publishes 16 articles providing insights into the different applications of molecular markers in plant breeding. Eleven crop species were analyzed with six different approaches.

Although the development of linkage maps is the first step for gene identification and many maps have been developed in several crop species, novel marker approaches involving high-throughput genotyping are evolving for the construction of highly saturated maps. A high-density single nucleotide polymorphism (SNP) linkage map in potato was developed using a recently developed strategy for the discovery of SNPs, the specific length amplified fragment sequencing (SLAF-seq) approach [6].

Biparental quantitative trait loci (QTL) mapping is a classical approach to identify multi-genic traits. The success of detecting a QTL depends on several factors: (1) marker density, (2) population size, and (3) trait heritability. As a classical approach in crop breeding, within the Special Issue it was applied for the identification of the loci controlling milling yield in rice [7], resistance to *Striga hermonthica* in maize [8], and leaf rust and stem rust resistance in wheat [9]. These studies identified new loci for important traits in breeding and will be the starting point for a deeper analysis of candidate gene identification.

Genome-wide association studies (GWASs) have become a valuable tool in recent years as a complementary approach to biparental mapping, providing broader allelic coverage and higher mapping resolution. In this issue, GWASs were performed for the analysis of seminal roots in landraces of wheat and durum wheat from the Mediterranean basin [10,11], agronomic and quality traits in elite durum wheat [12], and for flowering time in maize inbred lines [13]. The studies of Roselló et al. [10] and Rufo et al. [11] pointed out the usefulness of the old germplasm to be used as genetic resources for improving drought-related traits in the breeding programs to broaden the genetic variability. Merida-García et al. [12] combined a GWAS with a candidate gene approach to successfully identify gene clusters involved in important traits for wheat breeding. The study of Maldonado et al. [13] revealed that the use of a GWAS based on haplotype blocks was more efficient than the standard approach to identify major effect loci, and the network-assisted gene prioritization used identified four genes influencing flowering time in tropical maize.

Genetic diversity is crucial for crop improvement, as it allow breeders to identify appropriate parents to be included in breeding programs for broadening genetic variability. Within this Special Issue, the genetic diversity of wheat, avocado, and raspberry was assessed by high-resolution melting (HRM), and insertion site-based polymorphism (ISBP) markers, simple sequence repeats (SSR) developed from single-molecule long-read sequences, and SSRs from flavonoid biosynthesis genes, respectively [14–16]. Merida-Garcia et al. [14] developed ISBP markers for the wheat genome as an alternative to SSRs and SNPs. The authors concluded that these HRM-ISBPs are a cost-effective and efficient marker approach for wheat breeding programs, being also useful for gene tagging. The studies of Ge et al. [15] and Lebedev et al. [16] demonstrated the power of SSR markers. Although they were developed three decades ago [17], they are still commonly used because of their codominant and multi-allelic nature and high reproducibility.

The abovementioned studies in this Special Issue have taken into account the use of molecular markers for the development of linkage maps, the analysis of genetic diversity, and the mapping of quantitative traits, by means of the classic biparental QTL mapping or GWASs. The rest of the studies represent direct applications of molecular markers after these previous steps of development, location in a linkage map, and genotype–phenotype association: MAS and GS.

Marker-assisted selection was applied in four studies for (1) the selection of parental germplasm in sugarcane breeding programs [18], (2) improving blast resistance and salt tolerance in rice [19], (3) the selection of the pollination constant non-astringent (PCNA) type in Spanish germplasms of persimmon [20], and (4) the selection of resistance to plum pox virus (PPV) in apricot by allele-specific PCR [21]. Wu et al. [18] identified two groups among the 150 most widely used sugarcane parental clones. Based on these results, the authors could identify the most appropriate cultivars to broaden the genetic base of breeding germplasm. Thanasilungura et al. [19] improved the rice cultivar RD6 for salt tolerance with the QTL "Saltol" and blast resistance with four different QTLs by marker-assisted backcrossing and phenotypic selection. The authors found that one of the introgression lines showed superior salt tolerance and blast resistance, maintaining higher quality and agronomic performance than RD6. In the study of the selection of the PCNA type of persimmon, Blasco et al. [20] identified in the Spanish germplasm of persimmon the previously developed markers DlSx-AF4, linked to the production of male flowers, and AST, linked to fruit astringency. The screening of these markers in different progenies of backcrosses demonstrated a very low rate of selection of both traits together and is thus a valuable tool in a breeding program. The last example of MAS in this Special Issue corresponds to the selection of PPV resistance in the apricot breeding program at IVIA (Valencia, Spain) [21]. In this study, the authors present a high-throughput method for a rapid test of PPV resistance, thus improving the efficiency of apricot breeding programs at a low cost. MAS is of special interest in fruit trees due to the long time needed to obtain a new generation.

Finally, GS was used to estimate the breeding values for grain composition in sorghum [22]. Although GS was developed initially for animal breeding, its use in plant breeding has been extended in recent years. GS emerged as a valuable tool for improving complex traits controlled by QTLs with small effects. Together with high-throughput phenotyping techniques, it has brought a revolution in breeding by enhancing the accuracy level of selection. Sapkota et al. [22] report the use of GS for grain compositional traits. The authors found that the prediction accuracy for single trait prediction was moderate to high in respect to the phenotypic measurements obtained from near infra-red spectroscopy (NIRS) prediction.

#### **3. Concluding Remarks**

The Special Issue covers the use of different types of molecular markers, from SSR markers developed in 1989 to the newest high-throughput marker technology, as well as different approaches for genetic mapping and the use of molecular markers to assist crop breeding as a single marker with MAS or at genome level with GS.

To meet the needs of a growing world population, crop yields must be increased under the climate change scenario predicted for the coming decades and the threat of the emergence of new pathogens. However, the information obtained by genome sequencing and its availability at low cost, the continuous development of new molecular markers, the implementation of high-throughput phenotyping tools, and speed breeding techniques will make it possible to face these new challenges. Although not directly related with the molecular marker technology of this Special Issue, the new systems based on gene editing will play an important role in the future of agriculture.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Genetic Dissection of the Seminal Root System Architecture in Mediterranean Durum Wheat Landraces by Genome-Wide Association Study**

#### **Martina Roselló 1, Conxita Royo 1, Miguel Sanchez-Garcia <sup>2</sup> and Jose Miguel Soriano 1,\***


Received: 19 June 2019; Accepted: 8 July 2019; Published: 9 July 2019

**Abstract:** Roots are crucial for adaptation to drought stress. However, phenotyping root systems is a difficult and time-consuming task due to the special feature of the traits in the process of being analyzed. Correlations between root system architecture (RSA) at the early stages of development and in adult plants have been reported. In this study, the seminal RSA was analysed on a collection of 160 durum wheat landraces from 21 Mediterranean countries and 18 modern cultivars. The landraces showed large variability in RSA, and differences in root traits were found between previously identified genetic subpopulations. Landraces from the eastern Mediterranean region, which is the driest and warmest within the Mediterranean Basin, showed the largest seminal root size in terms of root length, surface, and volume and the widest root angle, whereas landraces from eastern Balkan countries showed the lowest values. Correlations were found between RSA and yield-related traits in a very dry environment. The identification of molecular markers linked to the traits of interest detected 233 marker-trait associations for 10 RSA traits and grouped them in 82 genome regions named marker-train association quantitative trait loci (MTA-QTLs). Our results support the use of ancient local germplasm to widen the genetic background for root traits in breeding programs.

**Keywords:** durum wheat; landraces; marker-trait association; root system architecture

#### **1. Introduction**

Wheat is estimated to have been first cultivated around 10,000 years before present (BP) in the Fertile Crescent region. It spread to the west of the Mediterranean Basin and reached the Iberian Peninsula around 7000 years BP [1]. During this migration, both natural and human selection resulted in the development of local landraces considered to be very well adapted to the regions where they were grown and containing the largest genetic diversity within the species [2]. From the middle of the 20th century, as a consequence of the Green Revolution, the cultivation of local landraces was progressively abandoned and replaced by the improved, more productive, and genetically uniform semi-dwarf cultivars. However, scientists are convinced that local landraces may provide new alleles to improve commercially valuable traits [3]. Introgression of these alleles into modern cultivars can be very useful, especially in breeding for suboptimal environments.

Drought is the most important environmental factor limiting wheat productivity in many parts of the world. Therefore, improving yield under water-limited conditions is one of the major challenges for wheat production worldwide. Breeding for adaptation to drought is extremely challenging due to the complexity of the target environments and the stress-adaptive mechanisms adopted by plants to withstand and mitigate the negative effects of a water deficit [4]. These mechanisms allow the plant to escape (e.g., early flowering date), avoid (e.g., root system), and/or tolerate (e.g., osmolyte accumulation) the negative effects of drought, which plays a role in determining final crop performance [5]. The crop traits to be considered as selection targets under drought conditions must be genetically correlated with yield and should have a greater heritability than yield itself [6,7]. Among these traits, early vigour, leaf area duration, crop water status, radiation use efficiency, and root architecture have been identified to be associated with yield under rainfed conditions (reviewed by Reference [8]).

Root system architecture (RSA) is crucial for wheat adaptation to drought stress. Roots exhibit a high level of morphological plasticity in response to soil conditions, which allows plants to adapt better, which is particularly under drought conditions. However, evaluating root architecture in the field is very difficult, expensive, and time-consuming, especially when a large number of plants need to be phenotyped. Several studies have reported a correlation of RSA in the early stages of development with RSA in adult plants [9], Manschadi et al. [10] reported that adult root geometry is strongly related to seminal root angle (SRA). Wasson et al. [11] described a relationship of root vigor between plants grown in the field and controlled conditions. Several systems have been adopted to enable early screening of the RSA in wheat [12].

Identifying quantitative trait loci (QTLs) and using marker-assisted selection is an efficient way to increase selection efficiency and boost genetic gains in breeding programs. However, while numerous studies have reported QTLs for RSA in bi-parental crosses [13], very few of them were based on association mapping [12,14–18]. Association mapping is a complementary approach to bi-parental linkage analysis and provides broader allelic coverage with higher mapping resolution. Association mapping is based on linkage disequilibrium, defined as the non-random association of alleles at different loci, and is used to detect the relationship between phenotypic variation and genetic polymorphism.

The main objectives of the present study were a) to identify differences in RSA among genetic subpopulations of durum wheat Mediterranean landraces, b) to find correlations of RSA with yield-related traits in different rainfed Mediterranean environments, and c) to identify molecular markers linked to RSA in the old Mediterranean germplasm through a genome-wide association study.

#### **2. Materials and Methods**

#### *2.1. Plant Material*

The germplasm used in the current study consisted of a set of 160 durum wheat landraces from 21 Mediterranean countries and 18 modern cultivars from a previously structured collection [2,19]. The landraces were classified into four genetic subpopulations (SPs) that matched their geographical origin as follows: the eastern Mediterranean (19 genotypes), the eastern Balkans and Turkey (20 genotypes), the western Balkans and Egypt (31 genotypes), the western Mediterranean (71 genotypes), and 19 genotypes that remained as admixed (Supplementary Materials Table S1).

#### *2.2. Phenotyping*

Eight uniform seeds per genotype were cultured following the paper roll method [20,21] in two replicates of four seeds. The seeds were placed at the top of a filter paper (420 × 520 mm) with the embryo facing down and sprayed with a 0.4% sodium hypochlorite solution. Subsequently, the papers were folded in half to obtain a 210 × 520 mm rectangle with the seeds fixed at the top. The papers were misted with deionized water and rolled by hand. The rolls were placed in plastic pots with deionized water at the bottom that was regularly checked to ensure it did not evaporate. The experiment was conducted in a growth chamber at 25 ◦C and darkness conditions. One week after sowing, the seeds were transferred to a black surface to take digital images that were processed by SmartRoot software [22] (Figure 1). Nine traits for the seminal root system architecture (RSA) were measured: total root number (TRN), primary root length (PRL, cm), total lateral root length (LRL, cm), primary root surface (PRS, cm2), total lateral root surface (LRS, cm2), primary root volume (PRV, cm3), total

lateral root volume (LRV, cm3), primary root diameter (PRD, cm), and mean lateral root diameter (LRD, cm).

**Figure 1.** Experimental setup for root system architecture analysis. First, seeds were placed on humid filter paper (**1**) and rolled. Paper rolls were placed in plastic pots with deionized water at the bottom for root growth (**2**). One week after sowing, the seeds were transferred to a black surface for digital imaging (**3**) that were processed by SmartRoot software [22] (**4**). The seminal root angle was measured using the clear pots (**5**,**6**).

Additionally, the SRA (◦) was measured at the facilities of the International Center for Agricultural Research in the Dry Areas (ICARDA) in Rabat (Morocco) using the clear pot method described by Richard et al. [23] (Figure 1). Using a randomized complete block design, eight seeds per genotype were grown in 4 L clear pots filled with peat. The seeds were placed with the embryo facing down and close to the pot wall to facilitate root growth along the transparent wall. The pots were then watered, placed inside 4 L black pots, and kept at 20 ◦C and darkness conditions in a growth chamber. Five days after sowing, digital images were taken and processed with ImageJ software [24].

Data from field experiments conducted under rainfed conditions during two years of contrasting water input from sowing to physiological maturity (285 mm in 2008 and 104 mm in 2014) in Lleida, North-eastern Spain [25] were used to assess the relationships between RSA traits and yield-related traits.

The experiments were carried out in a non-replicated modified augmented design with three replicated checks (the cultivars 'Claudio,' 'Simeto,' and 'Vitron') and plots of 6 m<sup>2</sup> (8 rows, 5 m long with a 0.15 m spacing). Sowing density was adjusted to 250 viable seeds m−<sup>2</sup> and the plots were maintained free of weeds and diseases.

#### *2.3. Statistical Analysis*

Combined analyses of variance (ANOVA) were performed for the RSA traits of the structured accessions (141 landraces and 18 modern cultivars), considering the accessions and the replicate as random effects. The sum of squares of the cultivar effect was partitioned into differences between SPs and differences within them. The Kenward-Roger correction was used due to the unbalanced number of genotypes within the SPs. Since the experiment was divided into six sets with one check, least squared means were calculated using Simeto as a check and compared using the Tukey test [26] at *p* < 0.01.

Raw field data were fitted to a linear mixed model with the check cultivars as fixed effects and the row number, column number, and genotype as random effects [27]. Restricted maximum likelihood was used to estimate the variance components and to produce the best linear unbiased predictors (BLUPs) for yield and yield components. The relationships between RSA traits and yield-related traits were assessed through correlation analyses. All calculations were carried out using the SAS statistical package [28].

#### *2.4. Genotyping*

DNA isolation was performed from leaf samples following the method reported by Doyle and Doyle [29]. High throughput genotyping was performed at Diversity Arrays Technology Pty Ltd. (Canberra, Australia) (http://www.diversityarrays.com) with the genotyping by sequencing (GBS) DArTseq platform [30]. A total of 46,161 markers were used to genotype the association mapping panel, including 35,837 presence/absence variants (PAVs) and 10,324 single nucleotide polymorphisms (SNPs). Markers were ordered according to the consensus map of wheat v4 available at https://www.diversityarrays.com/.

#### *2.5. Linkage Disequilibrium*

Linkage disequilibrium (LD) among markers was calculated for the A and B genomes using markers with a map position on the wheat v4 consensus map, and a minor allele frequency greater than 5%, using TASSEL 5.0 [31]. Pair-wise LD was measured using the squared allele frequency correlations *r2* and the values for genomes A and B were plotted against the genetic distance to determine how fast the LD decays. A LOESS curve was fitted to the plot using the JMP v12Pro statistical package (SAS Institute Inc, Cary, NC, USA).

#### *2.6. Genome-Wide Association Study*

A genome-wide association study (GWAS) was performed with 160 landraces for the mean of measured traits with TASSEL 5.0 software [31]. A mixed linear model was conducted using the population structure determined by Soriano et al. [19] as the fixed effect and a kinship (K) matrix as the random effect (Q + K) at the optimum compression level. A false discovery rate threshold [32] was established at −log10*p* > 4.6 (*p* < 0.05), using 2135 markers according to the results of the LD decay, to consider a marker-trait association (MTA) significant. Moreover, a second, less restrictive threshold was established at −log10*p* > 3. To simplify the MTA information, those associations located within LD blocks were considered to belong to the same QTL and were named marker-trait association quantitative trait loci (MTA-QTLs). Graphical representation of the genetic position of MTA-QTLs was carried out using MapChart 2.3 [33].

#### *2.7. Gene Annotation*

Gene annotation for the target region of significant MTAs was performed using the gene models for high-confidence genes reported for the wheat genome sequence [34] available at https: //wheat-urgi.versailles.inra.fr/Seq-Repository/.

#### **3. Results**

#### *3.1. Phenotypic Analyses*

The ANOVA showed that, for all traits, the phenotypic variability was mainly explained by the cultivar effect, since it accounted for 63.41% (PRD) to 90.57% (LRD) of the total sum of squares (Table 1). A summary of the genetic variation of the RSA traits is shown in Supplementary Materials

Table S2. The partitioning of the sum of squares of the cultivar effect into differences between and within SPs revealed that the variability induced by the genotype was mainly explained by differences within SPs on a range from 70.1% for TRN to 91.5 for PRV (Table 1). Differences between SPs were statistically significant for all traits, accounting for 8.5% (PRV) to 30.5% (TRN) of the sum of squares of the genotype effect (Table 1). Western Mediterranean landraces showed the highest number of seminal roots and the narrowest root angle, whereas the eastern Balkans and Turkey SP showed the widest angle (Table 2). The highest values for root size–related traits (length, surface and volume) in both primary and lateral roots were recorded in the eastern Mediterranean landraces. The western Balkans and Egypt subpopulation showed the largest root diameter (Table 2). The comparison of mean values of eastern Balkans and Turkish landraces revealed that the Turkish ones had high values for all traits except TRN, LRL, and root diameter (Supplementary Materials Table S3). The modern cultivars showed intermediate values for all RSA traits (Table 2).

**Table 1.** Percentage of the sum of squares of the ANOVA model for the seminal root system architecture traits in a set of 159 Mediterranean durum wheat genotypes structured into five genetic subpopulations by Soriano et al. [19].


TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter. \* *p* < 0.05. \*\* *p* < 0.01. \*\*\* *p* < 0.001.

**Table 2.** Means comparison of seminal root system architecture traits measured in a set of 159 Mediterranean durum wheat genotypes structured into five genetic subpopulations [19]. Means within columns with different letters are significantly different at *p* < 0.01 following a Tukey test.


TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter. EM, Eastern Mediterranean. EB + T, Eastern Balkans and Turkey. WB + E, Western Balkans and Egypt. WM, Western Mediterranean.

Correlation coefficients between RSA traits and yield-related traits were calculated for two field experiments with contrasting water input (285 and 104 mm of rainfall from sowing to physiological maturity). Whereas, for the rainiest environment, only the relationship between SRA and number of spikes per square meter (NSm2) was statistically significant (*p* = 0.043, *r*<sup>2</sup> = 0.16)). For the driest environment, 14 correlations involving all the yield-related traits and RSA traits except root diameter were statistically significant (Figure 2) (*r*<sup>2</sup> between 0.17 for NSm<sup>2</sup> and PRL and PRS to 0.30 for TKW and TRN). Most of the significant correlations were positive. Only the relationship between SRA and thousand kernel weight (TKW) was negative.

**Figure 2.** Correlations between seminal root system architecture traits and yield-related traits determined in field experiments receiving high (density ellipse in red, **A**) and low (density ellipse in green, **B**) water input from sowing to physiological maturity. Significant correlation coefficients (*p* < 0.05) are indicated with red and green points. TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter. GY, grain yield. NSm2, number of spikes per square meter. NGm2, number of grains per square meter. TKW, thousand kernel weight.

#### *3.2. Marker-Trait Associations*

A total of 46,161 DArTseq markers, including PAVs and SNPs, were used to genotype the set of 160 durum wheat landraces. To reduce the risk of false positives, markers and accessions were analyzed for the presence of duplicated patterns and missing values. Of 35,837 PAVs, 24,188 were placed on the wheat v4 consensus map. Of these, those with more than 30% of missing data and those with a minor allele frequency lower than 5% were removed from the analysis, leaving 19,443 PAVs. A total of 6957 SNPs were mapped, leaving a total of 4686 SNPs after marker filtering as before. Additionally, 413 markers were duplicated between PAVs and SNPs, so the corresponding PAVs were eliminated. A total of 23,716 markers remained for the subsequent analysis.

Linkage disequilibrium was estimated for locus pairs in genomes A and B using a sliding window of 50 cM. A total of 471,319 and 681,389 possible pair-wise loci were observed for genomes A and B, respectively. Of these locus pairs, 52% and 43% showed significant linkage disequilibrium at *p* < 0.01 and *p* < 0.001, respectively. Mean *r*<sup>2</sup> was 0.12 for genome A and 0.11 for genome B. These means were used as a threshold for estimating the intercept of the LOESS curve to determine the distance at which LD decays in each genome. Markers were in LD in a range from less than 1 cM in genome B to 1 cM in genome A (Supplementary Materials Table S4).

Results of the GWAS are reported in Figure 3 and in Supplementary Materials Table S5. Using a restrictive threshold based on a false discovery rate at *p* < 0.05 (−log10*p* > 4.6) and the LD decay, only 12 MTAs corresponding to seven markers were significant. Using a common threshold of −log10*p* > 3, as previously reported by other authors [35–38], a total of 233 MTAs involving 176 markers were identified. MTAs were equally distributed in both genomes (50.2% in the A genome and 49.8% in the B genome). Chromosomes 2B and 7A harbored the highest number of MTAs (39 and 32 respectively), carrying 30% of the total number of MTAs, whereas chromosomes 4B and 7B harbored the lowest number of MTAs, 8 and 6 MTAs, respectively (Figure 3A). Root volume was the trait showing the highest number of MTAs (77), followed by root surface (46), root diameter (37), root length and number (26), and SRA (21) (Figure 3B). The mean percentage of phenotypic variance explained (PVE) per MTA was similar for all traits, ranging from 0.09 to 0.11 (Figure 3C). Most of the MTAs showed low PVE, in agreement with the quantitative nature of the analyzed traits. The percentage of MTAs with a PVE lower than 0.1 was 71%, whereas that of MTAs with a PVE lower than 0.15 was 98% (Figure 3D).

**Figure 3.** Summary of marker trait associations (MTA). (**A**) Number of MTAs per chromosome. (**B**) Number of MTAs per trait. (**C**) Mean PVE per trait. (**D**) PVE. TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter.

To simplify the MTA information, those MTAs located within a region of 1 cM, as reported by the LD decay, were considered part of the same QTL. Thus, the 233 associations were restricted to 81 MTA-QTLs (Figure 4 and Table 3). Of the 82 MTA-QTLs, 33 had only one MTA, whereas, for the remaining 49, the number of MTAs per MTA-QTL ranged from 2 in 19 MTA-QTLs to 15 in mtaq-7A.1. When several consecutive pairs of MTAs were separated for a distance of 1 cm, the whole block was considered as the same MTA-QTL. The genomic distribution of MTA-QTLs showed that chromosome 1A, 4A, and 5B harbored 8 MTA-QTLs, chromosomes 1B, 3A, 3B, and 5A 7 MTA-QTLs, 6A 6 MTA-QTLs, 7A 5 MTA-QTLs, 2A, 2B, 4B, and 6B 4 MTA-QTLs and chromosome 7B harbored 3 MTA-QTLs. For the 48 MTA-QTLs with more than one MTA, 10 were related to one trait. Of these, mtaq-1A.5, mtaq-3A.1, mtaq-4A.4, and mtaq-4A.5 carried associations related to root volume, mtaq-2B.1, and mtaq-3B.7 to root diameter, mtaq-3B.1, and mtaq-7A.5 to root number, and mtaq-4A.3 and mtaq-6A.5 to the root angle.

**Figure 4.** MTA-QTL map. MTA-QTLs are indicated in bold on the left side of the chromosome and traits involved in each MTA-QTL are on the right side. The rule on the left indicates genetic distance in cM. TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter.

**Table 3.** MTA-QTLS.



**Table 3.** *Cont*.

TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter.

Among all significant MTAs, markers with different alleles between extreme genotypes for each trait (i.e., the upper and lower 10th percentile) were identified except for PRL (Table 4, Figure 5). Frequency of the most common allele among genotypes from the upper 10th percentile ranged from 67% for LRD to 90% for PRV, whereas, for the lower 10th percentile, they ranged from 74% for TRN to 93% for LRD (Figure 5).


**Table 4.** Selected significant markers from the GWAS with different allele composition for the upper (UP) and lower (LOW) 10th percentile of genotypes. Different letters on the UP and LOW 10th phenotype indicate that means are significantly different at *p* < 0.01 following a Tukey test.

TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter.

**Figure 5.** Marker allele frequency means from landraces within the upper and lower 10th percentile for the analyzed traits. All significant markers shown in Table 4 are included. TRN, total root number. SRA, seminal root angle. PRL, primary root length. LRL, total lateral root length. PRS, primary root surface. LRS, total lateral root surface. PRV, primary root volume. LRV, total lateral root volume. PRD, primary root diameter. LRD, mean lateral root diameter.

#### *3.3. Gene Annotation*

Of the 176 markers showing significant associations, 31 were identified in the reference sequence of the wheat genome [34] (Table 5). Eight of them were positioned within gene models, whereas, for the rest, the closest gene model to the corresponding marker was taken into consideration. The gene models described in Table 5 included molecules related to abiotic stress resistance, seed formation, carbohydrate remobilization, disease resistance proteins, and other genes involved in different cellular metabolic pathways.

**Table 5.** Gene models within MTA-QTL positions. Only MTAs with markers mapped against the genome sequence are included. Genome position of the gene model is indicated in Mb.


\* Markers located within gene models.

#### **4. Discussion**

Roots exhibit a high level of morphological plasticity in response to soil conditions, which allows plants to better adapt, particularly under drought conditions. Several authors have reported the role of RSA traits in response to drought stress [39,40]. Wasson et al. [11] suggested that a deep root system with the appropriate density along the soil profile would confer an advantage on wheat grown in rainfed agricultural systems. Therefore, identifying new alleles for improving root architecture under drought conditions and introgressing them into adapted phenotypes is a desirable approach for breeding purposes. The current study analyzed a collection of durum wheat landraces representative of the variability existing within the Mediterranean Basin in an attempt to broaden the genetic background present in commercial cultivars.

Evaluating root architecture in the field is a difficult, expensive, and time-consuming assignment, especially when a large number of plants need to be phenotyped. It has been reported that the root geometry of adult plants is strongly related to the seminal root angle (SRA), with deeply rooted wheat genotypes showing a narrower SRA [10]. Different systems have been adopted to enable early screening

of the root system architecture in wheat, assuming that genotypes that differ in root architecture at an early developmental stage would also differ in the field at stages when nutrient and/or water capture become critical for grain yield [12].

#### *4.1. Phenotypic Variation*

The germplasm analyzed in the present study, including mostly durum wheat landraces from the Mediterranean Basin, showed wide variability in RSA traits. The variability found was higher than that observed in other studies using elite accessions [12,14] or even landraces, as reported by Ruiz et al. [41] analyzing a collection of Spanish durum wheat landraces. These results, and the intermediate values obtained for all traits in modern cultivars, support the use of ancient local germplasm for widening the genetic background in breeding programs.

Means comparison of phenotypic traits revealed large differences among SPs associated with their geographical origin. Eastern Mediterranean landraces, collected in the area closest to the origin of tetraploid wheat, showed the largest root size in terms of length, surface, and volume, and the widest root angle. The wheat-growing areas of this region, which comprises Syria, Jordan, Israel, and Egypt, are the warmest and driest within the Mediterranean Basin [42]. In addition, when SRA traits were analyzed separately for the two components of the eastern Balkans and Turkey subpopulation, large differences appeared between them, with Turkish landraces being much more similar to the eastern Mediterranean ones than to the eastern Balkan ones, since the latter showed the lowest values for root length, surface, and volume. Turkish landraces also showed a wide root angle, as did the eastern Mediterranean ones. The differences found in SRA between the eastern Balkans and Turkish landraces are sustained by two lines of evidence. One is the contrasting environmental conditions of the wheat-growing areas of northern Balkan countries and Turkey, since the analysis of long-term climate data demonstrated less rainfall and higher temperatures and solar radiation in the latter [42]. The other is that the northern Balkan landraces likely originated in the steps of southern Russia and the Volga region [2,43], which also suggests contrasting environmental conditions in the zones of origin of the eastern Balkan and Turkish landraces. The phenotypic analysis carried out in the current study revealed that landraces from regions where drought stress is prevalent have a larger root size and a wider root angle. This architecture should allow a larger proportion of the soil to be covered for more efficient water capture, and this hypothesis is supported by correlations between RSA and yield traits. Although low, likely due to the very early stage when the root traits were measured, differences in the number of significant correlations were observed between the two environments with the highest and lowest water input reported by Roselló et al. [25]. Root size–related traits were positively correlated with the number of grains and spikes per unit area (primary roots) and with grain yield and grain weight (lateral roots) in the driest environment. SRA was negatively correlated with TKW, as reported previously by Canè et al. [12], who concluded that it was due to the influence of the root angle on the distribution of roots in the soil layers, which affects the water uptake from deeper layers. In our study, the genotypes with the narrowest angle corresponded to those from the western Mediterranean countries, which Royo et al. [42] and Soriano et al. [19] reported to have heavier grains.

#### *4.2. Marker-Trait Associations*

The current study attempts to dissect the genetic architecture controlling the seminal root system in a collection of landraces from the Mediterranean Basin by association analysis. A mixed linear model accounting for the genetic relatedness between cultivars (random effect) and their population structure (fixed effect) (K + Q model) was used in order to reduce the number of spurious associations.

A total of 233 significant associations were identified for the 10 RSA traits underlying the complex genetic control of RSA. However, in order to simplify this information and to integrate closely linked MTAs in the same QTL, those MTAs located within LD blocks were considered as belonging to the same MTA-QTL. As a result, the number of genome regions involved in RSA was reduced to 82. The relationships between RSA and yield-related traits was also suggested by the presence of pleiotropic MTA-QTLs. The comparison of the genome regions identified in the current study with those related to yield and yield components by Roselló et al. [44] showed that 45% of the RSA MTA-QTLs were located with yield-related trait MTA-QTLs. These results are in agreement with the findings of Canè et al. [12], who found that 30% of the RSA-QTLs affected agronomic traits, which provided evidence of the implications of RSA in field performance of durum wheat at early growth stages.

In the last few years, GWAS for RSA have been limited in comparison with QTL mapping for root traits based on bi-parental populations (see Soriano and Álvaro [13] for a review). A comparison with previous studies reporting MTAs for RSA resulted in several common regions with the current study. Three common regions were found with the study of Canè et al. [12], but different traits were included for MTAs in those QTLs (mtaq-3A.3, mtaq-3A.5, mtaq-3A.6, and mtaq-6B.2). Two MTAs were in common with those reported by Ayalew et al. [15], who identified five significant associations with root length under stress (2) and non-stress (3) conditions. The MTA reported under stress conditions in chromosome 2B may correspond with mtaq-2B.2, which also shows an association with LRL. However, the association on chromosome 3B, although in a common region with mtaq-3B.4, differed in RSA. When MTA-QTLs were compared with QTLs from bi-parental populations, twelve genomic regions were located within the meta-QTL positions defined by Soriano and Álvaro [13] after the compilation of 754 QTLs from 30 studies.

Candidate genes at the MTA peak were sought using the high-confidence gene annotation from the wheat genome sequence [34]. Among these genes, those involved in plant growth and development as well as tolerance to abiotic stresses may be of special interest. On chromosome 1A, the marker 1210090\_SNP in mtaq-1A.7 is located close to a cellulose synthase gene. This type of gene is involved in plant cell growth and structure [45]. A trichome birefringence (TB) protein was identified in mtaq-1B.1. According to Zhu et al. [46], the TB-like27 protein mutants in *Arabidopsis* increased aluminium accumulation in cell walls, which inhibited root elongation through structural and functional damage. Three peaks corresponded with F-box domains located in mtaq-1B.7, mtaq-3A.1, and mtaq-6A.6. According to Hua et al. [47], this is the protein subunit of E3 ubiquitin ligases involved in the response to abiotic stresses. Li et al. [48] overexpressed the F-box *TaFBA1* in transgenic tobacco to improve heat tolerance, and one of the results was increased root length in the transgenic plants. 9-cis-epoxycarotenoid dioxygenase (NCED) is a key enzyme in the biosynthesis of ABA in higher plants, which regulates the response to various environmental stresses [49]. This enzyme is located within mtaq-2A.3. In mtaq-2A.4, the marker 1117775\_PAV corresponded with a late embryogenesis abundant (LEA) hydroxyproline-rich glycoprotein. These proteins play a role in the response to abiotic stresses. They are mainly accumulated in seeds, but have been found in roots during the whole developmental cycle [50]. The marker 1098568\_PAV, in mtaq-6B.4, is located within a gene coding a bZIP transcription factor family protein. This type of transcription factor is involved in abiotic stresses [51]. Zhang et al. [52] observed that the root growth of transgenic plants overexpressing the gene *TabZIP14-B* was hindered more severely than that of the control plants. Another gene involved in abiotic stress tolerance is the mitochondrial pyruvate carrier (MPC) located in mtaq-7A.1 [53]. This gene is involved in cadmium tolerance in *Arabidopsis*, which prevents its accumulation. Roots are the predominant plant tissue for cadmium absorption or exclusion. He et al. [53] found that the root length of mutant plants of *Arabidopsis* for MPC genes was substantially shorter than the wild-type plants. A protein related to WAT1 (WALLS ARE THIN1) involved in secondary cell wall thickness [54] is located in the peak of mtaq-7A.1.

#### **5. Conclusions**

Including local landraces in breeding programs is a useful approach to broadening the genetic variability of crops [3]. The variability for root system architecture traits found in Mediterranean landraces and the high number of genome regions controlling them—most of them not reported previously—makes this germplasm a valuable source for root architecture improvement. The identification of extreme genotypes for root architecture traits can help identify parents for the development of new mapping populations to tackle a map-based cloning approach to the genes of interest. In the present study, we identified the molecular markers linked to these genotypes with different allele composition that facilitate the introgression of the corresponding traits through marker-assisted breeding.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/9/7/364/s1, Table S1: Accessions included in the study, Table S2: Statistics of the seminal RSA traits, Table S3: Means comparison of seminal root system architecture traits for the eastern Balkans (EB) and Turkish durum wheat landraces. Means within columns with different letters are significantly different at *p* < 0.05 following a Tukey test, Table S4: Linkage disequilibrium decay plots. (A) Genome A. (B) Genome B. The LOESS curve is represented in blue. The horizontal red line corresponds to the *r*<sup>2</sup> mean for each genome, Table S5: Significant markers associated with seminal root system architecture traits obtained in 160 durum wheat Mediterranean landraces.

**Author Contributions:** Conceptualization, M.R. and J.M.S.; Methodology, M.R., M.S.-G., and J.M.S.; Formal Analysis, M.R.; Investigation, M.R., M.S.-G., J.M.S.; Resources, C.R.; Data Curation, M.R., C.R.; Writing—Original Draft Preparation, M.R., J.M.S.; Writing—Review & Editing, C.R., M.S.-G., and J.M.S.; Visualization, J.M.S.; Supervision, C.R., M.S.-G., and J.M.S.; Project Administration, C.R. and J.M.S.; Funding Acquisition, C.R. and J.M.S.

**Funding:** This research was funded by Spanish Ministry of Science, Innovation and Universities (http://www. ciencia.gob.es/), grant numbers AGL-2012-37217 (C.R.) and AGL2015-65351-R (J.M.S. and C.R.).

**Acknowledgments:** Projects AGL-2012-37217 (C.R.) and AGL2015-65351-R (J.M.S. and C.R.) of the Spanish Ministry of Science, Innovation and Universities (http://www.ciencia.gob.es/) funded this study. The authors acknowledge the contribution of the CERCA Program (Generalitat de Catalunya). M.R. is a recipient of a PhD grant from the Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA). J.M.S. was hired by the INIA-CCAA program funded by INIA and the Generalitat de Catalunya.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Abbreviations**


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **SSR Marker-Assisted Management of Parental Germplasm in Sugarcane (***Saccharum* **spp. hybrids) Breeding Programs**

**Jiantao Wu 1,\*,**†**, Qinnan Wang 1,**†**, Jing Xie 1, Yong-Bao Pan 2, Feng Zhou 1, Yuqiang Guo 1, Hailong Chang 1, Huanying Xu 1, Wei Zhang 1, Chuiming Zhang <sup>1</sup> and Yongsheng Qiu 1,\***


Received: 11 July 2019; Accepted: 10 August 2019; Published: 14 August 2019

**Abstract:** Sugarcane (*Saccharum* spp. hybrids) is an important sugar and bioenergy crop with a high aneuploidy, complex genomes and extreme heterozygosity. A good understanding of genetic diversity and population structure among sugarcane parental lines is a prerequisite for sugarcane improvement through breeding. In order to understand genetic characteristics of parental lines used in sugarcane breeding programs in China, 150 of the most popular accessions were analyzed with 21 fluorescence-labeled simple sequence repeats (SSR) markers and high-performance capillary electrophoresis (HPCE). A total of 226 SSR alleles of high-resolution capacity were identified. Among the series obtained from different origins, the YC-series, which contained eight unique alleles, had the highest genetic diversity. Based on the population structure analysis, the principal coordinate analysis (PCoA) and phylogenetic analysis, the 150 accessions were clustered into two distinct sub-populations (Pop1 and Pop2). Pop1 contained the majority of clones introduced to China (including 28/29 CP-series accessions) while accessions native to China clustered in Pop2. The analysis of molecular variance (AMOVA), fixation index (*Fst*) value and gene flow (*Nm*) value all indicated the very low genetic differentiation between the two groups. This study illustrated that fluorescence-labeled SSR markers combined with high-performance capillary electrophoresis (HPCE) could be a very useful tool for genotyping of the polyploidy sugarcane. The results provided valuable information for sugarcane breeders to better manage the parental germplasm, choose the best parents to cross, and produce the best progeny to evaluate and select for new cultivar(s).

**Keywords:** sugarcane; parental line; population structure; plant breeding; genetic diversity; simple sequence repeats (SSR)

#### **1. Introduction**

Sugarcane cultivars are allopolyploids with highly heterozygous and complex genomes, which render a slow progress in breeding. To date, most commercial sugarcane varieties can be traced back to a limited number of popular cultivars belonging to either the POJ- or Co-series, which represent a very narrow genetic base [1]. Therefore, it is important for sugarcane breeders to fully understand the genetic relationship among parental lines and to choose elite parents of different genetic background for crossing in order to broaden the genetic diversity of sugarcane population [2].

Hainan sugarcane breeding station (HSBS) is the primary sugarcane crossing facility in Mainland China. It produces nearly all the seeds for sugarcane breeders in China every year [3]. HSBS has more than 2000 germplasm materials. Currently, thousands of new elite sugarcane genotypes are created by breeders each year. The utilization of these ever-increasing germplasm materials is a daunting challenge. Parental selection is a crucial step for good quality cross-breeding. Therefore, breeding materials should be adequately evaluated by different analytical methods to ensure their genetic suitability.

In the past, sugarcane breeders studied the genetic differences of parents mainly from the aspects of the genetic relationship, geographical origin and morphology. The genetic differences of sugarcane parents cannot really be reflected by pedigree because of mixed pollen, selfing and seed admixture [4]. Although morphological traits can be evaluated, these traits are easily influenced by the environment and may not reflect the real genetic diversity of sugarcane germplasm resources [5]. DNA molecular markers with high stability, multiple quantity and high polymorphism are more suitable for evaluating sugarcane germplasm collection [1]. With the rapid development of biotechnology, sugarcane researchers have utilized different types of DNA molecular markers, including amplified fragment length polymorphisms (AFLP) [1,5], restriction fragment length polymorphisms (RFLP) [6,7], random amplification of polymorphic DNAs (RAPD) [8,9], single nucleotide polymorphism (SNP) [10], simple sequence repeats (SSRs) [11], inter simple sequence repeat (ISSRs) [12,13], expressed sequence tag-simple sequence repeat (EST-SSRs) [14–16], 5S rRNA intergenic spacers [17], start codon targeted (SCoT) [18], target region amplification polymorphism (TRAP) [5,19,20], and cleaved amplified polymorphism sequences (CAPS) [21] for evaluating sugarcane germplasm.

Among PCR-based markers, SSR (microsatellite) markers are considered one of the most efficient markers for plant breeding due to large quantity, low dosage, co-dominant, reliability and multi-allelic detecting [22]. SSR markers have been used widely to study sugarcane genetic diversity and population structure [22–24], variety identity [25], genetic map [26,27], and genetic association [28–30]. Furthermore, fluorescence-labeled SSR markers combined with high-performance capillary electrophoresis (HPCE) have manifested better performance in genotyping of polyploid sugarcane, due to higher accuracy and better detection power [22–24,31–37].

Now, this paper reports a study that was designed to manage the parental germplasm of the sugarcane breeding programs in China through the microsatellite (SSR) DNA fingerprinting using fluorescence-labeled SSR primers and the high-performance capillary electrophoresis (HPCE) system. The results will help sugarcane breeders better manage the parental germplam, choose cross parents, design cross combinations, and produce high quality seedlings for the selection and development of elite varieties.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

One hundred and fifty parental clones were chosen for this study, based on the number of lines used most often in crossing from 2014 to 2018 in all Chinese sugarcane breeding programs (Table 1 and S1). These included 32 of clones from foreign origin, 109 clones from the China Mainland, and nine ROC-series clones from China Taiwan. Among the 32 foreign clones, one was from India (Co-series), 29 were from the U.S. (CP-series) and two were from Thailand (K-series). Among the 109 clones from China Mainland, four were from the Dehong Sugarcane Research Institute, Yunnan Province (DZ-series); 11 were from the Fujian Agriculture and Forestry University, Fujian Province (FN-series); two were from the Jiangxi Sugarcane Research Institute, Jiangxi Province (GN-series); 21 were from the Guangxi Academy of Agricultural Sciences, Guangxi Province (GT-series); six were from the Liucheng Academy of Agricultural Sciences, Guangxi Province (LC-series); six were from the Neijiang Academy of Agricultural Sciences, Sichuan Province (NJ-series); 18 were from the Hainan Sugarcane Breeding Station of Guangzhou Sugarcane Industry Research Institute, Hainan Province (YC-series); 29 were from the Guangzhou Sugarcane Industry Research Institute, Guangdong Province (YT-series); 10 were from the Yunnan Academy of Agricultural Sciences, Yunnan Province (YZ-series) and two were from

other breeding units in China Mainland (one from Sichuan Research Institute of Sugar Crops, Sichuan Province and one from the Guangdong Academy of Agricultural Sciences, Guangdong Province).


**Table 1.** The 150 sugarcane accessions used in the experiment.

#### *2.2. SSR Genotyping*

Young leaf tissues were collected from three individual clones, rinsed with 75% ethanol, and kept at −80 ◦C prior to DNA extraction. The genomic DNA was extracted from leaf tissues using the cetyl trimethyl ammonium bromide (CTAB) method [38] with minor modifications. The quality and concentration of DNA were measured using the UV-Vis Spectrophotometer Q5000 of Quawell (Quawell Technology, Inc. San Jose, CA, USA) and diluted to 20 ng/μL. A set of 21 SSR primer pairs (Table 1) with stable and clear amplification was selected from previous reports [3,11,33,39–42]. All forward primers were labeled with a fluorescence dye, 6-carboxy-fluorescein (FAM) or Hexachlorofluorescein (HEX). PCR reactions were performed with the following cycling condition: 95 ◦C for 2 min, followed by 40 cycles of 94 ◦C for 30 s, then primer-specific annealing temperature (Tm) for 90 s, 65 ◦C for 30 s, followed by one cycle at 65 ◦C for 10 min. The annealing temperatures for the 21 primer pairs were optimized separately, ranging from 49 ◦C to 62 ◦C (Table 2). The amplified PCR products were checked by a 3% agarose gel electrophoresis. High-performance capillary electrophoreses (HPCE) was conducted on the ABI 3730XL DNA analyzer (Applied Biosystems, Inc. Foster City, CA, USA) to generate GeneScan files. The GeneScan files were analyzed using the GeneMarker V2.2 software (SoftGenetics, LLC. State College, PA, USA) to show SSR DNA fragments (alleles) and the sizes of these fragments were calibrated automatically against the GeneScan500 size standards. Due to the polyploidy nature of sugarcane, the SSR alleles had to be manually called first and the score sheet was manually rechecked according to Pan [43]. The presence of an allele was scored as "1" and its absence scored as "0". SSR alleles were named using a combination of primer name and allele size.


**Table 2.** The 21 simple sequence repeat (SSR) markers used in this study.


**Table 2.** *Cont.*

<sup>a</sup> G-SSR: SSR primer pair designed from genomic sequence; E-SSR: SSR primer pair designed from UniGene or cDNA sequences.

#### *2.3. Genetic Diversity Analysis*

Qualitative allelic data matrix was constructed and formatted using the DataFormatter software [44]. The PowerMarker v3.25 software [45] was used to calculate allele frequency, number of alleles per locus, polymorphism information content (PIC), the gene diversity index (*h*), Shannon's information index (*I*), and percentage of polymorphic loci (*PPL*) of each marker. The resolving power of the primer (*Rp*) [46] was calculated using allele frequencies. The probability of identity (*PI*) [23] was computed using the CERVUS v3.0 software [47]. Unique (Series-specific) alleles were estimated using GeneALEx v6.502 [48,49].

#### *2.4. Population Structure Analysis*

The model-based program Structure v2.3.4 [50] was used to analyze the population structure involving the 226 alleles amplified by the 21 SSR primer pairs. The number of populations (*K*) was set from one to 10, and at each *K* value, ten runs were conducted separately with 50,000 iterations of burn-in length and 50,000 Markov Chain Monte Carlo (MCMC). Then, the best K value was estimated using Evanno's Δ*K* method [51] with an online tool, Structure Harvester [52]. An individual Q matrix was generated by CLUMPP v1.1.2 [53]. Parental clones with membership probabilities greater than 0.5 were identified as the same group [54]. A Principal Coordinate Analysis (PCoA) map was generated based on the genetic distances between pairs of clones by GeneALEx v6.502 [48,49]. An unrooted phylogenetic tree was constructed based on the neighbor-joining (NJ) method and the genetic distance matrix using PowerMarker v3.25 [45] and adjusted with MEGA v6.06 [55].

#### *2.5. Di*ff*erentiation Analysis and Genetic Diversity Indices*

Analysis of Molecular Variance (AMOVA) was conducted to find the genetic differentiation within and among subpopulations using GeneALEx v6.502 [48,49]. From AMOVA, the fixation index (*Fst*) and gene flow (*Nm*) within the population was also acquired. In addition, genetic diversity indices, including number of different alleles (*Na*), number of effective alleles (*Ne*), Shannon's information index (*I*), observed heterozygosity (*Ho*), expected heterozygosity (*He*), unbiased expected heterozygosity (*uHe*), and percentage of polymorphic loci (*PPL*) of different sub-groups were also calculated using GeneALEx v6.502 [48,49].

#### **3. Results**

#### *3.1. Polymorphism Revealed by SSR Genotyping*

The 21 SSR primer pairs amplified a total of 226 alleles with an average of 10.8 alleles per primer pair (Table 2). Of the 226 alleles, 220 alleles were polymorphic and the other six alleles could be amplified in each clone. The number of alleles amplified by one primer pair ranged from five by MCSA176C01 to 25 by SCM4. The mean PIC value of each SSR primer pair ranged from 0.15 to 0.29 with an average of 0.23. The probability of identity (*PI*) of the 21 markers was all very low, which ranged from 0.000001 (mSSCIR36) to 0.071332 (SMC569CS) with an average of 0.015532. For the 21 primers pairs, the resolving power of the primer (*Rp*) was relatively high, ranging from 3.68 (SMC569CS) to 21.01 (mSSCIR36) with an average of 9.14. The mean number of alleles and the mean PIC value of genomic SSRs were 10.6 and 0.23, and were 9.8 and 0.23 for EST SSRs, respectively (Table 3).

**Table 3.** Genetic diversity parameters of 150 of the most popular parental clones from sugarcane hybrid breeding programs.


<sup>a</sup> PIC: Polymorphism information content; <sup>b</sup> *PI*: Probability of identity; <sup>c</sup> *RP*: Resolving power.

#### *3.2. Genetic Diversity*

The gene diversity (*h*) of the polymorphic allele ranged from 0.013 to 0.500 with an average of 0.282. The Shannon's information index (*I*) of the polymorphic allele ranged from 0.010 to 0.534 with an average of 0.261. Among the different series of sugarcane parental lines, the highest values of both gene diversity (*h*) and Shannon's information index (*I*) were found in the YC-series (0.261, 0.397), followed by the YT-series (0.254, 0.386,) and the GT-series (0.251, 0.376) (Table 3), indicating that the YC-series is genetically more diverse than the other series. The average percentages of polymorphic allele for the YT-, YC-, and CP-series were 0.814, 0.805 and 0.743, respectively. Alleles were identified that were unique to the 12 distinct germplasm groups (Table 4).


**Table 4.** Gene diversity, Shannon's information index, percentage of polymorphic loci and series-specific alleles of different series.

<sup>a</sup> *h*, Gene diversity; <sup>b</sup> *I*, Shannon's information index; <sup>c</sup> *PPL*, percentage of polymorphic loci.

#### *3.3. Population Structure and Phylogeny*

The *K*-value was used to estimate the number of clusters of the clones based on the genotypic data. A continuous gradual increase was observed in the log-likelihood of *K*-value (LnP(*K*)) with the increase of *K*-value (Figure 1B and Table S2). The number of clusters (*K*) was plotted against Delta *K* (Δ*K*), which revealed a sharp peak at *K* = 2 (Figure 1A and Table S2). The optimal K-value was *K* = 2, which revealed that the highest probability for the presence of two sub-populations (Pop1 and Pop2) among the 150 sugarcane clones (Figure 1C); Pop1 consisted of 50 clones and Pop2 contained 100 clones (Table S3). Pop1 clones were mainly introduction accessions and most of the Pop2 clones were from Mainland China.

In accordance with the population structure results, PCoA also showed two clusters with the first three axes together explained 20.04% of cumulative variation. In the PCoA plot, the first and second principal coordinates accounted for 8.41% and 6.71% of the total variations, respectively (Figure 2). Furthermore, the unrooted neighbor-joining phylogenetic tree (Figure 3) also showed two clusters. One cluster contained most of the clones of Pop1; the other cluster contained most of the clones of Pop2. However, the admixture of clones between the two sub-populations does exist. Few accessions (YC98-27, GT03-2112 and FN0717) native to China were clustered into Pop1 while several others (HoCP01-517, ROC10, ROC16, K5, ROC25, ROC22, ROC1) introduced to China Mainland were grouped into Pop2.

**Figure 1.** (**A**) Delta *K* (Δ*K*) for different numbers of subpopulations (*K*); (**B**) average log-likelihood *K*-value (LnP(*K*)) against the number of *K*; (**C**) the population structure of 150 most popular parental clones in the hybrid breeding programs in China based on the distribution of 226 SSR alleles among these clones. Pop1 clones are coded in red and Pop2 clones in green.

**Figure 2.** Principal coordinates analysis (PCoA) scatter plots. Red circles represent the Pop1 clones and green triangles the Pop2 clones.

**Figure 3.** A neighbor-joining phylogenetic tree based on the pair-wise genetic distance between 150 most popular parental clones from hybrid breeding programs in China. Red circles represent the Pop1 clones and green triangles the Pop2 clones.

#### *3.4. Genetic Di*ff*erentiation and Allelic Pattern Across Populations*

The two sub-populations Pop1 and Pop2 identified by the Structure analysis were subjected to the GeneALEx analysis to calculate the values of Analysis of Molecular Variance (AMOVA), *Nei*'s genetic distance and genetic diversity indices (Table 5). The variation value within the sub-populations (95% of total variation) was significantly higher than that between the sub-populations (5% of total variation). In addition, a high gene flow (*Nm* = 4.981) and a low fixation index value (*Fst* = 0.048) were obtained on the basis of *Nei*'s genetic distance analysis.

**Table 5.** Analysis of molecular variance (AMOVA) of SSR-based genetic variation between and within two sub-populations of Pop1 and Pop2.


The mean value of the number of different alleles (*Na*) and effective alleles (*Ne*) of the two sub-populations were 1.885 ± 0.015 and 1.462 ± 0.017, respectively. The mean values for *I*, *He* and *uHe* among the 150 parental clones were 0.413 ± 0.011, 0.272 ± 0.008 and 0.274 ± 0.009, respectively. Pop2 (*I* = 0.423 ± 0.016, *He* = 0.278 ± 0.012, and *uHe* = 0.278 ± 0.012) showed higher levels of genetic diversity than Pop1 (*I* = 0.403 ± 0.017, *He* = 0.267 ± 0.012, and *uHe* = 0.269 ± 0.012). The percentage of polymorphic loci per population (*PPL*) ranged from 83.63% (Pop1) to 93.36% (Pop2) with an average of 88.50% (Figure 4).

**Figure 4.** Allelic pattern of SSR across the two sub-populations Pop1 and Pop2. (**A**) Number of SSR alleles (*Na*); (**B**) number of effective SSR alleles (*Ne*); (**C**) Shannon's information index (*I*); (**D**) expected heterozygosity (*He*); (**E**) expected unbiased heterozygosity (*uHe*); and (**F**) percentage of polymorphic loci (*PPL*).

#### **4. Discussion**

Cross hybridization has become the main breeding method for the sugarcane variety improvement. In the traditional sugarcane cross-breeding process, selecting parental clones for crossing is the most important step. Only parental clones sharing a highly level of genetic diversity and complementarity can generate high quality seedling populations [56,57]. Since the 1950s, some sugarcane cultivars from America and China Taiwan have played a very important role in China's sugarcane cross-breeding programs [3]. Meanwhile, some new elite sugarcane parents are being created and utilized by the breeders every year. To make informed crossing choices, the genetic relationship among the parental clones involved in the latest sugarcane cross-breeding programs should be clarified.

In this study, we used 21 pairs of SSR primers to investigate the genetic diversity and population structure of 150 of the most commonly used parental clones. These primer pairs amplified 226 alleles, of which 97.3% were polymorphic. The mean PIC and the gene diversity (*h*) of the polymorphic alleles were 0.23 and 0.28, respectively, which were lower than the values reported on the "World Collections of Sugarcane and Related Grasses" (WGSRG) (PIC = 0.2568, *h* = 0.310) [23]. This may be largely due to the number of accessions involved in the world collection study. The WCSRG study involved 1002 highly diverse accessions, belonging to nine species, whereas only 150 clones were used in this study. Since 2000, a large number of genomic SSR and EST-SSR markers has been developed and applied effectively in estimating genetic diversity in the sugarcane [16,35,39,41,58]. After a lot of screening and identification (unpublished), we selected the best 21 primer pairs from these reports, including eight EST-SSR and 13 genomic SSR. We found that the number and mean PIC value of the EST-SSR alleles were lower than those of the genomic SSR alleles (Table 2). This can be due to the fact that the EST-SSR alleles are located in more conserved regions of the genome [16].

The probability of identity (*PI*) is an individual identification estimator that shows the probability of two different accessions sharing the same genotypes at one specific locus in a population [23]. In this study, the *PI* values of all SSR primer pairs were very low, ranging from 0.000001 (mSSCIR36) to 0.071332 (SMC569CS) (Table 2). The combined *PI* value for all markers was 9.04 <sup>×</sup> 10−57, indicating that these 21 SSR primer pairs are able to distinguish the 150 parental clones. The resolving power of the primer pair (*Rp*) is an index, which explains the primer pair's ability to identify different genotypes. *Rp* is related to the distribution of alleles within the sampled genotypes [46] and has been found to correlate strongly with the genotype in evaluating 34 potato cultivars using four primers [46]. The mean *Rp* value (9.135) of the 21 SSR primer pairs is much higher than other studies, such as 2.37 by [59] and 2.2 by [12], indicating these primer pairs are more informative and could identify more cultivars.

Based on geographic origin, the 150 clones were sorted into 15 series. Among these series, the genetic diversity (*h*) indices ranged from 0 to 0.261 and the Shannon's information index (*I*) ranged from 0 to 0.397. At the series level, the YC-series had the highest genetic diversity (*h* = 0.261, *I* = 0.397), which was similar to the previous results reported by You et al. [35,60]. The YC-series clones are from the Hainan Sugarcane Breeding Station of Guangzhou Sugarcane Industry Research Institute in Sanya city, Hainan province, where the primary sugarcane crossing facility of China is located. The YC-series clones were selected from crosses involving indigenous clones, foreign clones, and clones of closely related *Saccharum* species and genera [35]. Furthermore, the YC-series also had the greatest number of eight series-specific alleles. Only four, two, one, and one unique alleles were found in the CP-series, YT-series, ROC-series, FN-series and NJ-series clones, respectively. Series-specific alleles are the alleles found only in a single population among a broader collection of populations [61,62]. These alleles have been proven to be informative for population genetic studies [63,64] and we may use these alleles for variety identification and marker assisted selection.

The 150 parental clones were classified into two groups (Pop1 and Pop2) based on the PCoA, phylogenetic analysis and population structure analysis. Pop1 contained the majority of foreign accessions with the membership probabilities of >0.5, while most accessions from Mainland China were assigned to Pop2. Certain specific target traits intentionally selected by different germplasm collectors or breeders might also contribute to the population structure [54]. However, admixture of clones between the two sub-populations do exist (Figures 1–3). For example, one out of the 29 CP-series clones, nine ROC-series clones and two K-series clones clustered into Pop2, but the majority of introduction clones clustered into Pop1. Likewise, one out of four DZ-series, five out of 11 FN-series, four out of 21 GT-series, two out of six LC-series, seven out of 29 YT-series, and two out 10 YZ-series clones clustered into Pop1, while the majority of the clones from Mainland China clustered into Pop2. This might be due to genetic exchange among different series, or the similar threshold (Pop1: 0.5098, Pop2: 0.4902) (Table S3) resulting in several clones to be clustered completely into a certain group (Pop1 or Pop1), while others being clustered into both groups.

The utilization data was based the most widely used 150 parental clones of sugarcane breeding programs in China during the recent five years. These included 32 of clones from foreign origin, 109 clones from the China Mainland, and nine ROC-series clones from China Taiwan. Among the 32 foreign clones, only one was from India (Co1001), two were from Thailand (K5 and K86-110) while the majority of them (29/32) were from the US (CP-series). Co1001 has been used as parental line extensively in the sugarcane breeding programs in the world. Some sugarcane cultivars, including the CP-series and China Mainland clones, were the progenies of Co-series varieties. Compared to clones from China Mainland, the CP-series clones may have closer genetic distance with the Co-series. So CP-series clones

and Co-series clone can be clustered into Pop1. K5 and K86-110, which were from Thailand, were two of the most widely used parental clones in China. Some clones from China Mainland were the progenies of K5 and K86-110. Clones from China mainland may have the closer genetic distance with the two clones to be clustered into Pop2. The ROC-series varieties have been used as major cultivars in China Mainland accounting for greater than 80% of sugarcane planting areas [24]. In addition, the ROC-series accessions were also the most widely used parents in China Mainland during the recent five years (Table S1). In our study, the ROC-series accessions were clustered into Pop2 because of their closer genetic distance with China Mainland's clones. It is suggested that less attention be continually paid on the utilization of ROC-series accessions in China Mainland's sugarcane breeding programs.

Fixation index (*Fst*) measures the genetic distance between populations. An *Fst* value of zero indicates no differentiation between the sub-populations, while one indicates complete differentiation [65]. An *Fst* value less than 0.05 is considered no differentiation, while an *Fst* value greater than 0.15 is considered significant in differentiating populations [66]. In this study, the *Fst* value between the two sub-populations was 0.048 (Table 5), which was low and would indicate a very low genetic differentiation. This is consistent with the results obtained from the AMOVA, where the genetic variation within sub-populations (95%) was significantly higher than between sub-populations (5%). Gene flow (*Nm*) is the transfer of genetic variation from one population to another. If the value is less than one, then the gene exchange would be limited between sub-populations [67].In this study, the *Nm* value was high, 4.981 suggesting that a high level of genetic exchange may have occurred and this can result in a low genetic differentiation between the two sub-populations. Since the genetic diversity indices of Pop2, such as the number of different alleles (*Na*), effective alleles (*Ne*), *I*, *He* and *uHe*, were all higher than those of Pop1, Pop2 is more diverse than Pop1.

Selecting genetically distant accessions from Pop1 and Pop2 for crossing parents in sugarcane breeding programs will potentially lead to elite varieties with broadened genetic bases. Almost all the CP-series clones from the US were clustered into Pop1. These clones have been used extensively as parental lines in the sugarcane breeding programs in China; some have become or are elite progenitors of Chinese cultivars [67]. In addition, this study shows that several YC-series clones are also good crossing parents with a high level of genetic diversity.

#### **5. Conclusions**

Using a high-performance capillary electrophoresis (HPCE) detection system, the most widely used 150 sugarcane parental clones from 15 different series were fingerprinted with 21 SSR primer pairs. A total of 226 SSR alleles were identified and the distribution of these SSR alleles were subjected to genetic variation, phylogeny, population structure, and principal coordinate analyses. The results showed that the parental lines were clustered into two distinct groups, Pop1 and Pop2. Pop1 contained the majority of foreign clones, while Pop2 consisted of the majority of accessions from Mainland China. Genetic differentiation between the two groups was low. The YC-series clones of Pop2 displayed a high level of genetic diversity and the CP-series clones were elite parents of several Chinese cultivars. The introduction and utilization of more clones of the YC- and CP-series into China's sugarcane breeding programs will broaden the genetic base of breeding germplasm and produce high quality seedlings for selection and development of elite varieties.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/9/8/449/s1, Table S1: Utilization data of the most widely used 150 parental clones from sugarcane hybrid breeding programs in China during the recent five years. Table S2: Tabulated *K* values of 150 most popular parental clones from sugarcane hybrid breeding programs in China at *K* = 1 to 10. Table S3: Sub-population assignment of the 150 most popular parental clones from the sugarcane breeding programs in China based on the *Q* values.

**Author Contributions:** Methodology, J.W.; Validation, J.W. and Q.W.; Formal Analysis, J.W. and Y.-B.P.; Investigation, J.X., H.X. and J.W.; Resources, F.Z., C.Z., and W.Z.; Data Curation, J.X., Y.G. and H.C.; Writing—Original Draft Preparation, J.W., Y.-B.P. and J.X.; Writing—Review and Editing, Q.W. and Y.-B.P. Funding Acquisition, Q.W. and Y.Q.

**Funding:** This research was funded by the National Natural Science Foundation of China (31701488), the Earmarked Fund for China Agriculture Research System (CARS-170107), the Science and Technology Project of Guangdong Province (2017A030303049) and the Guangdong Provincial Team of Technical System Innovation for Sugarcane Sisal Industry (2019KJ104-02).

**Acknowledgments:** We thank Perng-Kuang Chang, James Todd and Yunlin Jia for their review comments and language editing.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Single-Molecule Long-Read Sequencing of Avocado Generates Microsatellite Markers for Analyzing the Genetic Diversity in Avocado Germplasm**

#### **Yu Ge, Xiaoping Zang, Lin Tan, Jiashui Wang, Yuanzheng Liu, Yanxia Li, Nan Wang, Di Chen, Rulin Zhan \* and Weihong Ma \***

Haikou Experimental Station, Chinese Academy of Tropical Agricultural Sciences, Haikou 570102, China

**\*** Correspondence: zhanrulin@catas.cn (R.Z.); zjwhma@catas.cn (W.M.); Tel.: +86-898-6679-4563 (R.Z.); +86-898-6677-3067 (W.M.)

Received: 9 July 2019; Accepted: 1 September 2019; Published: 5 September 2019

**Abstract:** Avocado (*Persea americana* Mill.) is an important fruit crop commercially grown in tropical and subtropical regions. Despite the importance of avocado, there is relatively little available genomic information regarding this fruit species. In this study, we functionally annotated the full-length avocado transcriptome sequence based on single-molecule real-time sequencing technology, and predicted the coding sequences (CDSs), transcription factors (TFs), and long non-coding RNA (lncRNA) sequences. Moreover, 76,777 simple sequence repeat (SSR) loci detected among the 42,096 SSR-containing transcript sequences were used to develop 149,733 expressed sequence tag (EST)-SSR markers. A subset of 100 EST-SSR markers was randomly chosen for an analysis that detected 15 polymorphicEST-SSR markers, with an average polymorphism information content of 0.45. These 15markers were able to clearly and effectively characterize46 avocado accessions based on geographical origin. In summary, our study is the first to generate a full-length transcriptome sequence and develop and analyze a set of EST-SSR markers in avocado. The application of third-generation sequencing techniques for developing SSR markers is a potentially powerful tool for genetic studies.

**Keywords:** *Persea americana*; SMRT sequencing; simple sequence repeat; genetic relationship

#### **1. Introduction**

Avocado (*Persea americana* Mill.) belonging to the family Lauraceae of the order Laurales is native to Mexico and Central and South America, and is one of the most economically important subtropical/tropical fruit crops worldwide [1]. Taxonomic treatments differ considerably in terms of the circumscription and defining of infraspecific avocado entities [2–5]. Additionally, researchers have long considered that geographical isolation has likely resulted in the following three ecological races of avocado: Mexican (*P. americana* var. *drymifolia*), Guatemalan (*P. americana* var. *guatemalensis*), and West Indian (*P. Americana* var. *americana*) [1]. The Mexican race adapted to a Mediterranean climate, whereas the Guatemalan race originated in a tropical highland climate, and the West Indian race adapted to humid tropical lowland conditions [1].

Avocado is rich in lipids, sugars, proteins, minerals, vitamins, and other active ingredients [6–8]. Moreover, avocado production has increased worldwide [1]. One factor contributing to the increases in production and consumption is the expansion of avocado products into new global markets where avocado was previously unknown or scarce, includingChina, which is an emerging market for the production and consumption of avocado [1,9]. After avocado was first introduced and cultivatedin China in the late 1950s, selective breeding by some national scientific research bodies and other state farms have resulted in the development of more than 10 superior avocado accessions [9,10]. Additionally, natural crosses among avocado accessions have generated new hybrids on state and

private farms, andsome nativeaccessions are increasingly produced in somewhat remote areas with distinct local environmental conditions [9,10]. Avocado is broadly grown and exploited in some provinces in southern China, including Hainan, Guangxi, Yunnan, and Taiwan [9,10]. The climatic conditions in these provinces are subtropical to tropical, which are ideal conditions for the cultivation of avocado [9,10].

The avocado germplasm should be precisely characterized to maximize its utility to breeders worldwide [1]. Specifically, a molecular characterization is required for analyses of the genetic relationships among avocado germplasm. Over the past two decades, studies involving various types of molecular markers have examined the genetic relationships among avocado germplasm [11–20]. Of the many available DNA markers, simple sequence repeats (SSRs) are commonly used for investigating plant genetics and breeding because they are widely distributed and abundant in plant genomes. They are also genetically codominant, highly reproducible, multi-allelic, and perfectly suitable for high-throughput genotyping [21–25]. Expressed sequence tag (EST)-derived markers in the genomic coding regions have an advantage over genomic DNA-derived markers, and can be efficiently amplified to reveal conserved sequences among related species [26]. There has recently been increasing interest in developing EST-SSR markers viahigh-throughput transcriptome sequencing. Thus, there has been rapid progress in the development of EST-SSR markers based on transcriptome data produced with second-generation sequencing technology for *Lilium brownii* var. *viridulum* Baker [27], *crataegus Pinnatifida* Bunge [28], *Acer miaotaiense* P. C. Tsoong [29], and *Rosa hybrida* hort. ex Lavalle [30]. Among the third-generation sequencing platforms, PacBio RS II, which is regarded as the first commercialized third-generation sequencer, is based on single-molecule real-time (SMRT) technology [31]. The PacBio RS II system can produce much longer reads than second-generation sequencing platforms, and has been applied to effectively capture full-length transcriptsequences for EST-derived marker development [32]. However, there are few reports regarding the application ofEST-SSR markers developed with SMRT technology for crop breeding.

Single-molecule real-time technology has the following threemain advantages over second-generation sequencing options: it generates longer reads, it has higher consensus accuracy, and it is less biased [33]. A previous study revealed that SMRT technology can precisely ascertain alternative polyadenylation sites and full-length splice isoforms, and also detect a higher isoform density than that for the reference genome [34]. The application of SMRT technology for nearly 3 years has helped to elucidate the complexity of the transcriptome and molecular mechanism underlying the metabolite synthesisin safflower [31], *Zanthoxylum bungeanum* Maxim. [32], *Trifolium pratense* L. [34], *Saccharum o*ffi*cinarum* L. [35], *Panicum virgatum* L. [36], *Medicago sativa* L. [37], *Zanthoxylum planispinum* Sieb. [38], *Cynodon dactylon* L. Pers. [39], *Camellia sinensis* L. O. Ktze. [40], and *Cassia obtusifolia* L. [41].

In the previous study, we had generated the first full-length transcriptome sequence of avocadobased on SMRT technology andthe short-reads obtained in this previous study involving second-generation transcriptome sequencing were used to correct the transcripts that were obtained with SMRT technology [42]. In this study, we functionally annotated sequences andcompleted SSR mining experiments from SMRT technology in avocado mesocarp. We also predicted the coding sequences (CDSs), transcription factors (TFs), and long non-coding RNA (lncRNA) sequences. Furthermore, we identified a set of EST-SSR markers, and assessed their utility for determining the genetic diversity among 46 selected avocado accessions from various locations in southern China. The generated data enabled the broad and distinct visualization of the genetic diversity in the analyzed avocado germplasm. The results of this study represent useful genetic and transcriptome information to support future research on avocado.

#### **2. Materials and Methods**

#### *2.1. Sample Collection, DNA Extraction, and RNA Extraction*

For transcriptome analyses, avocado fruits (cultivar 'Hass') were harvested from April to September 2018 from six 10-year-old trees (grafted onto Zutano clonal rootstock) growing at the Chinese Academy of Tropical Agricultural Sciences (CATAS; Danzhou, Hainan, China; latitude 19◦31 N, longitude 109◦34 E, and 20 m above sea level). Each biological replicate comprised samples from two trees. Specifically, fruits that developed during the main flowering season (i.e., February 2018) were marked, after which samples were collected at five time-points (75, 110, 145, 180, and 215 days after full bloom) until the fruits reached physiological maturity (i.e., able to ripen after harvest). The fruits were randomly collected for each biological replicate during each developmental stage. Fruits were quickly brought to the laboratory, after which the mesocarp (pulp) was separated from the seedand then immediately frozen at −80 ◦C for subsequent transcriptome analyses. Total RNA was extracted with a Plant RNA Kit (OMEGA Bio-Tek, Norcross, GA, USA).

For kompetitive allele-specific PCR (KASP) genotyping and EST-SSR detection, seven commercial cultivars and 39 native accessions were selected. These native accessions were obtained from the CATAS (Danzhou, Hainan, China; latitude 19◦31 N, longitude 109◦34 E, and 20 m above sea level), Daling State Farm (DLSF; Baisha, Hainan, China; latitude 19◦14 N, longitude 109◦14 E, and 60 m above sea level), Mengmao State Farm (MMSF; Ruili, Yunnan, China; latitude 24◦00 N, longitude 97◦50 E, and 240 m above sea level), and Guangxi Vocational and Technical College (GVTC; Nanning, Guangxi, China; latitude 22◦29 N, longitude 108◦11 E, and 79 m above sea level). Details regarding the avocado germplasm are provided in Table S1. Genomic DNA was extracted from fresh leaves as described by Ge [43].

#### *2.2. PacBiocDNA Library Construction and Sequencing*

Poly-T oligo-attached magnetic beads were used to purify the mRNA from the total RNA extracted from 15 mesocarp (pulp) samples collected at each analyzed developmental stage. The mRNA from all five developmental stages was combined to serve as the template to synthesize cDNA with the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View, CA, USA). After a PCR amplification, quality control check, and purification, full-length cDNA fragments were acquired according to the BluePippin Size Selection System protocol, ultimately resulting in the construction of a cDNA library (1–6 kb). Selected full-length cDNA sequences were ligated to the SMRT bell hairpin loop. The concentration of the cDNA library was then determined with the Qubit 2.0 fluorometer, whereas the quality of the cDNA library was assessed with the 2100 Bioanalyzer (Agilent). Finally, one SMRT cell was sequenced with the PacBio RSII system (Pacific Biosciences, Menlo Park, CA, USA).

#### *2.3. IlluminacDNA Library Construction and Sequencing*

Oligo-(dT) magnetic beads were used to purify the mRNA from the total RNA extracted from 15 mesocarp (pulp) samples from five developmental stages. Three replicates were analyzed for each developmental stage. Samples from each developmental stage underwent an RNA-sequencing analysis, with three biological replicates per sample. The fragmentation step was completed with divalent cations in heated 5× NEBNext First Strand Synthesis Reaction Buffer. First-strand cDNA was synthesized with a series of random hexamer primers and reverse transcriptase, after which the second-strand cDNA was generated with DNA polymerase I and RNase H. The cDNA libraries were constructed by ligating the cDNA fragments to sequencing adapters and amplifying the fragments by PCR. The libraries were then sequenced with the Illumina HiSeq 2000 platform (Nanxin Bioinformatics Technology Co., Ltd., Guangzhou, China).

#### *2.4. Quality Filtering and Correction of PacBio Long-Reads*

Raw reads were processed into error-corrected reads of insert (ROIs) using an isoform sequencing pipeline, with minimum full pass = 0.00 and minimum predicted accuracy = 0.80. Next, full-length, non-chimeric transcripts were detected by searching for the poly-A tail signal and the 5 and 3 cDNA primer sequences in the ROIs. Iterative clustering for error correction was used to obtain high-quality consensus isoforms, which were then polished with QuiverVersion 1.0. The low-quality full-length transcript isoforms were corrected based on Illumina short-reads with the default setting of the Proovread program. High-quality and corrected low-quality transcript isoforms were confirmed as nonredundant with the CD-HIT software.

#### *2.5. Functional Annotation*

Genes were functionally annotated based on a BLASTX search (*E*-value threshold of 10<sup>−</sup>5) of the following databases: Clusters of Orthologous Groups of proteins (KOG/COG) (available online: http: //www.ncbi.nlm.nih.gov/KOG/; available online: http://www.ncbi.nlm.nih.gov/COG/), Non-supervised Orthologous Groups (eggNOG) (available online: http://eggnogdb.embl.de/#/app/home), Swiss-Prot (a manually annotated and reviewed protein sequence database, available online: http://www.uniprot. org/), Pfam (assigned with the HMMER3.0 package, available online: https://pfam.xfam.org/), and NCBI nonredundant protein sequence (Nr) (availableonline: http://www.ncbi.nlm.nih.gov/). Additionally, the KEGG Automatic Annotation Server [44] was used to assign these genes to Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolic pathways (available online: http://www.genome.jp/kegg/). The unigenes were annotated with gene ontology (GO) terms (available online: http://www.geneontology. org/) with the Blast2GO (version 2.5) program [45] based on the BLASTX matches in the Pfam and Nr databases (*E*-value threshold of 10<sup>−</sup>6).

#### *2.6. Mining of EST-SSR Markers*

The MISA (version 1.0) program, with the following default settings, was used to locate SSRs: a minimum of five repeats; a minimum motif length of 5 for tri- and hexanucleotides, 6 for dinucleotides, and 10 for single nucleotides.

#### *2.7. Analyses of Detected Coding Sequences, Transcription Factors, and Long Non-Coding RNA Features*

The open reading frames (ORFs) detected with the TransDecoder (version 3.0.0) program were designated as putative CDSs if they satisfied the following criteria: (1) An ORF was detected in a transcript sequence; (2) the log-likelihood score was >0, and was similar to what was calculated with the GeneID software; (3) the score was higher when the ORF was in the first reading frame than when the ORF was in the other five reading frames; (4) if a candidate ORF was within another candidate ORF, the longer one was reported. However, a single transcript could be associated with multiple ORFs (because of operons and chimeras); and (5) the putative encoded peptide matched a Pfam domain.

Transcription factor gene families were identified based on categorically defined TF families and criteria from the KO, KOG, GO, Swiss-Prot, Pfam, Nr, and Nt databases. Specifically, the default parameters of the iTAK (version 1.2) program were used. The methods used to identify and classify TFs were previously described by Perez-Rodriguez [46].

The following four computational tools were combined to sort non-protein-coding RNA candidates from putative protein-coding RNAs among the transcripts: the Coding Potential Calculator (CPC), Coding-Non-Coding Index (CNCI), Coding Potential Assessment Tool (CPAT), and Pfam database. Transcripts longer than 200 nt, with more than two exons, were selected as lncRNA candidates and were further screened with CPC/CNCI/CPAT/Pfam, which distinguished the protein-coding genes from the non-coding genes.

#### *2.8. Assignment of the Native Avocado Accessions with an Unknown Race*

To validate the origins of the 33 native accessions with anunknown race, six primers for race-specific single nucleotide polymorphism (SNP) loci were used for KASP genotyping listed in Table S2 [47]. The primer mix, which was prepared and used as described by KBioscience (http://www.kbioscience.co.uk), comprised 46 μL dH2O, 30 μL common primer (100 μM), and 12 μL each tailed primer (100 μM). The SNPs were amplified by PCR in a thermal cycler with a 5-μL solution consisting of 1× KASP Master mix, 10 ng genomic DNA, and the SNP-specific KASP assay mix. The following PCR amplification conditions were the same as those used for each SNP assay: 94 ◦C for 15 min; 10 touchdown cycles of 94 ◦C for 20 s, and 58–61 ◦C for 60 s (decreasing by 0.8 ◦C per cycle); 35 cycles of 94 ◦C for 20 s and 57 ◦C for 60 s. The resulting data were analyzed with the Roche LightCycler 480 (version 1.50.39) program.

#### *2.9. Identification of EST-SSR Markers*

To screen the EST-SSR loci, primers based on the sequences flanking the selected microsatellite loci were designed with the Primer3 program; the PCR products ranged from 100 to 300 bp. All assigned marker names included Pa-eSSR to indicate their association with *P. Americana* and EST-SSRs. A subset of 100 EST-SSR primer pairs was randomly selected for validation by a PCR amplification with the same conditions as those described by Ge [43]. The PCR products were analyzed with the 96-capillary 3730xl DNA Analyzer (Applied Biosystems, Foster City, CA, USA). The detection system included 8.9 μL HIDI (Applied Biosystems), 0.1 μL LIZ (Applied Biosystems), and 1 μL PCR products (1:10 dilution). A lack of amplification was considered indicative of a null allele.

#### *2.10. Data Analysis*

The number of observed alleles (Na), effective number of alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), and polymorphism information content (PIC) of each EST-SSR was assessed with the POPGEN (version 1.32) program [48]. A cluster analysis was performed with PowerMarker (version 3.25) [49]. The cophenetic correlation coefficient was computed for the dendrogram after the construction of a cophenetic matrix to measure the goodness of fit between the original similarity matrix and the dendrogram. Bootstrap support values were obtained from 1000 replicates. A neighbor-joining tree was constructed based on shared alleles, and visualized with the MEGA6.0 software [50].

#### **3. Results**

#### *3.1. General Properties and Functional Annotations Based on Public Databases of Single-Molecule Long-Reads*

Figure 1 presents the length distribution of 651,260 reads of insert in avocado mesocarp, and the classification of the reads of insert in avocado mesocarpis listed in Figure 2. The SMRT and Illumina HiSeq 2000 sequencing data were deposited in the GenBank database (accession numbersPRJNA551932 and PRJNA541745, respectively). Gene annotations according to a BLASTX algorithm indicated that the 71,627 avocado transcripts significantly matched sequences in the COG, GO, KEGG, KOG, Pfam, Swiss-Prot, eggNOG, and Nr databases, respectively (Table S3). The species with the most matches for the transcripts were *Nelumbo nucifera* Gaertn. (41.18% of transcripts), *Vitis vinifera* L. (10.76% of transcripts), *Elaeis guineensis* Jacq. (8.88% of transcripts), and *Phoenix dactylifera* L. (6.90% of transcripts). The homology with the other species was relatively low (1.14%–2.54% of transcripts; Figure 3). To further predict and classify the functions of the annotated transcripts, we analyzed their matching GO terms, eggNOG classifications, and KEGG pathway assignments. A total of 45,134 transcripts were assigned to 51 subcategories of the three main GO functional categories as follows: 106,390 transcripts for biological processes, 45,931 transcripts for cellular components, and 69,120 for molecular functions (Figure 4a, Table S4). Next, 70,205 transcripts were functionally classified into 25eggNOG categories (Figure 4b, Table S5). Among the 26 categories, the most heavily represented group was posttranslational modification, protein turnover, chaperones (6410 transcripts, 8.94%), followed by

signal transduction mechanisms (4189 transcripts, 5.84%) and transcription (3868 transcripts, 5.39%). Only 20 and 6 transcripts belonged to the cell motility and nuclear structure categories, respectively. Finally, 33,310 transcripts were assigned to 129 KEGG pathways (Table S6). The most represented pathways were related to carbon metabolism (1678 transcripts), protein processing in endoplasmic reticulum (1649 transcripts), and biosynthesis of amino acids (1503 transcripts).

**Figure 1.** Length distribution of 651,260 reads of insert in avocadomesocarp.

**Figure 2.** Classification of reads of insert in avocadomesocarp.

**Figure 3.** Species most closely related to avocado based on the NCBI nonredundant protein sequence database.

(**a**) **Figure 4.** *Cont.*

**Figure 4.** Functional classification of transcripts. The predicted functions were based on Gene Ontology (**a**) and Non-supervised Orthologous Groups (**b**) databases.

#### *3.2. Predictions of ORFs, TFs, and lncRNAs*

A total of 73,946 ORFs were predicted, 61,523 of which were complete CDSs. The number and length distribution of proteins encoded by the CDS regions are presented in Figure 5 and Additional file 1. A total of 7969 putative avocado TFs distributed in 203 families were identified (Table S7). The most abundant TF categories included RLK-Pelle\_DLSV (241) and C3H (240). Additionally, the CPC, CNCI, CPAT, and Pfam database were combined to distinguish lncRNA candidates from putative protein-coding RNAs among the unannotated transcripts. Analyses with the CPC, CNCI, CPAT, and Pfam database revealed 7869, 6444, 16,464, and 15,579 transcripts longer than 200 nt with more than two exons as lncRNA candidates. A total of 3596 lncRNA transcripts were predicted (Figure 6).

**Figure 5.** Distribution of 61,523 complete coding sequences for the avocado open reading frames.

**Figure 6.** The number of long non-coding RNA transcripts predicted in avocado based on the Coding Potential Calculator, Coding-Non-Coding Index, Coding Potential Assessment Tool, and Pfam database.

#### *3.3. Frequency and Distribution of Various Types of EST-SSR Loci*

The 75,946 transcript sequences comprising 170,959,769 bp detected in this study included 42,096 sequences containing 76,777 SSR loci (Table 1). Of these SSR-containing transcript sequences, 19,825 harbored more than one SSR locus. Mononucleotide motifs were the most abundant (44,800, 58.35%), followed by di- (18,903; 24.62%), tri- (11,724, 15.27%), tetra- (788, 0.01%), hexa- (321, 0.00%), and pentanucleotide (241, 0.00%) motif repeats (Table 2).





There were 5–1343 SSRs per locus. Moreover, SSRs with more than 10 repeats were the most abundant, followed by those with 10, 6, and 5 random repeats. Among the 139 different repeat types, (A/T)n was the most common (56.63%). The six other main motif types were (AG/CT)n (19.14%), (AAG/CTT)n (5.97%), (AT/AT)n (3.35%), (AGC/CTG)n (2.18%), and (AC/GT)n (2.04%) (Table S8).

#### *3.4. Development of Polymorphic EST-SSR Markers, Analysis of Genetic Diversity, and KASP genotyping*

Using Primer3, we developed 149,733 EST-SSR markers from the 49,911 SSR loci (Table S9). To verify the amplification of the EST-SSR markers, a subset of 100 EST-SSR markers was randomly chosen and tested with seven accessions from various regions in southern China (Table S10). The primers for 30 of the tested markers generated amplification products, whereas 37 primer pairs amplified nonpolymorphic products and 33 did not produce clear amplicons. The 30 polymorphic EST-SSR markers, which included 15 di-, 5 tri-, 5 tetra-, 2 penta-, and 3hexanucleotidemotif-based markers, were further verified with 46 avocado accessions. Finally, 15 polymorphic EST-SSR markers, with missing allele frequencies <10% for all 46 avocado accessions, were selected for subsequent analyses of genetic diversity (Table S11). A total of 71 alleles in the 46 avocado accessions carried the 15 polymorphic EST-SSR markers. Eight of these alleles were considered to be accession-specific and the other 63 alleles were generally found in multiple accessions (Table S11). The eight accession-specific alleles were from the following accessions: Renong No. 4, Renong, No. 5; Renong No. 6, Guiyan No. 8, Daling No. 5, Daling No. 6, RL chang, and RL yuan.

The 15polymorphic EST-SSRs were applied to evaluate diversity parameters (Table 3). The Na amplified per SSR locus varied from 2 to 10, with a mean of 4.73. The Ne varied from 1.04 to 4.39, with an average of 2.31, and Ho ranged from 0.04 to 0.93, with an average of 0.49. The He ranged from 0.04 to 0.77, with an average of 0.50, and PIC values ranged from 0.04 to 0.74, with an average of 0.45.


**Table 3.** Diversity parameters associated with 15 polymorphic EST-SSRs analyzed in 46 avocado accessions.

<sup>1</sup> Number of observed alleles; <sup>2</sup> effective number of alleles; <sup>3</sup> observed heterozygosity; <sup>4</sup> expected heterozygosity; <sup>5</sup> polymorphism information content.

Six race-specificKASP markers were used to determine the race of 33 avocado accessions with an unknown race. The KASP genotyping results demonstrated that all 33 avocado accessions were Guatemalan × West Indian hybridsbased on the corresponding genotype of each racial avocado (Table S2).

#### *3.5. Analyses of Genetic Relationships Based on Polymorphic EST-SSRs from SMRT Sequencing Data*

A cluster analysis grouped the 46 accessions into two major sections (Figure 7). The dendrogram revealed a clear separation between the native avocado accessions from Hainan province and those from Guangxi and Yunnan provinces. In cluster I, 19 Guatemalan × West Indian hybrids were clustered into two sub-sections. Sub-cluster I-I consisted of 13native Guatemalan × West Indian hybrids from Guangxi province. Sub-cluster I-II contained two native Guatemalan × West Indian hybrids from Yunnan province.Cluster II comprised 27 Guatemalan × West Indian hybrids from Hainan province. Among these hybrids, 15 and 6were obtained from the CATAS and DLSF, respectively.

**Figure 7.** Neighbor-joining consensus tree of 1000 bootstrap replicates revealing the phylogenetic relationships among the 46 analyzed avocado accessions based on the shared alleles for the 15 EST-SSR markers. GVTC, native avocado accessions from Guangxi Vocational and Technical College; MMSF, native avocado accessions from Mengmao State Farm; CATAS, native avocado accessions from the Chinese Academy of Tropical Agricultural Sciences; and DLSF, native avocado accessions from Daling State Farm. The native avocado accessionslabeled withan asterisk originated from other regions.

Figure 8 presents the distribution of the 46 avocado accessions for the first two principal coordinates of a principal coordinate analysis (PCoA). On the basis of the first coordinate, which accounted for 21.71% of the total variation, the accessions were generally distributed in two groups. The native avocado accessions from Hainan and Yunnan provinceswere basically grouped separately from the native avocado accessions from Guangxi province. The second coordinate accounted for 10.06% of the total variation.Finally, we observed that the native avocado accessions were generally grouped according to their geographical origins.

**Figure 8.** Principal coordinate analysis of 46 avocado accessions based on the 15 EST-SSR markers. POP1, avocado accessions fromFlorida, USA; POP2, native avocado accessions from theChinese Academy of Tropical Agricultural Sciences; POP3, native avocado accessions from Mengmao State Farm; POP4, native avocado accessions from Daling State Farm; and POP5, native avocado accessionsfrom Guangxi Vocational and Technical College.

#### **4. Discussion**

Transcriptome sequencing is a useful technique for obtaining a large number of transcripts for organisms lacking a reference sequence, at least partly because it is inexpensive and can be completed rapidly [51–53]. To date, several short-read next-generation sequencing (NGS) transcriptome databases have been developed for avocado mesocarp samples [54,55] and avocado mixed tissue samples [18,56]. However, both the number and length of the transcript sequences derived from these short-read NGS studies have hamperedtheirapplication ingenetics and molecular biology research [41]. One of the advances in sequencing technology has been the development of the long-read SMRT sequencing

technique, which enables researchers to obtain a substantial number of full-length sequences from a cDNA library [32]. In the current study, we applied the PacBio SMRT system to generate and analyze the full-length transcriptome of avocado mixed mesocarp samples collected at various developmental stages. The 25.79 Gb SMRT data produced in this study provide the first comprehensive insights into the avocado mesocarp, which is the most economically valuable organ of this fruit species, and might serve as the genetic basis for future research on avocado. Interestingly, the full-length transcriptome sequence described herein is also the first such sequence for a plant species from the family Lauraceae.

In this study, 93.82% (71,627 of 76,345) of the nonredundant transcripts were annotated based on similarities with sequences in public databases. Thus, a greater proportion of transcripts were annotated in this study than in previous investigations involving NGS data for various avocado races (49.00%) [18] and for avocado mesocarp samples (57.50%) [55]. We determined that the mean length of the avocado nonredundant transcripts was2330 bp, implying that our sequences were long enough to represent full-length transcripts. Additionally, this mean length was in between the mean lengths obtained for other species, including *Z. bungeanum* (3414 bp) [32], *T. pretense* (2789 bp) [34], *M. sativa* (1706 bp) [37], *Z. planispinum* (1781 bp) [38], *C. sinensis* (1781 bp) [40], and *Arabidopsis pumila* (2194 bp) [57]. Moreover, the 76,345 nonredundant transcripts derived from the 25.79 Gb clean PacBio SMRT data produced in this study may facilitate future research on the physiology, biochemistry, and molecular genetics of avocado and related species.

A previous study indicated that lncRNAs may be important for the gene regulation in eukaryotic cells, especially during some key biological processes [58]. However, the number of lncRNAs encoded in genomes as well as their characteristics remains largely unknown [59]. Predicting and functionally annotating lncRNAs is challenging, but valuable because they are not orthologous and there is a lack of homologous sequences between closely related species [38]. Unfortunately, very few of the lncRNA functions have been elucidated [60,61]. Hence, the lncRNA information for one species is not suitable for predicting the lncRNAs in another species. In this study, 3596 avocado transcript sequences (accounting for 4.71% of the total number of nonredundant transcripts) were putatively predicted aslncRNAs. This almost completely uncharacterized gene pool may include genes associated with agronomically relevant traits related to the most economically valuable organ (mesocarp).

The accurate identification of avocado germplasm races is needed to ensure that germplasm collections are optimally used by plant breeders and farmers worldwide [1]. The traditional assignment of avocado races based on morphological traits is imprecise because of environmental effects and a limited number of applicable characteristics [17]. Molecular-based characterizations are more consistent and valid for assigning avocado genotypes. We previously confirmed the universality of six race-specific KASP markers [47]. These markers were used in the current study to identify avocado accessions with an unknown race, with implications for the application of available avocado germplasms for breeding and resource conservation. Interestingly, the KASP genotyping results revealed that all of the native avocado accessions included in this study are Guatemalan × West Indian hybrids. The reason for this observation might be related to theintroduction of avocado cultivars and the climates of the sample collection regions. First, the major avocado cultivars grown commercially are typically hybrids of three races (i.e., mainly Guatemalan × West Indian and Guatemalan × Mexican hybrids) [1]. Since the late 1950s, Guatemalan × West Indian and Guatemalan × Mexican hybrids have been brought into China from other countries for cultivation in Southern China [9]. Second, the native avocado accessions included in the present studyare mainly from three geographical regions, namely Nanning located in the central and southern region of Guangxi province, Danzhou and Baisha located in the central and western region of Hainan province, and Ruili located in the western region ofYunnan province. These locations are characterized by a warm and humid oceanic climatewith a relatively low altitude in the central and southern region of Guangxi province and the central and western region of Hainan province. Although Ruili is located in the western region ofYunnan province and far from the ocean, it still has a subtropical monsoon climate. The climates of these three regions resemble that of the areas in which theWest Indian races originated, and are favorablefor the growth of Guatemalan ×

West Indian hybrids. Therefore, Guatemalan × West Indian hybridsmay have graduallybecome the dominant native avocado accessions because of artificial selection or via naturally occurring crosses.

The 100 EST-SSR markers randomly selected for validation in the present study had an amplification rate of 67%, and 30 were determined to be polymorphic. This polymorphism level is generally consistent with that of our previous study [18]. In subsequent analyses of the genetic diversity of these polymorphic EST-SSR markers among 46 avocado accessions, 15 markers produced 4.73 alleles per locus, which was fewer than the 6.13 alleles per locus of Ge [18], the 11.40 alleles per SSR locus of Gross-German and Viruel [17], the 18.8 alleles per SSR locus of Schnell [16], and the 9.75 alleles per SSR locus of Alcaraz and Hormaza [62]. Additionally, a PIC value > 0.5 is generally considered to represent a high polymorphism rate [63]. In this study, 7 of 15 polymorphic EST-SSRs had a PIC value < 0.5. This result may have been because the 46 avocado accessions in this study are genotypically the same (Guatemalan × West Indian hybrids), with relatively low genetic diversity.

In this study, a cluster analysis and a PCoA grouped the native avocado accessions according to where they originated. Additionally, some of the native avocadoaccessions derived from different regions was included in the same sub-cluster. For example, Renong No. 13 from Hainan province clustered with the native accessions from Guangxi province. One factor leading to this promiscuous clustering is the fact that avocado germplasm resources have been exchanged among researchers and breeders since the late 1980s. The CATAS, which is a national scientific research unit, was commissioned to popularize superior avocado accessions among breeders at adjacent state farms or at other national scientific research units. Some superior native accessions from the CATAS may be the male or female parent of other native accessions from various state farms orother national scientific research units, which is consistent with our study results. Furthermore, a cluster analysis grouped two native avocado accessions from Yunnan province with the native avocado accessions from Guangxi province. In contrast, our PCoA indicated that these two native avocado accessions from Yunnan province belong to the same groupas the native avocado accessions from Hainan province. We speculate that the relatively few native avocado accessions from Yunnan province (i.e., two) may have led to these contradictory results based on two statistical analyses. At many avocado plantations in Yunnan province, the local avocado accessions have been replaced by"Hass," which is the most economically valuable avocado cultivar, ultimately making it difficult to collect local avocado accessions. Thus, maximizing the economic benefits of cultivating specific avocado cultivars, while ensuring avocado genetic resources are conserved will need to be addressed.

#### **5. Conclusions**

We annotated SMRT sequencing data based on the COG, GO, KEGG, KOG, Pfam, Swiss-Prot, eggNOG, and Nr databases. Among 71,627 transcripts, 45,134, 52,125, and 33,310 were annotated according to GO, eggNOG, and KEGG classifications, respectively. We detected 76,777 SSR loci in 42,096 transcript sequences and used them to develop 149,733 EST-SSR markers. From a randomly selected subset comprising 100 EST-SSR markers, we finally identified 15 polymorphic EST-SSR markers on 71 alleles, which had 2–10 of these markers per locus. A cluster analysis and a PCoA separated the 46 avocado accessions according to their geographical origins. These 15 newly developed EST-SSR markers may be useful for future analyses of avocado accessions and may contribute to the improved management of avocado resources for germplasm conservation and breeding programs.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/9/9/512/s1. Table S1. Sources of the 46 avocado accessions evaluated in this study. Table S2. KASP primer information and KASP genotyping results. Table S3. Gene annotations of the 71,627 avocado transcripts. Table S4. Characteristics of the GO annotation of avocado transcripts. Table S5. Characteristics of eggNOG classifications of avocado transcripts. Table S6. Characteristics of KEGG pathways ofavocado transcripts. Table S7. Transcription factors identified in the avocado transcripts. Table S8. Frequencies of different repeat motifs in EST-SSRs from avocado. Table S9. Characteristics ofavocado EST-SSR markers in this study. Table S10. Summary of 100 EST-SSR markers used for amplification. Table S11. Summary of 15 EST-SSRs in 46 avocado accessions.Additional file 1. Coding sequences predicted with TransDecoder.

**Author Contributions:** Y.G., R.Z., and W.M. conceived and designed the experiments; J.W., Y.L. (Yuanzheng Liu), and N.W. performed the experiments; L.T. and D.C. analyzed the data; Y.L. (Yanxia Li) helped complete the experiments; X.Z. contributed materials; and Y.G. wrote the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant number 31701883) and the Natural Science Foundation of Hainan Province of China (grant number 319QN266).

**Acknowledgments:** We gratefully acknowledge Pingzhen Lin from the Haikou Experimental Station of the Chinese Academy of Tropical Agricultural Sciences for supporting the collection of avocado resources. We thank Yajima for editing the English text of a draft of this manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **Assessment of Genetic Diversity in Di**ff**erently Colored Raspberry Cultivars Using SSR Markers Located in Flavonoid Biosynthesis Genes**

#### **Vadim G. Lebedev 1,2, Natalya M. Subbotina 1,2, Oleg P. Maluchenko 3, Konstantin V. Krutovsky 4,5,6,7,8,\* and Konstantin A. Shestibratov <sup>2</sup>**


Received: 4 August 2019; Accepted: 4 September 2019; Published: 6 September 2019

**Abstract:** Raspberry is a valuable berry crop containing a large amount of antioxidants that correlates with the color of the berries. We evaluated the genetic diversity of differently colored raspberry cultivars by the microsatellite markers developed using the flavonoid biosynthesis structural and regulatory genes. Among nine tested markers, seven were polymorphic. In total, 26 alleles were found at seven loci in 19 red (*Rubus idaeus* L.) and two black (*R. occidentalis* L.) raspberry cultivars. The most polymorphic marker was *RiMY01* located in the MYB10 transcription factor intron region. Its polymorphic information content (PIC) equalled 0.82. The *RiG001* marker that previously failed to amplify in blackberry also failed in black raspberry. The raspberry cultivar clustering in the UPGMA dendrogram was unrelated to geographical and genetic origin, but significantly correlated with the color of berries. The black raspberry cultivars had a higher homozygosity and clustered separately from other cultivars, while at the same time they differed from each other. In addition, some of the raspberry cultivars with a yellow-orange color of berries formed a separate cluster. This suggests that there may be not a single genetic mechanism for the formation of yellow-orange berries. The data obtained can be used prospectively in future breeding programs to improve the nutritional qualities of raspberry fruits.

**Keywords:** flavonoid biosynthesis; fruit coloration; marker-assisted selection; microsatellites; *Rubus*

#### **1. Introduction**

The genus *Rubus* L. (Rosaceae, Rosoideae) is one of the most diverse in the plant kingdom and contains between 600 and 800 species grouped in 12 subgenera, which are widely distributed throughout the world from the lowland tropics to subarctic regions [1]. Among these species, red

raspberry (*Rubus idaeus* L.) and blackberry (several species in the genus *Rubus*) grown world-wide, and black raspberry (*R. occidentalis* L.) grown mainly in the United States, are of the greatest economic importance. Their berries are in great demand due to their flavor, color, and taste. In addition, they are very healthy providing a good source of antioxidants, including phenolic acids, flavonoids, anthocyanins, and carotenoids [2]. Berries contain four times more antioxidants than non-berry fruits, 10 times more than vegetables, and 40 times than cereals [3]. For this reason, berries and their products (i.e., berry juice and jam) are very often recognized as "superfoods" [4]. The popularity of this crop can be indicated by the fact that their harvest increased 1.5 times from 2010 to 2017 worldwide and exceeded 800,000 tons [5]. Russia consistently ranks first in the world for the raspberry production. The growing interest in raspberry has led not only to an increase in its production, but also to the expansion of breeding programs for the development of new cultivars. However, classical selection takes a lot of time: in red raspberry, it can take up to 15 years for development and release of a new cultivar [6]. Moreover, a specific feature in the *Rubus* spp. breeding system is that multiple species are often utilized in breeding programs [7]. Scientific achievements in molecular biology, and use of molecular markers, in particular, can accelerate the selection process, as they will allow for the assessment of the seedlings with valuable traits at a much earlier stage. Molecular genetic markers provide more reliable cultivar identification of *Rubus* species than morphological markers [8].

In order to speed up the breeding process, it is useful to have genetic linkage maps containing information about the markers associated with the most important traits, including disease and pest resistance, plant habitus, nutritional and sensory fruit quality, and plant architecture. The first genetic linkage map of *Rubus* was constructed from a cross between two *Rubus* subspecies, *R. idaeus* (cv. Glen Moy) × *R strigosus* (cv. Latham), in 2004 [9]. After that, other molecular maps for red raspberry [10–12], black raspberry [13] and tetraploid blackberry [14] appeared. Quantitative trait loci (QTL) have been identified for important traits including resistance to diseases [11,15] and pests [10], fruit anthocyanin content [16], growth characteristics [10,17], fruit color and quality traits [18]. Currently, molecular markers are routinely used in breeding raspberries for resistance to the Phytophthora root rot at the James Hutton Institute (UK), and two promising genotypes are under commercial trials, as well as markers for the quality of berries are in the process of validation [19]. If in the first reports a combination of various types of molecular markers such as AFLP and simple sequence repeat (SSR) [9,10], RAPD, and RGAP [11] were used, then the most recent molecular maps were produced using only molecular markers designed from sequenced DNA such as microsatellites or SSR markers [13,20]. SSRs are DNA tandem repeats of the 1–6 nucleotide long motifs that are very frequent in genomes. They are very polymorphic with high information content, co-dominant inheritance, locus specificity, extensive genome coverage and simple detection using labelled primers that flank the microsatellite [9,21], and their ability to distinguish even closely related individuals is particularly important for many crop species [21]. Raspberry researchers have noted the benefits of the SSR markers, but very few molecular markers still exist for *Rubus* [7,22]. It should be also acknowledged that the breeding process can be accelerated using genomic selection (e.g., [23]), an approach under rapid adoption in many species, which is based on multiple marker–trait associations and does not require linkage maps.

The color of the berries not only affects their attractiveness but also serves as an indicator of the content of biologically active compounds. For example, the content of anthocyanins in raspberry berries varies widely from 2 to 325 mg/100 g depending on the color of the berries [24]. Flavonols and anthocyanins are synthesized in the flavonoid pathway, and its enzymes are well characterized. Kassim et al. [16] mapped QTLs for individual anthocyanin pigments in raspberry. The genes of various enzymes of flavonoid biosynthesis were also identified in red [18] and black [25] raspberry and blackberry [26]. Besides the structural genes, regulatory genes are important in the biosynthesis of flavonoids. The late flavonoid biosynthetic genes are activated by the ternary transcriptional MYB-bHLH-WD40 (MBW) complex comprising three classes of regulatory proteins including R2R3-MYBs, bHLHs, and TTG1 (WD40) [27]. Transcription factor genes, such as *MYB10*, *bHLH* and *bZIP*, have also been identified in the *Rubus* species [18,26].

There are several studies that used random genomic SSR markers to assess genetic diversity in cultivars within [8,28] and between [29] different species. However, we are unaware of studies in which genetic diversity would be assessed using markers located in genes of any metabolic pathway and the biosynthesis of flavonoids, in particular. In this study, we developed SSR markers using nucleotide sequences of structural and regulatory genes of flavonoid biosynthesis in *Rubus* and *Fragaria* (strawberry) available at the National Center for Biotechnology Information (NCBI) GenBank database to test whether genetic variation associated with these genes correlate with a variation of berry colors. These markers were genotyped in 19 raspberry cultivars from different geographic regions (Russia, Poland, Italy, Switzerland, UK, and USA) and two cultivars of black raspberry. If alleles at these loci correlate with the content of biologically active substances, they could subsequently be used to optimize selection for valuable traits associated with color and, indirectly, with the content of flavonoids, by accelerating selection via screening genotypes at early stages.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

Nineteen cultivars of red raspberry (Amira, Anne, Babye Leto II, Beglyanka, Brilliantovaya, Bryanskoe Divo, Gerakl, Glen Ample, Marosejka, Meteor, Oranzhevoe Chudo, Pingvin, Polka, Poranna Rosa, Solnyshko, Sugana, Tarusa, Zheltyj Gigant, and Zolotaya Osen) and two cultivars of black raspberry (Cumberland and Jewel) were chosen to genotype SSR loci located in the flavonoid biosynthesis genes. These cultivars have a wide range of fruit color from yellow to black with various geographic and genetic origins, but cultivars of Russian origin from two raspberry breeding centers (Bryansk and Moscow) dominated in the list (Table 1). Raspberry plants used in this study were kindly provided by Dr. I. A. Pozdniakov (OOO Microklon, Pushchino, Russia). Each cultivar represented a microclonally vegetatively propagated line containing practically genetically identical plants. Therefore, a single specimen per culture was used for further DNA isolation and genotyping.


**Table 1.** Parentage and fruit color of the *Rubus* cultivars used in the study.

#### *2.2. Simple Sequence Repeat (SSR) Marker and Polymerase Chain Reaction (PCR) Primer Development*

The WebSat software [30] was used to detect SSR loci in the nucleotide sequences of *Rubus* and *Fragaria* × *ananassa* (the garden strawberry or simply strawberry, a widely grown hybrid species of the genus *Fragaria*) flavonoid biosynthesis genes available at the NCBI GenBank database (http: //www.ncbi.nlm.nih.gov) (Table 2). The Primer 3 software (http://primer3.org) was used to design appropriate polymerase chain reaction (PCR) primers based on the sequences flanking the SSR loci. The minimum number of motifs used to select the SSR locus was nine for mono-nucleotide repeats, five for di-nucleotide motifs, three for tri-, and tetra-, and two for penta-, and hexa-nucleotide repeats. Primers were designed using the following criteria: primer length of 18–27 bp (optimally 22 bp), GC content of 40%–80%, annealing temperature of 57–68 ◦C (optimally 60 ◦C), and expected amplified product size of 100–400 bp. Primers for the *RiG001* locus were as in [8]. Primers were synthesized by Syntol Company (Moscow, Russia) and are summarized in Table 2.

#### *2.3. DNA Isolation, PCR Amplification and Fragment Analysis*

A single DNA sample per each cultivar was produced from young expanding leaves representing a single plant per each cultivar. Total genomic DNA was extracted using the STAB method [31]. The quality and quantity of extracted DNA were determined by the NanoDrop 2000 spectrophotometer (ThermoFisher). The final concentration of each DNA sample was adjusted to 50 ng/μL in TE buffer before the PCR amplification.

For genotyping, PCR was performed separately for each primer pair using a forward primer labeled with the fluorescent dye 6-FAM and an unlabeled reverse primer (Syntol, Russia). The PCR amplification was performed in a total volume of 20 μL consisted of 50 ng of genomic DNA, 10 pmol of the labeled forward primer, 10 pmol of an unlabeled reverse primer, and PCR Mixture Screenmix (Eurogen, Russia). After an initial denaturation at 95 ◦C for 3 min, DNA was amplified during 33 cycles in a gradient thermal cycler (Bio-Rad, Hercules, CA, USA) programmed for a 30 s denaturation step at 95 ◦C, a 20 s annealing step at the optimal annealing temperature of the primer pair and a 35 s extension step at 72 ◦C. A final extension step was done at 72 ◦C for 5 min.

The PCR generating clear, stable, and specific DNA fragments within an expected length (200–400 bp) were considered as successful PCR amplifications. If a primer pair failed three times to amplify template DNA that was amplified with other primers, then it was scored as a null genotype.

Separation of amplified DNA fragments was performed in an ABI 3130xl Genetic Analyzer using S450 LIZ size standard (Syntol Company, Moscow, Russia). Peak identification and fragment sizing were done using the Gene Mapper v4.0 software (Applied Biosystems, Foster, CA, USA).

#### *2.4. Genetic Data Analysis*

Genetic parameters were calculated for 21 raspberry cultivars based on seven SSR polymorphic loci. The allele frequencies, number of alleles, observed (*Ho*) and expected (*He*) heterozygosities, and polymorphic information content (PIC) were calculated using the PowerMarker v.3.25 software [32]. This software was also used to estimate pairwise Nei's standard genetic distances between each pair of cultivars and to generate a UPGMA dendrogram, which was visualized using the Statistica software (TIBCO Software Inc., Palo Alto, CA, USA).



Optimal annealing temperature. Two or three SSRs in these loci were amplified simultaneously by a single pair of primers.

#### *Agronomy* **2019**, *9*, 518

#### **3. Results**

#### *3.1. Polymorphism and Genetic Diversity Analysis*

Nine SSR markers (six based on *Rubus* and three on *Fragaria* nucleotide sequences of the flavonoid biosynthesis genes) were used to estimated genetic diversity in 19 raspberry (*R. idaeus*) and two black raspberry (*R. occidentalis*) cultivars. All PCR primer pairs amplified one or two alleles. In raspberries, two loci (*RiTT01* and *FaAR01*) were monomorphic, and other seven were polymorphic. In black raspberry cultivars, the *RiG001* was not amplified at all, six loci were monomorphic and only two polymorphic (Table 2). In total, 26 alleles were found in seven polymorphic microsatellite loci. The number of alleles per locus varied from two per locus (*FaFS02* and *FaFL01*) to nine per locus (*RiMY01*) with an average number of 3.7 alleles per locus (Table 3). The *RiMY01* locus was the most polymorphic. In general, the SSR loci located in introns were more polymorphic than loci in exons.



There were cultivar-specific alleles, such as a unique allele 358 at the *RiMY01* locus found only in black raspberry, and alleles 267 and 269 at the *RhUF01* locus found only in the red raspberry Meteor and Jewel cultivars, respectively. Meteor contained also a unique allele 333 at the *RiMY01* locus.

Parameters of genetic variation for seven polymorphic SSR loci in 21 *Rubus* cultivars are presented in Table 3. Expected heterozygosity (*He*) ranged from 0.05 in the *RiMY01* locus up to 0.84 in the *RiMY01* locus with an average value of 0.36. Observed heterozygosity was zero in the *RhUF01* locus and ranged from 0.05 in the *FaFS02* locus to 0.57 in the *RiMY01* locus with an average value of 0.29. The observed heterozygosity was lower than expected in four microsatellite loci and on average (Table 3). On average, the expected and observed heterozygosities were higher for the SSRs in introns (0.49 and 0.44, respectively) compared to the SSRs in exons (0.20 and 0.08, respectively). The average PIC was 0.332 and varied from 0.05 in the *FaFS02* locus to 0.82 in the *RiMY01* locus (Table 3).

#### *3.2. Cluster Analysis*

A UPGMA dendrogram was constructed for 21 raspberry cultivars based on seven SSR markers located in the genes of the flavonoid biosynthesis (Figure 1). The dendrogram clearly separates red and black raspberries. Among the red raspberry cultivars, there is a group of cultivars with yellow-orange colored berries (Anne, Poranna Rosa, Orangevoe Chudo, and Zolotaya Osen), which forms a separate cluster. The same group includes also the Bryanskoe Divo cultivar with light red berries. At the same time, the Zheltyj Gigant (yellow berries) and Beglyanka (orange berries) were not included in this group. Separation of cultivars did not follow their genetic origin. The cultivars Beglyanka, Solnyshko, and Meteor having the same genetic origin from the Kostinbrodskaya × Novost Kuzmina cross were completely separated from each other. In addition, the Babye Leto 2 also having an ancestral hybrid (Autumn Bliss × (September × (Kostinbrodskaya × Novost Kuzmina))) turned out to differ mostly from other raspberry cultivars. Gerakl and Sugana both also having Autumn Bliss as their parent species

were significantly separated. At the same time, close similarities have been observed for cultivars from different geographic regions. No genetic differences were found between the Orangevoe Chudo (Russia) and Poranna Rosa (Poland) cultivars, and between the Amira (Italy) and Tarusa (Russia) cultivars, although they have different genetic origins. The Brilliantovaya and Pingvin cultivars were also identical and were obtained with the use of interspecific hybrids.

**Figure 1.** The UPGMA dendrogram of the 21 *Rubus* cultivars based on pairwise Nei's standard genetic distances calculated using seven SSR markers located in the flavonoid biosynthesis genes. Left column shows the colors of the cultivar berries. Only bootstrap values larger than 50% are presented. See Table 1 for the full cultivar names.

#### **4. Discussion**

SSR markers (microsatellites) are widely used in genetic diversity studies, QTL and genetic mapping, molecular-assisted selection (MAS), and cultivar identification, because they are multi-allelic, co-dominant, highly informative, relatively accurate and easily detected [33]. SSR markers have been often used to map different types of *Rubus* [9,13], fingerprinting germplasm [34], and in studies of the genetic diversity and population structure within [28] and among [29] *Rubus* species. However, genetic diversity has not previously been studied in terms of any specific metabolic pathway genes that determine valuable breeding traits.

In this study, we report on the evaluation of a number of red and black raspberry cultivars using SSR loci representing known sequences of the flavonoid biosynthesis pathway genes, which synthesize biologically active substances with high antioxidant activity—flavonols and anthocyanins. Among these microsatellite loci, six (*RcFH01, FaFS01, FaFS02, RiAS01, FaAR01,* and *RhUF01*) were located in the structural genes of the flavonoid biosynthesis (*F3H*, *FLS*, *ANS*, *ANR*, and *UFGT*) and two (*RiMY01* and *RiTT01*) in the regulatory genes (*MYB10* and *TTG1*). Flavanone-3-hydroxylase (F3H) is a key enzyme in the flavonoid biosynthesis in plants, as it catalyzes formation of 3-hydroxy flavonol, a common precursor of anthocyanins, flavanols, and proanthocyanidins [35]. Particular attention was paid to the flavonol synthase gene, for which two loci were used. Flavonol synthase (FLS) is an important enzyme of flavonoid pathway that catalyzes the formation of flavonols from dihydroflavonols, and thus may influence anthocyanin levels, as dihydroflavonols are intermediates in the production of both colored anthocyanins and colorless flavonols [36]. The anthocyanidin synthase (ANS) leads to the synthesis of the anthocyanidin, the first colored compound in the anthocyanin biosynthetic pathway, from which anthocyanidin reductase catalyzes the formation of proanthocyanidins (condensed tannins) [37]. The last common step for the production of stable anthocyanins is the glycosylation by the enzyme UDP-glucose/flavonoid 3-O-glucosyl transferase (UFGT) [38].

In addition, loci were used on the sequence of two transcription factors (MYB 10 and TTG1) that belong to the MBW complex, which regulates the production of the late biosynthetic genes [27]. For comparison, we also used a pair of primers designed for the *RiG001* locus using the sequence of the *R. idaeus* aromatic polyketide synthase (*PiPKS3*) gene, which was not amplified in blackberry cultivars [8]. The *RiPKS3* gene differed from the *RiPKS1* gene, encoding a typical chalcone synthase (CHS) catalyzing the first step of flavonoid biosynthesis, in four amino acid positions and produced in vitro predominantly p-coumaryltriacetic acid lactone and low levels of chalcone [39]. Within the PCR fragment amplified by the primers for the *RiG001* locus the sequence of the *RiPKS3* gene (NCBI GenBank AF292369) differed from the *RiPKS1* gene sequence (AF292367) by a two nucleotide long deletion (2 bp) and a single nucleotide insertion. Three alleles (349, 350, and 351 bp) were obtained for this locus (Table 2).

In addition to the sequences of the genes of the *Rubus* plants (*R. idaeus*, *R. coreanus*, and *R. hybrid*), we used the sequences of the genes from *Fragaria* × *ananassa*, which is a close relative of *Rubus* from the same sub-family, Rosoideae. The *Rubus* and *Fragaria* both have the same base chromosome number 1*n* = 7, similar morphology and chloroplast and nuclear DNA phylogenies [13].

Among three most economically important types of raspberry, 19 cultivars of red raspberry with a wide range of berry color from various world breeding centers and two cultivars of black raspberry are mostly used. Both species, red (*R. idaeus*) and black (*R. occidentalis*) raspberry belong to the same subgenus *Idaeobatus* (raspberries) and are diploids (2*n* = 2*x* = 14), while blackberry species vary greatly in ploidy [34].

In our study, the average number of alleles for seven polymorphic SSR loci in the flavonoid biosynthesis genes was 3.71, the mean *Ho* and *He* were 0.286 and 0.360, respectively, and the mean PIC was 0.332. These values were generally lower than previously reported for *R. idaeus* [8] and *R. coreanus* [29], but quite comparable with the data for black raspberry cultivars [28]. Perhaps, this is due to the fact that red raspberry cultivars are, for the most part, complex hybrids with a limited genetic pool [34], and the selection for berries quality has further reduced their diversity. The level of expected heterozygosity (*He*) was higher than observed (*Ho*) both on average and in most individual loci. These data are different from other studies of the *Rubus* species, where these parameters were approximately equal [8,29], or even higher [28]. However, unlike those studies, where population samples were used, a collection of different cultures was used in this study, which is not a population sample, but a mixture of genotypes with different genetic background and origin. Therefore, it is expected to observe excess of expected heterozygosity in comparison to observed heterozygosity due to Wahlund effect.

Only the *RiMY01* locus was highly polymorphic (PIC = 0.82). This locus had three SSR regions, two of which represent dinucleotide repeats. These data coincide with the results of Castillo et al. [8], in which all three highly informative markers (PIC = 0.78–0.82) represented dinucleotide repeats. In *R. coreanus*, among five highly polymorphic markers (PIC > 0.7), four represented dinucleotide repeats, and one trinucleotide repeats [29]. The high variation of the *RiMY01* locus can be explained by its location in the first intron of the transcription factor MYB10. SSR markers located in introns were more variable in comparison to those located in exons (expected and observed heterozygosities averaged 0.49 and 0.44 vs. 0.20 and 0.08, respectively). Our results are in agreement with those of Garcia-Gomez et al. [40], which showed that SSRs in introns had a higher level of heterozygosity compared to SSRs in exons in Prunus species—0.65 vs. 0.17, respectively. Similar results were also

obtained in maize [41]. Significantly higher variation was observed also for SNPs in noncoding regions compared to coding ones [42]. In general, introns are more variable than exons, as they are under less selection pressure during the evolutionary process [43].

The length of most alleles at the *RiMY01* locus differ from each other by two nucleotide-long steps, which is consistent with dinucleotide repeats of the SSR motifs in this locus. However, imperfect repeats also often occur in the raspberry SSR loci. For instance, Fernandez et al. [34] has previously reported the alleles with length different by consecutive one nucleotide-long steps in the *Rubus57a* and *Rub5a* markers. This single nucleotide stepwise variation is expected for *Rub5a*, which is a SSR marker with a mononucleotide motif, but *Rubus57a* is a SSR marker with a dinucleotide motif. We also observed a few alleles with imperfect repeats, such as the unique allele 267 of the *RhUF01* locus in the Meteor cultivar, for which the perfect allele size is 270 following the trinucleotide motif GAG stepwise allelic variation.

The black raspberry cultivars were highly homozygous: six out of eight loci were monomorphic (Table 2). High homozygous in black raspberry has been also found earlier by Lewers and Weber [44]. They noticed that the level of homozygosity for the black raspberry was 80%, but only 40% for the red raspberry. The 21 SSR loci were unable to distinguish between six of the black raspberry cultivars [28]. However, the black raspberry cultivars Cumberland and Jewel were well discriminated in this study. Despite the small number of loci used in our study, these two cultivars were also separated by two loci: *RcFH01* and *RhUF01*. In our study the red raspberry cultivars were easily discriminated from the black raspberry cultivars by a unique black raspberry specific allele 358 at the *RiMY01* locus and the allele 309 at the *RiAS01* locus, which occurred almost exclusively in the black raspberry cultivars, except the red raspberry cultivar Babye Leto 2. In addition, the *RiG001* locus was not amplified in black raspberry. The same was observed also in 48 earlier tested blackberry cultivars [8]. Thus, in respect to this locus, the black raspberry is closer to the wild blackberry than to the red raspberry, although it belongs to different subgenera. No amplification of RiG001 and the unique allele 358 at the *RiMY01* locus can be used to separate the red raspberry cultivars from the black ones.

Cluster analysis of the SSR markers located in the genes of the biosynthesis of flavonoids showed a clear separation of the black raspberry (*R. occidentalis*) cultivars with black colored berries from the red raspberry (*R. idaeus*) cultivars with berries colored from yellow to dark red (Figure 1). It is important to note also that five cultivars with berries of similar shades of light red color (three with yellow berries, one with orange, and another with light red color) having completely different origin still clustered together into one sub-group. Perhaps, gene-targeted markers [45] such as SSR loci in the genes of the biosynthesis of flavonoids reflect better their genetic similarity for traits, such as color of their berries, likely controlled or affected by these genes, than random genomic SSR markers.

Castillo et al. [8] found that the primocane fruiting (fall fruiting) raspberry cultivars were grouped into a separate cluster. In Fernandez et al. [34] studies, it was shown that the majority of primocane-fruiting material from various breeding programs, as well as some very early ripening floricane-fruiting genotypes are grouped into one cluster. This shows that cultivars can be grouped according to a particular trait regardless of their origin. At the same time, two cultivars with yellow and orange-colored fruits (Zheltyj Gigant and Beglyanka) fell into another group of red-colored fruits. Perhaps, for a clearer separation, it is necessary to use additionally more polymorphic markers, including other genes of the biosynthesis of flavonoids not represented in this study.

Moreover, it is possible that the yellow color of the raspberry fruits can be obtained by two or more mechanisms. For example, primocane fruiting cultivars were also distributed in two different groups [34]. The genetic mechanisms for the formation of yellow color in raspberry fruit have not yet been fully studied. Although assumptions on this topic were made back in the 1930s, it was not until 2016 when an inactive anthocyanidin synthase (ANS) allele was identified in yellow raspberry [46]. A 5 bp insertion in the coding region of gene creates a premature stop codon resulting in a truncated amino acid sequence of the defective ANS protein. However, other mechanisms are also possible, such

as the combinations of recessive and dominant alleles, or the transcription factors that may lead to a huge variety of berry colors in raspberry.

The clustering along the flavonoid pathway also showed that there is a lack of connections between cultivars of the related origin. This is exactly the opposite data compared to the analyses carried out on randomly selected SSR markers evenly distributed across the genome. For example, Fernandez et al. [34] demonstrated that one cluster is almost entirely composed of cultivars from the Scottish raspberry breeding program or cultivars based on their germplasm. From the point of view of MAS the use of gene-targeted markers to assess genotypes for particular breeding traits is preferable to the use of random SSR markers. Graham et al. [9] suggested in 2004 that *Rubus idaeus* due to the diploid set of chromosomes (2*n* = 2*x* = 14) and a very small genome (275 Mb) may be used as a model species for the Rosaceae. For many years, this was impeded by the lack of the full-genome *Rubus* sequence, although the genomes of other Rosaceae species have been already sequenced, such as apple in 2010, strawberry in 2011, pear and peach in 2013 [47]. However, the situation is changing with genomes of *R. occidentalis* [48] and *R. idaeus* [49] having been recently published. This will facilitate developing gene-targeted markers that can advance breeding *Rubus* for important traits including those related to the nutritional value of their berries.

#### **5. Conclusions**

In this study, we demonstrated that a set of gene-targeted SSR markers representing structural and regulatory genes of flavonoid biosynthesis could potentially allow more informative and meaningful evaluation of the genetic relationship between different cultivars of red and black raspberries that reflect the color of their berries and possibly also their nutritional value. However, the study did not compare this set of gene-targeted markers with an analysis of the same germplasm set using neutral markers. A comparative analysis using a set of neutral SSR markers would seem to be important to support this particular conclusion. The developed primer set can be potentially used for MAS in the *Rubus* breeding programs for improving the nutritional quality of fruits. This first requires confirmation that the SSR alleles identified correlate with differences in the content of flavonoids. Additional studies and further development of these gene-targeted markers are needed to validate this approach.

**Author Contributions:** Conceptualization, V.G.L. and K.A.S.; Data curation, V.G.L., K.V.K. and K.A.S.; Formal Analysis, V.G.L. and O.P.M.; Funding Acquisition, V.G.L., K.V.K. and K.A.S.; Investigation, V.G.L., N.M.S., O.P.M. and K.A.S.; Methodology, V.G.L. and K.A.S.; Project Administration, V.G.L. and K.A.S.; Resources, V.G.L. and K.A.S.; Supervision, V.G.L. and K.A.S.; Writing, V.G.L., K.V.K. and K.A.S.

**Funding:** The work was financially supported by the Ministry of Education and Science of the Russian Federation (grant No. 14.574.21.0149 from 26.09.2017, unique project identifier RFMEFI57417X0149).

**Acknowledgments:** We thank I. A. Pozdniakov (OOO Microklon, Pushchino, Russia) for providing us with raspberry plants used in this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Communication*

## **SNP- and Haplotype-Based GWAS of Flowering-Related Traits in Maize with Network-Assisted Gene Prioritization**

#### **Carlos Maldonado 1, Freddy Mora 1,\*, Filipe Augusto Bengosi Bertagna 2, Maurício Carlos Kuki <sup>2</sup> and Carlos Alberto Scapim <sup>3</sup>**


Received: 18 September 2019; Accepted: 4 November 2019; Published: 7 November 2019

**Abstract:** Maize (*Zea mays* L.) is one of the most crucial crops for global food security worldwide. For this reason, many efforts have been undertaken to address the efficient utilization of germplasm collections. In this study, 322 inbred lines were used to link genotypic variations (53,403 haplotype blocks (HBs) and 290,973 single nucleotide polymorphisms (SNPs)) to corresponding differences in flowering-related traits in two locations in Southern Brazil. Additionally, network-assisted gene prioritization (NAGP) was applied in order to better understand the genetic basis of flowering-related traits in tropical maize. According to the results, the linkage disequilibrium (LD) decayed rapidly within 3 kb, with a cut-off value of r2 = 0.11. Total values of 45 and 44 marker-trait associations (SNPs and HBs, respectively) were identified. Another important finding was the identification of HBs, explaining more than 10% of the total variation. NAGP identified 44, 22, and 34 genes that are related to female/male flowering time and anthesis-silking interval, respectively. The co-functional network approach identified four genes directly related to female flowering time (*p* < 0.0001): *GRMZM2G013398*, *GRMZM2G021614*, *GRMZM2G152689*, and *GRMZM2G117057*. NAGP provided new insights into the genetic architecture and mechanisms underlying flowering-related traits in tropical maize.

**Keywords:** gene prioritization; linkage disequilibrium; marker-trait association; tropical maize

#### **1. Introduction**

Maize (*Zea mays* L.) plays an important role in the human diet and accounts for a large proportion of the global cereal demand. Together with rice and wheat, these three cereals account for more than 40% and 35% of the world's calorie and protein supply, respectively [1,2]. Maize is among the few crops grown on almost every continent and has diverse uses, including food, animal feed, and ethanol production [3]. The United States, China, and Brazil are the top three largest maize-producing countries in the world, representing more than 70% of total maize production [4].

Since maize is one of the most important crops for global food security, several efforts have been undertaken addressing the efficient utilization of germplasm materials. In fact, the development of maize germplasm collections has been beneficial to capture and maintain the high levels of genetic diversity that exist locally and globally [5–9]. These efforts have allowed the methodical exploration of the genetic architecture of complex traits in maize, which benefit from the high diversity [8]. Liu [10], for instance, performed a genome-wide association study (GWAS; a standard forward genetic

technique) using a population comprised of a global core collection of maize inbred lines, and found several candidate genes associated with starch synthesis, of which one gene (*Glucose-1-phosphate adenylyltransferase*) is known as an important regulator of kernel starch content. Li et al. [11] identified several genetic variants associated with maize flowering time using an extremely large multigenetic background population (>8000 maize lines). The associated single nucleotide polymorphisms (SNPs) detected in this large panel exhibited high accuracy for predicting flowering time.

In an effort to overcome certain limitations present in forward and reverse genetic techniques, for example lacking in functional clues of trait-associated candidate genes derived from forward genetics studies and in silico strategies for candidate gene selection in targeted mutagenesis in reverse genetics approaches, Lee et al. [12] recently presented a network-assisted gene prioritization system (MaizeNet), which facilitates genetic analysis through supporting candidate genes based on network neighbors with known traits or functions, and aids in identifying potential candidate genes that are highly likely to be causal to the phenotype of interest. This network-based resource provides new insights into the genetic architecture and mechanisms underlying complex traits in maize and promises to accelerate the discovery of trait-associated genes for crop improvement. In this study, an integrated approach using GWAS (based on 53,403 haplotype blocks (HBs) and 290,973 SNPs) and network-assisted gene prioritization was applied in order to better understand the genetic basis of flowering-related traits in tropical maize. To this end, marker-trait association analyses were performed using a multigenetic background population comprising 322 inbred lines of field corn, popcorn, and sweet corn.

#### **2. Materials and Methods**

#### *2.1. Trial Conditions and Phenotyping*

A total of 322 inbred lines of tropical maize were used in this genome-wide association study, which were derived from three genetic backgrounds collected in Brazil: Field corn (178), popcorn (128), and sweet corn (16). This maize panel was evaluated during the growing season of 2017–2018 in two locations (Cambira and Sabaudia) situated in Southern Brazil, Parana State. The experimental design was an alpha-lattice with 24 incomplete blocks and 3 replications per line. Female and male flowering time (FF and MF, respectively) were measured in each line as the number of days from sowing to anther extrusion from the tassel glumes (MF) or to visible silks (FF). Additionally, the anthesis-silking interval (ASI) was calculated as the difference between MF and FF.

#### *2.2. Population Structure, Linkage Disequilibrium (LD), and Haplotype Blocks*

Genomic DNA was isolated from young leaves of five plants from each inbred line of tropical maize (319 in Cambira and 293 in Sabaudia), approximately 30 days after germination. The DNA extraction was carried out by Cetyl trimethyl ammonium bromide (CTAB) according to the protocol established by Chen and Ronald [13]. The quality of DNA was evaluated and quantified using 1% agarose gel and Nanodrop, respectively. The DNA samples were sent to the University of Wisconsin-Madison—Biotechnology Center for SNP discovery via genotyping by sequencing (GBS), which is described in Elshire et al. [14] and Glaubitz et al. [15]. The raw database was filtered considering a minor allele frequency (MAF) > 0.05, resulting in a genotype file of 291,633 high-quality SNPs. The LD kNNi imputation (linkage disequilibrium k-nearest neighbor imputation) was performed to impute missing data in the dataset [16]. Finally, SNPs with a MAF < 0.01 and a proportion of missing data per location >90% were eliminated from the imputed dataset [17]. Subsequently, 290,973 SNPs were retained after filtering for MAF and missing data.

The population structure was inferred using the model-based Bayesian clustering approach implemented in the program InStruct [18]. For each K value (where K is the number of genetically differentiated groups, K = 1–6), 10 runs were performed separately, each with 100,000 Monte Carlo Markov Chain replicates and a burn-in period of 10,000 iterations. The optimal K value was determined with the highest ΔK method [19] and the lowest deviance information criterion (DIC).

The extent of LD was estimated using the correlation coefficients of the allelic frequencies (r2) considering all the possible combinations of the alleles. The critical r<sup>2</sup> value was calculated according to the method used by Breseghello and Sorells [20].

The HBs were constructed for each chromosome according to the confidence interval algorithm developed by Gabriel [21], implemented in the software Haploview v.4.2 [22]. This method considers the 95% confidence intervals of the disequilibrium coefficient (D') values and builds a haplotype block if the LD is classified as a "strong LD" type (D higher than 0.98 and lower interval limit of >0.7). Finally, HBs were later transformed into multiallelic markers, considering the allelic combinations within each block to be independent alleles [5,23].

#### *2.3. SNP- and Haplotype-Based GWAS*

The HB- and SNP-based association analyses were performed using a mixed linear model (MLM) in TASSEL 3.0 and TASSEL 5.2, respectively [24], which considers the effects of the population structure (Q) and genetic relationships or matrix kinship (K) among inbred lines. The kinship matrix was calculated based on identity by state (IBS) [25] in TASSEL. The Adjusted-Entry Means of the general linear model (experimental design) were used as the adjusted phenotypes according to Contreras-Soto et al. [26] and Arriagada et al. [27]. Correlations between each pair of traits were calculated using a Bayesian bi-trait model [28–30]. The statistical analysis was performed using the R package MCMCglmm (version 3.6.1; https://www.r-project.org) [31].

#### *2.4. Prioritization of GWAS Candidate Genes and Inference of Co-Functional Networks for Flowering Traits in Maize*

The candidate genes were chosen from the genes around the significant loci (SNP or haplotype blocks) identified by GWAS. To this end, a window (or threshold) of twice the distance indicated by the LD analysis was established, placing the marker in the center of the window. The gene prioritization was performed using MaizeNet [12] based on the connections of the candidate genes to the genes in one estimated network with previously associated genes with flowering-time in *Zea mays*. New candidate genes were then ranked by closeness to the "guide genes" (derived from estimated network in MaizeNet) measured for each candidate gene (derived from GWAS) as the sum of network edge scores from that gene to the guide genes [12]. The estimated co-functional network was carried out through the association of genes (candidate genes and genes identified in prioritization of MaizeNet) with subnetworks enriched by gene ontology annotations related to the biological processes (GOBP) of flowering in MaizeNet. Finally, the given genes are related to the flowering-time if the subnetworks of MaizeNet significantly associated with these genes, and if are also enriched for on the relevant GOBP term for flowering.

#### **3. Results and Discussion**

#### *3.1. Genetic Structure*

The Bayesian clustering analysis (InStruct) of the population structure indicated that the 322 inbred lines from the Brazilian germplasm represent two main genetic clusters (k = 2; Figure S1A), inferred from both the lowest DIC value and the second-order change rate of the probability function with respect to Q (ΔQ) [19]. Cluster I contained 221 lines (68.6%), over 80% (177/221) of which were genotypes of field corn, while all sweet corn lines (16) were within this cluster. On the other hand, cluster II consisted of 101 lines, over 99% (100/101) of which were genotypes of popcorn (Figure S1). Similar results were obtained by the PCA method for this association mapping panel, as shown in Figure S1B. The first component explained 12.1% of the total variation and most of the inbred lines were separated in the same genetically differentiated groups (Figure S1B). These results are in accordance with the previous findings of Maldonado et al. [5] and Coan et al. [6], in which tropical maize inbred lines were grouped into two genetically differentiated clusters, separating field corn and popcorn lines.

#### *3.2. Linkage Disequilibrium*

The genome-wide LD decay pattern is shown in Table 1 and Figure S2. The LD statistic r<sup>2</sup> showed a clear nonlinear trend with physical distance. According to these results, the LD decayed rapidly within 3 kb, with a cut-off value of r<sup>2</sup> = 0.11. The average LD on all chromosomes (Chr) was r<sup>2</sup> = 0.09. On the other hand, 0.63% of the total pairs of linked SNPs were in a complete LD (r2 = 1), and 4.4% had an r<sup>2</sup> value >0.5 (strong LD). The LD of Chr 3 and 7 decayed faster than the other chromosomes, with ~2.2 kb for a cut-off value of r<sup>2</sup> = 0.11. Past studies have found that this LD pattern (i.e., rapid decay with increasing physical distance) is typical in tropical maize germplasms [6,9,32]. The LD decay pattern in this study was similar to the findings of Yan et al. [33] and Coan et al. [6], who reported that the LD pattern (in tropical maize germplasms) decreases rapidly in the range of 0.1–10 kb.

**Table 1.** Summary of information on linkage disequilibrium (LD) and haplotype blocks (HBs) determined in inbred lines of tropical maize. Chr corresponds to the chromosome number; N◦ SNP indicates the number of single nucleotide polymorphisms (SNPs) detected on each chromosome; N◦ HB is the number of haplotype blocks; SizeHB and Max(kb) correspond to the maximum number of SNPs forming a haplotype block and the maximum size (in kb) for a haplotype block, respectively.


#### *3.3. Haplotype Blocks*

Total values of 53,403 and 52,377 HBs were identified in all chromosomes for Cambira and Sabaudia, respectively (Table 1), over 47%, 20%, and 33% of which contained two, three, and four (or more) SNPs, respectively (Figure S3). These HBs were constructed considering 319 and 293 inbred lines in Cambira and Sabaudia (respectively), and 290,973 SNPs distributed on all chromosomes (Table 1). An average of ~20,000 SNPs per chromosome satisfied the criteria of the 95% confidence interval proposed by Gabriel et al. [21]. Particularly, the largest number of HBs in both locations was determined by combinations of SNPs located on Chr 1, while the smaller amount was constructed by SNPs located on Chr 10 (Table 1). In this study, several genomic regions were detected in strong disequilibrium, up to ~0.1 Mb. Therefore, as these regions have a strong LD, it is possible to suggest that they will be inherited together across generations. Moreover, about 2.3% of the HBs formed in both locations had an extension over 0.1 Mb, with a D' value between 0.7 and 0.98 [21]. Analysis of the LD pattern enabled the identification and characterization of several HBs (or strongly linked genomic regions), because there is a strong LD among the SNPs that compose it. This indicates that recombination events within these HBs are unlikely, thus, these HBs should inherit together across generations.

#### *3.4. Genome-Wide Association Study and Network-Assisted Gene Prioritization*

Total values of 45 and 44 associations (SNPs and HBs, respectively) were identified for the studied traits, which are distributed in all chromosomes of maize (Table 2 and Table S1). Four SNPs were jointly associated with the FF and MF traits. In Cambira, four haplotype blocks—two loci on Chr 8 (bin 8.03) and two on Chr 9 (bin 9.06)—were jointly associated with FF and MF. In turn, Chr 3 presented two genomic regions associated with FF in Cambira (one SNP and one HB) and three in Sabaudia (one SNP and two HBs), while various SNPs (five) and HBs (three) were associated with some the three traits on Chr 9 (bin 9.06). Interestingly, all associations were environment-specific, confirming the existence of a significant and complex genetic-by-environment interaction. The results from Bayesian bi-trait analyses showed a high correlation between FF and MF, which was significantly different from zero in both locations (r = 0.94 and 0.92), justifying the fact that FF and MF share significant loci. In accordance with our findings, Xu et al. [34] found a very high amount of quantitative trait loci (QTL) significant on bins 1.03, 8.05, and 9.06 for photoperiod sensitivity and flowering time (traits highly correlated in maize; [35]), while Chardon et al. [36], through a meta-analysis, detected hot-spot QTL regions for flowering time on bins 8.03 and 8.05. On the other hand, 64 QTLs related to maize flowering time were identified by Liu et al. [37], which were distributed on chromosomal bins 1.01, 1.03, 1.1, 2.02, 3.02, 3.04, 4.05, 6.06, 7.02, 7.03, 7.04, 8.03, 8.05, 9.01, and 9.07. Like these previous studies, this study also identified significant marker-trait associations on bins 1.01, 1.03, 1.1, 2.02, 3.02, 3.04, 4.05, 6.06, 7.02, 7.03, 7.04, 8.03, 8.05, 9.01, and 9.07. This result suggests that these regions should contain important genes controlling the flowering time in maize. In addition, chromosomes 8 and 9 had the main associations for all three traits, which is consistent with studies that considered other environments and genetic materials [34,36–38].

**Table 2.** Summary of the associations detected by a genome-wide association study (GWAS), based on in haplotype blocks and SNP for the traits of female/male flowering time (FF and MF, respectively) and anthesis–silking interval (ASI) measured in inbred lines of tropical maize.


PV%: Percentage of the phenotype variation explained by the marker; NM: Number of significant marker-trait associations.

In Cambira, the proportion of the phenotypic variance (PV%) explained by SNP markers was ~6%, while haplotype blocks explained 6–17%, 5–13%, and 6–8% of the phenotypic variation of FF, MF, and ASI, respectively (Table 2 and Table S1). On the other hand, in Sabaudia, the PV% explained by SNPs was similar to that detected by HBs. In Sabaudia, the PV% values were moderate (either SNPs or HBs), which varied between 5 and 10%, while in Cambira, HBs showed higher PV% values (>10%) in comparison with SNPs. Moreover, the HB HChr9B2943 (in Cambira) was jointly associated with FF and MF, accounting for 17% and 13% of the total variation of FF and MF, respectively (Table 2 and Table S1). Several studies reported PV% values of flowering time smaller than 10% [34,37,39,40]. In fact, numerous QTLs with small effects would be contributing to genetic variation in flowering time across diverse maize germplasms [34,37,41]. In accordance with this, 93% (41/44) of the significant HB and all SNP associations did not explain more than 10% of the total variation. Importantly, three HBs had PV% values higher than 10%, indicating the potential effectiveness of haplotypes over individual

SNP analysis, an aspect emphasized by Maldonado et al. [5] and Contreras-Soto et al. [26]. Twenty-five of the 45 SNPs detected by GWAS (i.e., 56%) were found to be part of a haplotype block, which in turn were significantly associated with a given trait. Moreover, 14 HBs contained at least 1 significant SNP, and 9 HBs contained 2 or 3 SNPs significantly associated with some trait. On the other hand, 68% (30/44) of the HBs detected did not contain any associated SNPs, which suggests that haplotype blocks are useful for discovering genomic regions that are not detected by SNP markers. On the other hand, the use of haplotype blocks in GWAS reduces the number of multiple tests, compared with SNP-based association analysis [5]. Moreover, the use of haplotype blocks as multiallelic markers might improve marker-trait association analyses, compensating the biallelic limitation of SNP markers [5,26].

Based on the physical position of the maize reference genome (http://www.maizegdb.org//), 51 candidate genes were identified neighboring the significant SNPs and HBs (Table S1), of which 11 were present in more than one trait (FF and MF) (Table S1). The network-assisted gene prioritization performed by MaizeNet [12] identified 100 additional genes based on biological processes of flowering and reproduction. Forty-four, 22, and 34 genes that were identified by MaizeNet are related to FF, MF, and ASI, respectively (Table S1). Co-functional networks determined by MaizeNet [12] are shown in Figure 1 and Figure S4. The co-functional networks identified the following genes directly related to FF—*GRMZM2G013398*, *GRMZM2G021614*, *GRMZM2G152689*, and *GRMZM2G117057*—with statistical significance of *p* < 0.0001 (Figure 1). On the other hand, the co-functional networks of MF presented significances of 2.2 <sup>×</sup> 10−<sup>11</sup> and 2.3 <sup>×</sup> 10−<sup>5</sup> using HBs and SNPs, respectively. The gene *GRMZM2G013398* has an ortholog in *Arabidopsis thaliana* that encodes CONSTANS-LIKE 9 (COL9), which has light-controlled functions and is crucial to inducing the day-length specific expression of the *FLOWERING LOCUS T* (*FT*) gene in leaves [42]. FT protein is the main component of florigen that strongly influences the timing of flowering [43]. Notably, the CONSTANS protein strongly influences the performance of maize flowering time in response to photoperiod, directly inducing the transcription of *FT* genes in *Arabidopsis* [42,43]. On the other hand, the genes *GRMZM2G021614*, *GRMZM2G152689*, and *GRMZM2G117057* encode phosphatidylethanolamine-binding proteins (pebp9, pebp10, and pebp11, respectively), which play important roles in floral transition in angiosperms [44]. Moreover, Kikuchi et al. [45] and Wickland and Hanzawa [46] showed that the presence and structure of these genes, together with their roles in the regulation of flowering, are well conserved among cereal plants.

**Figure 1.** *Cont*.

**Figure 1.** Co-functional networks estimated using genes identified by SNP- and haplotype-based GWAS (**A**,**B**, respectively), genes identified by network-assisted gene prioritization (in MaizeNet) for flowering time and subnets enriched by gene ontology annotations related to the biological processes of female flowering (FF) in MaizeNet. (**A**) Gene *GRMZM2G013398* identified by the prioritization analysis (nodes orange highlighted with bold borderline) connected with all the subnetwork genes of MaizeNet (white nodes) and genes associated with the ontology annotations related to flowering time (orange nodes). (**B**) Genes *GRMZM2G117057*, *GRMZM2G021614*, *GRMZM2G059358*, and *GRMZM2G152689* identified by prioritization analysis (nodes orange highlighted with bold borderline) connected with all the subnetwork genes of MaizeNet (white nodes).

#### **4. Conclusions**

In the present study, we identified several loci (SNPs and haplotype blocks) with variable contributions to phenotypic expression, which were located in regions that play important roles in the control of flowering time in maize. The GWAS based on haplotype blocks was beneficial to identify loci with major effects in comparison to SNP-based GWAS. The co-functional network approach identified four genes that strongly influence the timing of flowering in tropical maize. In general, network-assisted gene prioritization provides new insights into the genetic architecture and mechanisms underlying flowering-related traits in tropical maize.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/9/11/725/s1: Figure S1. Inferred population structure in a collection of maize germplasm (322 inbred lines). (**A**) Genetic structure inferred by a Bayesian clustering model using InStruct and a dendrogram carried out using the neighbor-joining (Nei's genetic distances). The light gray and dark gray indicate the proportion of the genome extracted from the two main genetic clusters estimated by InStruct. (**B**) Principal components analysis (PCA) with two major groups identified, which correspond closely to InStruct results. Values in parentheses indicate the percentage of variation explained by each main component; Figure S2. Linkage disequilibrium (LD) decay pattern in all chromosomes of maize. Chromosomes 3 and 7 decayed faster than the other chromosomes, while chromosome 4 presented the slowest decay (lower and upper margins, respectively); Figure S3. Frequency distribution of the size of haplotype blocks consisting of two or more SNPs, in the locations Cambira and Sabaudia; Figure S4. Co-functional networks estimated using genes identified by SNP- and Haplotype-based GWAS (**A** and **B**, respectively), genes identified in prioritization of MaizeNet for flowering time and subnets enriched by gene ontology annotations related to the biological processes of male flowering (MF) in MaizeNet. White nodes represent all the subnetwork genes of MaizeNet, orange nodes are genes associated with the ontology annotations related to flowering time, and nodes highlighted with bold borderline correspond to genes identified by GWAS or the prioritization analysis; Table S1. Details of the associations and candidate genes detected in SNP- and Haplotype-based GWAS for the traits of Female/Male Flowering time (FF and MF, respectively) and Anthesis–Silking Interval (ASI) measured in inbred lines of tropical maize in two locations (Cambira and Sabaudia).

**Author Contributions:** Conceptualization, F.M., C.M., M.C.K., and F.A.B.B.; methodology, F.M., C.A.S., and C.M.; software, C.M., M.C.K., and F.A.B.B.; validation, F.M., C.M., M.C.K., F.A.B.B., and C.A.S.; formal analysis, F.M., C.A.S., and C.M.; investigation, M.C.K., F.A.B.B., and C.A.S.; resources, F.M. and C.M.; writing—original draft preparation, F.M. and C.M.; writing—review and editing, F.M. and C.M.; visualization, M.C.K., F.A.B.B., and C.A.S.; supervision, F.M.; project administration, C.A.S.; funding acquisition, C.A.S.

**Funding:** This research was funded by the National Council of Technological and Scientific Development (CNPq) and the Coordination for the Improvement of Higher Education Personnel (CAPES).

**Acknowledgments:** Freddy Mora thanks FONDECYT (grant number 1170695). Carlos Maldonado thanks CONICYT-PCHA/Doctorado Nacional/2017-21171466.

**Conflicts of Interest:** The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Identification and Verification of Quantitative Trait Loci A**ff**ecting Milling Yield of Rice**

#### **Hui Zhang 1,2,3, Yu-Jun Zhu 2, An-Dong Zhu 2, Ye-Yang Fan 2, Ting-Xu Huang 3, Jian-Fu Zhang 3,\*, Hua-An Xie 1,3,\* and Jie-Yun Zhuang 2,\***


Received: 25 November 2019; Accepted: 2 January 2020; Published: 5 January 2020

**Abstract:** Rice is generally consumed in the form of milled rice. The yield of total milled rice and head mill rice is affected by both the paddy rice yield and milling efficiency. In this study, three recombinant inbred line (RIL) populations and one F4:5 population derived from a residual heterozygous (RH) plant were used to determine quantitative trait loci (QTLs) affecting milling yield of rice. Seven traits were analyzed, including recovery of brown rice (BR), milled rice (MR) and head rice (HR); grain yield (GY); and the yield of brown rice (BRY), milled rice (MRY) and head rice (HRY). A total of 77 QTLs distributed on 35 regions was detected in the three RIL populations. Four regions, where *qBR5*, *qBR7*, *qBR10,* and *qBR12* were located, were validated in the RH-derived F4:5 population. In the three RIL populations, all the 11 QTLs for GY detected were accompanied with QTLs for two or all the three milling yield traits. Not only the allele direction for milling yield traits was unchanged, but also the effects were consistent with GY. In the RH-derived F4:5 population, regions controlling GY also affected all three milling yield traits. Results indicated that variations of BRY and MRY were mainly ascribed to GY, but HRY was determined by both GY and HR. Results also showed that the regions covering *GW5*–*Chalk5* and *Wx* loci had major effects on milling quality and milling yield of rice. These two regions, which have been known to affect multiple traits determining grain quality and yield of rice, provide good candidates for milled yield improvement.

**Keywords:** brown rice recovery; milled rice recovery; head rice recovery; milling yield traits; QTL mapping; rice (*Oryza sativa* L.)

#### **1. Introduction**

As a major cereal crop, rice (*Oryza sativa* L.) provides staple food for at least half of the global population. Rice food is mainly consumed in the form of cooked milled rice. Farmer's incomes are based on both the paddy rice yield and the milling efficiency. In the postharvest processing, paddy grains are firstly de-hulled into brown rice and then milled into milled rice. Milled rice is separated into head rice (also called whole rice) and broken rice. Milled rice whose length is longer than or equal to 3/4 of its unbroken length falls into the category of head rice, and the rest is called broken rice. Head rice has a higher price than broken rice [1]. Three parameters, recovery of brown rice (BR), milled rice (MR) and head rice (HR), are used to evaluate rice milling quality and efficiency of the milling processing [2–4]. Generally, BR is defined as the percentage of brown rice to grains, MR the percentage

of milled rice to grains, and HR the percentage of head rice to grains. Alternatively, MR is measured as the percentage of milled rice to brown rice, and HR the percentage of head rice to milled rice [5]. In some reports, these traits are called percent milling yield, brown rice yield, milled rice yield, total milled yield, or head rice yield [3,6–8].

In the past two decades, analysis of quantitative trait loci (QTLs) was employed to study the genetic basis of rice milling quality. A larger number of QTLs were identified using various populations developed from crosses of the same subspecies [3,9], between the *indica* and *japonica* subspecies [6,10–12], or between different species [2]. In these studies, clustering of QTLs for different milling traits was commonly observed. Some of them detected only one cluster. For example, Tan et al. [9] located QTLs for MR and HR in the C1087–RZ403 region on chromosome 3; Aluko et al. [2] detected *br8* and *hr8* in RM126–RM137 on chromosome 8; and Lou et al. [12] found *qMRR-3* and *qHRR-3* in RM3204–RM6283 on chromosome 3. Other studies detected more clusters. Li et al. [6] located *qBR-4* and *qMR-4* in the C975–C734 interval on chromosome 4, *qBR-9* and *qMR-9* in R1751–R2272 on chromosome 9, *qBR-10* and *qMR-10* in C488–R716 on chromosome 10, and *qBR11* and *qMR11* in R728–G202 on chromosome 11. Zheng et al. [11] detected *QBr6* and *QMr6* in the *Wx* region on chromosome 6, and *QBr7* and *QMr7* in RM505–RM118 on chromosome 7.

When QTL mapping for milling quality traits and components of grain yield was performed using the same population, a proportion of genomic regions were found to be associated with both types of traits. Using 240 backcross introgression lines derived from the Ce258/IR75862 cross, three QTL regions were found to simultaneously affect components of milling quality and grain yield [5]. In the RM71–RM300 interval harboring *qBR2*, the Ce258 allele increased BR and panicle weight. In RM348–RM349 harboring *qBR4*, the Ce258 allele increased BR and grain weight but decreased spikelet number. In RM250–RM482 harboring *qHR2*, the Ce258 allele increased HR, grain number and spikelet number but decreased grain weight. Using 205 recombinant inbred lines (RILs) developed from the L-204/01Y110 cross, two QTL regions were found to simultaneously affect milling traits and grain weight [8]. For QTLs located in RM5638–RM1361 on chromosome 1, the L-204 allele increased BR and HR but decreased grain weight. For QTLs linked to RM3283 on chromosome 10, the L-204 allele increased HR but decreased grain weight. These results provide evidences for genetic association between milling quality and grain yield in rice, but it remains unknown whether this association has a consequence on the yield of milled and head rice.

In the present study, QTL analysis was employed to determine the dependence of milled and head rice yield on milling quality and grain yield. Firstly, three RIL populations were used to detect QTLs for brown, milled and head rice recovery; grain yield; and brown, milled and head rice yield. Then, QTL validation was performed using one secondary population derived from a residual heterozygote (RH) identified from one of the RIL populations.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

Four populations of *indica* rice (*Oryza sativa* subsp. *indica*) were used, including three RIL populations and one RH-derived F4:5 population.

The three RIL populations were previously used to detect QTLs for components of appearance quality and physiochemical traits for eating and cooking quality [13–15]. The TI population consisting of 204 lines was constructed from crosses between Teqing (TQ) and IRBB lines. The IRBB lines are near isogenic lines carrying different bacterial blight resistance genes [16] in the background of IR24, including IRBB50, IRBB51, IRBB52, IRBB54, IRBB55, and IRBB59. The ZM population consisting of 230 lines was constructed from a cross between Zhenshan 97 and Milyang 46 (MY46). The XM population consisting of 209 lines was constructed from a cross between Xieqingzao and MY46. All the parental lines of these populations have been widely used in the breeding and production of three-line hybrid rice in China.

The RH-derived F4:5 population was previously used to validate minor QTL for gel consistency [15]. It was derived from one RH plant that was an F7 progeny of the cross TQ/IRBB52. Of the 135 polymorphic markers included in the TI map, 33 were heterozygous and 102 were homozygous in the RH plant. This plant was selfed to produce an F2-type population consisting of 250 individuals. Single seed descent was applied to advance the population to F4. Seeds of the F4 plants were harvested and a population consisting of 250 F4:5 families was constructed.

#### *2.2. Field Experiment and Trait Measurement*

All the populations were planted in the middle rice growing season (from May to October) at the China National Rice Research Institute (CNRRI), Hangzhou (30◦04 N, 119◦54 E), China. The three RIL populations were tested for 2 years, including 2008 and 2009 for TI, 2009 and 2010 for ZM, and 2003 and 2009 for XM. The RH-derived F4:5 population was tested for 1 year in 2017. The experiments followed a randomized complete block design. Twelve plants per line were transplanted in a single row with 16.7 cm between plants and 26.7 cm between rows. Field management followed common practice in rice production.

At maturity, five plants from the 10 middle plants of each line were randomly sampled and harvested. The grains were dried and weighted to calculate grain yield per plant (GY, g). Dried grains were stored at room temperature for three months. Then, two replicates of filled grains were processed independently. Filled grain of 100 g was de-husked using a Satake Rice Machine (Suzhou, China). The brown rice was milled using a JNMJ3 rice miller (Taizhou, China). Head rice of which the length was longer than or equal to 3/4 of the full length was separated from broken rice. Three traits for milling quality, i.e., BR, MR and HR, were calculated as the percentage of the weight of brown, milled and head rice to grain weight, respectively. Three traits for milling yield, i.e., brown, milled and head rice yield per plant, were calculated as follows:

$$\text{Brown rise yield per plant (BRY, g)} = \text{BR} \times \text{GY} \tag{1}$$

$$\text{Milled rise yield per plant (MRY, g)} = \text{MR} \times \text{GY} \tag{2}$$

$$\text{Head rise yield per plant (HRY, g)} = \text{HR} \times \text{GY} \tag{3}$$

#### *2.3. Marker Data and Genetic Maps*

Marker data and genetic maps of the four populations have been available [15]. The TI, ZM and XM maps were 1345.3, 1814.7 and 2080.4 cM in length, consisting of 135, 256 and 240 DNA markers, respectively. Genomic coverage and distances between neighboring markers are satisfactory for primary QTL mapping in the ZM and XM populations. A number of large homozygous segments remain in the TI map due to low polymorphism between the female and male parents of the TI population. For the RH-derived F4:5 population, the map consisted of 35 markers, including 28 simple sequence repeats, six InDels and one single nucleotide polymorphism.

#### *2.4. Data Analysis*

For the three RIL populations that were tested for 2 years, phenotypic data averaged over 2 years were used for computing the descriptive statistics, plotting the frequency distribution and calculating the Pearson correlation coefficient, and the data of each year were used for QTL mapping. QTL analysis was performed using the default setting of the MET (multi-environmental trials) approach in IciMapping V4.1 [17], taking the 2 years for each population as two environments. *LOD* thresholds for genome-wide type I error of *p* < 0.05 were calculated with 1000 permutation test and used to claim a putative QTL. QTL effects and the proportions of phenotypic variance explained (*R*2) were estimated. When a QTL was shown to have a significant genotype-by-environment (GE) interaction, the effect and *R*<sup>2</sup> due to GE interaction were also measured. QTLs were designated as proposed by McCouch and CGSNL [18].

For the RH-derived F4:5 population that was tested for 1 year, QTLs were determined with the BIP (bi-parental populations) approach in IciMapping V4.1 [17]. *LOD* > 2.0 was used as the threshold to claim a putative QTL. QTLs were designated as proposed by McCouch and CGSNL [18].

#### **3. Results**

#### *3.1. Phenotypic Performance of the Three RIL Populations*

Descriptive statistics of the seven traits in the three RIL populations are presented in Table S1. Two of the seven traits, BR and MR, showed similar coefficients of variation (CV) among the three populations, ranging from 0.0116 to 0.0119 and 0.0140 to 0.0180, respectively. The CV of the five other traits were higher in the ZM and XM populations than in the TI population, ranging from 0.1266 to 0.2642 in ZM, 0.1288 to 0.2942 in XM, and 0.0967 to 0.1871 in TI (Table S1). Continuous distributions were observed for all the traits in the three populations (Figure S1), suggesting polygenic inheritance of these traits.

Correlations between the seven traits were either non-significant or positive highly significant (*p* < 0.01) (Table 1). Regarding the three traits for milling quality, the correlation was strong between BR and MR but weak between these two traits and HR. The correlation coefficients (*r*) between BR and MR ranged from 0.764 to 0.815 in the three populations, but their correlations with HR were either non-significant or had low *r* values (0.230–0.363). These results suggest that the control of properties for maintaining whole milled rice may differ greatly from that for achieving high brown and milled rice recovery.


**Table 1.** Simple correlation coefficients between seven traits in three RIL populations of rice.

TI = Teqing/IRBB lines; ZM = Zhenshan 97/Milyang 46; XM = Xieqingzao/Milyang 46; BR = Brown rice recovery; MR = Milled rice recovery; HR = Head rice recovery; GY = Grain yield per plant; BRY = Brown rice yield per plant; MRY = Milled rice yield per plant; HRY = Head rice yield per plant; \*\*, *p* < 0.01.

Regarding the four yield traits, GY, BRY, MRY and HRY, not only the correlations were all significant but also the coefficients were all high. Near-perfect correlation was observed between GY, BRY and MRY, with the *r* values ranging from 0.996 to 0.999. Lower *r* values were found between these three traits and HRY, ranging from 0.816 to 0.826, 0.902 to 0.907 and 0.900 to 0.907 in the TI, ZM and XM populations, respectively. These results suggest that GY is the main source of variations for BRY, MRY and HRY, and the postharvest processing has a more significant influence on HRY than on BRY and MRY.

Regarding the three pairs of corresponding traits for milling quality and yield, the correlation was stronger between HR and HRY than between the other two pairs of traits. The *r* values ranged from 0.505 to 0.710 between HR and HRY in the three populations and decreased to 0.292–0.055 between BR and BRY, and 0.136–0.425 between MR and MRY. These results also suggest that the postharvest processing has a more significant influence on HRY than on BRY and MRY.

#### *3.2. QTL Detected in the Three RIL Populations*

In the TI, ZM and XM populations, a total of 27, 33 and 17 QTLs were detected for the seven traits analyzed, of which one, four and none showed significant GE interactions, respectively (Figure 1; Tables 2–4). Based on the physical position of DNA markers, it was found that all the five QTLs having significant GE effects were located in the region covering the *Wx* locus on the short arm of chromosome 6.

**Figure 1.** Genomic distribution of QTLs for seven traits detected in three RIL populations. TI = Teqing/IRBB lines; ZM = Zhenshan 97/Milyang 46; XM = Xieqingzao/Milyang 46. Marker positions in each chromosome are indicated by solid lines and the distances are in proportion to the physical length. Solid rectangles refer to the approximate positions of centromeres. QTLs are drawn on the left side of the corresponding interval. Significant genotype-by-environment interaction is indicated by the number "1".


**Table 2.** QTLs for seven traits detected in the TI population.

QTLs are designated as proposed by McCouch and CGSNL [18]. *A*: additive effect of replacing a maternal with a paternal allele; *ge*: effect due to genotype-by-environment interaction; *R*2: percentage of phenotypic variance explained by the additive or GE effect.


**Table 3.** QTLs for seven traits detected in the ZM population.


**Table 3.** *Cont.*

QTLs are designated as proposed by McCouch and CGSNL [18]. *A*: additive effect of replacing a maternal with a paternal allele; *ge*: effect due to genotype-by-environment interaction; *R*2: percentage of phenotypic variance explained by the additive or GE effect.


**Table 4.** QTLs for seven traits detected in the XM population.

QTLs are designated as proposed by McCouch and CGSNL [18]. *A*: additive effect of replacing a maternal with a paternal allele; *R*2: percentage of phenotypic variance explained by the additive effect.

#### *3.3. QTLs Detected in the TI Population*

The 27 QTLs identified in the TI population were distributed across nine of the 12 rice chromosomes (Figure 1, Table 2). Numbers of QTLs detected for BR, MR, HR, GY, BRY, MRY, and HRY were 8, 3, 2, 4, 4, 4, and 2, having overall *R*<sup>2</sup> of 59.55%, 21.93%, 23.40%, 25.05%, 26.13%, 25.49%, and 12.93%, respectively. Twenty-two of these QTLs formed six clusters distributed on chromosomes 2, 3, 4, 5, 6, and 8.

The largest cluster consisted of five QTLs, followed by three clusters of four QTLs. It was found that the 14 QTLs detected for grain yield and the three traits for milling yield were all included in these four clusters. In the RM437–RM18189 region on chromosome 5, the TQ allele increased BR, MR, GY, BRY, and MRY by 0.51%, 0.39%, 1.30 g, 1.25 g, and 1.13 g, respectively (Table 2). In the RM6–RM240 region on chromosome 2, the TQ allele increased GY, BRY, MRY, and HRY by 1.89 g, 1.54 g, 1.30 g, and 1.21 g, respectively. In the RM6992–RM349 region on chromosome 4, the TQ allele decreased

GY, BRY, MRY, and HRY by 1.44 g, 1.14 g, 0.99 g, and 0.89 g, respectively. In the RM547–RM22755 region on chromosome 8, the TQ allele increased BR, GY, BRY, and MRY by 0.20%, 1.10 g, 0.98 g, and 0.79 g, respectively.

The fifth cluster consisted of three QTLs, which were located in the RM190–RM587 region covering the *Wx* locus [19] on chromosome 6. The TQ allele increased BR, MR and HR by 0.30%, 0.28% and 1.40%, respectively. The sixth cluster consisted of two QTLs, which were located in the RM15139–RM15303 region covering the *GS3* locus [20] on chromosome 3. The TQ allele decreased BR by 0.41 g but increased HR by 2.51 g.

In the other five regions, one QTL was detected in each region. Included were *qBR3.2* located in the interval RM16048–RM16184 on chromosome 3, *qBR5.2* in RM274–RM334 on chromosome 5, *qBR7* in RM70–RM18 on chromosome 7, *qBR10* in RM6100–RM3773 on chromosome 10, and *qMR12* in RM20–RM27610 on chromosome 12.

#### *3.4. QTLs Detected in the ZM Population*

The 33 QTLs identified in the ZM population were distributed across 10 of the 12 rice chromosomes (Figure 1, Table 3). Numbers of QTLs detected for BR, MR, HR, GY, BRY, MRY, and HRY were 5, 1, 5, 5, 5, 6, and 6, having overall *R*<sup>2</sup> of 19.55%, 3.25%, 24.34%, 23.57%, 23.08%, 26.20%, and 28.39%, respectively. Twenty-four of these QTLs formed six clusters distributed on chromosomes 1, 3, 5, 6, and 10.

Two clusters on chromosome 1 and one on chromosome 6 were the three largest clusters consisting of five QTLs. Each of them affected one milling quality trait and all the four yield traits. In the RG532–RM5359 region on the short-arm chromosome 1, the MY46 allele increased HR, GY, BRY, MRY, and HRY by 2.21%, 0.89 g, 0.74 g, 0.69 g, and 0.70 g, respectively (Table 3). In the RZ730–RG381 region on the long arm of chromosome 1, the MY46 allele decreased BR by 0.24% but increased GY, BRY, MRY, and HRY by 1.12 g, 0.86 g, 0.76 g, and 0.67 g, respectively. In the RZ516–RM197 region covering the *Wx* locus on chromosome 6, the MY46 allele decreased HR, GY, BRY, MRY, and HRY by 1.56%, 0.79 g, 0.63 g, 0.58 g, and 0.66 g, respectively.

The other three clusters consisted of four, three and two QTLs, respectively. The RZ613–RG418A region on chromosome 3 affected all four yield traits, with the MY46 allele decreasing GY, BRY, MRY, and HRY by 0.76 g, 0.61 g, 0.56 g, and 0.65 g, respectively. The RZ811–RZ583 region on chromosome 10 affected three yield traits, with the MY46 allele decreasing GY, BRY and MRY by 0.83 g, 0.73 g and 0.68 g, respectively. The CDO348–RG480 region on chromosome 5 affected the recovery and yield of head rice, with the MY46 allele decreasing HR and HRY by 1.61% and 0.79 g, respectively.

Two other QTLs, *qHRY4* and *qHR4*, were mapped in close positions on chromosome 4 (Figure 1). The MY46 allele increased HRY and HR by 0.76 g and 1.31%, respectively (Table 3). In the other seven regions, one QTL was detected in each region. Included were *qHR2* located in the interval A5–RM71 on chromosome 2, *qBR3* in RM251–RG393 on chromosome 3, *qMR6* in RM276–RZ667 on chromosome 6, *qBR7* in RG650–RZ395 on chromosome 7, *qMRY9* in RG667–RM201 on chromosome 9, *qBR11.1* in RZ816–RM332, and *qBR11.2* in RM187–RM254 on chromosome 11.

#### *3.5. QTLs Detected in the XM Population*

The 17 QTLs identified in the XM population were distributed on four of the 12 rice chromosomes (Figure 1, Table 4). Numbers of QTLs detected for BR, MR, HR, GY, BRY, MRY, and HRY were 3, 3, 2, 2, 2, 2, and 3, having overall *R*<sup>2</sup> of 11.68%, 13.11%, 8.86%, 13.62%, 13.21%, 13.80%, and 13.98%, respectively. Twelve of these QTLs formed four clusters distributed on chromosomes 3, 5, 6, and 10.

The clusters on chromosomes 6 and 10 were the two largest clusters consisting of four QTLs, both of which affected all four yield traits. In the RM190–RM204 region covering the *Wx* locus on chromosome 6, the MY46 allele decreased GY, BRY, MRY, and HRY by 1.74 g, 1.43 g, 1.32 g, and 0.99 g, respectively (Table 4). In the RM1859–RM184 region on chromosome 10, the MY46 allele decreased GY, BRY, MRY, and HRY by 1.36 g, 1.11 g, 1.04 g, and 1.11 g, respectively.

The other two clusters each consisted of two QTLs. In the RM6849–RM14629 region on the short arm of chromosome 3, the MY46 allele increased BR and MR by 0.21% and 0.28%, respectively. In the RM163–RG470 on the long arm of chromosome 5, the MY46 allele decreased HR and HRY by 1.62% and 0.80 g, respectively. Two other QTLs, *qBR5* and *qMR5*, were mapped in close positions on the short arm of chromosome 5 (Figure 1). The MY46 allele increased BR and MR by 0.19% and 0.43%, respectively (Table 4). The remaining three QTLs were loosely linked on the long arm of chromosome 3, including *qBR3.2* for brown rice recovery, *qMR3.2* for milled rice recovery, and *qHR3* for head rice recovery.

#### *3.6. Validation of Five QTL Regions in an RH-Derived F4:5 Population*

In the Ti52-3 population that was derived from an RH-plant of TQ/IRBB52, correlations between the seven traits (Table S2) are much the same as in the three RIL populations. Regarding the three traits for milling quality, the correlation between BR and MR (*r* = 0.773) was much stronger than between these two traits and HR (*r* = 0.243 and 0.266). Regarding the four yield traits, near-perfect correlation was observed between GY, BRY and MRY (*r* values ranging as 0.993–0.998), and their correlations with HRY were slightly weaker (*r* values ranging as 0.923–0.929). Regarding the three pairs of traits for milling quality and yield, the correlation between HR and HRY (*r* = 0.440) was much stronger than between the two others (*r* = 0.065 and 0.159).

Among the 16 segregating regions distributed on 12 chromosomes in the Ti52-3 population, QTLs were detected in 10 regions across nine chromosomes (Figure 2; Table 5). A total of 26 QTLs were found, including 6, 6, 1, 3, 3, 3, and 4 for BR, MR, HR, GY, BRY, MRY, and HRY, which had overall *R*<sup>2</sup> of 33.39%, 45.99%, 6.01%, 22.75%, 22.93%, 23.73%, and 23.27%, respectively.

**Figure 2.** Genomic distribution of QTLs for seven traits detected in the RH-F4:5 population. BR = Brown rice recovery; MR = Milled rice recovery; HR = Head rice recovery; GY = Grain yield per plant; BRY = Brown rice yield per plant; MRY = Milled rice yield per plant; HRY = Head rice yield per plant. Markers within the blue rectangle are flanking markers of QTLs detected in the TI population.


**Table 5.** QTLs for seven traits detected in the RH-F4:5 population.

QTLs are designated as proposed by McCouch and CGSNL [18]. *A*: additive effect of replacing a maternal with a paternal allele; *D*: dominance effect; *R*2: proportion of the phenotypic variance explained by the QTL.

Of the QTLs detected in the TI population, four were covered by the segregating regions of the Ti52-3 population (Figure 2), including *qBR5.2* located in the interval RM274–RM334 on chromosome 5, *qBR7* in RM70–RM18 on chromosome 7, *qBR10* in RM6100–RM3773 on chromosome 10, and *qMR12* in RM20–RM27610 on chromosome 12. They were all well validated. The TQ alleles consistently increased BR in the *qBR5.2* and *qBR10* regions, decreased BR in the *qBR7* region, and decreased MR in the *qMR12* region (Tables 2 and 5). Additionally, significant effects were newly detected on MR in the *qBR5.2*, *qBR7* and *qBR10* regions, and on BR and HRY in the *qMR12* region. The QTL region *qBR8*/*qGY8*/*qBRY8*/*qMRY8* found in the TI population was overlapped with the segregating region RM22755–RM23001 in the Ti52-3 population (Figure 2). These QTLs were not detected in Ti52-3. Since one side of this putative QTL region was homozygous in the new population, it is possible that the QTLs were not segregated in Ti52-3.

The other six QTL regions found in the Ti52-3 population were not detected in the TI population. One of them, Tw31911–Tw32437 on chromosome 2, showed significant effects on four traits. In the neighboring region RM6–RM240, QTLs for the same four traits were detected in the TI population. However, the QTL directions were opposite between the two regions. It is noted that RM6–RM240 segregated in the TI population was homozygous in the Ti52-3 population (Figure 2). The gene underlying this QTL cluster may be located between RM6–RM240 and Tw31911–Tw32437, and crossover may have occurred between the gene and Tw31911.

Five other QTL regions detected in the Ti52-3 population included three QTL clusters and two regions affecting a single trait. The RM14302–RM14383 region on chromosome 3 affected two traits, in which the TQ allele increased BR and MR by 0.13% and 0.31%, respectively. The RM107 region on chromosome 9 affected four traits, in which the TQ allele increased GY, BRY, MRY, and HRY by 0.93 g, 0.76 g, 0.73 g, and 0.62 g, respectively. The RM167–RM287 region on chromosome 11 affected five traits, in which the TQ allele decreased BR, GY, BRY, MRY, and HRY by 0.10%, 0.76 g, 0.63 g, 0.55 g, and 0.36 g, respectively. The remaining two QTLs were *qHR6* and *qMR9*, of which the TQ allele decreased HR and MR by 1.34% and 0.20%, respectively.

#### **4. Discussion**

Milled and head rice yield, two of the most important commercial traits in rice production, are determined by grain yield and milling quality. Understanding the genetic relationship among these traits is critical for the improvement of milled and head rice yield in breeding. In this study, QTL analysis for seven traits—brown, milled and head rice recovery, grain yield, and brown, milled and head rice yield—was performed using three RIL populations and one RH-derived F4:5 population. New knowledge on the genetic basis underlying the control of brown, milled and head rice yield is provided.

In the four populations investigated in this study, correlations between the four yield traits were all highly significant. Near-perfect correlations were observed between GY, BRY and MRY, and their correlations with HRY were slightly weaker. These results were supported by QTLs detected for the four traits. Four, five, two, and three QTLs were detected for grain yield in the TI, ZM, XM, and Ti52-3 populations. It is worth noting that each of these QTL regions had significant effects on all or two of the three traits for milling yield. For multiple QTLs accompanied in the same region, not only the allelic direction remained unchanged, but also the effects were consistent. Of the four regions controlling GY in TI, the *qGY2* and *qGY4* regions controlled all four traits, but the *qGY5* and *qGY8* regions were non-significant for HRY (Table 2). No other QTLs for these traits were detected in TI. Of the five regions controlling GY in ZM, the *qGY1.1*, *qGY1.2*, *qGY3,* and *qGY6* regions controlled all four traits, but the *qGY10* region was non-significant for HRY (Table 3). One more QTL for MRY, *qMRY9*, was detected alone. One more QTL for HRY, *qHRY4*, was detected and accompanied with a QTL for HR, *qHR4*. In XM, the two regions controlling GY both affected all four traits (Table 4). One more QTL for HRY, *qHRY5*, was detected and accompanied with a QTL for HR, *qHR5*. Similarly, all three regions controlling GY in Ti52-3 affected all four traits (Table 5). These results have two implications. Firstly, variation on paddy grain yield might be the only main source of variation for brown and milled rice yield. Secondly, variations on the paddy grain yield and head rice recovery both make important contributions to the variation of head rice yield.

Significant correlations between different quality traits in rice have been commonly observed [7,8, 12,21], which could be partly ascribed to the influence of a QTL region on multiple traits [7,12]. By comparing the locations of genes or QTLs reported for various grain quality traits in rice, it is found that some regions harboring QTLs for milling quality are associated with other traits that determine appearance quality or eating and cooking characteristics. Two typical examples are the *GW5*–*Chalk5* region on chromosome 5 [22,23] and the *Wx* region on chromosome 6 [19]. In the *GW5*–*Chalk5* region, QTLs having major effects for MR were detected in the TI and XM populations. In a previous study reported by Zheng et al. [11], one QTL for MR, *QMr5*, was also detected in this region, having an additive effect of 1.10% and *R*<sup>2</sup> of 11.5%. In addition, this region was reported to affect various traits for grain chalkiness, endosperm transparency and grain size in the TI and XM populations [13,14].

The *Wx* gene not only plays a key role in controlling eating and cooking quality of rice, but also influences other traits including protein content, head rice recovery, grain chalkiness, and grain weight [2,3,8,9,24]. The *Wx* locus was segregated in the TI, ZM and XM populations, having major effects on amylose content and gel consistency [15]. The *Wx* region also showed significant effects on grain chalkiness, grain width and endosperm transparency in TI; on grain chalkiness and grain length in ZM; and on grain chalkiness, grain length and endosperm transparency in XM [13,14]. In the present study, this region was found to have significant effects on BR, MR and HR in TI; on HR, GY, BRY, MRY, and HRY in ZM; and on GY, BRY, MRY, and HRY in XM.

In conclusion, the *GW5*–*Chalk5* and *Wx* regions are good targets for studying the genetic control of multiple traits determining grain yield, appearance quality, eating and cooking quality, milling quality, and milling yield.

#### **5. Conclusions**

A total of 77 QTLs for seven traits affecting milling yield in rice were detected using three RIL populations. All the regions harboring QTLs for grain yield were found to affect two or all three milling yield traits. QTLs for head rice yield were usually accompanied with grain yield and head rice recovery. Variations of brown and milled rice yield were mainly ascribed to grain yield, but head rice yield was determined by both grain yield and head rice recovery. Two regions covering *GW5*–*Chalk5* and *Wx* loci, respectively, had a major contribution to milling quality and milling yield of rice.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/1/75/s1, Figure S1: Frequency distribution of seven traits in three RIL populations. Table S1: Phenotypic performance of seven traits in the three RIL populations. Table S2: Simple correlation coefficients between seven traits in the RH-F4:5 population.

**Author Contributions:** Conceptualization, J.-Y.Z., H.-A.X. and J.-F.Z.; investigation, H.Z., Y.-J.Z., A.-D.Z., Y.-Y.F. and T.-X.H.; writing—original draft preparation, H.Z.; writing—review and editing, J.-Y.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Science and Technology Innovation Program of the Fujian Academy of Agricultural Sciences (Grant No. STIT 2017-1-1), the National Research and Development Program (Grant No. 2016YFD0101801), the Special Foundation of Non-Profit Research Institutes of Fujian Province (Grant No. 2018R1101013-4) and the National Natural Science Foundation of China (Grant No. 31521064).

**Acknowledgments:** The authors would like to thank D.-P. Li for his assistance in field work. We acknowledge Y.-F. Sun and H.-Z. Lin for their technical assistance in laboratory works.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **An SNP-Based High-Density Genetic Linkage Map for Tetraploid Potato Using Specific Length Amplified Fragment Sequencing (SLAF-Seq) Technology**

#### **Xiaoxia Yu** †**, Mingfei Zhang** †**, Zhuo Yu \*, Dongsheng Yang, Jingwei Li, Guofang Wu and Jiaqi Li**

Agricultural College, Inner Mongolia Agricultural University, Hohhot 010000, China;

yuxiaoxia1985@sina.com (X.Y.); zhangmingfei0207@163.com (M.Z.); yangdongsheng007@163.com (D.Y.);

ljw2016409@163.com (J.L.); wuguofang25@163.com (G.W.); lijiaqi1127@sina.com (J.L.)

**\*** Correspondence: yuzhuo58@sina.com; Tel.: +86-155-9806-8434

† These authors contributed equally to this work.

Received: 30 December 2019; Accepted: 9 January 2020; Published: 13 January 2020

**Abstract:** Specific length amplified fragment sequencing (SLAF-seq) is a recently developed high-resolution strategy for the discovery of large-scale de novo genotyping of single nucleotide polymorphism (SNP) markers. In the present research, in order to facilitate genome-guided breeding in potato, this strategy was used to develop a large number of SNP markers and construct a high-density genetic linkage map for tetraploid potato. The genomic DNA extracted from 106 F1 individuals derived from a cross between two tetraploid potato varieties YSP-4 × MIN-021 and their parents was used for high-throughput sequencing and SLAF library construction. A total of 556.71 Gb data, which contained 2269.98 million pair-end reads, were obtained after preprocessing. According to bioinformatics analysis, a total of 838,604 SLAF labels were developed, with an average sequencing depth of 26.14-fold for parents and 15.36-fold for offspring of each SLAF, respectively. In total, 113,473 polymorphic SLAFs were obtained, from which 7638 SLAFs were successfully classified into four segregation patterns. After filtering, a total of 7329 SNP markers were detected for genetic map construction. The final integrated linkage map of tetraploid potato included 3001 SNP markers on 12 linkage groups, and covered 1415.88 cM, with an average distance of 0.47 cM between adjacent markers. To our knowledge, the integrated map described herein has the best coverage of the potato genome and the highest marker density for tetraploid potato. This work provides a foundation for further quantitative trait loci (QTL) location, map-based gene cloning of important traits and marker-assisted selection (MAS) of potato.

**Keywords:** tetraploid potato; SNP markers; SLAF-seq technology; high-density genetic linkage map

#### **1. Introduction**

Potato, *Solanum tuberosum* L., is the fourth most important food crop in the world behind maize, wheat, and rice, with a total production of more than 388 million tons in 2017 [1]. Nevertheless, cultivated potato is a highly heterozygous outcrossing autotetraploid (2*n* = 4x = 48), which causes complexities in genetic or genomic studies, and provides many challenges for breeding. Therefore, more breeding efforts have been focused on improving important traits, such as processing quality, nutritional value, as well as disease/pest resistance.

A high-density genetic linkage map can provide a large amount of information that facilitates map-based cloning, QTL identification, and comparative genomic researches, establishing a general tool for marker-assisted selection breeding (MAS). However, the construction of linkage maps in autopolyploids always has more difficulties than that in polyploids as well as allopolyploid species, due to their complicated segregation patterns and chromosomal pairing [2–5]. Over the past two

decades, multiple linkage maps have been constructed for potato (both diploid and autotetraploid potato) for the purpose of better understanding the potato genome, facilitating map-based cloning, and developing markers for MAS [6–11]. Gebhardt et al. (1991) [6] reported the first potato map in the world, including 135 restriction fragment length polymorphism (RFLP) molecular markers and defining 12 distinct linkage groups, which was drawn from segregation data derived from the interspecific cross of diploid potato (2*n* = 2x = 24), *S. phureja* × (*S. tuberosum* × *S. chacoense*). Yamanaka et al. (2005) [10] constructed an integrated genetic linkage map of diploid potato, using 106 F1 individuals from a cross of two wild and landrace germplasm 86.61.26 × 84.194.30 as the mapping population. This map included 13 newly developed P450-based analogue (PBA), 27 random amplified polymorphic DNA (RAPD), 4 inter-simple sequence repeat (ISSR), 22 simple sequence repeat (SSR), 9 restriction fragment length polymorphism-sequence-tagged sites (RFLP-STSs), and 7 RFLP markers, with a coverage of 673 cM and an average marker distance of 8.2 cM. Van Os et al. (2006) [11] constructed an ultradense map of potato with more than 10,000 amplified fragment length polymorphism (AFLP) markers from a heterozygous diploid potato population. It is also the densest meiotic recombination map ever constructed.

With the rapid development of next-generation sequencing technologies, single nucleotide polymorphism (SNP) markers have been developed to construct high-density genetic linkage maps for many important crop species, such as maize [12,13], rice [14,15], and wheat [16,17]. For potato, Xun et al. (2011) [18] used a homozygous double-monoploid potato clone to sequence and assemble 86% of the 844-megabase genome, which bridged the gap between genomics and applied breeding with an in-depth understanding of the structure and function of the potato genome, and provided an effective tool and data to develop potato SNP markers. To date, several high-density genetic linkage maps based on SNP markers have been reported with the accomplishment and subsequent development of the potato's whole genome sequence. Felcher et al. (2012) [19] first used SNP markers and two diploid potato populations to create two linkage maps, where over 4400 markers were mapped, including 787 markers common to both populations, and the map sizes were 965 and 792 cM, respectively. Hackett et al. (2013) [20] constructed a high-density SNP map of tetraploid potato based on obtained Infinium 8300 Potato SNP Array data, which included 1130 markers with a coverage of 1087.5 cM, using a mapping population of 190 progenies from a cross between the breeding clone 12601ab1 and the cultivar stirling. Endelman et al. (2016) [21] first used a diploid inbred line-based F2 population to construct a genetic linkage map of diploid potato with 2264 SNP markers. To sum up, most potato linkage maps are generated from diploid populations of wild species and primitive cultivars. Linkage mapping in tetraploid potato species is still a challenge despite the recent advances in mapping methodology, genotyping, and molecular marker technology.

Due to the advances in next generation sequencing (NGS) technologies, new high-throughput genotyping methods hold promise for the detection of a large number of SNPs in a short time, which include genotyping-by-sequencing (GBS) [22], complexity reduction of polymorphic sequences (CroPSs) [23], restriction site-associated DNA sequencing (RAD-seq) [24,25], and specific length amplified fragment sequencing (SLAF-seq) [26]. Specific-locus amplified fragment sequencing (SLAF-seq) technology, reported by Sun et al. (2013) [26], is an efficient strategy for the de novo SNP discovery and genotyping of large populations based on an enhanced reduced representation library (RRL) sequencing method. The advantages of SLAF-seq technology are: (i) Deep sequencing to ensure genotyping accuracy; (ii) a lower sequencing cost; (iii) pre-designed RRL scheme to optimize marker efficiency; (iv) and double barcode multiplexed sequencing system for large population and large numbers of loci. To date, this strategy has been applied to various species for SNP high-density genetic mapping, such as cucumber [27], *Agropyron gaertn* [28], and orchardgrass [29], due to its advantages of optimized marker efficiency, accurate genotyping, affordable price, and applicability for large populations. In the present research, an F1 mapping population of 106 individuals was created from the cross between two tetraploid potato varieties, YSP-4 × MIN-021. We used the SLAF-seq approach to construct a high-density integrated SNP genetic linkage map of tetraploid potato, which

will expedite map-based cloning efforts, QTL location for important traits, as well as marker-assisted selection breeding for tetraploid potato.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

The F1 mapping population consisted of 106 individuals from a cross between two tetraploid potato varieties, YSP-4 (female) and MIN-021 (male). YSP-4 is a wild tetraploid potato material, which has a short growth period, moderate tuber numbers per plant, high commodity potato rate, and high starch content (ca. 18%). This material is also highly resistant to early blight and virus disease. MIN-021 is a tetraploid potato material, which has a short growth period, high yield, and high starch content (ca. 19%). All the materials were planted in the potato breeding base of Inner Mongolia Agricultural University. The field trial was arranged in randomized complete block design (RCBD) with three replications per plot. Each plot contained 20 plants, which were grown in 2 rows with a spacing of 30 cm within rows and 90 cm between rows, and the planting depth was about 12 cm. The experiment field had sandy soil with pH 7.8 to 8.2, good irrigation conditions with annual precipitation from 300 to 400 mm, and the geographic position is 111◦42 E, 45◦57 N, with an altitude of 1063 m.

#### *2.2. DNA Extraction*

At the potato squaring stage, the genomic DNA of all parents and 106 progenies was extracted from young fresh leaf tissue by the Plant Genomics DNA Kit (Tiangen, Beijing, China). Then, the quality of DNA was determined by electrophoresis on a 1% (*w*/*v*) agarose gel stained with ethidium bromide, and the concentration was quantified by an ND-1000 Spectrophotometer (Nano Drop, Wilmington, DE, USA) and adjusted to a concentration of 50 ng/μL.

#### *2.3. SLAF Library Preparation and Sequencing*

According to the genome size and GC (guanine-cysteine) content of the tested materials, the potato genome (http://solanaceae.plantbiology.msu.edu/pgsc\_download.shtml) was selected as a reference genome to make predictions of the electronic enzyme, and finally determine the enzyme combination of Rsa I and Hae III to digest the genomic DNA of the 106 F1 individuals and their two parents. The read length used for sequencing ranged from 264 to 394 bp. The SLAF labels (the length of fragments ranged from 314 to 364 bp) were selected for paired-end sequencing (125 bp per end) on an Illumina HiSeq 2500 sequencing platform, performed by the Beijing Biomarker Technologies Corporation (http://Biomarker.com.cn/). The SLAFs with a sequence depth of less than 10-fold were considered as low-depth SLAFs and filtered out. Several steps were defined to deal with SLAF-seq data: Samples were distinguished by barcodes and data grouping by sequence similarity; sequence error evaluation by control data; minor allele frequency (MAF) filtering and SLAF definition; correction of sequence errors; and definition and evaluation of genotypes. In addition, the quality score algorithm was developed to evaluate the quality of SNP discovery and genotyping, which can help researchers balance accuracy and cost during heterozygote detection using high-throughput sequencing technology. The Q30 (a quality score of 30; indicating 99.90% confidence) was used to evaluate the sequencing quality of reads, and examination of the base distribution was used to detect the GC content of the raw data for data quality control. The raw sequence reads were deposited in the NCBI-short read archive (SRA) database (accession: PRJNA597429).

#### *2.4. SLAF Data Analysis and Development of SNP Markers*

The approach of clustering among reads was used to develop and search for polymorphic SLAF labels from 106 F1 individuals and their parents. All paired-end reads generated from SLAF-seq raw reads were compared according to their sequence similarity as detected by the BLAST-like alignment

tool (BLAT) [30]. The F1 individual sequence reads were aligned on the referenced potato genome using Burrows–Wheeler Aligner (BWA) software [31]. Identical reads from different samples were clustered, and the fragment with over 90% sequence identity was defined as an SLAF label. The SLAF labels with differences in high-depth fragments were also considered as SNP or indel markers. According to the differences among sequences or allele numbers, the SLAF labels were divided into three categories, including NoPoly (non-polymorphic), Poly (polymorphic), and Repeat (repetitive). After comparing the sequence differences on SLAFs from each sample, the polymorphic SLAF labels were screened for further analysis. Both Sequence Alignment/Map tools (SAMtools) [32] and Genome Analysis Tookit (GATK) [33] were used to identify SNPs, and their intersection was identified as the candidate SNP dataset. Only biallelic SNPs were retained as the final SNPs. The SNP locus were confirmed from the polymorphic SLAF labels, with the screening criteria of MAF > 0.5.

#### *2.5. Construction of High-Density Linkage Map*

The HighMap software was used to construct a high-density genetic linkage map of tetraploid potato [34]. The single-linkage clustering algorithm was used to cluster the SNP markers, which were ordered into linkage groups. The high quality MLOD value among SLAF labels was calculated and used for linkage grouping. The genotyping errors were corrected using the module of error genotyping correction of HighMap sofware.

#### **3. Results**

#### *3.1. SLAF Library Construction and SLAF Labels Development*

The in silico restriction enzyme combination of *Rsa*I and *Hae*III was used for genome DNA digestion and the prediction of the potato reference genome. A total of 334,787 SLAF labels were predictably obtained, which were evenly distributed on the genome. The rice genome (*Oryza sativa*) was used as a control for the restriction enzyme digestion control trial, in order to indirectly monitor the progress of the potato SLAF library construction. Compared with the control, the ratio of paired-end mapping reads was 89.20%, and the digestion efficiency of the RsaI and HaeIII restriction enzymes was 90.91%, which indicated that the potato SLAF library was constructed normally and suitable for high-throughput sequencing.

After SLAF library construction and high-throughput sequencing, a total of 2269.98 million pair-end reads (556.71 Gb data) with a length of 100 bp were obtained. The Q30 ratio was 95.05%, and the average GC (guanine-cytosine) content was 35.51%. Of all the high-quality data, 48,849,737 reads were from the male parent MIN-21, 41,510,213 reads were from the female parent YSP-4, and the average 90,562,465 reads were from 106 offspring of the F1 mapping population (Table 1). According to bioinformatics analysis, a total of 838,604 SLAF labels were developed, with an average depth of 26.14-fold and 15.36-fold for each SLAF of the parents and offspring, respectively. Of all the 838,604 high-quality SLAFs, 282,838 were polymorphic, of which 113,473 polymorphic SLAFs could be used for map construction.


**Table 1.** Basic statistic of the SLAF-seq data in tetraploid potato.

#### *3.2. SNP Marker Detection*

A total of 7638 SLAF labels were screened from 113,473 polymorphic SLAFs, which were successfully classified into four segregation patterns: Hk × hk, lm × ll, nn × np, and ef × eg (Table 2). The patterns, except aa × bb, were used for later genetic map construction which was suitable for the F1 population, because the potato F1 population was not obtained by a cross between two fully homozygous parents with genotype aa or bb. After filtering out the SNP markers with sequence depths no more than 4-fold, a total of 7329 SNP markers were detected from 7638 SLAFs for map construction.


**Table 2.** The number of SLAFs in different types of segregation patterns for map construction.

#### *3.3. Construction of the Genetic Linkage Map*

After four quality control steps, the 7329 screened SNPs were used to calculate the modified logarithm of odds (MLOD) values between two markers [35]. The markers with an MLOD value of less than three were filtered, and the remaining markers were grouped into 12 linkage groups (LGs). The HighMap software was used to analyze the linear arrangements of all the grouped SNPs and the genetic distance between adjacent SNP markers within each LG. An integrated map as well as two separate linkage maps for the female and male parents were constructed, including 12 linkage groups.

In YSP-4, the maternal linkage map contained 1638 SNP markers, which covered a total length of 1383.86 cM, with an average marker distance of 0.83 cM. The number of markers in the linkage groups ranged from 43 to 341 markers, with an average of 137 markers. The length of LGs ranged from 32.82 to 282.89 cM, with an average size of 0.84 cM (Table A1). In MIN-021, the paternal linkage map consisted of 1402 SNP markers, and covered a total length of 1203.94 cM, with an average marker distance of 0.87 cM. The number of mapped markers in the LGs ranged from 542 to 243, with an average of 117 markers. The length of LGs ranged from 26.05 to 170.2 cM, with an average size of 100.33 cM (Table A2).

The integrated genetic map included 3001 SNP markers, which covered a total length of 1415.88 cM, and the average distance between adjacent markers was 0.47 cM. The number of markers in the linkage groups ranged from 43 to 341 markers, with an average of 137 markers. The length of LGs ranged from 45.02 to 282.89 cM, with an average size of 117.99 cM (Table 3, Figures 1 and A1). LG chr10 was not only the shortest but also the densest group, with 440 loci spanning 33.47 cM, which had an average marker density of 0.08 cM. LG chr2 was the longest group, with 225 loci spanning 205.09 cM. The largest gap on this map was 25.19 cM, located in LG chr7 (Table 3; Figure 1).

**Figure 1.** The high-density integrated genetic map of tetraploid potato. A total of 3001 SNP markers were distributed in 12 linkage groups, covering 1415.88 cM, with an average interval of 0.47 cM between markers.


**Table 3.** The integrated genetic map for tetraploid potato.

The average depth of the SNP markers on the integrated map was 85.63-fold in the paternal parent MIN-021 and 65.10-fold in the maternal parent YSP-4, as well as 40.34-fold in the offspring of the F1 population. Segregation distortion is occurs when the segregation ratio deviates from the expected Mendelian ratio, which is considered as a powerful driving force for organic evolution [36]. The Chi-square (χ2) test (α = 0.05) was used to analyze the goodness-of-fit to the expected segregation ratios for all the SNP markers. A total of 80 out of 3001 markers (2.7%) did not fit the expected segregation ratios at a level of α ≤ 0.05. The distorted SNP markers were mainly located on LG chr 3, chr 5, chr 7, chr 8, chr 11, and chr 12 (Table 4).

**Table 4.** The distorted SNP markers on integrated genetic map of tetraploid potato.


#### *3.4. Evaluation of the Genetic Map*

The quality of this genetic map was evaluated by haplotype maps and heat maps, which directly revealed the recombination relationship among SNP markers in the 12 LGs. Haplotype maps were created to reflect the crossover events. The recombination events of the 12 LGs are shown on the haplotype maps (Figure A2). The haplotype maps from 12 LGs showed that all LGs had an extremely low double crossover rate, which indicated the genetic map had a high quality.

Heat maps were also constructed to evaluate the quality of the genetic map by using pair-wise recombination values for the 3001 SNP markers (Figure A3). It showed that most of the heat maps for 12 LGs performed well in visualization, which indicated that the markers were well-ordered, and the genetic distances of adjacent markers were accurate in each LG.

#### **4. Discussion**

#### *4.1. The Development of SNP Markers Using SLAF-seq Technology*

A genetic linkage map is the basis for QTL identification of important traits, map-based gene cloning, and molecular marker-assisted breeding of crops. Different types and numbers of polymorphic markers were used to construct genetic maps. For potato, most genetic linkage maps were mainly based on several conventional low-throughput molecular markers, including RFLP [6–10], AFLP [9,11], as well as RAPD, ISSR, and SSR markers [10]. However, it is time-consuming and costly to construct a high-density genetic map for potato using conventional molecular marker technologies. SNP markers are the most frequent polymorphisms and are suitable for high-throughput genotyping. In addition, many SNP markers are located within transcribed regions, which can generate more links between the genetic and physical maps. To date, high-density polymorphic SNP markers have been used in potato for large-scale genotyping and high-density genetic map construction [19–21].

SNP markers can be rapidly developed on a large scale by different high-throughput sequencing technologies and genotyping methods, such as genotyping by sequencing (GBS) [22], restriction site-associated sequencing (RAD-seq) [24,25], and SLAF-seq. The SLAF-seq technology, a combination of locus-specific amplification and high-throughput sequencing, provides a high-resolution strategy with a shorter period of time and lower cost for large-scale genotyping and can be generally applicable to various species and populations [26].

In the present research, we first used SLAF-seq technology in potato to develop SNP markers and construct the high-density genetic map. A total of 2269.98 million pair-end reads were obtained based on high-throughput sequencing. According to bioinformatics analysis, a total of 838,604 SLAF labels were generated, of which 282,838 were polymorphic. Finally, a total of 7329 polymorphic SNP markers were developed for high-density genetic map construction. The present study extends the utility of SLAF-seq technology to potato. The results showed that SLAF-seq was an effective tool to rapid develop large-scale SNP markers, which met the requirements for high-density genetic map construction of tetraploid potato.

#### *4.2. Mapping Population and Strategies*

In general, the F2, backcross (BC), doubled haploid (DH), or recombinant inbred lines (RIL)population are used as an appropriate mapping population to construct genetic linkage maps [5,37,38]. However, for most autopolyploid species, it is very difficult to obtain a typical family-based population in potato because of its high heterozygosity. To date, most of the reported potato linkage maps have been established by applying a double pseudo-testcross strategy on an F1 population. The double pseudo-testcross strategy was first put forward by Grattapaglia and Sederoff (1994) [37] to construct the genetic linkage map for genetically heterozygous species of forest trees. An F1 population was used as a mapping population by crossing between two irrelevant and highly heterozygous parents. The gene segregation patterns were assumed as a backcross. Afterwards, this strategy has been widely used to construct linkage maps for those heterozygous species, such as danshen [39], pineapple [40], rhodesgrass [41], and sweet potato [5].

In the present research, an F1 segregation population from a cross YSP-4 × MIN-021 was created, of which 106 individuals were randomly selected and used for SNP genotyping and map construction based on the double pseudo-testcross strategy. In the pseudo-testcross, a total of 7638 polymorphic SLAF markers were classified into four segregation patterns, which were hk × hk, lm × ll, nn × np, and ef × eg. The 7329 SNP markers screened and confirmed from 7638 polymorphic SLAFs were then used to construct a genetic linkage map. In our study, among the 838,604 high-quality SLAFs, 282,838 were polymorphic, with a polymorphic rate of 33.7%. It indicates that there is considerable genetic difference between YSP-4 and MIN-021. Therefore, it is suitable to use them as mapping parents, and the F1 population derived from the cross between them conforms to the requirements of the mapping population for high-density map construction.

#### *4.3. The High-Density Genetic Map of Tetraploid Potato*

Segregation distortion is a common phenomenon that has been observed in many studies [42–44]. It may generate from cytological attributes, genetic drift, gametophyte selection, or some biological reasons [45,46]. Segregation distortion could alter the estimation of recombination and cause a spurious linkage [47]. Therefore, distorted markers may affect the accuracy of genetic maps. In our study,

only 2.7% SNP markers located on the integrated map were distorted markers, which indicated the high map accuracy.

To our knowledge, only one high-density SNP genetic linkage map for tetraploid potato was reported because of the high heterozygosity of autotetraploid potato [20]. In the present research, we first used the SLAF-seq method for genotyping and developing SNP markers, and constructed high-density genetic maps of tetraploid potato. The integrated map included 3001 SNP markers, and had a genetic length of 1415.88 cM, with an average distance between markers of 0.47 cM. Compared with the map obtained by Hackett et al. (2013) [20], the integrated map had more SNP markers (3001 vs. 1130), higher marker density (0.47 cM vs. 1.60 cM), and larger total length (1415.88 cM vs. 1087.5 cM). Thus, our map has better coverage of the potato genome and nearer marker density.

#### **5. Conclusions**

In the present study, the SLAF-seq technology was first successfully used for the development of large-scale SNP markers and the construction of high-density linkage maps in tetraploid potato. The integrated high genetic linkage map generated here has the best coverage of the potato genome and the nearest marker density reported for tetraploid potato until now. This work represents an important step forward in genomics and marker-assisted breeding of tetraploid potato. It also provides a foundation for QTL location and map-based gene cloning of important traits for potato, such as tuber yield, starch content, and protein content. In addition, the application of SLAF-seq strategy and the mapping population in our study will provide valuable references for other tetraploid plants.

**Author Contributions:** Z.Y. and X.Y. conceived and designed the research. Z.Y., X.Y. and M.Z. performed the experiments. X.Y. and M.Z. completed the writing of the article. D.Y., J.L. (Jingwei Li), G.W., and J.L. (Jiaqi Li) assisted in the completion of the experiments. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by grants from the Inner Mongolia Major Science and Technology Project (zdzx2018019), Inner Mongolia Youth Science and Technology Talents Project for Colleges and Universities (NJYT-18-B04), Inner Mongolia Natural Science Foundation (2018MS03040), Major Project for Inner Mongolia Science and Technology Program (201602048), and Animal and Plant Breeding Project for Transformation of Scientific Achievements in Inner Mongolia Agricultural University (YZGC2017006).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**


**Table A1.** The maternal genetic map for tetraploid potato.


**Table A2.** The paternal genetic map for tetraploid potato.

**Figure A1.** The SNP distribution on the potato genome. The *x*-axis represents the chromosome length and the *y*-axis indicates the chromosome code. Each band represents a chromosome, and the genome is divided according to the size of 1 M. The more SNPs in each band, the darker the color; the smaller the number of SNPs, the lighter the color. The darker areas in the figure are the areas where SNPs are concentrated.

**Figure A2.** Haplotype maps for 12 linkage groups of the integrated genetic map for tetraploid potato. The haplotype maps consist of 12 maps from LG chr 1 to LG chr 12. Each two columns represent the genotype of an individual. Blank columns are used between two individuals. The first and second columns represent the paternal and maternal chromosome, respectively. Rows correspond to genetic markers. Green and blue boxes indicate one chromatid from parents, and gray boxes indicate missing data.

**Figure A3.** Heat maps for 12 linkage groups of the integrated genetic map for tetraploid potato. The heat maps consist of 12 maps from LG chr 1 to LG chr 12. Markers of each row and column are ranked according to the map order. Each small square represents the rate of recombination between two markers. Yellow color represents highly tight linkage; red color represents relatively weak linkage, the darker the red color, the less tight linkage; and blue color represents no linkage.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Mapping Agronomic and Quality Traits in Elite Durum Wheat Lines under Di**ff**ering Water Regimes**

#### **Rosa Mérida-García 1, Alison R. Bentley 2, Sergio Gálvez 3, Gabriel Dorado 4, Ignacio Solís 5, Karim Ammar <sup>6</sup> and Pilar Hernandez 1,\***


Received: 17 December 2019; Accepted: 14 January 2020; Published: 19 January 2020

**Abstract:** Final grain production and quality in durum wheat are affected by biotic and abiotic stresses. The association mapping (AM) approach is useful for dissecting the genetic control of quantitative traits, with the aim of increasing final wheat production under stress conditions. In this study, we used AM analyses to detect quantitative trait loci (QTL) underlying agronomic and quality traits in a collection of 294 elite durum wheat lines from CIMMYT (International Maize and Wheat Improvement Center), grown under different water regimes over four growing seasons. Thirty-seven significant marker-trait associations (MTAs) were detected for sedimentation volume (SV) and thousand kernel weight (TKW), located on chromosomes 1B and 2A, respectively. The QTL loci found were then confirmed with several AM analyses, which revealed 12 sedimentation index (SDS) MTAs and two additional loci for SV (4A) and yellow rust (1B). A candidate gene analysis of the identified genomic regions detected a cluster of 25 genes encoding blue copper proteins in chromosome 1B, with homoeologs in the two durum wheat subgenomes, and an ubiquinone biosynthesis O-methyltransferase gene. On chromosome 2A, several genes related to photosynthetic processes and metabolic pathways were found in proximity to the markers associated with TKW. These results are of potential use for subsequent application in marker-assisted durum wheat-breeding programs.

**Keywords:** durum wheat; genome wide association study; GWAS water use; agronomic traits; MTAs; candidate genes; TKW; sedimentation volume; SDS; YR

#### **1. Introduction**

Wheat is one of the most widely grown crops worldwide (FAO, 2015), and is essential for the human diet [1]. Its importance and worldwide dominance are due, in part, to its agronomic adaptability. Durum wheat (*Triticum durum*) is a tetraploid wheat species (AABB genomes) mainly grown in the Mediterranean basin, in the Northern Plains (between the USA and Canada), in the arid areas of South Western USA and in Northern Mexico [2]. Durum wheat is well-adapted to a broad range of climatic conditions (including dry environments) and marginal soils, and has low water requirements [3,4]. Climatic conditions, as temperature and water availability, together with biotic stresses, can strongly affect durum wheat development and production [3–6]. Crop adaptation

is a central objective for breeding progress, driving improvement in final production, quantity and quality [7,8]. For over two decades, CIMMYT (International Maize and Wheat Improvement Center) has had an intensive breeding and improvement programme focused on the acceleration of durum productivity in developing countries.

Grain quality is an important breeding aim determining product end-use linked to financial returns. It is influenced by both genetic and environmental conditions [9], and biotic and abiotic stresses during growth and at key development stages [10]. Temperature, water availability and soil properties, especially nitrogen content, influence the final quality and protein content of wheat and its end-products [11–13].

There is a growing need to increase wheat yield without losing grain quality [14,15]. Key end-use grain quality traits include grain protein content (GPC), gluten strength, kernel size and vitreousness [7,16] and are all influenced by climatic conditions [17]. A number of agronomic components influence final productivity, including phenology (maturity) and plant architecture (plant height and lodging resistance). The majority of important agronomic traits, including yield, are controlled or influenced by multiple genes and are quantitatively inherited [18]. In addition, most are influenced by the environment and interactions between environmental and genetic (GxE) effects [19–23]. One of the most common methods currently used for dissection of quantitative agronomic and quality traits is the association mapping (AM) approach [24].

AM, originating in human genetics, was initially combined with linkage disequilibrium (LD) to identify the role of genes and linked markers for the determination of disease loci [25]. It is now widely used in plant and crop genetics. Some of the first studies based on LD mapping applied in plants were done in maize [26], rice [27] and oat [28]. AM has the main objective of determining, based on LD, correlations between genotypes and phenotypes in a panel of selected individuals [29]. It can support the development of new genetic markers for use in marker-assisted plant breeding [30]. It also facilitates the analysis of genetic variation underlying traits for further characterisation of the loci of interest [31].

Single nucleotide polymorphism (SNP) markers are commonly used in quantitative trait loci (QTL) mapping experiments [32,33] and genome-wide association studies (GWAS) for the detection of marker-traits associations (MTAs) in wheat [34–38]. DArTseq, a variant of the microarray-based DArT technology, has also been widely used in QTL mapping [39–41]. It reduces the complexity of the genome, using combinations of restriction enzymes [42] and next-generation sequencing. Several studies have assessed MTAs in durum wheat. The analyzed traits include grain yield, yield and yield components [6,37,43,44], heading date [6], and grain quality traits (thousand kernel weight, vitreousness, protein content, sedimentation index [17,45–47], yellow colour [48,49]).

In this study, three panels of elite durum wheat lines from CIMMYT were assessed in field trials conducted over multiple seasons and with differing water regimes. The AM approach was used to detect SNP and DArT markers associated with heterogeneous agronomic and quality trait data in order to test the approach as a tool for marker discovery within a live and ongoing breeding programme.

#### **2. Material and Methods**

#### *2.1. Plant Material, Phenotyping and Genotyping*

Panels of elite durum lines from CIMMYT wheat preliminary yield trials (PYT), comprising a total of 294 accessions (Supplementary Materials Table S1) were used for agronomic and quality assessment. PYT trials consisted of the best advanced breeding lines which were promoted to unreplicated trials, including one or two repeated checks. The trials were sown, assessed and analysed according to their specific statistical designs [50] and consisted of two blocks with different water treatments, one with full irrigation (FI) and the other with reduced irrigation (RI). In the FI treatment four to five irrigations were applied during the field season to maintain the optimal soil moisture conditions, whilst in the RI block a single irrigation was applied at planting, in order to ensure establishment. In both water treatments the irrigation was applied by gravity in furrows. The rainfall data (https://www.meteoblue. com/en/weather/historyclimate/weatherarchive/ciudad-obreg%c3%b3n\_mexico\_4013704) for the four

field seasons is included in Supplementary Materials Figure S1. The agronomic and quality assessment of the panels over seasons is summarised in Table 1.


**Table 1.** Agronomic and quality assessment of wheat field trials. The number of lines, year, location and water regime applied is shown.

YAQ: Yaqui (Mexico); FI: Full irrigation; RI: Reduced irrigation.

Field experiments were conducted at CIMMYT's experimental station in the Yaqui Valley, Mexico (27.282848◦ N; 109.923878◦ W) over four field seasons (2012 to 2015 harvest years, inclusive). Wheat panels 2 and 3 were phenotypically assessed across years, while panel 1 was only grown in 2012. The experimental plots (1.6 × 3 m) were sown in November/December of each year and harvested in May of the following year. Data for yellow rust were assessed in semi-controlled conditions at CIMMYT's experimental station in Toluca (Mexico).

Plant material for genetic analysis was harvested for each line at the 4th leaf stage (growth stage 14 on the Zadoks scale [51]) and immediately frozen in dry-ice. Samples were stored at −80 ◦C until DNA extraction. Approximately 100 mg of frozen tissue was used for DNA isolation with a DNeasy Plant Mini Kit from Qiagen, following the manufacturer's protocol. DNA sample quality and concentration were assessed using electrophoresis on a 0.8% agarose gel and the restriction enzyme Tru1I (ThermoFisher) was used to check for the absence of nucleases in DNA prior to genotyping.

Samples were genotyped by Diversity Arrays Technology Pty Ltd. (Montana St, University of Camberra, Bruce ACT 2617, Australia) (DArT) using DartSeqTM. A total of 35,509 polymorphic dominant DArT markers and 9142 biallelic SNP markers were generated. Both datasets were thinned by removing one marker from each pair with a correlation coefficient of >0.95. The final dataset consisted on 14,588 DArT markers (of which 8411 were mapped) and 5716 SNP markers (4142 mapped markers). DartSeqTM genotyping and mapping of the corresponding markers to the wheat reference genome sequence RefSeq v1 from the International Wheat Genome Sequencing Consortium (IWGSC, http://www.wheatgenome.org/) was performed by DArT (diversityarrays.com), as described by Sukumaran et al. [43]. The distribution of markers across the A and B subgenomes is given in Table 2.


**Table 2.** Molecular markers distribution across the wheat genome. The distribution of DArT (Diversity Arrays Technology, left) and single nucleotide polymorphism (SNP, right) markers across the durum wheat A and B subgenomes and the number of unmapped markers are shown.

Chr: chromosome; A: wheat A subgenome; B: wheat B subgenome; Un: unmaped; 1: total markers by group; 2: total markers by genome.

#### *2.2. Phenotypic Data*

Ten agronomic and quality traits were assessed for three durum wheat elite line panels: days to heading (days, DTH); plant height (cm, PH); lodging (%, LOD); yellow rust (%, YR); yellow colour (YC); sedimentation index (cm3, SDS); sedimentation volume (cm3, SV); grain protein content (%, GPC); thousand kernel weight (g, TKW); and grain yield (Kg/ha, GY). Agronomic traits (DTH, PH, LOD, YR, TKW and GY) were assessed under both water treatments (FI and RI), and quality traits (YC, SV, SDS and GPC) were only evaluated under FI conditions. Visual disease evaluation and phenology assessments were made in the field, while quality parameters were evaluated on grain samples post-harvest. DTH, PH and LOD and YR were visually assessed at the field trials in Yaqui, while YR was assessed at Toluca. To assess DTH, heading date was recorded as the time when 50% of the spikes have emerged from the flag leaf sheath (stage 59 in Zadoks scale [51]); PH was recorded by measuring the distance between the stem's base and the top of the spike (awns not included); LOD was assessed as the percentage of lodging plot; and YR was assessed as the percentage of leaves with rust pustules. YC was assessed by a rapid colorimetric method with a Minolta color meter following CEN/TS 15,465:2008 [52–54]; SDS was evaluated following UNE 34,903:2014 [55,56]; SV and GPC were assessed by Near-infrared spectroscopy (NIRs) [57]; TKW was measured by weighing 2 samples of 100 entire kernels randomly chosen previously dried at 70 ◦C for 48 h.

The correlation between the assessed traits was analysed using the 'cor' function in R [58–60]. Then, an analysis of variance (ANOVA) was undertaken, using the 'aov' function in R [61], to obtain the descriptive statistics for each trait.

Traits were analysed using a Q + K linear mixed-model [62,63] which follows the model equation:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{S}\boldsymbol{\alpha} + \mathbf{Q}\mathbf{v} + \mathbf{Z}\boldsymbol{\mu} + \boldsymbol{\varepsilon} \tag{1}$$

where y is a vector of observed phenotypes; X, S and Z are matrices related to β, α and μ, respectively; β is a vector of fixed effects; α is a vector of marker effects; Qv is a vector of population effect; μ is a vector of polygenic effects (with covariance proportional to a kindship or relationship matrix); and ε is a vector of residuals.

These analyses were carried out using GenStat (14th Edition) to generate the best linear unbiased estimates (BLUEs) of variety performance in different ways: (i) across years and blocks; (ii) across years for each block (FI and RI); (iii) across a reduced dataset (years 2013 and 2014) and blocks; and (iv) across the reduced dataset for each block. The resulting datasets (available in Supplementary Material Table S2) were then used in different association mapping analyses.

#### *2.3. Population Structure and Linkage Disequilibrium*

Population structure was assessed using principal component analysis (PCA) based on the combined DArT and SNP genotyping datasets. Euclidean distances were calculated using the R package 'ggfortify' [64] and the PCA was visualised with 'ggplot2' [65].

The pattern of linkage disequilibrium (LD) was assessed between each pair of SNP markers on the same chromosome across the two constitutive genomes with the allele frequency correlation (r2) using the 'popgen' package in R [66]. A heatmap was obtained with the *D'* and r<sup>2</sup> values for each chromosome and a scatterplot to determine LD decay (genetic distance in cM).

#### *2.4. Association Mapping (AM)*

The AM analyses were performed on the BLUEs obtained above using an additive model with 'rrBLUP' [67] and 'GWASpoly' [68] packages in R in different ways. Two marker-based kinship matrices (k-matrix), created from a subset of 14,588 DArT and 5716 SNP markers, respectively, were used for the adjustment based on relatedness of individuals (Supplementary Materials Tables S3 and S4). A minor allele frequency (maf) threshold of 0.05 was used. To establish a *p*-value detection threshold

for statistical significance of associations, the Bonferroni correction, which employs a threshold of α/m to ensure the genome-wide type error I of 0.05, was applied with a total of 1000 permutations.

Associated DartSeqTM and SNP markers were blasted against the wheat reference assembly RefSeqv1 [69] with no indels or mismatches allowed, using an ad hoc Java program, to confirm their physical mapping location on the A or B genomes. The molecular markers were also mapped against the durum wheat genome (https://www.interomics.eu/durum-wheat-genome) to confirm their physical positions. In addition, to identify candidate genes, results were filtered, selecting the hits with best *e*-value for each molecular marker, and candidate genes were manually selected based on their annotations.

#### **3. Results**

#### *3.1. Phenotypic Assessment*

Results from the ANOVA for all the traits across years and water treatments are shown in Table 3. The mean phenotypic values across years were calculated for each block and panel to evaluate the influence of water conditions on the assessed traits (Table 3). Days to heading during the field seasons assessed ranged from 63 to 94 days. In plots with lower water availability (RI block), the spike emergence from the flag leaf took place approximately 11 days earlier than in FI plots. Plant height ranged from 39 to 110 cm showing differences between water regimes, with a decrease of 25–30 cm under RI conditions. Likewise, and as result of the RI treatment, GY (ranging from 1.35 to 10.63 ton/ha) and TKW (from 29.6 to 63.2 g) also varied, being reduced by 4–5 tons/ha and 7–10 g, respectively, in the low water availability RI treatment. This strong RI treatment resulted in very low heritability values for DTH, PH and LOD.

Several significant phenotypic correlations were observed between the analysed traits (Figure 1 and Supplementary Materials Table S5). The most correlated traits were PH and GY (r = 0.90, *<sup>p</sup>*-value = <2.2 <sup>×</sup> <sup>10</sup><sup>−</sup>16), followed by DTH and GY (r = 0.87, *p*-value = <2.2 <sup>×</sup> 10−16), SDS and SV (r <sup>=</sup> 0.85, *<sup>p</sup>*-value = <2.2 <sup>×</sup> <sup>10</sup><sup>−</sup>16) and also DTH and PH (r <sup>=</sup> 0.82, *<sup>p</sup>*-value = <2.2 <sup>×</sup> <sup>10</sup><sup>−</sup>16). Other traits showed important positive correlations too, including YC and DTH (r <sup>=</sup> 0.69, *<sup>p</sup>*-value <sup>=</sup> 2.82 <sup>×</sup> <sup>10</sup><sup>−</sup>09), GY and TKW (r = 0.66, *p*-value = <2.2 <sup>×</sup> 10−16), PH and TKW (r = 0.62, *p*-value = <2.2 <sup>×</sup> 10−16), GY and YC (r <sup>=</sup> 0.53, *p*-value = 0.039) and DTH and TKW (r = 0.49, *p*-value = <2.2 <sup>×</sup> 10−16). Negative correlations were also observed for SDS and GPC (r <sup>=</sup> <sup>−</sup>0.38, *<sup>p</sup>*-value = <2.2 <sup>×</sup> <sup>10</sup><sup>−</sup>16), and for GPC and YC (r <sup>=</sup> <sup>−</sup>0.08, *<sup>p</sup>*-value = <2.2 <sup>×</sup> <sup>10</sup><sup>−</sup>16) (Figure 1).

**Figure 1.** Phenotypic correlations. YR: yellow rust; DTH: days to heading; PH: plant height; GY: grain yield; TKW: thousand kernel weight; YC: yellow colour; SV: sedimentation volume; SDS: sedimentation index; and GPC: grain protein content.


#### *3.2. Population Structure and Linkage Disequilibrium*

The PCA used a total of 14,588 DArT and 5716 SNP markers. The first and second principal components explained 3.91% of the genetic variation (Figure 2). No underlying genetic structure was detected within or between the panels assessed. LD was estimated using the mapped SNP markers dataset. LD decay was determined within 20–30 cM for all the chromosomes (Figure 3). Using the classification defined by Maccaferri et al. [70], the markers presented loose linkage (Class 2), showing a distance value between 21 to 50 cM.

**Figure 2.** Principal components analysis (PCA) of the genotypic data.

The LD decay was established

 between 20–30 cM.

#### *3.3. AM Analysis*

Thirty-seven significant marker-trait associations (MTAs) were detected for TKW and SV across all years and water treatments with most of the significant markers located on chromosome 2A (Table 4). Twenty DArT and seven SNP markers were found in association with TKW on chromosome 2A (with additive effects ranging from −3.41 to 3.46). In addition, eight unmapped DArT and one SNP marker were also associated with TKW (additive effects ranged from −3.39 to 3.46 g). Most of these MTAs showed a negative additive effect, reducing the final weight value (ranging from −2.84 to −3.19 g), and only two MTAs were found to increase TKW (values of 2.97 and 3.09 g). Finally, a single SNP associated with SV was located on chromosome 1B (showing a positive effect increasing the final value by 1.26 mL). The resulting Manhattan and QQ-plots from this AM analysis are included in Supplementary Materials Figures S2–S5.

The AM analyses on partitioned subsets of the data consistently detected the QTLs for TKW and SV. Nevertheless, the individual assessment of the water treatments significantly reduced the number of MTAs found, due in part to less available data for the RI block (Supplementary Material Table S6). The initial dataset of 294 durum wheat elite lines was reduced to 200 lines (assessed during the 2013 and 2014 seasons) to give a dataset balanced across assessment years. Using this reduced dataset for AM analysis, the results confirmed the QTLs previously found for the full dataset (Supplementary Materials Table S6). The analysis also detected an additional locus for SV on chromosome 4A, and a locus for YR on chromosome 1B (with additive effects of −0.84 and 2.79, respectively).



#### *3.4. Candidate Genes Analysis*

The marker *SNP620*, located on chromosome 1B and detected in association with SV, was found included into a cluster of 12 genes encoding blue copper proteins (BCP), with homoeologs in the two durum wheat subgenomes (Figure 4, Table 5 and Supplementary Material Figure S6). In the hexaploid wheat genome, this set of genes form a cluster of homeolog triads [72] with a total of 31 genes (Supplementary Materials Table S7 and Figure S7). Additionally, another interesting gene (*TraesCS1B01G568400LC.1*) was found closer this marker, coding for the ubiquinone biosynthesis *O*-methyltransferase.

There were several markers located in chromosome 2A, in close proximity to some interesting genes. Markers *SNP1183*, *SNP1184* and *DArT3165* were found next to several genes encoding reductase-1 (Figure 4 and Table 5). In addition, the marker *SNP8395*, also located on the same chromosome, was found in proximity to the gene *TraesCS2A01G309700.1*, which encodes a type A response regulator 1 (Figure 4 and Table 5).

Significant MTAs from the partitioned analysis also allowed the identification of potentially interesting genes. On chromosome 1B, marker *DArT1744*, previously described by Mérida-García et al. [73] related to high molecular-weight glutenin subunits, was found in proximity to genes encoding isocitrate dehydrogenase kinase/phosphatase G and leucine-rich repeat receptor-like protein kinase family protein. These genes participate in the carbohydrate metabolism during the Krebs cycle and play a crucial role in plant development and stress responses, respectively [74,75]. On this chromosome, another marker (*SNP809*) was found in proximity to some interesting genes encoding sugar transporter proteins. Additionally, some markers located on chromosome 2A were found in proximity to Acyl-CoA *N*-acyltransferase genes (*SNP1206*, *SNP8395* and *DArT3180*) and chloroplastic zeaxanthin epoxidase (*SNP1189*) (Supplementary Materials Table S8).

**Figure 4.** Location of candidate genes on chromosomes 1B and 2A. Markers are indicated with the symbol '-'; blue copper proteins are shown in blue colour; ubiquinone biosynthesis *O*-methyltransferase in purple colour; the regulator response gene in brown colour; and for reductase 1 genes in green colour.


**Table 5.** Candidate genes for markers with significant marker-traits associations. Genes located in proximity of markers found in association with TKW andacross years and water treatments (within a ±50 kb window). Values for physical position and distance are indicated in base pairs (bp). Chr: chromosome position.Blue copper proteins are shown in blue colour; ubiquinone biosynthesis *O*-methyltransferase in purple colour; the regulator response gene in brown colour; and

 SV  for




#### **4. Discussion**

The maintenance of crop production is a current and pressing need given growing populations and reduced availability of arable land [76]. There is an increasing need for breeding programs to improve yield potential and the adaptation of new varieties to different biotic and abiotic stresses [77]. Abiotic stresses, including drought and heat, are impacting productivity on the million hectares of wheat grown worldwide each year [78]. Detailed molecular and phenotypic characterization are valuable tools in the dissection of complex traits [79], and especially those that are influenced by water availability [14].

The improvement of key traits is essential for better end-use production quantity and quality in wheat [80]. In this study, we analysed a set of 10 agronomic and quality traits under full irrigation conditions (FI), with an additional six traits also assessed under low water availability (RI) in order to understand trait variation under contrasting water regimes in the CIMMYT durum wheat breeding programme. Irrigation conditions influenced some important yield and yield-related traits such as GY and TKW, as well as adaptive traits including DTH and PH (Table 3). The RI treatments had decreased final yields in line with previous observations [6,81]. Previous reports have also shown TKW to be reduced by high temperatures [17], most likely related to water availability. Groos et al. [82] assessed the genetic basis of grain yield and protein content in a segregating population of wheat RILs grown at six locations and also identified effects from GxE interactions involving protein content and yield. Our mean trait values corroborated this, with the highest values for GY recorded for FI blocks across panels (see Table 3). A similar trend was shown for DTH, PH and TKW, which decreased under low-water regimes.

Correlations between the assessed traits showed that GY was positively correlated with two different phenology traits (PH and DTH). This is in agreement with Maccaferri et al. [6], who showed important positive and negative correlations for GY and DTH, and also positive correlations for GY and PH in several environments with different water regimes. DTH and PH were also positively correlated (Supplementary Materials Table S5), with taller plants having a longer time period to the emergence of the tip of the spike stage.

Wheat TKW is an important yield component with a direct effect on grain yield [83,84]. In line with this, our results showed a significant and positive correlation between TKW and GY. However, the previously reported negative correlation between TKW and DTH [6] was not observed, potentially as result of temperatures and water availability from emergence to heading, and also from heading to harvest. Rharrabti et al. [17] previously highlighted a positive correlation between protein content and TKW, which is in agreement with the results obtained in the present study. They highlighted that warm temperatures during grain filling could negatively affect this correlation.

Significant associations between endosperm proteins (gliadin and glutenin subunits) and SV have been previously highlighted [85,86]. Here we found a positive correlation (r = 0.15) between SV and GPC. This correlation is thought to be the result of grain protein content influencing the sedimentation volume value [87], which depends on the degree of protein hydration and oxidation [88]. Finally, for sedimentation index (SDS) analysis, we observed a negative correlation with protein content, in agreement with results presented by Rharrabti et al. [17,45]. This is also in agreement with Oelofse et al. [89] who highlighted the significant influence of protein content on the SDS sedimentation test [90–92].

The SNP and DArT markers used to analyze patterns of genetic structure (Figure 2) and LD (Figure 3) for the durum wheat lines revealed no detectable genetic structure and consistent patterns of LD across chromosomes (LD was estimated to extend ~20 cM). These results support the suitability of durum elite line sets currently in use in breeding programmes for the practical application of GWAs analysis. The rate of unmapped markers was lower for SNP than for DArT markers (27.5% vs. 42.0%, Table 2), indicating higher precision in genetically mapping SNP markers, probably as a result of co-dominance.

In the assessment of MTAs for quality and yield-related traits, different AM analyses were performed on subsets of the dataset. Several MTAs for SV and TKW were detected across years and water regimes, located on chromosomes 1B and 2A, respectively. All GWAS analyses corroborated the major QTLs previously detected, and reported two new QTLs, one for YR in chromosome 1B, and another for SV in chromosome 4A.

Associations on chromosome 1B were significant for wheat quality. There are known loci including Gli-B1/Glu-B3 on this chromosome, which are of great importance for some gliadin and glutenin subunits [93]. In fact, the candidate gene analysis reported the presence of a high molecular-weight glutenin subunit (HMW-GS) gene in the proximity of marker *DArT1744* (found to be significantly associated with SV and SDS), which was previously described by Mérida-García et al. [73] in association with specific weight. In line with this, Pogna et al. [93] highlighted the importance of Glu-B3 for determining protein quality with these endosperm proteins showing significant effects on SV, which also showed a high and positive correlation with SDS in our study (Figure 1 and Supplementary Materials Table S5). Likewise, Blanco et al. [86] reported three QTLs on chromosomes 1B, 6A and 7B (based on the analysis of 259 polymorphic markers) associated with SDS and SV in a recombinant inbred line population. In the present study, we found a SNP marker (*SNP620*) associated with SV, showing a positive additive effect of 1.26 (see Table 4) and also with SDS (marker effect of 0.11) (Supplementary Materials Table S6). This marker was previously placed on chromosome 1B, in the same location as MTAs for gluten index and sedimentation index [73]. Other previous studies, such as Reif et al. [94] and Fiedler et al. [95], also reported markers associated with SV on chromosome 1B, but with differing genetic positions. The additional locus for SV found on chromosome 4A (marker *DArT9459*) has not been previously reported.

Marker *SNP620*, significantly associated with SV, is located within a cluster of homoeolog gene triads coding for blue copper proteins (Table 5 and Supplementary Materials Figure S6). These proteins have been described containing a copper atom, and participate in redox processes [96], with a crucial role in the electron shuttle in plants [97]. In addition, Yao et al. [98] described the blue copper protein genes as the targets of miR408 in wheat, which is involved in the regulation of gene transcription required for heading time [99]. In our study *SNP620* was also found in proximity to a gene coding for an ubiquinone biosynthesis O-methyltransferase. Liu et al. [100] highlighted its crucial role as an electron transporter in the electron transport chain of the aerobic respiratory chain. This ubiquinone gene is involved in plant growth and development, is implied in chemical compounds biosynthesis and metabolism which are involved in plant responses to stress, and also participates in gene expression regulation and cell signal transduction [100].

On chromosome 1B we also found a significant MTA for yellow rust, in agreement with previous studies in durum and bread wheat, which placed different markers significantly associated with this trait, but in differing genetic positions [101–104]. The candidate gene analysis revealed the proximity of this marker (*SNP809*) to sugar transporter protein genes. Sugars are formed during the photosynthetic process and are essential for plant nutrition. The sucrose transport has been considered of great importance for plant productivity [105]. In line with this, the sucrose is involved in the gene expression regulation of the supposedly sugar-sensing pathway [106,107].

The majority of MTAs for TKW were located on chromosome 2A, showing both positive and negative effects. Previous studies have reported different markers in association with this quality trait, including a number mapped on chromosome 2A [38,108–110]. One of the markers found by Yao et al. [38] (SSR marker *gwm445* on chromosome 2A (68.2 cM), belongs to the same QTL found in this study for the marker *SNP1153* (chromosome 2A, 68.6 cM), and also found by Juliana et al. [111] in bread wheat lines from CIMMYT's first year-yield trials. Sukumaran et al. [43] analysed a durum wheat panel of 208 lines under yield potential, heat and drought stress conditions, and identified markers on chromosome 2A with a similar position to those detected in this study (4 markers at 70 cM and 6 markers at 69 cM) under heat stress conditions. They highlighted that several SNP markers were related to transmembranes or were uncharacterized proteins. We found several candidate genes for this

important TKW QTL (Table 5) among which the most striking feature is the presence of four reductase 1 genes (NADPH-dependent 6 -deoxychalcone synthase) and a type A response regulator 1 (Figure 4). These genes are both related with photosynthesis. Hu et al. [112] highlighted that NADPH plays a crucial role in biological processes in plants, such as the regulation of the production of reactive oxygen species (ROS) for the stress tolerance [113,114]. Additional GWAS analyses using reduced datasets revealed other interesting genes for this QTL (chromosome 2A, Supplementary Materials Table S8), encoding for the Acyl-CoA *N*-acyltransferase and the chloroplastic zeaxanthin epoxidase. The first gene has several functions in signaling and metabolic pathways [115]. The zeaxanthin epoxidase plays an important role in the xanthophyll cycle and abscisic acid (ABA) biosynthesis. The xanthophyll cycle has a main function in the dissipation of light energy excess and also increasing the photosynthetic system stability [116].

The proposed approach has successfully detected genetic markers with significant associations with TKW, SV, SDS and YR. These are of potential use in durum wheat breeding programs, and can be further interrogated to the candidate gene level using the RefSeqv1 bread wheat genome reference [69] and the durum wheat genome reference [71].

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/1/144/s1: Figure S1: Rainfall data for Ciudad Obregon (Mexico) for the growing seasons 2012–2015; Figure S2: Manhattan plots for durum wheat mapped DArT markers. DTH: days to heading; PH: plant height; GY: grain yield; TKW: thousand kernel weight; YC: yellow color; SV: sedimentation volume; SDS: sedimentation index; and GPC: grain protein content; Figure S3: Manhattan plots for durum wheat mapped SNP markers. DTH: days to heading; PH: plant height; GY: grain yield; TKW: thousand kernel weight; YC: yellow color; SV: sedimentation volume; SDS: sedimentation index; and GPC: grain protein content; Figure S4: Quantile quantile-plots from genome-wide association studies (GWAS) analysis for durum wheat DArT markers (mapped and unmapped). DTH: days to heading; PH: plant height; LOD: lodging; GY: grain yield; TKW: thousand kernel weight; YC: yellow color; SV: sedimentation volume; SDS: sedimentation index; and GPC: grain protein content; Figure S5: Quantile quantile-plots from GWAS analysis for durum wheat SNP markers (mapped and unmapped). DTH: days to heading; PH: plant height; LOD: lodging; GY: grain yield; TKW: thousand kernel weight; YC: yellow color; SV: sedimentation volume; SDS: sedimentation index; and GPC: grain protein content; Figure S6: Blue copper protein gene cluster on durum wheat chromosome 1B. High confidence genes are shown in green colour, low confidence genes are shown in yellow; Figure S7: Cluster tree of blue copper protein gene homoeologs in bread wheat (RefSeqv1 [69]). For chromosome 1A, high confidence (HC) and low confidence (LC) genes are shown in brown and orange colour, respectively; for chromosome 1B, HC and LC genes are shown in dark and light green colour, respectively; for chromosome 1D, HC and LC genes are shown in dark and light blue colour, respectively; Table S1: Durum wheat elite lines assessed; Table S2: Best Linear Unbiased Estimates (BLUEs) outputs for all assessed traits and the association mapping analyses performed in durum wheat: [i] across years and blocks; [ii] across years for each block (FI and RI); [iii] across years and blocks for a reduced dataset (years 2013 and 2014); and [iv] across the reduced dataset for each block. DTH: days to heading; GPC: grain protein content; GY: grain yield; PH: plant height; SDS: sedimentation index; SV: sedimentation volume; TKW: thousand kernel weight; and YC: yellow colour; YR: yellow rust; LOD: lodging; Table S3: Kinship matrix for durum wheat DArT markers; Table S4: Kinship matrix for durum wheat SNP markers; Table S5: Phenotypic correlations between the assessed traits in durum wheat and their corresponding p values. YR: yellow rust; DTH: days to heading; PH: plant height; LOD: lodging; GY: grain yield; TKW: thousand kernel weight; YC: yellow color; SV: sedimentation volume; SDS: sedimentation index; and GPC: grain protein content; Table S6: Marker-trait associations found for the association mapping analyses performed in durum wheat: [i] across years and blocks; [ii] across years for each block (FI and RI); [iii] across years and blocks for a reduced dataset (years 2013 and 2014); and [iv] across the reduced dataset for each block. SV: sedimentation volume; TKW: thousand kernel weight; SDS: sedimentation index; YR: yellow rust; "-": unmapped marker; Table S7: Homoeolog triads for blue copper protein genes mapped in the wheat reference assembly RefSeqv1 [69]; Table S8: Candidate genes for GWAS analyses performed in durum wheat: [i] across years for each block (FI and RI); [ii] across years and blocks for a reduced dataset (years 2013 and 2014); and [iii] across the reduced dataset for each block. Molecular markers mapping positions are shown both in the durum wheat genome [71] and the wheat reference assembly RefSeqv1 [69]; Supplementary Material S1. R script used to perform the GWAS analysis; Supplementary Material S2. R script used to perform the LD analysis.

**Author Contributions:** P.H., I.S. and K.A. conceived the experiment. K.A. and I.S. selected the plant materials and agronomic traits. K.A. managed the field trials and contributed the phenotypic data. I.S., P.H., G.D., S.G., R.M.-G. and A.R.B. analysed the genotypic and phenotypic data. R.M.-G. isolated the DNA. R.M.-G., A.R.B. and P.H. carried out the AM analyses. R.M.-G., A.R.B. and P.H. drafted the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by project P12-AGR-0482 from Junta de Andalucía (Andalusian Regional Government), Spain (Co-funded by FEDER). ARB is supported by the UK Biotechnology and Biological Sciences Research Council (BBSRC) 'Designing Future Wheat' cross-institute strategic programme (BB/P016855/1). PH is supported by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) project AGL2016-77149-C2-1-P.

**Conflicts of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Exploring the Genetic Architecture of Root-Related Traits in Mediterranean Bread Wheat Landraces by Genome-Wide Association Analysis**

#### **Rubén Rufo 1, Silvio Salvi 2, Conxita Royo <sup>1</sup> and Jose Miguel Soriano 1,\***


Received: 3 April 2020; Accepted: 23 April 2020; Published: 25 April 2020

**Abstract:** Background: Roots are essential for drought adaptation because of their involvement in water and nutrient uptake. As the study of the root system architecture (RSA) is costly and time-consuming, it is not generally considered in breeding programs. Thus, the identification of molecular markers linked to RSA traits is of special interest to the breeding community. The reported correlation between the RSA of seedlings and adult plants simplifies its assessment. Methods: In this study, a panel of 170 bread wheat landraces from 24 Mediterranean countries was used to identify molecular markers associated with the seminal RSA and related traits: seminal root angle, total root number, root dry weight, seed weight and shoot length, and grain yield (GY). Results: A genome-wide association study identified 135 marker-trait associations explaining 6% to 15% of the phenotypic variances for root related traits and 112 for GY. Fifteen QTL hotspots were identified as the most important for controlling root trait variation and were shown to include 31 candidate genes related to RSA traits, seed size, root development, and abiotic stress tolerance (mainly drought). Co-location for root related traits and GY was found in 17 genome regions. In addition, only four out of the fifteen QTL hotspots were reported previously. Conclusions: The variability found in the Mediterranean wheat landraces is a valuable source of root traits to introgress into adapted phenotypes through marker-assisted breeding. The study reveals new loci affecting root development in wheat.

**Keywords:** drought stress; association mapping; root system architecture; QTL hotspot; seminal root

#### **1. Introduction**

Wheat is the most widely cultivated crop in the world, covering around 219 million ha (Faostat 2017, http://www.fao.org/faostat/). It is a staple food for humans, as it provides 18% of daily human intake of calories and 20% of protein (http://www.fao.org/faostat/). Global wheat demand is estimated to increase by 60% by the year 2050 [1], so wheat production will need to rise by 1.7% per year until then. Achieving this objective is a great challenge under the current climate change scenario, as the prediction models estimate a precipitation decrease of 25% to 30% and a temperature increase of 4 ◦C to 5 ◦C for the Mediterranean region [2]. It is well known that wheat production is greatly affected by environmental stresses such as drought and heat [3] that negatively affect yield and grain quality [4]. Drought is considered the greatest environmental constraint to yield and yield stability in rainfed production systems [5]. Environmental effects on yield in the Mediterranean Basin have been estimated at 60% for bread wheat [6] and 98% for durum wheat [7]. The expected effects of climate change and the declining availability of water and chemical fertilizers will require the release of cultivars with an enhanced genetic capacity to maintain acceptable yield levels and yield stability under harmful

environmental conditions [8,9]. To cope with the challenges of climate change, breeders are particularly challenged to stretch the adaptability and performance stability of new cultivars, so many improvement programs are focussing on breeding for adaptation [10].

Plants respond and adapt to water deficit using various strategies that have evolved at several levels of function and are components of the conceptual framework developed by Reynolds et al. [11], which defines drought resistance in terms of dehydration escape, tolerance, and avoidance. Traits defining root system architecture (RSA) are critical for wheat adaptation to drought environments and non-optimal nutritional supply conditions [12]. Besides, water-use efficiency (WUE) can be significantly increased by optimizing the anatomy and growth features of roots [13]. Root traits are critical for drought tolerance due to its role in plant performance and the acquisition of nutrients and water from dry soils [14]. The wheat plant includes two types of roots: seminal (embryonal) and nodal (crown or adventitious or adult root system). The seminal roots are the first to penetrate the soil and remain functional during the whole plant cycle [9,15]. A correlation between seminal and adult roots in terms of size, dry-weight, or even specific architectural features have been reported [9,13]. Since the evaluation of RSA features in the field is very difficult, expensive, and time-consuming when a large number of genotypes need to be phenotyped, several studies have been carried out at early growth stages to allow an optimal screening of RSA traits [8,12,16–18]. Maccaferri et al. [9] observed that among RSA traits, those involving the root structure and related to the uptake of nutrients and water are root length, surface area and volume, and the number of roots, while root diameter is significantly associated with drought tolerance. Another RSA trait of interest in wheat is the seminal root angle (SRA), whose features suggest that narrow angles could lead to deeper root growth to obtain water from deeper soil layers and hence maintain higher yields [5,13].

Identifying quantitative trait loci (QTLs) and applying marker-assisted selection is of particular interest for RSA because the trait is important but difficult to phenotype. In the last few years, genome-wide association studies (GWAS) have become very popular because of their use of germplasm collections with wider variability than the classical bi-parental crosses. These collections allow many recombination events to be detected, making the association between genotype and phenotype more accurate. Collections of landraces are an ideal subject of GWAS [19] since they are genetically diverse repositories of unique traits that have evolved in local environments characterized by a wide range of biotic and abiotic conditions. Several studies have shown that Mediterranean wheat landraces possess a wide genetic background for root architecture, yield formation, stress tolerance, and quality traits [17–22]. In the current study, a GWAS for three RSA traits and two related traits was performed on a panel of 170 bread wheat (*Triticum aestivum* L.) landraces from 24 Mediterranean countries with the following goals: (1) to detect differences in RSA among genetic subpopulations previously distinguished in the panel, (2) to identify correlations among RSA and grain yield under rainfed conditions, and (3) to identify molecular markers and candidate genes linked to root-related traits and candidate gene models for the associations.

#### **2. Materials and Methods**

#### *2.1. Plant Material*

A germplasm collection of 170 bread wheat (*Triticum aestivum* L.) genotypes from the MED6WHEAT IRTA panel described by Rufo et al. [23] was used in this study. The panel was genotyped and characterized using the Illumina Infinium 15K Wheat SNP Chip at Trait Genetics GmbH (Gatersleben, Germany), and markers were ordered according to the SNP map developed by Wang et al. [24]. The collection was previously structured into three subpopulations (SPs) matching their geographical origin [23]: western (SP1, WM), northern (SP2, NM), and eastern Mediterranean (SP3, EM) (Supplementary Materials, Table S1). Additionally, the cultivars 'Arthur Nick', 'Anza', 'Soissons', and 'Chinese Spring' were included as checks.

#### *2.2. Root Morphology and Statistical Analysis*

Root analysis was performed following the protocol described by Canè et al. [8], which was slightly modified in the current study (Figure 1). Ten representative seeds were randomly chosen from each genotype, weighed, sterilized in a 10% sodium hypochlorite solution for 5–10 min, washed thoroughly in distilled water and placed on hydrated filter paper in a 140 mm Petri dish at 28 ◦C for 24 h. Subsequently, five seedlings were selected on the basis of a normal seminal root emergence and were spaced 8 cm from each other on a filter paper sheet placed on a vertical black rectangular polycarbonate plate (42.5 × 38.5 cm). Finally, each plate was covered with another wet sheet of filter paper. Distilled water was used for the plantlets' growth. The plantlets were grown in a growth chamber for 14 days at 22 ◦C under a 16-h light photoperiod. In addition to the ten seed weight (SW), four other traits were scored for each genotype: total root number (TRN), shoot length (SL) from the seed to the tip of the longest leaf and SRA, obtained using a digital camera following the methodology described in Canè et al. [8]. The images were processed with ImageJ software [25]. The angle between the two external roots of each plantlet was measured at a distance of 3.5 cm from the tip of the seed. Finally, the roots were desiccated at 70 ◦C for 24 h to obtain the root dry weight (RDW).

The experimental design followed a randomized complete block with two replications in time. Means of five observational units for each genotype were used for TRN, RDW, and SL, while only three observational units were used for SRA because the two external ones were considered as border plantlets for root angle.

**Figure 1.** Experimental setup for the analysis of seminal root traits. Seeds were placed 8 cm apart on moist filter paper (**A**) and kept in a box with distilled water in a growth chamber for 14 days at 22 ◦C under a 16-h light photoperiod (**B**). (**C**) Example of seminal root angle measurement, using ImageJ software.

#### *2.3. Grain Yield*

Field experiments were carried out in 2016, 2017, and 2018 harvesting seasons in Gimenells, Lleida, north-east Spain (41◦38' N and 0◦22' E, 260 m a.s.l) under rainfed conditions. The experiments followed a non-replicated augmented design with two replicated checks (the cultivars 'Anza' and 'Soissons') and plots of 3.6 m2. The experimental design is shown in Supplementary Materials, Figure S1. Sowing density was adjusted to 250 germinable seeds m2. Weeds and diseases were controlled following standard practices at the site. The anthesis date was determined in each plot. Grain yield (GY, t ha<sup>−</sup>1) was determined by mechanically harvesting the plots at maturity and expressed on a 12% moisture level.

#### *2.4. Statistical Analysis*

Phenotypic data for GY was fitted to a linear mixed model with the check cultivars as fixed effects and the row number, column number and cultivar as random effects following the SAS PROC MIXED procedure:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\boldsymbol{\gamma} + \boldsymbol{\varepsilon} \tag{1}$$

where β is an unknown vector of fixed-effects parameters with known design matrix X, γ is an unknown vector of random-effects parameters with known design matrix Z, and ε is an unknown random error vector whose elements are no longer required to be independent and homogeneous.

Restricted maximum likelihood was used to estimate the variance components and to produce the best linear unbiased predictors (BLUPs) for the traits of each cultivar and year with the SAS-STAT statistical package (SAS Institute Inc, Cary, NC, USA).

Analyses of variance (ANOVA) were performed for the root traits, considering the genotypes and the replication as random effects in the model. Additionally, for a subset of 55 of the 141 structured landraces, selected as having an SP membership q > 0.8 (WM, 17; NM, 15; EM, 23), the sum of squares of the cultivar effect in the ANOVAs was partitioned into differences between SPs and differences within them. ANOVA for grain yield was performed for the complete collection, considering genotype, year, and the combination of genotype and year the sources of variation. Least squares means were calculated and compared using the Tukey HSD test at P < 0.05. Pearson correlation coefficients among root traits were computed. Repeatability (H) was calculated on a mean basis across two replications following the formula described by Harper [26] r = (B − W) / (B + ( (n −1 ) W)), where n is the number of genotypes and B and W the two variances from the ANOVA table: between (B) and within (W). Frequency distributions, ANOVAs, the Tukey test, and the Pearson correlation coefficients were calculated using the JMP v13.1.0 statistical package (SAS Institute Inc, Cary, NC, USA).

#### *2.5. Genome-Wide Association Analysis*

A GWAS was performed for the mean of measured root traits and from the BLUPs for GY per year and across years with TASSEL 5.0 software [27]. A mixed linear model (MLM) was conducted using the information of the genetic structure reported in Rufo et al. [23] as the fixed effect and a kinship (K) matrix, calculated using Haploview [28], as the random effect (Q + K model) at the optimum compression level. In addition, the anthesis date was incorporated as a cofactor in the analysis. As reported in other studies [29–32], an adjusted –log10 *P* > 3 was established as a threshold for considering a marker-trait association (MTA) statistically significant. A moderate threshold at –log10 *P* > 2.5 was also established for GY. Confidence intervals (CI) for MTAs were calculated for each chromosome according to the linkage disequilibrium (LD) decay reported by Rufo et al. [23]. In order to simplify the MTA information, the associations were grouped into QTL hotspots when at least two MTAs belonging to different traits overlapped their CIs. Circular Manhattan plots were performed using the R package "CMplot" (http://www.r-project.org).

#### *2.6. Gene Annotation*

Gene annotation within the CIs of the QTL hotspots was performed using the gene models for high-confidence genes reported for the wheat genome sequence [33], available at https://wheat-urgi. versailles.inra.fr/Seq-Repository/Assemblies. Markers flanking the CIs were used to estimate physical distances from genetic distances.

#### **3. Results**

#### *3.1. Phenotypic Data of Root Traits*

A summary of the genetic variation of the root traits is shown in Table 1. The genotypes showed a low coefficient of variation (CV) with a narrow range of variation among traits, from 10.4 for SW to 18.8 for RDW and repeatability (H) ranging from 48.5% for RDW to 75.4% for SW.


**Table 1.** Statistics of the seminal root traits.

SD, standard deviation; CV, coefficient of variation; H, repeatability; TRN, total root number; RDW, root dry weight; SRA, seminal root angle; SW, seed weight; SL, seed length.

The ANOVA (Table 2) for cultivars with a high membership coefficient (q > 0.8) showed that for all traits the total variability was mainly explained by the genotype effect, in a range from 63.5% for SL to 88.8% for SW. When the sum of squares of the genotype effect was partitioned into differences between and within SPs, the results revealed that the genetic variability was mainly explained by differences within SPs in a range from 47.8% for TRN to 71.8% for SRA (Table 2). Differences between SPs were statistically significant for SRA, TRN, SW, and SL, in a range from 6.0% of the genotype effect for SL to 25.3% for TRN (Table 2). The sum of squares within SPs was partitioned into western (WM), northern (NM), and eastern (EM) effects, being statistically significant for SRA (40.8%), TRN (28.3%), SW (38.3%), and SL (26.8%) in the western SP, SRA (34.4%) in the northern SP and RDW (53.6%) and SW (55.4%) in the eastern SP.

**Table 2.** Percentage of the sum of squares of the ANOVA in a set of 55 bread wheat landraces structured into three genetic subpopulations with membership coefficient *q* > 0.8.


WM, western Mediterranean; NM, northern Mediterranean; EM, eastern Mediterranean; TRN, total root number; RDW, root dry weight; SRA, seminal root angle; SW, seed weight; SL, seed length. \* *P* < 0.05, \*\*\* *P* < 0.001.

The ANOVA for grain yield revealed that the genotype effect was the most important in the phenotypic expression of traits, accounting for 59% of the total phenotypic variation, whereas the year effect accounted only for 5%. The interaction accounted for almost 36% of the phenotypic variation although it was not significant (Table 3).


**Table 3.** Percentage of the sum of squares for grain yield of the ANOVA in the collection of 170 bread wheat landraces.

The landraces from northern Mediterranean countries showed the highest number of seminal roots with a root angle not statistically different from the western Mediterranean ones. On the other hand, eastern Mediterranean landraces showed the lowest number of roots but the widest angle. These landraces reported the lowest SW and the longest shoots. No differences were reported for RDW among the three SPs (Table 4).

**Table 4.** Means comparison of seminal root traits measured in a set of 55 Mediterranean wheat landraces structured into three genetic subpopulations [23] with *q* > 0.8. Means within columns with different letters are significantly different at *P* < 0.05 following a Tukey test.


TRN, total root number; RDW, root dry weight; SRA, seminal root angle; SW, seed weight; SL, seed length.

Correlation coefficients between root traits were calculated, showing highly significant correlation coefficients between RDW and SW and RDW and SL (*r* = 0.47 and 0.45 respectively; *P* < 0.0001). Moderate significant correlations were reported for TRN with RDW, SW and SRA (*r* = 0.20, 0.28 and 0.28, respectively), and for SW with SL (*r* = 0.27). Finally, a negative correlation coefficient (*r* = −0.12) was found between SRA and SW (Figure 2). GY showed a moderate significant correlation with TRN and SW (*r* = 0.28 and 0.29, respectively; *P* < 0.0005).

**Figure 2.** Correlations between seminal root traits and grain yield. On the right side are shown the values of the correlation coefficients (*r*). SL, seed length; RDW, root dry weight; SW, seed weight; TRN, total root number; SRA, seminal root angle; GY, grain yield.

#### *3.2. Marker-Trait Associations*

After filtering for duplicated patterns, missing values, and minor frequency alleles, a total of 10,458 SNPs were used to genotype the panel of 170 wheat landraces [23].

The results of the GWAS for root related traits are reported in Figure 3 and Supplementary Materials, Table S2. Using a common threshold of −log10 *P* > 3, as reported by other authors [29–32], a total of 135 MTAs were identified for the analyzed traits. Of these, 50 MTAs corresponded to SW, 39 to RDW, 18 to SL, 17 to SRA, and 11 to TRN. The A and B genomes harbored 46% and 48% of MTAs, respectively, whereas the D genome harbored only 6% of MTAs. The number of MTAs per chromosome ranged from 1 in chromosomes 4D, 5D, and 6D to 14 in chromosome 1B, with a mean of 7 MTAs per chromosome. Most of the MTAs (88%) showed a phenotypic variance explained (PVE) by each MTA in a range of 5% to 10%, and only 2% showed a PVE higher than 15%. Among traits, the PVE mean was stable in a range of 7% (SL) to 9% (RDW).

**Figure 3.** GWAS for root related traits (left circle) and grain yield for 3 years and across years (right circle). From the inside out, root traits correspond to RDW, SW, TRN, SRA, and SL, whereas for GY corresponds to 2016, 2017, 2018 harvesting seasons and the mean across years.

In order to identify and summarize the genomic regions most involved in trait variation, QTL hotspots were defined when two or more MTAs from different traits were grouped together within the same LD block. LD was previously estimated for locus pairs in each chromosome, and its decay was set to 1 to 10 cM depending on the chromosome [23]. Using this approach, 15 QTL hotspots grouping 43 MTAs were identified (Table 5), while 92 MTAs remained as singletons.

The results of the GWAS for GY are reported in Figure 3 and Supplementary Materials, Table S3. A common threshold of −log10 P > 3, detected a total of 40 MTAs, thus a moderate threshold at –log10 *P* > 2.5 was applied, increasing the number of significant associations to 112. Of these, 32 MTAs corresponded to the year 2016, 30 to 2017, 18 to 2018, and 32 across years. The A and B genomes harbored 43% and 38% of MTAs, respectively, whereas the D genome harbored only 18% of MTAs. The number of MTAs per chromosome ranged from 1 in chromosomes 3D and 6B to 16 in chromosome 1D. Chromosomes 1A, 4D, 5D, and 7D did not show any association. All of MTAs showed a phenotypic variance explained (PVE) by each MTA in a range from 5% to 11%. Most of the MTAs with a PVE > 8% were located on chromosome 1D (76%, 13 out of 17), whereas the percentage increased to 80% among MTAs with a PVE > 10% (4 out of 5).

In order to identify and summarize the genomic regions with a pleiotropic effect for root traits and grain yield, QTL hotspots were defined as previously but including the MTAs for GY. Using this approach, 17 QTL hotspots grouping 81 MTAs were identified (Table 6). From them, five were in common with those reported only with root traits (rootQTL1B.3, rootQTL2A.2, rootQTL3B.2, rootQTL6A.1, and rootQTL6A.2). GY shared 8 genomic regions with SW and 9 with RDW, 4 with SL, and 3 with SRA, whereas no regions were in common with TRN. In 59% of these genomic regions, GY co-localize with only one root trait, whereas the other 41% co-localize with two different root traits.


**Table 5.** Root QTL hotspots. Positions are indicated in cm.

TRN, total root number; RDW, root dry weight; SRA, seminal root angle; SW, seed weight; SL, seed length.



TRN, total root number; RDW, root dry weight; SRA, seminal root angle; SW, seed weight; SL, seed length; GY, grain yield.

In order to identify the most useful markers for selecting for the root traits, extreme phenotypes were identified in the upper and lower 10th percentile of genotypes within the collection for each trait (Figure 4). Among the most significant MTAs for each trait, markers with different alleles between extreme genotypes were identified (Table 7, Figure 5). The frequency of the most common allele among genotypes from the upper 10th percentile ranged from 78% for RDW to 88% for SW, while for the lower 10th percentile it ranged from 65% for TRN and SRA to 92% for RDW (Figure 2).

**Figure 4.** Extreme phenotypes for SRA and TRN. The means correspond for 3 observational units of the genotype for SRA and 5 observational units of the genotype for TRN.

**Figure 5.** Marker allele frequency means from landraces within the upper and lower 10th percentile for the analyzed traits. All significant markers shown in Table 5 are included. TRN, total root number; RDW, root dry weight; SRA, seminal root angle; SW, seed weight; SL, seed length.



 chromosome. TRN, total root number. RDW, root dry weight. SRA, seminal root angle. SW, seed weight. SL, shoot length.

#### *Agronomy* **2020**, *10*, 613

#### *3.3. Gene Annotation*

As reported in Supplementary Materials, Table S4, a total of 1489 gene models were identified within the 15 QTL hotspots using the high-confidence gene annotation from the wheat genome sequence [33]. Genetic distances were converted into physical distances using the position of common flanking markers on the genetic map [24] and the genome sequence. The number of gene models ranged from 224 in rootQTL\_2A.2 to 9 in rootQTL\_5B.1. Based on the high number of gene models, a selection was made according to gene families involved in root traits, growth and development, and abiotic stress resistance (Table 8). Thus, 31 gene families with a total of 96 gene models remained for subsequent analysis. Among them, F-box and zinc finger family proteins were identified in 12 of the 15 QTL hotspots, whereas 10 gene families were present in only one QTL hotspot. Among chromosomes with QTL hotspots, chromosome 2A had the highest number of gene models (22), whereas chromosomes 5A and 5B had the lowest number (4).



#### **4. Discussion**

Breeding for drought adaptation is one of the main challenges to be addressed in the coming years in order to increase wheat production and ensure sufficient food supply in the current scenario of climate change. Roots are crucial in this adaptation, as they are responsible for water and nutrient uptake. The wide morphological plasticity of the root system to different soil conditions and the role of root traits in drought environments are well known [34,35]. Wheat roots reduce their growth in water-limited conditions but increase the water uptake rate, extracting the water from deep soil layers [36]. The shape and spatial arrangement of the RSA can provide a growth advantage and increasing yield performance during periods of water scarcity [37]. Thus, it is necessary to increase the knowledge of the genetics of root architecture in order to improve wheat yield stability under stress conditions by introgressing favorable alleles through breeding programs.

The current study evaluated root-related traits in a collection of Mediterranean bread wheat landraces representative of the variability existing for the species in the Mediterranean Basin [23] with the aim of providing QTL information for these traits regarding seminal roots. Seminal roots are important for early vigor and crop establishment in dryland areas because they explore the soil for nutrients and water [38]. Moreover, it has been reported that under drought stress, seminal roots activity is more important than that of nodal roots [39]. Additionally, field phenotyping of hundreds of genotypes is a complex and expensive task. As the root geometry of adult plants is strongly related to the SRA [5], it may be assumed that genotypes that differ in root architecture at an early developmental stage would also differ in the field at later growth stages, when nutrient and/or water capture become critical for yield performance [8].

The range of variation for the traits analyzed in the present study (from 10.9% for TRN to 18.8% for RDW) is in agreement with those reported for elite durum wheat cultivars by Canè et al. [8], who explained this variability as an adaptive value for the environmental conditions of the region of origin of the cultivars. Moreover, the high repeatability found for the traits supports the approach followed to analyze the seminal roots under controlled conditions.

Landraces from the eastern Mediterranean Basin showed the widest SRA, the lowest SW, the longest SL, and the lowest number of roots. According to previous studies in durum wheat [18,40], landraces from southeastern Mediterranean countries corresponding to the warmest and driest areas of the Mediterranean Basin, reported more grains per unit area and lighter grains than those developed in cooler and wetter zones of the region. Although it has been reported that in water-limited environments a vigorous root system could have benefits at the beginning of the growing season because it offers a more efficient water capture [41], no significant differences were observed for RDW among the SPs in the current study. Moreover, our results for SRA are in agreement with those reported by Roselló et al. [18], who found that durum wheat landraces from the eastern Mediterranean have the widest root angle, which probably allows them to cover a larger soil area and be more efficient in water uptake than landraces that originated in wetter areas.

Although not significant, probably due to the very early stage when the root traits were measured, the correlation between SRA and SW was negative. The same result was also reported by Canè et al. [8], who suggested that it could be due to the influence of the root angle on the distribution of the roots on soil layers and, therefore, the water uptake from deeper layers. On the other hand, the correlation between RDW and SW was positive, in agreement with the findings of Fang et al. [42], thus indicating the effectiveness of greater root mass for obtaining more soil water for plant growth and grain filling in drought. Seedling growth has also been related to SW in wheat [43]. The vertical distribution of the root system can have a strong effect on yield [44], so mass root concentrated in upper layers can be more effective for resource capture, while roots in deeper layers have more access to deep water.

The complexity of the genetic control of root traits was confirmed with 135 marker-trait associations identified in the current study. Their distribution across genomes was similar in the A and B genomes (46% and 48%, respectively), leaving only 6% of MTAs in the D genome. These results agree with the lower genetic diversity and higher LD found in the D genome, as reported previously [23]. According to Chao et al. [45], the different levels of diversity in wheat genomes could be due to different rates of gene flow from the ancestors of wheat, since polyploidy bottleneck resulting from speciation reduced diversity and increased the levels of LD in the D genome in comparison with the A and B genomes.

In order to simplify and to integrate closely linked MTAs in a consensus region, QTL hotspots were identified based on the results of LD decay reported in [23]. LD decay was used to define the CIs for the QTL hotspots. Following this approach, 43 MTAs were grouped in 15 QTL hotspots. The genomic position of QTL hotspots was compared with previous studies reporting meta-QTLs

for root traits [46] and MTAs from GWAS studies in order to detect previously identified regions controlling root traits. Among the 15 QTL hotspots, only rootQTL6A.3 was located in the same region of a previously mapped meta-QTL, RootMQTL74 [46]. When compared with MTA-QTLs reported by [18] in durum wheat Mediterranean landraces, the QTL hotspot rootQTL6A.3 corresponded to the MTA-QTLs mtaq-6A.3 and mtaq-6A.6. This hotspot was also in the same region of a major SRA QTL identified by Alahmad et al. [47] and by a QTL controlling root growth angle identified by Maccaferri et al. [9], who also found a QTL for grain weight that is located in a common region with the hotspot rootQTL2A.2, which includes an MTA for SW. rootQTL3B.1 shared a common position with an MTA reported by Ayalew et al. [48] on chromosome 3B under stress conditions. rootQTL7A.1, including an MTA for RDW, was located in a similar position as MLM-RDWB-10 reported by Li et al. [49] and associated with RDW at the booting stage. Finally, no genomic regions were shared with the study carried out by Beyer et al. [50]. Only four of the 15 QTL hotspots identified in this work had been detected previously, suggesting the importance of wheat Mediterranean landraces for the identification of new loci controlling root-related traits.

As reported in previous studies, at early developmental stages [8,18] the co-location of MTAs for grain yield and root related traits within the same QTL hotspot suggests their pleiotropic effect, however, deeper analyses should be necessary to confirm it. In durum wheat elite cultivars, Canè et al. [8] found that 30% of the QTLs affecting root system architecture were included within QTLs for agronomic traits. More recently, Roselló et al. [18] using a collection of Mediterranean durum wheat landraces found that 45% of QTL hotspots for root related traits were mapped in similar regions to yield-related traits reported for the same collection of landraces.

From a breeding standpoint, exploiting genetic diversity from local landraces is a valuable approach for recovering and broadening allelic variation for traits of interest [19]. Therefore, identifying the genotypes showing the extreme phenotypes within the pool of Mediterranean landraces and the associated markers provide the opportunity for introgressing suitable traits in elite cultivars by marker-assisted breeding using the most recent technologies to speed the process.

The availability of a high-quality reference wheat genome sequence [33] enabled us to quickly identify gene models corresponding to QTLs. Thus, the genetic position of the CIs of the QTL hotspots was projected into physical distances on the reference sequence to search for putative candidate gene models. To narrow the number of candidates, only gene models involved in the development and abiotic stress according to the literature were taken into consideration. Therefore, of 1489 gene models identified within the 15 QTL hotspots, only 31 gene families were selected.

F-box and zinc finger family proteins were the most represented, each one appearing in 12 hotspots. F-box proteins play important roles in plant development and abiotic stress responses via the ubiquitin pathway [51] and the ABA signaling pathway [52]. In wheat, the F-box protein *TaFBA1* is involved in plant hormone signaling and response to abiotic stresses and is expressed in all plant organs, including roots [53]. The overexpression of *TaFBA1* in transgenic tobacco reported by Li et al. [54] to improve heat tolerance resulted in increased root length in the transgenic plants. Zinc finger proteins are involved in several processes, such as regulation of plant growth and development, and response to abiotic stresses [46]. In *Arabidopsis* and rice, they play a role in tolerance to drought and salt stresses [55], while in wheat the overexpression of *TaZFP34* enhances root-to-shoot ratio during plant adaptation to drying soil [56].

Other kinds of gene models found in a high number of QTL hotspots were *MYB* transcription factors and *NAC* domain-containing proteins, each of them presents in 8 hotspots. *MYB* domain-containing transcription factors are involved in salt and drought stress adaptation in wheat. Some examples in wheat are the genes *TaMyb1, TaMYBsdu1,* and *TaMYB33*. The expression of *TaMyb1* in roots is strongly related to responses to abiotic stresses [57]. The gene *TaMYBsdu1* was found to be upregulated in leaves and roots of wheat under long-term drought stress [58]. Finally, the overexpression of *TaMYB33* in *Arabidopsis* enhances tolerance to drought and salt stresses [59]. *NAC* domain-containing proteins have been described to play many important roles in abiotic stress adaptation [46]. Xie et al. [60]

reported that *NAC1* promoted the development of lateral roots. Similarly, He et al. [61] found that the expression of *AtNAC2* in response to salt stress led to an increase in the development of lateral roots. Xia et al. [62] demonstrated that the gene *TaNAC4* is a transcriptional activator involved in wheat's response to biotic and abiotic stresses.

Proteins belonging to the cytochrome *P450* family and *bZIP* transcription factors were present in five QTL hotspots. The first class of proteins belongs to one of the largest families of plant proteins, with genes affecting important traits for crop improvement such as *TaCYP78A3,* which is involved in the control of seed size [63]. *bZIP* transcription factors are involved in abiotic stress response [64]. In *Arabidopsis*, it has been observed that the overexpression of *TabZIP14-B*, involved in salt and freezing tolerance, hindered root growth in transgenic plants in comparison with the control plants [65].

Other proteins involved in root growth and development are the peroxidases and *ABC* transporters that were identified in four QTL hotspots. Extracellular peroxidases are involved in plant defense reactions against biotic and abiotic stresses through the generation of reactive oxygen species in wounded root cells [66]. In *Arabidopsis*, the *ABC* transporter *AtPGP4* is expressed mainly during early root development, and its loss of function enhances lateral root initiation and root hair development [67]. Gaedeke et al. [68] reported a new member of the *ABC* transporter superfamily of *Arabidopsis thaliana*, *AtMRP5*. Using reverse genetics, these authors found that the recessive allele *mrp5* exhibited decreased root growth and increased lateral root formation. In addition to peroxidases and *ABC* transporters, other proteins identified in four QTLs were the ethylene-responsive transcription factors (ERFs), found to be involved in the response to abiotic stresses. In wheat, the *ERF TaERFL1a* is induced in wheat seedlings in response to salt, cold, and water deficiency [69].

Other family proteins involved in drought stress, seed size, or early development were represented in a lower number of QTL hotspots. Among them, aquaporins are known to affect drought tolerance influencing the capacity of roots to take up the soil water [70]. The expansins were suggested to be involved in root development, as the overexpression of the wheat expansin *TaEXPB23* improved drought tolerance by stimulating the growth of the root system in tobacco [71].

#### **5. Conclusions**

The exploitation of unexplored genetic variation present in local landraces can potentially contribute to breeding programs aimed at enhancing drought tolerance in wheat. Roots are crucial for adaptation to drought stress because they are the plant organ responsible for water and nutrient uptake and interaction with soil microbes. Thus, designing and developing novel root system ideotypes could be one of the targets of wheat breeding for the coming years. The variability found in the Mediterranean wheat landraces together with the newly identified QTL hotspots shows landraces as a valuable source of favorable root traits to introgress into adapted phenotypes through marker-assisted breeding. Among the different marker trait associations, those reported in extreme genotypes could result as a starting point to develop new mapping populations to fine map the corresponding traits.

**Supplementary Materials:** Supplementary materials can be found at http://www.mdpi.com/2073-4395/10/5/613/s1. Figure S1: Field scheme of the experimental design for grain yield, Table S1: List of accessions, Table S2: Significant GWAS results for root related traits, Table S3: Significant GWAS results for grain yield, Table S4: Gene models identified within the 15 root QTL hotspots.

**Author Contributions:** Conceptualization, S.S. and J.M.S.; Data curation, R.R.; Formal analysis, R.R.; Funding acquisition, C.R. and J.M.S.; Investigation, R.R., S.S. and J.M.S.; Methodology, R.R., S.S. and J.M.S.; Project administration, C.R. and J.M.S.; Supervision, S.S. and J.M.S.; Writing—original draft, R.R.; Writing—review & editing, S.S., C.R. and J.M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was funded by projects AGL2015-65351-R and RTA2015-00072-C03-01 of the Spanish Ministry of Economy and Competitiveness. R.R. is a recipient of a PhD grant from the Spanish Ministry of Economy and Competitiveness.

**Acknowledgments:** The authors acknowledge the contribution of the CERCA programme (Generalitat de Catalunya). Thanks are given to the group of Agricultural Genetics of University of Bologna for technical support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **Improvement of a RD6 Rice Variety for Blast Resistance and Salt Tolerance through Marker-Assisted Backcrossing**

#### **Korachan Thanasilungura 1, Sukanya Kranto 2, Tidarat Monkham 1, Sompong Chankaew <sup>1</sup> and Jirawat Sanitchon 1,\***


Received: 24 July 2020; Accepted: 30 July 2020; Published: 1 August 2020

**Abstract:** RD6 is one of the most favorable glutinous rice varieties consumed throughout the north and northeast of Thailand because of its aroma and softness. However, blast disease and salt stress cause decreases in both yield quantity and quality during cultivation. Here, gene pyramiding via marker-assisted backcrossing (MAB) using combined blast resistance QTLs (*qBl 1, 2, 11*, and *12*) and *Saltol* QTL was employed in solving the problem. To pursue our goal, the RD6 introgression line (RGD07005-12-165-1), containing four blast-resistant QTLs, were crossed with the Pokkali salt tolerant variety. Blast resistance evaluation was thoroughly carried out in the fields, from BC2F2:3 to BC4F4, using the upland short-row and natural field infection methods. Additionally, salt tolerance was validated in both greenhouse and field conditions. We found that the RD6 "BC4F4 132-12-61" resulting from our breeding programme successfully resisted blast disease and tolerated salt stress, while it maintained the desirable agronomic traits of the original RD6 variety. This finding may provide a new improved rice variety to overcome blast disease and salt stress in Northeast Thailand.

**Keywords:** gene pyramiding; aroma; QTL; chromosome; selection; introgression line

#### **1. Introduction**

Rice (*Oryza sativa* L.) is consumed as a staple food in Asia, especially in the southeast region. In Thailand, the indica rice variety RD6 developed from KDML105 through gamma irradiation is one of the most favorable glutinous rice consumed throughout the northeast of Thailand [1,2]. Because of its cooking quality, aroma, and softness, production demand has increased over time. However, its yield of 4.16 ton/ha fails to meet its potential, due to biotic and abiotic stress.

Rice blast disease caused by the fungus *Pyricularia grisea* (Cooke) Sacc. leads to crop losses up to 85% of total yield [3]. Disease symptoms occur in all stages of plant growth, beginning with blast discoloration and wilting of the foliage [4]. Neck blast can be found at the flowering stage, accelerating plant death [5]. Severe damage was also observed within areas of intensive planting with high doses of nitrogen application [6]. Development of new rice varieties resistant to blast fungus is an alternative approach to diminish or control the invasion of this pathogen. The resistance quantitative trait loci (QTL) have been investigated to achieve parental varieties, which are further used for gene pyramiding in breeding programmes. Currently, more than 100 blast-resistant genes have been identified, of which 22 genes structures have been cloned [7]. In Thailand, few studies of blast resistant genes have been conducted. Noenplab et al. [8] studied the relationship of leaf blast and neck blast of resistant genes in the Jao Hom Nil (JHN) variety, in which the resistant QTLs were detected on chromosomes 1 and 11. The resistant QTLs conferred resistance to both leaf blast and neck blast. Suwannual et al. [9]

pyramided four blast-resistant QTLs, individually, on chromosomes 2 and 12 within the P0489 variety, and on chromosomes 1 and 11 carried by the JHN variety, resulting in the creation of new RD6 introgression lines. Their results demonstrated that the RD6 introgression lines carrying a high number of QTLs (achieved through pyramiding) reached a broader spectrum of blast resistance to the blast pathogens prevalent in the region.

In addition to rice blast fungus, salt stress is a crucial constraint for RD6 production. Thailand's northeast region is the country's largest rice-producing area, and it is comprised of two basins: Sakon Nakhon and Nakhon Ratchasima. In those basins containing an understructure of accumulated salt rock, the salt-affected range covers approximately 1.84 Mha [10]. Evaporation during the dry season tends to raise salinity from the subsoil to the surface, thereby increasing salinity intensity and increasing salt stress from 2–4 dS/m to 8–16 dS/m [10–12]. Rice is a salt-sensitive crop, capable of tolerating salinity at moderate levels of electrical conductivity (4–8 dS/m) [13]. Therefore, rice produced under rain-fed, lowland conditions is usually exposed to high levels of soil salinity. The RD6 variety, which is well known, was identified as a geographical indication (GI) within the Tung Gula Rong Hai of the Northeast, Thailand. Specifically, RD6 requires optimal soil salinity to enhance rice seed aroma [14]. However, an abundance of salinity can reduce rice plant growth, tiller number, and seed set-up [15], and the stress caused by excessive salt can significantly reduce total crop yield and result in plant death [16].

The Pokkali variety, derived from the International Rice Research Institute (IRRI), has become a well-known source of salinity tolerance worldwide, attributed to the salt-tolerant QTL located on rice chromosome 1 (*Saltol*) [17–20]. Therefore, several researchers have attempted to develop salt-tolerant rice varieties using the *Saltol* QTL [21–26].

The marker-assisted backcrossing (MAB) method has been employed to obtain beneficial QTLs from donor parents via introgression between the qualitative and quantitative traits from landraces and wild relatives [27] due to the precision method with shortened time frame in both foreground and background selection. MAB provides effective gene selection and/ or QTLs for pyramiding multi-genes/QTLs within the rice population. These benefits further support breeding practices for improved resistance and tolerance [28–32]. The objective of this study was to determine the blast resistance and salt tolerance levels within the RD6 introgression lines by pyramiding four blast-resistant and one salt-tolerant QTL into the RD6 rice variety in both greenhouse and field conditions.

#### **2. Materials and Methods**

#### *2.1. Plant Materials and Marker-Assisted Backcrossing Selection (MABS)*

Three parental varieties/lines were used to generate the BC4F4 population, representing a pseudo-backcrossing approach to increasing the recurrent genetic background of a pyramiding population, comprised of the RD6 (recurrent parent), Pokkali (obtained for the *saltol* QTL present on chromosome 1), and RGD07005-12-165-1 (the RD6 near-isogenic line obtained from the Rice Gene Discovery Unit, Kasetsart University, Thailand). The RGD07005-12-165-1 obtained four blast-resistant QTLs from the JHN and P0489 varieties on chromosomes 1 and 11, and chromosomes 2 and 12, respectively. The breeding program was subsequently improved within the population through MAB, by crossing the RGD07005-12-165-1 with the Pokkali variety to improve salt tolerance. The F1 was then backcrossed with RGD07005-12-165-1through BC1F1, whereas BC1F2 was utilized as a marker-assisted selection (MAS) in the blast-resistant and salt-tolerant QTLs. In this step, the flanking marker RM3412/RM10748 was used for the selected *Saltol* QTL [33], RM319/RM212 and RM114/RM224 were used for selected blast-resistant QTLs on chromosomes 1 and 11, respectively [8], and RM48/RM207 and RM313/RM277 were used for selected blast-resistant QTLs on chromosomes 2 and 12, respectively [9], as shown in Figure 1.

**Figure 1.** Breeding schematics for the development and validation of the RD6 NILs populations.

Total genomic DNA from young leaves of individual plants, lines, and their parents was extracted according to the method described by Dellaporta et al. [34] with slight modifications. The PCR reactions for SSR markers were carried out in a volume of 10 μL, containing 25 ng of genomic DNA, 1 × PCR buffer, 1.8 mM MgCl2, 0.2 mM dNTP, 0.2 μM forward and reverse primer, and 0.05 unit Taq DNA polymerase. DNA amplification was performed in a DNA thermal cycle for five minutes at 95 ◦C, followed by 35 cycles of 30 s at 95 ◦C, 30 s at 55 ◦C, and two min at 72 ◦C, with a final extension of seven minutes at 72 ◦C. The amplification products were separated via 4.5% polyacrylamide gel electrophoresis [32]. The selected lines in the BC1, BC2, and BC3 generations were backcrossed with the RD6 variety, using MAS for *saltol* and blast-resistant QTL selection. Trait qualities; such as glutinous type, aromatic character, and gelatinization temperature (GT) in the BC2F2 populations were fixed through MAS using glutinous 23 primer on chromosome 6, badh2 on chromosome 8, and RM190 on chromosome 6 (Table S1). Each backcross generation within the BC4F4 populations was evaluated for salt tolerance and blast resistance (Figure 1).

#### *2.2. Evaluation of Salt Tolerance and Blast Resistance in the BC2F2:3 Populations (Exp. 1)*

The evaluation of salt tolerance and blast resistance of the BC2F2:3 lines, as well as the parental and check varieties, were conducted at the Department of Agronomy, Faculty of Agriculture, Khon Kaen University, Khon Kaen, Thailand. The salt tolerance evaluation was performed through two methods, salt solution and artificial soil salinity. The salt solution method was laid out in a completely randomized design (CRD), with four replications. Seedlings were transplanted at seven days to 50 <sup>×</sup> 57 cm2 Styrofoam sheets with 1.5 cm diameter holes [17]. Fertilizer was applied with

Yoshida nutrient solution [35] at three to twenty-one days of age. NaCl was then added to reach the electrical conductivity (EC) of 4 dS/m. EC was subsequently increased at three-day intervals, reaching EC 8 and 12 dS/m. When the susceptible checks (IR29) presented salt injury symptoms, salt tolerance data were recorded following the standard evaluation score (SES) [36]. The artificial soil salinity method was also laid out in CRD with four replications. Seedlings were transplanted at seven days and transferred from the spent soil trays to the water trays. At 14 days after transplanting, the experimental trays were treated with NaCl, which adjusted the water solution to EC 8dS/m, which increased to EC 12dS/m after two days. Salt tolerance data were recorded similarly in both methods.

Blast resistance was also evaluated via the upland short-row method at the Sakon Nakhon Rice Research Center, Sakon Nakhon, Thailand. The experiment was laid out in CRD with three replications. Seeds of each BC2F2:3 line were sown in rows (approximately) 50 cm long and 10 cm apart. A susceptible KDML105 variety was planted alternately with every two testing varieties. Blast resistance scores were recorded following the SES method [36].

#### *2.3. Evaluation of Salt Tolerance in the BC3F4 Populations (Exp. 2)*

The BC3F4 lines and parental varieties were evaluated for salt tolerance in field conditions. The experiment was conducted at the Ban Daeng Village, Ban Fhang, Khon Kaen, Thailand. The experiment was laid out in a randomized complete block design (RCBD) with three replications. Germinated seeds were sown on seedbeds; then, at thirty days, the seedlings were transplanted to the field. Plot sizes were 1 <sup>×</sup> 1.5 m2, in three rows, spaced 25 <sup>×</sup> 25 cm between and within rows. The RD6 variety was planted between every five plots within the test lines to ensure that salinity occurred uniformly. Fertilizer (23.44 kg/ha of N, P2O5, and K2O) was applied at four days after transplanting, and hand weeding and chemical application for disease and insect control were performed as needed. When the susceptible check (RD6) presented salt injury symptoms, salt tolerance data were recorded following SES [37]. Moreover, the agronomic traits including 1000/seed weight, seed length, seed width, and seed shape were recorded.

#### *2.4. Evaluation of Salt Tolerance and Blast Resistance Evaluations in the BC4F3 Populations (Exp. 3)*

The salt tolerance evaluation of the BC4F3 lines was conducted in greenhouse conditions at the Department of Agronomy, Faculty of Agriculture, Khon Kaen University, Khon Kaen, Thailand. The experiment was laid out in CRD with three replications, as described in the salt solution method of Exp. 1. The experiment for blast resistance of the BC4F3 lines was conducted at the Department of Agronomy, Faculty of Agriculture, Khon Kaen University, Khon Kaen, Thailand through field and upland short-row experiments. The upland short-row method was conducted similarly to Exp. 1. The field experiment was laid out in RCBD with three replications, in plots 0.5 <sup>×</sup> 1 m2, in three rows, spaced 25 × 25 cm between and within rows. Seedlings were transplanted at 30 days of age. Symptoms of natural blast infection were identified when seedlings began to show signs of infection, following the protocol of the SES [37], in which both leaf and neck blast symptoms were recorded.

Agronomic trait data, such as plant height (PH) and panicle length (PL), were recorded at pre-harvest; whereas post-harvest data recorded 4/panicle seed weight (SW4P), 1000/seed weight (1000SW), total dry weight (TDW), total seed weight (TSW), harvest index (HI), seed length (SL), seed width (SW), and seed shape (SS) (ratio of SL/SW). Additionally, seed qualities of the BC4F4 seeds, such as seed morphology, including SL, SW, SS and seed color of brown and paddy rice and aromatic traits, were evaluated and compared with the RD6 variety. The seed aromatic evaluation of each line was achieved through the quantitative determination of 2-acetyl-1-pyrroline (2AP) content using automated headspace gas chromatography following the methods as prescribed by Sriseadka et al. [38]. In brief, polished seed (1.00 g) were ground and then placed in a 20 mL headspace vial. The headspace vials were immediately sealed with PTFE/silicone septa and aluminum caps prior to analysis by static headspace-gas chromatography. A static headspace (Model 7697A, Agilent Technologies, Santa Clara, CA, USA) coupled to an Agilent 7890B Series GC system equipped with an Agilent

5977B GC/MSD system was used. A series of 2AP standard solutions with concentrations of 1.22, 2.45, 4.90, 9.79, and 19.67 ppm in isopropanol were prepared, which was added to headspace vials containing 1.00 g of non-aromatic rice seed (cv. Chai Nat 1) used as the external standard. The optimum headspace operating conditions were oven temperature 110 ◦C, loop temperature 120 ◦C, transfer line temperature 130 ◦C, vial equilibration time 10 min with high speed shaking, pressurizing time 0.15 min, loop equilibration time 0.40 min, and inject time 0.50 min. The headspace volatiles were separated using an HP-5 (25 m × 250 μm × 0.25 μm film thickness) column (J&W Scientific, Folsom, CA, USA). The optimum GC conditions were achieved using an HP-5 column with a splitless injection at 210 ◦C. The column temperature programme began at 50 ◦C and increased to 200 ◦C at 10 ◦C/min. Purified helium was used as the GC carrier gas at a flow rate of 1.2 mL/min. A calibration curve for 2AP analysis by headspace was generated by spiking known concentrations of 2AP into a non-fragrant rice variety (Chai Nat 1). Samples were run in triplicate, and the concentration of 2AP was calculated based upon the relative peak area of external standard.

#### *2.5. Evaluation of Salt Tolerance in the BC4F4 Populations (Exp. 4)*

The salt tolerance evaluations of the BC4F4 lines were conducted in greenhouse conditions at the Department of Agronomy, Faculty of Agriculture, Khon Kaen University, Khon Kaen, Thailand, via the salt solution method. Laid out in CRD with four replications, the planting methods and experiment protocol were similar to those described in Exp1, except that the EC in Exp. 4 was adjusted to 18 dS/m. The leaf, stem, root, and total dry weight data were recorded. Dry weight of the seedlings was determined by oven-drying the seedlings at 80 ◦C for 3 days, and the percentages of Na+ and K+ in the leaves, stems, and roots also were determined, according to the flame photometric method [39]. In addition, the percentages of Na+ and K+ in the leaves, stems, and roots also were determined, according to the flame photometric method [39]. In brief, pounding sample volume 0.5 g of each tissues were digest by 10 mL nitric acid (HNO3) and 5 mL perchloric acid (HClO4), then incubated at 200 ◦C. The contents were covered to reflux acid fumes generated during digestion until digest appeared translucent. After cooling down, 100 mL deionized water was added to each digestion tube. Contents were then vortexed and passed through qualitative cellulose filter paper (Whatman No.1, Sigma-Aldrich®, St. Louis, MO, USA) and measurement the K+ (768 nm wavelength) and Na+ (589 nm wavelength) by flame photometer (Model 410 Flame Photometer, Sherwood Scientific Limited, Cambridge, UK). The K+ and Na+ concentrations of samples were compared with the known standard solutions of 0.0, 5.0, 10.0, 15, and 20 ppm from a calibration curve with a correlation coefficient (r2) = 0.999. Finally, the amount of K+ and Na+ were transformed to percentage when compared with dry weight of raw materials.

#### *2.6. Data Analysis*

Salt tolerance scores, blast resistance scores, and agronomic trait data were analyzed via the STATISTIC 10© program (1985–2013) (Analytical Software, Tallahassee, FL, USA). Means were compared by the least significant difference (LSD) at *p* < 0.05.

#### **3. Results**

#### *3.1. Development of Populations Through MABS*

The F1 population (RGD07005-12-165-1 × Pokkali) was backcrossed with RGD07005-12-165-1 using MAS through BC2F2, in which the MAS contained genes of the glutinous type, aromatic, and gelatinization temperature (GT); and the BC2F2:3 populations were then evaluated for salt tolerance and blast resistance. One BC2F2:3 line (no. 74) obtained all QTLs required to obtain the target traits necessary for generation advancement (Table S2). In developing the BC3 generation, BC2F2:3 (no. 74) was crossed back to RD6 to produce the BC3F1, in which the BC3F1 was developed through the BC3F4 population by MAS. Thirty-one BC3F4 lines were evaluated for salt tolerance and agronomic traits

within the salted field. Among the BC3F4 lines, the BC3F4 (no. 132) demonstrated salt tolerance and agronomic characteristics similar to those of the RD6 variety and were subsequently selected for the development of the BC4F1 through BC4F3. The BC4F3-132-12-61 line was successfully developed via the MABS, consisting of one QTL for salt tolerance and four QTLs for blast resistance. BC4F3 lines with varied QTL combinations were also validated in the tests.

#### *3.2. Evaluation of Salt Tolerance and Blast Resistance in the BC2F2:3 Populations*

Eight of the BC2F2:3 lines were evaluated for salt tolerance in the seedling stage through the salt solution and artificial soil salinity methods. The results concluded that the BC2F2:3 lines presented as highly significant at EC 12 dS/m. BC2F2:3 (nos. 23, 67, and 74), demonstrated tolerance (T) in both methods, similar to that of the tolerant check (Pokkali); while the remaining five BC2F2:3 lines showed moderate tolerance (MT) under both the salt solution and artificial soil salinity tests (Table 1). The BC2F2:3 lines were evaluated for blast resistance via the upland short-row method, through which the blast resistance reaction (R) was similar to that of the resistance checks (JHN and P0489). Their resistance proved greater than that of the susceptible check (RD6), due to the presence of blast-resistant QTLs, and superior to that of the original RD6 (Table 1).

**Table 1.** Salt tolerance scores at EC 12 Ds/m in salt solution, artificial soil salinity, and blast resistance within the BC2F2:3 populations.


T = tolerance, MT = moderate tolerance, S = susceptible, HS = highly susceptible, R = resistant, MR = moderate resistance, MS = moderately susceptible. \*\* = significant different at *p* < 0.01. Different letters after the mean within a column showed a significant difference. CV = the coefficient of variation.

#### *3.3. Evaluation of Salt Tolerance in the BC3F4 Populations*

BC2F2:3 (No.74) was selected for the next backcross cycle with the RD6, and MAS was performed through the BC3F4 population. Thirty-one BC3F4 lines were screened for salt tolerance and agronomic performance. The results showed that the 21 BC3F4 lines were salt-tolerant (T), whereas the 10 BC3F4 lines proved only moderately tolerant (MT). Agronomic traits of the BC3F4 lines were similar to those of the RD6 variety, which was in accordance with our objectives (Table 2). Five BC3F4 lines (nos.22, 36, 115, 129, and 132) demonstrated superior tolerance (T) and agronomic traits (1000/SW, SL, SW, and SS), again, similar to those of the original RD6, and were selected as donor parents for the development of the BC4. The results showed that BC3F4 (no. 132) produced the best performance for pollination, and was therefore selected for development of the BC4 population.



STS = salt tolerance score, 1000SW = 1000 seeds weight, SL = seed length, SW = seed width, SS = seed shape (ratio of SL/SW). = significantly different at *p* < 0.01. Different letters the mean within a column showed a significant difference. CV=the coeffcient of variation.

#### *3.4. Evaluation of Salt Tolerance and Blast Resistance in the BC4F3 Populations*

This experiment evaluated eight BC4F3 lines obtained via *Saltol* QTL from MAS, specifically, the combination lines, donor parents, and tolerance, and susceptible checks were screened for salt tolerance via the salt solution method (EC 12 dS/m). The results showed highly significant scores within the BC4F3 population and the tolerant check (Pokkali), with salt tolerance scores less than 5.0, indicated as tolerant (T). The recurrent parent (RD6) presented moderate tolerance (MT) with a score of 6.5 (Table 3). The results indicated that the Saltol QTL within the BC4F3 population demonstrated superior performance in salt tolerance to that of the recurrent parent (RD6).

The BC4F3 populations were evaluated for blast resistance in field conditions through natural infection. Since blast infection in the field was not severe, leaves with severe symptoms from the surrounding trap plants were collected and stored in a bag, under dark conditions, for twelve hours to induce spore formation. The natural inoculum was added with water, and spore suspension was then sprayed over the field. The BC4F3 populations and their parents showed a high level of blast resistance. A total of eight BC4F3 lines demonstrated high resistance (HR), similar to those of both the donor parents, whereas the recurrent parent (RD6) presented moderate susceptibility (MS) (Table 3). Importantly, some introgression lines were greater in resistance than the recurrent parent (RD6). The second peak of bimodal rain, which occurred in the flowering stage, initiated significant signs of neck blast. The results found that eight BC4F3 lines showed resistance (R) similar to that of the donor parent, whereas the RD6 was moderately susceptibility (MS) to neck blast (Table 3).

Additionally, blast resistance evaluation was also conducted via the upland short-row method in the seven BC4F3 lines, which presented highly significant blast-resistant levels. The BC4F3 lines showed resistance abilities similar to both donor parents, whereas the RD6 recurrent parent presented moderate resistance (MR). Notably, the blast-resistant genes in the BC4F3 populations provided blast resistance in the seedling stage similar to that of the tilling and grain filling stage, and evidenced greater resistance than that of the RD6 recurrent parent (Table 3).

The agronomic traits were also evaluated in the blast field experiments. The results were highly significant within the BC4F3 populations and their parents for ten traits: PH; PL; SW4P; 1000/SW; TDW; TSW; HI; SL; SW; and SS (Table 4). The BC4F3 132-12 maintained agronomic traits similar to those of the recurrent parent (RD6) for nine traits, except for the 1000/SW (1000 seed weight), in which the RD6 presented moderate susceptibility (MS) for leaf and neck blast, resulting in low grain filling (Table 4). The results indicate that the newly developed MAB population had greater resistance than that of the original RD6 variety while maintaining its desirable agronomic traits and satisfying consumer demand. The BC4F3 lines also showed improvements in seed length (SL), seed width (SW), and seed shape (SS) as long and slender type. Additionally, the color of paddy rice was straw yellow, similar to the original RD6 varieties (Figure 2).



HR = high resistance, R = resistance, MR = moderate resistance, MS = moderate susceptible, S = susceptible. \*\* = significantly different at *p* < 0.01. Different letters after the mean withina column show a significant difference. CV=the coefficient of variation.

#### *Agronomy* **2020** , *10*, 1118



column show a significant difference. CV = the coefficient of variation.

**Figure 2.** Seed quality, seed length, seed shape, and seed color of 10 seeds of the BC4F3 populations compared with the RD6 variety. (**a**) RD6, (**b**) BC4F3 133-12-61, (**c**) BC4F3 133-25-18, (**d**) BC4F3 132-14, (**e**) BC4F3 132-98, (**f**) FC4F3 132-174, (**g**) BC4F3 132-51, (**h**) BC4F3 132-167, (**i**) BC4F3 132-276. The top row is brown rice and bottom row is paddy rice with straw color. The small scale is in millimeters.

Eight BC4F3 lines and the RD6 (recurrent parent) were evaluated for seed aroma through the determination of 2AP content via the automated headspace gas chromatography method. The results showed a significant difference among the BC4F3 lines and RD6 variety, with mean values of 2AP content of the BC4F3 exceeding 3.00 ppm (Table 4). BC4F3 132-98-87 presented the highest 2AP content (4.68 ppm), similar to that of the RD6 (Table 4). The BC4F3 lines and RD6 were determined to be similar in fragrance.

#### *3.5. Evaluation of Salt Tolerance in the BC4F4 Population*

Two BC4F4 lines, BC4F4 132-12 and BC4F4 132-167, presented as tolerant (T) in salt tolerance evaluations, similar to that of the Pokkali, whereas the remaining five BC4F4 lines and the recurrent parent (RD6) showed moderate tolerance (MT) (Table 5). Statistically, seven of the BC4F4 lines demonstrated EC values up to 18 dS/m. The dry weights of the seven BC4F4 lines, their parent, and KDML105 check varieties presented as highly significant, in which the tolerant check (Pokkali) presented the highest LDW, SDW, RDW, and TDW evident in the weights of leaf stems and roots, whereas the BC4F4 lines were similar to that of the recurrent parent (RD6) (Table 5). The results indicate that salinity significantly affected dry weight.


**Table 5.** Leaf, stem, root, and total dry weights in the experiment 4.

LDW = leaf dry weight, SDW = stem dry weight, RDW = root dry weight, TDW = total dry weight. \*\* = significantly different at *p* < 0.01. Different letters after the mean within a column show a significant difference. CV = the coefficient of variation.

Additionally, the percentages of Na+ and K+ in leaf stems and roots were also recorded and proved highly significant in all traits. The tolerant checks (Pokkali) presented the lowest Na+ in leaves (1.94), stems (1.75), and roots (4.94), whereas K+ in leaf stems were non-significant within the BC4F4 lines. Moreover, Pokkali also displayed low Na+-to-K+ ratios in rice shoots. The BC4F4 132-12 also showed low levels of Na+ in both stems and leaves, as well as low Na+-to-K+ ratios in rice shoots comparable to those of the Pokkali variety, yet lower than those of the RD6. We may infer that transferring the saltol QTL from the Pokkali line created greater tolerance in subsequent breeding lines that that of RD6 recurrent parent (Figure 3). Interestingly, the BC4F4 lines presented Na+ levels in leaves and stems similar to those in Pokkali, and significantly different in RD6. This confirmed the Pokkali saltol QTL's ability to exclude Na+ from the leaf blade, expressing salt tolerance through the salt-excluder method. Moreover, the salt tolerance scores were negatively correlated to the leaf, stem, root, and total dry weights (r = −0.7899 \*\*, r = −0.8136 \*\*, r = −0.8140 \*\*, r = −0.8065 \*\*, respectively) and positively correlated with leaf and stem Na+ (r = 0.7670 \*\* and r = 0.8917 \*\*, respectively) (Table 6). The results indicated that salt stress decreased plant growth in susceptible varieties, while growth was maintained in salt-tolerant varieties and that the accumulation of Na+ in leaves and stems was related to salt susceptibility (Table 6).

**Table 6.** The correlation between leaf dry weight, stem dry weight, root dry weight, total dry weight, leaf Na+, leaf K+, stem Na+, stem K+, root Na+, root K+, and the salt score of seeds in the BC4F4 populations and check varieties in the experiment 4.


LDW = leaf dry weight, SDW = stem dry weight, RDW = root dry weight, TDW = total dry weight. \* = significantly different at *p* < 0.05. \*\* = significantly different at *p* < 0.01.

**Figure 3.** Na+ and K+ on leaves, stems, and roots in the experiment 4. Different letters within a color bar show a mean significant difference of each line.

#### **4. Discussion**

Since its release in 1977, the RD6 glutinous rice variety has remained a staple food crop for domestic consumption in Thailand's north and northeast regions. Comprising 83% of total glutinous rice production in these areas, consumers have developed a preference for its superior characteristics. However, the RD6 variety suffers from several production constraints, including biotic stress responsible for both rice blast [40] and bacterial blight disease [41]. Current research has attempted to eliminate sustainable infection-resistant production practices by pyramiding multiple resistant genes [42]. To date, an RD6 introgression line capable of resisting both biotic and abiotic stress has yet to be developed. Thailand's salt rock basins of Sakon Nakhon and Nakhon Ratchasima have demonstrated that consistent levels of salinity can enhance the fragrance of the RD6 rice variety [14] and increase production.

This study proposes the successful introgression of blast-resistant QTLs (*qBL 1, 2, 11*, and *12*) from RGD07005-12-165-1 and the *Saltol* QTL (Pokkali, chromosome 1) to improve the RD6 rice variety through the MAB method within BC4F4 populations. Trait evaluations were completed for the validation of progenies with desirable traits in each advanced population, based on the introgression of the genetic foregrounds and maintenance of the genetic backgrounds, respectively.

Salt salinity was absent in several areas of northeast Thailand, due to high levels of NaCl [43], and such factors as precipitation, soil type, and field management. In past research, the evaluation of salt tolerance was typically conducted through salt screening, hydroponic culture, and soil culture, as well as through pot and field methods [33]. The current study assessed salt tolerance within the breeding populations studied through salt solution, artificial salt culture, and field condition evaluations. Based on the results, salt evaluation under field conditions produced the lowest capability among the tested rice lines (Tables 1–3 and 5), due to the inherent difficulties and uncertainties present under field conditions. Kranto et al. [33] reported that effective alternative screening approaches must be proven to correlate with results produced within the early phases of growth in both greenhouse and field conditions. Within the present study, visual symptom scores of salt stress generated through the salt solution method proved to be the most appropriate method with which to confirm tolerance abilities within a breeding population (Table 5), suggesting that the salt solution method could, therefore, substitute salt tolerance score analysis in field conditions (Table 5). However, we acknowledge the necessity to evaluate RD6 plant types, yield performances, and agronomic traits within the field. The introgression lines developed within our study were evaluated for similarity with the original RD6 agronomic traits, namely plant height, panicle length, 4/panicle seed weight, 1000/seed weight, total dry weight, total seed weight, harvest index, seed length, seed width, and seed shape, as well as seed qualities, such as seed morphology (Tables 2 and 4).

The *Saltol* QTL on chromosome 1 from the Pokkali rice variety has been commonly used for rice improvement in several studies [21–26]. In our results, the *Saltol* QTLs from the Pokkali variety produced the greatest salt tolerance within the RD6 introgression lines (Table 1, Table 2, Table 3, and Table 5). This *saltol* QTL also contributed to the maintenance of low Na+, high K+, and low Na+/K+ homeostasis levels in rice stems, further resulting in increased salt tolerance [24,44] (Figure 3). The Pokkali variety was classified to balance the influx of Na+ and K+ for dilution in the mechanism, creating the ability to exclude Na+ from leaf blades and stems [45,46]. As the water up-take mechanisms in rice accept both nutrients and salt together, the Pokkali variety thereby demonstrated the highest and most significant differences in leaf, stem, root, and total dry weights when compared with other breeding lines (Table 5). However, the BC4F4 lines presented the agronomic traits (above) more closely matched to the RD6 than to the Pokkali (Table 5), due to the advance generation and visual selection of the trait performances (Tables 2 and 4). RD6 performance is very important for farmer acceptance and crop adaptation in our test areas. For example, excessively tall RD6 rice plants present problems in the grain filling stages as a result of heavy wind or rain [47]. Visual selection may explain the differences in percentages of Na+ of the RD6 introgression lines with those of Pokkali (Table 5, Figure 3).

As a photosensitive rice variety, the RD6 grows once a year, during Thailand's rainy season from late May to November [48]. These bimodal rain patterns produce favorable conditions for the occurrence of blast disease, causing damage in all stages of growth. Leaf blast generally occurs during the seedling and tilling stages, whereas neck blast usually occurs during the reproductive phase [4]. In our study, introgression lines were evaluated for blast disease in both the field and upland short-row evaluations.

In this study, the upland short-row method displayed greater incidences of blast disease, due to the favorable microclimate and moisture contents around the experimental plots (Tables 1 and 3) [49]. The experimental field was influenced by bimodal rain, capable of inducing leaf and neck blast symptoms (Table 3), further indicating the resistance of the QTLs [42,50]. Noenplab et al. [8] also reported that the blast QTL on chromosome 11 in the JHN variety successfully contributed to leaf and neck blast resistance. Pyramiding of four blast-resistant QTLs through MAS achieved high levels of blast resistance and broad-spectrum resistance to pathogens prevalent in the region [9]. Moreover, the testing of RD6 introgression lines for durable blast resistance and no-yield penalties were observed [42]. The results, herein, further demonstrated that neck blast disease caused direct yield loss during the grain filling phase [51], as well as lower 1000/SW within the original RD6 variety compared with those of the RD6 introgression lines (Table 4).

The resistance/tolerance abilities of the RD6 introgression lines represent the foreground genetics capable of enhancing plant breeding programs. However, maintaining the background of the original RD6 variety is also desirable; therefore, the quality and performance of the RD6 within the QTL introgression was also a consideration. The BC4F4 populations, herein, were achieved through the introgression of blast-resistant QTLs (*qBL 1, 2, 11*, and *12*) from RGD07005-12-165-1 and *Saltol* QTL (Pokkali) and improved the RD6 rice variety through MAB. Consequently, the performance of the RD6 introgression lines was similar to that of the original RD6 variety (Table 4, Figure 2). The results indicate that foreground and background selection, together with visual selection, accurately depicts the efficiency of MAB.

#### **5. Conclusions**

Improvement of the RD6 rice variety for salt tolerance and blast resistance was successfully achieved utilizing the *Saltol* QTL and *qBl* (*1*, *2*, *11*, and *12*) through marker-assisted backcrossing, together with phenotypic selection. The resulting BC4F4 132-12 introgression line exhibited superior salt tolerance, blast resistance, and reduced neck blast and was capable of maintaining higher qualities and agronomic performances than that of the original RD6 variety.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/8/1118/s1, Table S1: QTL traits and primer sequences of the SSR markers for blast resistance and salt tolerance, Table S2: Genotype of the BC2F2 populations derived from MAB.

**Author Contributions:** K.T. and S.K. conceived the study. T.M., S.C., and J.S. designed the experiments. K.T. and S.K. performed the experiments. J.S. and S.C. supervised the study. K.T., T.M., S.C., and J.S. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This research was supported by The Plant Breeding Research Center for Sustainable Agriculture and The Salt-Tolerance Rice Research Group, Khon Kaen University, Khon Kaen, Thailand. Our gratitude is also extended to the Sakon Nakhon Rice Research Center for their support in our field experiments and Ubon Ratchathani Rice Research Center for their support in 2AP analysis.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Identification of QTLs Controlling Resistance**/**Tolerance to** *Striga hermonthica* **in an Extra-Early Maturing Yellow Maize Population**

#### **Ba**ff**our Badu-Apraku \*, Samuel Adewale, Agre Patern, Melaku Gedil and Robert Asiedu**

International Institute of Tropical Agriculture (IITA), PMB 5320, Ibadan 200001, Nigeria;

S.Adewale@cgiar.org (S.A.); P.Agre@cgiar.org (A.P.); M.Gedil@cgiar.org (M.G.); R.Asiedu@cgiar.org (R.A.)

**\*** Correspondence: b.badu-apraku@cgiar.org; Tel.: +234-8108482590

Received: 12 July 2020; Accepted: 6 August 2020; Published: 10 August 2020

**Abstract:** *Striga hermonthica* parasitism is a major constraint to maize production in sub-Saharan Africa with yield losses reaching 100% under severe infestation. The application of marker-assisted selection is highly promising for accelerating breeding for *Striga* resistance/tolerance in maize but requires the identification of quantitative trait loci (QTLs) linked to *Striga* resistance/tolerance traits. In the present study, 194 F2:3 families of TZEEI 79 × TZdEEI 11 were screened at two *Striga*-endemic locations in Nigeria, to identify QTLs associated with *S. hermonthica* resistance/tolerance and underlying putative candidate genes. A genetic map was constructed using 1139 filtered DArTseq markers distributed across the 10 maize chromosomes, covering 2016 cM, with mean genetic distance of 1.70 cM. Twelve minor and major QTLs were identified for four *Striga* resistance/tolerance adaptive traits, explaining 19.4%, 34.9%, 14.2% and 3.2% of observed phenotypic variation for grain yield, ears per plant, *Striga* damage and emerged *Striga* plants, respectively. The QTLs were found to be linked to candidate genes which may be associated with plant defense mechanisms in *S. hermonthica* infested environments. The results of this study provide insights into the genetic architecture of *S. hermonthica* resistance/tolerance indicator traits which could be employed for marker-assisted selection to accelerate efficient transfer host plant resistance genes to susceptible genotypes.

**Keywords:** maize (*Zea mays* L.); *Striga* resistance/tolerance; QTL mapping; F2:3 biparental mapping; Marker-assisted selection

#### **1. Introduction**

Maize (*Zea mays* L.) is the most widely grown staple food crop in sub-Saharan Africa (SSA), and accounts for a large proportion of carbohydrates, proteins, lipids and vitamins for millions of people in the sub-region [1,2]. The root hemi-parasitic plant *Striga hermonthica* is an important biotic constraint limiting maize production in SSA. The *S. hermonthica* problem in SSA is the result of the shift from traditional cereal based farming system which facilitated longer fallow periods which ensured that the soil *Striga* seed bank was maintained at levels that plants could tolerate [1]. However, the pressure on agricultural land has necessitated land use intensification, cereal mono-cropping and reduced fallow periods resulting in increased *Striga* seed bank and infestation levels that threaten the livelihood of millions of farmers [3,4]. De Groote et al. [5] reported that over six million hectares of agricultural land in Western, Eastern and Southern Africa are seriously affected by *Striga*. Reduction in grain yield due to *S. hermonthica* parasitism may be up to 100% under severe infestation and unfavorable environmental conditions such as low soil fertility, erratic rainfall patterns and low-input conditions [6–8]. The subsistence farmers are usually the most severely affected. The extent to which *S. hermonthica* affects the growth of its host varies tremendously, depending on the level of host plant resistance/tolerance, extent of infestation, and the prevailing environmental

conditions [9,10]. Resistance to *Striga* denotes the capability of the host plant to induce the germination of *Striga* seeds but prevents the parasite from attaching to the roots of the maize plants or kills the attached parasitic plants. Under *S. hermonthica* infestation, the resistant genotype supports considerably fewer *Striga* plants and produces a greater yield than the susceptible genotype [11–13]. Contrarily, a *Striga* tolerant genotype supports as many *Striga* plants as the sensitive or susceptible genotype [14] but produces more dry matter and shows fewer damage symptoms [15]. *Striga* damage in maize is used as the indicator of tolerance while emerged *Striga* plants is the indicator of resistance. The identification of maize genotypes that combine outstanding levels of resistance and tolerance is a promising breeding strategy and has been recommended for *Striga* resistance breeding in several studies [12–14,16,17]. In selecting for tolerance or resistance and high grain yield under *Striga* infestation, the primary traits of interest are the *Striga* damage and number of emerged *Striga* plants. Presently, maize genotypes with combined resistance and tolerance to *S. hermonthica* (possessing both low *Striga* damage and few *Striga* emergence counts) as well as high grain yield have been identified in the International Institute of Tropical Agriculture - Maize Improvement Program (IITA-MIP). *Striga* damage rating score is positively associated with *Striga* emergence counts, and the two traits are negatively associated with yield under *S. hermonthica* infested conditions. Similarly, large positive additive genetic correlation was recorded between grain yield and ears per plant as well as moderately large negative genetic correlations between grain yield and flowering traits [18]. Comparable results were reported by earlier workers [19,20]. Nevertheless, the genotypic correlation between *S. hermonthica* damage rating and *S. hermonthica* emergence counts have been found to be low, implying that different genes control the inheritance of the two traits [18,21].

*Striga* infestation is dependent on *Striga* seedbank in the soil resulting from continuous cropping of host plants, leading to the accumulation of *Striga* seeds which can remain dormant in the soil for more than a decade [22]. The germination of *Striga* seed is induced by the production of plant hormones called strigolactones produced by the maize plant in the roots. The hormones are released when the plant is under stress [23]. For the germinated *Striga* seed to survive as an obligate parasite, it must produce haustoria that attach to the roots of the maize plant through which it draws water and photosynthates [24]. Even though *Striga* possesses chlorophyll for photosynthesis, it still depends on its host for survival [25], as a result of its inability to accumulate enough photosynthates for autotrophic growth. Therefore, *Striga* establishes direct xylem links with the root system of the host [26] to obtain nutrients from its xylem sap [27]. Furthermore, a higher rate of transpiration in *Striga* compared to the host plant creates a water potential gradient from the host to the *Striga* plant [28]. *Striga* phytotoxic effects on the maize plant direct the partitioning of assimilates into the roots rather than the shoots for grain filling, thereby resulting in plant biomass and yield reduction [29].

Presently, management strategies of *S. hermonthica* include cultural, chemical, and biological approaches, which are non-economical and/or knowledge-intensive for subsistence farmers [6]. Planting of *Striga*-resistant maize varieties is presently considered the best control strategy and easy to adopt or deploy, particularly when combined with other management practices [30–33]. Resistance to *S. hermonthica* parasitism is mainly attributed to low production of *Striga* germination stimulants by the host plant, attachment of few *Striga* plants to the roots of the host plant as well as fewer *Striga* emergence [13,34]. When breeding for *S. hermonthica* resistance in maize, a combination of these resistance mechanisms is desirable in achieving effective and durable resistance [33]. The slow rate of development and deployment of *Striga* resistant genotypes is largely attributable to the complex genetics of resistance as well as limited knowledge of the specific mechanisms associated with resistance to *Striga* [35]. The resistance to *S. hermonthica* in maize is regulated by many genes or quantitative trait loci (QTL) with small additive effects and it is significantly influenced by the environment [13,36]. Therefore, breeding for *Striga*-resistant cultivars using conventional approaches by selecting maize cultivars with enhanced resistance which requires evaluation in multi-locations and years, has been less effective and time-consuming [27].

Marker-assisted breeding makes use of genotypic data in the identification of genotypes possessing desirable alleles, using linked genetic markers. Breeders employ marker-assisted selection (MAS) when an important trait that is difficult to assess phenotypically, is tightly linked to a molecular DNA marker that can be scored quickly and precisely [37]. QTL mapping approaches are important genomic tools employed in dissecting the genetic architecture of complex traits [38,39] as well as identification of genetic linkage through wide genotyping of a panel of germplasm displaying contrasting phenotypes across different environments [40]. For QTL identification, the development of next-generation sequencing technology has become a practicable technique to rapidly identify large number of single nucleotide polymorphisms (SNPs) throughout the genome [41]. Unlike the use of second-generation molecular markers which result in low-quality mapping, SNP markers provide new insights to rapidly identify QTL of interest.

Information on map positions of genes and linked markers on a chromosome are crucial for efficient determination of genetic architecture of polygenic traits in crop plants [42]. Several QTLs and candidate genes controlling resistance to *Striga* have been reported in cereals. For instance, Swarbrick et al. [32] identified three *S. hermonthica*-resistant QTLs in Kasalath–Koshihikari rice backcross inbreds, two of these QTLs originated from the Kasalath allele and one from the Koshihkari allele. The largest-effect QTL (Kasalath-derived allele) explained 16% phenotypic variance in the mapping population and was located on linkage group 4. Haussmann et al. [43] detected molecular markers associated with *S. hermonthica*-resistant QTLs, with the most significant QTL corresponding to the major-gene locus low germination stimulant (LGS) in linkage group I. Five genomic regions (QTLs) linked to stable *S. hermonthica*-resistant alleles from resistant variety N13 were detected through evaluation across a large number of field trials in Mali and Kenya. However, limited reports are available on the QTLs and genes controlling *Striga* resistance in maize. In a recent study by Adewale et al. [44] to identify molecular markers linked to *S. hermonthica* resistance in maize, 24 SNPs significantly associated with *S. hermonthica* resistance indicator traits under artificial *S. hermonthica* infestation were detected. The authors also identified four candidate genes on chromosomes 3, 5, 9 and 10, with functions closely associated with maize plant defense mechanisms against *S. hermonthica* parasitism. Identification of QTLs linked to *S. hermonthica* resistance followed by gene introgression into elite genetic backgrounds, has the potential to reduce yield losses due to *Striga* and will ultimately provide solid foundation for improving *Striga* resistance [45].

The objectives of this study were to identify QTL and underlying candidate genes conferring resistance/tolerance to *S. hermonthica* in maize using F2:3 biparental mapping population derived from a cross between a *Striga* resistant inbred line, TZEEI 79 and *Striga* susceptible inbred TZdEEI 11.

#### **2. Materials and Methods**

#### *2.1. Germplasm and Phenotyping*

Based on the reports of previous studies, two extra-early maturing yellow inbred lines, TZEEI 79 (*Striga* resistant/tolerant) and TZdEEI 11 (*Striga* susceptible) were selected as parents to generate the F2:3 progenies used in the present study [1]. TZEEI 79 is an outstanding *S. hermonthica* resistant/tolerant, drought and low-soil N tolerant inbred line developed in the IITA-MIP from the broad-based *S. hermonthica* resistant/tolerant as well as drought and low-soil N tolerant population, TZEE-Y Pop STR C0. TZEEI 79 has significant positive GCA (general combining ability) for grain yield as well as significant negative GCA effects for *Striga* damage and *Striga* emergence counts under *Striga* infestation and has been extensively used in the IITA hybrid program as an important resource for developing high-yielding, multiple stress tolerant hybrids as well as an efficient tester for classifying other inbreds into heterotic groups [46]. Crosses were made between TZEEI 79 and TZdEEI 11 designated as P1 and P2 respectively, to generate 220 F1 progenies. The F1 progenies and the parental lines were planted, and leaf samples were collected at 3 weeks after planting. Verification of the parental type alleles (quality control analysis) was carried out on the F1 progenies prior to advancement to F2. The F1 progenies were screened using two SSR primers (bnlg 182 and umc 1568) which were found to be polymorphic between the two parents. The analysis identified 170 true-to-type F1 hybrids which were advanced to F2. The 170 true-to-type F2 ears were planted ear-to-row and 194 F2 individuals which were randomly selected and selfed were used in the present study.

The F2:3 progenies and the two parental lines were screened under artificial *S. hermonthica* infestation at Mokwa (9◦18 N, 5◦4 E, 210 m above sea level, 1100 mm yearly rainfall, luvisol soil) and Abuja (9◦16 N, 7◦20 E, 445 m above sea level, 1500 mm yearly rainfall, ferric-luvisol soil) in the Southern Guinea savanna of Nigeria in 2018. At each experimental site, the trial was laid out using randomized incomplete block design (14 × 14 lattice) with two replicates. The experimental units were 3 m long single-row plots, with an inter-row spacing of 0.75 m and within-row spacing of 0.4 m, to achieve a target population density of 66,666 plants/ha. The fields for artificial *S. hermonthica* infestation at Mokwa and Abuja were treated with ethylene gas at 2 weeks before planting to eliminate any potential *Striga* seeds present in the soil. The *S. hermonthica* seeds used for the experiment were obtained from sorghum farms around the test locations at Abuja and Mokwa in 2017. The artificial *S. hermonthica* infestation was carried out as proposed by the IITA Maize Program [16]. Briefly, about a week before inoculation, the *S. hermonthica* seeds were carefully mixed with finely sieved sand at the ratio 1:99 by weight to ensure rapid and uniform infestation. A standard scoop calibrated to deliver approximately 5000 germinable seeds per hill was utilized for the artificial infestation. Three maize seeds were planted per infested hill and the seedlings were later thinned to two plants per stand at 2 weeks after emergence. Fertilizer application on the maize plots was delayed till about 30 days after planting, in order to subject the maize plants to stress, a condition that was expected to enhance strigolactone production. This ensured good germination of *Striga* seeds and attachment of *Striga* plants to the roots of host plants. At this plant growth stage, 20–30 kg Nha<sup>−</sup>1, 30 kg each of P and K were applied as NPK 15-15-15, taking into consideration the fertility status of the soil. Reduction in the fertilizer application rate was important because *Striga* emergence decreases at high N rate [16]. At 10 weeks after planting, typical symptoms of *Striga* infestation on the host plants were observed, such as chlorosis, leaf scorching (firing) and blotching, stunting, decrease in ear and tassel size, brown necrotic spots, leaf wilting and rolling, stalk lodging, open-tip of ears at late growing stage, and premature death of host plants. Host plant *Striga* damage severity was scored using a scale of 1 to 9. Rating scales 1–5 indicated resistance while 6–9 indicated susceptibility, where 1 = normal plant growth, no obvious symptoms, and 9 = all leaves completely scorched, collapse of host plants and no ear formation [16]. In addition, data were collected on *Striga* emergence count at 10 WAP as the number of *Striga* plants thriving on the maize root system as well as ears per plant (EPP) by dividing the total number of ears harvested per plot by the number of plants in a plot at harvest. Grain moisture was determined using Kett moisture tester PM-450 and grain yield (kg/ha) was calculated using the field weight of harvested ears per plot, adopting a shelling percentage of 80, adjusted to 15% moisture content [47].

#### *2.2. DArTseq Genotyping and SNP Data Filtering*

Young and healthy leaf samples from single plants of the F2 individuals and bulk samples from the parental lines were collected and frozen immediately after harvesting using liquid nitrogen and thereafter stored at −80 ◦C. Genomic DNA extraction was carried out following the DArT protocol (www.diversityarrays.com/files/DArT\_DNA\_isolation/). The extracted DNA was assessed for quality by visualization on agarose gel (2% w/v) and the quantity was estimated on NanoDrop-1000 spectrophotometer (NanoDrop, Wilmington, DE, USA) using the absorbance ratio A260/A280 to determine the concentration (ng/μL) and purity level of the DNA.

Genotyping of the 194 F2 individuals plus the two parents was carried out using DArTseq technology [48,49]. Genome complexity reduction which involved the use of a combination of two restriction enzymes (*Pst*I–*Mse*I) was used to create a genome representation of the analyzed samples. All fragments generated were amplified and sequenced to identify the single nucleotide polymorphisms (SNPs) using a proprietary analytical pipeline developed by DArT P/L. After a strict quality control process, which included parameters such as call rate, data reproducibility (~20% of samples replicated), and rate of monomorphism to eliminate monomorphic markers, 9951 SNPs were extracted from the evaluated germplasm. The 9951 SNPs were filtered for unmapped markers, duplicate markers and markers segregating between the *Striga* resistant and *Striga* susceptible parents. A total of 1139 high-quality DArTseq markers distributed across the 10 maize chromosomes were retained for the construction of genetic linkage map as well as the QTL mapping.

#### *2.3. Data Analysis*

The data recorded on emerged *Striga* counts and *Striga* damage severity scores were subjected to natural logarithm transformation. Thereafter, data collected on grain yield, ears per plant, *Striga* damage as well as *Striga* emergence counts were tested for normality using Shapiro–Wilk's (*W*) test [39,50] before analysis of variance. Box plots were made to visualize the distributions of grain yield and other traits under each research environment using ggplot2 library [51]. Analysis of variance was conducted across research environments using the general linear model procedure (PROC GLM) implemented in the Statistical Analytical System (SAS), version 9.3 [52]. In the analysis, environment, replications (environments), blocks (replications × environments) were considered as random and the F2:3 families (genotypes) as fixed effects. Estimates of broad sense heritability of the traits (H2) across research environments were computed on a family-mean basis as proposed by Holland et al. [53], using the following formula:

$$\mathcal{H}^2 = \frac{\sigma\_{\mathcal{S}}^2}{\sigma\_{\mathcal{S}}^2 + \frac{\sigma\_{\mathcal{S}^c}^2}{\varepsilon} + \frac{\sigma\_{\mathcal{S}}}{rc}}$$

where σ<sup>2</sup> *<sup>g</sup>* = variance component due to the genotypes, σ<sup>2</sup> *ge* = genotype × environment variance, σ*<sup>e</sup>* = experimental error variance; *e* = number of environments, and *r* = number of replications within environment.

Correlation coefficients were estimated among the traits with the adjusted means of the F2:3 families using the Ggally function implemented in GGally package [54]. Furthermore, the mixed linear model (MLM) established in META-R software [55] was used to compute the best linear unbiased estimates (BLUEs) for each genotype in each and across environments which were for the QTL analysis. The R/qtl was used to construct a linkage map [56]. Markers that were identical across all genotypes were identified and eliminated as duplicates. Furthermore, <sup>χ</sup>2-test for goodness-of-fit (*p* <sup>≤</sup> 0.0001) was used to identify markers with distorted segregation patterns [50,57,58]. Markers with significant deviation from the expected Mendelian segregation ratio (1:2:1) for F2:3 population were excluded from the analysis, resulting in a total of 1139 SNP markers used for the genetic map construction and QTL analysis.

#### *2.4. QTL Analysis and Candidate Gene Identification*

Quantitative trait loci (QTL) mapping for each and across environments was carried out for four *Striga* resistance/tolerance adaptive traits (grain yield, *Striga* damage at 10 WAP, emerged *Striga* counts at 10 WAP and ears per plant (EPP)), using R/qtl package with the composite interval mapping (CIM) algorithms as proposed by Wang et al. [50]. The statistical significance of the QTL was assessed using permutation tests (1000 replications) for all traits. A logarithm of odds (LOD) of 3.0 was set through the permutation test to identify significant QTLs for the traits [59]. The additive effects and proportion of phenotypic variance explained (PVE) by each QTL were estimated using the "fitqtl" function of R version 3.3.4. The sign of the effect of each QTL was used to identify the origin

of the favorable alleles [60]. The potential locations of the QTLs were described according to their LOD peaks and their surrounding regions. Identified QTLs were named based on conventions method described by Bo et al. [61]. For example, *qepp-2* represented the QTL identified for number of ears per plant on chromosome 2. Putative candidate genes were searched within a 2.0 Mb interval downstream and upstream of the significant associated SNPs using the MaizeGDB database version (RefGen\_v4).

#### **3. Results**

#### *3.1. Phenotypic Analysis of Grain Yield and Other Striga Resistance Adaptive Traits*

The 194 F2:3 families and the two parental lines were evaluated under artificial *Striga* infestation to assess variation in their level of resistance. The distributions of grain yield, *Striga* damage, number of emerged *Striga* plants as well as ears per plant in the F2:3 population are displayed in Figure 1. Significant variation was detected among the genotypes under each and across research environments (Figure 1, Table S1). The performance of the genotypes (F2:3 families and the parental lines) for *Striga* emergence count and ears per plant were not significantly influenced by the environment whereas grain yield and *Striga* damage displayed significant genotype × environment interactions. The two parental lines TZEEI 79 and TZdEEI 11 differed significantly and consistently in their performance under artificial *Striga* infestation, and phenotypic values for each trait of segregating population displayed wide ranges (Table 1). Transgressive segregation was observed for all traits in that some of the F2:3 families showed higher and lower levels of grain yield, *Striga* damage, number of emerged *Striga* plants and ears per plant compared to the parental lines (Table 1). The *Striga* resistant inbred line TZEEI 79 exhibited high grain yield and ears per plant as well as reduced *Striga* damage and *Striga* emergence count whereas the *Striga* susceptible line TZdEEI 11 showed significantly lower grain yield and ears per plant as well as increased *Striga* damage and emergence count. Grain yield (kg/ha) across the F2:3 population varied from 1070.1 to 4113.9, with a mean of 2439.2 (Table 1). In addition, individual means varied from 0.5 to 1.4, 2.4 to 7.5, 0.4 to 3.7 for ears per plant, *Striga* damage and *Striga* emergence count, respectively. Broad-sense heritability estimates of the traits derived from the variance components varied from 0.47 for *Striga* damage to 0.70 for ears per plant. The normality tests by Shapiro–Wilk (*W*) revealed that the distributions of grain yield and *Striga* damage phenotypic data were normally distributed while those of *Striga* emergence counts and ears per plant were not (Table S2). High *W*-test values were obtained for all studied traits ranging from 0.97–0.99 (Table S2). Correlation analysis revealed significant and positive correlations between number of ears per plant and grain yield whereas negative and significant correlations were observed between the ears per plant and *Striga* damage as well as between *Striga* damage and grain yield (Figure 2).

**Figure 1.** Box plots showing the distribution of (**A**) grain yield (YIELD, t/ha), (**B**) emerged *Striga* plants (ESP), (**C**) *Striga* damage rating (SDR) and (**D**) number of ears per plant (EPP) under artificial *Striga* infestation at Mokwa (MK) and Abuja (AB) in 2018. The points represent the F2:3 families and the parental genotypes.

**Table 1.** Descriptive statistics of *Striga* resistance indicator traits of parents and F2:3 population derived from the cross between TZEEI 79 × TZdEEI 11 across *Striga*-infested environments.


H2—broad sense heritability, CV—coefficient of variation.

**Figure 2.** Correlation between grain yield and other *Striga* resistance indicator traits in an F2:3 mapping population derived from TZEEI 79 × TZdEEI 11 under artificial *Striga* infestation. The axis displayed the range of values obtained for each trait; black circles represent the most predominant values for each trait among the genotypes while the red lines represent the direction of the relationship between two traits. \*\* = 0.01 and indicates highly significant.

#### *3.2. Linkage Map*

A genetic linkage map containing 1139 SNPs mapped on the 10 maize chromosomes was constructed (Figure S1, Table 2). The resulting map spanned a total genetic distance of 2016 cM, with mean interlocus distance of 1.70 cM. The average genetic distances between successive markers ranged from 0.86 cM to 11.86 cM for chromosomes 1 and 7 respectively.



#### *3.3. QTL Detection and Identification of Potential Candidate Genes*

Through the QTL analysis, a total of 12 QTLs with significant LOD score ≥ 3.0, were identified for the four *Striga* resistance indicator traits using the integrated genetic map and mean phenotypic data across research environments (Table 3). The 12 QTLs identified included three for grain yield, five for ears per plant, three for *Striga* damage and one for *Striga* emergence counts. The proportion of phenotypic variation explained by the QTLs varied from 2.0% for *qepp-2.1* to 13.5% for *qepp-1*. Three QTLs *qgy-1.1*, *qgy-2.1* and *qgy-7* detected for grain yield explained 5.6, 10.3 and 2.3% phenotypic variation, respectively. Furthermore, five QTLs *qepp-1*, *qepp-2.1*, *qepp-3*, *qepp-7*, and *qepp-8.1* were identified for ears per plant, explaining phenotypic variation ranging from 2.0 to 13.5%. Similarly, QTLs *qsd-2*, *qsd-5.1* and *qsd-7* detected for *Striga* damage displayed phenotypic variance of 8.0, 3.0 and 3.2 respectively. The only QTL (*qsc*-3.1) identified for *Striga* emergence count had PVE of 3.1. The QTL *qsd-7* and *qepp-7* were detected at the same position for *Striga* damage and ears per plant. Similarly, *qgy-2.1* and *qsd-2* detected for grain yield and number of emerged *Striga* plants were consistently identified at the same position in each of the two locations. Three major QTL genomic regions were detected on chromosomes 1, 2, and 8 with flanking marker intervals 216–226 cM, 134–156 cM and 35–38 cM, respectively (Figure S2). In all cases, favorable alleles for *Striga* resistance/tolerance were contributed by the *Striga* resistant inbred line TZEEI 79.

A total of 116 protein coding genes were identified within 2.0 Mb interval downstream and upstream of the significantly associated SNPs (Table S3). Of the 116 candidate genes, 17 key candidate genes associated with the identified QTL for *Striga* resistance/tolerance indicator traits under artificial *Striga* infested environments are presented in Table 4. For grain yield, the *qgy-1.1* was found associated with GRMZM2G408305 which encodes ARM repeat superfamily protein as well as GRMZM2G072376 which encodes bHLH-transcription factor 56. The QTL *qgy-7, qepp-7* and *qsd-7* were linked to GRMZM6G199466 (hsp3—heat shock protein3), GRMZM2G008234 (ereb114—AP2-EREBP-transcription factor 114) as well as GRMZM2G044194 (phytosulfokine peptide precursor1). Similarly, for ears per plant, *qepp-1* was found associated with GRMZM2G324999 which encodes the WRKY-transcription factor 25; *qepp-2.1* was associated with GRMZM2G174784 (EREB197—putative AP2-EREBP transcription factor superfamily protein), GRMZM2G174917 (ereb47—AP2-EREBP-transcription factor 47) and GRMZM2G131961 (bzip27—bZIP-transcription factor 27); *qepp-3* was linked to Zma-MIR167g which promotes lateral root development in plants. On chromosome 8, QTL *qepp-8.1* detected for ears per plant was linked to the genes GRMZM2G051528 which encodes myb transcription factor95 and GRMZM2G053503 which encodes ethylene-responsive factor-like protein. For *Striga* damage, QTL *qsd-5.1* was associated with genes GRMZM2G059851 which encodes the heat shock factor protein as well as GRMZM2G099334 which encodes myb3—WD40 repeat protein. The QTL *qsc-3.1* detected for *Striga* emergence count was found associated with the genes GRMZM2G054050 which encodes multicopper oxidase protein, GRMZM2G162709 (MYB-transcription factor 137), and GRMZM2G340342 which encodes the ARM repeat superfamily protein.


**Table 3.** Summary of quantitative trait loci (QTLs) mapped in the F2:3 population derived from TZEEI 79 × TZdEEI 11 under artificial *Striga* infestation.

AB—Abuja, MK—Mokwa, Across—across the two *Striga* infested environments; Add—Additive effect, LOD—Logarithm of odds, PVE—proportion of phenotypic variance explained by single QTL. Grain yield (kg/ha), ears per plant (number of ears per plant), *Striga* damage (based on rating scale 1–9) and *Striga* emergence count (number of emerged *Striga* plants).



\* Linkage group start and end positions within 2.0 Mb interval downstream and upstream of the significant associated SNPs.

#### **4. Discussion**

Marker-assisted selection (MAS) is an efficient approach for increasing the accuracy and efficiency of selection using markers tightly linked to genes or QTLs of interest, to complement phenotypic selection [40,62]. The identification of QTLs associated with *Striga* resistance/tolerance would facilitate rapid development of *Striga* resistant/tolerant maize genotypes using MAS, due to the polygenic nature of host–parasite relationship and its interaction with environmental factors [63]. The normal distribution observed for grain yield and *Striga* damage in the present study is a result of the highly diverse genotypes segregating in the mapping population [64]. The selection of parental lines with varying levels of resistance to *Striga* allowed sufficient segregation of the traits in the population. The distribution of measured traits in the F2:3 population indicated the existence of transgressive segregation (i.e., progenies performing outside the range of the parental genotypes). Transgressive segregation has been observed in populations screened under low N [65] and *Striga* infestation [66,67]. This phenomenon results from the accumulation of favorable and unfavorable alleles resulting from both parents. Moderate-to-high broad sense heritability estimates (0.47–0.70) observed for grain yield and other *Striga* resistance/tolerance indicator traits confirmed that high-quality phenotypic data were used for the genetic analysis. The moderate-to-high heritability estimates obtained in the present study implied that the observed genetic variation among the genotypes was strongly influenced by genetic factors, and that the *Striga* resistance indicator traits could be effectively improved in *Striga* resistance breeding programs. Previous studies reported moderate-to-high heritability values, ranging from 0.53–0.84 for *Striga* resistance/tolerance adaptive traits [68,69].

Linkage map density and resolution largely depend on population size and type, marker density, as well as the accuracy of genotyping [70,71]. The F2:3 mapping population, developed from the cross between inbred TZEEI 79 (*Striga* resistant) and inbred TZdEEI 11 (*Striga* susceptible) was used to investigate the inheritance of *Striga* resistance/tolerance. The QTL mapping in the TZEEI 79 x TZdEEI 11 F2:3 mapping population identified twelve QTLs for the *Striga* resistance indicator traits across the two research environments. These QTLs explained moderate variation of the phenotype, with values ranging from 2.0% for *qepp-2* to 13.5% for *qepp-1*. This finding confirmed the complexity of the genetic basis of *S. hermonthica* resistance [13,44]. The 12 QTLs identified included three for grain yield, five for ears per plant, three for *Striga* damage as well as one for *Striga* emergence counts. The identified QTLs were located on chromosomes 1, 2, 3, 5, 7 and 8. Similarly, Adewale et al. [44] identified markers linked to *Striga* resistance indicator traits in maize on chromosomes 1, 3, 5, 7 and 8. Samayoa et al. [72] found QTLs associated with Mediterranean corn borer resistance in maize on chromosomes 1, 5 and 6 using Recombinant Inbred Lines (RIL) population obtained from the cross B73 × CML103. In a study by Haussmann et al. [43], five genomic regions (QTLs) linked to *Striga* resistance in sorghum were reported on chromosomes 1, 2, 5 and 6. Generally, the QTLs mapped in the present study provided more information on the genetic basis of *Striga* resistance/tolerance in maize, indicating that the resistance/tolerance to *Striga* is quantitatively inherited. The additive effects of the identified QTLs indicated that favorable alleles for each QTL were contributed by either the resistant or susceptible parent, depending on the signs of the QTL additive effects. The resistant parental inbred TZEEI 79 contributed favorable alleles for resistance/tolerance to *Striga* for most of the identified QTLs. Estimates of genetic effects of the QTL indicated that additive gene action was preponderant in most cases for *Striga* resistance indicator traits.

The QTL analysis in the F2:3 mapping population identified three major QTL (*qepp-1*, *qgy-2.1* and *qepp-8.1*) genomic regions on chromosomes 1, 2, and 8 with flanking marker intervals of 216–226 cM, –156 cM and 35–38 cM, respectively. Interestingly, the QTLs detected on chromosome 2 were found to be consistent across environments for grain yield and *Striga* damage. The QTL *qgy-2.1* identified for grain yield on chromosome 2 at 141.0 cM was found to be pleiotropic with QTL *qsd*-2 detected for *Striga* damage. Similarly, QTL *qgy-7* at 28.4 cM detected for grain yield on chromosome 7 was pleiotropic with the QTL for *Striga* damage and number of ears per plant. The co-localization of QTLs for these traits may reflect the high correlation coefficients observed among the different *Striga* resistance indicator traits. These two QTLs would be invaluable genomic resources for fine mapping and candidate gene discovery. The validation of a common QTL region in different environments and/or genetic backgrounds is important for application in MAS to improve breeding efficiency [73]. The QTL *qgy-2.1* and *qsd*-2 from TZEEI 79 located on chromosome 2 (133.9–156.4) and identified in the two test locations in this study have not been previously reported. This QTL could be a hot spot for genes for genetic improvement of *Striga* resistance in maize. Overlapping regions of QTL on chromosome 7 (28.1–39.9) for *Striga* damage and ears per plant were identified in the present study. This common region could also provide better prospects for breeders to enhance resistance to *Striga* parasitism in maize using MAS.

Putative candidate genes associated with some of the identified QTLs for *Striga* resistance indicator traits are presented in Table 4. The gene model GRMZM2G054050 (*qsc-3.1*), associated with *Striga* emergence count encodes a multicopper oxidase Lpr-2 (low phosphate root 2) protein, whose homologous gene, Lpr-1 has been found [74,75] to regulate primary roots length under Pi (inorganic phosphate) deficient conditions. Lpr-1 and Lpr-2 play important roles in Pi sensing at root tips [75]. Similarly, the gene model GRMZM2G044194 linked to QTLs for grain yield (*qgy-7*), ears per plant (*qepp-7*), and *Striga* damage (*qsd-7*) were associated with the psk1 (phytosulfokine peptide precursor1) gene. PSK genes have been reported to promote cell growth especially in the quiescent centre cells of the root apical meristem [76]. Similarly, the gene model GRMZM2G408305 (*qgy-1.1*) associated with grain yield encodes the ARM family proteins which promote lateral root growth in plants [77]. QTL *qepp-3* located on chromosome 3 was found to be associated with MIR167g. In Arabidopsis, soybean and maize, miR167 has been reported to play important roles in lateral root growth and architecture [78]. Under plant nutrient deprivation conditions such as *Striga* parasitism, plants alter their root systems to discover heterogeneous soil regions for nutrients. The branching of secondary roots from primary roots in plants is one of the processes through which plants efficiently obtain nutrients from the soil. The QTL *qsd-5.1* was associated with the gene GRMZM2G059851 encoding the heat-shock factor protein, HSF 6. Heat-shock proteins are ubiquitous proteins responsible for protein folding, assembly, translocation as well as degradation in response to biotic stresses, depending on the nature of the causal organisms and plant genotypes (either susceptible or resistant), as well as plant's growth stage [64,79]. In addition, Ng et al. [80] identified AP2/ERF, MYB, bHLH, WRKY as well as bZIP as major transcription factor families involved in plant defense signalling. In a recent study, Adewale et al. [44] identified four putative candidate genes GRMZM2G060216, GRMZM2G103085, GRMZM2G057243 and GRMZM2G164743 located on chromosomes 3, 5, 9 and 10, having functions related to plant defense mechanisms under *Striga* infested conditions. The identified candidate genes in the present study differ from those earlier reported. The candidate genes identified from the dissection of *qsc-3.1*, *qgy-1.1*, *qepp-3*, and *qsd-5.1* are suggestive of *Striga* resistance response mechanisms. The QTL identified in the present study would be validated in different genetic backgrounds and in different environments to verify the reproducibility for effective use in MAS breeding for resistance to *Striga*. Overall, the QTL/markers with significant association to *S. hermonthica*-resistant adaptive traits would be useful as potential candidate loci for the enhancement of *Striga* resistance in maize. The application of these markers for selection would lead to the elimination of the bulk of *Striga* susceptible genotypes, which in turn may significantly reduce the number and cost of screening required to improve maize for *Striga* resistance. Based on our results, we initiated a program aimed at developing extra-early mapping populations from different genetic backgrounds so that putative markers identified in our studies could be validated and deployed in maize breeding programs through MAS.

#### **5. Conclusions**

A total of 12 QTLs associated with *S. hermonthica* resistance/tolerance traits in maize were identified across *Striga* infested environments in the present study. The identified QTLs displayed varying contributions to phenotypic expression and are in regions that play roles which may be associated with plant defense response under *Striga* infestation in maize. The co-localization of QTL for grain yield and other traits indicated strong associations between the traits. The QTLs mapped in this study could be candidates for marker-assisted introgression of *Striga* resistance/tolerance genes in maize, after validation in different genetic backgrounds and in different environments.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/8/1168/s1, Table S1: Mean squares of F2:3 mapping population evaluated under artificial *Striga* infestation at both Abuja and Mokwa in 2018 growing season. Table S2: Shapiro-Wilk's normality tests for *Striga* resistance/tolerance indicator traits for F2:3 population derived from the cross between TZEEI 79 (*Striga* resistant) and TZdEEI 11 (*Striga* susceptible). Table S3: Candidate genes associated with the identified QTL for key Striga resistance/tolerance indicator traits under artificial *Striga* infestation. Figure S1: Linkage map of F2:3 mapping population based on 1139 DArTseq markers. Left bar of the linkage map indicates cM distance while right bar of linkage map displayed the marker names. Red bars and letters indicate QTL identified across *Striga* infested environments. Figure S2. Major QTL identified for *Striga* resistance in the extra-early yellow mapping population. A likelihood of odds (LOD) scan showing the QTL identified on chromosomes 1, 2, and 8 explaining ≥ 10% phenotypic variation.

**Author Contributions:** Conceptualization, B.B.-A.; Resource, B.B.-A.; Methodology, B.B.-A.; Formal Analysis, S.A., A.P.; Data Curation, A.P., S.A.; Writing—Original Draft Preparation, S.A., B.B.-A.; Writing—Review & Editing, B.B.-A., M.G., A.P., and R.A.; Funding Acquisition, B.B.-A., M.G. and R.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Bill & Melinda Gates Foundation [OPP1134248] as well as the Integrated Genotyping Service and Support (IGSS) platform grant (ref. number PJ-002507) of BecA-ILRI, Kenya.

**Acknowledgments:** The authors are grateful to Ana Luisa Garcia-Oliveira and Clay Sneller for their roles in funding acquisition and technical contributions, as well as the IITA Maize Improvement Program, (particularly A. Talabi and V. Oladipo) and the Bioscience Center staff (particularly N. Unachukwu, Q. Obi and Y. Ilesanmi) for technical assistance during the evaluation of field trials and DNA extraction, respectively.

**Conflicts of Interest:** The authors declare no conflicts of interest.

**Data Availability:** The DArTseq datasets used/analyzed in the manuscript have been deposited at the IITA-CKAN repository. doi:10.25502/aabs-rc02/d; doi:10.25502/8dkd-0h42/d.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Molecular Assisted Selection for Pollination-Constant and Non-Astringent Type without Male Flowers in Spanish Germplasm for Persimmon Breeding**

#### **Manuel Blasco 1, Francisco Gil-Muñoz 2, María del Mar Naval <sup>1</sup> and María Luisa Badenes 2,\***


Received: 14 July 2020; Accepted: 7 August 2020; Published: 11 August 2020

**Abstract:** Persimmon (*Diospyros kaki* Thunb) species is a hexaploid genotype that has a morphologically polygamous gyonodioecious sexual system. *D. kaki* bears unisexual flowers. The presence of male flowers resulted in the presence of seeds in the varieties. The fruits of persimmon are classified according to their astringency and the pollination events that produced seeds and modify the levels of astringency in the fruit. The presence of seeds in astringent varieties as pollination variant astringent (PVA), pollination variant non-astringent (PVNA) and pollination constant astringent (PCA) resulted in fruits not marketable. Molecular markers that allow selection of the varieties according to the type of flowers at the plantlet stage would allow selection of seedless varieties. In this study, a marker developed in *D. lotus* by bulk segregant analysis (BSA) and amplified fragment length polymorphism (AFLP) markers, named DlSx-AF4, has been validated in a germplasm collection of persimmon, results obtained agree with the phenotype data. A second important trait in persimmon is the presence of astringency in ripened fruits. Fruits non-astringent at the ripen stage named pollination constant non-astringent (PCNA) are the objective of many breeding programs as they do not need removal of the astringency by a postharvest treatment. Astringency in the hexaploid persimmon is a dominant trait. The presence of at least one astringent allele confers astringency to the fruit. In this paper we checked the marker developed linked to the AST gene. Our goal has been to validate both markers in germplasm from different origins and to test the usefulness in a breeding program.

**Keywords:** persimmon; sex determination; fruit astringency; molecular markers

#### **1. Introduction**

Persimmon (*Diospyros kaki* Thunb) species is a hexaploid genotype that has a morphologically polygamous gyonodioecious sexual system [1]. *D. kaki* bears unisexual flowers, as do other *Diospyros* species. There are genotypes that bear only female flowers and genotypes bearing male and female flowers [2]. Furthermore, varieties bearing only male flowers were described in China [3] and occasional male flower formation was reported in varieties that usually bear only female flowers [4]. In addition to the unisexual flowers, some varieties or genotypes bear hermaphrodite flowers; however, these flowers do not function fully as female flowers [5]. Most of the commercial varieties present only female flowers [6], however the presence of the male flowers type is important in two scenarios: first when the production of seeded fruits is convenient and second in breeding activities in which crosses are requested.

The fruits of persimmon are classified according to their astringency and the pollination events that resulted in different types of fruits. The PCNA (pollination constant non-astringent) varieties

are always not astringent at maturity regardless of pollination events and the presence or absence of seeds in the fruit. The presence of male flowers in these varieties is most convenient since it allows pollination and produces seeded fruits that increase the fruit size and weight. In Japan, the presence of seeds in the fruit does not affect the consumers demand [7]. However, in Europe and western countries consumers prefer seedless fruits.

Three additional variety types can be distinguished: PVNA-type (pollination variant non-astringent), which are non-astringent varieties when seeds are present; PVA-type (pollination variant astringent), which are astringent varieties in most parts of the fruit and non-astringent around the seeds if they are present, and PCA-type (pollination constant astringent), which are astringent varieties regardless the presence of seeds. The loss of astringency in these types of persimmons is associated to the ability of the seeds to produce acetaldehyde. This production resulted in browning of the flesh around the seeds (Figure 1), which interfere with the postharvest treatment for removing the astringency in the fruit, all together the PVA and PCA type fruits are unmarketable if they are pollinated and the fruits present seeds [8,9]. In PVA and PCA varieties it is crucial to avoid pollination, hence the presence of male flowers in the variety and in the vicinity of the crop should be avoided. The ability of the production of male flowers is a genetic trait that should be determined in the varieties for avoiding seeds in astringent varieties or improving the presence of them in non-astringent varieties.

**Figure 1.** Phenotypes of the traits selected. (**A**,**B**) flowers from the variety 'Cal Fuyu', female and male respectively and (**C**,**D**) results of pollination on astringent fruits from 'Rojo Brillante' a pollination variant astringent (PVA) variety: (**C**) parthenocarpic non-pollinated fruits and (**D**) pollinated fruits in which presence of seeds resulted in no marketable fruits.

Elucidation of the genetic and molecular basis of sex expression in *D. kaki* leading to the development of molecular markers would allow selection of the varieties according to the type of flowers, being a great contribution for persimmon production and *Diospyros* breeding. The hexaploidy of *D. kaki* made elucidation of this question more difficult than in diploid genotypes. Since the genus includes more than 700 species with different levels of polyploidy, the diploid *Diospyros lotus* was

used for investigation of the sex expression into the genus [10]. These authors described the model of inheritance and developed molecular markers associated to sex expression. Later small RNA acting as a sex determinant was identified [11]. Development of markers used the bulk segregant analysis (BSA) and amplified fragment length polymorphism (AFLP). An AFLP marker identified as DlSx-AF4 was sequence-characterized and converted into the sequence characterized amplified region (SCAR) [10]. In this study the marker has been tested in a germplasm selection of varieties phenotyped for sex expression and a backcross population obtained at Instituto Valenciano de Investigaciones Agrarias (IVIA). The results provide evidence of the usefulness of molecular marker assistance in identifying the genetic potential of production of male flowers in persimmon, an important trait in breeding.

PCNA varieties are highly desired because their mature fruits are not astringent, as they stop accumulating tannins at early steps of fruit development [12]. In Japanese varieties, the PCNA trait is recessive to the non-PCNA trait [13] and is controlled by a single locus, AST [14]. Due to persimmon being a hexaploid, the PCNA type should contain six recessive ast alleles [15]. In breeding programs aimed at obtaining PCNA cultivars, the hexaploidy of persimmon along with the recessive inheritance of the non-astringency trait led to breeders to develop crosses that involved only PCNA genotypes. Consequently, several generations of crosses between PCNA genotypes along with the low genetic diversity of this group of persimmons resulted in families with a high rate of inbreeding and plenty of the problems derived from this fact. To avoid inbreeding, the programs need to use non-astringent cultivars in the crosses, but the rate of PCNA obtained could be very low depending on the number of dominant AST alleles carried by the parents selected. In a backcross BC1, the expected proportion of PCNA offspring from a non-PCNA F1 parent with one dominant AST, two or three is 50, 20 or 5%, respectively, under an autohexaploid model. In this context, it is of high interest to be able of selecting PCNA types and non-PCNA types in the families obtained at the plantlet stage. The alternative is to select the type of the fruits in the fields after a juvenile period of four years minimum, which is extremely costly and has a low efficiency.

Many efforts have been made to target the region linked to AST [14,16–18]. The most promising results were obtained in a study that identified a region tightly linked to the AST gene [19]. These authors developed a multiplex PCR method based on primers developed from the region identified, highly reliable that allowed detecting recessive and dominant alleles. These primers have been used to test a group of varieties [15]. The region contains microsatellites that allow distinguishing 12 different alleles from 14 non PCNA genotypes. More than 200 accessions and several crosses between PCNA and non-PCNA genotypes were analyzed [20]. Based on the number of fragments detected per individual these authors were able to determine the dominant (AST) and recessive (ast) alleles in the hexaploid persimmon germplasm.

Using this methodology, in this paper we applied molecular assisted selection for discriminate PCNA cultivars and seedlings from different segregated populations obtained in the frame of the IVIA breeding program. The markers for both traits were developed from Japanese varieties, our goal is to validate the markers in a set of germplasm from different origins and different type of astringency and applied them to the IVIA breeding program, in which the involvement of varieties from Mediterranean origin is relevant.

#### **2. Materials and Methods**

#### *2.1. Plant Materials*

#### 2.1.1. Validation of AST and DlSx-AF4S Markers

Molecular markers developed for sex expression and type of astringency in persimmon were studied in a set of 42 accessions (Table 1) from the persimmon germplasm collection maintained at IVIA, Moncada, Spain (39.588741, −0.394848). The accessions were phenotyped regarding the presence of male flowers. The phenotype of astringency type was known from previous germplasm characterization [21,22].


**1.**Plantmaterialstudied,origin,astringencytype,genotypeofASTmarker,flowertypeandgenotypeoftheDlSx-AF4S

 Data from phenotyping.

 Data from genotyping.

#### 2.1.2. Marker Assisted Selection

Marker assisted selection was made on 12 segregated populations obtained from ('Rojo Brillante' × 'Cal Fuyu') × 'Cal Fuyu' in 2016. The backcross was made using 'Rojo Brillante' a high-quality variety astringent (PVA) and 'Cal Fuyu' a PCNA variety with male flowers. Both parents were selected based on agronomic characteristics and adaptability to the Mediterranean environment [21].

Segregated populations screened and individuals per population are described in Table 2. All progenies and seedlings obtained were maintained in orchards at CANSO's Experimental Station, L'Alcudia, Valencia, Spain (39.189086, −0.542067).

**Table 2.** Results of genotypes analyzed by molecular markers. Number of offspring with the astringent allele (AST), with the DlSx-AF4S allele, with AST + DlSx-AF4S (not selected) and number of offspring with the absence of both markers (genotypes selected).


Numbers in parentheses are the rate (%) of corresponding offspring in each progeny.

#### *2.2. Methods*

#### 2.2.1. DNA Isolation

Young fully expanded leaves were collected from trees and kept at −20 ◦C until DNA isolation. DNA was isolated according to the CTAB method described in [23] with minor modifications [24].

#### 2.2.2. Molecular Markers Analysis

The capacity of producing male flowers was checked with the sequence characterized amplified region (SCAR) marker 'DlSx-AF4S' [10], primers used were: forward (DlSx-AF4-3F; 5 -ACA TCC AAA GTT CTG GAG AAT CA-3 ) and reverse (DlSx-AF4-3R; 5 -ATT GGT GCT TGG TCA AAC ATA TC-3 ).

Determination of PCNA genotypes used the primers described in [19] PCNA-F (CCCCTCAGTGGCAGTGCTGC) and 5R3R (GAAACACTCATCCGGAGACTTC).

Polymerase chain reactions (PCRs) were performed in a final volume of 20 μL containing 1× of DreamTaq Buffer (Thermo Fisher Scientific, Vilnius, Lithuania), 0.1 mM of each dNTPs (Promega, Madison, WI, USA), 20 ng of genomic DNA and 1 U of DreamTaq polymerase (Thermo Fisher Scientific, Vilnius, Lithuania). The PCR program consisted of pre-denaturation at 94 ◦C for 2 min; 35 cycles at 98 ◦C for 15 s, 60 ◦C for 20 s and 72 ◦C for 1 min; followed by a final extension at 72 ◦C for 10 min. PCR products were separated by electrophoresis on 1.5% agarose gels in 0.5× TAE buffer and visualized with GelRED® (Sigma-Aldrich, St. Louis, MI, USA).

#### **3. Results and Discussion**

#### *3.1. Marker Assisted Selection Validation: Production of Male Flowers*

A set of accessions belonging to the persimmon germplasm bank were phenotyped for the presence of male flowers and later genotyped with the marker DlSx-AF4S [10] to test the accuracy of the marker for Molecular Assisted Selection (MAS). Results of the genotype agreed with the results of the phenotype (Table 1), non-discrepancies were observed. The capacity of developing male flowers was clearly stated by the presence of the amplified band (320 bp). Figure 2a shows the results on an agarose gel of the presence of male flowers in the genotypes 'Agakaki', 'Cal Fuyu' and the selection 'F-1.34 from the IVIA breeding program. Total correlation between the phenotype and the presence/absence of the band was obtained for all the genotypes studied (Table 1). This marker is a great advantage in breeding programs in which astringent and non-astringent genotypes are involved. The presence of male flowers in astringent varieties (PVNA, PVA and PCA) resulted in the presence of seeds in the fruit. The loss of astringency in these types of persimmons is associated to the ability of the seeds to produce acetaldehyde (Figure 3). Production of acetaldehyde is a quantitative trait in which less production by the seed resulted in higher astringency on the pulp (PCA), and high production resulted in low/none astringent flesh (PVNA) being PVA intermediate. This acetaldehyde production resulted in browning of the flesh around the seeds, which interferes with the postharvest treatment for removing the astringency in the fruit [25]. All together the PVA and PCA type fruits pollinated are unmarketable. In the case of PVNA types in which the presence of seeds browned completely the flesh (Figure 3), there are specific markets in which these varieties are accepted. However, in most of the markets, the PVNA fruits are accepted with no seeds and after removing the astringency by postharvest treatment. In all breeding programs that use astringent varieties, MAS for discriminating male flowers is very important for avoiding self-pollination and/or mix of cultivars that can cross pollinated among them and produced seeds. In breeding programs that involve non-astringent varieties the discrimination of the presence of male flowers is necessary too. Some Japanese programs look for varieties with male flowers and seeds that increase the size and setting of fruits, but in western countries, where the presence of seeded fruits is not acceptable, the presence of male flowers is discarded similarly to astringent varieties. Selection of this trait in persimmon species that have a four-year juvenile period resulted in great interest to avoid plants in the fields that will be eliminated in the future and, additionally, to avoid undesired pollination in the breeding plots.

**Figure 2.** PCR results by electrophoresis in an agarose gel (1.5%) vs. phenotype data; (**A**) DlSx-AF4S PCR results. The (+) presence and (−) absence of male flowers from phenotype data; the DlSx-AF4S marker is present in varieties 'Agakaki', 'Cal Fuyu' and F.1-34 in agreement with the phenotype; (**B**) AST PCR results; (+) astringent fruits according to phenotype data and (−) non-astringent fruits (pollination constant non-astringent (PCNA)) according to phenotype data. The AST marker was present in all astringent varieties and absent in 'Cal Fuyu', a PCNA variety.

**Figure 3.** Three types of astringent persimmon according to the amount of acetaldehyde produced by the seed; (**A**) fruits of pollination constant astringent (PCA), PVA and pollination variant non-astringent (PVNA; from right to left) and (**B**) distribution of condensed tannins, visualized by precipitation of blue ferric chloride impregnated in a paper [25,26]. Fruits of PCA, PVA and PVNA (from right to left).

#### *3.2. Marker Assisted Selection Validation: Selection of PCNA*

In Japanese cultivars, the PCNA trait is recessive to the non-PCNA trait [13] and is controlled by a single locus, AST [14]. Due to persimmon being a hexaploid, the PCNA type should contain six recessive ast alleles [15]. Detection of at least one AST allele determines the astringency of fruit. Selection of non-astringent fruits or PCNA are the objective in most of the persimmon breeding programs currently active in the world [27–32]. In this study we validated the AST marker developed in [19] for discrimination between PCNA genotypes and the different astringent types.

Validation of AST marker was carried out in a set of cultivars from the germplasm collection with known astringency (Table 1). A total correlation between the phenotypic data of astringency and the markers obtained in the genotypes analyzed was obtained. Figure 2b shows the PCR products of a set of accessions. Two PVA cultivars 'Rojo Brillante' and 'Tone Wase', three PVNA cultivars 'Agakaki', 'Castellani', 'Edoichi' and 'F-1.34 showed a clear band for AST marker. The PCNA cultivar 'Cal Fuyu' showed no amplified product.

#### *3.3. Marker Assisted Selection of Both Traits in the IVIA Breeding Program*

After validation of the DlSx-AF4S and AST markers in a set of accessions phenotyped, we applied both markers in the breeding program for selecting the individuals of several segregated populations (Figure 4).

A total of 441 individuals belonging to 12 segregated populations obtained by a backcross that consisted in (PVA × PCNA × PCNA) were evaluated (Table 2). The cross (PVA × PCNA) was made using 'Rojo Brillante' a high-quality variety astringent and 'Cal Fuyu' a PCNA variety with male flowers.

**Figure 4.** PCR results from agarose gel electrophoresis of 16 backcross (BC) individuals genotyped for AST and D1Sx-AF4S markers. (**A**) Presence of the marker (D1SX-AF4S) indicates ability for developing male flowers; (**B**) Absence of the marker means absence of any astringent allele (AST), which corresponds to a PCNA genotype The BC progenies are selected based on the absence of both markers, which corresponds to PCNA types without capability of the development of male flowers. Individuals 1, 4, 9 and 14 have been selected.

The astringency trait is a dominant marker and taking into account the hexaploidy of persimmon, the number of PCNA genotypes obtained in crosses that involved astringent types depends on the number of AST alleles present in the astringent parental. The validation of the AST marker was made based on different segregated families by [33]. Identification of different AST alleles was made in a set of cultivars by means of crosses with PCNA varieties, analysis of the segregation obtained and sequence of the genomic region [15]. If a non-PCNA has a single A allele (Aaaaaa) and is crossed with a PCNA individual (aaaaaaa), 50% of the offspring will be astringent. In the program the F1 seedlings obtained and crossed with the PCNA 'Cal Fuyu' resulted in a different percentage of astringent genotypes. It has been demonstrated that the number of PCNA obtained depends on the allelic dose of the F1 backcrossed [17]. In this study, the number of individuals per progeny was very low for studying segregation ratios and inferred the number of A alleles in the F1 mothers. However, taking all the tested BC1 seedlings together the rate of astringent genotypes was around 50%, which indicates that the F1 group of maternal genitors might contain one AST allele on average.

Flower gender analysis revealed that around half of the genotypes analyzed have the capacity to generate the male flower (47.4%). This proportion is as expected, since the crosses need always a parent bearing male flowers. In the IVIA breeding program the presence of male flowers is a discarded trait for all types of fruit. In astringent types as PCA, PVA and PVNA, the presence of male flowers resulted in fruits pollinated and the production of seeds that brown the flesh and difficult the postharvest treatment for removing the astringency. PCNA types are discarded as well because

consumers do not accept seeds in the fruits and the presence of male flowers can pollinate astringent fruits, affecting negatively the quality and marketability of them.

Combined results of both markers, AST positive (astringency of the fruit) and DlSx-AF4S positive (presence of male flowers) resulted in a high number of discarded genotypes. In column (AST- and DlSx-AF4S) from Table 2 we indicated the genotypes that will be selected according to our breeding objectives. Only genotypes not astringent (PCNA type) and without male flowers will be selected (absence of both markers), a total of 118 from 441 (26.8%). It is important to point out that the markers segregated independently. According to the published genome of *Diospyros oleifera* [34], identified as the diploid *D. kaki* ancestor, the DNA fragments from which the markers were derived are located in different chromosomes. Therefore, the AST and D1Sx-AF4 markers must segregate independently.

The low rate of genotypes selected from the populations generated indicates the usefulness of the MAS applied in persimmon breeding. We could select at a plantlet stage the genotypes that will be planted in the fields for further agronomic selection. This MAS avoids keeping the future rejected plants during 4 years in the experimental fields. In our case near to 75% of the genotypes obtained can be discarded at the seedling stage in the greenhouse, indicating a high effectiveness of MAS in persimmon breeding.

#### **4. Conclusions**

The markers DlSx-AF4, linked to the production of male flowers and AST linked to astringency of the fruits, have been validated in a germplasm collection of persimmon. Although the markers were developed from Japanese cultivars, the correlation between the phenotype and genotype was 100% in germplasm from a different origin, which demonstrated the usefulness of the markers for selecting these important traits. Both markers have been screened in different progenies from a backcross that includes an astringent parent from non-Japanese origin. Results demonstrated that selection of both traits combined resulted in a very low rate of selection. In a context of breeding programs that involve astringent cultivars the MAS applied to discriminate PCNA genotypes is highly valuable.

**Author Contributions:** Conceptualization, M.B. and M.L.B.; Methodology, M.B. and F.G.-M.; Investigation, M.B. and M.d.M.N.; Writing—Original Draft, M.L.B.; Writing—Review and Editing, M.B., F.G.-M. and M.d.M.N.; Funding Acquisition, M.L.B. and M.B.; Resources, M.L.B.; Supervision, M.L.B. and M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by IVIA grant 51914 and partially funded by FEDER. FGM was funded by European Social Fund and Generalitat Valenciana grant ACIF/2016/115.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Multi-Trait Regressor Stacking Increased Genomic Prediction Accuracy of Sorghum Grain Composition**

#### **Sirjan Sapkota 1,2,\*, J. Lucas Boatwright 1,2, Kathleen Jordan 1, Richard Boyles 2,3 and Stephen Kresovich 1,2**


Received: 4 July 2020; Accepted: 14 August 2020; Published: 19 August 2020

**Abstract:** Genomic prediction has enabled plant breeders to estimate breeding values of unobserved genotypes and environments. The use of genomic prediction will be extremely valuable for compositional traits for which phenotyping is labor-intensive and destructive for most accurate results. We studied the potential of Bayesian multi-output regressor stacking (BMORS) model in improving prediction performance over single trait single environment (STSE) models using a grain sorghum diversity panel (GSDP) and a biparental recombinant inbred lines (RILs) population. A total of five highly correlated grain composition traits—amylose, fat, gross energy, protein and starch, with genomic heritability ranging from 0.24 to 0.59 in the GSDP and 0.69 to 0.83 in the RILs were studied. Average prediction accuracies from the STSE model were within a range of 0.4 to 0.6 for all traits across both populations except amylose (0.25) in the GSDP. Prediction accuracy for BMORS increased by 41% and 32% on average over STSE in the GSDP and RILs, respectively. Prediction of whole environments by training with remaining environments in BMORS resulted in moderate to high prediction accuracy. Our results show regression stacking methods such as BMORS have potential to accurately predict unobserved individuals and environments, and implementation of such models can accelerate genetic gain.

**Keywords:** genomics; genomic selection; genomic prediction; marker-assisted selection; whole genome regression; grain quality; near infra-red spectroscopy; cereal crop; sorghum; multi-trait

#### **1. Introduction**

Cereal grains provide more than half of the total human caloric consumption globally and amount to over 80% in some of the poorest nations of the world [1]. Sorghum [*Sorghum bicolor* (L.) Moench], a drought-tolerant cereal crop, is a dietary staple for over half a billion people of semi-arid tropics which is inhabited by some of the most food insecure and malnourished populations [2]. In industrialized countries, such as the United States and Australia, grain sorghum is primarily grown for animal feed. But in recent years the uses of sorghum grain have expanded to baking, malting, brewing, and biofortification [3–5]. Therefore, genetic improvement of sorghum grain composition is crucial to mitigate the global malnutrition crisis, to increase efficiency of feed grains used in animal production, and to serve evolving niche markets for gluten-free grains.

In the last two decades, the use of genome-wide markers in prediction of genetic merit of individuals has revolutionized plant and animal breeding. Genomic prediction (GP) uses statistical models to estimate marker effects in a training population with phenotypic and genotypic data which is then used to predict breeding values of individuals solely from genetic markers [6,7]. Training population size, genetic relatedness between individuals in training and testing population, marker density, span of linkage disequilibrium and genetic architecture of traits are some of the factors that can affect the predictive ability of the models [8–10]. Genomic prediction models are routinely studied and applied by breeding programs around the world in several crops. Novel statistical methods that are capable of incorporating pedigree, genomic, and environmental covariates into statistical-genetic prediction models have emerged as a result of extensive computational research [11].

One of the main advantages of GP is that breeders can use phenotypic values from some lines in some environments to make predictions of new lines and environments. Genomic best linear unbiased prediction (GBLUP) proposed by VanRaden [12] is probably the most widely used genomic prediction model in both plant and animal breeding. Since then GBLUP model has been extended to include G × E interactions resulting in improved prediction accuracy of unobserved lines in environments [13,14]. Burgueño et al. [13] found an increase in prediction ability of unobserved wheat genotypes by about 20% in multi-environment GBLUP model compared to single environment model. Also an extension of the GBLUP model, Jarquín et al. [14] introduced a reaction norm model which introduces the main and interaction effects of markers and environmental covariates by using high-dimensional random variance-covariance structures of markers and environmental covariates. While most of the genomic prediction studies have been on individual traits, breeding programs use selection indices based on several traits to make breeding decisions. To address those challenges, expanded genomic prediction models that perform joint analysis of multiple traits have been studied using empirical and simulated data [15,16]. Subsequent improvement in prediction accuracy from multi-trait model over single-trait model depends on trait heritability and correlation between the traits involved [15,17].

Data generated in breeding programs span multiple environment and are recorded for multiple traits for each individual. While multi-environment models and multi-trait models are implemented separately, a single model to account for complexity of variance-covariance structure in a combined multi-trait multi-environment (MTME) model was lacking until Montesinos-López et al. [18] developed a Bayesian whole genome prediction model to incorporate and analyze multiple traits and multiple environments simultaneously. Montesinos-López et al. [18] also developed a computationally efficient Markov Chain Monte Carlo (MCMC) method that produces a full conditional distribution of the parameters leading to an exact Gibbs sampling for the posterior distribution. Another MTME model that employs a completely different method was proposed by Montesinos-López et al. [19]. This method, called the Bayesian multi-output regression stacking (BMORS), is a Bayesian version of multi-target regressor stacking (MTRS) originally proposed by Spyromitros-Xioufis et al. [20,21]. This method consists of training in two stages: first training multiple learning algorithms for the same dataset and then subsequently combining the predictions to obtain the final predictions.

Genomic prediction for grain quality traits has previously been reported in crops such as wheat [22–24], rye [25], maize [26], and soybean [27]. Hayes et al. [28] and Battenfield et al. [23] used near-infrared derived phenotypes in genomic prediction of protein content and end-use quality in wheat. Multi-trait genomic prediction models can simultaneously improve grain yield and protein content despite being negatively correlated [24,29]. In sorghum, grain macronutrients have shown to be inter-correlated among one another [30], which suggests the multi-trait models may increase predictive ability of individual grain quality traits. The ability to assess genetic merit of unobserved selection candidates across environments is promising for reducing evaluation cost and generation interval in the sorghum breeding pipeline where parental lines of commercial hybrids are currently selected on the basis of extensive progeny testing [31]. In order to extend capacities to performance index selection for multiple environments, we need to study and effectively implement MTME genomic prediction models in our breeding programs. In this study, we report the first implementation of genomic prediction for grain composition in sorghum, and the objective was to assess potential for improvement in prediction accuracy of multi-trait regressor stacking model over single trait model for five grain composition traits: amylose, fat, gross energy, protein and starch.

#### **2. Results**

#### *2.1. Phenotypic Variation*

A single calibration curve for near infra-red spectroscopy (NIRS) was used for the two populations studied. Table 1 outlines the summary statistics of NIRS predictions and phenotypic distribution and heritability of the grain composition traits. The cross validation accuracy (*R*2) of the NIRS calibration curve was moderately high to high, except for fat which had a moderate *R*<sup>2</sup> value (0.41). We had a total of three environments (three years in one location) for the GSDP and four environments (two years in two locations) for the RILs. Traits were normally distributed except amylose in two 2014 environments in the RILs which had bimodal distribution (Figures S1 and S2). All traits showed significant variation in distribution across the environments, except starch in GSDP.

**Table 1. Summary statistics of near infra-red spectroscopy (NIRS) calibration and phenotypic distribution.** *R*<sup>2</sup> is the prediction accuracy and SECV is the standard error of cross validation for the NIRS calibration curve. Mean represents the phenotypic mean of the trait with its standard deviation (SD). *h*<sup>2</sup> is the estimate of genomic heritability.


The genomic heritabilities of all traits except gross energy were significantly higher (*p* < 0.05) in the RILs than in the GSDP (Table 1). Trait heritabilities were high in the RILs, with protein and gross energy having the highest and lowest heritabilities, respectively. In the GSDP, genomic heritability was moderately high for fat and gross energy, moderate for protein and starch, and low for amylose. The poor genomic heritability (0.24) of amylose in the GSDP was expected because only a very small proportion (1%) of accessions have low amylose as a result of *waxy* gene (Mendelian trait).

Figure 1 shows correlation between the adjusted phenotypic means for trait and environment combination. Starch was negatively correlated (*p* < 0.001) with all other traits in both populations except for amylose in the RILs. Fat, protein and gross energy were significantly positively (*p* < 0.001) correlated to each other across environments in both populations. The strongest positive correlation was between gross energy and fat, whereas the strongest negative correlations were found between starch gross energy and starch protein. Moderate (0.4) to high (0.73) positive correlation was observed between years for all traits except for amylose (r = 0.08) between 2014 and 2017 in GSDP (Figure 1). We conducted a principal component analysis (PCA) of correlation matrix for the traits in each environment. In both populations, the first component separated amylose and starch from the other three traits, whereas, the second component separated amylose from starch and gross energy from protein and fat (Figure S3). The first component explained 78.8% and 75.9% of variation, and second component explained 6.3% and 9.8% of variation in the GSDP and RILs, respectively. The third principal component in the RILs separated proteins from fat and explained about 7.6% of the variation.

**Figure 1. Correlation between traits across year and location combination for the two populations.** Ams: amylose, GE: gross energy, Prt: protein, Sta: starch, SC: South Carolina, TX: Texas, and numbers in x and y-axes represent the year.

#### *2.2. Prediction Performance in Single and Multiple Environment*

We first implemented GBLUP prediction model for single-trait single-environment (STSE). Prediction accuracies of the STSE model varied across environments in both populations (Figure 2). The environments 2014 in the GSDP and TX2014 in the RILs had highest average prediction accuracy but were not always the best predicted environment for all traits. While poorly predicted for amylose, the environments 2017 in the GSDP and TX2015 in the RILs had higher prediction accuracy for starch compared to all or most environments. Despite variation across environments and populations, the average prediction accuracies from the STSE were within the range of 0.4 to 0.6 for all traits except amylose (0.25) in the GSDP. The average prediction accuracy of the STSE model in the GSDP was positively correlated (r = 0.86) with the genomic heritability of the traits. In the RILs, there was a positive correlation (r = 0.77) between average prediction accuracy and genomic heritability for amylose, fat and gross energy, but the traits (protein and starch) with the highest heritabilities had relatively lower average prediction accuracies.

**Figure 2. Prediction accuracy for single-trait single-environment (STSE) model.** The y-axis shows prediction accuracy calculated as Pearson's correlation between observed values and predicted values of phenotypes. Legend represents the environment/years. SC: South Carolina, TX: Texas, GSDP: Grain sorghum diversity panel, RILs: recombinant inbred lines. Pale blue dots represent the mean of prediction accuracy.

We did not see substantial improvement in multi-environment (BME) model over the STSE prediction accuracies (Figure S4). In the GSDP, the multi-environment models resulted in a decline in average prediction accuracy compared to the STSE model for fat (21%), amylose (10%) and protein (4%), however, no significant change was observed for gross energy and starch (Table 2, Figure S5). The prediction accuracy in the RILs increased by an average of 3% in the BME compared to the STSE, however, the overall trend of prediction accuracy for traits and environments remained unchanged (Table 3, Figure S5). The environment SC2014 showed consistent increase in accuracy for BME over STSE model across all traits with about 10% increase for protein (Table 3). Amylose in TX2015 environment had the single greatest increase (12%) in average prediction accuracy in the BME among all trait-environment combinations for the RILs (Table 3).


**Table 2. Percent change in mean prediction accuracy (r) over the single trait single environment (STSE) model in the diversity panel (GSDP).** BME: Bayesian multi-environment, and BMORS: Bayesian multi-output regressor stacking. Values were rounded to the nearest whole number.

**Table 3. Percent change in mean prediction accuracy (r) over the single trait single environment (STSE) model in the recombinant inbred lines (RILs).** BME: Bayesian multi-environment, BMORS: Bayesian multi-output regressor stacking, SC: South Carolina, and TX: Texas. Values were rounded to the nearest whole number.


#### *2.3. Bayesian Multi-Output Regression Stacking*

We tested two different prediction schemes in the BMORS prediction model using the two functions *BMORS()* and *BMORS\_Env()* as described in Montesinos-López et al. [19]. While the *BMORS()* function was used for a five-fold CV as described in the methods section, the *BMORS\_Env()* was used to assess the prediction performance of whole environments while using the remaining environments as training. So in *BMORS\_Env*, an environment was left out during training and correlation between the predicted values (obtained from training with remaining environments) and observed values for the test environment was measured as prediction accuracy for that environment in BMORS\_Env model.

#### 2.3.1. Five-Fold CV

The prediction accuracy from five-fold CV in BMORS increased by 41% and 32% on average over the STSE model in GSDP and RILs, respectively. Figure 3 shows the prediction accuracy of BMORS for each trait and environment combination across the two populations. While the percent change in accuracy varied across environments, the BMORS model nonetheless had higher average prediction accuracy than the STSE and BME models for all traits (Figure S4). The increase in average accuracy in BMORS over STSE ranged from 11% (amylose, 2014) to 66% (amylose, 2013) in the GSDP with exception of amylose in 2017 (13% decrease), and 15% (fat, SC2015) to 60% (protein, TX2014) in the RILs (Tables 2 and 3). The increase in average prediction accuracy was higher (35%) for both locations in 2014 for the RILs, whereas, the year 2013 in the GSDP increased the most. Among the traits, protein (54%) had the greatest average increase in prediction accuracy in the GSDP, whereas in the RILs, protein and starch (42%) both showed the greatest increase.

#### 2.3.2. Prediction of Whole Environment

Predicting a whole environment using the BMORS model usually yielded higher accuracy than the mean prediction accuracy from the STSE or BME model where only portions of each environment was tested instead of whole environment as in *BMORS\_Env* model (Figures 2 and 4, Figure S5, Table 4). This shows that BMORS model can be reliably used in predicting unobserved environment with the same accuracy as from STSE or BME models from training within the environments. The distribution of prediction accuracy across trait and environment combination were, however, similar to the results from the STSE model. In the GSDP, little variation in prediction accuracies was observed across environments for gross energy, starch and protein, whereas, amylose and fat showed greater variability in prediction accuracy between environments. In the RILs, prediction accuracy for all traits except protein had high variability across the environments (Table 4).

In order to assess predictability by location or year in the RILs, we tested one location or year by training the BMORS model using the other location or year, respectively (Table 4). The Texas location had higher accuracy of prediction for fat (+0.11) and gross energy (+0.1) compared to South Carolina, but rest of the traits had similar prediction accuracy (difference < 0.02). Prediction accuracy of whole years varied across traits, amylose (+0.09) and fat(+0.04) were higher in 2014, protein was higher (+0.05) in 2015, and starch and gross energy were similar.

**Figure 4. Prediction accuracy of whole environment predicted using the Bayesian multi-output regressor stacking (BMORS\_Env) in the diversity panel (GSDP).** The y-axis shows prediction accuracy calculated as Pearson's correlation between observed values and predicted values of phenotypes. Values on top of the bar represent the height of the bar.

**Table 4. Prediction accuracy of the test environments predicted using the Bayesian multi-output regressor stacking (BMORS\_Env) in the recombinant inbred lines (RILs).** SC: South Carolina, TX, Texas. Prediction accuracy was calculated as Pearson's correlation between observed values and predicted values of phenotypes.


#### **3. Discussion**

Phenotyping for grain compositional traits is—(1) challenging and labor-intensive, (2) destructive for most accurate results, and (3) only performed after plants reach physiological maturity and are harvested. The use of genomic prediction for compositional traits will be extremely valuable because it increases selection intensity and decreases generational interval by overcoming the phenotyping challenges. Moreover, these traits are complex and quantitatively inherited so will benefit from genomic prediction's ability to account for many small effect QTLs in estimating breeding values.

#### *3.1. Trait Architecture and Prediction Accuracy*

While the accuracy of NIRS calibration for traits in this study ranged from moderate to high, there was prediction error associated with NIRS prediction. However, it is unclear if and what effects NIRS prediction error had on genomic prediction. No direct correlation was observed between the genomic prediction accuracy and NIRS statistics for the traits studied. The trait with the lowest NIRS *R*2, fat, was predicted as well as or better than starch, protein and gross energy, which had NIRS *R*<sup>2</sup> > 0.7. Despite varying strength of correlations between traits across the two populations studied, the nature of relationship was similar for a given pair of traits, which is also in agreement with previous studies [30,32,33]. The strong negative relationship of starch and amylose to protein, fat and gross energy was further elucidated by the PCA analysis of phenotypic correlation matrix (Figure S3). Since starch, protein and fat were measured on a percent dry matter basis, the strong correlation between them is expected.

Genetic relatedness and trait architecture are known to affect the accuracy of genomic prediction [8,34]. The genetic relatedness between individuals and heritability of the traits were higher in the RILs than in GSDP (Figure S6, Table 1). Those factors could be contributing to higher average prediction accuracy in the RILs. However, the average prediction accuracies for gross energy and starch were comparable between GSDP and RILs (Figure S4). Prediction accuracy in the GSDP could have been boosted by greater genetic diversity despite lower genetic relatedness [35]. Heffner et al. [22] observed a prediction accuracy of 0.5–0.6 for wheat flour protein in two biparental populations. Guo et al. [26] reported prediction accuracies of 0.44 and 0.8 for protein and amylose in rice diversity panel. Similar results were observed in our STSE models for protein content (Figure 2). Whereas, lower prediction accuracy of amylose in our diversity panel is probably due to the lack of sufficient low-amylose lines with the *waxy* gene [30]. While genomic prediction study for starch, fat and gross energy has not been reported in sorghum, these traits are nutritionally one of the most important traits for any cereal grain. The moderate to high prediction accuracy observed suggests implementation of genomic selection can improve genetic gain for these grain quality traits.

#### *3.2. Multi-Trait Regressor Stacking*

One of the daunting tasks of genomic prediction is estimating the effects of unobserved individuals and environments. As multiple traits are analyzed across several environments, the ability to combine information from multiple traits and environments can be crucial in increasing accuracy of prediction [13,15,16]. When the correlations among traits are high, prediction accuracies of complex traits can be increased by using multivariate model that takes this correlation into account [15,18]. We fit a Bayesian multi-environment (BME) model (2) that takes the genotype × environment effects into consideration. In the GSDP, where environments were three years at the same location, the BME model showed a slight decline (7%) in average prediction accuracy which was mostly due to the two traits, amylose and fat (Table 2). The RILs showed slight increase (2–3%) in prediction accuracy of traits when averaged over the environments, but there was variability across the environments (Table 3).

We implemented two functions [*BMORS*() and *BMORS\_Env*()] which are not only used to evaluate prediction accuracy but are also computationally efficient [19]. The BMORS model (3) performs two-stage training by stacking the multi-environment models from all the traits. The five-fold cross validation conducted for BMORS was similar to the CV1 strategy of Montesinos-López et al. [18]. The use of multi-trait models has been consistently shown to increase prediction accuracy over single-trait models across different crops and traits [15–17,36]. The multi-target regressor stacking increased average prediction accuracy by 41% and 32% in the GSDP and RILs, respectively, as compared to the STSE prediction accuracy. Average prediction accuracy of all traits improved in BMORS over STSE and BME across both the populations (Figure S4). Consistent improvement in accuracy of BMORS is a result of the ability to use not only correlation between traits but also between environment in the model training [18,19]. The ability to accurately estimate genetic merit of lines in unobserved environments is of tremendous value in plant breeding. Our results show potential of *BMORS\_Env()* function for predicting the whole environment. Testing a whole environment by training BMORS model using all other environments resulted in higher prediction accuracy for that trait-environment combination than using STSE or BME model. Prediction accuracy of all environments were 0.5 or higher with exception of amylose in GSDP, the reason for which we have discussed above (Figure 4, Table 4).

#### *3.3. Application for Breeding*

Grain quality traits such as starch and protein content have been under selection since the inception of phenotypic selection in modern breeding practices. More recently, total energy supplement of grain has gained attention for increasing feed efficiency in animal production, and a need exists for increasing total calories for human nutrition in the wake of global malnutrition crisis. Despite high correlations among these traits, the genetic variation underlying starch, protein and fat can be decoupled. Boyles et al. [30] showed major and minor effect QTLs underlying the three traits are distributed across the genome and are segregating in biparental populations. However, in practice, selection would be conducted simultaneously for these traits using a selection index rather than for individual traits. Velazco et al. [31] observed an increase in predictive ability by using a multi-trait model for grain yield and stay green in sorghum, and argue that such an exercise would allow for using selection index for implementation of genomic selection for correlated traits. Increased prediction accuracy, improved selection index, and estimation of precise genetic, environmental and residual co-variances makes multi-trait multi-environment models preferable over univariate models [18]. The multi-trait regression stacking model we tested shows large scale improvement in model prediction and can be used in tandem with Bayesian multi-trait multi-environment (BMTME) model for parameter estimation and assessing prediction accuracy. The ability to estimate genetic effects and breeding values of unobserved environments will be of great advantage to predict performance in diverse environments and for implementation of selection theory.

#### **4. Materials and Methods**

#### *4.1. Plant Material*

#### 4.1.1. Grain Sorghum Diversity Panel

A grain sorghum diversity panel (GSDP) of 389 diverse sorghum accessions was planted in randomized complete block design with two replications in 2013, 2014, and 2017 field seasons at the Clemson University Pee Dee Research and Education Center in Florence, SC. The GSDP included a total of 332 accessions from the original United States sorghum association panel (SAP) developed by Casa et al. [37]. The details on experimental field design and agronomic practices are described in Boyles et al. [38] and Sapkota et al. [35]. Briefly, the experiments were planted in a two row plots each 6.1 m long, separated by row spacing of 0.762 m with an approximate planting density of 130,000 plants ha<sup>−</sup>1. Fields were irrigated only when signs of drought stress was seen across the field.

#### 4.1.2. Recombinant Inbred Population

A biparental population of 191 recombinant inbred lines (RILs) segregating for grain quality traits was studied along with the GSDP. The parents of the RIL population were BTx642, a yellow-pericarp drought tolerant line, and BTxARG-1, a white pericarp waxy endosperm (low amylose) line. The population was planted in two replicated plots in randomized complete block design across two years (2014 and 2015) in Blackville, SC and College Station, TX. Field design and agronomic practices have previously been described in detail in Boyles et al. [30].

#### *4.2. Phenotyping*

The primary panicle of three plants selected from each plot were harvested at physiological maturity. The plants from beginning and end of the row were excluded to account for border effect. Panicles were air dried to a constant moisture (10–12%) and threshed. A 25 g subsample of cleaned and homogenized grain ground to 1-mm particle size with a CT 193 Cyclotec Sample Mill (FOSS North America) was used in near-infrared spectroscopy (NIRS) for compositional analysis.

Grain composition traits such as total fat, gross energy, crude protein, and starch content can be measured using NIRS. Previous studies have shown high NIRS predictability of the traits used in feed analysis [39,40]. We used a DA 7250TM NIR analyzer (Perten Instruments). The ground sample was packed in a gradually rotating Teflon dish positioned under the instrument's light source and predicted phenotypic values was generated based on calibration curve for spectral measurements. The calibration curve was built using wet chemistry values from a subset of samples. The wet chemistry was performed by Dairyland Laboratories, Inc. (Arcadia, WI, USA) and the Quality Assurance Laboratory at Murphy-Brown, LLC (Warsaw, NC, USA). The details on the prediction curves and wet chemistry can be found in Boyles et al. [30].

#### *4.3. Genotypic Data*

Genotyping-by-sequencing (GBS) was used for genetic characterization of the GSDP and RILs populations [30,38,41]. Sequenced reads were aligned to the BTx623 v3.1 reference assembly (phytozome) using Burrows-Wheeler aligner [42]. SNP calling, imputation and filtering was done using TASSEL 5.0 pipeline [43]. The TASSEL plugin FILLIN for GSDP and FSFHap for RILs population were used to impute for missing genotypes. Following imputation SNPs with minor allele frequency (MAF) < 0.01, and sites missing in more than 10% and 30% of the genotypes in GSDP and RILs, respectively, were filtered. The number of genotypes studied for each population represent those with at least 70% of SNP sites. The genotype matrix with 224,007 SNPs from GSDP and 56,142 SNPs from RILs population was used for whole genome regression.

#### *4.4. Statistical Analysis*

The statistical software environment 'R' was used for model building and analysis [44]. The phenotypic values of the traits were adjusted for random effects of replications within environment using 'lme4' package in R [45]. Principal component analysis was done using the R package 'factoextra' [46]. Marker-based estimates of narrow sense (genomic) heritabilities were calculated using the SNP genotype matrix and phenotypic values using the R package 'heritability' [47]. A matrix with dummy variables '1' and '0' representing combinations of environmental variables (replication and year for GSDP, and replication, year and location for RILs) was used as co-variate in heritability estimation.

#### 4.4.1. Single-Trait Single-Environment (STSE) Model:

The following genomic best linear unbiased prediction (GBLUP) model was used to assess prediction performance of an individual trait from a single environment:

$$y\_j = \mu + \varrho\_j + \varepsilon\_{j,\*} \tag{1}$$

where *yj* is a vector of adjusted phenotypic mean of the *j*th line (*j* = 1, 2, ..., *J*). *μ* is the overall mean which is assigned a flat prior, *gj* is a vector of random genomic effect of the *<sup>j</sup>*th line, with *<sup>g</sup>* = (*g*1, ..., *gj*)*<sup>T</sup>* ∼ *N*(0, *Gσ*<sup>2</sup> <sup>1</sup> ), *<sup>σ</sup>*<sup>2</sup> <sup>1</sup> is a genomic variance, *G* is the genomic relationship matrix in the order *J* × *J* and is calculated [12] as *G* = *ZZ<sup>T</sup>* 2 ∑ *pjqj* , where *qj* and *pj* denote major and minor allele frequency of *j*th line respectively, and Z is the design matrix for markers of order *J* × *p* (*p* is total number of markers). Further, *ej* is residual error assigned the normal distribution *<sup>e</sup>* ∼ *<sup>N</sup>*(0, *<sup>I</sup>σ*<sup>2</sup> *<sup>e</sup>* ) where *I* is identity matrix and *σ*<sup>2</sup> *<sup>e</sup>* is the residual variance with a scaled-inverse Chi-square density.

#### 4.4.2. Bayesian Multi-Environment (BME) GBLUP Model

Considering genotype × environment interaction can contribute to substantial amount of phenotypic variance in complex traits, we fit the following univariate linear mixed model to account for environmental effects in prediction performance:

$$y\_{ij} = E\_i + \mathcal{g}\_j + \mathcal{g}E\_{ij} + \mathcal{e}\_{ij},\tag{2}$$

where *yj* is a vector of adjusted phenotypic mean of the *j*th line in the *i*th environment (*i* = 1, 2, ..., *I*, *j* = 1, 2, ..., *J*). *Ei* represents the effect of *i*th environment and *gj* represents the genomic effect of the *j*th line as described in Equation (1). The term *gEij* represents random interaction between the genomic effect of *j*th line and the *i*th environment with *gE* = (*gE*11, ..., *gEI J*) *<sup>T</sup>* ∼ *<sup>N</sup>*(0, *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> *<sup>I</sup><sup>I</sup>* ⊗ **<sup>G</sup>**), where *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> is an interaction variance, and *eij* is a random residual associated with the jth line in the ith environment distributed as N(0, *σ*<sup>2</sup> *<sup>e</sup>* ) where *σ*<sup>2</sup> *<sup>e</sup>* is the residual variance.

#### 4.4.3. Bayesian Multi-Output Regressor Stacking (BMORS)

BMORS is the Bayesian version of multi-trait (or multi-target) regressor stacking method [48]. The multi-target regressor stacking (MTRS) was proposed by Spyromitros-Xioufis et al. [20,21] based on multi-labeled classification approach of Godbole and Sarawagi [49]. In BMORS or MTRS, the training is done in two stages. First, *L* univariate models are implemented using the multi-environment GBLUP model given in Equation (2), then instead of using these models for prediction, MTRS performs the second stage of training using a second set of *L* meta-models for each of the *L* traits. The following model is used to implement each meta-model:

$$y\_{i\bar{\jmath}} = \beta\_1 \hat{Z}\_{1\bar{\imath}\bar{\jmath}} + \beta\_2 \hat{Z}\_{2\bar{\imath}\bar{\jmath}} + \dots + \beta\_L \hat{Z}\_{L\bar{\imath}\bar{\jmath}} + e\_{i\bar{\jmath}},\tag{3}$$

where the covariates *Z*ˆ <sup>1</sup>*ij*, *Z*ˆ <sup>2</sup>*ij*, ..., *Z*ˆ *Lij* represent the scaled prediction from the first stage training with the GBLUP model for *L* traits, and *β*1, ..., *β<sup>L</sup>* are the regression coefficients for each covariate in the model. The scaling of each prediction was performed by subtracting its mean (*μlij*) and dividing by its corresponding standard deviation (*σlij*), that is, *<sup>Z</sup>*ˆ*lij* = (*y*ˆ*lij* <sup>−</sup> *<sup>μ</sup>lij*)*σ*−<sup>1</sup> *lij* , for each *l* = 1, ..., *L*. The scaled predictions of its response variables yielded by the first-stage models as predictor information by the BMORS model. Simply put, the multi-trait regression stacking model is based on the idea that a second stage model is able to correct the predictions of a first-stage model using information about the predictions of other first-stage models [20,21].

#### 4.4.4. Performance of Prediction Model:

All prediction models were fit using Bayesian approach in statistical program 'R'. The STSE model (1) was fit using the R package 'BGLR' [50], BME model (2) and BMORS model (3) were fit using the R package 'BMTME' [19]. A minimum of 20,000 iterations with 10,000 burn-in steps was used for each Bayesian run.

The evaluation of prediction performance of models was done using a five-fold cross validation (CV), which means 80% of the samples were used as training set and testing was done on the remaining 20% for each cross-validation fold. The individuals were randomly assigned into five mutually exclusive folds. Four folds were used to train prediction models and to predict the genomic estimated breeding values (GEBVs) of the individuals in fifth fold (validation/test set). The accuracy of prediction for each fold was calculated as Pearson's correlation coefficient (r) between predicted values and adjusted phenotypic means for the individuals in validation set. Each cross validation run, therefore, resulted in five estimates of prediction accuracy. The same set of individuals were assigned to training and validation across different traits and models tested by using *set.seed()* function in R. In order to avoid bias due to sampling, we performed 10 different cross-validation runs to calculate the mean and dispersion of the prediction accuracies.

#### **5. Conclusions**

Phenotyping of grain compositional traits using near-infrared spectroscopy is labor-intensive, generally destructive, and time limiting. Therefore, the use of genomic selection for these traits will be extremely valuable. This study establishes the potential to improve genomics-assisted selection of grain composition traits by using multi-trait multi-environment model. The phenotypic measurements obtained from NIRS prediction were amenable to genomic selection as shown by moderate to high

prediction accuracy for single trait prediction. While multi-environment model alone did not lead to much improvement over single environment model, stacking of regression from multiple traits showed substantial improvement in prediction accuracy. The prediction accuracy increased by 32% and 41% in the RILs and GSDP, respectively, when using the Bayesian multi-output regressor stacking (BMORS) model compared to a single trait single environment model. The ability to predict line performance in an unobserved environment is of great importance to breeding programs, and results show high accuracy for predicting whole environments using BMORS.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/9/1221/ s1. The supplementary file contains six figures: Figure S1. Phenotypic distribution of grain composition traits in the RILs. In the x-axes, SC: South Carolina, TX: Texas, numbers represent years. Values are percentage dry basis for protein, fat and starch; gross energy is in KCal/lb; and amylose is in percent of starch. Figure S2. Phenotypic distribution of grain composition traits in the GSDP. Numbers in x-axes represent years. Values are percentage dry basis for protein, fat and starch; gross energy is in Cal/g; and amylose is in percent of starch. Figure S3. PCA analysis of correlation matrix between traits. a. GSDP, and b. RILs. Ams: amylose, GE: gross energy, Prt: protein, Sta: starch, SC: South Carolina, TX: Texas. The numbers in the text represent years of the environment. Figure S4. Overall prediction accuracy of traits across all the environment for the three prediction methods in the two populations. The y-axis shows prediction accuracy calculated as Pearson's correlation between observed values and predicted values of phenotypes. Legend represents the environment/years. SC: South Carolina, TX: Texas, GSDP: Grain sorghum diversity panel, RILs: recombinant inbred lines. Figure S5. Prediction accuracy using five-fold CV in Bayesian multi-environment (BME) model. a. GSDP, and b. RILs. Legend represents the environment/years. SC: South Carolina, TX: Texas. Pale blue dots represent the mean of prediction accuracy. Figure S6. Heatmap for genomic relationship matrix calculated using vanRaden (2008). a. GSDP, b. RILs. Trees show hierarchical clustering using Euclidean distance.

**Author Contributions:** S.S. conceptualized the study, performed data analysis, and wrote the manuscript; R.B. helped with experimental design and field phenotyping; J.L.B. helped in computation and data analysis; K.J. helped in near infra-red phenotyping; S.K. helped with conception of the study, acquisition of fund, and management of the study. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was funded by United States Department of Energy TERRA grant: DE-AR0001134.

**Acknowledgments:** The authors would like to thank William L. Rooney and Brian K. Pfeiffer for their contributions to phenotyping of the recombinant inbred population at College Station, TX. Our appreciation goes to the Wade Stackhouse Fellowship, and Robert and Lois Coker Endowment for their support during the study.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

**Data Availability:** The codes and phenotypic data used in the study can be accessed through github at sirjansapkota/GrainComp\_GS.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **QTL Mapping for Seedling and Adult Plant Resistance to Leaf and Stem Rusts in Pamyati Azieva** × **Paragon Mapping Population of Bread Wheat**

**Yuliya Genievskaya 1, Saule Abugalieva 1,2, Aralbek Rsaliyev 3, Gulbahar Yskakova <sup>3</sup> and Yerlan Turuspekov 1,4,\***


Received: 6 August 2020; Accepted: 22 August 2020; Published: 30 August 2020

**Abstract:** Leaf rust (LR) and stem rust (SR) pose serious challenges to wheat production in Kazakhstan. In recent years, the susceptibility of local wheat cultivars has substantially decreased grain yield and quality. Therefore, local breeding projects must be adjusted toward the improvement of LR and SR disease resistances, including genetic approaches. In this study, a spring wheat segregating population of Pamyati Azieva (PA) × Paragon (Par), consisting of 98 recombinant inbred lines (RILs), was analyzed for the resistance to LR and SR at the seedling and adult plant-growth stages. In total, 24 quantitative trait loci (QTLs) for resistance to rust diseases at the seedling and adult plant stages were identified, including 11 QTLs for LR and 13 QTLs for SR resistances. Fourteen QTLs were in similar locations to QTLs and major genes detected in previous linkage mapping and genome-wide association studies. The remaining 10 QTLs are potentially new genetic factors for LR and SR resistance in wheat. Overall, the QTLs revealed in this study may play an important role in the improvement of wheat resistance to LR and SR per the marker-assisted selection approach.

**Keywords:** *Triticum aestivum*; QTL; mapping population; leaf rust; stem rust; pathogen races; disease resistance

#### **1. Introduction**

Bread wheat (*Triticum aestivum* L.) is one of the major cereal crops in the world. In 2018/2019, the global production of wheat was 734.7 million metric tons, ranking second place amongst the grains after maize [1]. It is used mostly as flour for the production of a large variety of leavened and flat breads and the manufacturing of a wide range of other baked products [2]. In 2018/2019, Kazakhstan was ranked the 12th largest wheat producer in the world [3]. In Kazakhstan, wheat is cultivated on about 13 million hectares annually. The country produces up to 20–25 million tons of bread wheat per year and exports up to 5–7 million tons of the grain [4]. The primary goals of modern wheat breeding programs worldwide include enhancing grain yield and quality and increasing resistance to biotic and abiotic stresses to ensure global food security [5]. Biotic stresses include dangerous fungal diseases and particularly the most common representatives of the *Puccinia* genus: *Puccinia recondita* Rob. ex Desm f. sp. *tritici*, causing leaf rust (LR), and *Puccinia graminis* Pers. f. sp. *tritici* Eriks. & Henn., which is responsible for stem rust (SR) of wheat.

LR generally causes light to moderate yield losses ranging from 1% to 20% over a large area, but when the disease is severe prior to heading time, it may destroy up to 90% of the wheat crop [6]. For example, in Kazakhstan, epiphytotic development of the pathogen on spring wheat during 2000–2001 resulted in 50–100% LR severity on commercial cultivars in the Akmola region (northern Kazakhstan), which is the main wheat-growing region in the country [7].

SR is another important rust disease that is often considered the most devastating of the wheat rust diseases because it can cause complete crop loss over a large area within a short period of time [8]. In 2016, northern Kazakhstan was subjected to an epiphytotic outbreak of SR, resulting in 50% disease development severity in the field, decreasing wheat yield and grain quality [7]. Nowadays, local farmers prefer the usage of fungicides to protect wheat fields from LR and SR; however, this method is harmful to the environment and more expensive than breeding and growing genetically resistant wheat cultivars [9].

LR and SR resistances are controlled by a diverse group of genes, designated as *Lr* and *Sr*, respectively [10]. In the last 100 years, approximately 80 *Lr* resistance genes have been identified and described in bread wheat, durum wheat, and diploid wheat species [10], and the list is still growing. For SR, nearly 60 *Sr* genes have been identified to date in wheat and its wild relatives [10]. Generally, resistance to rust diseases can be broadly categorized into two types. The first is resistance at all growth stages (called seedling resistance), detected at the seedling stage and expressed until the plant dies. This type of resistance is controlled by the R type of genes, and the majority of *Lr* and *Sr* genes belong to this group. The efficacy of the R gene is pathogen-strain-dependent [11]. The second type of resistance is adult plant resistance (APR), where genes are ineffective during the seedling stage but provide robust resistance at maturity [11]. For example, LR resistance genes *Lr12*, *Lr13*, *Lr22a*, *Lr22b*, *Lr34*, *Lr35*, *Lr46*, *Lr48*, *Lr49*, and *Lr67*, and SR resistance genes *Sr2* and *Sr57* are well-characterized APR genes [10]. Durable rust resistance is more likely to be the APR type rather than the seedling type [12]; both types are important for wheat breeding [11].

Two of the most effective methods of quantitative trait locus (QTL) mapping are based on association panels and biparental segregating populations [13]. Both of these methods provide the means to investigate the genome and describe the etiology of complex quantitative traits, including disease resistance [14–17]. Genetic maps are a key tool enabling genetic linkage studies and searches for novel loci responsible for traits. Modern high-throughput sequencing technologies allow for the high-accuracy genotyping of large collections with genetically diverse germplasms [18,19] and segregating mapping populations, such as doubled haploids (DHs), recombinant inbred lines (RILs), F2, and backcross (BC) populations [20]. Linkage maps were successfully used for the QTL analyses of wheat yield components [21], grain quality traits [22], abiotic [23], and biotic stress factors, including pests [24].

The primary goal of this study was to identify QTLs involved in seedling and adult plant resistance of bread wheat to LR and SR under environmental conditions in southern and southeastern Kazakhstan. To meet this goal, the Pamyati Azieva × Paragon (PA × Par) RILs mapping population (MP) was studied in field and greenhouse (GH) conditions. Previously, this population was successfully used for the analysis of yield-related traits [25] and adult plant resistance to LR and SR in south-east and northern Kazakhstan in 2018 [26,27]. Hence, the current study adds the investigation of seedling resistance in the MP to LR and SR races. In addition to one-year studies in south-east and north Kazakhstan, this work covers the analysis of LR and SR resistance in the MP in southern Kazakhstan in 2018 and 2019.

#### **2. Materials and Methods**

#### *2.1. Plant Material and Genotyping*

The biparental mapping population PA × Par composed of 98 RILs was developed in greenhouse conditions of the John Innes Centre (Norwich, UK) during 2011–2015 under the ADAPTAWHEAT project [28]. The RIL population was obtained via a single-seed descent method using two parental cultivars: Paragon (elite spring cultivar originated from the U.K.) and Pamyati Azieva (a commercial spring cultivar originating from Russia and registered in Kazakhstan) [29]. Both cultivars were chosen due to their diverse genetic backgrounds and different manifestations of yield traits as well as resistance to diseases. The RIL population was further developed for F8 generation in the fields of southeastern Kazakhstan.

The RILs and two parental cultivars were genotyped using the Illumina's iSelect 20K single nucleotide polymorphism (SNP) array at the TraitGenetics Company (TraitGenetics GmbH, Gatersleben, Germany). The genotypic data were filtered from markers with >10% missing data and with <0.1 minor allele frequency and consisted of 4595 polymorphic SNP markers.

#### *2.2. Phenotyping of Seedling Resistance in Greenhouse*

For the comprehensive study of the PA × Par MP response to LR and SR pathogens, the resistance was evaluated at the seedling and adult plant-growth stages. Race-specific resistance at the seedling stage was assessed in a greenhouse of the Research Institute of Biological Safety Problems (RIBSP, Gvardeisky, Zhambyl region, southern Kazakhstan). For the inoculation of RILs seedlings (7–10 days after sowing) in greenhouse conditions, three races of *P. graminis* and three races of *P. recondita* with different levels of virulence to *Sr* and *Lr* genes, respectively, were used (Table 1). Inoculated plants were placed in the boxes of the greenhouse with appropriate temperature conditions (22 ± 2 ◦C for SR, 18 ± 2 ◦C for LR) and illumination (10,000–15,000 lux, 16 h' light period) [30–32]. RIL reaction was assessed on the 14th day after inoculation, according to the scale reported by Stakman [33]. The experiment was performed in two independent replicates.


**Table 1.** Virulence/avirulence pattern of pathogen races used in the study.

LR, leaf rust; SR, stem rust.

Races of *P. graminis* were differentiated in 2018 [34] using the North American nomenclature [35] with the assistance of five sets of SR-differentiating wheat cultivars. Races of *P. recondita* were also identified in 2018 using 20 Thatcher near-isogenic lines (NILs) sets of *Lr* genes [36–38]. For the nomenclature of *P. recondita* races, Virulence Analysis Tools [39] were used.

#### *2.3. Adult Plant Resistance and Yield Components in Field Conditions*

APR in the field was tested in two environments: the RIBSP and the Kazakh Research Institute of Agriculture and Plant Industry (KRIAPI, Almalybak, Almaty region, southeastern Kazakhstan) (Table 2).


**Table 2.** Meteorological data on average temperature and precipitations during the vegetation period in the fields of Research Institute of Biological Safety Problems (RIBSP, southern Kazakhstan) and Kazakh Research Institute of Agriculture and Plant Industry (KRIAPI, southeastern Kazakhstan).

In RIBSP fields, mixed races of LR and SR urediniospores common in Kazakhstan were applied as inoculum. The inoculum was activated at a temperature of 37–40 ◦C for 30 min, followed by watering in a humid chamber at a temperature of 18–22 ◦C for 2 h. At the booting stage, individual plants were treated with an aqueous suspension of leaf and stem rust urediniospores dissolved in Tween 80 detergent. After inoculation, the plots were covered with plastic wrap for 16–18 h. In KRIAPI fields, inoculation occurred with local LR and SR pathogen populations in uncontrolled natural conditions.

Thus, experiments were conducted in three independent environments, including the study of seedling resistance in greenhouse conditions, the study of APR at RIBSP (controlled inoculation), and APR at KRIAPI (uncontrolled inoculation). In both field conditions, phenotyping of APR to LR and SR was performed in two independent replicates at the stage of grain ripening with the maximum level of disease manifestation. Disease assessment was performed using the scale of Stakman for SR [33] and the scale of Mains and Jackson for LR [40]. The severity of rust infection on leaf and stem surfaces was evaluated using the modified Cobb scale [41,42]. To meet the data format required for linkage analysis, the results of LR and SR evaluations at both seedling and adult plant-growth stages were converted to the 0–9 linear disease scale as described by Zhang et al. [43].

To identify the influence of LR and SR severity on the productivity of the studied population, two important yield-related components, thousand kernel weight (TKW, g) and kernel yield per plot (YP, g/m2), were also evaluated.

#### *2.4. Statistical Analysis of Phenotypic Data and QTL Mapping*

Phenotypic data processing, descriptive statistics, and one-tailed correlation tests were performed with SPSS Statistics v. 22 (SPSS Inc., Chicago, IL, USA).

The composite interval mapping (CIM) method with the Kosambi mapping function was used for the detection of QTLs by Windows QTL Cartographer v2.5 [44]. The threshold value for the logarithm of odds (LOD) score was calculated based on 1000 permutations and was 3.0 for all experiments with a walking step of 1 cM. QTLs were detected for each environment and replication separately (seedling resistance in GH, APR at KRIAPI, and APR at RIBSP). QTLs identified in individual environments and/or replications overlapping in 20 cM intervals and associated with the same trait were considered as identical [45]. Genetic maps with QTLs were drawn using MapChart v. 2.32 software [46]. For the markers with the same positions, only one single nucleotide polymorphism (SNP) maker was selected for the map.

All genes present within the interval of 500 kb to the left and 500 kb to the right (1 Mb in total) from the peak marker were identified using the Ensembl Plant database [47]. As a reference, the genome of *T. aestivum* RefSeq v1 was used. The exact position of the peak SNP in the genome was determined using a BLAST tool [48]. Proteins and RNA gene products were identified using the UniProt database [49] via cross-reference from Ensembl Plant.

#### **3. Results**

*3.1. Phenotyping Variations of Seedling and Adult Plant Resistance in Mapping Population*

The values of the resistance to target diseases in parents and 98 RILs are summarized in Table 3.


**Table 3.** Descriptive statistics for leaf rust (LR) and stem rust (SR) resistance at two plant-growth stages in the Pamyati Azieva × Paragon mapping population.

1—parents IT scores are given in 0–9 numeric scale, traditional IT scores are given in parentheses. Env., environment; GH, greenhouse conditions; PA, Pamyati Azieva; Par, Paragon; IT, infection type; R, percentage of resistant lines (0–1 on 9-point scale); MR, percentage of moderately resistant lines (2–4 on 9-point scale); MS, percentage of moderately susceptible lines (5–7 on 9-point scale); S, percentage of susceptible lines (8–9 on 9-point scale).

The average seedling resistance of RILs to LR races was between 5.9 and 6.1 points, corresponding to the moderately susceptible (MS) level. The major part of the population belonged to the MS group, with only several lines observed in the susceptible (S) group. Several lines were also in the resistant (R) group to the races TQKHT and TRTHT. Parental cultivars demonstrated an MS level of resistance to studied LR races, except for PA, which was susceptible to the race TQKHT. As for seedling resistance to SR, the average level in the RILs population was MS to all three SR races. However, unlike in the case of LR, the distribution of lines among resistance groups was different. Races TKRTF and RKRTF were divided between MS and S groups with a dominance of the S reaction to TKRTF and MS reaction to RKRTF. R and moderately resistant (MR) levels were detected only in the race PKCTC. Levels of resistance in parental cultivars were similar to races TKRTF (S) and RKRTF (MS), but PA demonstrated higher resistance to the race PKCTC than Par.

At the adult plant stage, the reactions of parents and RILs to LR and SR were significantly different between the studied environments. At RIBSP, the average reaction of RILs to LR was MR, with almost even distribution among all possible reactions observed in the population. At KRIAPI, the average level of resistance was MS with a dominance of MS and S reactions in the population. The parental cultivars demonstrated the same reaction to LR at KRIAPI, but at RIBSP, Par was more resistant than PA. For SR at the adult plant stage at RIBSP, the majority of RILs were in the S group, and the average level was MS. At KRIAPI, the largest part of the population was in the R group and the average level was MR. Parental cultivars also were in the S group at RIBSP and the R group at KRIAPI.

The analysis of variance showed that the resistance of RILs to LR at the seedling growth stage was significantly affected by the RIL genotype (*p* < 0.01) and the race of LR pathogen (*p* < 0.05), but not by genotype × race interaction (Table 4). For SR resistance, all factors had a significant influence (*p* < 0.001) on the resistance at the seedling stage (Table 4).


**Table 4.** ANOVA of plant genotype (Geno), pathogen race (Race), and plant genotype × pathogen race (Geno: Race) effects on seedling resistance to leaf rust (LR) and stem rust (SR) in the Pamyati Azieva × Paragon mapping population.

df, degree of freedom; MeanS, mean square.

#### *3.2. Correlations among SR and LR Seedling and Adult Plant Resistance and Influence of APR on the Yield-Related Traits*

Significant positive correlations were found among the reactions to all three LR races at the seedling stage, as well as between APR to LR at KRIAPI and seedling resistance to LR races TQTMQ and TRTHT (Table 5). APR to LR at KRIAPI was also positively correlated with APR to SR at KRIAPI and seedling resistance to SR race TKRTF. For the other SR races, race PKCTC had a positive correlation with APR to SR at KRIAPI, and race RKRTF was negatively correlated with LR race TQTMQ. The only significant correlation of APR to SR at RIBSP was associated with LR race TRTHT.



APR, adult plant resistance; ns not significant; \* *p* < 0.05; \*\* *p* < 0.01.

The negative influence of LR and SR severities at the adult plant stage on the wheat YP and TKW was confirmed by significant negative correlations (*p* < 0.01) between these traits at RIBSP (Table 6). At KRIAPI, the severity of LR at the adult plant stage was negatively correlated with YP only.

**Table 6.** Correlations between leaf rust (LR) and stem rust (SR) resistance at the adult plant stage and yield-related traits in the Pamyati Azieva × Paragon mapping population.


TKW, thousand kernel weight (g); YP, yield per plot (g/m2); ns not significant; \* *p* < 0.05; \*\* *p* < 0.01.

#### *3.3. Identification of QTLs for Seedling and Adult Plant Resistance to LR in the RIL Population*

A total of 11 QTLs for resistance to LR were identified at the seedling and adult plant-growth stages. Out of these 11 QTLs, eight QTLs were detected for different LR races at the seedling stage, two QTLs were for APR, and one QTL was observed for both seedling and adult plant resistance (Table 7, Figure 1). QTLs for LR resistance were located on 10 chromosomes of the A, B, and D genomes. The phenotypic variations explained by an individual QTL ranged from 11.6% to 25.7%. Because all QTLs for LR resistance identified in this study explained more than 10% of the phenotypic variation, they were considered major QTLs [50]. The LOD score of QTLs for LR resistance was in the range of 3.2–8.6.

For the LR race TQTMQ, three QTLs identified on chromosomes 4A, 5B, and 7B were revealed. They explained 12.1–15.7% of the phenotypic variations. The alleles of all three QTLs associated with the increase in resistance to LR originated from PA. For the second LR race TQKHT, two QTLs on chromosomes 6A and 7D explained 25.7% and 11.6% of the variations in phenotype, respectively. Both QTLs associated with higher resistance to race TQKHT originated from Paragon. For the third LR race TRTHT, three QTLs on chromosomes 3A, 4D, and 6A were observed. Identified QTLs explained 12.3–16.2% of the variation in resistance to race TQKHT. The alleles of QTLs *QLr.ipbb-3A.2* and *QLr.ipbb-6A.5* increasing resistance were from Paragon, and the allele of *QLr.ipbb-4D.1* was from PA.

**Figure 1.** Pamyati Azieva × Paragon genetic map with quantitative trait loci (QTLs) for adult plant resistance (APR) to leaf rust (LR) in two regions and seedling resistance to three LR races. The region containing the QTL is indicated by a vertical bar on the right and followed by the name of the QTL. Single nucleotide polymorphism (SNP) markers are shown on the right and their genetic positions (cM) on the left. The peak marker for each QTL is highlighted in color and bolded. Colors of QTL indicate APR or race-specific seedling resistance.


Thelistofquantitativetraitloci(QTLs)foradultplantresistance(APR)andrace-specificseedlingresistancetoleafrust(LR)identifiedinthe

the effective allele is taken from the other parent (negative for PA and positive for Paragon).

*Agronomy* **2020** , *10*, 1285

Two QTLs for APR to LR at RIBSP were identified on chromosomes 1B and 7A and explained 13.6% and 11.7% of phenotypic variation, respectively. For both APR QTLs, alleles associated with increased LR resistance were from Paragon. One QTL for APR to LR at KRIAPI was also detected at the seedling stage for the resistance to LR race TQKHT on chromosome 1A. It explained 16.0% of LR resistance variation. In both cases, alleles associated with higher resistance originated from Paragon.

#### *3.4. QTLs for SR Resistance at Seedling and Adult Plant Stages Identified in PA* × *Par Mapping Population*

A total of 13 QTLs were detected in this study for SR resistance at the seedling and adult plant-growth stages. Among them, seven race-specific QTLs were identified at the seedling stage (three QTLs for race TKRTF, three QTLs for race PKCTC, and one QTL for race RKRTF), three QTLs were observed for APR (two QTLs at KRIAPI and one QTL at RIBSP), and three QTLs were revealed in both the seedling and adult stages (Table 8, Figure 2). The identified QTLs for SR resistance were distributed among nine chromosomes of A, B, and D genomes and explained from 8.9% to 39.1% of the variation in the resistance to SR. In total, 11 out of 13 QTLs for SR resistance had *R*<sup>2</sup> > 10% and could be considered major QTLs. The LOD score for the detected QTL varied from 3.0 to 6.8.

**Figure 2.** Pamyati Azieva × Paragon genetic map with quantitative trait loci (QTLs) for adult plant resistance (APR) to stem rust (SR) in two regions and seedling resistance to three SR races. The region containing the QTL is indicated by a vertical bar on the right and followed by the name of the QTL. Single nucleotide polymorphism (SNP) markers are positioned on the right and their genetic positions (cM) are shown on the left. The peak marker for each QTL is highlighted in color and bold. Colors of QTL indicate APR or race-specific seedling resistance.



increased expression of the trait is undesired, and the effective allele is taken from the other parent (negative for PA and positive for Paragon).

#### *Agronomy* **2020** , *10*, 1285

Three QTLs for resistance to SR race TKRTF were identified, including two QTLs mapped on chromosome 6B and one QTL on 2D. These QTLs explained 10.0–19.2% of the variation in SR resistance to race TKRTF. QTLs *QSr.ipbb-6B.6* and *QSr.ipbb-6B.7* had alleles increasing SR resistance to race TKRTF carried by PA, whereas resistance allele *QSr.ipbb-2D.2* originated from Paragon. For the second SR race PKCTC, two QTLs for resistance to this race were located on chromosome 2B and one QTL was on 5B. The phenotypic variance conditioned by these QTLs varied from 8.9% to 14.7%. The third SR race RKRTF allowed the identification of one race-specific QTL on chromosome 5A explaining 24.7% of the phenotypic variation. Its allele, associated with an increase in SR resistance, originated from Paragon.

Three QTLs for APR to SR were identified on chromosomes 1D, 3D, and 5A, and explained from 10.8% to 18.0% of SR resistance variation. Two QTLs identified at KRIAPI had alleles increasing resistance to SR originating from Paragon, and the allele of QTL at RIBSP was from PA. The last three QTLs for LR resistance occurred multiple times in the experiment and are located on chromosomes 1B, 2A, and 6B. The QTL *QSr.ipbb-1B.4* was detected as race-specific to TKRTF at the seedling stage and as APR QTL at KRIAPI. It explained 15.7% of SR resistance variation and had alleles increasing resistance originating from Paragon at the seedling stage and PA at the adult plant stage. The QTL *QSr.ipbb-2A.2* was identified as effective against SR race RKRTF at the seedling stage and as APR QTL at RIBSP. This QTL explained 39.1% of the phenotypic variation, and alleles increasing SR resistance at both growth stages were inherited from PA. The QTL *QSr.ipbb-6B.5* was discovered at the seedling stage to race TKRTF and at the adult plant stage at RIBSP. The QTL explained 17.3% of SR resistance variations. Its alleles increasing resistance originated from PA in the case of seedling resistance and from Paragon at the adult growth stage.

#### *3.5. Comparison of Identified QTLs with Previous Works and Gene Identification*

The QTLs identified in this study were analyzed in comparison with previously reported QTLs for LR and SR resistance in the PA × Par RILs population [27] and with QTLs for LR and SR resistance at RIBSP identified using genome-wide association study (GWAS) [51]. The location of each identified QTL was compared to the genetic positions of known *Lr* and *Sr* genes (Table 9). In total, four candidate *Lr* genes and four QTLs were found for five QTLs associated with LR resistance in this study. In the analysis of QTLs for SR resistance, we found similarities with the genetic locations of eight previously identified QTLs and/or candidate *Sr* genes.


**Table 9.** Comparison of quantitative trait loci (QTLs) for leaf rust (LR) and stem rust (SR) resistance identified in this study in a Pamyati Azieva × Paragon mapping population with previously described QTLs and candidate *Lr* and *Sr* genes.

The region of each QTL was analyzed for the presence of protein-coding genes in the interval 500 kb upstream and 500 kb downstream from the most significant SNP (Table S1). The analysis of LR-associated QTL regions suggested the presence of 158 genes ranging from 6 (*QLr.ipbb-7B.3* and *QLr.ipbb-7D.1*) to 22 (*QLr.ipbb-1B.4* and *QLr.ipbb-6A.4*) genes per interval. A similar search for SR-associated QTL regions indicated the presence of 226 genes ranging from 5 (*QSr.ipbb-2B.4*) to 29 (*QSr.ipbb-1B.4*) genes per interval. Among these 158 genes identified for QTLs associated with LR, 48.9% coded for proteins with functions known in *T. aestivum*, 48.6% for uncharacterized proteins, and 2.5% for RNAs. For QTLs associated with SR, 56.6% of genes coded for proteins uncharacterized in *T. aestivum,* 41.2% described protein-coding genes, and 2.2% coded for RNAs. Among genes coding for uncharacterized proteins, sequences similar to the 24 QTL regions for LR and 39 QTL regions for SR were identified in other grass species (Table S1). Orthologous genes with their sequence similarity level higher than 70% were selected and are listed.

#### **4. Discussion**

#### *4.1. General Resistance of RILs in Studied Environments*

At the seedling stage, the majority of RILs and parental cultivars showed MS and S levels of resistance to all races of LR and SR, except for the SR race PKCTC, where several lines were identified as R and MR (Table 3). The ANOVA test showed a more significant influence of pathogen genotype (race) on the resistance of RILs rather than the genotype of wheat lines (Table 4). This result indicated that these genetic factors associated with resistance are race-specific. In the world and in Kazakhstan, breeding programs are mostly focused on the combination of seedling resistance and APR in new cultivars. Pyramiding of seedling gene(s) with slow rusting APR gene(s) usually results in higher resistance of the crop. This agrees with wheat R genes conferring resistance to LR (*Lr1*, *Lr10*, *Lr21*) and SR (*Sr22*, *Sr33*, *Sr35*, *Sr45*, *Sr50*) being cloned and widely used in wheat breeding [55]. However, the significant positive correlations among LR races observed in this study (Table 5) suggested the involvement of genetic factors that are effective against all three races. The presence of strong positive correlations between APR to LR and SR at KRIAPI also indicated that genes conferring LR resistance are either closely linked or may have a pleiotropic effect on genes that control SR resistance [26,56]. Positive correlations were simultaneously observed between seedling resistance to LR races TQTHQ and TKTHT and APR to LR at KRIAPI, as well as between seedling resistance to SR race PKCTC and APR to SR at KRIAPI (Table 5). The relationship between race-specific seedling and broad adult plant resistances could be influenced by the presence of LR and SR races in the fields at KRIAPI. This also suggested that the wheat germplasm growing in this region could be effectively and rapidly screened for resistance to LR and SR at the seedling stage in a greenhouse [57].

LR and SR resistances are complex traits [58]; this was confirmed by the range of reactions to pathogens and the presence of transgressive segregations. Even when parents demonstrated the same level of resistance, such as APR to SR, RILs still showed transgressive phenotypes in the direction of either resistance (RIBSP) or susceptibility (KRIAPI) (Table 3). This phenomenon is not rare; it was previously described for many other quantitatively inherited wheat traits; for example, in studies of grain quality traits [22], grain Zn and Fe concentrations [59], grain yield and plant height [60], and rust diseases [61,62].

#### *4.2. QTL Mapping for Leaf Rust Resistance*

Alleles conferring increased resistance of QTLs for LR race TQKHT and APR at RIBSP originated from Par (Table 7). The higher LR resistance of Par in comparison with PA indicated that the U.K. cultivar is a promising source for wheat breeding programs in Kazakhstan. PA was simultaneously found to be a source for QTLs with increased LR resistance to race TQTMQ.

The 11 QTLs for the resistance to LR at the seedling and adult plant-growth stages can be divided into two categories: (1) similar to QTLs previously detected for LR resistance and (2) presumably novel QTLs. The first category consisted of 6 out of 11 QTLs for LR resistance (Table 9). Four of the QTLs for LR resistance with similar genetic positions (*QLr.ipbb-1A.2* (APR at KRIAPI)*, QLr.ipbb-6A.4* (seedling resistance to TQKHT)*, QLr.ipbb-6A.5* (seedling resistance to TRTHT), and *QLr.ipbb-7B.3* (seedling resistance to TQTMQ) were previously identified in a GWAS study performed at RIBSP in 2018/2019 at the adult plant stage [50]. Hence, multiple occurrences of QTLs associated with the resistance to LR in different conditions and environments indicated the broad stability of these loci. *QLr.ipbb-7B.3* may be associated with the gene *Lr14* located in a similar region of the genome (Table 9). The effectiveness of allele *Lr14a* was described for northern Kazakhstan and *Lr14b* for eastern and western Kazakhstan [7]. *Lr14* was also described as an effective resistance factor to TQTMQ (Table 1). The APR QTL *QLr.ipbb-1B.4* is associated with the gene *Lr21*, positioned in close proximity to the peak of the QTL (Table S1). This gene was described as effective in southeastern Kazakhstan [7]. The last QTL from the first group, *QLr.ipbb-3A.2*, is probably associated with genes *Lr63* and *Lr66* (Table 9). Unfortunately, information is lacking about the role of these genes in the wheat-growing areas of Kazakhstan. However, *Lr63* and *Lr66* are known to condition low to intermediate infection types to most of *P. recondita* isolates [63]. The remaining five QTLs identified for LR resistance are presumably novel genetic factors, since there were no reliable matches between their positions in the genome and previously identified QTLs or genes.

#### *4.3. QTLs for Stem Rust*

In 13 QTLs for the resistance to SR identified in this study, alleles presumably increasing resistance originated from both PA and Par (Table 8). Similar to LR resistance, SR-resistance-associated QTLs could be divided into two loci groups, where the first group has similar genetic positions with previously reported QTLs for SR resistance (Table 9), and the second group has none of those matches. The first group includes 8 out of 13 QTLs identified for SR resistance. For three of them—*QSr.ipbb-2B.4* (seedling resistance to PKCTC), *QSr.ipbb-6B.5* (seedling resistance to TKRTF and APR in RIBSP), and *QSr.ipbb-6B.7* (seedling resistance to TKRTF)—QTLs for SR resistance with similar positions in the genome were identified in a previous work involving the PA × Par mapping population [27] and in a GWAS study using resistance data obtained from RIBSP [51]. Similar to the LR study, these findings may indicate the stability of identified QTLs. In addition to the information with QTL similarities, several specific *Sr* genes seem to be associated with QTLs from this study (Table 9). One of the most interesting findings was the identification of three QTLs on distal ends of chromosomes 2A, 2B, and 2D responsible for seedling resistance to SR races RKRTF, PKCTC, and TKRTF, respectively. These QTLs could be associated with the gene *Sr32*, which was mapped in these regions of chromosomes 2A [64], 2B [65], and 2D [66]. The gene was previously reported as effective against Ug99 and related SR races [66]. The other *Sr* genes involved in resistance to SR races in the Ug99 lineage and possibly associated with QTL from this study are *Sr31* (resistant to TTKSF and TTKSP), *Sr36* (all Ug99 lineage races, except TTTSK), and *Sr39* (all Ug99 lineage races) [67]. Among the SR races used in this study, *Sr36* was described as effective against PKCTC (Table 1). The resistance pattern is similar to *QSr.ipbb-2B.4*, which is located in a nearby region of the chromosome. The second group of the genetic factors consisted of the remaining five QTLs that could be novel QTLs associated with resistance to SR.

#### *4.4. QTLs Cluster on Chromosome 1B*

The QTLs associated with several traits are common in wheat. It may occur due to pleiotropic effect or their tight linkage. For the resistance to wheat fungal diseases, pleiotropic APR genes *Lr34*/*Yr18*/*Pm38*/*Sr57* [68], *Lr46*/*Yr29*/*Pm39*/*Sr58* [69], and *Lr67*/*Yr46*/*Pm46*/*Sr55* [70] were previously described. Among the QTLs identified for LR and SR resistance in this study, two QTLs (*QLr.ipbb-1B.4* and *QSr.ipbb-1B.4*) occupy the same interval on chromosome 1B (Tables 7 and 8, Figures 1 and 2). In addition, the *QLr.ipbb-1B.4* interval contains the resistance gene *Lr21* less than 500 kb from the significant peak, whereas the interval of *QSr.ipbb-1B.4* has genes for disease resistance proteins and resistance-related kinases next to the peak marker (Table S1). *Lr21* was described as effective

for southeastern Kazakhstan [7]. Common markers in these intervals suggest the usefulness for marker-assisted breeding of these QTLs to develop wheat cultivars with durable rust resistance for gene pyramiding [11].

#### **5. Conclusions**

Overall, 24 QTLs for the resistance to rust diseases at the seedling and adult plant stages were identified in this study, including 11 QTLs for LR and 13 QTLs for SR. Among the QTLs associated with LR, eight QTLs were race-specific and detected at the seedling stage, two QTLs were at the stage of the adult plant, and one QTL was identified in both stages. The QTLs for LR-resistance explained from 11.6% (*QLr.ipbb-7D.1*) to 25.7% (*QLr.ipbb-6A.4*) of the phenotypic variation and were detected on 10 chromosomes. The increased resistance to LR in TQTMQ race-specific QTLs originated from PA; in QTLs specific for the race TQKHT and APR, alleles were from Par. For TRTHT, the origin of resistance alleles in identified QTLs was both parental cultivars. For SR resistance, seven QTLs were race-specific and detected at the seedling stage, three QTLs were identified at the adult plant stage, and three QTLs were identified at both growth stages. SR-associated QTLs explained from 8.9% (*QSr.ipbb-2B.4*) to 39.1% (*QSr.ipbb-2A.2*) of variation in SR resistance and were mapped on nine chromosomes. The alleles increasing resistance to SR originated from both parents: effective alleles in six QTLs were from Par, in five QTLs from PA, and two QTLs had a different origin of resistance at the seedling and adult plant stages. Among the QTLs from this study, 10 QTLs were putative and 14 matching QTLs were found in previous works involving the PA × Par population, a GWAS study at RIBSP, and possible candidate resistance genes. The cluster of QTLs associated with both LR and SR resistances was identified on chromosome 1B. Thus, the QTLs revealed in this study may play an essential role in the improvement of wheat resistance to LR and SR via marker-assisted selection.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/9/1285/s1, Table S1. The list of protein- and RNA-coding genes 500 kb upstream and 500 kb downstream from the most significant SNP of the QTL.

**Author Contributions:** Conceptualization, S.A. and Y.T.; methodology, A.R. and Y.T.; formal analysis, Y.G. and G.Y.; investigation, Y.G., A.R., and S.A.; resources, Y.T. and A.R.; data curation, Y.G., G.Y., and Y.T.; writing—original draft preparation, Y.G.; writing—review and editing, Y.G., S.A., A.R., G.A., and Y.T.; supervision, Y.T.; project administration, S.A.; funding acquisition, A.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Agriculture of the Republic of Kazakhstan, grant number BR06249329. The APC was also funded by grant number BR06249329.

**Acknowledgments:** This work was conducted within the framework of the project "Development of new DNA markers associated with the resistance of bread wheat to the most dangerous fungal diseases in Kazakhstan" in Program "Development of the innovative systems for increasing the resistance of wheat varieties to especially dangerous diseases in the Republic of Kazakhstan": BR06249329 supported by the Ministry of Agriculture of the Republic of Kazakhstan.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Cost-E**ff**ective and Time-E**ffi**cient Molecular Assisted Selection for Ppv Resistance in Apricot Based on** *ParPMC2* **Allele-Specific PCR**

#### **Ángela Polo-Oltra 1, Carlos Romero 2, Inmaculada López 1, María Luisa Badenes <sup>1</sup>**


Received: 31 July 2020; Accepted: 25 August 2020; Published: 31 August 2020

**Abstract:** Plum pox virus (PPV) is the most important limiting factor for apricot (*Prunus armeniaca* L.) production worldwide, and development of resistant cultivars has been proven to be the best solution in the long-term. However, just like in other woody species, apricot breeding is highly time and space demanding, and this is particularly true for PPV resistance phenotyping. Therefore, marker-assisted selection (MAS) may be very helpful to speed up breeding programs. Tightly linked *ParPMC1* and *ParPMC2*, meprin and TRAF-C homology (MATH)-domain-containing genes have been proposed as host susceptibility genes required for PPV infection. Contribution of additional genes to PPV resistance cannot be discarded, but all available studies undoubtedly show a strong correlation between *ParPMC2*-resistant alleles (*ParPMC2res*) and PPV resistance. The *ParPMC2res* allele was shown to carry a 5-bp deletion (*ParPMC2*-del) within the second exon that has been characterized as a molecular marker suitable for MAS (PMC2). Based on this finding, we propose here a method for PPV resistance selection in apricot by combining high-throughput DNA extraction of 384 samples in 2 working days and the allele-specific genotyping of PMC2 on agarose gel. Moreover, the PMC2 genotype has been determined by PCR or by using whole-genome sequences (WGS) in 175 apricot accessions. These results were complemented with phenotypic and/or genotypic data available in the literature to reach a total of 325 apricot accessions. As a whole, we conclude that this is a time-efficient, cost-effective and straightforward method for PPV resistance screening that can be highly useful for apricot breeding programs.

**Keywords:** apricot; MAS; breeding; MATH; PPV resistance; agarose; *ParPMC*; *ParPMC2*-del

#### **1. Introduction**

Most cultivated apricots belong to the *Prunus armeniaca* L. species, a member of the Rosaceae family, *Prunus* genus and section Armeniaca (Lam.) Koch [1]. World apricot production reached 3.84 million tonnes in 2018, with Turkey, Uzbekistan and Iran as the main producers (http://www.fao.org/faostat/). This means an increase of about 45% since 1998 mainly due to Asian countries. By contrast, European production in this period has just increased slightly while the cultivated area declined up to 19%. Despite its wide geographical spread, apricot has very specific ecological requirements. Consequently, each region usually grows locally adapted cultivars. For this reason, significant breeding efforts have been undertaken since the first apricot breeding program started in 1925 at the Nikita Botanical Garden in Yalta (Crimea, Ukraine) [2]. However, apricot breeding based on biparental controlled crosses

and subsequent selection of the best new allelic combinations is hardly limited by the capacity to evaluate trees in the field [3]. On one side, fruit trees show high space requirements to be grown. On the other, their juvenile phase is quite long and reliable pomological phenotyping requires several cropping seasons, which means that at least ten years are needed to release a new variety. Therefore, the implementation of marker-assisted selection (MAS) has a great potential to improve breeding efficiency in fruit trees, including apricot.

Sharka disease, caused by *Plum pox* virus (PPV), is currently the most important viral disease affecting stone fruit trees (*Prunus* spp.) [4]. To date, nine PPV strains (D, M, C, EA, W, Rec, T, CR and An) are identified [5]. However, PPV genetic diversity may be even bigger, as observed by Chirkov et al. [6], who recently described the new Tat isolates affecting sour cherry (*Prunus cerasus*). PPV-D and M are the most widespread and economically important strains [5,7]. A clear host preference is observed: PPV-D/plum/apricot and PPV-M/peach. However, underlying genetic determinants are still unknown [8].

Particularly in apricot, PPV-D has severely hindered production in the last three decades, especially in endemic areas. In this context, development of PPV-resistant varieties is the main objective of apricot breeding programs. However, resistant sources are scarce. Just a handful of North American PPV-resistant cultivars have been identified to date, and they are commonly used as donors in all apricot resistance breeding programs currently in progress [9]. Several independent works aimed at dissecting the genetic control of PPV resistance in apricot have identified the major dominant *PPVres* locus in the upper part of linkage group 1 [10–17]. According to the pedigree and fine mapping data, a single common ancestor carrying *PPVres* has been suggested for all PPV-resistant cultivars [16,18–20]. Moreover, other minor loci contributing to PPV resistance have been suggested [13–16], but their role has not yet been well defined. More recently, transcriptomic and genomic analyses of *PPVres* locus have pointed out *ParPMC1* and *ParPMC2,* two members of a cluster of meprin and TRAF-C homology domain (MATHd)-containing genes, as host susceptibility paralogous genes required for PPV infection [21]. The *ParPMC2* allele linked in coupling with PPV resistance (*ParPMC2res*) accumulates 15 variants, including a 5 nt deletion (*ParPMC2*-del) that results in a premature stop codon. Moreover, cultivars carrying the *ParPMC2res* allele show that *ParPMC2* and especially *ParPMC1* genes are downregulated. As a result, this *ParPMC2res* was proposed to be a pseudogene that confers PPV resistance by silencing functional homologs, the non-mutated *ParPMC2* allele and/or *ParPMC1*. Another plausible scenario involves epigenetic modifications to explain *ParPMC* silencing in the resistant cultivars [22].

In spite of evidence supporting linkage with the *PPVres* locus, some genotype-phenotype incongruencies (GPIs) have been detected in biparental populations segregating for PPV resistance [17,23,24]. In other words, some phenotypically susceptible individuals carrying *ParPMC2res* were classified as genetically resistant. Possible causes underlying these discrepancies, including other loci contributing to PPV resistance, are still unresolved. However, the potential benefit of using a *ParPMC2* allele-specific marker (PMC2) for MAS is still very high since sharka resistance phenotyping is a major bottleneck in apricot breeding programs. The most reliable method for apricot PPV resistance phenotyping is based on a biological test that uses GF-305 peach rootstocks as woody indicators and graft-inoculation with PPV [25]. This procedure is time-consuming and requires visual inspection during two to four growing seasons in several replicates per genotype followed by ELISA [26] and RT-PCR tests [27]. It should be noted that the plant to be tested must be of a significant size in order to have enough buds for grafting replicates, so it takes a couple of years from the time of crossing. As a result of a genetic mapping approach, Soriano et al. [18] reported the first successful MAS application for PPV resistance using 3 SSRs within the *PPVres* locus resolved by capillary electrophoresis. Afterwards, these SSRs were combined with a single sequence length polymorphism marker (ZP002) interrogating the *ParPMC2*-del resolved by capillary or acrylamide electrophoresis [24] and by high resolution melting [28]. However, specialized DNA testing services are needed to adopt these MAS approaches, and together with the economic costs, this could be a challenge [29].

Here, we report a method combining high-throughput DNA extraction of 384 samples in 2 days and PMC2 genotyping by allele-specific PCR amplification and agarose gel electrophoresis. This method is proven to be an easily implemented tool for MAS of PPV-resistant seedlings in almost any apricot breeding program. Therefore, bioassays for PPV resistance evaluation will be needed to confirm the phenotype in selected materials. Moreover, PMC2 genotype has been determined and/or revised for 325 worldwide cultivated apricot accessions providing useful information for breeders to select parental genotypes.

#### **2. Materials and Methods**

#### *2.1. High-Throughput DNA Isolation in 96-Well Plate*

The genomic DNA extraction protocol was optimized from the original Doyle and Doyle method [30] to manage 384 samples per isolation using 8-well 1.2-mL strip tubes (VWR International). For each accession, 2 leaf discs were collected and placed into a tube with 3 glass beads (VWR International). The strips were frozen in liquid N2 and stored at −20 ◦C before DNA isolation. Frozen tissue was ground for 1 min with a frequency of 26/s using a Qiagen TissueLyser 85210 (Qiagen, Hilden, Germany). Then, 340 μL of preheated CTAB isolation buffer (with 0.2% 2-mercaptoethanol) was added to the ground tissue and incubated at 65 ◦C for 40 min, shaking gently every 10 min. After a short spin, 340 μL of chloroform-isoamyl alcohol (24:1) was added and mixed inverting the plates. Tubes were centrifuged for 10 min at 3000 rpm and 4 ◦C. The clean aqueous phase was transferred to new strip tubes, and 1.5 vol of 100% ethanol and 15 mM ammonium acetate were added and mixed gently. After overnight incubation at −20 ◦C, tubes were centrifuged for 10 min at 3000 rpm at 4 ◦C. The supernatant was discarded inverting the tubes, and 300 μL of 70% ethanol was added. After centrifugation for 10 min at 3000 rpm at 4 ◦C, the supernatant was discarded and finally 75 μL of TE was added. DNA at 1:10 dilution was used for PCR. Some random DNA samples from each plate were subjected to quality control. DNA integrity was checked on an agarose gel, and quantification was performed using a Nanodrop ND-1000 spectrophotometer (Nanodrop Technologies, Wilmington, DE, USA).

#### *2.2. PMC2 Genotype by Allele-Specific PCR Assay*

PMC2 marker genotyping was performed using the allele-specific forward primer (PMC2-F-alleleR: 5'-GTCATTTTCATTGATGTCATTCA-3' or PMC2-F-alleleS: 5'-GTCATTTTCATTGATGTCATTCA -3') and one common reverse primer (PMC2-R: 5'- GTCATTTTCATTGATGTCATTCA -3'), as described by Zuriaga et al. [21]. PCRs were performed in a final volume of 20 μl containing 1 × DreamTaq buffer, 0.2 mM of each dNTP, 5 μM of each primer, 1 U of DreamTaq DNA polymerase (Thermo Fisher) and 2 μL of DNA extraction (diluted 1:10). Cycling conditions were as follows: an initial denaturing of 95 ◦C for 5 min; 35 cycles of 95 ◦C for 30 s, 55 ◦C for 45 s and 72 ◦C for 45 s; and a final extension of 72 ◦C for 10 min. PCR products were electrophoresed in 1% (w/v) agarose gels.

Available DNA samples from 120 apricot cultivars and accessions were PCR screened in this work. Part of this collection is currently kept at the collection of the Instituto Valenciano de Investigaciones Agrarias (IVIA) in Valencia (Spain), while other samples were provided by the Departamento de Mejora y Patología Vegetal del CEBAS-CSIC in Murcia (Spain), the University of St. Istvan (Budapest, Hungary) or by SharCo project (FP7-KBBE-2007-1) partners.

#### *2.3. WGS Mapping and PMC2 Screening*

WGSs of 73 cultivars were used in this study. Twenty-four of these WGSs and the 454 sequenced BAC clones belonging to the "Goldrich" *PPVres* locus R-haplotype were already screened in our previous works [20–31]. The other 49 WGSs were downloaded from the SRA repository (https://www. ncbi.nlm.nih.gov/sra). All raw reads were processed using the "run\_trimmomatic\_qual\_trimming.pl" script from the Trinity software [32]. After removing the low-quality regions as well as vector and

adaptor contaminants, cleaned reads were aligned to the peach genome v.2.0.a1 [33] using Bowtie2 v.2.2.4 software [34]. The presence/absence of the *ParPMC2*-del was visually inspected using IGV v.2.4.16 [35].

#### **3. Results and Discussion**

#### *3.1. High-Throughput DNA Extraction and ParPMC2-del Genotyping for MAS*

MAS offers great advantages over traditional seedling selection based just on phenotypic evaluations in fruit breeding [36]. DNA tests in segregating populations can improve the cost efficiency and/or the genetic gain for each seedling selection cycle [29], allowing to identify a few seedlings from among many thousands that have the genetic potential for desired performance levels [37]. As a result, agronomical evaluation in field trials is restricted to the promising selected materials. Implementation of MAS is especially valuable for traits that are difficult and/or expensive to phenotype as PPV resistance. As previously explained, the most reliable PPV resistance phenotyping is based on a biological test that uses graft-inoculated GF-305 peach seedlings [25] (Figure 1A). This protocol requires several replicates per genotype and visual symptoms inspection during 2–4 growing seasons, which entails the main bottleneck in apricot breeding programs. For instance, following this method at the IVIA's greenhouse and cold chamber facilities, we can phenotype no more than 3000 plants per year, which equals 500 seedlings (i.e., 6 replicates are needed for each seedling).

In this work, we present a new strategy to speed up while reducing costs of the current application of MAS for PPV resistance in apricot [18,24,28]. Here, we combine a high-throughput DNA extraction protocol that does not need sophisticated robotic systems and can be implemented in any regular laboratory, with PMC2 allele-specific PCR amplification using previously described primers [21] and agarose electrophoresis (Figure 1B). Both forwards primers differ at the 3'-end, allowing to easily discriminate the presence/absence of the 5-bp *ParPMC2*-del (Figure 2). With this DNA extraction method, one person can easily process up to 384 samples (four 96-well sample plates) in 2 working days, enabling high throughput sample preparation. This is 4 times more samples than a standard CTAB method using individual tubes, while the cost of reagents and consumables is similar in both cases (around 0.29–0.30 € per sample) (Table S2). DNA obtained has enough quantity and quality to ensure subsequent regular PCRs. A 1:10 dilution of the DNA obtained was directly used for PCR amplification, without any additional purification step. In contrast, commercial kits are much more expensive in terms of reagents and consumables with costs around 4€ per sample. Then, using this DNA, 3 different methods could be applied for PPV MAS in apricot: the fluorescent labelling of PCR fragments that are resolved using capillary electrophoresis [18], the high-resolution melting (HRM) approach [28], and the use of standard PCR resolved by agarose gel electrophoresis [21]. It should be noted that the first two methods require the use of special equipment that could not be available for some laboratories and that also make the protocol more expensive. For instance, just the capillary electrophoresis costs around 1.5–2€ per sample (PCR not included) and the fluorescently labelled primers needed for PCR (136€ 10 nm) are much more expensive than the non-labelled ones (4€ 20 nm). On the other hand, commercial kits for HRM are not very expensive (around 1€ per sample) but requires the use of real-time PCR machines specially calibrated for this type of experiments and the analysis software. As a resume, although prices differ between laboratories or countries, our rough estimate of the cost points to first and second approaches as 13 and 8 times more expensive, respectively, in terms of reagents and consumables than the protocol proposed in this work (Table S2).

Practical advantages of PMC2 genotyping over classical phenotyping may be illustrated by the following example (Figure 1). The estimated time needed for evaluating 1000 samples at the IVIA's facilities using bioassays is about 16 months (500 samples/8 months), taking into account that plants should be big enough to be ready-to-graft (approximately 2 years old). In contrast, just about 4 weeks are needed to conduct PMC2 genotyping just after seed germination. This estimated time was calculated assuming a 40-h workweek. As 1000 samples could be distributed into 10.4 96-well plates, ideally the DNA extraction would need 5.2 days (4 plates each 2 days), the 2 allele-specific PCRs would need 7.8 days (3 h each plate) and the agarose electrophoresis would last 2.6 days (2 PCR 96-well plates and 2 h per gel). In total, we would need 15.6 working days to genotype 1000 samples. This improvement removes the phenotyping bottleneck since all seedlings obtained from a particular cross can be PCR screened that same year. Hence, this quick and high-throughput method for DNA testing is expected to have an important effect on the cost efficiency of MAS, as suggested by Edge-Garza et al. [37].

**Figure 1.** Comparison between traditional Plum pox virus (PPV) resistance phenotyping (**A**) and high-throughput marker-assisted selection (MAS) based on PMC2 allele-specific PCR (**B**). (\*) Estimated duration based on Instituto Valenciano de Investigaciones Agrarias (IVIA) facilities.

**Figure 2.** PMC2 genotyping by allele-specific PCR using forward primers differing at the 3'-end (**A**): R-allele (**B**) and S-allele (**C**) amplifications in 1% agarose gel electrophoresis for 46 apricot accessions (1: Goldrich, 2: Harlayne, 3: Henderson, 4: Lito, 5: Orange Red, 6: Pandora, 7: SEO, 8: Stella, 9: Veecot, 10: Bebeco, 11: Bergeron, 12: Canino, 13: Currot, 14: Ginesta, 15: Katy, 16: Mitger, 17: Palau, 18: Tyrinthos, 19: Piera, 20: Selene, 21: Colorao, 22: Moixent, 23: Perla, 24: Dama Vermella, 25: Maravilla, 26: Ninfa, 27: Palabras, 28: Sublime, 29: Dorada, 30: Castlebrite, 31: Martinet, 32: Corbató, 33: Gandía, 34: Cristalí, 35: Manri, 36: Gavatxet, 37: Pisana, 38: Xirivello, 39: Velazquez, 40: Mirlo Rojo, 41: Rojo Carlet, 42: Bulida, 43: ASP, 44: Silvercot, 45: Bora and 46: Roxana).

#### *3.2. ParPMC2-del Highly Correlates with PPV Resistance in Apricot Germplasm*

One of the main pillars of plant breeding relies on skilful parental selection to create new genetic variation by controlled crossing. Usually, breeders just connect the concept of DNA-informed breeding with the use of molecular markers for seedling selection, but it also can be very helpful for parental selection [36]. This is the case in apricot breeding for PPV resistance. Two decades ago, Martínez-Gómez et al. [9] reviewed phenotypic information regarding apricot cultivar behaviour against PPV. Similarly, here, we compile the PMC2 genotype of a wide set of apricot accessions to facilitate parental selection tasks incorporating also their resistance phenotype, pedigree and origin data from the literature when available. The PPV strain used for phenotyping was also included because differences in severity of the induced symptoms have been observed [10,16]. As a result, after screening 120 accessions by PCR and other 49 by WGS and reviewing the available literature, PMC2 genotype was determined in a total of 325 apricot cultivars or accessions that represent a wide range of geographic origins (Figure 3). A significant part of the materials come from European countries directly involved in PPV resistance research during the last decades, such as Italy (20.9%), Spain (15.7%) or France (14.8%) [38–42]. Regarding viral strain, PPV-M was more frequently used for phenotyping except for PPV-D in Spain and PPV-T in Turkey (Figure 3), in agreement with the prevalence of these two strains in every country [5,43].

**Figure 3.** Geographic distribution of apricot accessions: PMC2 genotypes (RR: homozygous for the resistant allele; SS: homozygous for the susceptible allele; and RS: heterozygous) and PPV strain used for phenotyping are also indicated.

In total, 110 accessions were considered phenotypically resistant (Table 1), 108 were susceptible (Table 2) and 11 showed uncertain phenotype against the same or different PPV strains (Table 3). *ParPMC2*-del highly correlates with PPV resistance, as evidenced by its presence in 92.8% of the resistant accessions (Table 1) and its absence in 92.6% of the susceptible accessions (Table 2). Only 16 out of 219 (7.3%) accessions phenotypically classified as resistant or susceptible showed genotype-phenotype incongruences (GPIs). GPIs were previously reported mainly when using segregating populations [18,23,24,28,44], but clarifying reasons underlying GPIs was found difficult, as quite different factors may be involved. These factors include complex phenotyping protocols, loci other than *PPVres* contributing to PPV resistance, environmental conditions and/or gene–environment interactions. Additionally, putative misclassifications could also explain some genotypic discrepancies observed in this work. For instance, Sunglo, the resistant donor parent of Goldrich, has been phenotyped as resistant by several authors using PPV-M [15,45,46] and PPV-D [47] and genotypically showed the SSR-resistant alleles targeting the *PPVres* locus [18]. However, WGS data (SRR2153157) supposedly corresponding to this accession do not have the *ParPMC2*-del. Something similar occurs with Mirlo Naranja, classified as resistant [48], that was found to carry one copy of the *ParPMC2*-del by PCR in this work but not in that of Passaro [49]. Detailed accession documentation may be helpful to resolve these discrepancies, but 13 of the 16 identified GPIs have no pedigree data available. This information would be very valuable to increase the efficiency of apricot breeding programs and germplasm management.


 **1.** Apricot PPV-resistant accessions genotyped for PMC2.

**Table**






Búlida

SP

 Murcia

Unknown [73]

S

M

 [93]

SS

PCR; [21]


**PPV**

**PPV Strain**

**First**

**PMC2**

**PMC2**

**Genotype c**

**Genotype Ref**

**Phenotype**

**Ref**

 [62]

 RS

 RS

 PCR; [87]

 [49]

**Used**

**Resistance**

**Phenotype b**

 R R

 R

 R R

 M\*

T[61]

 [28]

 RS

 RS

 [61]

 [28]

 M\*

 [28]

 RS

 [28]

D[86]

**Name**

Sunnycot (<sup>=</sup> 97-3-203)

Traian

Tsunami (<sup>=</sup> EA 5016) Wonder Cot (<sup>=</sup> RM 7)

Zard

M \*: strain likely used for phenotyping

 USA

 RO

 FR

 USA

 CA

 by the Phytosanitary

 Service,

Emilia-Romagna

 (Italy). a Countries: C: Canada, CA: Central Asia, CR: Czech Republic, FR: France, GR: Greece,

**b**

 SDR FRUIT LLC – USA

 Escande EARL

 SDR FRUIT LLC – USA

 **Country a**

**Origin**

 **Pedigree**



#### *Agronomy* **2020** , *10*, 1292


#### *Agronomy* **2020** , *10*, 1292



#### *Agronomy* **2020** , *10*, 1292


accessionswithuncertainPPVresistancephenotypegenotypedfor*ParPMC2*-del.

*Agronomy* **2020** , *10*, 1292

R: Resistant and S: Susceptible; and c Genotype: RR: homozygous

 for PMC2 resistant allele, SS: homozygous

 for PMC2 susceptible allele and RS: heterozygous.

Accurate evaluation of PPV resistance is a complex process, and results obtained by different researchers sometimes are contradictory, as exemplified by Farbaly and Pieve (Table 3), which may lead to GPIs. This problem is also observed in well-known accessions. For instance, Goldrich, usually classified as resistant against both PPV-D and M strains, has also been classified as uncertain or even as susceptible at least once (Table 3). Moreover, the effect of the PPV strain used [9,24] has also been observed, as at least 5 accessions showed different behaviour against PPV-M, D or T infection (Table 3). In addition, the environmental effect on symptoms and the different PPV detection techniques employed could also been involved in GPIs [9].

On the other hand, PPV resistance has been related with the downregulation of both *ParPMC2* and, especially, *ParPMC1*, putatively due to an RNA silencing mechanism triggered by the pseudogenization of *ParPMC2res* [21]. Notwithstanding, the presence of epigenetic changes has also been suggested as a possible cause [22]. In any case, resistant cultivars show residual expression levels that could somehow be influenced by environmental conditions. This might explain sporadic symptoms that eventually lead to GPI classification. Moreover, the role of additional PPV resistance loci or genes may also contribute to GPIs. In this sense, Gallois et al. [105] pointed out that a large part of a resistant phenotype conferred by a given QTL depends on the genetic background due to frequent epistatic effects between resistance genes. In fact, other minor loci, linked or not to *PPVres*, have been suggested to underlie PPV resistance in apricot [13–16]. Altogether, the identification and/or confirmation of GPIs in this work pave the way for future studies to unravel the PPV resistance mechanism.

The handful of North American cultivars originally described as PPV resistant [9] have been extensively used as donors in all breeding programs currently in progress. As a result, the *PPVres* locus has been introduced in different genetic backgrounds. In order to complete our survey, genotypic information was compiled from other 96 accessions without available PPV phenotype data (Table S1, [107–113]). In summary, 152 accessions (46.8%) have at least one copy of the *ParPMC2*-del (Figure 3) and 15 out of them are homozygous for *ParPMC2-del*, including the North American PPV-resistant cultivar Stella [114]. Those materials derived from crosses with North American PPV-resistant cultivars represent an opportunity to accelerate the development of new varieties better adapted to the Mediterranean basin conditions [9]. In this context, it should be highlighted that MAS allows to improve cost efficiency and/or genetic gain in apricot breeding programs aimed to select PPV-resistant seedlings. This improvement is highly significant even if some PPV susceptible individuals among those with *ParPMC2*-del are dragged, since they will be later identified by PPV phenotyping. Similarly, Tartarini et al. [115] underlined the advantage of the identification of homozygous *Rvi6* scab-resistant plants using MAS, despite segregating progenies showing at least 5% of GPIs.

#### **4. Conclusions**

Here, we present a high-throughput method to quickly perform DNA testing for PPV resistance that may greatly improve the efficiency of apricot breeding programs. The long-lasting PPV phenotyping process will only be performed with those advanced selections showing promising agronomic behaviour in advanced stages to guarantee the selection of PPV-resistant individuals. Additionally, a wide survey over 300 accessions has been made to identify PPV-resistant sources that could also be useful in apricot breeding programs.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/9/1292/s1, Table S1. PMC2 genotyped apricot accessions without phenotypic data against PPV infection; Table S2. Estimation cost of DNA extraction and PMC genotyping for PPV MAS in apricot.

**Author Contributions:** Conceptualization: C.R., M.L.B. and E.Z.; experimental procedures: Á.P.-O., I.L. and E.Z.; bioinformatics: E.Z.; funding acquisition: M.L.B.; writing—original draft, E.Z.; writing—review and editing, Á.P.-O., C.R., I.L., M.L.B. and E.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA)-FEDER (RTA2017-00011-C03-01). Á.P.O. was funded by a fellowship cofinanced by the Generalitat Valenciana and European Social Fund (2014–2020) (DOCV 8426/19.11.2018).

**Acknowledgments:** The authors would like to express their gratitude to Bassi (University of Milan, Italy) for providing pedigree information from their apricot breeding program.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **High Resolution Melting and Insertion Site-Based Polymorphism Markers for Wheat Variability Analysis and Candidate Genes Selection at Drought and Heat MQTL Loci**

**Rosa Mérida-García 1,\*, Sergio Gálvez 2, Etienne Paux 3, Gabriel Dorado 4, Laura Pascual 5, Patricia Giraldo <sup>5</sup> and Pilar Hernandez 1,\***


Received: 30 July 2020; Accepted: 27 August 2020; Published: 1 September 2020

**Abstract:** The practical use of molecular markers is facilitated by cost-effective detection techniques. In this work, wheat insertion site-based polymorphisms (ISBP) markers were set up for genotyping using high-resolution melting analysis (HRM). Polymorphic HRM-ISBP assays were developed for wheat chromosomes 4A and 3B and used for wheat variability assessment. The marker sequences were mapped against the wheat genome reference sequence, targeting interesting genes. Those genes were located within or in proximity to previously described quantitative trait loci (QTL) or meta-quantitative trait loci (MQTL) for drought and heat stress tolerance, and also yield and yield related traits. Eighteen of the markers used tagged drought related genes and, interestingly, eight of the genes were differentially expressed under different abiotic stress conditions. These results confirmed HRM as a cost-effective and efficient tool for wheat breeding programs.

**Keywords:** high resolution melting; ISBP markers; drought; candidate genes; QTL; MQTL; wheat variability

#### **1. Introduction**

Wheat is among the most important and widely grown crops worldwide [1] and one of the most important grain food crops in the human diet (https://www.fao.org). Wheat development and yield can be affected by abiotic stresses as drought [2–5] and heat [6,7], whose frequencies would be increased by the strong effects of the predicted climate change and global warming [8–10]. In fact, drought is considered one of the most limiting environmental factors [3,11,12], strongly affecting the growth [13,14] and production of crops, with significant reductions in the final yield of cereals, including wheat [13]. Heat stress usually affects crops during the post-anthesis period, with negative effects on final production [15] and end-use quality products [16]. These abiotic stresses are important challenges for plant research and breeders, and plant breeding efforts have been focused on the improvement of final crops production under these limiting conditions [17].

Drought tolerance is considered a complex [18,19] and quantitative trait [14], which interacts with the environment, and possesses an additive polygenic nature [3]. Even though most of these genes have minor contributions, they are of great importance in the genetic improvement for drought tolerance [5]. Drought can also affect gene expression [20,21], both under controlled [22–25] and field conditions [26,27]. Heat stress tolerance includes plant mechanisms which occur at several levels, where the plant acquires thermotolerance to cope with high temperatures [28]. Plants heat shock response results from the reprogramming of gene expression [29] and is of great interest for studies focused on plants stress tolerance and gene expression regulation [28]. Wheat breeding programmes are necessary to ensure an improved selection of favorable alleles focused on interesting agronomic traits, as yield and quality, and also biotic and abiotic stresses tolerance [30,31].

Molecular markers can be developed and successfully applied to identify important genomic regions and major genes [32] closely related to target traits as drought tolerance [3]. In addition, public resources as Wheat Expression (http://www.wheat-expression.com) facilitate the identification of interesting candidate genes, as well as their validation [22,27]. Recent advances in genomics, and the available fully annotated wheat reference genome (IWGSC RefSeq v1) [33], allow the accurate identification of marker positions and their chromosome locations [5]. The available gene models have been used, through appropriate bioinformatic pipelines, for the identification of differentially expressed genes during drought and heat stress treatments (i.e., [22,27]).

The insertion site-based polymorphism markers (ISBP) are PCR markers designed based on the knowledge of the sequence flanking transposable element (TE) sequences, to design one primer in the transposable element and the other in the flanking DNA sequence [34]. TEs are very abundant and nested in the wheat genome, with unique (genome-specific) insertion sites that are highly polymorphic [30]. ISBP were developed for wheat genomic and genetic studies [34,35], and later used in marker-assisted selection (MAS), and as a selecting tool for new varieties in plant breeding programs [30]. The ISBP technique is a rapid and efficient way to develop single copy chromosome-specific markers from incomplete genomic sequences [30,34]. ISBP markers represent a valuable source of polymorphism, which is mostly genome specific in wheat [35], and therefore very convenient for wheat mapping applications [36]. These markers have been used to improve the wheat genome saturation [30,33,37], for genotyping and single-nucleotide polymorphism (SNP) detection [35,38], genetic diversity assessment [39], micro RNA (miRNA) coding sequences identification [40], or sequence composition analysis [41]. They were also successfully applied to develop physical or genetic maps, and to locate important agronomic traits [42,43] or resistance genes [36,44].

There are different techniques to detect the ISBP markers, as fluorescent polymerase chain reaction (PCR) and capillary electrophoresis, allele-specific PCR or melting curve analysis [35]. High resolution melting analysis has been described as a versatile and powerful analytical tool in molecular biology, characterized by its easy use, simplicity, flexibility, low cost, sensitivity, and specificity [45–47]. Briefly, this technique is based on the analysis of the PCR product's melting, by analyzing the fluorescence (due to an intercalated dye) broadcast level as response of a specific increasing temperature ramp [48]. ISBP was firstly carried out by Paux et al. [30] using wheat chromosome 3B markers and melting curve analysis as an alternative to agarose gel electrophoresis. ISBP marker detection using HRM analysis was later performed for resistance loci assessment in bread wheat [49]. The HRM technique has also been used to detect other molecular markers (i.e., SNP, expressed-sequence tag (EST), simple-sequence repeat (SSR), or insertion-deletion (InDeL)). HRM applications in wheat and closely related species as barley and *Aegilops*, include the characterization of InDeL and SNP markers involved in drought and salt tolerance [50]; the detection SNP markers [51,52] and mutations [53–56]; and the mapping of markers linked to resistance genes [57–59].

ISBP markers have been developed for all wheat chromosomes [33,35,40,41,43]. Chromosome 3B [30] contains loci for grain yield, kernel length, plant height, and related traits [60]. Chromosome 4A [33,61] harbors several QTLs related to biotic and abiotic stresses tolerance, agronomic traits as grain yield and quality, and regulation of physiological traits as plant height, maturity, or dormancy [62–70]. This chromosome represents an important target in plant breeding, marker design for variability analysis and candidate genes assessment. In this study, ISBP markers from wheat chromosome 3B [30] and 4A were used to develop and validate HRM assays, and to assess the genetic variability in a wheat collection. These markers were also used to target meta-quantitative trait loci (MQTL) [71,72] related to drought and heat stresses, as well as yield and yield related traits. Candidate gene analyses were performed for the ISBP markers and the genes were validated by gene expression analyses carried out among different drought and heat stress conditions.

#### **2. Material and methods**

#### *2.1. Plant Material and DNA Isolation*

Two wheat panels were used to perform the analyses: (i) panel 1: a collection of 62 wheat lines (37 *Triticum aestivum* L., 11 *Triticum turgidum* ssp. *durum* (Desf.) Husn., 11 *Triticum monococcum* L., 2 *Triticum turgidum* ssp *turgidum* L. and one *Triticum urartu* Thumanian ex Gandilyan) from different sources (Supplementary Materials Table S1); (ii) panel 2: a collection of 76 durum wheat (*Triticum turgidum* L) landraces, provided by the Spanish National Plant Genetic Resources Center (CRF-INIA) (Supplementary Materials Table S2). This panel comprised genotypes of three subspecies: 8 *dicoccon* (Schrank) Thell., 21 *T. turgidum,* and 45 *durum* (Desf.) Husn.

Genomic DNA was isolated from young leaf tissue according to the cetyl trimethyl ammonium bromide (CTAB) method of Murray and Thompson [73], as optimized by Hernández et al. [74]. The quality and concentration of samples were assessed by electrophoresis in a 0.8% agarose gel.

#### *2.2. Insertion Site-Based Polymorphism Markers Development*

ISBP markers were initially developed based on the wheat chromosome 4A survey sequencing [61] and confirmed in the bread wheat reference genome sequence RefSeq v1 [33].

The assemblies corresponding to the 4A wheat chromosome survey sequencing were generated using the "Newbler v2.7" software package (Roche Diagnostics Corporation, Basel, Switzerland) using default parameters. IsbpFinder [30] was run on the assemblies obtained, and ISBP markers were located on the 4AS and 4AL chromosome arms assemblies. The corresponding 45 ISBP primers were designed using Primer3 (http://primer3.sourceforge.net) and mapped to the bread wheat reference genome RefSeq v1 [33]. Marker set up was carried out using 6 durum and bread wheat lines representative of variability (Supplementary Materials Table S1).

ISBP amplicons obtained by standard PCR (55 ◦C annealing, [30]) using 5 *T. aestivum* varieties (Supplementary Materials Table S1) were purified by Exonuclease l (Exo I, New England Biolabs, Inc., Ipswich, MA, USA) and SAP treatment (5 μL DNA, 1U Exol, 1xSAP buffer, 1U SAP in 9 μL at 37 ◦C for 1 h). The purified fragments were then sequenced on an ABI PRISM® 3730XL (Applied Biosystems, Foster City, CA, USA) genetic analyzer using the forward and reverse ISBP primers [30] and using the ABIPRISM BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, USA). HRM analyses were carried out for the same 5 *T. aestivum* varieties (Supplementary Materials Table S1) using 6 ISBP primer pairs previously developed for wheat chromosome 3B [30]: *HRM3B\_273339424*, *HRM3B\_609364064, HRM3B\_124761338, HRM3B\_203288704, HRM3B\_465802537,* and *HRM3B\_331497483*. This analysis was carried out in a RotorgeneTM 6000, model 5-Plex real time PCR (Corbett Research, Mortlake, NSW, Australia). The PCR reaction volume was of 15 μL and the mixture composition was: Master Mix of Type-it HRM PCR Kit (Qiagen, CA, USA); 0.7 μM of each primer and 2 μL of genomic DNA (30 ng). The PCR protocol consisted on an initial denaturation step at 95 ◦C for 10 min; 45 amplification cycles of denaturation at 95 ◦C for 10 s, annealing at 55 ◦C for 15 s and a final extension at 72 ◦C for 20 s; the high-resolution melting was set out by ramping from 65 to 95 ◦C, with fluorescence data acquisition at 0.1 ◦C increments (waiting for 2 s every acquisition). HRM results were compared with the amplicon sequencing for the same five samples.

The high-resolution melting analysis was later performed to assign HRM pattern types to the ISBP markers, using 19 durum and bread wheat lines (Supplementary Materials Table S1). The PCR protocol consisted of an initial denaturation step at 95 ◦C for 5 min; 50 cycles of denaturation at 95 ◦C for 20 s, annealing at 60 ◦C for 20 s, and a final extension at 72 ◦C for 20 s. HRM analysis was undertaken once amplification was completed by ramping from 65 to 95 ◦C, with fluorescence data acquisition at 0.1 ◦C increments (waiting for 2 s every acquisition). Results were analyzed by using the Rotorgene software version 1.7 (Qiagen, the Netherlands), and HRM curves were normalized according to the manufacturer's instructions.

#### *2.3. Candidate Genes and Gene Expression Analyses*

ISBP markers were mapped in wheat chromosomes 4A and 3B, and then were compared to the wheat MQTLs described in Acuña-Galindo et al. [71]. To obtain the MQTLs positions in wheat chromosomes, marker sequences [71] were extracted from "Graingenes" (graingenes.org) and "Wheat SSR DB markers" (wheatssr.nig.ac.jp), and then located by mapping flanking markers (using BLAST) against RefSeq v1 [33]. Only the markers with a corresponding amplicon shorter than 500 bp and a perfect BLAST match (no gap, no mismatch) were considered.

Markers sequences in chromosomes 4A and 3B were blasted against the RefSeq1 v1 [33] using the parameters "-task", "blastn-short" and "-ungapped". The resulting hits were then processed to pair forward and reverse sequence hits with an amplicon <1000 base pairs (bp). For subsequent analysis, paired sequences were ordered by number of mismatches, so markers position was inferred from the position of pairs with lower number of mismatches (0 in most cases). To identify candidate genes associated to each marker, the results were filtered and the hits with best e-value were selected for each molecular marker. The candidate genes were manually selected within a window of +/−20 kb of the marker's hit in the pseudomolecule [33] gene model annotation. Due to the reduced gene density found for wheat chromosome 3B ISBP markers (only three genes were found), that window was extended to +/−300 kb for this chromosome. Genes described as "uncharacterized protein" were then manually annotated. Sequences were obtained in Ensembl Plants (*T. aestivum* RefSeq v1.1) (https://plants.ensembl.org/), and then searched in UniProt (https://www.uniprot.org/). The annotated hits with e-value 0.0 and a score >2000 were selected, except for the gene *TraesCS4A01G410700,* which possesses a short length (207aa).

To overview the results from gene expression analyses, heatmaps were drawn using the data retrieved from Wheat Expression (www.wheat-expression.com/) and the 'NMF 0.21.0 R package [75]. The information used was generated by Liu et al. [22], Ma et al. [26], and Galvez et al. [27]. Liu et al. [22] experimental seedling samples grown in controlled conditions were associated to NCBI SRA ID SRP045409 (control (IS), heat and drought (PEG induced drought) stress for 1 and 6 h (PEG1 and PEG6), respectively). Ma et al. [26] experimental samples grown in a shelter corresponded to NCBI SRA ID SRP102636 (anther stage irrigated leaf phenotype (AD\_C), anther stage drought-stressed leaf phenotype (AD\_S), tetrad stage irrigated developing spike phenotype (T\_C), and tetrad stage drought-stressed developing spike phenotype (T\_S)). Galvez et al. [27] flag leaf samples from field grown plants used have NCBI SRA ID SRP119300 (irrigated (IF), mild stress (MS), and severe stress (SS) flag leaf samples). Transcripts Per Kilobase Millions (TPMs) of genes under every condition were calculated as mean value of TPMs of its constitutive experiments. A differential gene expression (DGE) analysis was performed using RevSeqv1 [33] gene models through two bioinformatic pipelines: Kallisto (version 0.43.0) with the R library "sleuth"(version 0.28.1), and STAR with the R library DESeq2 (version 1.14.1). A consensus threshold for the two pipelines (|lg2FC, β| > 1 and *p*-adjust, Q-value < 0.05 [27]) was used.

#### *2.4. Wheat Variability Assessment*

The PCR amplification protocol used was the previously described for HRM pattern type assessment. Samples genotyping was performed using Melt and HRM analysis options of the Rotor-GeneTM 6000 software. PCR was repeated three times to ensure the amplifications. Results were then corroborated with Rotor-Gene™ ScreenClust HRM™ Software. In those cases where the number of genotypes assigned was unclear, ScreenClust HRM™ Software was also used for the final decision. A binary matrix for genotyping results was created and then analyzed using two different software, PhylTools v.1.32 (Wageningen Agricultural University, The Netherlands) and PowerMarker v.3.25 [76]. The first one was used to calculate the genetic distances for haploid data with "individuals" as hierarchy level and the Nei index [77]. PowerMarker was used to obtain the statistics mean allele number, mean gene diversity, and Polymorphism Index Content (PIC). The unweighted pair group method with arithmetic mean (UPGMA) clustering was carried out using the NEIGHBOR module in the Phylip 3.695 package [78] with default parameters. The selected tree method was UPGMA. The final dendrogram was drawn using MEGA v.6.0 [79] software with the results from the genotyping.

The goodness of fit of the UPGMA tree was calculated by the Cophenetic Correlation Coefficient (CCC) using a visual basic program for Microsoft Excel 2000 [80]. The CCC calculated from the linear regression between the corresponding values of the original distance matrix and the cophenetic matrix derived from the calculation of the UPGMA tree.

#### **3. Results**

#### *3.1. Markers Sequence Validation and HRM Pattern Types Assignment*

To validate markers sequences, difference plots and HRM normalized curves from the high-resolution melting analyses were obtained for each of the six ISBP markers for wheat chromosome 3B and compared with their corresponding amplicon sequences. The HRM profiles were successfully validated. Sequence polymorphism from one to five nucleotides were detected by HRM analyses. Those included both transitions (examples are shown in Figure 1a,c,f), transversions (Figure 1e) and both (Figure 1b,d).

**Figure 1.** Sequence validation for insertion site-based polymorphisms (ISBP) markers using high-resolution melting (HRM) analysis. Normalized curves (on the right) and difference plots (on the left) are shown for 6 ISBP markers for wheat chromosome 3B, with different nucleotide variations: (**a**) marker *HRM3B\_273339424*—single nucleotide transition (A/G); (**b**) marker *HRM3B\_124761338*—five nucleotide variations, three transitions (C/T, G/A, and A/G) and two transversions (T/A and C/A); marker *HRM3B\_203288704*—three nucleotide transitions (A/G, G/A, and A/G); (**d**) marker *HRM3B\_465802537*—two nucleotide variations, one transversion (A/T) and one transition (G/A); (**e**) marker *HRM3B\_609364064*—one transversion (T/G); and (**f**) marker *HRM3B\_331497483*—two transitions (T/C and C/T).

(**c**)

The HRM pattern type assignment was based on the pattern of normalized high-resolution melting curves obtained, and their potential to genotype a high number of varieties. The ISBP HRM patterns were classified into four different types (Table 1, Figure 2): (i) pattern type A, excellent markers to genotype a high number of wheat varieties simultaneously. The HRM curves are very different and easily distinguishable into classes (Figure 2a); (ii) pattern type B, good markers to differentiate several groups of varieties (genotypes) in the same run. HRM curves can be differentiated in an easy way (Figure 2b); (iii) pattern type C, good HRM markers, but not recommended for genotyping a broad variety of samples. The differentiation between HRM curves is not clear in all cases (Figure 2c); and (iv) pattern type D, assigned to markers which are not recommended for HRM genotyping, due to a low amplification efficiency or to a gradient of HRM curves that hinders precise classification into classes (Figure 2d). Twelve ISBP markers were classified as HRM pattern type A; 4 markers showed HRM pattern type B; 10 markers had HRM pattern type C; and 13 were considered with pattern type D (Table 1). Four of the 45 wheat chromosome 4A ISBP primer pairs designed (Table 2), were discarded for the rest of analyses due to non-reliable PCR amplifications.

**Figure 2.** High resolution melting pattern types assessment for wheat chromosome 4A ISBP markers. Normalized HRM curves for 19 samples were shown for (**a**) marker *HRM4A\_67413676* (pattern type A); (**b**) marker *HRM4A\_618105078* (pattern type B); (**c**) marker *HRM4A\_317085557* (pattern type C); and (**d**) marker *HRM4A\_291420130* (pattern type D).





wheat varieties simultaneously; **C**: good marker but not recommended for a broad variety of wheat samples; **D**: marker not recommended for HRM genotyping; and **n**/**s** for markers non-suitable for HRM); Amp. size: amplicon size in base pairs (bp); **\***: ISBP markers used in the durum wheat collection variability assessment.

**Table 2.** Wheat chromosome 4A ISBP markers linked to genes which alter their expression in response to water stress treatments. DE genes are shown in bold. The positive and negative values in "*Dist*" column indicate if the corresponding gene is downstream or upstream of the marker.




Chr: chromosome location; Dist (bp): distance from the gene to the marker in base pairs (bp); TPMs: Transcripts Per Kilobase Millions; and \*: genes manually annotated.

#### *3.2. Candidate Gene Analysis*

After mapping ISBP markers and comparing them with the MQTLs positions previously described in Acuña-Galindo et al. [71], some markers for the wheat chromosome 4A were found in the proximity of interesting QTLs or within MQTLs related to drought and heat stress tolerance, as well as QTLs for yield components (Figure 3a). Two ISBP markers, *HRM4A\_317085557* and *HRM4A\_460238681*, were located within MQTL30 [71], related to the physiological drought trait and root vigor. Markers *HRM4A\_617938526* and *HRM4A\_618105078* were placed in MQTL31 [71], in proximity to QTLs related to drought and heat stresses. The marker *HRM4A\_660524139* was placed close to two of the QTLs located within MQTL31, associated to traits for yield component and coleoptile vigor, and also heat stress. Marker *HRM4A\_583704598* was placed between MQTL30 and MQTL31, in proximity to 2 QTL related to drought and heat stresses. Finally, markers *HRM4A\_681664894* and *HRM4A\_683608822* were placed in MQTL32 [71], close to QTLs related to drought; and markers *HRM4A\_702156718, HRM4A\_714743756,* and *HRM4A\_716986193* were found close to a QTL located in MQTL32, related to drought tolerance (Figure 3a).

**Figure 3.** *Cont*.

**Figure 3.** Physical and genetic maps for markers located in chromosome 4A (**a**) and 3B (**b**) including location for meta-quantitative trait loci (MQTLs) [71] associated to heat and drought stresses tolerance. Markers used in variability assessment are shown in blue color. MQTLs are indicated using green lines.

Two of the wheat chromosome 3B ISBP markers (*HRM3B\_465802537* and *HRM3B\_609364064)* were mapped within MQTL26 [71], in proximity to QTLs controlling yield component and biomass traits, and also related to heat stress (Figure 3b).

Additionally, as result of the candidate gene analysis, the developed ISBP markers for wheat chromosome 4A mapped in the proximity of 61 genes (Supplementary Materials Table S3 and Figure S1). The chromosome position for the ISBP and their closest genes are shown in Figure 4a.

**Figure 4.** Physical maps for the wheat chromosomes 4A (**a**) and 3B (**b**) showing the molecular markers location and their nearest genes found within a window of +/−20 kb and +/−300 kb, respectively. ISBP markers used in wheat variability assessment are highlighted in blue.

Based on gene expression differences under different drought stress conditions, we filtered 23 genes (37.7% of the total genes) with a TPM value above 2.5 (Table 2 and Figure 5). An expression heatmap, using all available public studies RNASeq data in wheat drought responses [22,26,27] is shown in Figure 5.

**Figure 5.** Gene expression analysis under different water stress conditions for candidate genes located within a window of +/−20 kb to the wheat chromosome 4A ISBP markers. Differentially expressed genes (DEGs) are shown in bold. IF: irrigated field conditions; MS: mild stress field condition; SS: severe stress field condition [27]; IS: seedling control; PEG1: seedling 1 h PEG stress; PEG6: seedling 6 h PEG stress [22]; AD\_C: anther stage irrigated shelter phenotype; AD\_S: anther stage drought stressed shelter phenotype; T\_C: tetrad stage irrigated shelter phenotype; and T\_S: tetrad stage drought shelter phenotype [26].

The wheat chromosome 3B ISBP markers mapped in the proximity of 49 genes (Supplementary Materials Table S4 and Figure S2). The closest genes are shown next to the corresponding marker in Figure 4b. Seventeen of these genes (34.69% of the total) showed TPM values above 2.5 (Table 3 and Figure 6). The drought responsive genes mapped by ISBP markers located in proximity or within QTLs or MQTLs are shown in Table 4.

**Figure 6.** Gene expression analysis under different stress conditions for candidate genes located within a +/−300 kb window and in proximity to wheat chromosome 3B ISBP markers. Differentially expressed genes (DEGs) are shown in bold. IF: irrigated field conditions; MS: mild stress field conditions; SS: severe stress field conditions [27]; IS: seedling PEG shock control; PEG1: seedling 1 h PEG stress; PEG6: seedling 6 h PEG stress [22]; AD\_C: anther stage irrigated shelter phenotype; AD\_S: anther stage drought stressed shelter phenotype; T\_C: tetrad stage irrigated shelter phenotype; and T\_S: tetrad stage drought shelter phenotype [26].


**Table 4.** Drought responsive candidate genes tagged by the developed HRM-ISBP markers. DE genes are shown in bold and differentially expressed genes are indicated with "\*". Gene expression responses to drought treatments are shown in Figures 5 and 6, and Tables 3 and 4. The positive and negative "Dist" column valuesindicateifthecorrespondinggeneisdownstreamorupstreamofthemarker.


#### *Agronomy* **2020** , *10*, 1294

 DE genes are shown in bold.

**Table 3.** Wheat

chromosome

 3B ISBP markers linked to genes which alter their expression in response to water stress treatments.



After differential gene expression analysis we obtained 5 DE genes in chromosome 4A and 3 in chromosome 3B, which were up and down-regulated by the PEG drought treatment [22]. The 4A chromosome DE genes *TraesCS4A01G003500* and *TraesCS4A01G043500,* were up and down regulated under PEG6 drought treatment, respectively. Genes *TraesCS4A01G047000, TraesCS4A01G410700* were up-regulated, while the gene *TraesCS4A01G069200* was down-regulated under PEG6 treatment. Chromosome 3B gene *TraesCS3B01G221100* was down-regulated under PEG6 treatment, while genes *TraesCS3B01G290200* and *TraesCS3b01G290300* were up-regulated (Supplementary Materials Table S5).

#### *3.3. Wheat Variability Assessment by High Resolution Melting Analysis*

To assess the polymorphism levels in a wheat collection, the 45 ISBP markers developed for the wheat chromosome 4A (Table 1), were evaluated. Thirteen of them were selected based on their reproducibility and polymorphism (Table 1) and were used in HRM analyses to study the genetic diversity among durum and bread wheat lines in panel 1 (Supplementary Materials Table S1).

The number of alleles detected for these markers ranged from 2 to 7 (mean = 3.38), and the polymorphism index content varied between 0.24 and 0.68 (mean = 0.52) (Table 5).


**Table 5.** Genetic parameters for the wheat chromosome 4A ISBP markers used in the variability assessment.

HRM type: high resolution melting pattern type; PIC: polymorphism index content; \*: one of the alleles was described as "null genotype".

The cluster analysis shows three clearly differentiated clusters (Figure 7). The first one contains 30 *T. aestivum* lines, while the remaining 8 bread wheat lines (TaesIN-13, TaesLI-06, TaesLI-07, TaesIF-06, TaesIF-08, TaesIF-07, TaesLI-05, and TspeBO-01) were placed within the second cluster. This cluster also contains *Triticum durum* and *Triticum dicoccoides* accessions, while *Triticum monococcum* and *Triticum urartu* were placed in the third cluster (Figure 7). The cophenetic correlation coefficient obtained for the UPGMA tree was 0.86.

**Figure 7.** Phylogenetic unweighted pair group method with arithmetic mean (UPGMA) tree showing the relationships among 62 wheat lines genotyped with 13 ISBP markers. *Triticum aestivum* lines are shown in in blue, *T. durum* and *T. dicoccoides* in green, and *T. monococcum* in orange.

Seven of these ISBP markers (Table 1) selected by their efficiency and polymorphism for durum wheat, were used for the variability assessment of durum wheat lines in panel 2 (Supplementary Materials Table S2). The UPGMA tree for the wheat panel 2 is shown in Supplementary Materials Figure S3a,b. This cluster analysis resulted in 9 differentiated clusters. Five lines (BGE002866, BGE013055, BGE020464, BGE013652, and BGE013722) were not placed in any of these clusters. The observed distribution of durum wheat lines across the clusters could be associated in some cases to the geographical location and agroclimatic areas (Supplementary Materials Figure S3c). There are two clear clusters (clusters 4 and 6) where species from southern Spain are presented in a larger proportion. Furthermore, wheat lines placed in cluster 1 are mainly located in norther temperate zones without dry season and temperate summer; while cluster 4 contains landraces mainly located in southern template areas with dry and cool summer. The CCC obtained for the UPGMA tree was 0.67. The number of alleles detected in this analysis ranged from 3 to 6 (mean allele number = 3.86), and the PIC mean value was 0.486 (Supplementary Materials Table S6).

#### **4. Discussion**

Insertion site-based polymorphism markers have been described as a useful tool for wheat genomic studies [35] and attractive alternative to markers as SSR or SNP, due to the high repetitive content of some cereal genomes [81,82]. Due to their straightforward design and high polymorphism, there are previous studies which developed and used specific ISBP markers for several wheat chromosomes and different purposes (i.e., Barabaschi et al. [83] for the bread wheat chromosome 5A, focused on polymorphism assessment; Lucas et al. [84], who used markers derived from the wheat chromosome 1A to map this chromosome, and also with marker assisted breeding purposes; Li et al. [36] applied wheat chromosome 3B ISBP linked to mildew resistance genes in durum wheat; or Sehgal et al. [41], who used chromosome 3A ISBP markers for gene discovery and physical mapping). In this work, new ISBP markers were developed for the wheat chromosome 4A, which contains interesting genes related to biotic and abiotic stresses, as drought tolerance [65,68,69]. It also harbors loci related to essential agronomic traits as yield and grain quality [62,64,66]. The ISBP markers resulted highly polymorphic (Table 5). Thirteen of the developed ISBP markers were used for wheat variability assessment and showed melting curve polymorphisms, seven of them with a PIC value higher than 0.50. Thus, they resulted highly resolutive tools for wheat variability assessment using a cost-effective technique as high resolution melting analysis (HRM). This technique is described as an optimized methodology for melting curve assessment, which allows the determination of melting temperature and profile of an amplicon [85]. Some studies have highlighted some advantages for the HRM technique, as its reduced cost per sample in comparison with other techniques used for SNP detection [86–88]. It is worth noting that the required system consists of standard and affordable RT-PCR equipment, which is suitable for in-house genotyping and adequate for small/medium breeders. Other advantages for HRM are the excellent results for the detection of homozygous and heterozygous variants [46,89–91]; its use for gene mapping, SNPs and mutations [46,90,92]; and its efficiency for the identification of species and closely related varieties [88,93–97]. In this regard, our results from ISBP markers sequence validation, as well as HRM genotyping, support the efficiency of HRM analysis in wheat varieties differentiation. Our results are in agreement with Dong et al. [53], who highlighted that HRM does not require any digestion or gel electrophoresis, so it provides a worthwhile approach for SNP/indel genotyping of different varieties without prior sequence knowledge, as required by other methods. Results from the wheat genetic diversity assessment confirmed that HRM is a convenient way for a first screening to determine variability groups, prior to resequencing only representative varieties as the basis to develop other SNP platforms. Nevertheless, some HRM limitations have been pointed out. Distefano et al. [88] highlighted that sometimes, the HRM profiles could be similar preventing the differentiation of some of the genotypes. Regardless of this, Wu et al. [98] proposed that this issue can be solved using mixed strategies. Accordingly, our results show that a combined use of different ISBP markers can differentiate all the wheat lines studied.

Twenty ISBP markers for wheat chromosomes 4A and six for chromosome 3B mapped to interesting candidate genes, mainly related to drought and heat stresses and yield components (Tables 2 and 3). They were validated using data from available RNASeq public studies in wheat drought responses [22,26,27], as they showed differences on their expression under different stress conditions.

In chromosome 4A, the ISBP marker *HRM4A\_109848074* mapped next (97 bp) to the gene *TraesCS4A01G098300,* which encodes a xyloxyltransferase 1, and participates in carbohydrates metabolism in the development of cellular walls [99]. This process is markedly affected by water stress [100,101] and this can be observed in Figure 5, where this gene decreases its expression under severe stress field conditions. This result is also in agreement with results found by Abebe et al. [102], who analyzed spikes of barley grown under controlled drought conditions, and also found that this gene was down-regulated. The marker *HRM4A\_2791416* mapped 814 bp upstream to the gene *TraesCS4A01G003600,* which encodes an alpha/beta-hydrolases superfamily protein with functional adaptability in plants [103]. This gene reduces its expression as drought stress increases under field conditions (Figure 5). Marker *HMR4A\_67413676,* mapped 975 bp to the gene *TraesCS4A01G069200,* which encodes an armadillo repeat-only protein. These kind of repeat proteins participate in the coordination of protein interactions during stress and hormonal signalling in plants [104]. In agreement with this, the gene was downregulated under PEG6 drought treatment (Supplementary Materials Table S5). The marker *HMR4A\_683608822*, which was located within the drought stress tolerance MQTL32 [71] (Table 5 and Figure 3a), mapped 2685 bp upstream to the gene *TraesCS4A01G410700*, which encodes a ras-related protein RABC2a. The function of this gene has been related to ABA induced stress tolerance in barley [105]. Accordingly to a drought stress response role, this gene was upregulated under PEG6 drought treatment (Supplementary Materials Table S5). The marker *HMR4A\_36371442* mapped 2736 bp upstream to the gene *TraesCS4A01G043500* and was downregulated under PEG6 drought treatment (Supplementary Materials Table S5). Contrary to this, this gene increases its expression under severe stress field conditions (Figure 5). This gene encodes a STAS domain containing-protein, which plays a role to membrane attachment of many anion transporters in transport activity and regulation in plants [106]. In fact, it has been demonstrated its crucial role in the activity of sulfate transporter in *Arabidopsis thaliana* [107], providing key amino acids in the sulfate transport activity [108]. The importance of this activity should be noted, since sulfate is an element that has been described as an essential component in the structure of plant enzymes and reserve proteins in grain [109]. Further, marker *HRM4A\_716986193*, which was located close to a QTL within MQTL32 [71] (Table 5 and Figure 3a), mapped in proximity (3576 bp upstream) to the gene *TraesCS4A01G671200LC.* This gene encodes a peptidase M20/M25/M40 family protein, which is involved in drought stress responses [110]. In fact, proteolysis under drought conditions allows a reorganization in the plant's metabolism, and also increases plants drought tolerance [20,111,112]. According to this, the results showed increased expression of this gene under different drought stress treatments (Figure 6). This is also in agreement with the results showed by Simova-Stoilova et al. [113], who assessed wheat leaves under severe soil drought and found an increase in peptidase activity. Thus, this drought responsive genes represent an interesting candidate for a known drought and heat stress tolerance MQTL.

Additionally, within the window of +/−20 kb used for the candidate gene analysis in chromosome 4A, there were two interesting genes: *TraesCS4A01G003500* (5141 bp upstream from marker *HRM4A\_2791416*) and *TraesCS4A01G047000* (14,885 bp upstream from marker *HRM4A\_38654555*). The first gene encodes a thionin like-protein gene, which plays an important role in the growth and development of the plant and its defense against pathogens [114]. It was found differentially expressed under PEG drought treatment, being upregulated under PEG6 drought treatment (Supplementary Materials Table S5). The gene *TraesCS4A01G047000* encodes a formin-like protein, which plays a primary role in the organization of plant's structure [115]. In agreement with our results where this gene was found upregulated under PEG6 drought treatment (Supplementary Materials Table S5), formin-like proteins showed variations in their expression under drought conditions in wheat [115].

Some HRM/ISBP markers mapped to previously described MQTL loci [71] (Table 5 and Figure 3a), which were mainly associated to drought and heat stresses tolerance. Within these markers, *HRM4A\_618105078*, which tags the MQTL31 [71] and mapped close (2442 bp upstream) to the gene *TraesCS4A01G497800LC* can be highlighted. This gene encodes a receptor-like protein kinase, which is involved in abiotic stress responses [116], matching the description assigned to the MQTL.

Additionally, within the window of +/−300 kb, four wheat chromosome 3B ISBP markers can be highlighted. The marker *HRM3B\_124761338* mapped 6356 bp downstream to the gene *TraesCS3B01G138700,* which encodes a ribonucleoside-diphosphate reductase, an essential enzyme for DNA synthesis [117,118]. This gene increases its expression under severe field stress conditions, and it decreases in response to a PEG drought treatment (Figure 6). Marker *HRM3B\_609364064* mapped 31,936 bp downstream to the gene *TraesCS3B01G575000LC*, encoding myosin-1. Plant myosins have a functional role in organelle movement in response to biotic and abiotic stresses [119]. This response is shown in Figure 6, where this gene significantly increments its expression under PEG1 and PEG6 drought treatments. Marker *HRM3B\_465802537* mapped to two interesting genes, the gene *TraesCS3B01G290200* (83,072 bp downstream) and the gene *TraesCS3B01G290300* (125,805 bp downstream), which were both found upregulated under PEG6 drought treatment (Supplementary Materials Table S5), and contrary to this, decreased their expression under severe stress field conditions (Figure 6). The gene *TraesCS3B01G290200* encodes a glycosyltransferase, an enzyme which possesses a main role in plant's stress tolerance and defense [120], and in agreement with our results, it has been previously found upregulated in wheat leaf under drought conditions [121]; and the gene *TraesCS3B01G290300* encodes an ABC transporter B family protein, which is significantly involved in organs growth, plant nutrition and development and plant responses to abiotic stresses [122]. In fact, as Rampino et al. [123] highlighted, and in agreement with our results (Supplementary Materials Table S6), the up-regulation of this gene in wheat under heat and drought conditions confirms the implication of this gene family in drought responses. Therefore, this family protein has been related to grain formation in wheat [124], which is consistent with the location of marker *HRM3B\_465802537* within MQTL26 [71], mainly related to yield components. Thus, this marker can be useful in wheat breeding, for the marker-assisted selection of this interesting gene and MQTL. Finally, the marker *HRM4A\_273339424* mapped 89,674 bp upstream to the gene *TraesCS3B01G221100*, which encodes a protein kinase superfamily protein. This protein's family is involved in plants' responses to abiotic stresses and plants' development [125]. According to this, this gene was found downregulated under PEG6 drought treatment (Supplementary Materials Table S5), and it also shows differences in its expression across different stress conditions (Figure 6). Therefore, these results agree with Wei et al. [126], who highlighted that kinase proteins are involved in various responses with exposure time of drought.

According to our results, the developed HRM-ISBP markers can be used in wheat breeding programs to genotype interesting genomic regions in a cost-effective manner. These markers can useful resources for marker-assisted selection (MAS) focused on abiotic stress responses and yield components, to tag interesting known QTLs and MQTLs related to drought and heat stresses tolerance, and also yield-related traits.

#### **5. Conclusions**

In this work, highly polymorphic ISBP markers for wheat chromosome 4A were successfully developed and applied in a genetic variability assessment of a collection of durum and bread wheats, using the high-resolution melting analysis technique. These HRM-ISBP markers represent cost-effective and efficient tools for wheat breeding programs focused on variability assessments. The obtained results provide an interesting framework for wheat genetic studies and varieties selection. These HRM-ISBP markers have also been shown useful for tagging interesting genes associated to drought and heat stresses tolerance, some of which showed differential expression patterns under stress conditions. In addition, some of these markers can be applied in breeding through marker-assisted selection of

QTL and MQTL related to abiotic stresses as drought and heat, and also yield and yield related traits. The resources and results presented here can also facilitate the understanding of important traits in other species with large genomes.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2073-4395/10/9/1294/s1; Table S1. List of durum and bread wheat lines used for variability analysis by high resolution melting. 1: marker set up analysis; 2: marker's sequence validation by HRM analysis; 3: HRM pattern types obtained; IFAPA: Instituto Andaluz de Investigación y Formación Agraria, Pesquera, Alimentaria y de la Producción ecológica; INRA: Institut National de la Recherche Agronomique; and WGRC: Wheat Genetics Resource Center; Table S2. Durum wheat lines used for variability analysis by high resolution melting (HRM); Table S3. Candidate genes tagged by the developed HRM-ISBP markers for wheat chromosome 4A. Genes with expression differences are shown in bold and DE genes are indicated with "\*". The positive and negative values in the "Dist" column indicate if the corresponding gene is downstream or upstream of the respective marker. Chr: chromosome location. Dist (bp): distance from the gene to the marker in base pairs; Table S4. Candidate genes tagged by the HRM-ISBP markers for wheat chromosome 3B. Genes with expression differences are shown in bold and DE genes are indicated with "\*". The positive and negative values in the "Dist" column indicate if the corresponding gene is downstream or upstream of the respective marker. Chr: chromosome location. Dist (bp): distance from the gene to the marker in base pairs; Table S5. Differential expression (DE) analysis significance parameters for DE genes. Significant values (|lg2FC, β|> 1 and *p*-adjust, Q-value < 0.05) are shown in bold; Table S6. Genetic parameters for wheat chromosome 4A ISBP markers used in durum wheat variability assessment. HRM type: high resolution melting pattern type; No. alleles: number of alleles found with the marker; PIC: polymorphism index content; and \*: one of the genotypes was described as "null genotype"; Figure S1. Gene expression analysis under different water stress conditions for all candidate genes located within a +/−20 kb window to the wheat chromosome 4A ISBP markers. Genes with differences on their expression are shown in bold and DE genes are indicated with "\*". IF: irrigated field conditions; MS: mild stress field conditions; SS: severe stress field conditions; IS: seedling PEG shock control; PEG1: seedling 1 h PEG stress; PEG6: seedling 6 h PEG stress; AD\_C: anther stage irrigated shelter phenotype; AD\_S: anther stage drought stressed shelter phenotype; T\_C: tetrad stage irrigated shelter phenotype; and T\_S: tetrad stage drought shelter phenotype; Figure S2. Gene expression analysis under different water stress conditions for all candidate genes located within a +/−300 kb window to the wheat chromosome 3B ISBP markers. Genes with differences on their expression are shown in bold and DE genes are indicated with "\*". IF: irrigated field conditions; MS: mild stress field conditions; SS: severe stress field conditions; IS: seedling PEG shock control; PEG1: seedling 1 h PEG stress; PEG6: seedling 6 h PEG stress; AD\_C: anther stage irrigated shelter phenotype; AD\_S: anther stage drought stressed shelter phenotype; T\_C: tetrad stage irrigated shelter phenotype; and T\_S: tetrad stage drought shelter phenotype; Figure S3. Phylogenetic UPGMA tree showing the relationships between 76 durum wheat lines genotyped with 7 wheat chromosome 4A ISBP markers. a) wheat lines are colored based on their geographic zone (Supplementary Materials Table S2) (Center: green; North: blue; North-east: light blue; North-west: dark blue; South: red; South-east: orange; South-west: maroon; and East: purple); b) wheat lines are colored for species (*T. turgidum* subsp *durum* appears in green, *T. turgidum* subsp *turgidum* in orange and *T. turgidum* subsp. *dicoccon* in purple). The colored and vertical lines indicate the differentiated clusters (cluster 1—yellow; cluster 2—blue; cluster 3—dark blue; cluster 4—pink; cluster 5—purple; cluster 6—red; cluster 7—green; cluster 8—grey; cluster 9—light blue; and black lines show the wheat lines that have not been included within any cluster); and c) each dot represents a wheat line, the colors are the same in "a)".

**Author Contributions:** P.H., P.G., and E.P. conceived the experiments. R.M.-G. and L.P. isolated the DNA. E.P. sequenced ISBP amplicons and designed ISBP markers. R.M.-G. performed the PCR and HRM analyses. S.G. performed the bioinformatics analyses. R.M.-G., S.G., E.P., G.D., L.P., P.G., and P.H. analyzed the results. R.M.-G. and P.H. drafted the manuscript. R.M.-G., S.G., E.P., G.D., L.P., P.G., and P.H. have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by project P18-RT-992 from Junta de Andalucía (Andalusian Regional Government), Spain (Co-funded by FEDER), and by the Spanish Ministry of Science and Innovation project PID2019-109089RB-C32.

**Acknowledgments:** The authors gratefully acknowledge Agrovegetal S. A. (Spain), Instituto Andaluz de Investigación y Formación Agraria (IFAPA, Spain), Royal Botanic Gardens of Cordoba (Spain), INRA-Clermont-Ferrand and INRA-Versailles (Institut National de la Recherche Agronomique, France), Limagrain (France and Spain), the Wheat Genetics Resource Center (WGRC Kansas State University, USA), and CFR-INIA (Spanish National Plant Genetic Resources Center) for providing the seeds resources.

**Conflicts of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### **References**

1. Lawlor, D.W. Wheat and Wheat Improvement. *Soil Sci.* **1988**, *146*, 292–293. [CrossRef]


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Agronomy* Editorial Office E-mail: agronomy@mdpi.com www.mdpi.com/journal/agronomy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18