Next Article in Journal
Nitrogen Fixation and Resource Partitioning in Alfalfa (Medicago sativa L.), Cicer Milkvetch (Astragalus cicer L.) and Sainfoin (Onobrychis viciifolia Scop.) Using 15N Enrichment under Controlled Environment Conditions
Previous Article in Journal
Crambe: Seed Yield and Quality in Response to Nitrogen and Sulfur—A Case Study in Northeastern Poland
 
 
Article
Peer-Review Record

Genotyping-by-Sequencing to Unlock Genetic Diversity and Population Structure in White Yam (Dioscorea rotundata Poir.)

Agronomy 2020, 10(9), 1437; https://doi.org/10.3390/agronomy10091437
by Ranjana Bhattacharjee 1,*, Paterne Agre 1, Guillaume Bauchet 2, David De Koeyer 3, Antonio Lopez-Montes 1,4, P. Lava Kumar 1, Michael Abberton 1, Patrick Adebola 5, Asrat Asfaw 5 and Robert Asiedu 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Agronomy 2020, 10(9), 1437; https://doi.org/10.3390/agronomy10091437
Submission received: 7 August 2020 / Revised: 12 September 2020 / Accepted: 14 September 2020 / Published: 22 September 2020
(This article belongs to the Section Crop Breeding and Genetics)

Round 1

Reviewer 1 Report

Dear authors,

I have several main critics about the article ‘Genotyping-by-sequencing to unlock genetic diversity and population structure in white yam (Dioscorea rotundata Poir.)’.

1. There are a lot of data but paradoxically few striking results. There is a population structure but the biological explanation is not clear and I have some critics about the methodology (see below).

2. Despite filtering SNPs according distribution across the chromosome (but with no detail for the method), the SNP distribution on the 21 chromosomes is really heterogeneous. How do you explain this heterogeneity? Several regions have a number of SNPs larger than 97 (red). Probably most of them are in linkage disequilibirum (LD). Did you estimate this LD? What was the impact of using SNPs in LD on the inference of population structure? I would suggest to use SNPs with reduced LD to study population structure in order to avoid over-representing some genomic regions with a particular life history.

3. For population structure analysis, only one member of a clonal group should be kept, while all the genotypes were used in the manuscript. For methods like DAPC, which try to maximise the interpopulational diversity and minimise the intrapopulational diversity, some clusters may appear only by the presence of several clonal individuals.

4. The definition of the number of clusters in DAPC and ADMIXTURE is not convincing. As often for BIC, there is not a clear ‘elbow’. You justifies the choice of K=3 by ‘ a rapid decline from K=1 to K=3’ (line 195). However there is a rapid decline from K=1 (BIC=7100) to K=2 (BIC=6550?) but the decline from K=2 to K=3 is comparatively slower (BIC=6525 ? for K=3). Therefore I would suggest to study K=2 (and maybe K=12 as there is a sudden BIC taking down between K=12 and K=13). As for ADMIXTURE software, the ‘maximum delta K was detected at K=3’ but we do not have the proof as this is the DAPC BIC (Fig. 3A) which is cited. How was determined the number of clusters in ADMIXTURE analysis?

5. There is a discrepancy between the results part and the discussion part. The discussion focuses on admixture between groups while it was not the main topic of the results part. The sentence ‘Higher admixed ancestry among breeding lines’ (line 312) is even not clearly related to a part of Results. Therefore, the discussion should be re-written in order to keep it in mind.

 

Some other important concerns:

Abstract (Lines28-29): The use of genetic groups (breeding lines, genebank landraces, market varieties in the following) is confusing in the same sentence than DAPC (which describes clusters). Please rewrite this sentence.

 

Lines 116-199: Variant calling is sequential : 8 subfiles of 12 samples are created for each 96-well, then the files for each 96-well plate are merged for creating the final file. This is not clear what is done by each bash script (will they be available?) but GATK generally uses population information for variant calling. Is the script only a way to put together mapping files or is it directly a part of the variant calling step? In the latter case, which is the impact of the subfiles on variant calling?

 

Lines 121-122: Filtering criteria should be described in detail: read depth, missing data thresholds, SNP distribution? It would help to understand the fast part about reduction of SNP data from 137,800 to 9,266 (Results; lines 157-160) which should be better explained too.

 

Line 134: How was the ‘critical distance threshold’ determined? And which genetic distance was used?

 

Figure 1: D. rotundata has 20 chromosomes and an additional chromosome for unmapped SNPs. I do not understand how unmapped SNPs can be mapped to a chromosome! Moreover, on figure 1, chromosomes are named from 1 to 21. We do not know which one is the additional chromosome. Probably the legend is misleading and should be modified (as the corresponding part in the Main text (lines 158-160). In reference [38] (genome paper), the scaffolds were anchored to 21 linkage groups (despite 20 chromosomes in this species) and probably the 21 chromosomes correspond to the same than in this paper, without being able to determine an additional chromosome composed of unmapped SNPs.

 

Line 164: Some SNPs were penta-allelic and hexa-allelic. How could it be possible with only four bases? Did you include short indels in SNP alleles? Moreover, Table S1 also includes multi-allelic missing in Main text. The definition of ‘Multi-allelic’ shloud also be added in Table S1.

 

Lines 190-193: The organization of the sentences is misleading. If a genetic distance of 0.02 is used as the lower limit to consider that two genotypes are not clones, then how can the genetic distance among the 803 genotypes be between 0.02 and 0.37 (so only genotypes with distance = 0.02 should be considered as clones) while only 595 unique genotypes were recorded? This must be clarified.

 

Lines 209-210 : You should give an interpretation for the group comparisons according indexes.

 

Figure 4: What is the order of the genotypes on x-axis? As the main conclusion of the manuscript is that there is population structure according genetic groups (breeding lines, etc.), the genotypes should be sorted according genetic groups to endorse the statement.

 

Lines 257-260: The same colors (red, green, blue) were used between DAPC and ADMIXTURE results. To what extent are the groups of a same color between the two analyses identical? A contingency table could help to understand. The same remark can be done for the description of the hierarchical clustering (lines 262-267).

The constitution of ADMIXTURE clusters should be more detailed too (lines 257-260).

 

Table S3. According to the main Text, the diagonals correspond to ‘Genetic diversity among accession within each cluster’. This should be precised in Table S3. The way pairwise Fst and the diagonals were calculated should be precised in Materials and Methods. Why the diversity is negative in diagonals?

 

The authors should include AMOVA (analysis of molecular variance) in order to test for the role of genetic groups on diversity.

 

Minor suggestions/concerns:

Line 83: Table S2 is cited before Table S1 (line 165). The tables should be renamed according to their citation in the Main text.

Line 96: Please replace Invirtogen by Invitrogen

Line 109: Please replace rotunda by rotundata

Line 142: Please replace complimented by complemented

Line 166: “which represent only 92.9% of total number of SNPs used in this study” could be removed as it is e repetition of Line 163.

Line 195: Please replace inidcating by indicating.

Table 2: He is 105 for Breeding lines while this index should be between 0 and 1. Please correct it.

Line 119-220; It should be precised that even if the population divergence is significant between genetic groups, it is weak, especially by comparison to DAPC clusters.

Line 220: A bracket is missing.

Line 262: Please remove ‘the’ in ‘Identity by the State’.

Line 285: Please replace similat by similar

Table S2. I suppose for example that TDr9505576/TDr98:103 means that the corresponding genotype was obtained from the cross between TDr9505576 and TDr98:103. However, this is not obvious for the reader and this should be clarified.

Figure S1. It is almost identical to figure 3B. So what is the interest of this figure?

Author Response

The authors would like to deeply appreciate the comments and valuable suggestions of the reviewer, which has allowed us to significantly improve the manuscript. We have provided response to each comment of the reviewer in the attached document.

Author Response File: Author Response.pdf

Reviewer 2 Report

Introduction

Line 39 you could indicate here the yearly economic value in $ of this species.

I suggest also to specify somewhere in the introduction the ploidy, the number of chromosomes (2n=2x=20 ??), the estimated genome size in Gb and if the species is allogamous or autogamous.

Line 40. Not clear if 600 species are part of the Dioscoraceae family or section Enantiophyllum. Please clarify

Line 42 …Yam belt…accounts for

Line 45 “genotypes that suited their needs”. Please provide some examples of these “needs”

Line 49-62. This paragraph is a bit confusing. As far as I understood, in the previous section the topic is D. rotundata. In this paragraph you stated that D. rotundata and D. cayenensis are a complex. Why is that? Maybe, you should firstly introduce this aspect/issue, deepening these two species (are both edible? Can they interbreed? Do they share the same environment? Why they are considered a complex?) Moreover (line 61) you said that “very few studies used morphological or molecular markers to assess genetic diversity” But, apparently, references from [10] to [22] are all about studies based on morphological or molecular markers… they are not very few….

 Materials and Methods

Line 82 From the introduction it seems that D. rotundata and D. cayenensis represent a complex and it seems quite difficult to distinguish these two species both from a morphological and genetic point of view. Did the 803 accessions all belong to D. rotundata species? Or they were from Dioscorea genus? This is an important information. From line 209 it seems that all the accessions are from D. rotundata

Line 83 “selected from the core collection.” seems incomplete. Which core collection are you referring to?

Line 84 Table S2 should be Table S1 since it is the first supplementary table cited in the manuscript. Please be consistent, reordering/renaming all the supplementary materials in numerical order as they appear in text. Moreover, fix this table with the correct formatting (e.g. see “DAPC clusters”: it is inexplicably split in 4 rows, in the last column of the table)

Line 92 “Qiagen DNeasy DNA extraction kit” does not exist. Maybe “DNeasy Plant Mini Kit”?

Line 92 information about Qiagen supplier (city, country) must be given.

Line 103 Please indicate at least the reads length (e.g. 100bp?)

Line 108 double check/standardize the font and size font throughout the entire manuscript

Line 109 the link provided is not working. Please provide a working link

Line 113 Please provide the minimum number of reads set for SNP calling (example a minimum of 4 or 8 reads per sample/per position) and for homozygosis and heterozygosis calling. This is an extremely important information and can compromise the entire study.

Results

Line 156 Half of the SNP (46.8%) did not align to D. rotundata reference genome. This is weird and it should be widely discussed in the Discussion section. Can you find an explanation?

Lines 158-159 A total of […] was obtained and were distributed? Double check the subject-verb agreement

Lines 158-159. From figure 1 it seems that the 9,266 SNP did not map uniformly throughout the chromosomes but, instead, they seem to map prevalently within telomeric regions. What is your opinion about this? Maybe you could add a couple of sentences in the discussion about it

Line 161 The highest number of SNPs were mapped to chromosome 5: Double check the subject-verb agreement. Moreover, I would rephrase the sentence because it is not very clear.

Line 162 The lowest number of SNPs were mapped to …. Double check the subject-verb agreement. Moreover, I would rephrase the sentence because it is not very clear

Lines 163-165 Could you elucidate how a SNP (A, T, C, G) can be penta or exa allelic? In Table S1 (that should be named Table S2), you even mention “multi-allelic” SNP, what is the meaning of it?

Line 160, Figure 1 and Table 1. The fictitious construction of an extra (21st) chromosome, grouping all the unmapped SNPs, doesn’t make any sense. Please find an alternative. If the 21st chromosome is not real, how did you calculate its length (in Figure 1, ~5 Mb)? How did you calculate the distances among unmapped SNPs? How did you decide the physical order of the unmapped SNPs within this additional chromosome?

Figure 2. what does “highcharts.com” means (bottom right, Figure 1?). If it is a software used for this picture it must cited in the methods and must be removed from the Figure.

Figure 2 caption. “percent transition and transversion”. I don’t see any percentage in this picture.

Line 190 “data not shown” should be avoided. Please provide this information as supplementary material

Line 191 rephrase this sentence. You could simplify “two accessions were considered identical or representative of the same clone if their pairwise genetic distance was lower than 0.02”. However, you should clarify why you set this specific threshold.

Line 195 typo: indicating.

Line 195 “a rapid decline from K = 1 to K = 3”. Honestly, from Figure 3A this rapid decline is not evident at all. Actually, assuming that each circle is a cluster and that the first circle represents K=1, it can be observed a rapid decline from K=1 to K=2 (not from K = 1 to K = 3). This part should be clarified because it is crucial for the entire work-

Supplementary Table S2. If I understood well this table, there are several accessions that derived from the same two parental lines. For example, there are 42 accessions (breeding lines) all descending from the same two parents (TDr9700973/TDr9501932). How can is possible that these 42 accessions belong to different clusters (e.g. TDr_09_00001 belongs to cluster 3 while TDr_09_00374 to cluster 2)? Can you explain this?

Table 2. please format properly the caption and moved the footnotes in the caption, because it is illegible. Please standardize the decimal digits throughout the manuscript: in some cases, there are 4 decimal digits (Simpson index), in some other three (MAF) in some others two (He).

Figure 3. Panel A and Panel B have different fonts. Please standardize them.

Figure 4 axes number and axes titles are unreadable. Moreover, please, standardize also the font of all the Figures throughout the manuscript according to the text.

Line 253-260 This paragraph is confusing. In fact, the content of the first sentence [“three clusters primarily corresponded to their cultivar (genebank landraces/market varieties/breeding lines) origin]” does not match with what reported in lines 256-260 (where is stated that each cluster include genotypes of different origins (e.g. cluster 1 includes breeding lines and landraces whereas cluster 2 includes breeding lines, landraces and commercial varieties).

Line 263 From Figure S2 is not clear which are the cluster 1, 2 and 3. You should use arrows to indicate the correspondence between node and cluster. The resolution of this Figure must be improved.

Lines 273-277 seems a discussion rather than a description of the results.

Lines 278-280 This sentence must be definitively clarified. What does “one of the popular market variety in Nigeria is Hembakwase which corresponds to several breeding lines” mean? Basically, you are stating that some breeding lines are identical to a market variety. Is this supported by molecular data? Also, you stated that “they are associated with other market varieties such as Makakuasa and Omi_efun”? Does this mean that these breeding lines (that are identical to Hembakwase) are also sold as market varieties with other names? Is this statement based again on molecular data? And also does “Makakuasa“ correspond to “TDr_Makakusa” (Table S2, check typo)?

280-282 This is your personal opinion and not a result (it should be moved to discussion). Moreover, “indicating the use of genebank landraces in the yam breeding program to generate the selected breeding lines” is a very strong statement. You had no proof of it, it is just a hypothesis.

282-284 “This can be further elucidated from Supplementary Table 2 wherein the pedigree information of the breeding lines have been provided”. In this sense, Table S2 does not elucidate anything. In fact, in the previous sentence you stated that breeding lines may derive from landraces, but in Table S2 there is no pedigree information that link breeding lines to landraces. Remove this sentence or clarify it

285 similat is similar

Discussion

This section is too short, and it is somehow a repetition of the results. Data should be discussed more carefully. For example, the geographical origin of the genotypes analyzed in this work is never mentioned and it would be very interesting to discuss any possible correlation between molecular clustering and geographical origin. Another aspect that should be elucidated (and that could be extremely useful for future breeding programs) is the presence of cases of synonymy and homonymy. For example, how many market varieties resulted identical but sold with different names? How many, landraces resulted identical to market varieties? etc

305-308 it is a repetition of the results and it can be removed

312-314 reference is needed

315 you are acknowledging that the admixed ancestry could be due to complex domestication patterns between D. rotundata and D. cayenensis. So (as I already pointed out in my previous comments) how do you know that your 803 genotypes are all from D. rotundata (from line 209 it seems that all the accessions are from D. rotundata). This is crucial because in the entire manuscript it is not clear if you are analyzing 803 samples from the same species D. rotundata or if some of them could be inter-specific hybrids (D. rotundata x D. cayenensis) This aspect deserves a deep discussion.

320-321 “we have successfully unraveled the underlying genetic relationships and population structure” is an overestimation of your data.

329 majorly is not the right adverb. (use mainly instead)

339-342 this statement is not supported by data

Author Response

The authors would like to deeply appreciate the comments and valuable suggestions of the reviewer, which has allowed us to significantly improve the manuscript. We have provided response to each comment of the reviewer in the attached document.

Author Response File: Author Response.pdf

Reviewer 3 Report

White yam is one important staple tuber crop in West Africa. Although a previous population study used 94 accessions to understand its genetic diversity and evolution, a larger collection may be needed to further understand white yam. In this manuscript, Bhattacharjee et al. studied 803 white yam samples (landraces, breeding lines, and market varieties) using GBS to reveal the genetic diversity and population structure. The methods used are relatively standardized and the findings are interesting. Before acceptance, I have some concerns which may need addressing by the authors.

Major:

  1. Would it be possible to add a phylogenetic tree to show the clusters of the samples?
  2. From Figure 3 (A), it is difficult to tell which K is the most suitable one. Do authors know the reasons? By the way, maybe it's good to list several Ks, for instance, K=2 and K=3 in the following Admixture plot to show the patterns.
  3. Line 275, Page 8: Since the naming systems are different, will the collection (pre-grouping of the samples) have some bias and thus make the result less accurate? 

Minor:

  1. Line 72, Page 2: May give the full name of SNPs at line 64, where it is first used
  2. Line 107, Page 3: I'm curious why the authors used a very old GATK (v2.4) instead of version 3 or 4? Will this affect the accuracy of SNP calling?
  3. Line 121, Page 3: Maybe it is better to list all settings here
  4. Line 128, Page 3: Similarly, may list all parameter settings
  5. Line 175, Page 4: unmapped SNPs? Do you mean SNPs that are on unanchored contigs/scaffolds?
  6. Line 190, Page 5: “data not shown”, Any reasons here?
  7. Table 2: Compared to others, why is the HE number for the breeding lines so high?
  8. Figure 4: Cannot tell the distribution of each subpopulation from this figure. May add such information in the legend or label each sample/group in the figure

Author Response

The authors would like to deeply appreciate the comments and valuable suggestions of the reviewer, which has allowed us to significantly improve the manuscript. We have provided response to each comment of the reviewer in the attached document.

Author Response File: Author Response.pdf

Reviewer 4 Report

Dear authors,
I really appreciate your manuscript from several points of view. First of all, the amount of biological material studied is impressive. So is the multitude of statistical analyzes that support your conclusions. We also appreciated the use of state-of-the-art molecular analysis techniques in the breeding program of a crop plant.
I noticed that in terms of white yam breeding program (Dioscorea rotundata Poir.), The bibliographic references are quite limited, which made me appreciate your initiative.
In terms of English language and style I did not find any flaws, but overall, reading is quite difficult due to the way the data are presented. In the Material and Method section,  2.3. Processing of Illumina Raw Sequence Read Data, SNP Calling and filtering, states that some of the biological material was excluded from the analysis (Genotypes with more than 20% missing data were removed from further analysis). How many genotypes are there and especially what are the analyzes from which they were excluded? These things do not emerge from results or discussions.
Presentation quality for: Figure S2. Hierarchical clustering depicting the genetic relationship among 803 genotypes based on Jaccard’s genetic distances, needs to be improved, even if we enlarged it not much is understood from it.

Author Response

The authors would like to deeply appreciate the comments and valuable suggestions of the reviewer, which has allowed us to significantly improve the manuscript. We have provided response to each comment of the reviewer in the attached document.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

I congratulate authors for fast modifications which improved the quality of the manuscript. My only regret now is that we could not find an interpretation to the genetic clusters highlighted. Neither genetic group (genebank landraces, breeding lines, market varieties), nor geographical origin seem to explain this structure. Nevertheless, this manuscript is a step forward and will provide input for future work. Therefore, I will not oppose its publication. However, you will find below some comments. I precise that the lines indicated below correspond to the ‘Track change’ document.

 

Line 26 : You should write Admixture in upper case letters if you want to talk about the software ADMIXTURE.

Line 44 : Please add a space between « allogamous,polyploid »

Lines 86-94 : There are now 473 genebank landraces used (line 86) but the origin is still described for 462 genebank landraces (lines 91-94). Please improve it.

Line 162 : Is it ‘from one to 100’ rather than ‘from two to 50’ ?

Line 171 : Are you talking about MEGA X (https://doi.org/10.1093/molbev/msy096) ? If true, please cite the article. And what was the method used to build the tree and how did you delineate the four clusters you found in Results (lines 320-322) ?

Figure 1. The legend is complicated. A sentence like ‘Distribution and density of filtered SNPs across 21 pseudo-chromosomes of D. rotundata, as suugested by Tamiru et al. [38]. The horizontal axis displays the chromosome length. The number of SNPs in a given region is indicated at the bottom right.’ would be more simple.

Figure 2. In legend, please replace ‘thiamine’ by ‘thymine’.

Lines 212, 221, 286, 289, 304, 321, 329 : There is an inconsistency between the number of accessions (803) and the one indicated at the beginning of Materials and Methods (814 ; line 86).

Lines 217-219 : I do not undestand the logic of the sentence ‘Being clonally propagated, the progenies of bi-parental crosses in white yam represented a segregating population (F2) and were genetically different from each other.’. Progenies of bi-parental crosses always represent a segregating population, even if the population is not a clonally propagated species.

Lines 230-231 : I am not opposed to the authors looking at results other than K=2, if it seems biologically relevant for example. However, it cannot always be justified by the sentence ‘Based on BIC curve’. For example, BIC is quite similar for K=11 and K=12 (Figure 3A). Therefore, what is the justification to study K=12 rather than K=11 « based on BIC curve » ?

Line 264 : ‘Table 33’ must be replaced by ‘Table 3’

Lines 266-267 : In the sentence ‘The AMOVA analysis among revealed [...]’, ‘among’ should be removed. The sentence would even be clearer if it was ‘The AMOVA analysis revealed that the variability is divided into 96% within genetic groups and 4% between the three genetic groups (Table 4).’.

Figure 4 : How were the accessions sorted in the x-axis of ADMIXTURE results for each K ? If several plots are presented vertically, the reader will try to interpret it by comparing the results for several Ks, for example to try to find structures (for low Ks) and sub-structures (for high K values). You should precise in the legend if the accessions are sorted in the same order in all plots from Figure 4.

Lines 290-298 : It is not correct to talk about ‘admixed population’. Accessions can be admixed between two or more genetic clusters but a group of admixed accessions does not form a distinct population. Please correct it. Moreover, you should give a definition of ‘admixed accession’ (thresholds?) as I think it is not present in the manuscript.

Line 294 : ‘[…] found to be K=4 and K=10. , which [...]’. Please remove ‘. ‘.

Figure S2. Please revise the grammar of the legend (consisting [...]).

Line 329 : This sentence was not modified after the modifications of the new submission. You should say it is a phylogenetic tree and that four and not three genetic groups were obtained.

Lines 363 and 430 : Please remove space in ‘gene (s)’

 

 

Author Response

The authors would like to express their immense gratefulness to the reviewer for the valuable suggestions and comments. This has improved the quality of the manuscript significantly.

Author Response File: Author Response.pdf

Reviewer 2 Report

Line 39 reference needed for 15 billion

Line 88 462 Genbank landarces VS line 83 473 Genbank landarces

Line 94 (Qiagen, Germantown, MD, USA)

Line 104 Again, as I was asking in my previous report, the sequencing reads length needs to be specified. You stated that “The reads length used was 64bp, which is already mentioned in M&M section under 2.3” but 64 bp represents the reads length after trimming and not the original length of the reads (i.e. the output of the sequencing, at this point i guess it 1x100bp). I really hope that this basic difference is now clear.

Moreover, in my previous report I was asking “the minimum number of reads set for SNP calling (or read depth)”. Your reply is “the minimum number of reads set used was 64bp”  This does not make any sense. You cannot confuse reads length with read depth.

However I do appreciate that, in the end, you reported (line 119) the requested information (read depth>5).

Line 195 as I was asking in my previous report, the pairwise genetic distances among 803 genotypes  (previously reported as “data not shown”) needs to be presented as supplementary file. It is not a big issue, you can provide it as an excel file .

Author Response

The authors would like to express their immense gratefulness to the reviewer for the valuable suggestions and comments. This has improved the quality of the manuscript significantly.

Author Response File: Author Response.pdf

Reviewer 3 Report

I think the authors have addressed my previous concerns and I have no more questions.

Author Response

The authors would like to express their immense gratefulness to the reviewer for the valuable suggestions and comments. This has improved the quality of the manuscript significantly.

Author Response File: Author Response.pdf

Back to TopTop