Next Article in Journal
Development and Validation of a Single Step GC/MS Method for the Determination of 41 Drugs and Drugs of Abuse in Postmortem Blood
Previous Article in Journal
The Fall and Rise of Identified Reference Collection: It Is Possible and Necessary to Transition from a Typological Conceptualization of Variation to Effective Utilization of Collections
 
 
Article
Peer-Review Record

Decomposition of Individual SNP Patterns from Mixed DNA Samples

Forensic Sci. 2022, 2(3), 455-472; https://doi.org/10.3390/forensicsci2030034
by Gabriel Azhari 1, Shamam Waldman 2, Netanel Ofer 1, Yosi Keller 1, Shai Carmi 2 and Gur Yaari 1,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Forensic Sci. 2022, 2(3), 455-472; https://doi.org/10.3390/forensicsci2030034
Submission received: 1 November 2021 / Revised: 17 May 2022 / Accepted: 22 June 2022 / Published: 5 July 2022

Round 1

Reviewer 1 Report

This manuscript describes a new method for mixture deconvolution of SNP data using whole genome sequencing. Mixture deconvolution is one of the most interesting topics in forensic genetics due to its difficulty, therefore I encourage the publication of this study. I have a couple of comments that would like the authors to address.

Line 31- The authors mean SNaPshot and not SNP array as listed in reference 7.

Lines 31-33- The referred assay is based on next generation sequencing. The usage of these methods has been increasing in forensic science but it is still not the standard. In any case, it is not related to the assay mentioned previously, which is based on capillary electrophoresis - minisequencing or SNaPshot.

Line 33. I am confused by what the authors mean with "Alternatively, as the cost of sequencing decreases, whole genome sequencing (WGS) of the sample can be conducted and then analyzed for all relevant markers.". In forensic casework, sequencing of the entire genome is not allowed for ethical reasons. Therefore, although I understand that much more information can be driven from the entire genome, this is still not widely accepted in the community.

Figure 1. Misses an "e" in sequenced reads.

Page 4 has no text.

Table 1 has no mention in the text.

Discussion comments:

From my impression, forensic scientists would only follow such expensive methods as whole genome sequencing, when no further investigative leads exist. In other cases, scientists might prefer using cheaper SNP typing alternatives such as AmpliSeq panels or the ForenSeq panel for the MiSeq. Would AH-HA help in these cases? Would it be applicable to amplicon based sequencing? If so, the authors should definitely include this information in the discussion.

The idea of mixture deconvolution is worth of investigation and one of the major challenges in forensic analysis. I would be interested in knowing if the authors would comment in situations where the DNA from the suspect is not known or in cases where there is no suspect. In these cases, how would the forensic scientist predict the ancestry? And if not possible, could we run AH-HA with different possible reference populations and estimate  the most probable one? If so, the authors should also include this information in the discussion as a suggestion for such cases.

Finally, I understand this study has been applied in silico. Have the authors considered applying this method in real mixtures whole genome sequencing data? This would give an overview how the software deals with real case scenarios.

 

 

 

 

 

Author Response

This manuscript describes a new method for mixture deconvolution of SNP data using whole genome sequencing. Mixture deconvolution is one of the most interesting topics in forensic genetics due to its difficulty, therefore I encourage the publication of this study. I have a couple of comments that would like the authors to address.

>>We are grateful for your encouragement to publish our manuscript and for your helpful and thoughtful comments. We certainly agree this is an important and fascinating problem. We referred to all your concerns point by point below.

Line 31- The authors mean SNaPshot and not SNP array as listed in reference 7.

>>1.1 We apologize for this. We clarified in the manuscript that  we meant customized SNP array as a strategy, with SNaPshot being one of the leading applications (lines 32-38).

Lines 31-33- The referred assay is based on next generation sequencing. The usage of these methods has been increasing in forensic science, but it is still not the standard. In any case, it is not related to the assay mentioned previously, which is based on capillary electrophoresis - minisequencing or SNaPshot.

>>1.2 You are correct, there is a mix up between the two techniques. We rephrased the text to clarify it (lines 32-38).

Line 33. I am confused by what the authors mean with "Alternatively, as the cost of sequencing decreases, whole genome sequencing (WGS) of the sample can be conducted and then analyzed for all relevant markers.". In forensic casework, sequencing of the entire genome is not allowed for ethical reasons. Therefore, although I understand that much more information can be driven from the entire genome, this is still not widely accepted in the community.

>>1.3 This is a good point. It has been suggested in the past and there is an increasing use of available WGS data for forensics. Relevant citations have been added, and the need for ethical discussion has been mentioned (lines 32-38).

Figure 1. Misses an "e" in sequenced reads.

>>1.4 Fixed.

Page 4 has no text.

>>1.5 Fixed.

Table 1 has no mention in the text.

>>1.6 All tables are now mentioned in the text by the order that they appear in.

Discussion comments:

From my impression, forensic scientists would only follow such expensive methods as whole genome sequencing, when no further investigative leads exist. In other cases, scientists might prefer using cheaper SNP typing alternatives such as AmpliSeq panels or the ForenSeq panel for the MiSeq. Would AH-HA help in these cases? Would it be applicable to amplicon based sequencing? If so, the authors should definitely include this information in the discussion.

>> 1.7 This is a good point. Currently AH-HA cannot help in cases of SNP based data, but it may be extended in the future. We added this to the discussion (lines 423-427).

The idea of mixture deconvolution is worth of investigation and one of the major challenges in forensic analysis. I would be interested in knowing if the authors would comment in situations where the DNA from the suspect is not known or in cases where there is no suspect. In these cases, how would the forensic scientist predict the ancestry? And if not possible, could we run AH-HA with different possible reference populations and estimate  the most probable one? If so, the authors should also include this information in the discussion as a suggestion for such cases.

>> 1.8 We agree with this comment, and have included in the discussion a suggestion to predict ancestry by searching for unique markers in the mixture sample, as has been previously suggested by others (line 377-380).

Finally, I understand this study has been applied in silico. Have the authors considered applying this method in real mixtures whole genome sequencing data? This would give an overview how the software deals with real case scenarios.

>> 1.9 That is a good idea. Unfortunately it was too complicated to achieve such samples, since forensic data is not publicly available. It is thus out of the scope of the study. Hopefully it can be tried in future studies.

Reviewer 2 Report

The authors have presented several methods for deconvolving DNA two-person mixtures involving 1 known contributor, and the method proposed is among the first to consider SNPs in linkage disequilibrium. The authors employ a Li and Stephens model, making the approach notably similar to popular phasing and imputation-based approaches that are common in the genomics literature. Much of the science, computational and otherwise, I find to be novel and refreshing, and importantly, I think their AH-HA method is worthy of further investigation. That said, the article’s presentation is sorely lacking. At a high level, a major reorganization is needed—the Methods section is given last in the paper, however terminology and necessary details that first arise in the Methods and are then presented in the Results. Thus, to a casual reader, experiments and terminologies are presented without first introducing what they are in a way that can be understood, and as such, the paper is unreadable unless it is read out of its presented order. Second, while the Methods section is largely clear, when results are given (e.g., Figure 2A), not enough detail is given to say what dataset and experiment a result stems from. Likewise, not all works investigated in the study need to find their way in the publication. For example, the authors made several deconvolution methods including several naïve methods—naive methods are very good idea as it provides a baseline. However, that there are several naïve methods detracts from the content. Only one baseline is needed, and having several adds to the information but not the substantive content of the paper. Further, the intended point of the dummy methods is lost when they are not given along side the primary results. Eg, I see BYS (a term only meaningfully introduced in the Methods, towards the end of the paper, by the way) in Figure 2A, but not in Figures 6 and 7, and the true “dummy” methods (Lines 97-107) are only presented in reference to one table.

Similarly, while the authors spend a fair amount of time talking about the algorithm speed (a fact I would not highlight quite so much as the run-times are close to prohibitive), and very little time talking about the results beyond the F1 score. There are some very interesting cases to consider—for example, in your sampling strategy it could be that a true heterozygote presents (by the reads sampled) as a homozygote. This certainly happens, however, what does AH-HA do? Can it “fix” genotyping errors (because of LD the heterozygote call can be recovered even though a naïve variant caller would never make such a call)? Or is it constrained to call genotypes that are compatible with the observed alleles? Can AH-HA be made better if simple thresholds are considered—e.g., if all sites with <5 reads are omitted? In gist, reporting just an F1 score is a bit unsatisfying, and key properties of the approach need to be made clear so they can be discussed (e.g., the forensic audience will want and need to know if AH-HA will impute/fix genotypes, as in the previous example, or not).

Additionally, your model assumptions need to be presented clearly, perhaps as a table. You start to do this at line 70 in the document, but this needs to be more exhaustively described. Your assumptions appear to include: no indels, biallelic variants, no genotyping error (in the suspect), population ancestry is known, no recurrent mutation, known genetic map, and how violations of these assumptions are likely to impact the analysis.

 

There are also a fair number of typographic errors in the document, including unusual capitalizations. Throughout the first paragraph of the introduction in particular, words are capitalized when they need not be. For example line 21: … short tandem repeats (STRs), not Short Tandem Repeats (STR), ditto with external visible characteristics, … etc. and later on with Microhaplotypes. Please double-check all capitalizations and use a spellchecker.

 

Once these issues have been resolved I think the manuscript could be ready for publication, in my estimation at least. I have some in-line comments as well (below) that also need to be addressed.

 

Small notes:

Line 23: However, STRs do not store valuable information regarding forensically relevant traits such as ancestry and phenotype inference.

 

This statement overstates and is untrue. Some microvariants have very clear population associations, further, STR alleles can and are associated with phenotype.

 

Line 43: STR based methods, such as STRmix [10], 44 LRmix Studio [11] and EurForMix [12], are accurate only for 2-3 person mixtures

 

This statement also overstates and lacks a citation for the accuracy claim. It is clear that these programs lose power as the number of contributors increases, but accuracy is a different matter.

 

Line 51: “It should be noted that since most SNPs are typed as bi-allelic”. I know what the authors are trying to say, but more care with the language is needed. First, it would be unusual for WGS approaches to “type” SNPs—that’s genotyping, not variant calling. Second, the number of alleles observed is a function of the number of individuals. A great many SNPs (say in dbSNP) are multiallelic, but it’s also the case that for a pair of individuals, the vast majority of segregating sites are biallelic.

Line 55: There’s nothing pseudo about microhaplotypes being multiallelic. An allele corresponds to locus of arbitrary size (otherwise we can argue that STRs are “pseudo” as well).

 

Line 58: The authors are walking a fine line by saying the problem is overlooked. There are applications that deconvolve DNA mixtures using SNPs (https://doi.org/10.1093/bioinformatics/btx530), and the algorithm of DEploid has some strong similarities to what is proposed herein.

 

Line 60: What does AH-HA stand for?

 

Line 65: Given the audience the Li and Stephens model should be introduced more thoroughly. I also wouldn’t describe it as an “ancestral based coalescent model”.

 

Line 70: Typo in the Figure 1N

 

Line 83: Typo with “shapIT” and the chosen citation is odd—perhaps shapeit4 is a better citation to consider?

 

Line 85: From the presentation I do not see any use of a coalescence process. I do see similarities to a Li and Stephens process however—perhaps the authors mean that their algorithm approximates the coalescence process. Clarity on this subject is needed.

 

Figure 2:

A.: There are many datasets used in your study, including different population groups and different mixture proportions. What is presented here?

 

B:

Where is the 5x data for BYS?

And 5x refers to coverage, not depth. Depth is the property of a single site in the genome, coverage is (effectively) the average depth, taken over the genome.

 

C:

Presenting accuracy as a function of chromosome number is a bit bewildering as chromosome number is of predictive of very little—chromosome size and the mean rate of recombination are about all that come to mind. The latter may matter for the approach, but if that is the case then the units (e.g., Morgans, not chromosome number) should be made appropriate.

 

Line 91: “To evaluate the performance of the different algorithms…”
Your algorithms (in the plural) need to be presented, if only at a relatively high level, then your experimental design needs to be described. The experiment should provide enough information to be reproducible. Having a Results section before a Methods section is always a challenge for this kind of work, and I don’t believe it is required by the journal. I would highly recommend placing your Methods before your Results—as is, the transition from the Introduction to the Results is too jarring and cannot be followed.

 

Line 104: The choice of dummy predictors needs some motivation. The most reasonable null/dummy classifier would be to choose the most likely genotype based on the allele frequencies in the relevant populations. Why are your choices (reference sequence, or most likely allele) the most appropriate?

 

Line 108: Bayesian, not bayesian

 

Line 148: Your 5Mbp segments are nonoverlapping but adjacent, and as such they are in LD and are not independent. Further, the 250bp buffer statement does not sound correct. From the paper cited they claim to use a 500kb overlap between adjacent segments, and if 250 anything is used, it would be 250 SNPs, not bp. The whole idea of phasing each part of the genome separately simply doesn’t work if the parts cannot be stitched back together correctly. Just because IMPUTE proposes stitching things together in some ad hoc way does not mean that you need to do so.

 

157: Genetic distance (Morgans) is not the same thing as physical distance (bp).

 

Sections: 2.1.2. Computational runtimes:

I like that the authors have given an estimate of asymptotic runtime. Given how slow their approach is it makes we wonder why the authors did not consider more efficient algorithms (https://doi.org/10.1093/bioinformatics/bty735) instead of multithreading. Since multithreading is used, I would also consider a more appropriate threading strategy—eg, as the blocks are not stitched together, there’s no reason for any thread to wait; when they finish the analysis they simply need to write the information to disk and be done. In your case (more 5Mb chunk * replicates than cpu cores) the apparent inefficiency will have little effect. As well, the typical forensic reader will have limited interest in such conversations. I would strongly consider shortening this section to the relevant pieces of information, and if the threading strategy is inefficient, I would consider mentioning Amdahl’s law and little else.

 

Section 2.2.1:

This section claims to see if the inferred values to the Li and Stephen’s model perform optimally—all we are given numerically is a figure, however. Please provide an argument (at the least) that says why the inferred parameters are considered optimal, and include in this argument some quantified statements. As well, please include what the parameters inferred are (Ne at the least has an expected value in human populations) and be clear which dataset was used.

 

Throughout the remainder of the Results:

You included several dummy methods and you did so for good reason. Pick one of them and then use it in the figures/results/discussion, otherwise showing the performance of AH-HA cannot be understood as there is no baseline.

 

Discussion:

Paragraph 1:
Your discussion section needs considerable work. The first paragraph makes grand claims (“It outperforms other methods”, “is superior compared …”) and it does so without referring the reader back to the Results section. Likewise, while these claims may be true, not enough information is given to substantiate them—having one F1 score point estimate be higher than another does not make a method “superior”. You have chosen a single objective function (f1) and have neglected the role of chance. Formal hypothesis testing is needed and the null hypothesis needs to be acknowledged, at a minimum, before such claims can be made. Additionally, this information needs to be woven into the body of the Discussion.

 

Paragraph2:

You cite Voskoboinik  in your paper but not here. Why? It is one of the few methods that appears to apply in your case (2 person mixture, 1 known) and I would be surprised if the discrimination power for that would be much different than the method proposed. Giving a review paper (9) alone does not provide enough information to the reader and it is unclear what “these algorithms” refer to. Please name them, cite them, and discuss! Notably, it should be mentioned that even if your algorithm deconvolves the unknown contributor correctly, it will not match the whole genome sequence of the unknown/suspect (if such information were available). Simply put, you’re throwing out sites that are not in your reference panel, which includes all variants that are individual-specific. In my mind that is fine, but it needs to be abundantly clear to the reader.

 

Section 4.1:

With your sequence data, please present the basic properties—eg, what chemistry (e.g., Illumina HiSeq), and what read lengths (2x150bp reads). Similarly for the variant calling—how was this done?

Line 297: It’s nitpicking but polymorphic doesn’t imply biallelic. Ie, you filter to only consider biallelic snps, not snps with two polymorphisms. Additionally, by how it is presented, triallelic variants are removed only for the YRI data.

 

Line 305: It is unclear how you take ~50x AJ genomes and make 500x mixtures. How were PCR/optical duplicate reads treated?

 

Line 315: “SNPs that were not observed” is an odd turn of phrase. Formats like VCF describe variants, differences to the reference. In high coverage genomes, sites that are reference-consistent are not in the VCF file simply because they are assumed to be identical to the reference. These sites (ie, sites that are variable in one dataset but not the other) are informative of LD to the sites that are segregating in both datasets and I would strongly consider including them in your analyses. Projects like the 1000 Genomes do have masks that discriminate between uncallable and reference-consistent regions as well, as both phenomena are associated with a lack of a VCF record.

 

Line 321: Be clear where your genetic map comes from—you can’t estimate cM in any meaningful way from the human reference genome, which is how your sentence reads. Common maps include those inferred from pedigrees (decode genetics) as well as those inferred in population studies (e.g., from HapMap).

 

Line 352: Their model is used … (All uses require citations, and the meaning of “ancestral studies” is unclear).

 

Line 366: It is unclear how a pair of haplotypes (over a chromosome) is a genotype. Is diplotype the right word here?

 

Line 376: “Effective population size” is a fundamental population genetic parameter—do not use quotation marks here, and it’s traditionally presented as N or Ne, not Neff. Further Ne and r (the recombination rate) are independent— one describes the magnitude of genetic drift while the other describes the probability of recombination, typically between pairs of sites, regardless of the number of individuals in the population. Perhaps the authors are referring to rho (4Ne*r, which can be inferred by a Li and Stephens model as well), but clarification is needed.

 

Section 4.4: Saying you’re using CHROMOPAINTER is fine, but for a Method section you need to also say what CHROMOPAINTER is doing, ie how it’s inferring Ne, and present that. I’m also unclear what theta is supposed to represent here. Calling it a mutation / mutation event is unclear.

 

Table 2: This does not look like it belongs in the Methods section.

Author Response

The authors have presented several methods for deconvolving DNA two-person mixtures involving 1 known contributor, and the method proposed is among the first to consider SNPs in linkage disequilibrium. The authors employ a Li and Stephens model, making the approach notably similar to popular phasing and imputation-based approaches that are common in the genomics literature. Much of the science, computational and otherwise, I find to be novel and refreshing, and importantly, I think their AH-HA method is worthy of further investigation. 

2.1 Thank you very much for these kind words. 

That said, the article’s presentation is sorely lacking. At a high level, a major reorganization is needed—the Methods section is given last in the paper, however terminology and necessary details that first arise in the Methods and are then presented in the Results. Thus, to a casual reader, experiments and terminologies are presented without first introducing what they are in a way that can be understood, and as such, the paper is unreadable unless it is read out of its presented order.

2.2 We agree with you on this, and moved the Methods section to between the Introduction and Results sections.

Second, while the Methods section is largely clear, when results are given (e.g., Figure 2A), not enough detail is given to say what dataset and experiment a result stems from. Likewise, not all works investigated in the study need to find their way in the publication. For example, the authors made several deconvolution methods including several naïve methods—naive methods are very good idea as it provides a baseline. However, that there are several naïve methods detracts from the content. Only one baseline is needed, and having several adds to the information but not the substantive content of the paper. Further, the intended point of the dummy methods is lost when they are not given alongside the primary results. Eg, I see BYS (a term only meaningfully introduced in the Methods, towards the end of the paper, by the way) in Figure 2A, but not in Figures 6 and 7, and the true “dummy” methods (Lines 97-107) are only presented in reference to one table.

2.3 First, since we changed the structure of the paper by moving the Methods section to come before the Results, we think it is more clear now. Second, we added clarification in the figure description. Third, in our study we have tested different methods for the problem, eventually incorporating those principles to a unified algorithm (AH-HA). We find it useful to show the intermediate algorithms at least one time in the paper, as part of the introduction of AH-HA and its initial evaluation. As for including the “dummy methods” in other parts of the paper, the whole point of mentioning them was for the F1 discussion (table 1) and setting a baseline for these scores. In all other comparisons, these methods were significantly lower than the methods that we compared (BYS, NS-HMM, SH-HMM, and AH-HA), and did not even enter the scale presented in the graphs. We tried to show them, but it makes the manuscript less focused, so we keep it as is and mention in the text that in all cases the other dummy methods were outperformed by AH-HA.

 

Similarly, while the authors spend a fair amount of time talking about the algorithm speed (a fact I would not highlight quite so much as the run-times are close to prohibitive), and very little time talking about the results beyond the F1 score. There are some very interesting cases to consider—for example, in your sampling strategy it could be that a true heterozygote presents (by the reads sampled) as a homozygote. This certainly happens, however, what does AH-HA do? Can it “fix” genotyping errors (because of LD the heterozygote call can be recovered even though a naïve variant caller would never make such a call)? Or is it constrained to call genotypes that are compatible with the observed alleles? Can AH-HA be made better if simple thresholds are considered—e.g., if all sites with <5 reads are omitted? In gist, reporting just an F1 score is a bit unsatisfying, and key properties of the approach need to be made clear so they can be discussed (e.g., the forensic audience will want and need to know if AH-HA will impute/fix genotypes, as in the previous example, or not).

2.4 Very interesting point. The reason for the “advanced” Viterbi solver we developed is to weigh these two sides of genotyping - one being based only on SNP counts (without any LD consideration) and the other leaning heavily on LD. That means that when the read count for the observed allele is low, there is a higher weight to the LD  model, essentially imputing genotype. We mention this when discussing Figure 2B, and added another clarification to the Discussion (line 351-355).

Additionally, your model assumptions need to be presented clearly, perhaps as a table. You start to do this at line 70 in the document, but this needs to be more exhaustively described. Your assumptions appear to include: no indels, biallelic variants, no genotyping error (in the suspect), population ancestry is known, no recurrent mutation, known genetic map, and how violations of these assumptions are likely to impact the analysis.

2.5  We agree that this should be better presented. We transferred the description from the end of the Introduction section to the beginning of the Methods section, and elaborated as required (lines 76-86).

There are also a fair number of typographic errors in the document, including unusual capitalizations. Throughout the first paragraph of the introduction in particular, words are capitalized when they need not be. For example line 21: … short tandem repeats (STRs), not Short Tandem Repeats (STR), ditto with external visible characteristics, … etc. and later on with Microhaplotypes. Please double-check all capitalizations and use a spellchecker.

2.6 We thank the reviewer for pointing this out. We corrected the capitalizations in the manuscript.

 

Once these issues have been resolved I think the manuscript could be ready for publication, in my estimation at least. I have some in-line comments as well (below) that also need to be addressed.

 

Small notes:

Line 23: However, STRs do not store valuable information regarding forensically relevant traits such as ancestry and phenotype inference.

 This statement overstates and is untrue. Some microvariants have very clear population associations, further, STR alleles can and are associated with phenotype.

 2.7 We agree with this and modified the sentence (line 23-25).

Line 43: STR based methods, such as STRmix [10], 44 LRmix Studio [11] and EurForMix [12], are accurate only for 2-3 person mixtures

 This statement also overstates and lacks a citation for the accuracy claim. It is clear that these programs lose power as the number of contributors increases, but accuracy is a different matter.

 2.8  We rephrased this line and added citation. (line 46-48)

Line 51: “It should be noted that since most SNPs are typed as bi-allelic”. I know what the authors are trying to say, but more care with the language is needed. First, it would be unusual for WGS approaches to “type” SNPs—that’s genotyping, not variant calling. Second, the number of alleles observed is a function of the number of individuals. A great many SNPs (say in dbSNP) are multiallelic, but it’s also the case that for a pair of individuals, the vast majority of segregating sites are biallelic.

2.9 We modified the text and added citation (line 55).  

Line 55: There’s nothing pseudo about microhaplotypes being multiallelic. An allele corresponds to locus of arbitrary size (otherwise we can argue that STRs are “pseudo” as well).

2.10 Thanks for pointing this out. We removed the word “pseudo”.

Line 58: The authors are walking a fine line by saying the problem is overlooked. There are applications that deconvolve DNA mixtures using SNPs (https://doi.org/10.1093/bioinformatics/btx530), and the algorithm of DEploid has some strong similarities to what is proposed herein.

2.11  We modified the text to say that this problem is overlooked in the context of forensics (line 62).

Line 60: What does AH-HA stand for?

2.12 We added an explanation to the first mention of AH-HA in the abstract.

Line 65: Given the audience the Li and Stephens model should be introduced more thoroughly. I also wouldn’t describe it as an “ancestral based coalescent model”.

2.13  We provided a more detailed introduction to the Li and Stephens model, and removed the description of it as an ancestral based coalescent model (line 68-74).

Line 70: Typo in the Figure 1N

 2.14  Fixed.

Line 83: Typo with “shapIT” and the chosen citation is odd—perhaps shapeit4 is a better citation to consider?

2.15 This is correct. We fixed the name and reference in the four places it was mentioned. 

Line 85: From the presentation I do not see any use of a coalescence process. I do see similarities to a Li and Stephens process however—perhaps the authors mean that their algorithm approximates the coalescence process. Clarity on this subject is needed.

 2.16 As part of the revision we removed this sentence completely.

Figure 2:

A.: There are many datasets used in your study, including different population groups and different mixture proportions. What is presented here?

 2.17 We added the description to the figure legend.

 

B:

Where is the 5x data for BYS?

And 5x refers to coverage, not depth. Depth is the property of a single site in the genome, coverage is (effectively) the average depth, taken over the genome.

2.18  Phrasing was fixed. 5x BYS is out of scale for resolution purposes, clarified in the figure description. 

C:

Presenting accuracy as a function of chromosome number is a bit bewildering as chromosome number is of predictive of very little—chromosome size and the mean rate of recombination are about all that come to mind. The latter may matter for the approach, but if that is the case then the units (e.g., Morgans, not chromosome number) should be made appropriate.

2.19 The idea of showing the analysis results at the chromosome level was to show that the accuracy obtained in chromosome 22 is consistent with the other chromosomes, and by this to give support to show the rest of the results for chromosome 22 only.

Line 91: “To evaluate the performance of the different algorithms…”

Your algorithms (in the plural) need to be presented, if only at a relatively high level, then your experimental design needs to be described. The experiment should provide enough information to be reproducible. Having a Results section before a Methods section is always a challenge for this kind of work, and I don’t believe it is required by the journal. I would highly recommend placing your Methods before your Results—as is, the transition from the Introduction to the Results is too jarring and cannot be followed.

2.20  This is an important point. The METHODS have been moved to before the RESULTS.

Line 104: The choice of dummy predictors needs some motivation. The most reasonable null/dummy classifier would be to choose the most likely genotype based on the allele frequencies in the relevant populations. Why are your choices (reference sequence, or most likely allele) the most appropriate?

2.21  We clarified in the text (line 236-237) that in the first dummy predictor we indeed meant the genotype based on allele frequencies.

Line 108: Bayesian, not bayesian

2.22 We thank the reviewer for pointing this out. We fixed all mentions of this error. 

Line 148: Your 5Mbp segments are nonoverlapping but adjacent, and as such they are in LD and are not independent. Further, the 250bp buffer statement does not sound correct. From the paper cited they claim to use a 500kb overlap between adjacent segments, and if 250 anything is used, it would be 250 SNPs, not bp. The whole idea of phasing each part of the genome separately simply doesn’t work if the parts cannot be stitched back together correctly. Just because IMPUTE proposes stitching things together in some ad hoc way does not mean that you need to do so.

2.23 You are correct that there are more accurate methods for splitting and stitching chromosomes if one is interested in phasing a genome into haplotype blocks. The point here was to show that the F1 score is hardly affected by this procedure and indeed we show that the results are not so sensitive to the chunk size. You are correct about the 250bp/250kbp overlap. We therefore tested different splitting strategies and did not find a significant variation in the final results. Hence, we corrected the 250bp mention and removed IMPUTE’s statement about LD.

 

157: Genetic distance (Morgans) is not the same thing as physical distance (bp).

2.24  Thanks for pointing this out. In the revised manuscript we removed this sentence (see answer to the next comment).

Sections: 2.1.2. Computational runtimes:

I like that the authors have given an estimate of asymptotic runtime. Given how slow their approach is it makes we wonder why the authors did not consider more efficient algorithms (https://doi.org/10.1093/bioinformatics/bty735) instead of multithreading. Since multithreading is used, I would also consider a more appropriate threading strategy—eg, as the blocks are not stitched together, there’s no reason for any thread to wait; when they finish the analysis they simply need to write the information to disk and be done. In your case (more 5Mb chunk * replicates than cpu cores) the apparent inefficiency will have little effect. As well, the typical forensic reader will have limited interest in such conversations. I would strongly consider shortening this section to the relevant pieces of information, and if the threading strategy is inefficient, I would consider mentioning Amdahl’s law and little else.

2.25  Thanks for the suggestion, it’s a good idea. We use a modified Li-Stephens method for modeling NGS emissions, which is not taken into account in https://doi.org/10.1093/bioinformatics/bty735. Adaptation of this algorithm for our case is beyond the scope of the current manuscript but will certainly be a subject for future research. We mention this in the revised text (lines 413-415). We also shortened the paragraph on computational run-times.

Section 2.2.1:

This section claims to see if the inferred values to the Li and Stephen’s model perform optimally—all we are given numerically is a figure, however. Please provide an argument (at the least) that says why the inferred parameters are considered optimal, and include in this argument some quantified statements. As well, please include what the parameters inferred are (Ne at the least has an expected value in human populations) and be clear which dataset was used.

2.26  We believe that in the current version of the manuscript, in which the Methods section precedes the Results section, this issue is resolved. We also added a reference to the section describing the inferred values’ estimation (lines 293-294).

Throughout the remainder of the Results:

You included several dummy methods and you did so for good reason. Pick one of them and then use it in the figures/results/discussion, otherwise showing the performance of AH-HA cannot be understood as there is no baseline.

2.27 After showing that AH-HA outperforms the other dummy methods in the beginning of the results, we found it unnecessary to show such comparisons for all of the rest of the analyses as the difference in performance between AH-HA and the other methods continues to be significant.

Discussion:

Paragraph 1:

Your discussion section needs considerable work. The first paragraph makes grand claims (“It outperforms other methods”, “is superior compared …”) and it does so without referring the reader back to the Results section. Likewise, while these claims may be true, not enough information is given to substantiate them—having one F1 score point estimate be higher than another does not make a method “superior”. You have chosen a single objective function (f1) and have neglected the role of chance. Formal hypothesis testing is needed and the null hypothesis needs to be acknowledged, at a minimum, before such claims can be made. Additionally, this information needs to be woven into the body of the Discussion.

2.28 We moderated the first paragraph, added references to the figures where needed, and calculated p-values for the comparisons between the methods. Similar p-values were obtained to other measures such as accuracy and precision.

Paragraph2:

You cite Voskoboinik  in your paper but not here. Why? It is one of the few methods that appears to apply in your case (2 person mixture, 1 known) and I would be surprised if the discrimination power for that would be much different than the method proposed. Giving a review paper (9) alone does not provide enough information to the reader and it is unclear what “these algorithms” refer to. Please name them, cite them, and discuss! 

Notably, it should be mentioned that even if your algorithm deconvolves the unknown contributor correctly, it will not match the whole genome sequence of the unknown/suspect (if such information were available). Simply put, you’re throwing out sites that are not in your reference panel, which includes all variants that are individual-specific. In my mind that is fine, but it needs to be abundantly clear to the reader.

2.29  We elaborated in the Discussion on the methods described in the review paper. We also added to the Discussion that even if our algorithm deconvolves the unknown contributor correctly it will not match the whole sequence of the unknown (lines 358-371).

Section 4.1:

With your sequence data, please present the basic properties—eg, what chemistry (e.g., Illumina HiSeq), and what read lengths (2x150bp reads). Similarly for the variant calling—how was this done?

2.30 We added the necessary details to the Methods section (lines 98-99, 107-108).

Line 297: It’s nitpicking but polymorphic doesn’t imply biallelic. Ie, you filter to only consider biallelic snps, not snps with two polymorphisms. Additionally, by how it is presented, triallelic variants are removed only for the YRI data.

2.31  We modified the sentence to say “Non-biallelic SNPs were filtered out”.

Line 305: It is unclear how you take ~50x AJ genomes and make 500x mixtures. How were PCR/optical duplicate reads treated?

2.32  The AJ reference is constructed from TAGC128 data, which is 50x. However, the mixture data is from the heavily sequenced AJ_Trio data, in which each individual is ~275x. Combined, they can reach 500x without duplications. The text was rephrased to clarify this point (line 124).

Line 315: “SNPs that were not observed” is an odd turn of phrase. Formats like VCF describe variants, differences to the reference. In high coverage genomes, sites that are reference-consistent are not in the VCF file simply because they are assumed to be identical to the reference. These sites (ie, sites that are variable in one dataset but not the other) are informative of LD to the sites that are segregating in both datasets and I would strongly consider including them in your analyses. Projects like the 1000 Genomes do have masks that discriminate between uncallable and reference-consistent regions as well, as both phenomena are associated with a lack of a VCF record.

2.33  We did not mention VCF format, but we understand the terminology that you mentioned might be confusing. We therefore edited the text to be less ambiguous.

Line 321: Be clear where your genetic map comes from—you can’t estimate cM in any meaningful way from the human reference genome, which is how your sentence reads. Common maps include those inferred from pedigrees (decode genetics) as well as those inferred in population studies (e.g., from HapMap).

2.34  We agree with this comment. The genetic map comes from the HapMap project, and not from the human reference genome. We corrected this mistake in the text (line 140).

Line 352: Their model is used … (All uses require citations, and the meaning of “ancestral studies” is unclear).

2.35 We added citations to all uses.

Line 366: It is unclear how a pair of haplotypes (over a chromosome) is a genotype. Is diplotype the right word here?

2.36 We modified the text to clarify this point (line 179).

Line 376: “Effective population size” is a fundamental population genetic parameter—do not use quotation marks here, and it’s traditionally presented as N or Ne, not Neff. Further Ne and r (the recombination rate) are independent— one describes the magnitude of genetic drift while the other describes the probability of recombination, typically between pairs of sites, regardless of the number of individuals in the population. Perhaps the authors are referring to rho (4Ne*r, which can be inferred by a Li and Stephens model as well), but clarification is needed.

2.37  We removed the quotation marks, and changed Neff to Ne.

Section 4.4: Saying you’re using CHROMOPAINTER is fine, but for a Method section you need to also say what CHROMOPAINTER is doing, ie how it’s inferring Ne, and present that. I’m also unclear what theta is supposed to represent here. Calling it a mutation / mutation event is unclear.

2.38  We elaborated as requested on CHROMOPAINTER (lines 190-192) and theta (lines 194-195).

Table 2: This does not look like it belongs in the Methods section.

2.39 We agree that this table can be moved to the results section. However, this table is important for the F1 score explanation in the following Methods section. Since we moved the Methods to before the Results, we believe it will be best to leave this table in the Methods section.  

Reviewer 3 Report

Dear Authors, 

This is a great paper demonstrating the efficiency of a new tool AH-HA software (love the name!) to deconvolute mixtures of SNPs. This is a useful and alternative approach for complex mixtures in forensics. The experiments to assess the utility and specificity of the software were appropriate, and the conclusions were supported by the results. I only have one question: will you make this software freely available after the publishing of the paper or will you address some of the issues you referred on the discussion before release the software? 

Thank you very much. 

Author Response

This is a great paper demonstrating the efficiency of a new tool AH-HA software (love the name!) to deconvolute mixtures of SNPs. This is a useful and alternative approach for complex mixtures in forensics. The experiments to assess the utility and specificity of the software were appropriate, and the conclusions were supported by the results. I only have one question: will you make this software freely available after the publishing of the paper or will you address some of the issues you referred on the discussion before release the software? 

Thank you very much. 

>> We thank you very much for your kind words. AH-HA is indeed freely available through the github link provided in the paper. We will address the issues referred to in the discussion and update the software after each progress.

Round 2

Reviewer 2 Report

I thank the authors for their revisions. I found a few additional wrinkles to work out, as highlighted below. I would add that there are still a handful of grammatical issues—I would strongly encourage the authors to reread the manuscript a few more times to improve the overall clarity.

 

Line 38: It sounds like WGS is not used solely for ethical reasons (which does not sound correct).

 

Line 40: [13] refers to mixtures, but the sentence mentions both mixtures and degraded samples.

 

Lines 59 and 60: Please remove the quotation marks (both sets). A microhaplotype is truly multiallelic.

 

Line 62: Adding “in the context of forensics” isn’t really a big enough change and the statement is not entirely true. Deploid reconstructs SNP profiles from mixtures and the deploid algorithm has been applied to mitochondrial mixtures (https://doi.org/10.3390/genes12020128) in forensics.

 

Line 69: You have added a description of a Li and Stephens model, for which I am thankful, but the last paragraph of an introduction is not the right place to include this information.

 

Line 116: As before, it appears that non-biallelic snps were only removed from the YRI panel. Is this truly the case?

 

Line 122: This is the same issue as before--

Line 315: “SNPs that were not observed” is an odd turn of phrase. Formats like VCF describe variants, differences to the reference. In high coverage genomes, sites that are reference-consistent are not in the VCF file simply because they are assumed to be identical to the reference. These sites (ie, sites that are variable in one dataset but not the other) are informative of LD to the sites that are segregating in both datasets and I would strongly consider including them in your analyses. Projects like the 1000 Genomes do have masks that discriminate between uncallable and reference-consistent regions as well, as both phenomena are associated with a lack of a VCF record.

2.33  We did not mention VCF format, but we understand the terminology that you mentioned might be confusing. We therefore edited the text to be less ambiguous.

I’m not suggesting that you change the analyses, but if:

“Base pairs that were not included both in the AJ-trio and the TAGC128 panel were filtered out.” Means what I think it means (if a SNP was called in one dataset but not the other you removed it) you should strongly consider keeping such sites. They’re not missing data and they contain information on LD relevant to your analyses and they’re especially relevant when the reference population is mis-specified (which in practice it always will be).

 

Section 2.5:

Line 186: the term “coalescence-based” appears here—I believe it should be removed.

Line 189: Be careful with the language—Ne and r (recombination rate) are independent; the probability of observing a recombinant allele is a function of the population recombination rate (commonly notated as rho).

 

Line 194: Theta appears to be the probability of recurrent mutation (the meaning of “mutation rate” when you’re selecting a site that’s already mutated in humans is a bit hard to understand), but honestly the definition I just gave does not sound correct. Can the term “mutation” be made more clear?

Figure 2C. Your explanation as to why chromosomes are important was helpful. To the reader it would be helpful to highlight which chromosome’s F1 scores are most meaningful.

 

Line 252: That AH-HA is better is not completely clear from the plot. It would be helpful to provide (in the text) what some values are; as is, there’s too much overplotting.

 

Line 262: Use of the word: “coalescent”

 

Line 279: As with before, I’m not convinced that IMPUTE2 uses nonoverlapping segments. Consider striking the IMPUTE reference as it does not add to the sentence.

 

Line 323: Here and elsewhere; if CHROMOPAINTER is inferring theta and Ne, please include what values it is inferring.

 

Line 344: It is highly unusual to introduce a statistical test in a Discussion section. This is a Result.

Author Response

I thank the authors for their revisions. I found a few additional wrinkles to work out, as highlighted below. I would add that there are still a handful of grammatical issues—I would strongly encourage the authors to reread the manuscript a few more times to improve the overall clarity.

Thanks again for your comments. The paper has gone through a significant re-write in order to improve clarity. All changes are marked in red. Please find a point by point reply below.

Line 38: It sounds like WGS is not used solely for ethical reasons (which does not sound correct).

We agree and rephrased the sentence (line 37).

Line 40: [13] refers to mixtures, but the sentence mentions both mixtures and degraded samples.

We added a citation that refers to degraded samples.

Lines 59 and 60: Please remove the quotation marks (both sets). A microhaplotype is truly multiallelic.

Removed.

Line 62: Adding “in the context of forensics” isn’t really a big enough change and the statement is not entirely true. Deploid reconstructs SNP profiles from mixtures and the deploid algorithm has been applied to mitochondrial mixtures (https://doi.org/10.3390/genes12020128) in forensics.

 We rephrased the sentence again and cited the paper on mitochondrial mixtures (line 64).

Line 69: You have added a description of a Li and Stephens model, for which I am thankful, but the last paragraph of an introduction is not the right place to include this information.

We moved the model description to the Methods section (line 181) and added a cross-ref in the Introduction section (line 73). 

Line 116: As before, it appears that non-biallelic snps were only removed from the YRI panel. Is this truly the case?

To clarify this point, we added in the text that for every data processing step non-bi-allelic SNPs were removed (line 104).

Line 122: This is the same issue as before–

Line 315: “SNPs that were not observed” is an odd turn of phrase. Formats like VCF describe variants, differences to the reference. In high coverage genomes, sites that are reference-consistent are not in the VCF file simply because they are assumed to be identical to the reference. These sites (ie, sites that are variable in one dataset but not the other) are informative of LD to the sites that are segregating in both datasets and I would strongly consider including them in your analyses. Projects like the 1000 Genomes do have masks that discriminate between uncallable and reference-consistent regions as well, as both phenomena are associated with a lack of a VCF record.

2.33  We did not mention VCF format, but we understand the terminology that you mentioned might be confusing. We therefore edited the text to be less ambiguous.

I’m not suggesting that you change the analyses, but if:

“Base pairs that were not included both in the AJ-trio and the TAGC128 panel were filtered out.” Means what I think it means (if a SNP was called in one dataset but not the other you removed it) you should strongly consider keeping such sites. They’re not missing data and they contain information on LD relevant to your analyses and they’re especially relevant when the reference population is mis-specified (which in practice it always will be).

We understand the point the reviewer has made. To save computational time, we only considered variants that were polymorphic in the reference panel. Some of these sites left out could be interesting, but most are monomorphic and don’t add information. Also, from a performance analysis stand point, we preferred, at this stage, to work with known data throughout the study.
We referenced this in the data processing section (line 104) and in the discussion (line 400).

Section 2.5:

Line 186: the term “coalescence-based” appears here—I believe it should be removed.

Removed.

Line 189: Be careful with the language—Ne and r (recombination rate) are independent; the probability of observing a recombinant allele is a function of the population recombination rate (commonly notated as rho).

We added a reference to Appendix B, which explains our mathematical logic in more detail.

Line 194: Theta appears to be the probability of recurrent mutation (the meaning of “mutation rate” when you’re selecting a site that’s already mutated in humans is a bit hard to understand), but honestly the definition I just gave does not sound correct. Can the term “mutation” be made more clear?

We rephrased the explanation on the meaning of Theta (line 215).

Figure 2C. Your explanation as to why chromosomes are important was helpful. To the reader it would be helpful to highlight which chromosome’s F1 scores are most meaningful.

Added scoring analysis in the results section (line 294).

 

Line 252: That AH-HA is better is not completely clear from the plot. It would be helpful to provide (in the text) what some values are; as is, there’s too much overplotting.

 Added values (line 284).

Line 262: Use of the word: “coalescent”

We removed the word “coalescent” (line 287).

Line 279: As with before, I’m not convinced that IMPUTE2 uses nonoverlapping segments. Consider striking the IMPUTE reference as it does not add to the sentence.

Deleted the IMPUTE reference.

Line 323: Here and elsewhere; if CHROMOPAINTER is inferring theta and Ne, please include what values it is inferring.

We inserted the values in the parameter estimation section. 

Line 344: It is highly unusual to introduce a statistical test in a Discussion section. This is a Result.

This is correct. We moved it to the Results section (line 269) and referenced it in the Discussion section (line 376).

Back to TopTop