Whole-Genome Alignment: Methods, Challenges, and Future Directions
Abstract
:1. Introduction
2. Classification of Whole-Genome Alignment Algorithms
2.1. Suffix Tree-Based Alignment Methods
2.1.1. MUMmer Technique
- (i)
- Perform a maximal unique match (MUM) decomposition of the two genomes. The algorithm identifies all MUMs between pairs of genomes, which are subsequences that occurs exactly once in genome A and once in genome B. This decomposition identifies all maximal unique matches between the two genomes. To detect these MUMs, the two genomes are represented by a suffix tree. The common substrings detected on the tree represent all the MUMs between the two genomes (Figure 2). MUMs represent regions of high similarity or conserved regions between genomes;
- (ii)
- Apply filtering techniques to remove spurious matches and improve the accuracy of the alignment results. Organize the MUMs and identify the most extended sequence of matches found in both genomes, maintaining their original order;
- (iii)
- Fill in the spaces within the alignment by detecting significant insertions, repetitions, altered sections, and single nucleotide variations (SNVs);
- (iv)
- Perform a Smith–Waterman alignment for the regions between the MUMs and construct the final alignment;
- (v)
- Output alignments, allowing visualization and analysis to understand the evolutionary relationships and structural similarities between the input genomes.
MUMmer 2.1
MUMmer 3.0
MUMmer 4.0
2.1.2. Other Sequence Comparison Approaches Based on the Suffix Tree Method
2.2. Anchor-Based Methods
2.2.1. Lagan
- (i)
- Generation of Local Alignments:The Lagan algorithm uses the CHAOS method [29] to detect local homologies between the two genomes and chains them into a rough global map.The first step of CHAOS is chaining short exact matches (seeds) which match between the two genomes. The seeds that are close are regrouped to the same anchors. The gaps between the seeds are aligned using a dynamic programming alignment method;
- (ii)
- Construction of a Rough Global Map:The Lagan algorithm uses local alignments to develop a rough global map. Each local alignment has a score of similarity. The optimal rough global map has the highest-scoring chain, which can be computed using sparse dynamic programming [30].
- (iii)
- Computation of Global Alignment:To compute the final global alignment, the Lagan method uses the Needleman–Wunsch algorithm to perform an alignment of the limited area between the anchors.
Multi-Lagan
- (i)
- Creation of initial global maps: the approximate global map between each sequence pair is determined using the Lagan algorithm;
- (ii)
- Progressive multiple alignment with anchors:
- (1)
- Alignment between the two closest sequences according to the binary phylogenetic tree is performed using the Lagan algorithm;
- (2)
- Each alignment is a new multi-sequence. The rough global maps of this multi-sequence and the closest sequence are identified;
- (3)
- Steps 2.1 and 2.2 are repeated, and global alignment is carried out between the multi-sequences and the closest sequence;
- (4)
- Step 2 is repeated until a multiple alignment of all the set of sequences has been performed (Figure 4).
2.2.2. Mauve
- (i)
- Finding multi-MUMs: While the algorithm detects anchors across multiple genomes comparison, some repetitive regions can occur several times in each genome as a duplication of those regions. The more aligned genomes there are, the more difficult it becomes to place each anchor in the correct place within the global alignment. To resolve this problem, Mauve uses multiple maximal unique matches (multi-MUMs) with a minimal length k as anchors. Those multi-MUMs are the exact matching subchains shared by two or more genomes that occur only once in each genome and that are bounded on either side by mismatched nucleotides. In addition, to detect other anchors of a length less than k, Mauve uses an anchoring technique that reduces k while looking for smaller anchors in the remaining unmatched regions.
- (ii)
- Calculating a phylogenic guide tree: Mauve uses the genomes’ similar regions indicated by the subset of multi-MUMs as a binary distance metric to build a phylogenetic guide tree using neighbor joining [25].
- (iii)
- Selecting a set of anchors: This step consists in the detection of homologous subsequences. These regions are called locally collinear blocks (LCBs). Each locally collinear block is a homologous region shared by two or more genomes and does not contain any rearrangement of similar blocks.
- (iv)
- Recursive anchoring and gapped alignment: The previous step may not detect all the regions of homology between the genomes, as a minimum length k is required for a region to be considered as an LCB. To resolve this problem, two recursive anchoring techniques are performed. The first technique consists in the detection of similar regions outside of LCBs to extend the number of LCBs and identify new ones. The second technique consists of detecting unanchored regions within LCBs. For that reason, new LCBs of the minimum length k may be identified as outside LCB regions.
- (v)
- Regions that are not unique in the entire genome may be unique in regions outside it.
2.2.3. BLASTZ
2.2.4. LASTZ
- (i)
- This algorithm starts by detecting short near matches (seeds) between target and query sequences;
- (ii)
- Next, these seeds are extended into longer alignments using a heuristic approach. An adaptive score threshold is calculated taking into account various factors such as the scoring matrix, gap penalties, and statistical significance;
- (iii)
- A dynamic programming approach is adopted to align the non-extended blocks and construct the final alignment result, similar to the Lagan method.
2.2.5. DIALLGN
- (i)
- The algorithm starts by identifying anchor genes that are conserved across the genomic sequences being compared. These anchor genes serve as reference points for aligning and extending the gene neighborhoods;
- (ii)
- Then, it extends the gene neighborhoods by searching for additional genes that are located nearby in the genomic sequences and exhibit sequence similarity to the anchor genes;
- (iii)
- It uses a scoring system to evaluate the similarity between genes and determine the quality of the alignments. The scoring criteria may include sequence identity, gap penalties, and other parameters that reflect the degree of conservation between genes;
- (iv)
- As the algorithm extends the gene neighborhoods and identifies conserved gene pairs, it clusters these genes together based on their proximity and sequence similarity.
2.2.6. AnchorWave
2.2.7. Minimap2
2.3. Graph-Based Homology Mapping Methods
2.3.1. Mercator
2.3.2. Mugsy
2.3.3. BubbZ
2.3.4. SibeliaZ
2.3.5. Progressive Cactus
- Initial Pairwise Alignment:
- The process begins with aligning two closely related genomes using a pairwise alignment algorithm;
- This initial alignment is used to create a baseline graph structure, where nodes represent sequences and edges represent alignment matches, identifying regions of sequence similarity and dissimilarity;
- Additional Genome Integration:
- As additional genomes are integrated, the alignment graph is expanded iteratively, one genome at a time;
- Each new genome is incorporated into the existing graph by identifying sequences and aligning these to the graph’s nodes, updating the graph structure to include new alignment paths;
- This step ensures optimal alignment quality across the entire set of genomes, leveraging the graph structure to maintain comprehensive genomic relationships;
- Progressive Refinement:
- The existing graph undergoes iterative refinement as new genomes are added, optimizing the graph structure based on sequence homologies and structural constraints;
- This refinement process adjusts the graph’s edges and nodes to enhance alignment accuracy, maintaining both sequence similarity and structural integrity across the graph;
- Iterative Interaction:
- Progressive Cactus iterates through the alignment process multiple times, refining the alignment graph at each step;
- Each interaction improves the graph’s accuracy and consistency, ensuring that the final graph represents the most accurate genomic relationships;
- Parallelization:
- To enhance computational efficiency, Progressive Cactus employs parallelization techniques;
- The graph manipulation and alignment processes are distributed across multiple computing nodes or cores, enabling faster processing of the alignment graph for large genomic datasets.
2.3.6. GraphAligner
- Graph Construction:
- The reference genome is first constructed into a graph where nodes represent sequence blocks and edges represent possible transitions between these blocks. This can include linear sequences as well as alternative paths representing variations or repeating elements;
- Seed Finding:
- The algorithm begins with identifying seed matches between the read and sub-sequences in the graph. Seeds are typically exact matches that are used as anchor points for more detailed alignment;
- Seed Extension:
- Using dynamic programming, seeds are extended along the paths of the graph. The extension process respects the graph’s topology, exploring different paths where the read might align with segments of the reference;
- Path Scoring and Selection:
- After extension, paths are scored based on the alignment’s quality, and the best-scoring path is selected as the optimal alignment. The scoring system considers mismatches, gaps, and the overall coherence of the alignment within the graph structure;
- Handling Complexity:
- GraphAligner is specifically adept at handling complex graph structures with varying paths, such as those introduced by repeats or structural variants, which are often challenging for linear alignment methods.
2.4. Hashing Techniques in Genome Alignment
2.4.1. Tools Utilizing Hashing Techniques
- SOAP: SOAP (short oligonucleotide alignment program) uses a straightforward hashing strategy to align short reads against a reference genome [51]. It is particularly efficient in handling throughput sequencing data, making it a popular choice for SNP discovery and genotyping analyses (Figure 10). The complexity of SOAP depends on the size and complexity of the input sequences. It typically has a worst-case time complexity of O(n2), where n is the length of the input sequences;
- Stampy: While primarily relying on the Burrows–Wheeler transform for alignment, Stampy incorporates hashing to improve alignment accuracy in regions with high sequence variability [52]. This makes it exceptionally useful for aligning reads from organisms with significant genetic diversity or in studies focusing on evolutionary differences;
- BLAST: The aasic local alignment search tool (BLAST) is a widely-used algorithm for comparing an inquiry sequence against a database or reference sequence [34]. BLAST employs hashing to rapidly find regions of local similarity, facilitating a broad range of analyses from gene identification to comparative genomics;
- GMAP: Optimized for spliced alignments, such as those needed in RNA-seq data analysis, GMAP uses hashing to index the genome [53], allowing efficient identification of splice junctions and exons (Figure 11). Its capability to handle long reads makes it suitable for transcriptome studies where reads span multiple exons. GMAP’s complexity is O(n), where n is the length of the reference sequence.
2.4.2. Efficiency and Applications
- Efficiency in SNP discovery: Tools like SOAP leverage hashing for rapid identification of potential SNP locations via quickly mapping reads to the reference genome [54], a process integral to variant calling and genotyping studies;
- Handling genetic diversity: Stampy’s use of hashing to accommodate high sequence variability allows researchers to study populations with significant genetic differences, making it invaluable for evolutionary biology and conservation genetics [55];
- Comparative genomics with BLAST: BLAST’s hashing-based search mechanism is efficient in identifying homologous sequences across species, aiding in the annotation of novel genomes and the discovery of evolutionary conserved regions [56].
3. Algorithmic Aspects of WGA Algorithms
3.1. Performance Characteristics
3.2. Methodological Underpinnings
4. Recent Advancements in WGA Algorithms
5. Comprehensive Analysis of Genomic Comparison Tools: Human and Diverse Genomes
5.1. Analyses of Human Genomes
- MUMmer4 demonstrated a remarkable ability to detect near-identical sequences in human genomes, reporting a similarity of almost 100%. This high similarity index indicates that MUMmer4 is particularly efficient at aligning genomes with minimal genetic variations, making it an ideal tool for studies where the genomes are closely related;
- Sibeliaz presented a different perspective, reporting a much lower similarity percentage. Sibeliaz’s methodology, which focuses on k-mer pattern analysis, allows it to uncover subtle variations that direct sequence comparison methods might miss. This attribute is particularly beneficial for research that delves into genomic diversity, mutation analysis, and evolutionary biology;
- Dialign2 encountered limitations with the human genomes, primarily due to their size. This outcome underscores the necessity of considering genomic data scale when selecting analytical tools, particularly for large and complex genomes;
- Mirroring MUMmer4 in effectiveness, Minimap2 showed a 100% mapping rate, aligning the human genomes completely. It demonstrated additional capabilities in managing complex mapping situations, indicated by the presence of secondary and supplementary alignments, thus offering a broader scope in genomic analysis;
- The alignment process using GraphAligner failed, resulting in an empty output GAM file.
- Progressive Cactus encountered challenges in achieving the anticipated alignments across whole human genomes, demonstrating successful alignment only for chromosomes. This outcome underscores the necessity for refinement in accommodating the diversity and complexity of genomic datasets. The alignment between human Chromosome 1 and Chromosome 2 was effectively visualized using Circos, based on the output generated by Progressive Cactus. This visualization highlighted the comparative genomic architecture and facilitated the identification of conserved and divergent regions between these two chromosomes (Figure 12).
5.2. Analysis of Human and Mouse Genomes
- Mummer4: During the execution of the “nucmer” command, we encountered an error related to the construction of the suffix tree, indicating that the input sequence length exceeded the maximum allowable limit. As a result, the alignment process failed to proceed, preventing the comparison between the human and mouse genomes using MUMmer;
- SibeliaZ: The analysis indicated that the human and mouse genomes shared a substantial number of junctions, highlighting common genomic features and evolutionary conservation between the two species. The identification of 1,784,620 blocks suggested regions of significant sequence similarity, potentially corresponding to conserved genes, regulatory elements, or functional domains across the genomes. The coverage level of 15% suggested that a considerable portion of the genomes aligned with each other, indicating a substantial level of evolutionary conservation and shared ancestry between humans and mice.
- Progressive Cactus: This method faced obstacles in achieving the expected alignments between entire human and mouse genomes, mirroring its performance limitations observed in human-to-human genome comparisons. This suggests that the tool may be better suited for shorter read comparisons, such as chromosome-level alignments, rather than whole-genome analyses spanning across species.
- Dialign2: When applied to the alignment of human versus mouse genomes, Dialign2 faced significant challenges, attributable in particular to the substantial size of these genomes. In the same way, difficulties were encountered in the human vs. human genome alignment, resulting in a failure to achieve the alignment. This outcome mirrored the difficulties encountered in aligning large genomes and underscored the importance of refining alignment methodologies to effectively handle such complex datasets.
- GraphAligner: the alignment process using GraphAligner failed, resulting in an empty output GAM file;
- Minimap2: Out of the total reads processed (2589), 98.45% reads were successfully mapped. The high mapping rate of 98.45% indicated a close evolutionary relationship between the human and mouse genomes, facilitating robust alignment of sequencing reads between these species.
5.3. Analysis of C. elegans and Baker’s Yeast Genomes
- In stark contrast to its performance with human genomes, MUMmer4 yielded no output for the C. elegans and baker’s yeast genomes. This indicated MUMmer4′s limitations in aligning genomes that are not closely related, thereby suggesting its niche application in genomic studies;
- Sibeliaz revealed only 12% coverage in conserved regions between C. elegans and baker’s yeast, identifying 21,816 distinct blocks. This low coverage indicated a substantial evolutionary distance between these species. Sibeliaz’s strength lies in its ability to analyze and compare genomes with significant structural and evolutionary variations, making it a versatile tool for comparative genomics across diverse species;
- Like its performance with human genomes, Dialign2 was unsuccessful in processing the data from C. elegans and baker’s yeast, further highlighting its limitations in handling diverse genomic data.
- Unlike its effective mapping of human genomes, Minimap2 failed to map the C. elegans and baker’s yeast genomes, indicating challenges in aligning significantly divergent genomes.
5.4. Execution Times and Tool Complexity
5.5. Methodological Insights
6. Challenges in Whole-Genome Alignment
6.1. Computational Challenges
6.2. Biological Relevance
6.3. Future Directions
7. Discussion
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Guerfali, F.; Laouini, D.; Boudabous, A.; Tekaia, F. Designing and running an advanced Bioinformatics and genome analyses course in Tunisia. PLoS Comput. Biol. 2019, 15, e1006373. [Google Scholar] [CrossRef]
- Goldfeder, R.L.; Wall, D.P.; Khoury, M.J.; Ioannidis, J.P.A.; Ashley, E.A. Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis. Am. J. Epidemiol. 2017, 186, 1000–1009. [Google Scholar] [CrossRef] [PubMed]
- Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef]
- Smith, T.F.; Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
- Tørresen, O.K.; Star, B.; Mier, P.; Andrade-Navarro, M.A.; Bateman, A.; Jarnot, P.; Gruca, A.; Grynberg, M.; Kajava, A.V.; Promponas, V.J. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019, 47, 10994–11006. [Google Scholar] [CrossRef] [PubMed]
- Medina-Medina, N.; Broka, A.; Lacey, S.; Lin, H.; Klings, E.; Baldwin, C.; Steinberg, M.; Sebastiani, P. Comparing Bowtie and BWA to align short reads from a RNA-Seq experiment. In Proceedings of the 6th International Conference on Practical Applications of Computational Biology & Bioinformatics, Salamanca, Spain, 28–30 March 2012; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Nakano, K.; Shiroma, A.; Shimoji, M.; Tamotsu, H.; Ashimine, N.; Ohki, S.; Shinzato, M.; Minami, M.; Nakanishi, T.; Teruya, K. Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area. Hum. Cell 2017, 30, 149–161. [Google Scholar] [CrossRef] [PubMed]
- Pinese, M.; Lacaze, P.; Rath, E.M.; Stone, A.; Brion, M.-J.; Ameur, A.; Nagpal, S.; Puttick, C.; Husson, S.; Degrave, D.; et al. The Medical Genome Reference Bank contains whole genome and phenotype data of 2570 healthy elderly. Nat. Commun. 2020, 11, 435. [Google Scholar] [CrossRef]
- Anderson, W.; Aretz, A.; Barker, A.D.; Bell, C.; Bernabé, R.R.; Bhan, M.; Calvo, F.; Eerola, I.; Gerhard, D.S.; Guttmacher, A. International network of cancer genome projects. Nature 2010, 464, 993–998. [Google Scholar]
- Blake, J.A.; Baldarelli, R.; Kadin, J.A.; Richardson, J.E.; Smith, C.L.; Bult, C.J.; Mouse Genome Database Group. Mouse Genome Database (MGD): Knowledgebase for mouse–human comparative biology. Nucleic Acids Res. 2021, 49, D981–D987. [Google Scholar] [CrossRef]
- Abascal, F.; Acosta, R.; Addleman, N.J.; Adrian, J.; Afzal, V.; Ai, R.; Aken, B.; Akiyama, J.A.; Jammal, O.A.; Amrhein, H.; et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 2020, 583, 699–710. [Google Scholar]
- Morgenstern, B. DIALIGN 2: Improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15, 211–218. [Google Scholar] [CrossRef]
- Delcher, A.L.; Phillippy, A.; Carlton, J.; Salzberg, S.L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002, 30, 2478–2483. [Google Scholar] [CrossRef]
- Gusfield, D. Algorithms on stings, trees, and sequences: Computer science and computational biology. ACM Sigact News 1997, 28, 41–60. [Google Scholar] [CrossRef]
- Farruggia, A.; Gagie, T.; Navarro, G.; Puglisi, S.J.; Siren, J. Relative Suffix Trees. Comput. J. 2018, 61, 773–788. [Google Scholar] [CrossRef]
- Tian, Y.; Tata, S.; Hankins, R.A.; Patel, J.M. Practical methods for constructing suffix trees. VLDB J. 2005, 14, 281–299. [Google Scholar] [CrossRef]
- Delcher, A.L.; Kasif, S.; Fleischmann, R.D.; Peterson, J.; White, O.; Salzberg, S.L. Alignment of whole genomes. Nucleic Acids Res. 1999, 27, 2369–2376. [Google Scholar] [CrossRef]
- Marcais, G.; Delcher, A.L.; Phillippy, A.M.; Coston, R.; Salzberg, S.L.; Zimin, A. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 2018, 14, e1005944. [Google Scholar] [CrossRef]
- Kurtz, S.; Phillippy, A.; Delcher, A.L.; Smoot, M.; Shumway, M.; Antonescu, C.; Salzberg, S.L. Versatile and open software for comparing large genomes. Genome Biol. 2004, 5, R12. [Google Scholar] [CrossRef]
- Yang, T.; Liu, R.; Luo, Y.; Hu, S.; Wang, D.; Wang, C.; Pandey, M.K.; Ge, S.; Xu, Q.; Li, N.; et al. Improved pea reference genome and pan-genome highlight genomic features and evolutionary characteristics. Nat. Genet. 2022, 54, 1553–1563. [Google Scholar] [CrossRef] [PubMed]
- Soares, I.; Goios, A.; Amorim, A. Sequence comparison alignment-free approach based on suffix tree and L-words frequency. Sci. World J. 2012, 2012, 450124. [Google Scholar] [CrossRef] [PubMed]
- Navarro, G.; Mäkinen, V. Compressed full-text indexes. ACM Comput. Surv. (CSUR) 2007, 39, 2-es. [Google Scholar] [CrossRef]
- Su, W.; Liao, X.; Lu, Y.; Zou, Q.; Peng, S. Multiple sequence alignment based on a suffix tree and center-star strategy: A linear method for multiple nucleotide sequence alignment on spark parallel framework. J. Comput. Biol. 2017, 24, 1230–1242. [Google Scholar] [CrossRef] [PubMed]
- Zou, Q.; Guo, M.Z.; Wang, X.K.; Zhang, T.T. An Algorithm for DNA Multiple Sequence Alignment Based on Center Star Method and Keyword Tree. Acta Electonica Sin. 2009, 37, 1746–1750. [Google Scholar]
- Chatzou, M.; Magis, C.; Chang, J.-M.; Kemena, C.; Bussotti, G.; Erb, I.; Notredame, C. Multiple sequence alignment modeling: Methods and applications. Brief. Bioinform. 2016, 17, 1009–1023. [Google Scholar] [CrossRef] [PubMed]
- Thompson, J.D.; Linard, B.; Lecompte, O.; Poch, O. A comprehensive benchmark study of multiple sequence alignment methods: Current challenges and future perspectives. PLoS ONE 2011, 6, e18093. [Google Scholar] [CrossRef] [PubMed]
- Darling, A.C.; Mau, B.; Blattner, F.R.; Perna, N.T. Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004, 14, 1394–1403. [Google Scholar] [CrossRef] [PubMed]
- Brudno, M.; Do, C.B.; Cooper, G.M.; Kim, M.F.; Davydov, E.; Green, E.D.; Sidow, A.; Batzoglou, S.; Program, N.C.S. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003, 13, 721–731. [Google Scholar] [CrossRef]
- Wan, X.; Karniadakis, G.E. An adaptive multi-element generalized polynomial chaos method for stochastic differential equations. J. Comput. Phys. 2005, 209, 617–642. [Google Scholar] [CrossRef]
- Eppstein, D.; Galil, Z.; Giancarlo, R.; Italiano, G.F. Sparse dynamic programming I: Linear cost functions. J. ACM 1992, 39, 519–545. [Google Scholar] [CrossRef]
- Popendorf, K.; Tsuyoshi, H.; Osana, Y.; Sakakibara, Y. Murasaki: A fast, parallelizable algorithm to find anchors from multiple genomes. PLoS ONE 2010, 5, e12651. [Google Scholar] [CrossRef]
- Darling, A.E.; Mau, B.; Perna, N.T. progressiveMauve: Multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE 2010, 5, e11147. [Google Scholar] [CrossRef]
- Thompson, J.D.; Higgins, D.G.; Gibson, T.J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22, 4673–4680. [Google Scholar] [CrossRef] [PubMed]
- Tatusova, T.A.; Madden, T.L. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 1999, 174, 247–250. [Google Scholar] [CrossRef] [PubMed]
- Ma, B.; Tromp, J.; Li, M. PatternHunter: Faster and more sensitive homology search. Bioinformatics 2002, 18, 440–445. [Google Scholar] [CrossRef] [PubMed]
- Schwartz, S.; Kent, W.J.; Smit, A.; Zhang, Z.; Baertsch, R.; Hardison, R.C.; Haussler, D.; Miller, W. Human–mouse alignments with BLASTZ. Genome Res. 2003, 13, 103–107. [Google Scholar] [CrossRef] [PubMed]
- Harris, R.S. Improved Pairwise Alignment of Genomic DNA; The Pennsylvania State University: State College, PA, USA, 2007. [Google Scholar]
- Bu, L.; Wang, Q.; Gu, W.; Yang, R.; Zhu, D.; Song, Z.; Liu, X.; Zhao, Y. Improving read alignment through the generation of alternative reference via iterative strategy. Sci. Rep. 2020, 10, 18712. [Google Scholar] [CrossRef] [PubMed]
- Minkin, I.; Medvedev, P. Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. Nat. Commun. 2020, 11, 6327. [Google Scholar] [CrossRef]
- Al Ait, L.; Yamak, Z.; Morgenstern, B. DIALIGN at GOBICS—Multiple sequence alignment using various sources of external information. Nucleic Acids Res. 2013, 41, W3–W7. [Google Scholar] [CrossRef] [PubMed]
- Subramanian, A.R.; Kaufmann, M.; Morgenstern, B. DIALIGN-TX: Greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol. Biol. 2008, 3, 6. [Google Scholar] [CrossRef]
- Song, B.; Marco-Sola, S.; Moreto, M.; Johnson, L.; Buckler, E.S.; Stitzer, M.C. AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proc. Natl. Acad. Sci. USA 2022, 119, e2113075119. [Google Scholar] [CrossRef]
- Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 2021, 37, 4572–4574. [Google Scholar] [CrossRef]
- Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef]
- Dewey, C.N. Aligning multiple whole genomes with Mercator and MAVID. Comp. Genom. 2008, 221–235. [Google Scholar] [CrossRef]
- Angiuoli, S.V.; Salzberg, S.L. Mugsy: Fast multiple alignment of closely related whole genomes. Bioinformatics 2010, 27, 334–342. [Google Scholar] [CrossRef]
- Minkin, I.; Medvedev, P. Scalable pairwise whole-genome homology mapping of long genomes with BubbZ. IScience 2020, 23, 101224. [Google Scholar] [CrossRef]
- Dabbaghie, F.; Ebler, J.; Marschall, T. BubbleGun: Enumerating bubbles and superbubbles in genome graphs. Bioinformatics 2022, 38, 4217–4219. [Google Scholar] [CrossRef]
- Armstrong, J.; Hickey, G.; Diekhans, M.; Fiddes, I.T.; Novak, A.M.; Deran, A.; Fang, Q.; Xie, D.; Feng, S.; Stiller, J.; et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature 2020, 587, 246–251. [Google Scholar] [CrossRef]
- Rautiainen, M.; Marschall, T. GraphAligner: Rapid and versatile sequence-to-graph alignment. Genome Biol. 2020, 21, 253. [Google Scholar] [CrossRef]
- Li, R.; Li, Y.; Kristiansen, K.; Wang, J. SOAP: Short oligonucleotide alignment program. Bioinformatics 2008, 24, 713–714. [Google Scholar] [CrossRef]
- Lunter, G.; Goodson, M. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011, 21, 936–939. [Google Scholar] [CrossRef]
- Wu, T.D.; Watanabe, C.K. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 2005, 21, 1859–1875. [Google Scholar] [CrossRef]
- Cui, Y.; Liao, X.; Peng, S.; Lu, Y.; Yang, C.; Wang, B.; Wu, C. Large-scale neo-heterogeneous programming and optimization of SNP detection on Tianhe-2. In Proceedings of the High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, 12–16 July 2015; Proceedings 30. Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Capblancq, T.; Butnor, J.R.; Deyoung, S.; Thibault, E.; Munson, H.; Nelson, D.M.; Fitzpatrick, M.C.; Keller, S.R. Whole-exome sequencing reveals a long-term decline in effective population size of red spruce (Picea rubens). Evol. Appl. 2020, 13, 2190–2205. [Google Scholar] [CrossRef]
- Kuznetsov, A.; Bollin, C.J. NCBI genome workbench: Desktop software for comparative genomics, visualization, and GenBank data submission. Mult. Seq. Alignment Methods Protoc. 2021, 261–295. [Google Scholar] [CrossRef]
- Saada, B.; Zhang, J. DNA sequences compression algorithm based on extended-ASCII representation. In Proceedings of the World Congress on Engineering and Computer Science, San Francisco, CA, USA, 21–23 October 2015. [Google Scholar]
- Silva, M.; Pratas, D.; Pinho, A.J. Efficient DNA sequence compression with neural networks. GigaScience 2020, 9, giaa119. [Google Scholar] [CrossRef]
- Corbett, R.D.; Eveleigh, R.; Whitney, J.; Barai, N.; Bourgey, M.; Chuah, E.; Johnson, J.; Moore, R.A.; Moradin, N.; Mungall, K.L. A distributed whole genome sequencing benchmark study. Front. Genet. 2020, 11, 612515. [Google Scholar] [CrossRef]
- Marco-Sola, S.; Eizenga, J.M.; Guarracino, A.; Paten, B.; Garrison, E.; Moreto, M. Optimal gap-affine alignment in O(s) space. Bioinformatics 2023, 39, btad074. [Google Scholar] [CrossRef]
- Alser, M.; Rotman, J.; Deshpande, D.; Taraszka, K.; Shi, H.; Baykal, P.I.; Yang, H.T.; Xue, V.; Knyazev, S.; Singer, B.D.; et al. Technology dictates algorithms: Recent developments in read alignment. Genome Biol. 2021, 22, 249. [Google Scholar] [CrossRef]
- Rhie, A.; Nurk, S.; Cechova, M.; Hoyt, S.J.; Taylor, D.J.; Altemose, N.; Hook, P.W.; Koren, S.; Rautiainen, M.; Alexandrov, I.A.; et al. The complete sequence of a human Y chromosome. Nature 2023, 621, 344–354. [Google Scholar] [CrossRef]
- Zhou, Y.; Zheng, J.; Wu, Y.; Zhang, W.; Jin, J. A completeness-independent method for pre-selection of closely related genomes for species delineation in prokaryotes. BMC Genom. 2020, 21, 183. [Google Scholar] [CrossRef]
- Gardner, S.N.; Hiddessen, A.L.; Williams, P.L.; Hara, C.; Wagner, M.C.; Colston, B.W., Jr. Multiplex primer prediction software for divergent targets. Nucleic Acids Res. 2009, 37, 6291–6304. [Google Scholar] [CrossRef]
- Dewey, C.N. Whole-Genome Alignment. In Evolutionary Genomics: Statistical and Computational Methods, Volume 1; Anisimova, M., Ed.; Humana Press: Totowa, NJ, USA, 2012; pp. 237–257. [Google Scholar]
- Löytynoja, A. Alignment methods: Strategies, challenges, benchmarking, and comparative overview. In Volutionary Genomics: Statistical and Computational Methods, Volume 1; Springer: Berlin/Heidelberg, Germany, 2012; pp. 203–235. [Google Scholar]
- Couronne, O.; Poliakov, A.; Bray, N.; Ishkhanov, T.; Ryaboy, D.; Rubin, E.; Pachter, L.; Dubchak, I. Strategies and tools for whole-genome alignments. Genome Res. 2003, 13, 73–80. [Google Scholar] [CrossRef]
- Govek, K.W.; Yamajala, V.S.; Camara, P.G. Clustering-independent analysis of genomic data using spectral simplicial theory. PLoS Comput. Biol. 2019, 15, e1007509. [Google Scholar] [CrossRef]
- Wu, Y.; Johnson, L.; Song, B.; Romay, C.; Stitzer, M.; Siepel, A.; Buckler, E.; Scheben, A. A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants. Plant Genome 2022, 15, e20204. [Google Scholar] [CrossRef]
- Kille, B.; Balaji, A.; Sedlazeck, F.J.; Nute, M.; Treangen, T.J. Multiple genome alignment in the telomere-to-telomere assembly era. Genome Biol. 2022, 23, 182. [Google Scholar] [CrossRef]
- Huang, C.; Li, R.; Li, A. Parallel Implementation of Key Algorithms for Intelligent Processing of Graphic Signal Data of Consumer Digital Equipment. Mob. Netw. Appl. 2023. [Google Scholar] [CrossRef]
- Nolle, T.; Seeliger, A.; Thoma, N.; Mühlhäuser, M. DeepAlign: Alignment-based process anomaly correction using recurrent neural networks. In Proceedings of the International Conference on Advanced Information Systems Engineering, Grenoble, France, 8–12 June 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Peltzer, A.; Jäger, G.; Herbig, A.; Seitz, A.; Kniep, C.; Krause, J.; Nieselt, K. EAGER: Efficient ancient genome reconstruction. Genome Biol. 2016, 17, 60. [Google Scholar] [CrossRef]
- Song, B.; Buckler, E.S.; Stitzer, M.C. New whole-genome alignment tools are needed for tapping into plant diversity. Trends Plant Sci. 2024, 29, 355–369. [Google Scholar] [CrossRef]
- Earl, D.; Nguyen, N.; Hickey, G.; Harris, R.S.; Fitzgerald, S.; Beal, K.; Seledtsov, I.; Molodtsov, V.; Raney, B.J.; Clawson, H. Alignathon: A competitive assessment of whole-genome alignment methods. Genome Res. 2014, 24, 2077–2089. [Google Scholar] [CrossRef]
- Schadt, E.E.; Linderman, M.D.; Sorenson, J.; Lee, L.; Nolan, G.P. Computational solutions to large-scale data management and analysis. Nat. Rev. Genet. 2010, 11, 647–657. [Google Scholar] [CrossRef]
- Ye, C.; Hill, C.M.; Wu, S.; Ruan, J.; Ma, Z. DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci. Rep. 2016, 6, 31900. [Google Scholar] [CrossRef]
- Kshemkalyani, A.D.; Singhal, M. Distributed Computing: Principles, Algorithms, and Systems; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Volozonoka, L.; Miskova, A.; Gailite, L. Whole genome amplification in preimplantation genetic testing in the era of massively parallel sequencing. Int. J. Mol. Sci. 2022, 23, 4819. [Google Scholar] [CrossRef]
- Uffelmann, E.; Huang, Q.Q.; Munung, N.S.; de Vries, J.; Okada, Y.; Martin, A.R.; Martin, H.C.; Lappalainen, T.; Posthuma, D. Genome-wide association studies. Nat. Rev. Methods Primers 2021, 1, 59. [Google Scholar] [CrossRef]
- Girisha, M.N.; Badiger, V.P.; Pattar, S. A comprehensive review of global alignment of multiple biological networks: Background, applications and open issues. Netw. Model. Anal. Health Inform. Bioinform. 2022, 11, 9. [Google Scholar] [CrossRef]
- Hennig, A.; Nieselt, K. Efficient merging of genome profile alignments. Bioinformatics 2019, 35, i71–i80. [Google Scholar] [CrossRef]
- Armstrong, J.; Fiddes, I.T.; Diekhans, M.; Paten, B. Whole-genome alignment and comparative annotation. Annu. Rev. Anim. Biosci. 2019, 7, 41–64. [Google Scholar] [CrossRef]
- Macaulay, I.C.; Voet, T. Single cell genomics: Advances and future perspectives. PLoS Genet. 2014, 10, e1004126. [Google Scholar] [CrossRef]
- Shi, L.; Wang, Z. Computational strategies for scalable genomics analysis. Genes 2019, 10, 1017. [Google Scholar] [CrossRef]
- Ryva, B.; Zhang, K.; Asthana, A.; Wong, D.; Vicioso, Y.; Parameswaran, R. Wheat germ agglutinin as a potential therapeutic agent for leukemia. Front. Oncol. 2019, 9, 100. [Google Scholar] [CrossRef]
- Taylor, J.; Yudkowsky, E.; LaVictoire, P.; Critch, A. Alignment for advanced machine learning systems. Ethics Artif. Intell. 2016, 342–382. [Google Scholar] [CrossRef]
MUMmer Version | Complexity | Main Steps | Main Improvement |
---|---|---|---|
MUMmer 1 | O(n2) | Construct suffix trees, identify MEMs, extend MEMs, cluster MUMs | Initial implementation of suffix tree-based matching |
MUMmer 2 | O(n2) | Construct suffix trees, identify MEMs, extend MEMs, cluster MUMs | Improved optimization and refinement of suffix tree algorithms |
MUMmer 3.1 | O(n lg n) | Construct enhanced suffix arrays, efficient MEM identification, extend MEMs, cluster MUMs | Introduction of enhanced suffix arrays for improved efficiency |
MUMmer 4 | O(n lg n) | Construct suffix trees, efficient MEM identification, extend MEMs, cluster MUMs | Optimization of suffix tree usage for enhanced algorithm performance |
Approach | Method | Type |
---|---|---|
Suffix tree-based methods | MUMmer | Local alignment |
MUMmer 4.0 | Global multiple genome alignment | |
Suffix tree & Lword | Global multiple genome alignment | |
Multiple Sequence Alignment (MSA) | Local alignment | |
Anchor-based methods | LAGAN/Multi-LAGAN | Global multiple genome alignment |
ProgressiveMauve | Hierarchical WGA mapping | |
BlastZ | Local alignment | |
STELLAR | Local alignment | |
LASTZ | Local alignment | |
DIALIGN | Global multiple genome alignment | |
Graph-based methods | AnchorWave | Global alignment |
MERCATOR | Homology mapping | |
Mugsy | Hierarchical WGA mapping | |
BubbZ | Homology mapping | |
SibeliaZ | Hierarchical WGA mapping | |
Progressive Cactus | Global multiple genome alignment | |
GraphAligner | Global multiple genome alignment | |
Hash-based alignment algorithms | SOAP | Hash-based mapping |
Stampy | Hybrid local alignment | |
BLAST | Local similarity search | |
GMAP | Spliced alignment mapping |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Saada, B.; Zhang, T.; Siga, E.; Zhang, J.; Magalhães Muniz, M.M. Whole-Genome Alignment: Methods, Challenges, and Future Directions. Appl. Sci. 2024, 14, 4837. https://doi.org/10.3390/app14114837
Saada B, Zhang T, Siga E, Zhang J, Magalhães Muniz MM. Whole-Genome Alignment: Methods, Challenges, and Future Directions. Applied Sciences. 2024; 14(11):4837. https://doi.org/10.3390/app14114837
Chicago/Turabian StyleSaada, Bacem, Tianchi Zhang, Estevao Siga, Jing Zhang, and Maria Malane Magalhães Muniz. 2024. "Whole-Genome Alignment: Methods, Challenges, and Future Directions" Applied Sciences 14, no. 11: 4837. https://doi.org/10.3390/app14114837
APA StyleSaada, B., Zhang, T., Siga, E., Zhang, J., & Magalhães Muniz, M. M. (2024). Whole-Genome Alignment: Methods, Challenges, and Future Directions. Applied Sciences, 14(11), 4837. https://doi.org/10.3390/app14114837