Why Assembling Plant Genome Sequences Is So Challenging
Abstract
:1. Introduction
2. From Sanger Technology to NGS: Getting Plants off the Ground
3. Challenging Features of Plant Genomes
3.1. Sampling
3.2. Genome Size and Complexity
3.3. Transposable Elements
3.4. Heterozygosity
3.5. Polyploidy
3.6. Gene Content and Gene Families
3.7. Non-Coding RNAs
3.8. Widely Distributed Repetitive Sequences (Low Complexity Sequences)
- Repetitions among chromosomes: Duplications occurring both within chromosomes (e.g., ~250 tandem duplications each of ~10 kbp on Chromosome 2 of Arabidopsis) and between chromosomes (e.g., ~4 Mbp long regions between Chromosomes 2 and 4, or 700 Mbp long regions between Chromosomes 1 and 2 in Arabidopsis; ~3 Mbp at the termini of the short arms of Chromosomes 11 and 12 in rice, as well as Chromosomes 5 and 8 in sorghum) [62,74].
- rDNA units: These contain the rRNA genes, which are presented as hundreds of copies. Each unit is typically 10 kbp in plants and as a whole they represent up to 10% of the genome (for example, 8% in Arabidopsis [75]). They have not been resolved by any sequencing technology.
- Satellites: These are arrays of many tens or even thousands of identical or nearly identical copies of a repeated unit. They are abundant at centromeres and constitutive heterochromatin. For example, a total of 3% of the Arabidopsis genome consists of the 180 bp centromeric repeat [76]. As a result of microsatellites, most sequenced chromosomes are split into two sequences, the right arm and the left arm, since the repetitive, centromeric sequence is unknown.
- Microsatellites or SSRs (simple sequence repeats): These are short tandem repeats (in the range of kbp) of short motifs (1–5 bp) repeated a few hundred times or less, with different microsatellites having different motifs. They are often highly polymorphic with regard to the number of repeat units in a repeat [77]. Microsatellites are mainly located at the subtelomeric region that forms a border between distally positioned structural genes and telomeres, but they can also be found elsewhere.
- Telomeric sequences: These consist of a short repeat of a sequence motif similar to TTTAGGG in tandem arrays many hundreds of units long at the physical end of each chromosome arm. The number of telomeric repeats is a species-specific characteristic ranging from 2–5 kbp in Arabidopsis to 60–160 kbp in tobacco [62]. Moreover, the number of copies of the repeat motif also differs among the chromosome arms for the same genome, and may even vary from cell to cell and tissue to tissue [78]. They are usually still unknown at the sequence level in most species sequenced to date since they are nearly impossible to assemble.
4. Confounding Factors for Plant Genome Assembly
4.1. Repetitive Nature of Plant Genomes
4.2. DNA Contamination
4.3. Sequencing Errors
4.4. Read Length
4.5. Quality Values
4.6. Number of Reads and Coverage
5. Seeking for the Best Assembly
6. Concluding Remarks
Acknowledgments
References
- Paterson, A.H.; Freeling, M.; Tang, H.; Wang, X. Insights from the comparison of plant genome sequences. Annu. Rev. Plant Biol. 2010, 61, 349–372. [Google Scholar] [CrossRef]
- Sterck, L.; Rombauts, S.; Vandepoele, K.; Rouze, P.; van de Peer, Y. How many genes are there in plants (... and why are they there)? Curr. Opin. Plant Biol. 2007, 10, 199–203. [Google Scholar] [CrossRef]
- Gregory, T.R. The C-value enigma in plants and animals: A review of parallels and an appeal for partnership. Ann. Bot. 2005, 95, 133–146. [Google Scholar] [CrossRef]
- Arabidopsis Genome, I. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408, 796–815. [Google Scholar] [CrossRef]
- Feuillet, C.; Leach, J.E.; Rogers, J.; Schnable, P.S.; Eversole, K. Crop genome sequencing: Lessons and rationales. Trends Plant Sci. 2011, 16, 77–88. [Google Scholar] [CrossRef]
- International Rice Genome Sequencing, P. The map-based sequence of the rice genome. Nature 2005, 436, 793–800. [CrossRef]
- Ming, R.; Hou, S.; Feng, Y.; Yu, Q.; Dionne-Laporte, A.; Saw, J.H.; Senin, P.; Wang, W.; Ly, B.V.; Lewis, K.L.; et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 2008, 452, 991–996. [Google Scholar] [CrossRef]
- Schnable, P.S.; Ware, D.; Fulton, R.S.; Stein, J.C.; Wei, F.; Pasternak, S.; Liang, C.; Zhang, J.; Fulton, L.; Graves, T.A.; et al. The B73 maize genome: Complexity, diversity, and dynamics. Science 2009, 326, 1112–1115. [Google Scholar]
- Duvick, J.; Fu, A.; Muppirala, U.; Sabharwal, M.; Wilkerson, M.D.; Lawrence, C.J.; Lushbough, C.; Brendel, V. PlantGDB: A resource for comparative plant genomics. Nucleic Acids Res. 2008, 36, D959–D965. [Google Scholar]
- Varshney, R.K.; Close, T.J.; Singh, N.K.; Hoisington, D.A.; Cook, D.R. Orphan legume crops enter the genomics era! Curr. Opin. Plant Biol. 2009, 12, 202–210. [Google Scholar] [CrossRef]
- Armstead, I.; Huang, L.; Ravagnani, A.; Robson, P.; Ougham, H. Bioinformatics in the orphan crops. Brief. Bioinform. 2009, 10, 645–653. [Google Scholar] [CrossRef]
- Imelfort, M.; Edwards, D. De novo sequencing of plant genomes using second-generation technologies. Brief. Bioinform. 2009, 10, 609–618. [Google Scholar] [CrossRef]
- Goodstein, D.M.; Shu, S.; Howson, R.; Neupane, R.; Hayes, R.D.; Fazo, J.; Mitros, T.; Dirks, W.; Hellsten, U.; Putnam, N.; et al. Phytozome: A comparative platform for green plant genomics. Nucleic Acids Res. 2012, 40, D1178–D1186. [Google Scholar] [CrossRef]
- Hamilton, J.P.; Buell, C.R. Advances in plant genome sequencing. Plant J. 2012, 70, 177–190. [Google Scholar] [CrossRef]
- Proost, S.; Pattyn, P.; Gerats, T.; van de Peer, Y. Journey through the past: 150 million years of plant genome evolution. Plant J. 2011, 66, 58–65. [Google Scholar] [CrossRef]
- Ossowski, S.; Schneeberger, K.; Clark, R.M.; Lanz, C.; Warthmann, N.; Weigel, D. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 2008, 18, 2024–2033. [Google Scholar] [CrossRef]
- Springer, N.M.; Ying, K.; Fu, Y.; Ji, T.; Yeh, C.T.; Jia, Y.; Wu, W.; Richmond, T.; Kitzman, J.; Rosenbaum, H.; et al. Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009, 5, e1000734. [Google Scholar] [CrossRef]
- Morgante, M.; de Paoli, E.; Radovic, S. Transposable elements and the plant pan-genomes. Curr. Opin. Plant Biol. 2007, 10, 149–155. [Google Scholar] [CrossRef]
- Plant Genomes Central. Available online: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html (accessed on 14 September 2012).
- List of Sequenced Plant Genomes. Available online: http://en.wikipedia.org/wiki/List_of_sequenced_plant_genomes (accessed on 14 September 2012).
- Sanger, F.; Nicklen, S.; Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 1977, 74, 5463–5467. [Google Scholar] [CrossRef]
- Bräutigam, A.; Gowik, U. What can next generation sequencing do for you? Next generation sequencing as a valuable tool in plant research. Plant Biol. (Stuttg) 2010, 12, 831–841. [Google Scholar] [CrossRef]
- Goff, S.A.; Ricke, D.; Lan, T.H.; Presting, G.; Wang, R.; Dunn, M.; Glazebrook, J.; Sessions, A.; Oeller, P.; Varma, H.; et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 2002, 296, 92–100. [Google Scholar]
- Yu, J.; Hu, S.; Wang, J.; Wong, G.K.; Li, S.; Liu, B.; Deng, Y.; Dai, L.; Zhou, Y.; Zhang, X.; et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 2002, 296, 79–92. [Google Scholar] [CrossRef]
- Shendure, J.; Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 2008, 26, 1135–1145. [Google Scholar] [CrossRef]
- Mardis, E.R. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 2008, 9, 387–402. [Google Scholar] [CrossRef]
- Ansorge, W.J. Next-generation DNA sequencing techniques. N. Biotechnol. 2009, 25, 195–203. [Google Scholar] [CrossRef]
- Kircher, M.; Kelso, J. High-throughput DNA sequencing—Concepts and limitations. Bioessays 2010, 32, 524–536. [Google Scholar] [CrossRef]
- Zhou, X.; Ren, L.; Meng, Q.; Li, Y.; Yu, Y.; Yu, J. The next-generation sequencing technology and application. Protein Cell 2010, 1, 520–536. [Google Scholar] [CrossRef]
- Niedringhaus, T.P.; Milanova, D.; Kerby, M.B.; Snyder, M.P.; Barron, A.E. Landscape of next-generation sequencing technologies. Anal. Chem. 2011, 83, 4327–4341. [Google Scholar] [CrossRef]
- Pareek, C.S.; Smoczynski, R.; Tretyn, A. Sequencing technologies and genome sequencing. J. Appl. Genet. 2011, 52, 413–435. [Google Scholar] [CrossRef]
- Finotello, F.; Lavezzo, E.; Fontana, P.; Peruzzo, D.; Albiero, A.; Barzon, L.; Falda, M.; di Camillo, B.; Toppo, S. Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief. Bioinform. 2012, 13, 269–280. [Google Scholar] [CrossRef]
- Alkan, C.; Sajjadian, S.; Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 2011, 8, 61–65. [Google Scholar] [CrossRef]
- Barthelson, R.; McFarlin, A.J.; Rounsley, S.D.; Young, S. Plantagora: Modeling whole genome sequencing and assembly of plant genomes. PLoS One 2011, 6, e28436. [Google Scholar]
- Wang, L.; Li, P.; Brutnell, T.P. Exploring plant transcriptomes using ultra high-throughput sequencing. Brief. Funct. Genomics 2010, 9, 118–128. [Google Scholar] [CrossRef]
- Vandepoele, K.; Quimbaya, M.; Casneuf, T.; de Veylder, L.; van de Peer, Y. Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiol. 2009, 150, 535–546. [Google Scholar] [CrossRef]
- He, F.; Zhou, Y.; Zhang, Z. Deciphering the Arabidopsis floral transition process by integrating a protein-protein interaction network and gene expression data. Plant Physiol. 2010, 153, 1492–1505. [Google Scholar] [CrossRef]
- Alvarez, J.M.; Vidal, E.A.; Gutierrez, R.A. Integration of local and systemic signaling pathways for plant N responses. Curr. Opin. Plant Biol. 2012, 15, 185–191. [Google Scholar] [CrossRef]
- Proost, S.; van Bel, M.; Sterck, L.; Billiau, K.; van Parys, T.; van de Peer, Y.; Vandepoele, K. PLAZA: A comparative genomics resource to study gene and genome evolution in plants. Plant Cell 2009, 21, 3718–3731. [Google Scholar] [CrossRef] [Green Version]
- Wegrzyn, J.L.; Lee, J.M.; Tearse, B.R.; Neale, D.B. TreeGenes: A forest tree genome database. Int. J. Plant Genomics 2008, 412875. [Google Scholar]
- Fernandez-Pozo, N.; Canales, J.; Guerrero-Fernandez, D.; Villalobos, D.P.; Diaz-Moreno, S.M.; Bautista, R.; Flores-Monterroso, A.; Guevara, M.A.; Perdiguero, P.; Collada, C.; et al. EuroPineDB: A high-coverage web database for maritime pine transcriptome. BMC Genomics 2011, 12, 366. [Google Scholar] [CrossRef] [Green Version]
- Rengel, D.; San Clemente, H.; Servant, F.; Ladouce, N.; Paux, E.; Wincker, P.; Couloux, A.; Sivadon, P.; Grima-Pettenati, J. A new genomic resource dedicated to wood formation in Eucalyptus. BMC Plant Biol. 2009, 9, 36. [Google Scholar] [CrossRef]
- Gonzalez-Ibeas, D.; Blanca, J.; Roig, C.; Gonzalez-To, M.; Pico, B.; Truniger, V.; Gomez, P.; Deleu, W.; Cano-Delgado, A.; Arus, P.; et al. MELOGEN: An EST database for melon functional genomics. BMC Genomics 2007, 8, 306. [Google Scholar]
- Goff, S.A.; Vaughn, M.; McKay, S.; Lyons, E.; Stapleton, A.E.; Gessler, D.; Matasci, N.; Wang, L.; Hanlon, M.; Lenards, A.; et al. The iPlant collaborative: Cyberinfrastructure for plant biology. Front. Plant Sci. 2011, 2, 34.31–34.16. [Google Scholar]
- Katari, M.S.; Nowicki, S.D.; Aceituno, F.F.; Nero, D.; Kelfer, J.; Thompson, L.P.; Cabello, J.M.; Davidson, R.S.; Goldberg, A.P.; Shasha, D.E.; et al. VirtualPlant: A software platform to support systems biology research. Plant Physiol. 2010, 152, 500–515. [Google Scholar] [CrossRef]
- Lapitan, N.L.V. Organization and evolution of higher plant nuclear genome. Genome 1992, 35, 171–181. [Google Scholar] [CrossRef]
- Janicki, M.; Rooke, R.; Yang, G. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res. 2011, 19, 787–808. [Google Scholar] [CrossRef]
- Wicker, T.; Sabot, F.; Hua-Van, A.; Bennetzen, J.L.; Capy, P.; Chalhoub, B.; Flavell, A.; Leroy, P.; Morgante, M.; Panaud, O.; et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 2007, 8, 973–982. [Google Scholar] [CrossRef]
- Bousios, A.; Darzentas, N.; Tsaftaris, A.; Pearce, S.R. Highly conserved motifs in non-coding regions of Sirevirus retrotransposons: The key for their pattern of distribution within and across plants? BMC Genomics 2010, 11, 89. [Google Scholar] [CrossRef]
- Treangen, T.J.; Salzberg, S.L. Repetitive DNA and next-generation sequencing: Computational challenges and solutions. Nat. Rev. Genet. 2012, 13, 36–46. [Google Scholar]
- Schatz, M.C.; Delcher, A.L.; Salzberg, S.L. Assembly of large genomes using second-generation sequencing. Genome Res. 2010, 20, 1165–1173. [Google Scholar] [CrossRef]
- Hochholdinger, F.; Hoecker, N. Towards the molecular basis of heterosis. Trends Plant Sci. 2007, 12, 427–432. [Google Scholar] [CrossRef]
- Tuskan, G.A.; Difazio, S.; Jansson, S.; Bohlmann, J.; Grigoriev, I.; Hellsten, U.; Putnam, N.; Ralph, S.; Rombauts, S.; Salamov, A.; et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313, 1596–1604. [Google Scholar] [CrossRef]
- Jaillon, O.; Aury, J.M.; Noel, B.; Policriti, A.; Clepet, C.; Casagrande, A.; Choisne, N.; Aubourg, S.; Vitulo, N.; Jubin, C.; et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449, 463–467. [Google Scholar]
- Kelley, D.R.; Salzberg, S.L. Detection and correction of false segmental duplications caused by genome mis-assembly. Genome Biol. 2010, 11, R28. [Google Scholar] [CrossRef]
- Comai, L. The advantages and disadvantages of being polyploid. Nat. Rev. Genet. 2005, 6, 836–846. [Google Scholar] [CrossRef]
- Meyers, L.A.; Levin, D.A. On the abundance of polyploids in flowering plants. Evolution 2006, 60, 1198–1206. [Google Scholar]
- Bento, M.; Gustafson, J.P.; Viegas, W.; Silva, M. Size matters in Triticeae polyploids: Larger genomes have higher remodeling. Genome 2011, 54, 175–183. [Google Scholar] [CrossRef]
- Tang, H.; Bowers, J.E.; Wang, X.; Ming, R.; Alam, M.; Paterson, A.H. Synteny and collinearity in plant genomes. Science 2008, 320, 486–488. [Google Scholar] [CrossRef]
- Xu, X.; Pan, S.; Cheng, S.; Zhang, B.; Mu, D.; Ni, P.; Zhang, G.; Yang, S.; Li, R.; et al. Potato Genome Sequencing, C. Genome sequence and analysis of the tuber crop potato. Nature 2011, 475, 189–195. [Google Scholar]
- Shulaev, V.; Sargent, D.J.; Crowhurst, R.N.; Mockler, T.C.; Folkerts, O.; Delcher, A.L.; Jaiswal, P.; Mockaitis, K.; Liston, A.; Mane, S.P.; et al. The genome of woodland strawberry (Fragaria vesca). Nat. Genet. 2011, 43, 109–116. [Google Scholar]
- Heslop-Harrison, J.S. Comparative genome organization in plants: From sequence and markers to chromatin and chromosomes. Plant Cell 2000, 12, 617–636. [Google Scholar]
- Giussani, L.M.; Cota-Sanchez, J.H.; Zuloaga, F.O.; Kellogg, E.A. A molecular phylogeny of the grass subfamily Panicoideae (Poaceae) shows multiple origins of C4 photosynthesis. Am. J. Bot. 2001, 88, 1993–2012. [Google Scholar] [CrossRef]
- Sappl, P.G.; Heazlewood, J.L.; Millar, A.H. Untangling multi-gene families in plants by integrating proteomics into functional genomics. Phytochemistry 2004, 65, 1517–1530. [Google Scholar] [CrossRef]
- Duarte, J.M.; Cui, L.; Wall, P.K.; Zhang, Q.; Zhang, X.; Leebens-Mack, J.; Ma, H.; Altman, N.; dePamphilis, C.W. Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Mol. Biol. Evol. 2006, 23, 469–478. [Google Scholar]
- Fernández-Pozo, N.; Guerrero-Fernández, D.; Bautista, R.; Claros, M.G. Full‑LengtherNext: A tool for fine-tuning de novo assembled transcriptomes of non-model organisms. Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, 29071 Málaga, Spain, and Plataforma Andaluza de Bioinformática, Centro de Supercomputación y Bioinformática, Edificio de Bioinnovación, Universidad de Málaga, 29590 Málaga, Spain. Unpublished work, to be submitted for publication. 2012. [Google Scholar]
- Phillippy, A.M.; Schatz, M.C.; Pop, M. Genome assembly forensics: Finding the elusive mis-assembly. Genome Biol. 2008, 9, R55. [Google Scholar] [CrossRef]
- Lai, J.; Li, Y.; Messing, J.; Dooner, H.K. Gene movement by Helitron transposons contributes to the haplotype variability of maize. Proc. Natl. Acad. Sci. USA 2005, 102, 9068–9073. [Google Scholar] [CrossRef]
- Freeling, M.; Lyons, E.; Pedersen, B.; Alam, M.; Ming, R.; Lisch, D. Many or most genes in Arabidopsis transposed after the origin of the order Brassicales. Genome Res. 2008, 18, 1924–1937. [Google Scholar] [CrossRef]
- Lindbo, J.A.; Silva-Rosales, L.; Proebsting, W.M.; Dougherty, W.G. Induction of a highly specific antiviral state in transgenic plants: Implications for regulation of gene expression and virus resistance. Plant Cell 1993, 5, 1749–1759. [Google Scholar]
- Huang, R.; Jaritz, M.; Guenzl, P.; Vlatkovic, I.; Sommer, A.; Tamir, I.M.; Marks, H.; Klampfl, T.; Kralovics, R.; Stunnenberg, H.G.; et al. An RNA-Seq strategy to detect the complete coding and non-coding transcriptome including full-length imprinted macro ncRNAs. PLoS One 2011, 6, e27288. [Google Scholar]
- Carninci, P.; Kasukawa, T.; Katayama, S.; Gough, J.; Frith, M.C.; Maeda, N.; Oyama, R.; Ravasi, T.; Lenhard, B.; Wells, C.; et al. The transcriptional landscape of the mammalian genome. Science 2005, 309, 1559–1563. [Google Scholar] [CrossRef]
- Gore, M.A.; Chia, J.M.; Elshire, R.J.; Sun, Q.; Ersoz, E.S.; Hurwitz, B.L.; Peiffer, J.A.; McMullen, M.D.; Grills, G.S.; Ross-Ibarra, J.; et al. A first-generation haplotype map of maize. Science 2009, 326, 1115–1117. [Google Scholar] [CrossRef]
- Wang, X.; Tang, H.; Bowers, J.E.; Paterson, A.H. Comparative inference of illegitimate recombination between rice and sorghum duplicated genes produced by polyploidization. Genome Res. 2009, 19, 1026–1032. [Google Scholar] [CrossRef]
- Pruitt, R.E.; Meyerowitz, E.M. Characterization of the genome of Arabidopsis thaliana. J. Mol. Biol. 1986, 187, 169–183. [Google Scholar] [CrossRef]
- Murata, M.; Ogura, Y.; Motoyoshi, F. Centromeric repetitive sequences in Arabidopsis thaliana. Jpn. J. Genet. 1994, 69, 361–371. [Google Scholar] [CrossRef]
- Horáková, M.; Fajkus, J. TAS4—A dispersed repetitive sequence isolated from subtelomeric regions of Nicotiana tomentosiformis chromosomes. Genome 2000, 43, 273–284. [Google Scholar]
- Kilian, A.; Stiff, C.; Kleinhofs, A. Barley telomeres shorten during differentiation but grow in callus culture. Proc. Natl. Acad. Sci. USA 1995, 92, 9555–9559. [Google Scholar] [CrossRef]
- Schatz, M.C.; Witkowski, J.; McCombie, W.R. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 2012, 13, 243. [Google Scholar]
- Tomato Genome, C. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 2012, 485, 635–641. [Google Scholar] [Green Version]
- Garcia-Mas, J.; Benjak, A.; Sanseverino, W.; Bourgeois, M.; Mir, G.; González, V.M.; Hénaff, E.; Cámara, F.; Cozzuto, L.; Lowy, E.; et al. The genome of melon (Cucumis melo L.). Proc. Natl. Acad. Sci. USA 2012, in press. [Google Scholar]
- SeqTrimNext. Available online: http://www.scbi.uma.es/seqtrimnext (accessed on 14 September 2012).
- Falgueras, J.; Lara, A.J.; Fernandez-Pozo, N.; Canton, F.R.; Perez-Trabado, G.; Claros, M.G. SeqTrim: A high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010, 11, 38. [Google Scholar] [CrossRef]
- Guerrero-Fernaández, D.; Falgueras, J.; Claros, M.G. SCBI_MAPREDUCE: A task-farm, practical solution in Ruby for distribution of new and legacy bioinformatics software. IEEE Trans. Parallel. Distr. Syst. 2012. submitted for publication. [Google Scholar]
- Paszkiewicz, K.; Studholme, D.J. De novo assembly of short sequence reads. Brief. Bioinform. 2010, 11, 457–472. [Google Scholar] [CrossRef]
- Nakamura, K.; Oshima, T.; Morimoto, T.; Ikeda, S.; Yoshikawa, H.; Shiwa, Y.; Ishikawa, S.; Linak, M.C.; Hirai, A.; Takahashi, H.; et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011, 39, e90. [Google Scholar] [CrossRef]
- Minoche, A.E.; Dohm, J.C.; Himmelbauer, H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011, 12, R112. [Google Scholar] [CrossRef]
- Hoffmann, S.; Otto, C.; Kurtz, S.; Sharma, C.M.; Khaitovich, P.; Vogel, J.; Stadler, P.F.; Hackermuller, J. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput. Biol. 2009, 5, e1000502. [Google Scholar] [CrossRef]
- Gilles, A.; Meglecz, E.; Pech, N.; Ferreira, S.; Malausa, T.; Martin, J.F. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 2011, 12, 245. [Google Scholar] [CrossRef]
- Rasko, D.A.; Webster, D.R.; Sahl, J.W.; Bashir, A.; Boisen, N.; Scheutz, F.; Paxinos, E.E.; Sebra, R.; Chin, C.S.; Iliopoulos, D.; et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N. Engl. J. Med. 2011, 365, 709–717. [Google Scholar] [CrossRef]
- Balzer, S.; Malde, K.; Jonassen, I. Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics 2011, 27, i304–309. [Google Scholar] [CrossRef]
- Miller, J.R.; Koren, S.; Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 2010, 95, 315–327. [Google Scholar]
- Medvedev, P.; Pham, S.; Chaisson, M.; Tesler, G.; Pevzner, P. Paired de bruijn graphs: A novel approach for incorporating mate pair information into genome assemblers. J. Comput. Biol. 2011, 18, 1625–1634. [Google Scholar] [CrossRef]
- Compeau, P.E.; Pevzner, P.A.; Tesler, G. How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 2011, 29, 987–991. [Google Scholar] [CrossRef]
- Earl, D.; Bradnam, K.; St. John, J.; Darling, A.; Lin, D.; Fass, J.; Yu, H.O.; Buffalo, V.; Zerbino, D.R.; Diekhans, M.; et al. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res. 2011, 21, 2224–2241. [Google Scholar] [CrossRef]
- Huang, X.; Madan, A. CAP3: A DNA sequence assembly program. Genome Res. 1999, 9, 868–877. [Google Scholar] [CrossRef]
- Benzekri, H.; Bautista, R.; Guerrero-Fernández, D.; Claros, M.G. Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, 29071 Málaga, Spain, and Plataforma Andaluza de Bioinformática, Centro de Supercomputación y Bioinformática, Edificio de Bioinnovación, Universidad de Málaga, 29590 Málaga, Spain. Unpublished work. 2012. [Google Scholar]
- Lander, E.S.; Waterman, M.S. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 1988, 2, 231–239. [Google Scholar]
- Aird, D.; Ross, M.G.; Chen, W.S.; Danielsson, M.; Fennell, T.; Russ, C.; Jaffe, D.B.; Nusbaum, C.; Gnirke, A. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12, R18. [Google Scholar] [CrossRef]
- Li, Z.; Chen, Y.; Mu, D.; Yuan, J.; Shi, Y.; Zhang, H.; Gan, J.; Li, N.; Hu, X.; Liu, B.; et al. Comparison of the two major classes of assembly algorithms: Overlap-layout-consensus and de Bruijn-graph. Brief. Funct. Genomics 2012, 11, 25–37. [Google Scholar] [CrossRef]
- FullLengtherNext. Available online: http://www.scbi.uma.es/fulllengthernext (accessed on 14 September 2012).
- Loblolly Pine Genome Project. Available online: http://dendrome.ucdavis.edu/NealeLab/lpgp/ (accessed on 14 September 2012).
- Díaz-Sala, C.; Cervera, M. Promoting a functional and comparative understanding of the conifer genome-implementing applied aspects for more productive and adapted forests (ProCoGen). BCM Proceedings 2011, 5, P158. [Google Scholar]
- Kumar, S.; Blaxter, M.L. Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 2010, 11, 571. [Google Scholar] [CrossRef]
- Sommer, D.D.; Delcher, A.L.; Salzberg, S.L.; Pop, M. Minimus: A fast, lightweight genome assembler. BMC Bioinformatics 2007, 8, 64. [Google Scholar] [CrossRef]
- Zheng, Y.; Zhao, L.; Gao, J.; Fei, Z. iAssembler: A package for de novo assembly of Roche-454/Sanger transcriptome sequences. BMC Bioinformatics 2011, 12, 453. [Google Scholar] [CrossRef]
- Iorizzo, M.; Senalik, D.A.; Grzebelus, D.; Bowman, M.; Cavagnaro, P.F.; Matvienko, M.; Ashrafi, H.; van Deynze, A.; Simon, P.W. De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity. BMC Genomics 2011, 12, 389. [Google Scholar] [CrossRef]
- Martin, J.; Bruno, V.M.; Fang, Z.; Meng, X.; Blow, M.; Zhang, T.; Sherlock, G.; Snyder, M.; Wang, Z. Rnnotator: An automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics 2010, 11, 663. [Google Scholar] [CrossRef]
- Gnerre, S.; Maccallum, I.; Przybylski, D.; Ribeiro, F.J.; Burton, J.N.; Walker, B.J.; Sharpe, T.; Hall, G.; Shea, T.P.; Sykes, S.; et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 2011, 108, 1513–1518. [Google Scholar]
- Simpson, J.T.; Wong, K.; Jackman, S.D.; Schein, J.E.; Jones, S.J.; Birol, I. ABySS: A parallel assembler for short read sequence data. Genome Res. 2009, 19, 1117–1123. [Google Scholar] [CrossRef]
© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Share and Cite
Claros, M.G.; Bautista, R.; Guerrero-Fernández, D.; Benzerki, H.; Seoane, P.; Fernández-Pozo, N. Why Assembling Plant Genome Sequences Is So Challenging. Biology 2012, 1, 439-459. https://doi.org/10.3390/biology1020439
Claros MG, Bautista R, Guerrero-Fernández D, Benzerki H, Seoane P, Fernández-Pozo N. Why Assembling Plant Genome Sequences Is So Challenging. Biology. 2012; 1(2):439-459. https://doi.org/10.3390/biology1020439
Chicago/Turabian StyleClaros, Manuel Gonzalo, Rocío Bautista, Darío Guerrero-Fernández, Hicham Benzerki, Pedro Seoane, and Noé Fernández-Pozo. 2012. "Why Assembling Plant Genome Sequences Is So Challenging" Biology 1, no. 2: 439-459. https://doi.org/10.3390/biology1020439