Algorithms for Sequence Analysis and Storage

A special issue of Algorithms (ISSN 1999-4893).

Deadline for manuscript submissions: closed (15 March 2013) | Viewed by 49678

Special Issue Editor

Department of Computer Science, University of Helsinki, P.O. Box 68, FI-00014 Helsinki, Finland
Interests: algorithms and data structures; computational molecular biology; sequence analysis; string algorithms; data compression; algorithm engineering

Special Issue Information

Dear Colleagues,

Analysis of high-throughput sequencing data has become a crucial component in genome research. For example, methods based on latest developments in compressed data structures, namely index structures exploiting Burrows-Wheeler transform, are widely deployed in the discovery of disease causing mutations. The success of such approaches is due to being able to solve the dilemma of the indexing requiring more space than the data itself, where the data itself is enormous. Also parallel computation and use of special hardware like GPUs have shown to be important paradigms to provide scalable analysis methods. With our already increased knowledge about the genomic structure of the whole human population, and with the development of sequencing techniques and their applications in studying RNAs, metapopulations, and epigenetics, the field seeks for new innovative and universal algorithmic approaches that scale for current and future needs in the analysis and storage of biological sequences. This special issue is dedicated to approaches to biological sequence analysis that have algorithmic novelty and potential for fu ndamental impact in methods used for genome research. Also theoretical studies increasing our understanding on the limits of indexing, compression, and parallel computation in this context are welcome.

Prof. Dr. Veli Mäkinen
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.


Keywords

  • high-throughput sequencing
  • compressed data structures
  • parallel computation
  • sequence alignment
  • fragment assembly
  • genomics
  • transcriptomics
  • metagenomics
  • epigenomics

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

71 KiB  
Editorial
Editorial: Special Issue on Algorithms for Sequence Analysis and Storage
by Veli Mäkinen
Algorithms 2014, 7(1), 186-187; https://doi.org/10.3390/a7010186 - 25 Mar 2014
Viewed by 4783
Abstract
This special issue of Algorithms is dedicated to approaches to biological sequence analysis that have algorithmic novelty and potential for fundamental impact in methods used for genome research. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)

Research

Jump to: Editorial

613 KiB  
Article
Modeling Dynamic Programming Problems over Sequences and Trees with Inverse Coupled Rewrite Systems
by Robert Giegerich and H´el'ene Touzet
Algorithms 2014, 7(1), 62-144; https://doi.org/10.3390/a7010062 - 07 Mar 2014
Cited by 12 | Viewed by 9986
Abstract
Dynamic programming is a classical algorithmic paradigm, which often allows the evaluation of a search space of exponential size in polynomial time. Recursive problem decomposition, tabulation of intermediate results for re-use, and Bellman’s Principle of Optimality are its well-understood ingredients. However, algorithms often [...] Read more.
Dynamic programming is a classical algorithmic paradigm, which often allows the evaluation of a search space of exponential size in polynomial time. Recursive problem decomposition, tabulation of intermediate results for re-use, and Bellman’s Principle of Optimality are its well-understood ingredients. However, algorithms often lack abstraction and are difficult to implement, tedious to debug, and delicate to modify. The present article proposes a generic framework for specifying dynamic programming problems. This framework can handle all kinds of sequential inputs, as well as tree-structured data. Biosequence analysis, document processing, molecular structure analysis, comparison of objects assembled in a hierarchic fashion, and generally, all domains come under consideration where strings and ordered, rooted trees serve as natural data representations. The new approach introduces inverse coupled rewrite systems. They describe the solutions of combinatorial optimization problems as the inverse image of a term rewrite relation that reduces problem solutions to problem inputs. This specification leads to concise yet translucent specifications of dynamic programming algorithms. Their actual implementation may be challenging, but eventually, as we hope, it can be produced automatically. The present article demonstrates the scope of this new approach by describing a diverse set of dynamic programming problems which arise in the domain of computational biology, with examples in biosequence and molecular structure analysis. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Show Figures

Figure 1

479 KiB  
Article
Sublinear Time Motif Discovery from Multiple Sequences
by Bin Fu, Yunhui Fu and Yuan Xue
Algorithms 2013, 6(4), 636-677; https://doi.org/10.3390/a6040636 - 14 Oct 2013
Cited by 2 | Viewed by 5835
Abstract
In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an [...] Read more.
In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet, Σ. A motif G = g1g2 ... gm is a string of m characters. In each background sequence is implanted a probabilistically-generated approximate copy of G. For a probabilistically-generated approximate copy b1b2 ... bm of G, every character, bi, is probabilistically generated, such that the probability for bi gi is at most α. We develop two new randomized algorithms and one new deterministic algorithm. They make advancements in the following aspects: (1) The algorithms are much faster than those before. Our algorithms can even run in sublinear time. (2) They can handle any motif pattern. (3) The restriction for the alphabet size is a lower bound of four. This gives them potential applications in practical problems, since gene sequences have an alphabet size of four. (4) All algorithms have rigorous proofs about their performances. The methods developed in this paper have been used in the software implementation. We observed some encouraging results that show improved performance for motif detection compared with other software. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Show Figures

Figure 1

165 KiB  
Article
Efficient in silico Chromosomal Representation of Populations via Indexing Ancestral Genomes
by Niina Haiminen, Filippo Utro, Claude Lebreton, Pascal Flament, Zivan Karaman and Laxmi Parida
Algorithms 2013, 6(3), 430-441; https://doi.org/10.3390/a6030430 - 30 Jul 2013
Cited by 5 | Viewed by 7839
Abstract
One of the major challenges in handling realistic forward simulations for plant and animal breeding is the sheer number of markers. Due to advancing technologies, the requirement has quickly grown from hundreds of markers to millions. Most simulators are lagging behind in handling [...] Read more.
One of the major challenges in handling realistic forward simulations for plant and animal breeding is the sheer number of markers. Due to advancing technologies, the requirement has quickly grown from hundreds of markers to millions. Most simulators are lagging behind in handling these sizes, since they do not scale well. We present a scheme for representing and manipulating such realistic size genomes, without any loss of information. Usually, the simulation is forward and over tens to hundreds of generations with hundreds of thousands of individuals at each generation. We demonstrate through simulations that our representation can be two orders of magnitude faster and handle at least two orders of magnitude more markers than existing software on realistic breeding scenarios. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Show Figures

Figure 1

427 KiB  
Article
Filtering Degenerate Patterns with Application to Protein Sequence Analysis
by Matteo Comin and Davide Verzotto
Algorithms 2013, 6(2), 352-370; https://doi.org/10.3390/a6020352 - 22 May 2013
Cited by 4 | Viewed by 6091
Abstract
In biology, the notion of degenerate pattern plays a central role for describing various phenomena. For example, protein active site patterns, like those contained in the PROSITE database, e.g., [FY ]DPC[LIM][ASG]C[ASG], are, in general, represented by degenerate patterns with character classes. Researchers have [...] Read more.
In biology, the notion of degenerate pattern plays a central role for describing various phenomena. For example, protein active site patterns, like those contained in the PROSITE database, e.g., [FY ]DPC[LIM][ASG]C[ASG], are, in general, represented by degenerate patterns with character classes. Researchers have developed several approaches over the years to discover degenerate patterns. Although these methods have been exhaustively and successfully tested on genomes and proteins, their outcomes often far exceed the size of the original input, making the output hard to be managed and to be interpreted by refined analysis requiring manual inspection. In this paper, we discuss a characterization of degenerate patterns with character classes, without gaps, and we introduce the concept of pattern priority for comparing and ranking different patterns. We define the class of underlying patterns for filtering any set of degenerate patterns into a new set that is linear in the size of the input sequence. We present some preliminary results on the detection of subtle signals in protein families. Results show that our approach drastically reduces the number of patterns in output for a tool for protein analysis, while retaining the representative patterns. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Show Figures

Figure 1

560 KiB  
Article
Practical Compressed Suffix Trees
by Andrés Abeliuk, Rodrigo Cánovas and Gonzalo Navarro
Algorithms 2013, 6(2), 319-351; https://doi.org/10.3390/a6020319 - 21 May 2013
Cited by 24 | Viewed by 7948
Abstract
The suffix tree is an extremely important data structure in bioinformatics. Classical implementations require much space, which renders them useless to handle large sequence collections. Recent research has obtained various compressed representations for suffix trees, with widely different space-time tradeoffs. In this paper [...] Read more.
The suffix tree is an extremely important data structure in bioinformatics. Classical implementations require much space, which renders them useless to handle large sequence collections. Recent research has obtained various compressed representations for suffix trees, with widely different space-time tradeoffs. In this paper we show how the use of range min-max trees yields novel representations achieving practical space/time tradeoffs. In addition, we show how those trees can be modified to index highly repetitive collections, obtaining the first compressed suffix tree representation that effectively adapts to that scenario. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Show Figures

Figure 1

288 KiB  
Article
Multi-Sided Compression Performance Assessment of ABI SOLiD WES Data
by Tommaso Mazza and Stefano Castellana
Algorithms 2013, 6(2), 309-318; https://doi.org/10.3390/a6020309 - 21 May 2013
Cited by 2 | Viewed by 6569
Abstract
Data storage is a major and growing part of IT budgets for research since manyyears. Especially in biology, the amount of raw data products is growing continuously,and the advent of the so-called "next-generation" sequencers has made things worse.Affordable prices have pushed scientists to [...] Read more.
Data storage is a major and growing part of IT budgets for research since manyyears. Especially in biology, the amount of raw data products is growing continuously,and the advent of the so-called "next-generation" sequencers has made things worse.Affordable prices have pushed scientists to massively sequence whole genomes and to screenlarge cohort of patients, thereby producing tons of data as a side effect. The need formaximally fitting data into the available storage volumes has encouraged and welcomednew compression algorithms and tools. We focus here on state-of-the-art compression toolsand measure their compression performance on ABI SOLiD data. Full article
(This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage)
Show Figures

Figure 1

Back to TopTop