A Survey on Shortest Unique Substring Queries
Abstract
:1. Introduction
2. Preliminaries
2.1. Definitions
2.2. Data Structures
3. Position-SUS Queries
3.1. Motivation
3.2. Suffix Trees Based Approach
- Find the leaf node corresponding to the suffix
- If the label of the leaf edge is $, it means that does not exist and we return null; Otherwise, we continue.
- Let l be the length of the label of the leaf edge (excluding $).
- is
3.3. Linear Time Approaches
3.3.1. Ileri et al.’s Framework
3.3.2. Tsuruta et al.’s Framework
3.4. In Place and Compact Data Structures’ Approaches
4. Interval-SUS Queries
4.1. Linear Time Approaches
- 1.
- if it is unique. This can be checked in constant time and linear space.
- 2.
- : This can be computed in constant time using linear space by Lemma 4.
- 3.
- : This can be computed in constant time using linear space by Lemma 4.
- 4.
- The shortest containing : It remains to show their structure of computing this candidate.
4.2. RLE-Based Approaches
4.3. Compact Data Structures’ Approaches
5. Approximate-SUS Queries
5.1. Motivation
5.2. Hon et al.’s Framework
5.3. Allen et al.’s Framework
5.4. GPU Based Approach
6. SUPS Queries
6.1. Motivation
6.2. Optimal Approaches
6.3. RLE-Based Approaches
7. Range-SUS
8. Discussion and Future Work
- We discussed all the solutions to solve approximate queries in Section 5. However, there is no efficient in-place algorithm which can find s to get s afterward. Another technique that can be applied to solve approximate queries is considering the representation of the input string. Section 4.2 shows this technique for solving interval- queries. To our knowledge, an based approach for solving approximate queries has not been studied. In addition, there is no work considering the standard external memory model for solving an approximate problem. As the I/O-efficient construction of the suffix array and array exist [45,46,47,48], it seems to be possible to change the RAM model algorithm for the construction of these arrays to the external memory model.
- In Section 4.2, the in the query time of Theorem 6 is , which is actually the time for performing dynamic predecessor/successor queries using space [8]. In order to make the query time faster using the same space, the question is if there exists a data structure of size that can efficiently answer Problem 3 without using predecessor/successor.
- As we discussed in Section 6, palindromic substrings have great motivations in computational biology. All the reviewed works are on finding the exact s. Similar to the approximate problem, approximate query is also important to be studied for considering errors and mutations. Besides the definition of Problem 5, the following definition has a great motivation in bioinformatics. A nucleotide sequence is considered as a palindrome if the reverse of its complementary strand is equal to the original sequence [49]. The question is if the methods discussed in Section 6 can be applied to efficiently solve this problem.
- The last topic that we discussed was the problem. According to Theorem 12, queries can be solved in time using a data structure of size word. The question is whether we can design an efficient -word data structure for the problem. In addition, the approximate version of queries has not been studied. It is possible to combine the technique discussed in Section 7 and the framework of Thankachan et al. [50] to provide an efficient algorithm for approximate problem.
- Besides shortest unique substrings, Maximal Unique Matches is an important concept in computational biology for aligning two long genome sequences [51]. Ganguly et al. [18] applied a similar technique discussed in Section 3.4 to find maximal unique matches of two strings. As far as we are aware, the dynamic version (when mismatches are allowed) of this problem has not been studied yet. We believe that, by modifying the techniques on the dynamic longest common substring problem ( after k mismatches) [52,53,54], the approximate Maximal Unique Matches problem can be solved in subquadratic time.
Author Contributions
Funding
Conflicts of Interest
References
- Pei, J.; Wu, W.C.H.; Yeh, M.Y. On Shortest Unique Substring Queries. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 8–11 April 2013; pp. 937–948. [Google Scholar]
- Hu, X.; Pei, J.; Tao, Y. Shortest Unique Queries on Strings. In Proceedings of the String Processing and Information Retrieval-21st International Symposium—SPIRE 2014, Ouro Preto, Brazil, 20–22 October 2014; Lecture Notes in Computer Science. de Moura, E.S., Crochemore, M., Eds.; Springer: Cham, Switzerland, 2014; Volume 8799, pp. 161–172. [Google Scholar] [CrossRef] [Green Version]
- Hon, W.; Thankachan, S.V.; Xu, B. In-place algorithms for exact and approximate shortest unique substring problems. Theor. Comput. Sci. 2017, 690, 12–25. [Google Scholar] [CrossRef]
- Inoue, H.; Nakashima, Y.; Mieno, T.; Inenaga, S.; Bannai, H.; Takeda, M. Algorithms and combinatorial properties on shortest unique palindromic substrings. J. Discrete Algorithms 2018, 52, 122–132. [Google Scholar] [CrossRef]
- Abedin, P.; Ganguly, A.; Pissis, S.P.; Thankachan, S.V. Range Shortest Unique Substring Queries. In Proceedings of the International Symposium on String Processing and Information Retrieval, Segovia, Spain, 7–9 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 258–266. [Google Scholar]
- Ileri, A.M.; Külekci, M.O.; Xu, B. Shortest unique substring query revisited. In Symposium on Combinatorial Pattern Matching; Springer: Berlin/Heidelberg, Germany, 2014; pp. 172–181. [Google Scholar]
- Ileri, A.M.; Külekci, M.O.; Xu, B. A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. Theor. Comput. Sci. 2015, 562, 621–633. [Google Scholar] [CrossRef]
- Mieno, T.; Inenaga, S.; Bannai, H.; Takeda, M. Shortest Unique Substring Queries on Run-Length Encoded Strings. In Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, Kraków, Poland, 22–26 August 2016; LIPIcs, Faliszewski, P., Muscholl, A., Niedermeier, R., Eds.; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik: Dagstuhl, Germany, 2016; Volume 58, pp. 69:1–69:11. [Google Scholar] [CrossRef]
- Allen, D.R.; Thankachan, S.V.; Xu, B. A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics—BCB 2018, Washington, DC, USA, 29 August–1 September 2018; Shehu, A., Wu, C.H., Boucher, C., Li, J., Liu, H., Pop, M., Eds.; ACM: New York, NY, USA, 2018; pp. 428–437. [Google Scholar] [CrossRef]
- Allen, D.R.; Thankachan, S.V.; Xu, B. An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020. [Google Scholar] [CrossRef] [PubMed]
- Watanabe, K.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Shortest Unique Palindromic Substring Queries on Run-Length Encoded Strings. In Proceedings of the Combinatorial Algorithms-30th International Workshop, IWOCA 2019, Pisa, Italy, 23–25 July 2019; pp. 430–441. [Google Scholar] [CrossRef]
- Watanabe, K.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings. Theory Comput. Syst. 2020. [Google Scholar] [CrossRef]
- Tsuruta, K.; Inenaga, S.; Bannai, H.; Takeda, M. Shortest Unique Substrings Queries in Optimal Time. In Proceedings of the SOFSEM 2014: Theory and Practice of Computer Science-40th International Conference on Current Trends in Theory and Practice of Computer Science, Nový Smokovec, Slovakia, 26–29 January 2014; Lecture Notes in Computer Science. Geffert, V., Preneel, B., Rovan, B., Stuller, J., Tjoa, A.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8327, pp. 503–513. [Google Scholar] [CrossRef]
- Mieno, T.; Köppl, D.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Compact Data Structures for Shortest Unique Substring Queries. In Proceedings of the International Symposium on String Processing and Information Retrieval, Segovia, Spain, 7–9 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 107–123. [Google Scholar]
- Schultz, D.W.; Xu, B. On k-Mismatch Shortest Unique Substring Queries Using GPU. In Proceedings of the Bioinformatics Research and Applications-14th International Symposium—ISBRA 2018, Beijing, China, 8–11 June 2018; pp. 193–204. [Google Scholar] [CrossRef]
- Schultz, D.W.; Xu, B. Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019. [Google Scholar] [CrossRef] [Green Version]
- Hon, W.; Thankachan, S.V.; Xu, B. An In-place Framework for Exact and Approximate Shortest Unique Substring Queries. In Proceedings of the Algorithms and Computation-26th International Symposium—ISAAC 2015, Nagoya, Japan, 9–11 December 2015; pp. 755–767. [Google Scholar] [CrossRef] [Green Version]
- Ganguly, A.; Hon, W.K.; Shah, R.; Thankachan, S.V. Space-time trade-offs for the shortest unique substring problem. In Proceedings of the 27th International Symposium on Algorithms and Computation (ISAAC 2016), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Sydney, Australia, 12–14 December 2016. [Google Scholar]
- Haubold, B.; Pierstorff, N.; Möller, F.; Wiehe, T. Genome comparison without alignment using shortest unique substrings. Bmc Bioinform. 2005, 6, 123. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tarhio, J.; Peltola, H. String matching in the DNA alphabet. Software Pract. Exp. 1997, 27, 851–861. [Google Scholar] [CrossRef]
- Adas, B.; Bayraktar, E.; Faro, S.; Moustafa, I.E.; Külekci, M.O. Nucleotide Sequence Alignment and Compression via Shortest Unique Substring. In Proceedings of the Bioinformatics and Biomedical Engineering-Third International Conference—IWBBIO 2015, Granada, Spain, 15–17 April 2015; Lecture Notes in Computer Science. Guzman, F.M.O., Rojas, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9044, Part II. pp. 363–374. [Google Scholar] [CrossRef]
- Kim, H.; Han, Y.S. OMPPM: Online multiple palindrome pattern matching. Bioinformatics 2016, 32, 1151–1157. [Google Scholar] [CrossRef] [Green Version]
- Kolpakov, R.; Kucherov, G. Searching for gapped palindromes. Theor. Comput. Sci. 2009, 410, 5365–5373. [Google Scholar] [CrossRef] [Green Version]
- Amir, A.; Apostolico, A.; Landau, G.M.; Levy, A.; Lewenstein, M.; Porat, E. Range LCP. J. Comput. Syst. Sci. 2014, 80, 1245–1253. [Google Scholar] [CrossRef]
- Abedin, P.; Ganguly, A.; Hon, W.K.; Matsuda, K.; Nekrich, Y.; Sadakane, K.; Shah, R.; Thankachan, S.V. A linear-space data structure for range-LCP queries in poly-logarithmic time. Theor. Comput. Sci. 2020, 163, 245–251. [Google Scholar]
- Kociumaka, T.; Radoszewski, J.; Rytter, W.; Waleń, T. Internal pattern matching queries in a text and applications. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, Portland, OR, USA, 5–7 January 2014; SIAM: Philadelphia, PA, USA, 2014; pp. 532–551. [Google Scholar]
- Weiner, P. Linear Pattern Matching Algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (Swat 1973), Iowa City, IA, USA, 15–17 October 1973; IEEE Computer Society: Washington, DC, USA, 1973; pp. 1–11. [Google Scholar] [CrossRef] [Green Version]
- Manber, U.; Myers, G. Suffix arrays: A new method for online string searches. Siam J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
- Kärkkäinen, J.; Sanders, P. Simple linear work suffix array construction. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Eindhoven, The Netherlands, 30 June–4 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 943–955. [Google Scholar]
- Fischer, J.; Heun, V. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM J. Comput. 2011, 40, 465–492. [Google Scholar] [CrossRef] [Green Version]
- Willard, D.E. Log-Logarithmic Worst-Case Range Queries are Possible in Space Theta(N). Inf. Process. Lett. 1983, 17, 81–84. [Google Scholar] [CrossRef]
- Rubinchik, M.; Shur, A.M. EERTREE: An efficient data structure for processing palindromes in strings. In International Workshop on Combinatorial Algorithms; Springer: Berlin/Heidelberg, Germany, 2015; pp. 321–333. [Google Scholar]
- Ukkonen, E. On-line construction of suffix trees. Algorithmica 1995, 14, 249–260. [Google Scholar] [CrossRef]
- Pei, J.; Wu, W.C.; Yeh, M. On shortest unique substring queries. In Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013; Jensen, C.S., Jermaine, C.M., Zhou, X., Eds.; IEEE Computer Society: Washington, DC, USA, 2013; pp. 937–948. [Google Scholar] [CrossRef]
- Aggarwal, A.; Vitter, J.S. The input/output complexity of sorting and related problems. Commun. ACM 1988, 31, 1116–1127. [Google Scholar] [CrossRef] [Green Version]
- Tamakoshi, Y.; Goto, K.; Inenaga, S.; Bannai, H.; Takeda, M. An opportunistic text indexing structure based on run length encoding. In Proceedings of the International Conference on Algorithms and Complexity, Paris, France, Germany, 20–22 May 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 390–402. [Google Scholar]
- Ulitsky, I.; Burstein, D.; Tuller, T.; Chor, B. The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 2006, 13, 336–350. [Google Scholar] [CrossRef] [Green Version]
- Hooshmand, S.; Tavakoli, N.; Abedin, P.; Thankachan, S.V. On computing average common substring over run length encoded sequences. Fundam. Informaticae 2018, 163, 267–273. [Google Scholar] [CrossRef] [Green Version]
- Thankachan, S.V.; Chockalingam, S.P.; Liu, Y.; Apostolico, A.; Aluru, S. ALFRED: A practical method for alignment-free distance computation. J. Comput. Biol. 2016, 23, 452–460. [Google Scholar] [CrossRef]
- Bannai, H.; Gagie, T.; Inenaga, S.; Kärkkäinen, J.; Kempa, D.; Piątkowski, M.; Puglisi, S.J.; Sugimoto, S. Diverse palindromic factorization is NP-complete. In Proceedings of the International Conference on Developments in Language Theory, Liverpool, UK, 27–30 July 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 85–96. [Google Scholar]
- Borozdin, K.; Kosolobov, D.; Rubinchik, M.; Shur, A.M. Palindromic length in linear time. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Warsaw, Poland, 4–6 July 2017. [Google Scholar]
- Mali, P.; Esvelt, K.M.; Church, G.M. Cas9 as a versatile tool for engineering biology. Nat. Methods 2013, 10, 957–963. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Manacher, G. A New Linear-Time“On-Line”Algorithm for Finding the Smallest Initial Palindrome of a String. J. ACM (JACM) 1975, 22, 346–351. [Google Scholar] [CrossRef]
- Chan, T.M.; Larsen, K.G.; Patrascu, M. Orthogonal Range Searching on the RAM, Revisited. In Proceedings of the 27th Annual Symposium on Computational Geometry 2011, Paris, France, 13–15 June 2011; pp. 1–10. [Google Scholar]
- Kärkkäinen, J.; Kempa, D.; Puglisi, S.J. Parallel external memory suffix sorting. In Annual Symposium on Combinatorial Pattern Matching; Springer: Berlin/Heidelberg, Germany, 2015; pp. 329–342. [Google Scholar]
- Kärkkäinen, J.; Kempa, D.; Puglisi, S.J.; Zhukova, B. Engineering external memory induced suffix sorting. In Proceedings of the 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), Barcelona, Spain, 17–18 January 2017; pp. 98–108.
- Kärkkäinen, J.; Kempa, D. Faster external memory LCP array construction. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA 2016), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Aarhus, Denmark, 22–24 August 2016. [Google Scholar]
- Kärkkäinen, J.; Kempa, D. LCP array construction using O (sort (n))(or less) I/Os. In Proceedings of the International Symposium on String Processing and Information Retrieval, Beppu, Japan, 18–20 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 204–217. [Google Scholar]
- Anjana, R.; Shankar, M.; Vaishnavi, M.K.; Sekar, K. A method to find palindromes in nucleic acid sequences. Bioinformation 2013, 9, 255. [Google Scholar] [CrossRef] [PubMed]
- Thankachan, S.V.; Aluru, C.; Chockalingam, S.P.; Aluru, S. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Proceedings of the International Conference on Research in Computational Molecular Biology, Paris, France, 21–24 April 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 211–224. [Google Scholar]
- Delcher, A.L.; Kasif, S.; Fleischmann, R.D.; Peterson, J.; White, O.; Salzberg, S.L. Alignment of whole genomes. Nucleic Acids Res. 1999, 27, 2369–2376. [Google Scholar] [CrossRef] [Green Version]
- Kociumaka, T.; Radoszewski, J.; Starikovskaya, T. Longest common substring with approximately k mismatches. Algorithmica 2019, 81, 2633–2652. [Google Scholar] [CrossRef] [Green Version]
- Abedin, P.; Hooshmand, S.; Ganguly, A.; Thankachan, S.V. The heaviest induced ancestors problem revisited. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Qingdao, China, 2–4 July 2018. [Google Scholar]
- Flouri, T.; Giaquinta, E.; Kobert, K.; Ukkonen, E. Longest common substrings with k mismatches. Inf. Process. Lett. 2015, 115, 643–647. [Google Scholar] [CrossRef]
Variant Queries | ||||
---|---|---|---|---|
Position- | Interval- | Approximate- | Palindromic- | Range- |
Pei et al. 2013 [1] | Hu et al. 2014 [2] | Hon et al. 2017 [3] | Inoue et al. 2018 [4] | Abedin et al. 2019 [5] |
Ileri et al. 2014, 2015 [6,7] | Mieno et al. 2016 [8] | Allen et al. 2018, 2020 [9,10] | Wantabe et al. 2019, 2020 [11,12] | |
Tsuruta et al. 2014 [13] | Mieno et al. 2019 [14] | Schultz et al. 2018, 2020 [15,16] | ||
Hon et al. 2015, 2017 [3,17] | ||||
Ganguly et al. 2016 [18] | ||||
Mieno et al. 2019 [14] |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abedin, P.; Külekci, M.O.; Thankachan, S.V. A Survey on Shortest Unique Substring Queries. Algorithms 2020, 13, 224. https://doi.org/10.3390/a13090224
Abedin P, Külekci MO, Thankachan SV. A Survey on Shortest Unique Substring Queries. Algorithms. 2020; 13(9):224. https://doi.org/10.3390/a13090224
Chicago/Turabian StyleAbedin, Paniz, M. Oğuzhan Külekci, and Shama V. Thankachan. 2020. "A Survey on Shortest Unique Substring Queries" Algorithms 13, no. 9: 224. https://doi.org/10.3390/a13090224
APA StyleAbedin, P., Külekci, M. O., & Thankachan, S. V. (2020). A Survey on Shortest Unique Substring Queries. Algorithms, 13(9), 224. https://doi.org/10.3390/a13090224