1. Introduction
A shortest unique substring (SUS) of a given text
is a substring containing a given position
and occurring only once in
T, such that every shorter substring containing
occurs at least twice in
T. For example, if
and
, then
is an SUS: it contains
, it occurs only once in
T, and each of the shorter substrings containing
—i.e.,
,
and
—occurs at least twice in
T. The problem of finding SUSs has attracted significant attention recently and many variants have been proposed—interval-SUSs,
k-mismatch SUSs, palindromic-SUSs and range-SUSs—so what we refer to simply as SUS here are sometimes also called position-SUSs. In this paper, we are only interested in position-SUSs and
k-mismatch SUSs, which we describe shortly. We refer readers to Abedin et al.’s very recent survey [
1] for a more detailed discussion.
Finding exact and approximate SUS has several applications in bioinformatics, including alignment-free genome comparison, PCR primer design, and the identification of DNA signatures distinguishing closely related organisms [
2,
3,
4]. Pei, Wu, and Yeh [
3] gave the definition of SUSs above, along with an
-time and
-space algorithm for finding an SUS for
given
q, and an
-time,
-space construction algorithm for an
-space data structure that in
time returns the endpoints of an SUS for
given
q. Hu, Pei, and Tao [
5], İleri, Külekci and Xu [
6], and Tsuruta, Inenaga, Bannai, and Takeda [
7] independently improved the construction time to
. Belazzougui and Cunial [
8] reduced both the construction space and the space of the data structure to
at the cost of increasing the construction time to
, where
is the size of the alphabet, while keeping the query time constant. Ganguly et al. [
9] gave the following time-space tradeoffs for finding SUSs, assuming we have random access to
T and
is a parameter:
given q, we can find an SUS for in time and space;
in time and words plus bits of space, we can build a -bit data structure answering SUS queries in time; and,
we can change the running time in both cases to at the cost of increasing the space by an additive and allowing for a low probability that the substrings are not shortest.
Recently, Senanayaka [
10] gave a simple low-memory randomized algorithm based on Karp–Rabin pattern matching, but did not give theoretical bounds for it.
Hon, Thankachan, and Xu [
11] generalized the problem by defining a
k-mismatch SUS to be a shortest substring containing
that is not only unique, but also not within Hamming distance
k of any other substring of
T. For example, if
and
, then
is a one-mismatch SUS, because no other substring is within Hamming distance 1 of it and each of the shorter substrings containing
—i.e.,
,
and
—is within edit distance of some other substring. On the other hand,
is not a 1-mismatch SUS, although it also has length 3, because the Hamming distance between ACA and ADA is 1.
Hon et al. gave an
-time,
-space construction of an
-space data structure, which, given
q, in
time returns the endpoints of a
k-mismatch SUS for
. Allen, Thankachan, and Xu [
12] reduced the construction time to
at the cost of increasing the construction space to
, and Schultz and Xu [
13] gave a GPU algorithm that is fast in practice.
In
Section 2, we show how Senanayaka’s approach can be extended, such that, given
q and
m words of workspace and random access to
T, with high probability we can find an SUS containing
in
time. In
Section 3, we show that, replacing Karp–Rabin pattern matching by a result on sketching by Golan and Porat [
14], we can use
sequential passes over
T instead of random access, at the cost of increasing the time to
and requiring
. Replacing Golan and Porat’s result by one by Gawrychowski and Starikovskaya [
15], for constant
k, we can find a
k-mismatch SUS for
in
time using
sequential passes over
T, now requiring
. Although the sketching results that we rely on are too sophisticated for us to explain them here, and to recent for us to be able to refer to one other than Golan and Porat’s and Gawrychowski and Starikovskaya’s papers themselves, we only use them only as black boxes, without relying on the details of how they work.
In
Section 4, we describe a deterministic algorithm that makes use of directed acyclic word graphs (DAWGs) [
16], the Crochemore-Perrin pattern matching algorithm [
17], and suffix trees [
18,
19,
20], to compute an SUS in
time using
words of workspace, improving Ganguly et al.’s result when
. Finally, in
Section 5 we discuss some possible directions for future work.
Table 1 summarizes known bounds for finding SUSs and
k-mismatch SUSs, including those that we give in this paper.
2. Tradeoffs with Karp-Rabin Pattern Matching
If we know an SUS for
has length at most
L, then we can search in
T for repetitions of the substrings of
in
T, using
words of workspace and
time. To do this, we build a suffix tree [
18,
19,
20] for
and scan
T, always descending in the suffix tree as much as possible and then following a suffix link. Suppose that, at some point, we have just read
and we are at string depth
d in the suffix tree. If
is not completely contained in
and
d is the largest string depth we have reached along the edge we are currently descending, then we mark with
d the node below that edge. After we have scanned
T, we can extract from the marked suffix tree the shortest string, such that:
its locus is a leaf (meaning it occurs only once in ),
that leaf’s label is at most q and its label plus the string’s length is at least q (so an occurrence of the string contains ), and
that leaf is not marked with a number greater than or equal to the strings length (meaning we have not seen a copy of the string elsewhere in T).
If we do not know L, then we can find it via exponential search, still using words of workspace, but time and sequential passes over T. Since finding an SUS is relatively easy if we can use workspace proportional to its length, even if that length is unknown, in this paper we assume that we have less workspace.
Obviously, we can find an SUS containing
while using
words of workspace if we are willing to spend
time, where
L is the length of that SUS. For example, we can use the simple Algorithm 1. This algorithm can easily be improved to take
time by replacing the linear search for
L with an exponential search. It can then be further improved to take
time with high probability, while still using
words of workspace, by replacing naïve pattern matching with Karp–Rabin pattern matching. The resulting randomized algorithm can be either Monte Carlo if we do not verify matches or Las Vegas if we do, and the probability of failure can be made inversely proportional to any fixed polynomial of
n without changing the asymptotic bounds.
Algorithm 1 An -time algorithm to find an shortest unique substring (SUS) of containing , where L is the length of that SUS. |
1 for ℓ from 1 to n 2 check each length ℓ in increasing order 3 for i from to q 4 % check all substrings of length ℓ that include 5 unmatchedflag ← true 6 % we have not yet seen a repetition of 7 for j from 1 to 8 if 9 % for , compare to 10 matchflag ← true 11 % we have not yet seen a mismatch between and 12 for k from 0 to 13 if 14 % we have found a mismatch 15 matchflag ← false 16 break 17 end if 18 end for 19 if matchflag = true 20 % we have found a repetition of 21 unmatchedflag ← false 22 break 23 end if 24 end if 25 end for 26 if unmatchedflag = true 27 % there is no repetition of 28 print i, ℓ 29 return 30 % do not check longer substrings 31 end if 32 end for 33 end for |
If we allow ourselves m words of workspace, then we can make the algorithm run in time with high probability. To do this, when searching for repetitions of the ℓ substrings of length ℓ that contain , we process them in batches of size , where depends on the ratio of the word-size to and the power of n we want in the denominator in the probability of failure. We note that we compute the hashes of the substrings in the same batch by rolling them, in total time, rather than computing each of them from scratch, which would take total time.
Theorem 1. With m words of workspace, with high probability we can find an SUS for in time.
3. Tradeoffs with Sketching
For Karp–Rabin pattern matching, we must keep track of characters leaving a sliding window, for which we need either enough memory to store the contents of the sliding window—which, by assumption, we do not have—or random access to
T. However, Golan and Porat [
14] gave a Monte-Carlo randomized sketching algorithm that takes
d patterns of maximum length
ℓ, scans
T one character at a time, and, for each position, reports the longest pattern with an occurrence ending at that position with probability of failure inversely proportional to any fixed polynomial in
n. Their algorithm uses
time per character of
T and
space and does not use a sliding window, so it needs only sequential access to
T. Replacing Karp–Rabin pattern matching with Golan and Porat’s result and searching for substrings of length
ℓ in batches of size
, so that we stay within our workspace bound
m, we obtain the following result:
Theorem 2. With words of workspace, with high probability we can find an SUS for in time using sequential passes over T.
Because Golan and Porat’s algorithm is Monte Carlo, so is our result; unlike Theorem 1, we cannot easily make it Las Vegas, since verifying matches requires random access. Again, our batch size depends on the ratio of the word-size to and the power of n we want in the denominator in the probability of failure. The requirement that means the probability of failure can still be made inversely proportional to any fixed polynomial of n.
Gawrychowski and Starikovskaya [
15] considered a harder version of the problem Golan and Porat studied, in which we are given a distance
k and, for each position in
T, we should report all of the patterns within Hamming distance
k of substrings of
T ending at that position. They gave a Monte-Carlo randomized sketching algorithm that searches for
d patterns of length at most
ℓ using
time per character of
T, where
is the number of matches reported ending at that character, and
space. Replacing Golan and Porat’s algorithm with Gawrychowski and Starikovskaya’s and searching for substrings of length
ℓ in batches of size
for some positive constant
, so that we stay within our workspace bound
m for constant
k, we obtain the following result:
Theorem 3. For constant k and with words of workspace, with high probability we can find a k-mismatch SUS for in time using sequential passes over T.
Like Theorem 2, this result is Monte Carlo and the probability of failure is inversely proportional to any fixed polynomial in n.
4. A Deterministic Algorithm
Lemma 1. Given a length ℓ and a position p in a text with alphabet size σ, there exists a deterministic algorithm that can determine whether there is a substring of length ℓ that covers position p and is unique in T in time and words of workspace.
Proof of Lemma 1. There are at most ℓ possible substrings of length ℓ covering p that could be unique. We separate these substrings into batches of m substrings with adjacent starting positions. Note that there may be one remainder batch with less than m substrings.
Consider a batch B containing k substrings of length ℓ, where . Let denote the substring in B with the adjacent starting position in B. Subsequently, is the leftmost substring in B and is the rightmost substring in B.
Consider the following substrings:
It is useful to think of x as the prefix of that does not overlap with in T. Similarly, z is the suffix of that does not overlap with . Finally, y is the suffix of and the prefix of that overlap with each other in T. Note that any substring in B is equal to , where is some suffix of x, is some prefix of z and · denotes concatenation. For each batch, we will use the substrings x, y, and z to determine whether any of the k substrings in the batch occur in T.
We can scan T to enumerate all occurrences of suffixes of x, all occurrences of y, and all occurrences of prefixes of z. Using this information we can determine if any of the k substrings in the batch occur elsewhere in T and are therefore not unique. Suppose that we search text T and we find an occurrence of a suffix q of x, an occurrence r of y, and an occurrence of a prefix s of z. If , and for some i, , then we have found an occurrence of a substring in batch B in T. Specifically, if for some , then is not unique in T. □
4.1. Finding Occurrences of Suffixes of X
To enumerate all occurrences of suffixes of
x in
T, we construct the Directed Acyclic Word Graph (DAWG) [
16]
with suffix links.
is the smallest deterministic automaton which accepts all suffixes of
and it is known to have the following properties:
Each edge is labeled by a single symbol, and the labels of all outgoing edges from a given node are distinct. The total number of nodes and edges is linear in the length of .
For any node
u, let
denote the set of strings that can be created by concatenating the labels of any path from the root to
u. Subsequently,
, and for any strings
, the set of ending positions of occurrences of
and
in
are equivalent, i.e.,
where
. This implies that
for some
(the longest element of
) and
.
The suffix link of a node u points to a node v such that the longest element of is , where is the shortest element of .
For technical convenience, we can also consider an auxiliary node that the suffix link of the root points to, which has outgoing edges for all symbols to the root.
can be built in
time and
space [
16]. Because the suffix links form a tree, we can also process the suffix link tree in linear time so that each node
u holds a pointer
to the deepest ancestor
v of
u (possibly
) that has
$ as an outgoing edge, i.e., the longest element
of
is the longest suffix of any
that is a suffix of
x.
Using , we can incrementally compute for each , the position in the DAWG, which corresponds to the longest suffix of that is a substring of x in overall time. Initially, , and we start at the root of . For each character for and current node u, we traverse by following the edge labelled from u or if that edge does not exist, try again after following the suffix link to . When the suffix link is traversed, i is incremented, so that the length of matches the length of the longest element in ( for the auxiliary node, and 0 for the root node).
If, upon reading character , we arrive at a node u, then, points to the (possibly empty) longest suffix of that is a suffix of x, so we have detected suffixes of x with lengths starting at positions and ending at position of T. In this manner, we can find occurrences of all suffixes of x in T in time.
4.2. Finding Occurrences of Y
To enumerate all the occurrences of
y in
T, we preprocess
y in
time and constant space using the Crochemore–Perrin preprocessing algorithm [
17]. We can then find all occurrences of
y in
T in
time and constant space.
4.3. Finding Occurrences of Prefixes of Z
To enumerate all occurrences of prefixes of
z in
T we construct the suffix tree [
18,
19,
20]
with suffix links. Note that
contains exactly one node
, such that
. For each explicit node in
, we add a special pointer to the closest ancestor that is on the path from the root to
. This takes
time and
space. We start at the root of
and for
we follow the edge labelled
until we reach a node
u that has no outgoing edges labelled
available. We then use the special pointer at node
u to find the closest ancestor
v of
u that is on the path to
.
then yields the longest prefix of
z at position
of
T. Furthermore, every prefix of
is also a prefix of
z, so we have detected prefixes of
z with lengths
all starting at position
i. To find the longest prefix of
z at position
of
T, follow the suffix link of the node
u we ended on for input
and repeat this process. In this manner, we can enumerate all the occurrences of all prefixes of
z in
T in
time.
4.4. Putting the Occurrences Together
We now determine which substrings in the batch B are not unique in T. Notice that, for any , occurrences of and which share the same occurrence of y implies an occurrence of also sharing the occurrence of y. Therefore, we maintain an integer arrays R of size , where all elements are initially 0, in order to record the start and end of ranges in B that have been found to occur in T.
We use the DAWG , the Crochemore–Perrin algorithm for y, and the Suffix Tree , and maintain three parallel scans on T, shifted so that the three parts are detected in sync. More precisely, at position i, we maintain the following:
the longest suffix of that is also a suffix of x using the modified DAWG ,
whether using the Crochemore–Perrin algorithm, and
the longest prefix of that is also a prefix of z while using the modified suffix tree .
If , then no substrings in B occur here. On the other hand, if , then certainly one or more substrings in B occur here. Specifically, if , then substrings through in B occur here. We increment and decrement .
After the scan, we process R and compute for , , i.e., and , which represents the total number of occurrences of in T (including those which contain q). Thus, the total is time. At any moment space is used.
We have shown that we can determine the uniqueness of all the substrings in a batch B in T in time and space. Because there are batches of possible substrings of length ℓ covering p, we can determine the uniqueness of all substrings in all batches in time and using at most words at any moment. □
Theorem 4. There is a deterministic algorithm that computes the shortest unique substring (SUS) of a text that covers some query position p chosen at runtime in words of workspace and time.
Proof of Theorem 4 If there is a unique substring in T with length ℓ and , then there is a unique substring in T with length greater than ℓ. This property lets us use exponential search over the length ℓ of the shortest unique substring (SUS), with Lemma 1 as a sub-algorithm, in order to find the SUS that covers a query position p in words and time, where L is the length of the SUS. Setting , this yields a time complexity of . □
Algorithm 2 shows the pseudo-code.
Algorithm 2 An -time algorithm to find an SUS of containing , where L is the length of the SUS. |
1 for next choice of ℓ in binary search for 2 % If there exists a unique substring of length ℓ that contains position q, then . 3 % Otherwise, all of them are repeating so 4 for i from to q step m 5 % check all substrings of length ℓ that include in batches of size k (m except possibly last) 6 7 , , 8 Construct the DAWG 9 Preprocess y for the Crochemore-Perrin algorithm 10 Construct the suffix tree 11 for j from 1 to 12 , 13 end for 14 for j from 1 to n 15 Compute longest common suffix of x and using 16 Compute whether , using Crochemore-Perrin algorithm 17 Compute longest common prefix of y and using 18 if then 19 20 21 end if 22 end for 23 24 for j from 2 to k 25 26 end for 27 % is unique if and only if 28 end for 29 end for |