1. Introduction
Let be an arbitrary word of length n. A fragment of w, where , is called a factor of w and is denoted by . Note that this factor can be considered either as a word itself or as the fragment of w. Thus, for factors, we have two different notions of equality: factors can be equal as the same fragment of the word w or as the same word. To avoid this ambiguity, we use two different notations: if two factors u and v of w are the same word (the same fragment of w), we will write (). For any the factor () is called a prefix (a suffix) of w. By positions in w, we mean the order numbers of letters of the word w. For any factor of w, the positions i and j are called the start position of v and the ending position of v and denoted by and , respectively. For any two factors u, v of w, the factor u is contained in v if and . Two factors u and v of w such that are called overlapped if . The intersection of the overlapped factors u and v is the factor (if , the intersection of u and v is assumed to be an empty word). The length of the intersection of the overlapped factors u and v is called the overlap of u and v. The union of the overlapped factors u and v is the factor , where , . If some word u is equal to a factor v of w, then v is called an occurrence of u in w.
A positive integer p is called a period of w if for each . We denote by the minimal period of w and by the ratio , which is called the exponent of w. Further, we use the following well-known fact, which is usually called the periodicity lemma.
Lemma 1. If a word w has two periods , and , then is also a period of w.
The periodicity lemma is actually a weaker version of the Fine and Wilf theorem (see [
1,
2]). Using the periodicity lemma, it is easy to obtain the following.
Proposition 1. Let q be a period of a word w such that . Then, q is divisible by .
We will also use the following evident fact.
Proposition 2. If two overlapped factors of a word have the same period p and the overlap of these factors is not less than p, then p is a period of the union of these factors.
A word is called
primitive if its exponent is not an integer greater than 1. For primitive words, the following well-known fact takes place (see, e.g., [
3]).
Lemma 2 (primitivity lemma). If u is a primitive word, then the word has no occurrences of u that are neither a prefix nor a suffix of .
Words r such that are called repetitions. A repetition in a word is called maximal if this repetition cannot be extended to the left or to the right in the word by at least one letter while preserving its minimal period. More precisely, a repetition r in a word w is called maximal if it satisfies the following maximality conditions:
if , then ,
if , then .
For example, word has 6 maximal repetitions: , , , , and . Maximal repetitions are usually called runs in the literature. By , we denote the set of all maximal repetitions in the word w. For any natural n, we define also and .
The possible number of maximal repetitions was actively investigated in the literature. In [
4], the linear upper bound
is proven, which implies obviously that
. Due to a series of papers (see, e.g., [
5]), more precise upper bounds on
and
have been obtained. A breakthrough in this direction was made in [
6], where the bounds
,
are proven. To our knowledge, the best upper bound
for binary words
w of length
n is shown in [
7], and the best lower bounds
on
and
on
are obtained, respectively, in [
8,
9]. Some results on the average number of runs in arbitrary words are obtained in [
10,
11].
In this paper, we consider words over a (polynomially bounded) integer alphabet, i.e., words over an alphabet that consists of nonnegative integers bounded by some polynomial of the length of words. Thus, further, by w, we will mean an arbitrary word of length n over the integer alphabet.
The problem of finding of all maximal repetitions in words was also actively investigated in the literature. A
-time algorithm for finding all runs in a word of length
n was proposed in [
4] for the case of a constant-size alphabet. This result was generalized to the case of words over the integer alphabet in [
12]. Another
-time algorithm for the case of the integer alphabet, based on a different approach, has been proposed in [
6]. An algorithm for solving the problem in the case of an unbounded, linearly ordered alphabet was proposed in [
13]. This algorithm was improved in [
14,
15]. Finally, a linear time algorithm for this case was proposed in [
16]. The obtained results can be summarized in the following two theorems.
Theorem 1. The number of maximal repetitions in w is , and all these repetitions with their minimal periods can be computed in time.
Theorem 2. The sum of exponents of all maximal repetitions in w is .
Let r be a repetition in w. We call any factor of w that has the length and is contained in r a cyclic root of r. The cyclic root that is the prefix of r is called the prefix cyclic root of r. It follows from the minimality of the period that any cyclic root of r is a primitive word. Thus, the following proposition can be easily obtained from Lemma 2.
Proposition 3. Two cyclic roots of a repetition r are equal if and only if ().
Since all roots of any repetition are primitive, any repetition r has different cyclic roots, which are cyclic rotations of each other. The lexicographically minimal root among these cyclic roots is called the Lyndon root of the repetition r. Let x be the leftmost occurrence of the Lyndon root in the repetition r. Then, the difference is denoted by . Two repetitions with the same minimal period are called repetitions with the same cyclic roots if they have the same set of distinct cyclic roots. Note that repetitions with the same cyclic roots have the same Lyndon root.
Let be maximal repetitions with the same cyclic roots, p be the minimal period of and , and be cyclic roots of repetitions , respectively. Note that and are the starting positions of the leftmost Lyndon roots of repetitions , respectively. Denote by () the residue of () in modulo p. Using Proposition 3, it is easy to see that if and only if . Thus, we obtain the following fact.
Proposition 4. Let be maximal repetitions with the same cyclic roots, p be the minimal period of and , and be cyclic roots of repetitions , respectively. Then, if and only if Further, we will use double-linked lists of all maximal repetitions with the same cyclic roots in the order of their increasing of starting positions. According to [
17], these lists can be computed for the word
w in
time. It is also shown in [
17] that, for all maximal repetitions
r in
w, the values
can be computed in
total time.
We will also use the following facts for overlaps of maximal repetitions (see, e.g., [
18]).
Proposition 5. The overlap of any two different maximal repetitions with the same minimal period p is smaller than p.
Proposition 6. The overlap of any two different maximal repetitions and is smaller than .
A natural generalization of repetitions are factors with exponents strictly less than 2. We will call such factors subrepetitions. More precisely, a factor r is called a subrepetition if . Analogously to repetitions, a subrepetition r in w is called maximal if r satisfies the maximality conditions, i.e., if r cannot be extended to the left or to the right in w by at least one letter while preserving its minimal period. For any such that (In the paper, for convenience, we assume actually that for some fixed ), a subrepetition r is called a (δ-subrepetition) if . It is shown below that the number of maximal -subrepetitions in a word of length n is . In this paper, the following problem is investigated.
Problem 1. For a given value δ, find in a given word w of length n all maximal δ-subrepetions.
Previously, in [
19], two algorithms for solving Problem 1 were proposed: the first algorithm has
time complexity and the second algorithm has
expected time complexity. In [
20], the expected time of the second algorithm was improved to the linear bound
. Using the results of [
21,
22], this time can be further improved to
. In [
23], it is shown that all subrepetitions with the largest exponent (over all subrepetitions) in an overlap-free string can be found in O(n) time for a constant-size alphabet. In this paper, we propose an alternative deterministic algorithm for solving Problem 1 in
time.
In the following sections, we give a detailed description of the proposed algorithm. In
Section 2, we consider, in a word, regular structures, which are called
maximal (gapped or overlapped) repeats. These repeats are closely related to maximal subrepetitions. We show that there exists a one-to-one correspondence between maximal subrepetitions and some maximal gapped repeats, which are called
principal, so, to find all maximal subrepetitions, we can find actually all principal gapped repeats in the word. In
Section 3, we introduce the notion of covering maximal repeats and show that a gapped repeat is principal if and only if it is not covered by another maximal repeat. Thus, to find all maximal subrepetitions, we find actually, in the word, all maximal gapped repeats that are not covered by other maximal repeats. In
Section 4,
Section 5 and
Section 6, some auxiliary notions that are used in the proposed algorithm—in particular, the notions of generated, periodic, represented, birepresented and
-periodic repeats—are introduced and several useful statements related to these notions are proven. Finally, a detailed description of the proposed algorithm is contained in
Section 7.
2. Repeats
Another regular structure in a word that is closely related to repetitions and subrepetitions is repeats. In a general case, a repeat in the word w is a pair of equal factors of the word w, where . The factors are called copies of the repeat ; in particular, is called the left copy of and is called the right copy of . The length of copies of is denoted by . The difference is called the period of the repeat and is denoted by . The minimal factor containing both copies of will be denoted by . Note that for different repeats and , we can have actually . Note also that is a period of , but the minimal period of can be less than . By the starting position (the ending position) of , we will mean the starting position (the ending position) of . We will say that a maximal repeat is contained in some factor of the word if the factor is contained in . A repeat is called overlapped if its copies are overlapped factors; otherwise, the repeat is called gapped. In other words, the repeat is gapped if can be represented in the form , where v is a nonempty word called the gap of the repeat . For any , a gapped repeat is called α-gapped if .
Analogously to repetitions and subrepetitions, we can also introduce the notion of maximal repeats. A repeat with left and right copies in w is called maximal if it satisfies the following conditions:
if , then ,
if , then .
In other words, a repeat in a word is called
maximal if its copies cannot be extended to the left or to the right in the word by at least one letter while preserving its period. Note that any repeat can be extended to a uniquely defined maximal repeat with the same period. In particular, any
-gapped repeat can be extended to a uniquely defined maximal
-gapped or overlapped repeat. In [
21,
22], the following fact on maximal
-gapped repeats was obtained.
Theorem 3. The number of maximal α-gapped repeats in w is , and all these repeats can be computed in time.
In [
22], the more precise upper bound
on the number of maximal
-gapped repeats in
w was actually obtained. A tighter bound on this number was obtained later in [
24]. An algorithm that finds, in each position of a word, the longest gapped repeats satisfying additional restrictions is proposed in [
25].
Let
be a maximal overlapped repeat with left and right copies
in
w. Note that, in this case, the period
of
is not greater than half of
, so
is a repetition. Let
p be the minimal period of
. By Proposition 1, we have that
p is a divisor of
. Let
. Since
is a maximal repeat, we have that
On the other hand, since the period
p of
is a divisor of
, we have that
. Thus, we obtain that
. In an analogous way, we can obtain that if
, then
. Thus, we conclude that
is a maximal repetition whose minimal period is a divisor of
. We will denote this repetition by
. Now, let
r be a maximal repetition in
w. Then, we can consider the repeat
with left and right copies
and
. It is easy to note that
is a maximal overlapped repeat such that
. We will call the repeat
the
principal repeat of the repetition
r. Principal repeats of maximal repetitions will be also called
reprincipal repeats. Note that any reprincipal repeat
is the principal repeat of the repetition
, so, for any maximal repetition, we have that this repetition and the principal repeat of this repetition are uniquely defined by each other. Therefore, we have one-to-one correspondence between all maximal repetitions and all reprincipal repeats in a word. Thus, in any word, the number of reprincipal repeats is equal to the number of maximal repetitions. We have the following fact for reprincipal repeats.
Proposition 7. The number of reprincipal repeats in w is , and all these repeats can be computed in time.
Proof. Recall that the number of reprincipal repeats in w is equal to the number of maximal repetitions in w, so, by Theorem 1, this number is . To compute all reprincipal repeats in w, first, we find all maximal repetitions in w. By Theorem 1, this can be done in time . Then, for each found maximal repetition r, we compute the principal repeat of r as repeat , where , , , . It can be computed in constant time. Therefore, the total time of computing all reprincipal repeats from the found maximal repetitions is proportional to the number of these repetitions, so this time is . Thus, the total time of computing all reprincipal repeats in w is . □
Now, let
r be a maximal
-subrepetition in
w. Then, we can consider a gapped repeat
with the left copy
and the right copy
. It is easy to see that
is a maximal gapped repeat with the period
. Moreover, we have
so
. Thus,
is actually a maximal
-gapped repeat. We call the repeat
the
principal repeat of the subrepetition
r. Thus, for any maximal
-subrepetition, there exists the principal repeat of this subrepetition. On the other hand, a maximal gapped repeat may not be the principal repeat of any maximal subrepetition. For example, in the word shown in
Figure 1, we can consider the maximal subrepetition
r with the minimal period 7. Note that the gapped repeat
is the principal repeat of
r, while the gapped repeat
is not the principal repeat of
r, so the repeat
is not the principal repeat of any maximal subrepetition. Thus, the repeat
is a principal repeat and the repeat
is not a principal repeat. Note that for any maximal subrepetition, we have that this subrepetition and the principal repeat of this subrepetition are uniquely defined by each other, and we have one-to-one correspondence between all maximal
-subrepetitions and all principal maximal
-gapped repeats in a word (According to this observation, maximal subrepetitions can be identified with maximal repeats
such that
is equal to the minimal period of
(see Proposition 9)). Thus, Problem 1 can be reformulated in the following way.
Problem 2. For a given value δ, find in a given word w of length n all principal maximal -gapped repeats.
We obtain also that, in any word, the number of maximal -subrepetitions is not greater than the number of maximal -gapped repeats. Thus, Theorem 3 implies the following upper bound on the number of maximal -subrepetitions in a word.
Proposition 8. The number of maximal δ-subrepetitions in w is .
Note also that, for any principal overlapped or gapped repeats, we have the following obvious fact.
Proposition 9. A maximal repeat σ is principal if and only if the minimal period of is equal to .
Let M be a set of maximal repeats in the word w. Note that a maximal repeat is uniquely defined by its starting position together with its period. Thus, we can represent the set M by lists for , where is the list containing all these repeats with the starting position t in the order of their increasing periods. Using bucket sorting, all the lists can be computed from the set M in total time. Moreover, all the lists can be traversed in total time. We will call the lists start position lists for the set M.
3. Covering of Maximal Repeats
Now, we reformulate Problem 2 in terms of covering maximal repeats. We say that a maximal repeat is covered by another maximal repeat if is contained in and . For example, in the word , the gapped repeat with copies is covered by the gapped repeat with copies . Note that the introduced notion of covering of repeats satisfies the transitivity property: if some maximal repeat is covered by some maximal repeat and the maximal repeat is covered by some maximal repeat , then is covered by . The following auxiliary facts can be easily checked.
Proposition 10. A maximal α-gapped repeat can be covered only by α-gapped or overlapped repeats.
Proposition 11. If a maximal repeat σ is covered by a maximal repeat , then the left copy of σ is contained in the left copy of and the right copy of σ is contained in the right copy of .
In an analogous way, a maximal repeat is covered by a maximal repetition r if is contained in r and . For example, the repetition with the minimal period 5 covers the gapped repeat with copies . Note that the repeat is covered by r if and only if is covered by the principal repeat of r. Note also that any maximal overlapped repeat coincides as a factor with some maximal repetition whose period is a divisor of the period of this repeat. Thus, any maximal repeat covered by a maximal overlapped repeat is covered also by the maximal repetition coinciding with this overlapped repeat. Thus, we have the following fact.
Proposition 12. Any maximal repeat covered by a maximal overlapped repeat is covered also by the principal repeat of some maximal repetition.
Let a maximal repeat be covered by a maximal repeat . Then, the factor has the period , so, by Proposition 9, the repeat is not principal. Hence, principal repeats are not covered by other maximal repeats. Now, let a maximal repeat be not principal, i.e., is contained in some repetition or subrepetition r such that . In this case, is covered by the principal repeat of r. Therefore, if is not covered by any other maximal repeat, then is principal. Thus, we obtain the following fact.
Proposition 13. A maximal repeat is principal if and only if it is not covered by any other maximal repeat.
Using Propositions 10 and 13, Problem 2 can be reformulated in the following way.
Problem 3. For a given value δ, find in a given word w of length n all maximal -gapped repeats that are not covered by other maximal repeats.
In the paper, we actually propose an algorithm for solving Problem 3. The main idea of the proposed algorithm is the following. First, using already known algorithms, we find in w all maximal -gapped repeats, and then we remove from the found repeats all repeats that are covered by other maximal repeats. Thus, the crucial part of the proposed algorithm is the procedure of removing maximal -gapped repeats covered by other maximal repeats. This procedure is described below in the text.
4. Periodic and Generated Repeats
For resolving Problem 3, we initially introduce the notions of periodic and generated repeats. Repeat is called periodic if the copies of are repetitions with a minimal period not greater than ; otherwise, is called nonperiodic. Let be a maximal periodic repeat with a left copy and a right copy . Since and are repetitions, these repetitions can be extended to some maximal repetitions and with the same cyclic roots. Let and be different maximal repetitions. Then, we say that is represented by the pair of maximal repetitions or, more briefly, is birepresented. The pair of maximal repetitions will be called left if ; otherwise, it is called right. We will also call the repeat left (right) birepresented if is represented by a left (right) pair of maximal repetitions. If the maximal repetitions and are the same repetition r, we say that is represented by the maximal repetition r.
A maximal repeat is considered to be generated by a maximal repetition r if , , and is divisible by . For maximal periodic repeats represented by maximal repetitions, we have the following fact.
Proposition 14. Any maximal periodic repeat represented by a maximal repetition is generated by this maximal repetition.
Proof. Let be a maximal periodic repeat represented by a maximal repetition r, and , be, respectively, the left and right copies of . Note that, since is periodic, the length of and is greater than . Denote by , the cyclic roots of r, which are prefixes of and , respectively. Since , by Proposition 3, we have , so . Thus, is divisible by . Let . Then, both symbols and are contained in r and the difference between the positions of these symbols is divisible by . Thus, , which contradicts the notion that is a maximal repeat. Thus, . In an analogous way, we have that . Thus, is generated by r. □
Note that any maximal repetition r generates no more than gapped repeats, and, knowing values , and , we can compute all these repeats in time. We can check each of these repeats in constant time if this repeat is -gapped. Thus, we can compute all maximal -gapped repeats generated by r in time. Therefore, we have the following simple procedure for computing all maximal -gapped repeats generated by maximal repetitions in w. First, we find all maximal repetitions in w. According to Theorem 1, this can be done in time, and the total number of these repetitions is . Then, for each found repetition r, we compute all maximal -gapped repeats generated by r in time. The total time of this procedure is , so, by Theorem 2, this time is . Note also that, by Theorem 2, the number of computed repeats is . Thus, we obtain the following fact.
Proposition 15. The number of maximal α-gapped repeats generated by maximal repetitions in w is , and all these repeats can be computed in time.
Proposition 15 is used further in the proposed algorithm for the exclusion of covered repeats generated by maximal repetitions.
5. Birepresented Gapped Repeats
Now, we prove several properties of birepresented gapped repeats, which are used in the proposed algorithm. Let the maximal repeat
with left and right copies
and
be represented by a left pair of maximal repetitions
with the same cyclic roots, and
p be the minimal period of
and
. Assume that
and
. Then,
and
are contained in
and
, respectively, so
Thus, from , we have , which contradicts the notion that is maximal. Therefore, we have or . Thus, we have the following separate cases for and :
- a1.
and ;
- a2.
and ;
- a3.
and .
Analogously, it can be shown that or . Thus, we have the following separate cases for and :
- b1.
and ;
- b2.
and ;
- b3.
and .
Thus, for , we have in total 9 cases aibj when there are cases ai and bj simultaneously. Note that, since , the case of and implies that and , so cases a1b3, a3b1 and a3b3 are impossible. The remaining cases can be grouped into the three following cases:
, and (case a3b2);
, and (cases a1b2 and a2b2);
and (cases a1b1, a2b1 and a2b3).
We will call the repeat the repeat of the first type in case 1, repeat of the second type in case 2, and repeat of the third type in case 3. For example, consider the word , which has the left pair of the maximal repetitions , with the same cyclic roots of length 3. This pair represents the gapped repeat of the first type with copies , the gapped repeat of the second type with copies and the gapped repeat of the third type with copies .
First, consider case 1. Let
,
be two different maximal repeats of the first type represented by the left pair of maximal repetitions
. If
, then it is easy to note that
and
are the same repeat, so
. Let
,
be the prefixes of length
p in the left copies of repeats
and
, respectively. Note that
,
are cyclic roots of
that are equal to the prefix cyclic root of
, so
. Thus, by Proposition 3, we have that the difference
is divisible by
p. Moreover, by the definition of repeats of the first type, we have that both values
are in the segment
Thus, the starting positions of all maximal repeats of the first type represented by the pair
form in the segment
an arithmetic progression of numbers with the step
p. Let
be the maximal number in this progression and
be the number of all maximal repeats of the first type represented by the pair
. Then, we can consider the numbers
of this progression in descending order as the starting positions of the corresponding repeats. In this way, we consider the set of all maximal repeats of the first type represented by the pair
as
, where
for
. Note that
,
and
for
.
Now, consider case 2. Let
,
be two different maximal repeats of the second type represented by the left pair of maximal repetitions
, and
,
be the right copies of repeats
,
, respectively. If
, then
and
are the same repeat, so
. Let
,
be the prefixes of length
p in
and
, respectively. Note that
,
are cyclic roots of
, which are equal to the prefix cyclic root of
, so
. Thus, by Proposition 3, we have that the difference
is divisible by
p. Moreover, by the definition of repeats of the second type, we have that both values
are in the segment
Thus, the starting positions of the right copies of all maximal repeats of the second type represented by the pair
form in the segment
an arithmetic progression of numbers with the step
p. Let
be the minimal number in this progression and
be the number of all maximal repeats of the second type represented by the pair
. Then, we can consider the numbers
of this progression in ascending order as the starting positions of the right copies of the corresponding repeats. In this way, we consider the set of all maximal repeats of the second type represented by the pair
as
, where the starting position of the right copy of
is
for
. Note that
,
and
for
.
Finally, consider case 3. Let , be two different maximal repeats of the third type represented by the left pair of maximal repetitions , and , be the right copies of repeats , , respectively. Analogously to case 2, it can be shown that , and the difference is divisible by p. Thus, the starting positions of the right copies of all maximal repeats of the third type represented by the pair form in the segment an arithmetic progression of numbers with the step p. Let be the minimal number in this progression and be the number of all maximal repeats of the third type represented by the pair . Then, we can consider the numbers of this progression in ascending order as the starting positions of the right copies of the corresponding repeats. In this way, we consider the set of all maximal repeats of the third type represented by the pair as , where the starting position of the right copy of is for . Note that , and for . The repeat will be called dominating, and other repeats will be called nondominating.
Consider additionally the repeats
and
. Let
and
be left and right copies of
, and
and
be left and right copies of
. Consider in
the prefix
of length
p, which is a cyclic root of
. Since
is a prefix of
, we have also in
the prefix
of length
p, which is a cyclic root of
and is equal to
. Note that
is a factor of
, so we can also consider in
the factor
corresponding to
. Note also that
is a cyclic root of
such that
. Moreover,
has to be in
the leftmost cyclic root to the right of
, so, by Proposition 3,
. We have also that
and
, so
. Thus,
, and
Consider also the repeats
and
. Let
and
be left and right copies of
, and
and
be left and right copies of
. Consider in
the prefix
of length
p, which is a cyclic root of
. Since
is a prefix of
, we have in
the prefix
of length
p, which is a cyclic root of
and is equal to
. Since
is a prefix of
, we have also in
the prefix
of length
p, which is a cyclic root of
and is equal to
. Thus,
and
are two equal cyclic roots of
. Note that
has to be in
the leftmost cyclic root to the right of
, so, by Proposition 3,
. Therefore,
Now, we can join all the repeats represented by the pair of repetitions into the sequence of repeats , where the repeats of the first type, the repeats of the second type and the repeats of the third type are inserted consecutively, i.e., for , for , and for . From the above observations, we have that for , for , for , for , and for . Note also that . Since for , any repeat cannot be covered by repeats from with greater indexes. Moreover, since for , any repeat for cannot be covered by repeats from with smaller indexes. Thus, any repeat for cannot be covered by other repeats from . On the other hand, all repeats for that are actually nondominating repeats of the third type are covered by the repeat , which is actually the dominating repeat of the third type. Thus, we have the following fact.
Proposition 16. A maximal repeat σ represented by a left pair of maximal repetitions is covered by another repeat represented by the same pair of repetitions if and only if σ is a nondominating repeat of the third type.
From the left pair of maximal repetitions , one can compute effectively all maximal -gapped periodic repeats represented by the pair .
Lemma 3. Let be a left pair of maximal repetitions with the same cyclic roots such that , , , , , , be known. Then, all maximal α-gapped periodic repeats represented by the pair can be computed in time , where s is the number of the maximal α-gapped periodic repeats represented by the pair .
Proof. First, we compute in constant time a sequence of repeats, where, by computing , we mean computing formulas by which any repeat from can be computed in constant time. Since any maximal repeat is defined uniquely by the values , and , we will compute actually for any repeat from the values , and .
Let
p be the minimal period of
and
, and
be the prefix cyclic root of
. Denote by
the starting position of the leftmost cyclic root of
, which is equal to
and is not a prefix of
. Taking into account Proposition 4, it can be checked that
Note that
has to be actually the starting position of the repeat
from
, and in this case,
. Thus, if
, we can conclude that
; otherwise,
.
Let
, i.e.,
. Denote by
the starting position of
. Note that
is the starting position of the rightmost cyclic root of
, which is equal to
, and such that
. Thus, taking into account Proposition 3, we have that
is the greatest number such that
is divisible by
p and
. Note that
can be computed in constant time. Then,
,
and
. Note also that
so
and for any
, we have
,
,
and
. Denote
. As shown above,
is the greatest number such that
, so
and
. If
, from the above observations for
, we have
,
,
and
. Moreover, the repeat
from
has to satisfy the conditions
,
,
. Note that in this case,
Thus, if
we conclude that
; otherwise,
. Let
. Note that
is the greatest number such that
i.e.,
Thus,
, and for any
, we have
,
,
and
.
Now, consider the case
, i.e.,
and the pair
represents no repeats of the first type. Since all repeats of the second or third type represented by the pair
must have starting position
, the equality
must be held for
. Denote by
the repeat with the starting position
and the period
. We proved above that
. In the same way, one can prove that
must be equal to
. In this case, we obtain that
. Therefore, if
, we conclude that
is empty. Let
. Note that
, so
. Thus,
. Note also that
has to be equal to
, so
has to be less than
. Thus, we have
. First, consider the case
, i.e.,
is a repeat of the second type, so
. Then,
can be computed in constant time as the greatest number such that
, and for any
, we have
,
,
and
. Moreover, we obtain that
must satisfy the conditions
,
and
, so
Thus, if
, we conclude that
. Otherwise, taking into account
, we obtain that
and
,
and
. Finally, consider the case
, i.e.,
and
is a repeat of the third type, so
. Taking into account
, we obtain in this case that
is a unique repeat in
. Thus, in any case,
can be computed in constant time.
Now, we need to select from
all
-gapped repeats, i.e., all gapped repeats
such that
. If
, i.e.,
, as shown above, we have a trivial case when
contains no more than one repeat. Thus, further, we assume that
. First, we consider the case when
is a gapped repeat. Note that for any repeats
and
from
, the ending position of the left copy of
is not greater than the ending position of the left copy of
and the starting position of the right copy of
is not less than the right copy of
. Thus, if
is a gapped repeat, then all repeats
for
are also gapped repeats. Therefore, in this case, all repeats from
are gapped repeats. Further, we note that for any
j such that
, we have
and
, so
Now, consider a repeat
from
such that
. Since
is a gapped repeat, we have
and, as shown above,
,
. Hence,
Thus, we have
From inequalities (
2) and (
3), we conclude that all the
-gapped repeats from
have to form a continuous segment
in
. Thus, to efficiently find all these repeats, we need to compute the indexes
l and
m.
First, we compute l. Let and . Then, we obviously obtain .
Now, let
and
. Then, taking into account (
3), we have that
l is the smallest number such that
. Since this number cannot be greater than
, as shown above, we have also
and
. Thus,
l is the smallest number such that
. It can be easily computed that
Now, let
or
. In this case, if there exists the repeat
in
and
, then we have
; otherwise, from (
2), we obtain that there are no
-gapped repeats in
.
Now, we compute
m. If
, then we obviously have
. Thus, further, we assume that
. Let
and
. Then, taking into account (
2), we have that
m is equal to the greatest number
for
such that
. As shown above, we have also
and
. Thus,
m is equal to the greatest number
for
such that
. It can be easily computed that
Now, let and . Then, we have obviously that .
Now, we can assume that
and, if
,
. Let
and
. Then, taking into account (
2), we have that
m is equal to the greatest number
for
such that
. From the above observations, we obtain that
and
. Thus,
m is equal to the greatest number
for
such that
. It can be easily computed that
Now, we assume additionally that, if , In this case, if and , then ; otherwise, we can conclude that has no -gapped repeats. Thus, l and m can be computed in constant time, so all -gapped repeats in can be computed in the required time.
Finally, we consider the case when
is an overlapped repeat. Denote for any
by
and
the left and right copies of the repeat
. Let
, i.e.,
is a repeat of the first type. Note that for any repeat
of the first type, we have
and
. Therefore, if
is an overlapped repeat, then any repeat
of the first type is also an overlapped repeat. Thus, in the considered case, we obtain that all repeats of the first type from
are overlapped repeats, i.e., only repeats of the second or third types from
can be gapped repeats. Now, consider in
a repeat
such that
. As shown above, we have
On the other hand, we have
. By Proposition 5, the overlap of repetitions
and
is less than
p, so
. Thus, we obtain that
and
, so
is a gapped repeat. Hence, all repeats
from
such that
are gapped repeats. Thus, all gapped repeats from
are repeats
, where
if
is a gapped repeat and
otherwise. Since one can check in constant time if the repeat
is gapped, the value
l can be computed in constant time. Taking into account inequalities (
2), we conclude that
Thus, if
, then there are no
-gapped repeats in
; otherwise, all
-gapped repeats in
form a continuous segment
, where the index
m can be computed in constant time, analogously to the case considered above, when
is a gapped repeat. Therefore, all
-gapped repeats in
can be computed in the required time in any case. □
In a similar way, from the left pair of maximal repetitions , one can compute effectively all maximal -gapped nondominating repeats of the third type represented by the pair .
Lemma 4. Let be a left pair of maximal repetitions with the same cyclic roots such that , , , , , , be known. Then, all maximal α-gapped nondominating repeats of the third type represented by the pair can be computed in time , where s is the number of the maximal α-gapped nondominating repeats of the third type represented by the pair .
Proof. According to the proof of the previous lemma, all maximal -gapped periodic repeats represented by the pair form a continuous segment in , where the indexes can be computed in constant time. Note that a repeat from is a nondominating repeat of the third type if and only if . Thus, in order to find all maximal -gapped nondominating repeats of the third type represented by the pair , one needs to select from repeats all repeats such that . This can be done obviously in time , where s is the number of the selected repeats. □
Let
be a left pair of maximal repetitions representing some
-gapped repeat
with left and right copies
and
and gap
v. Note that the distance
between the repetitions
and
cannot be greater than the gap length
, which is not greater than
. Thus,
, so
If
, we will say that the repetition
is
-close to the repetition
from the right. Thus, we obtain the following fact.
Proposition 17. If a left pair of maximal repetitions represents at least one maximal α-gapped repeat, then is α-close to from the right.
We also use the following proposition.
Proposition 18. Any maximal repetition has no more than maximal repetitions that are α-close to from the right.
Proof. Let
be all maximal repetitions that are
-close to
from the right in the increasing order of their starting positions:
For convenience, denote
by
and consider two consecutive repeats
and
for
. Since
, by Proposition 5, the overlap of
and
is less than
, so
Thus, we obtain that
so
, i.e.,
. □
Further, by computing a periodic repeat, we will mean that the minimal period of copies of this repeat is additionally computed. Now, we show how all left birepresented maximal -gapped periodic repeats in a word can be effectively computed.
Lemma 5. All left birepresented maximal α-gapped periodic repeats in w can be computed in time .
Proof. First, we find all maximal repetitions in
w. According to Theorem 1, this can be done in
time, and the total number of these repetitions is
. For each maximal repetition
r in
w, we compute the values
,
,
,
. Moreover, we divide all maximal repetitions in
w into subsets of all repetitions with the same cyclic roots, and represent each of these subsets as a double-linked list
of the subset repetitions in the order of their starting positions. According to [
17], this can be done in
time. We also rearrange the repetitions in each list
in the order of nondecreasing length (the repetitions of the same length are arranged in the order of increasing starting position). The rearranged list
will be denoted by
. Using bucket sorting, all the lists
can be computed from the lists
in total
time.
For each j, we compute separately all maximal -gapped periodic repeats represented by left pairs of repetitions from . The computation is performed as follows. We consider consecutively all repetitions from . For each repetition from , we compute all maximal -gapped periodic repeats represented by left pairs of repetitions where is a repetition from . Before the computation, we assume that all repetitions that precede the repetition in the list are already removed from the current list . Note that, in order to compute the required repeats for the repetition , we need to consider all the repetitions from such that the pair of maximal repetitions is left and represents at least one maximal -gapped periodic repeat. According to Proposition 17, such repetitions have to be -close to from the right. Thus, we actually need to consider only repetitions from that are -close to from the right. Note that, since the lengths of all these repetitions are not less than the length of and the starting positions of all these repetitions are greater than the starting position of , all these repetitions follow in the list , so they are presented in the current list . Moreover, they follow in . Recall that the current list contains no repetitions of length less than the length of . Thus, in the current list between the repetition and the repetitions that are -close to from the right, there are no any other repetitions. Thus, all the repetitions that are -close to from the right form in the list some continuous segment , which follows immediately , i.e., . Therefore, proceeding from the repetition , each of the repetitions that are -close to from the right can be found in the current list in constant time. After finding each repetition from these repetitions, we compute all maximal -gapped periodic repeats represented by the left pair . According to Lemma 3, it can be done in time , where s is the number of computed repeats. The minimal periods of copies of all these computed repeats are defined as the minimal period of . Thus, the treatment of each of the repetitions that are -close to from the right can be done in time . Since, according to Lemma 4, the number of maximal repetitions that are -close to from the right is not greater than , the total treatment of all these repetitions can be done in time , where is the total number of computed repeats for all these repetitions. Then, we remove the considered repetition from the double-linked list . It can be done in constant time. Thus, the consideration of the repetition can be performed in time . Therefore, since the number of all maximal repetitions is , the total time of considering all maximal repetitions from all lists is , where is the total number of computed repeats. Since each computed repeat is a maximal -gapped repeat, the number is not greater than the total number of maximal -gapped repeats in w, which is by Theorem 3, so . Thus, the total time of considering all maximal repetitions from all lists (which is actually the time of computing all left birepresented maximal -gapped periodic repeats in w) is . □
Analogously to the proof of Lemma 5, from Lemma 4, we can also prove the following lemma.
Lemma 6. All left birepresented maximal α-gapped nondominating repeats of the third type in w can be computed in time .
Now, consider the case of periodic repeats represented by right pairs of maximal repetitions. Let the maximal repeat be represented by the right pair of maximal repetitions with the same cyclic roots and the minimal period p. Analogously to the case of repeats represented by left pairs of repetitions, we can show that can satisfy one of the three following cases:
, and ;
, and ;
and .
We will call the repeat a repeat of the first type in case 1, repeat of the second type in case 2, and repeat of the third type in case 3.
Analogously to the case of repeats represented by left pairs of repetitions, we can also show that the starting positions of the right copies of all maximal repeats of the third type represented by the pair form an arithmetic progression , where is the number of all maximal repeats of the third type represented by the pair . We will call the repeat of the third type with the starting position of its right copy a dominating repeat; all the other repeats of the third type will be called nondominating. Analogously to Proposition 16, we can prove that a maximal repeat represented by a right pair of maximal repetitions is covered by another repeat represented by the same pair of repetitions if and only if is a nondominating repeat of the third type. Taking this into account, together with Proposition 16, we obtain the following fact.
Corollary 1. A maximal periodic repeat σ represented by a pair of maximal repetitions is covered by another periodic repeat represented by the same pair of repetitions if and only if σ is a nondominating repeat of the third type.
Analogously to Lemmas 5 and 6, we can also prove similar facts for repeats represented by right pairs of maximal repetitions.
Lemma 7. All right birepresented maximal α-gapped periodic repeats in w can be computed in time .
Lemma 8. All right birepresented maximal α-gapped nondominating repeats of the third type in w can be computed in time .
From Lemmas 6 and 8, we obtain the following corollary.
Corollary 2. All birepresented maximal α-gapped nondominating repeats of the third type in w can be computed in time .
Further, we use Corollary 2 in the proposed algorithm for the exclusion of birepresented maximal -gapped nondominating covered repeats. We show also how all maximal -gapped periodic repeats in a word can be effectively computed.
Lemma 9. All maximal α-gapped periodic repeats in w can be computed in time .
Proof. By Lemmas 5 and 7, we can find in time all birepresented maximal -gapped periodic repeats in w, so for finding all maximal -gapped periodic repeats in w, we need to compute additionally in w all maximal -gapped periodic repeats represented by maximal repetitions. By Proposition 14, maximal -gapped periodic repeats represented by maximal repetitions are generated by these maximal repetitions. Thus, for finding all these repeats, we can compute initially all maximal -gapped repeats generated by maximal repetitions in w. By Proposition 15, the number of such repeats is , and all these repeats can be computed in time. Note that a maximal repeat generated by a maximal repetition r is a periodic repeat represented by r if and only if . Moreover, in this case, is the minimal period of copies of . Thus, for each of the computed repeats, we can check in constant time if this repeat is a periodic repeat represented by a maximal repetition and can define in this case the minimal period of copies of this repeat. Thus, all maximal -gapped repeats generated by maximal repetitions in w can be computed in time, so the total time of computing all maximal -gapped repeats generated by maximal repetitions in w is . □
6. -Periodic Repeats
Further, we introduce the notion of -periodic repeats, which is crucial for the proposed algorithm. Repeat is called α-periodic if the minimal period of copies of is not greater than ; otherwise, is called α-nonperiodic. Note that for any -periodic maximal repeat that is -gapped or overlapped, we have , where p is the minimal period of copies of . Thus, any -periodic maximal -gapped or overlapped repeat is a periodic repeat. Thus, all -periodic maximal -gapped or overlapped repeats can be classified as repeats represented by single maximal repetitions or repeats represented by pairs of maximal repetitions. We will use the following property of reprincipal repeats.
Proposition 19. Let r be a maximal repetition such that . Then, the principal repeat of r is α-nonperiodic.
Proof. Let be the principal repeat of r. Assume that is -periodic, i.e., the copies of are repetitions with minimal period not greater than , so . Since , i.e., , we have that the length of copies of is not less than . Therefore, since both and are periods of copies of , by the periodicity lemma, we obtain that is also a period of copies of . Since is the minimal period of copies of , we conclude that , i.e., is a divisor of . In this case, we obtain that is also a period of r, which contradicts the notion that is the minimal period of r. □
Let be a -periodic maximal repeat represented by some maximal repetition r. Note that, in this case, is covered by r, i.e., it is covered by the principal repeat of r. Since the repeat is a periodic repeat with the minimal period of its copies, i.e., the length of its copies is not less than , and r contains copies of , we have that . Thus, we have the following corollary from Proposition 19.
Corollary 3. Any α-periodic maximal repeat represented by some maximal repetition is covered by the α-nonperiodic principal repeat of this repetition.
Our algorithm is based on the following lemma.
Lemma 10. Let a maximal α-gapped repeat σ be covered by some α-periodic maximal repeat and not covered by any α-nonperiodic principal repeat of maximal repetition. Then, σ is a periodic birepresented nondominating repeat of the third type.
Proof. Let
be covered by a
-periodic maximal repeat
. By Proposition 10,
is a
-gapped or overlapped repeat, so
is a periodic repeat. If
is represented by a maximal repetition, then, by Corollary 3,
is covered by the
-nonperiodic principal repeat of this repetition, so
is also covered by the
-nonperiodic principal repeat of this repetition, which contradicts conditions of the lemma. Thus,
is represented by a pair of maximal repetitions
. Let
be the minimal period of these repetitions. Note that, by Proposition 11, the left copy of
is contained in the left copy of
and the right copy of
is contained in the right copy of
. Hence, the left copy of
is contained in
and the right copy of
is contained in
, so
is a period of copies of
. Moreover, we have that
Thus,
is a periodic repeat. Denote by
p the minimal period of copies of
. Let
. Since
, by the periodicity lemma, we have that
is also a period of copies of
, which has to be equal to
p. Thus,
p is a divisor of
, so, in the case
, we obtain that the cyclic roots of repetitions
are not primitive. Therefore,
, i.e.,
is the minimal period of copies of
, so
is represented by the pair of repetitions
. Thus, we have that
is a periodic repeat represented by the pair
and is covered by the periodic repeat
represented by the same pair
. Thus, by Corollary 1,
is a nondominating repeat of the third type. □
We show also that all maximal -gapped -periodic repeats in a word can be computed in the following way.
Lemma 11. All maximal α-gapped α-periodic repeats in w can be computed in time .
Proof. Recall that any maximal -gapped -periodic repeat is a periodic repeat, so, to find all maximal -gapped -periodic repeats, we can compute initially all maximal -gapped periodic repeats in w. By Lemma 9, this can be done in time . Moreover, by Theorem 3, the number of computed repeats is . Recall that for each computed repeat, we compute additionally the minimal period of copies of this repeat, so we can check in constant time if this repeat is -periodic. Thus, in time, we can select from the computed repeats all maximal -gapped -periodic repeats in w. The total time of the proposed procedure for computing the required repeats is . □
Lemma 11 is used further in the proposed algorithm for the identification of all maximal -gapped -periodic repeats.
Further, we show how to compute effectively all reprincipal -periodic repeats in a word. Note that, as shown before, these repeats are actually maximal overlapped periodic repeats, which can be birepresented or represented by maximal repetitions. Thus, among reprincipal -periodic repeats, we consider separately birepresented repeats and repeats represented by maximal repetitions. We consider initially birepresented reprincipal -periodic repeats. These repeats are maximal overlapped periodic repeats represented by left or right pairs of maximal repetitions. Note that such repeats can be represented only by pairs of overlapped maximal repetitions. First, we consider maximal overlapped periodic repeats represented by left pairs of maximal repetitions.
Lemma 12. Let be a left pair of maximal overlapped repetitions with the same cyclic roots such that , , , , , , be known. Then, the number of maximal overlapped periodic repeats represented by the pair is less than , and all these repeats can be computed in time .
Proof. It is shown in the proof of Lemma 3 that in the set
, all repeats
such that
are gapped repeats, so only repeats
can be overlapped repeats, and, moreover, all these
repeats can be computed in
time. It is also shown that repeats
are overlapped if and only if repeat
is overlapped. Thus, in constant time, we can select all overlapped repeats from repeats
. It follows from Equation (
1) that
. Thus, the number of the selected repeats is not greater than
, and the total time of computing these repeats is
. □
From Lemma 12, we obtain the following fact.
Lemma 13. The number of maximal overlapped periodic repeats represented by left pairs of maximal repetitions is , and all these repeats can be computed in time.
Proof. Analogously to the proof of Lemma 5, first, we find all maximal repetitions in w; for each maximal repetition r in w, we compute the values , , , , and, moreover, we divide all maximal repetitions in w into subsets of all repetitions with the same cyclic roots and represent each of these subsets as a list of the subset repetitions in the order of increasing starting position. As shown in the proof of Lemma 5, this can be done in time. Then, we consider each list . Let consist of consecutive repetitions . According to Proposition 5, for each repetition , the overlaps of with repetitions and are less than , so repetitions and cannot be overlapped. Thus, only pairs can be overlapped pairs of repetitions representing maximal overlapped repeats. Therefore, we traverse the list for finding all overlapped left pairs of repetitions. Note that the total number of maximal repetitions in w is , so the total time of traversing all lists is . For each found pair , we compute all maximal overlapped periodic repeats represented by this pair. By Lemma 12, the number of these repeats is less than , and all these repeats can be computed in time . For the computed repeats, the minimal periods of copies of these repeats are defined as the minimal period of . Note that for each maximal repetition r in w, there can be only one left pair of repetitions such that , so the total number of maximal overlapped periodic repeats represented by all left pairs of repetitions in all lists is less than ; thus, by Theorem 2, this number is . For the same reason, the total time of computing all maximal overlapped periodic repeats represented by all left pairs of repetitions in all lists is , so, by Theorem 2, this time is . Thus, the total time of the considered procedure for computing the required repeats is . □
We can also prove the analogous lemma for right pairs of maximal repetitions.
Lemma 14. The number of maximal overlapped periodic repeats represented by right pairs of maximal repetitions is , and all these repeats can be computed in time.
From Lemmas 13 and 14, we directly obtain the corollary.
Corollary 4. The number of maximal overlapped periodic birepresented repeats is , and all these repeats can be computed in time.
Now, we show that all maximal overlapped -periodic birepresented repeats can be easily selected from maximal overlapped periodic birepresented repeats.
Lemma 15. The number of maximal overlapped α-periodic birepresented repeats is , and all these repeats can be computed in time.
Proof. Recall that any maximal overlapped -periodic repeat is a periodic repeat, so, to find all maximal overlapped -periodic birepresented repeats, we can compute initially all maximal overlapped periodic birepresented repeats in w. By Corollary 4, the number of such repeats is , and all these repeats can be computed in time. Since, for each computed repeat, we compute additionally the minimal period of copies of this repeat, we can check in constant time if this repeat is -periodic. Thus, in time, we can select from the computed repeats all maximal overlapped -periodic birepresented repeats in w. The total time of the proposed procedure for computing the required repeats is . □
Finally, we prove that all reprincipal -periodic birepresented repeats can be effectively selected from maximal overlapped -periodic birepresented repeats.
Lemma 16. The number of reprincipal α-periodic birepresented repeats is , and all these repeats can be computed in time.
Proof. Recall that all reprincipal repeats are maximal overlapped repeats, so for computing all reprincipal -periodic birepresented repeats in w, we can select them from all maximal overlapped -periodic birepresented repeats in w. By Lemma 15, the number of all maximal overlapped -periodic birepresented repeats is , and these repeats can be computed in time. By Proposition 7, the number of reprincipal repeats in w is also , and all these repeats can be computed in time. In order to select all reprincipal repeats from maximal overlapped -periodic birepresented repeats, we represent the set of all the computed maximal overlapped -periodic birepresented repeats by start position lists . In the same way, we represent the set of all the computed reprincipal repeats by start position lists . All the lists and can be computed in time , where S is the total size of all lists and , so, since this total size is , the time of computing all the lists and is . Then, in order to select the required reprincipal repeats, we traverse simultaneously lists and for . This can also be done in time . Thus, the total time of the proposed procedure for computing the required repeats is . □
Now, we consider reprincipal -periodic repeats represented by maximal repetitions. We show actually that these repeats are impossible.
Proposition 20. The principal repeat of a maximal repetition cannot be represented by another maximal repetition.
Proof. Let be the principal periodic repeat of a maximal repetition r. Assume that is represented by some another maximal repetition . Note that, in this case, , and r is contained in , i.e., the length of the overlap of the repetitions r and is not less than . It contradicts Proposition 6. □
Corollary 5. Any reprincipal α-periodic repeat is a birepresented repeat.
Proof. Let be the principal -periodic repeat of some maximal repetition r. Note that, as shown above, is a periodic repeat. Assume that is represented by some maximal repetition. By Proposition 20, can be represented only by repetition r. In this case, we have that and the minimal period of copies of is , so the minimal period of copies of is greater than , which contradicts the notion that is a -periodic repeat. □
From Lemma 16 and Corollary 5, we obtain immediately the following fact.
Corollary 6. The number of reprincipal α-periodic repeats is , and all these repeats can be computed in time.
Corollary 6 is used further in the proposed algorithm for the identification of all reprincipal -periodic repeats.
7. Algorithm for Solving Problem 3
In this section, we describe in detail the proposed algorithm. The pseudocode of the proposed algorithm is presented in Algorithm 1. Let . We compute initially the set of all maximal -gapped repeats in w and the set of all reprincipal repeats in w. By Theorem 3, we have that and can be computed in time. Moreover, by Proposition 7, we have that and can be computed in time. Recall that our goal is to exclude from all repeats that are covered by other maximal repeats. Using Propositions 10 and 12, we conclude that we need actually to exclude from all repeats that are covered by other maximal gapped repeats from or reprincipal repeats. Denote by the set of all repeats from that are not covered by other repeats from or reprincipal repeats. In these terms, our goal is to compute the set .
Note that all maximal repeats generated by a maximal repetition, except the principal repeat of this repetition, are covered by this repetition and so are covered by the principal repeat of this repetition. Recall also that the principal repeats of maximal repetitions are overlapped repeats, so gapped repeats cannot be reprincipal repeats. Thus, maximal gapped repeats generated by a maximal repetition are covered by the principal repeat of this repetition and so have to be excluded from . At the first stage, we exclude from all maximal -gapped repeats generated by maximal repetitions. By Proposition 15, the number of these repeats is , and all these repeats can be computed in time. Denote the set of all the computed repeats by . In order to exclude from the repeats of the set , we represent by start position lists . These lists can be computed in time . In the same way, we represent the set by start position lists . These lists can be computed in time . Then, all the computed repeats that are contained in can be excluded from by simultaneously traversing lists and in time .
Denote by
the resulting set of all repeats from
that remain after the first stage. Recall that all repeats from
that are removed at the first stage are covered by reprincipal repeats. Therefore, any repeat from
that is covered by a repeat
removed at the first stage is covered also by some reprincipal repeat covering the repeat
. Thus, in order to compute the set
, we can remove from
all repeats that are covered by other repeats from
or reprincipal repeats. Denote by
the set of all repeats from
together with all reprincipal repeats in
w. In these terms, in order to compute the set
, we remove from
all repeats that are covered by other repeats from
.
Algorithm 1 Algorithm for solving Problem 3 |
|
compute the set of all maximal -gapped repeats in w |
compute the set of all reprincipal repeats in w |
compute the set of all repeats from generated by maximal repetitions |
exclude from all repeats from |
|
compute the set of all -periodic repeats from |
compute the set of all -periodic reprincipal repeats in w |
compute the set of all -nonperiodic reprincipal repeats in w |
mark all -nonperiodic repeats from as -nonperiodic by using the set |
|
remove from all reprincipal repeats and all repeats that are covered by -nonperiodic repeats |
compute the set of all periodic birepresented maximal -gapped nondominating repeats of third type in w |
remove from all repeats from |
|
output |
At the second stage, we remove from all repeats that are covered by -nonperiodic repeats from . For this purpose, first, we compute the set of all -periodic repeats from and the set of all -periodic reprincipal repeats in w. By Lemma 11, the set can be computed in time, and, by Corollary 6, the set can be computed in time. Note that after performing the first stage, the set is represented by its start position lists . The set can be also represented by start position lists , which can be computed in time. Then, using the simultaneous traversing of lists and , we mark all -nonperiodic repeats from as -nonperiodic (each repeat from that is not contained in is marked as -nonperiodic). Moreover, we compute the set of all -nonperiodic reprincipal repeats in w by removing from the set all repeats from . To perform this removal, we also represent the sets and by their start position lists and . All these lists can be computed in time . Then, we compute all repeats from by the simultaneous traversing of lists and in total time . Note that the computed set is also represented by its start position lists . Then, we compute the set by merging the lists and into the start position lists for this set. This can be done by this simultaneous traversing of lists and in time . During the merging of repeats into the lists , we also mark these repeats as gapped or as reprincipal. Note also that .
For , we denote by the subset of all repeats from such that , i.e., .
Proposition 21. Let a maximal gapped repeat σ be covered by a maximal gapped repeat . Then, .
Proof. Note that , so . The inequality is obvious. □
Corollary 7. Let a gapped repeat σ from be covered by a gapped repeat from . Then, is contained in or .
Proposition 22. Let a gapped repeat σ from be covered by a reprincipal repeat from . Then, .
Proof. It is obvious that . Assume that . Note that for any from and any from , we have , so, in this case, . Hence, . Let , be left and right copies of . Since is covered by , the repeat is contained in the repetition with the minimal period , so both copies , are contained in and . Consider the prefixes of length in and . Since these prefixes are equal cyclic roots of , by Proposition 3, we have , so is divisible by , i.e., is divisible by the minimal period of . Moreover, if , then both symbols and are contained in , so , i.e., can be extended to the left, which contradicts the notion that is a maximal repeat. Thus, . It can be analogously proven that . Thus, the repeat is generated by the repetition , which contradicts the notion that . □
Summarizing Corollary 7 and Proposition 22, we obtain the following fact.
Corollary 8. Let a gapped repeat σ from be covered by a repeat from . Then, .
For finding all repeats that have to be removed at the second stage, for each starting position , we compute consecutively such repeats starting at position t. Note that such repeats starting at position t can be covered only by -nonperiodic repeats from that start at a position not greater than t and end at a position greater than t. Denote the set of all these -nonperiodic repeats by and write for . Note that if, for some repeat from , there exists a repeat in such that and , then can be excluded from consideration. The remaining repeats from form a sequence such that and . In order to perform an effective search in this sequence, we present as AVL-tree . For each i, we compute also the value , which is the maximum of the ending positions of the last repeats in sequences , , …, .
Let . For any repeat , we define . From and , we have .
Lemma 17. For each , the inequality is valid.
Proof. Denote, for convenience, by , , the left copies of , , , and by , , the right copies of , , . For , denote , , , and for , denote also , , .
Assume that , i.e., . Consider separately the following three cases.
1. Let be contained in . Thus, contains the factor corresponding to the factor in such that . Thus, since , we obtain that has the period , which contradicts the notion that is -nonperiodic.
2. Now, let
be not contained in
, i.e.,
, and
, which implies
. Let
be the intersection of
and
. Since
is a factor of
and
, for
, there are corresponding factors
in
and
in
. Since
and
, both factors
and
have the period
. Note that
and
are less than
. Thus,
Therefore, the intersection of
and
has a length greater than
, so, by Proposition 2, the union
of factors
and
has also the period
. Note that
is contained in
, so
has also the period
, which contradicts the notion that
is
-nonperiodic.
3. Finally, let
be not contained in
and
, which implies
. As in case 2, we define the factors
and
, and show that both these factors have period
. Note that
is a prefix of
, so
is a prefix of
. Thus,
. Consider the symbol
following the copy
. Since
is maximal, we have
. Moreover, since
, the symbol
is contained in
, so
contains the corresponding symbol
, which is equal to
. Thus, we obtain that
. Let
be not contained in
, i.e.,
. In this case, we have that
which contradicts that
. Thus,
is contained in
. Therefore, since the symbol
is contained in
and
, the symbol
is also contained in
. Thus,
contains both unequal symbols
and
. Therefore,
contains the corresponding unequal symbols
and
, which can be also represented as
and
. Let
. Note that
, so
. Thus, in this case, we have
i.e., both unequal symbols
and
are contained in
, which contradicts the notion that
has period
. Thus,
so
. By relation (
4), we have
. Thus,
, which contradicts the notion that
.
Since we obtained contradictions in all considered cases, the lemma is proven. □
Lemma 18. .
Proof. Assume by contradiction that . Note that in this case, , so is an overlapped repeat. Since , for the same reason, is also an overlapped repeat. Note that the intersection of repetitions and contains the factor , so the overlap of these repetitions is greater than , which contradicts Proposition 6, i.e., the lemma is proven. □
Using Lemma 18, we have that
and, by Lemma 17, for each
, we have
Thus, we obtain that
, so
. Here, we state this fact.
Corollary 9. for any i.
The procedure of the algorithm for a starting position t is as follows. First, we remove from trees all repeats ending at position t if such repeats exist. In order to perform effectively this removal, for each starting position t, we can maintain a double-linked list containing all repeats ending at position t from trees . In this case, all repeats that have to be removed from trees can be found in time proportional to the number of these repeats. Then, we consider consecutively all repeats in list . Let be a current considered gapped repeat from , and . By Corollary 7, can be covered only by repeats from , where . Note that is covered by a repeat from such that if and only if . Thus, in this case, we remove from . Otherwise, we check if is covered by a repeat from . For this purpose, we search in the maximal k such that . If such k exists and , then is covered by , so, in this case, we also remove from . Otherwise, if is a -nonperiodic repeat, we insert in between and . After this, we remove from all such that and . If it is necessary, we update the values , ,…, , which can depend on the value . Now, let be a current considered reprincipal repeat from . In this case, we remove from and insert in , where , in the same way as inserting it above a -nonperiodic gapped repeat, which is checked to be not covered. Then, we proceed to the next repeat in .
Note that during the described procedure for each repeat from , we perform at most one search in some tree , at most one insertion of in and at most one deletion of from . All these operations can be performed in time, since, by Corollary 9, , we obtain that all these operations can be performed in time. Moreover, after the insertion of , no more than values can be updated. It can be also performed in time. All the other operations over required for the described procedure can be performed in constant time. Thus, each repeat from can be treated in time, so the total time of the described procedure for all starting positions t is . Recall that , so the total time of the described procedure is .
Note that after the second stage, we removed from all reprincipal repeats and all repeats from that are covered by -nonperiodic repeats from , so, after the second stage, the set consists of all repeats from that are not covered by -nonperiodic repeats from . Note also that .
At the third stage, we compute the set by removing from the set all repeats that are not contained in . Let be a repeat from that is not contained in , i.e., is covered by some other repeat besides . Since consists of repeats that are not covered by -nonperiodic repeats from , the repeat can be covered only by a -periodic repeat from . Moreover, cannot be covered by any -nonperiodic reprincipal repeat, since, otherwise, has to be removed from at the second stage. Thus, is covered by some -periodic repeat from and is not covered by any -nonperiodic reprincipal repeat. Therefore, by Lemma 10, is a periodic birepresented nondominating repeat of the third type. On the other hand, if is a periodic birepresented nondominating repeat of the third type from , then, by Corollary 1, is covered by another periodic repeat represented by the same pair of repetitions, so is not contained in . Thus, a repeat from is not contained in if and only if this repeat is a periodic birepresented nondominating repeat of the third type. Hence, for computing the set , we remove from all periodic birepresented nondominating repeats of the third type. Recall that consists of maximal -gapped repeats, so we need actually to remove from all periodic birepresented maximal -gapped nondominating repeats of the third type. For this purpose, we compute the set of all periodic birepresented maximal -gapped nondominating repeats of the third type in w. By Corollary 2, this set can be computed in time . Moreover, since the set is a subset of the set , we have . Then, we represent the set by its start position lists in time . Finally, we remove from all repeats from by the simultaneous traversing of lists and in time . As a result, we obtain the required set . The total time of the third stage procedure is .
Summarizing the times of all the procedures of the proposed algorithm, we obtain that the total time of the algorithm is . Thus, Problem 3 can be resolved in time. Since, as shown above, Problem 3 is equivalent to Problem 1, we conclude the following main result of our paper.
Theorem 4. All maximal δ-subrepetitions in a given word of length n over an integer alphabet can be found in time .
8. Conclusions
In the paper, we proposed an algorithm for finding all maximal -subrepetitions in a given word of length n in time . By Proposition 8, the number of all maximal -subrepetitions in a word of length n is , so the considered problem could be presumably resolved in time. Thus, finding all maximal -subrepetitions in a given word of length n in time is still an open problem.