1. Introduction
The goal of grammar-based compression is to represent a word
w by a small context-free grammar that produces exactly
. Such a grammar is called a straight-line program (SLP) for
w. In the best case, one gets an SLP of size
for a word of length
n, where the size of an SLP is the total length of all right-hand sides of the rules of the grammar. A grammar-based compressor is an algorithm that produces an SLP for a given word
w. There are various grammar-based compressors that can be found in many places in the literature. A well-known example is the classic LZ78-compressor of Lempel and Ziv [
1]. Although it was not introduced as a grammar-based compressor, it is straightforward to compute from the LZ78-factorization of
w an SLP for
w of roughly the same size. Other examples include
BISECTION [
2] and
SEQUITUR [
3]. In this work, we study the global grammar-based compressors
Greedy [
4,
5,
6],
RePair [
7] and
LongestMatch [
8], to which we will also refer to as global algorithms. Global algorithms are important in practice because they show excellent compression results in various fields. For example,
Greedy is used in [
5] to compress DNA sequences. Among all global algorithms,
RePair is probably the most used one. Examples include compressing web graphs [
9], searching compressed text [
10], suffix array compression [
11] and compressing XML [
12]. A key concept of global compressors are maximal strings. A maximal string of an SLP
is a word that has length at least two and occurs at least twice without overlap as a factor of the right-hand sides of the rules of
. Furthermore, no strictly longer word appears at least as many times without overlap as a factor of the right-hand sides of
. For an input word
w, a global grammar-based compressor starts with the SLP that has a single rule
, where
S is the start nonterminal of the grammar. The SLP is then recursively updated by choosing a maximal string
of the current SLP and replacing a maximal set of pairwise nonoverlapping occurrences of
by a new nonterminal
X. Additionally, a new rule
is introduced. The algorithm stops when the obtained SLP has no maximal string. In the case of
Greedy the chosen maximal string minimizes the size of the SLP in each round, while
RePair selects in each round a most frequent maximal string, and
LongestMatch chooses a longest maximal string. Please note that the
Greedy algorithm as originally presented in [
4,
5,
6] is different from the version studied in this work as well as in [
13]: The original
Greedy algorithm only considers the right-hand side of the start rule for the choice and the replacement of the maximal string. In particular, all other rules do not change after they are introduced.
In [
13] the worst-case approximation ratio of grammar-based compressors is studied. For a grammar-based compressor
that computes an SLP
for a given word
w, one defines the approximation ratio of
on
w as the quotient of the size of
and the size
of a smallest SLP for
w. The approximation ratio
is the maximal approximation ratio of
among all words of length
n. In [
13] the authors provide upper and lower bounds for the approximation ratios of several grammar-based compressors (among them are all compressors mentioned so far), but for none of the compressors the lower and upper bounds match. For
LZ78 and
BISECTION these gaps were closed in [
14]. For all global algorithms the best upper bound on the approximation ratio is
[
13], while the best known lower bounds so far are
for
RePair [
15],
for
LongestMatch and
for
Greedy [
13]. In general, the achieved bounds “leave a large gap of understanding surrounding the global algorithms" as the authors in [
13] conclude.
Unary words have the form
for some symbol
a and integer
. Grammar-based compression on unary words is strongly related to the field of addition chains, which has been studied for decades (see [
16] (Chapter 4.6.3) for a survey) and still is an active topic due to the strong connection to public key cryptosystems (see [
17] for a review from that point of view). An addition chain for an integer
n of size
m is a sequence of integers
such that for each
d (
), there exists
(
) such that
. It is straightforward to compute from an addition chain for an integer
n of size
m an SLP for
of size
. Vice versa, an SLP for
of size
m yields an addition chain for
n of size
m. Therefore, grammar-based compressors on unary inputs can also be thought of as addition chain solvers, i.e., algorithms that find a (small) addition chain for a given integer.
The worst-case approximation ratio for global algorithms is difficult to analyze. A good starting point is therefore to analyze them on unary words because of their simplicity. Even though unary words are not interesting to compress, it is still interesting to look at how global algorithms perform on them. The improved upper bound we show for Greedy uses unary words and is the first improvement that happened in 15 years.
We show the worst-case approximation ratio of RePair and LongestMatch for unary words to be . Both algorithms are basically identical to the binary method that produces an addition chain for n by creating powers of two using repeated squaring, and then the integer n is represented as the sum of those powers of two that correspond to a one in the binary representation of n. Based on that information, we show that for any unary input w the produced SLPs of RePair and LongestMatch have size at most and we also provide a lower bound.
We improve the upper bound for the approximation ratio of
Greedy on unary words to
. In [
18], which is the previous version of this article, the authors showed an upper bound for the approximation ratio of
Greedy of
for unary inputs, by only analyzing the first three rounds. Here, we present a more in-depth analysis that makes use of every round. We can prove that
Greedy produces an SLP of size
on input
, which together with the fact that a smallest SLP for
has size
then yields the improved upper bound of
. To prove the size bound on the SLP produced by
Greedy, we distinguish unary and nonunary nonterminals. A nonterminal
X is called unary if its right-hand side is of the form
when it is first introduced. Otherwise, it is nonunary. We bound the total number of occurrences of all unary and nonunary nonterminals in the grammar produced by
Greedy separately. For the unary nonterminals, we bound their total number to be
, while each of them contributes a size of
, which yields a total size contribution of
. We then bound the number of occurrences of nonunary nonterminals using the already established number of unary nonterminals, which comes out to be
. Thus, we obtain the desired upper bound on the size of the grammar.
We also show the lower bound of
for the approximation ratio of
Greedy. The key to achieve this bound is the sequence
with
, which has been studied in [
19] (among other sequences), where it is shown that
for
. To prove the lower bound, we show that the SLP produced by
Greedy on input
has size
, while a smallest SLP for
has size
(this follows from a construction used to prove the lower bound for
Greedy in [
13]).
This paper is an extended version of our paper published in the proceedings of SPIRE 2019 [
18].
2. Preliminaries
For
we write
for
and
otherwise. For
we denote by
the integer division of
m and
n. We denote by
the modulo of
m and
n, i.e.,
and
If or is used, then this refers to the standard division over . Please note that and .
An alphabet is a finite set of symbols. For a word or string over with and we write to denote w’s length. The set of consists of all words over and , where is the word of length 0. A unary word is a word of the form with and . All other words are called nonunary. For words we say that v is a factor of w if there are such that .
A context-free grammar is a tuple , where N is the finite set of nonterminals, is the alphabet with , P is the set of productions of the form , where and , and is the start symbol. An SLP is a context-free grammar , where
This way, for every
there exists a unique word
with
. We say that
produces w if
. Please note that some authors require SLPs to be in Chomsky Normal Form (CNF), i.e., every production is of the form
, where
or
, where
. We do not make this assumption here because, in general, grammar-based compressors produce SLPs that are not in CNF. Furthermore, every SLP can easily be transformed into an SLP that is in CNF, produces the same word and has roughly the same size. A grammar-based compressor
is an algorithm that given an input
outputs an SLP
that produces
w. The size of an SLP is defined as
. For a word
we write
for the size of a smallest SLP that produces
w. The
worst-case approximation ratio of
is the maximal approximation ratio over all words of length
n over an alphabet of size
k:
For a given SLP , a word is called a maximal string of if
,
appears at least twice without overlap as a factor of the right-hand sides,
and no strictly longer word appears at least as many times as a factor of the right-hand sides without overlap.
Example 1. Let such that P contains
The maximal strings of are , and . The factors and occur four times on the right-hand sides without overlap and occurs twice without overlap.
A global grammar-based compressor (or simply global algorithm) starts on input w with the SLP . In each round , the algorithm selects a maximal string of and updates to by replacing a largest set of pairwise nonoverlapping occurrences of in by a new nonterminal X. Additionally, the algorithm introduces the rule in . The algorithm stops when no maximal string occurs. Please note that the replacement is not unique, e.g., the word has a unique maximal string , which yields SLPs with rules , or , or , . We assume the first variant here, i.e., maximal strings are replaced from left to right. The compressor Greedy that we study in this work chooses a maximal string in each round such that the size of is minimal.
Example 2(Greedy). Let . We have
,
, ,
, , ,
, , , ,
, , , , .
Please note that in the first round, instead of the maximal string the algorithm could also choose the maximal string , because both choices yield SLPs of minimal size 15. In the second round, instead of the algorithm could also choose , because both choices yield SLPs of size 14. Finally, the order of the choices (round 3) and (round 4) could be swapped because both choices yield SLPs of unchanged size 14.
The following lemma from [
13] provides a lower bound on the size of an SLP for a word of length
n.
Lemma 1 ([
13] (Lemma 1)).
For every word of length n, we have . 3. Upper Bound for Greedy
To show our improved upper bound for the approximation ratio of Greedy on unary words, we are first going to prove that the size of the SLP produced by Greedy for the input is upper-bounded by .
First, we need to prove several lemmas that are fulfilled for any global algorithm. When we apply specific arguments to Greedy, we draw attention to it. For better readability, we will use , i.e., the input is . Furthermore, let be the SLP obtained by the global algorithm on input after i rounds. Please note that until the algorithm stops, we have since exactly one new nonterminal is introduced in each round. If we quantify over the rounds of the algorithm, we always implicitly mean that the statements hold until the algorithm stops. If i is mentioned without a quantification, then the statement holds for any constructed after some round i of the algorithm.
Lemma 2. For every i, there is a fixed order of the nonterminals in such that every right-hand side of a rule satisfies Proof. We prove this property by induction. Initially, the property holds for the SLP since and the only rule satisfies . Now assume the claim is true for , i.e., each right-hand side of a rule in is a word from . Please note that any nonempty factor of such a right-hand side is a word from for some . Therefore, assume the global algorithm chooses a maximal string in round and is the corresponding new rule. We show that for all rules , i.e., the order of the nonterminals after round is obtained by inserting the new nonterminal X directly before in the previous order. First, this is obviously true for the new rule as well as for all rules that have not been modified during round . It remains to check the rules that are obtained from a rule by replacing a largest set of pairwise nonoverlapping occurrences of in by the new nonterminal X. If () is unary and is the single maximal -block that occurs in , then replacing occurrences of from left to right yields as the new maximal blocks of X and in w. It follows that . If otherwise is not a unary word, i.e., for , then has exactly one occurrence of as a factor. It follows that and thus v satisfies the claim. This finishes the induction. □
In other words, there is at most one maximal block for each symbol on each right-hand side and the order of these blocks is the same for all rules. Similar to the case distinction in the last steps of the proof of Lemma 2, we will distinguish two types of nonterminals. Let be the introduced rule in some round of the algorithm. If is unary, then we call X a unary nonterminal. Otherwise, we call X nonunary. We categorize as a unary nonterminal, although formally is not a nonterminal. Please note that the type of a nonterminal is decided when it is introduced and does not change later, i.e., even if the right-hand side of a unary nonterminal becomes nonunary during the execution of the algorithm, the type of the nonterminal stays the same. Our strategy to prove Theorem 1 is to bound the total number of occurrences of unary nonterminals and nonunary nonterminals on right-hand sides independently. It follows from Lemma 2 that every factor that occurs more than once on the right-hand side of a single rule is unary. The following lemma is a direct consequence of that fact.
Lemma 3. Every nonunary nonterminal occurs at most once on the right-hand side of each rule at any time of the algorithm.
Corollary 1. If a unary nonterminal X is introduced and with is the corresponding rule for X, then Z is a unary nonterminal.
Next, we bound the number of rules that contain a unary nonterminal
X on the right-hand side. For a nonterminal
X, including
, let
be the number of rules of
where
X occurs on the right-hand side, or more formally
The next two lemmas describe how evolves depending on the type of the introduced nonterminal.
Lemma 4. If a nonunary nonterminal X is introduced in some round , then for every (including ), we have .
Proof. To prove this point, we use that all rules
satisfy
(Lemma 2). Since
X is nonunary, the chosen maximal string
satisfies
for
. If a nonterminal
does not occur in
, then
. So, assume the nonterminal
occurs in
, i.e.,
for
. Note first that
occurs on the right-hand side of the new rule
. It follows that to prove the claimed result, we must show that for at least one rule
such that
occurs in
v, all occurrences of
must disappear, i.e.,
does not occur in
for
. Let
be the set of nonterminals of
where the corresponding rule is modified in round
. Please note that
since a maximal string occurs at least twice on all right-hand sides and
occurs at most once as factor of each rule since
X is nonunary (Lemma 3). If
then for all
and
, the nonterminal
does not occur in
anymore since the complete
-block (among other symbols) has been replaced. This means that
since one new rule contains
on the right-hand side, while for at least two rules the occurrences of
have been removed in round
. If otherwise
or
, then the same argument fails since for
and
, the right-hand side
could still contain
since it is not necessarily true that the complete
-block has been replaced. However, due to the properties of a maximal string, we show that
does not occur in
for at least one
and
. Towards a contradiction, assume
occurs in
for all
and
. This means that for all
and
, the length of the maximal
-block that occurs in
v is strictly larger than the length of the maximal
-block that occurs in
. If
, it follows that
is a factor of
v for all
and
, and symmetrically, if
then
is a factor of
v for all
and
. This contradicts the property that no strictly longer string than
occurs at least as often on the right-hand sides of the rules. It follows that in this case
, which finishes the proof. □
Lemma 5. If a unary nonterminal X is introduced in some round and with is the corresponding rule, then and .
Proof. Both points are straightforward: A rule only contains X on the right-hand side if is a factor of for , which shows that . For the second point, note that if does not contain Z on the right-hand side, then the same is true for (the unchanged rule) . The only rule where Z occurs new is the new rule , and thus . □
So far, we have shown that when a unary nonterminal X is introduced and with is the corresponding rule, then Z is a unary nonterminal as well (Corollary 1). Furthermore, we argued that introducing a nonunary nonterminal does not increase the number of rules where a unary nonterminal occurs on the right-hand side (Lemma 4). It follows that we can upper-bound the number of unary nonterminals and the number of rules where those nonterminals occur on right-hand sides independently of the nonunary nonterminals.
To do so, we inductively define a binary tree that describes how the unary nonterminals evolve until is reached. All nodes in the tree are labeled with , where X is a unary nonterminal and k is an upper bound on for some j.
- (1)
Initially, the tree only contains a single node that is labeled with .
- (2)
If a rule
is introduced in round
for some
, then we update
to
by adding two children to the unique leaf that is labeled with
for some
k. The new left child is labeled with
and the new right child is labeled with
as depicted on the left of
Figure 1.
- (3)
If otherwise a nonunary nonterminal is introduced in round , then , i.e., nonunary nonterminals are ignored.
The initial tree reflects that the only unary nonterminal of is and . If the tree is modified according to point (2) of the definition, this refers to Lemma 5, where and is shown when a rule for is introduced.
The level of a node is the length of the path from the node to the root. For a unary nonterminal , we denote by leveli the level of the unique leaf of that is labeled with for some k.
Example 3. Assume that the first three rules introduced by a global algorithm are in the first round, in the second round and in the third round for . The tree that corresponds to this introduced rules is depicted on the right of Figure 1. The indices for the introduced nonterminals are chosen such that the ordering of the nonterminals in (see Lemma 2) is , i.e., all right-hand sides of rules are contained in . The corresponding SLPs , , and are depicted next, where we simply use ∗ instead of the exact exponents of the symbols due to better readability. Please note that in this example, we have , , and , which is exactly the information contained in the second components of the leaf labels in . Furthermore, we havelevel3 for in this example.
The following lemma is a direct consequence of the fact that the maximal k that occurs for some label is incremented from one level to the next level (as described in Lemma 5).
Lemma 6. For each node of at level m that is labeled with for some unary nonterminal , we have . Also, let be a unary nonterminal. Then we have .
So far, we provided information about the number of rules where a unary nonterminal occurs. Next, we move on to the total number of occurrences of a unary nonterminal on all right-hand sides. We denote by the total number of occurrences of X on right-hand sides of rules in . We have by the definition of both functions and for a nonunary nonterminal X, we have due to Lemma 3.
Lemma 7. Let be the rule that is introduced in some round and let . We have
- (1)
for all , and
- (2)
.
Proof. Point (1) is straightforward: For let be the maximal Y-block that occurs as a factor in for some . Replacing on right-hand sides yields that at least two occurrences of are eliminated while only is added as a part of the new rule . If otherwise , then because the occurrences of Y are not affected by the new rule.
Point (2) is also based on a simple observation. Please note that describes the part of the SLP that is affected by the replacement of in round , and is the size of that part in after the occurrences of are replaced by X plus the new occurrences in the introduced rule. All other parts of are not affected by the new rule. Now the properties of a maximal string ensure that . The extreme case where has length two and occurs only twice without overlap on the right-hand sides of rules in satisfies . All other cases even satisfy . Point (2) directly follows. □
Our next goal is to bound depending on for a unary nonterminal X. To do so, we now apply arguments specific to Greedy. Recall that Greedy selects a maximal string that minimizes the size of the obtained SLP in each round.
Lemma 8. Let be a unary nonterminal and assume a rule for some is introduced byGreedyin round . We have Proof. If a unary nonterminal
X and a rule
with
are introduced in round
, then the choice of
d only depends on the maximal
Z-blocks occurring on all right-hand sides of rules in
since the remaining part of
does not change. Assume that
and let
be the lengths of the maximal
Z-blocks occurring on right-hand sides of
, i.e.,
. Then
Greedy minimizes
, where
d is the size of the new rule
and for each
a maximal block
on the right-hand side of a rule in
is transformed into
. Due to the greedy nature of the algorithm, the following equation holds for all
:
Please note that the chosen maximal string has length at least 2, but the upper bound also holds for
since in this case we have
due to Lemma 7 (point (2)). If we apply
, we get
Together with this proves the lemma. □
The following lemma is essential for the proof of Theorem 1 since we bound the total number of occurrences of a unary nonterminal depending on its level.
Lemma 9. Let be a unary nonterminal with . We have Proof. We prove the lemma by induction on
and we start with
. The only SLP
that contains a unary nonterminal
X such that
is the initial SLP
and the unary nonterminal is
. Please note that the maximal string
chosen by any global algorithm in the first round on input
trivially satisfies
and thus the two unary nonterminals of
have level one. We have
and this is exactly what we obtain when
is used on the right side of Equation (
1) (the empty product is considered to be 1).
Now assume any unary nonterminal that has level
m satisfies the claimed bound and we consider a unary nonterminal
X such that
for some
i. It follows from the definition that there is a leaf node at level
in
that is labeled with
for some
k. There are two cases that need to be distinguished. Either this leaf is a left child or a right child of its parent node. Assume that
is the label of a right child and let
be the label of the left sibling of that node. To prove both cases simultaneously, we prove the upper bound for
X and for
Z, i.e., we use
Z to cover the second case where the node is a left child. The parent node of
and
is labeled with
(see
Figure 1 on the left). Let
be the maximal
such that
, i.e.,
for
is the introduced rule in round
and
is the label of a leaf at level
m in
. By induction, we have
. Now by Lemma 8, we have
Together with
(Lemma 6), this yields
Using the fact that (there are at least two nonoverlapping occurrences of a maximal string) and (the new rule contains Z at least twice) yields the claimed upper bound on and . Finally, this upper bound holds for due to Lemma 7 (point (1)). □
Corollary 2. Let be a unary nonterminal with . We have Proof. Please note that
for all
. We upper-bound the right side of Equation (
1) (Lemma 9) as follows:
□
What we achieved so far is to bound the total size
that a unary nonterminal
X contributes on right-hand sides of the rules depending on
. Next, we bound the size that nonunary nonterminals contribute to
depending on the levels of all unary nonterminals. To do so, we need the following definitions. Let
be the number of distinct right neighbors of
X (which are not equal to
X) on right-hand sides plus the number of occurrences of
X as the last symbol of a right-hand side in
, i.e.,
Let
be the number of distinct left neighbors of
X (which are not equal to
X) on right-hand sides plus the number of occurrences of
X as the first symbol of a right-hand side in
, i.e.,
Also, let and . Please note that and since for each right-hand side of a rule there is at most one right (respectively, left) neighbor for some occurrence of X due to Lemma 2 and each right-hand side can contain X at most once as the last (respectively, first) symbol. Also, means that all maximal X-blocks on right-hand sides are either at the end of the right-hand side or are followed by a distinct symbol. Similarly, means that all maximal X-blocks on right-hand sides are either at the beginning of the right-hand side or are preceded by a distinct symbol. The following lemmas describe how the functions and evolve.
Lemma 10. If then . If a nonunary, maximal string is selected in round for some , then .
Proof. Let be the introduced rule in round . If X does not occur in , then it is straightforward to see that since and . The new nonterminal Y could be a new right neighbor for some occurrences of X, but all occurrences of X which have this new right neighbor Y in shared the same right neighbor in (the first symbol of ).
If otherwise X occurs in , then first assume that for some . Please note that replacing an occurrence of on the right-hand side of a rule either removes all occurrences of X on this right-hand side (in case u contains a maximal X-block of length for some integer ) or the right neighbor of the maximal X-block in u does not change in the modified rule since occurrences of are replaced from left to right. It follows that the only way to obtain is to remove all occurrences of X on a right-hand side, but then decreases by the same value. Additionally, the new rule adds a new right-hand side to , but since X is the last symbol on this right-hand side it follows that is incremented as well. Together this yields in this case.
The case remains where is nonunary and X occurs in . Here, we have due to Lemma 4 and occurs at most once on each right-hand side due to Lemma 2. However, again, the only way to reduce compared to is to remove all occurrences of X on a right-hand side, but then again decreases by the same value. This yields .
Assume now that a nonunary, maximal string for some is selected in round . We show that . If X only occurs in the new rule in after occurrences of are replaced on all right-hand sides in , i.e., all modified rules do not contain X anymore, then because at least two right-hand sides do not contain X as a factor anymore while only the new rule adds a new right-hand side which contains X to . Also, we have in this case and thus because all rules where the maximal X-block is removed shared the same right neighbor due to the fact that is the chosen maximal string. However, still occurs in the new rule of and thus . If otherwise at least one of the modified rules still contains X on the right-hand side, then is a factor of this right-hand side in after the replacement of . It follows that in this case and thus because each distinct right neighbor of X in is still a right neighbor of X in as argued above, but additionally is new since Y is a new nonterminal. □
The same result does not hold for . In particular, is possible when a rule is introduced in round for some due to the assumption that global algorithms replace occurrences of the maximal string from left to right. For example, assume that , and are the maximal X-blocks on right-hand sides of including distinct left neighbors for each X-block (A, B and C). Therefore, we have , and thus in this example. If now a rule is introduced, then this yields , and after replacing . Hence we have , and thus . We show in the following lemma that this is the only case where occurs.
Lemma 11. Let . If a rule is introduced in round such that , then . If a nonunary, maximal string is selected in round for some , then
Proof. The arguments are similar to the corresponding cases in Lemma 10. Let be the introduced rule in round . If X does not occur in , then since and . The new nonterminal Y could be a new left neighbor for some occurrences of X in , but all these occurrences of X shared the same left neighbor in (the last symbol of ).
If otherwise is nonunary and contains X, then we have due to Lemma 4 and occurs at most once on each right-hand side due to Lemma 2. The only way to obtain is again to remove all occurrences of X on a right-hand side, but then decreases by the same value. Please note that due to the assumption that is nonunary, it is not possible to modify two (or more) rules such that the maximal X-blocks have different left neighbors in and after the replacement these X-blocks share the same left neighbor in . This yields .
Assume now that a nonunary, maximal string is selected in round for some word . We show that . If X only occurs in the new rule in , i.e., X does not occur in the modified rules, then we have because at least two right-hand sides do not contain X as a factor anymore while only the new rule adds a new right-hand side which contains X to . Moreover, we have in this case because all rules where the maximal X-block is removed shared the same left neighbor since is the selected nonunary, maximal string. However, still occurs on the right-hand side of the new rule of . It follows that . If otherwise at least one of the modified rules still contains X on the right-hand side, then is a factor of this right-hand side in after the replacement of . It follows that and thus because each distinct left neighbor of X in is still a left neighbor of X in for some occurrence of X. Additionally, Y is a new left neighbor. □
In the following proof, we use the notation
for all unary nonterminals that appear in
and
for all nonunary nonterminals that appear in
.
Proof. Let be the total size that all nonunary nonterminals contribute to the size of . We first bound the number of rounds where the function s increases, i.e., we bound . If a unary nonterminal is introduced in some round , then , i.e., we can ignore these rules. So, consider some round where a nonunary nonterminal X is introduced and let be the introduced rule. Let be the set of nonterminals that occur at least once in . We first show that if , then . In other words, if two nonunary nonterminals occur in , then . Let r be the number of rules such that is a factor of v. Recall that nonunary factors and nonterminals occur at most once on the right-hand side of a single rule (Lemma 3). We have because the new nonunary nonterminal X occurs now on r right-hand sides, contains k nonunary nonterminals which occur exactly once in each, and the replacement of on right-hand sides deletes these k nonterminals on r right-hand sides. We have (due to the properties of a maximal string) which together with yields . Hence we can assume that . The maximal string has length and is not unary, so the first and the last symbol of are different and at least one of them is unary due to our assumption that at most one nonunary nonterminal occurs in . Let Y be this unary nonterminal and assume that Y is the first symbol, i.e., the nonunary, maximal string is for some (nonempty) v. Afterwards we discuss the case where Y is the last symbol, i.e., .
We bound the number of rounds where a nonunary, maximal string
is selected for some
v. Let
be the round where the unary nonterminal
Y has been introduced. We have
due to Lemma 6 and
for all
. We also have
for
by Lemma 10. In addition to that, if
is the selected nonunary, maximal string in round
for some
v, then we have
again by Lemma 10. It follows that after at most
many rounds where the chosen maximal string is nonunary and has the form
for some (nonempty)
v, we have
. In this case, all maximal
Y-blocks have distinct right neighbors or occur at the end of a right-hand side. Hence there is no possibility to select a nonunary, maximal string
anymore.
Now we similarly bound the number of rounds such that a nonunary, maximal string
is selected for some
v. However, care must be taken in this case, because it is possible that
when a rule
for
is introduced in round
as explained above. Fortunately, rules of this form (the selected maximal string is from
) are introduced at most
many times up to round
i by the definition of
. Let
be the round where the unary nonterminal
Y has been introduced. We have
for each
due to Lemma 6 and
for all
. Furthermore, if the selected maximal string in round
(
) is not from
, we have
due to Lemma 11. Moreover, if the maximal string
is nonunary and
for some (nonempty)
v, then
. It follows that between two rounds where maximal strings from
are selected, there are at most
many rounds where a nonunary, maximal string of the form
is chosen because then
is reached (for some
j) and thus all maximal
Y blocks have distinct left neighbors or occur at the beginning of a right-hand side. Hence no nonunary string of the form
for some
v occurs twice on right-hand sides. Since maximal strings from
are chosen at most
many times up to round
i, it follows that the number of rounds where a nonunary, maximal string of the form
for some
v is selected is at most
.
Furthermore, the maximal increase in a single round is at most , because the new nonunary nonterminal occurs in on at most many right-hand sides of rules for any and the total number of occurrences of all other (nonunary) nonterminals does not increase (Lemma 7, point (1)).
We conclude that for each unary nonterminal
Y, at most
many rules are introduced such that the nonunary, maximal string
satisfies
or
for some
v and each of those rules increases the total size that nonunary nonterminals contribute by at most
. In all other cases, we showed that the size that nonunary nonterminals contribute does not increase. □
Now we are able the prove Proposition 1.
Proof of Proposition 1. Let
be the final SLP obtained by
Greedy, i.e., after
f rounds the algorithm stops because
has no maximal string. First, we want to bound the level of unary nonterminals occurring in
. Assume there is a unary nonterminal
X such that
after some round
of the algorithm. By Corollary 2, we have
Consider the unique leaf node
in the tree
which has level
and label
for some
k. If in some round
two children with labels
and
are attached to
, i.e., the introduced rule in round
j is
for some
, then we have
by Lemma 7. To be more specific, if the length of the chosen maximal string
is exactly
and this maximal string
occurs exactly twice without overlap in
, then we have
(and
). Please note that in this case, there does not exist a maximal string
or
of
for all
(since
Y occurs only twice and
does not occur on right-hand sides of
), i.e., the children of the node
in
are leaves for
. Otherwise, if the maximal string has length
or occurs at least three times without overlap, then we have
since
holds in this setting. This means that when a new branch occurs in the tree
for some
j, then the new children of the branching node are either leaves of the final tree
or the corresponding nonterminals contribute strictly less to the size of the current SLP than the nonterminal which corresponds to the parent node did before the branch. We can iterate this argument for the children of the children of
and so on, i.e., if we consider the subtree rooted at
in
, then from level to level the size that the nonterminals contribute decreases until only leaves occur at some level. Since
, it follows that the subtree of
rooted at
has depth at most
and thus the maximal level of any unary nonterminal in
is bounded by
Consequently, the number of unary nonterminals (the number of leaves of ) is bounded by since is a binary tree of depth at most . Furthermore, each unary nonterminal X in satisfies . If there is a round such that we obtain by Corollary 2 and Lemma 7, point (1). Otherwise, let . Then , because there is at most one nonoverlapping occurrence of on right-hand sides of (otherwise there would exist a maximal string of ) and the number of rules where X occurs on the right-hand side is by Lemma 6. To be more precise, a single right-hand side of could have a maximal X-block of length 3 and all other right-hand sides must have at most one occurrence of X since two different right-hand sides where X-blocks of length 2 occur as well as one right-hand side where an X-block of length 4 occurs would contradict the fact that has no maximal string. It follows that the size which unary nonterminals contribute to is . By Lemma 12, we can bound the size that nonunary nonterminals contribute by since there are at most many unary nonterminals and each has level at most as argued above. It follows that , which proves the proposition. □
The following theorem follows directly from Proposition 1 and Lemma 1, where is shown for words w of length n.