Practical Evaluation of Lyndon Factors via Alphabet Reordering

Albertini, Marcelo K.; Louza, Felipe A.

doi:10.3390/math11010139

Open AccessArticle

Practical Evaluation of Lyndon Factors via Alphabet Reordering

by

Marcelo K. Albertini

^1,†

and

Felipe A. Louza

^2,*,†

¹

Faculdade de Computação, Universidade Federal de Uberlândia, Uberlândia 38400-902, Brazil

²

Faculdade de Engenharia Elétrica, Universidade Federal de Uberlândia, Uberlândia 38400-902, Brazil

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(1), 139; https://doi.org/10.3390/math11010139

Submission received: 24 November 2022 / Revised: 20 December 2022 / Accepted: 22 December 2022 / Published: 27 December 2022

(This article belongs to the Special Issue Analysis of One-Dimensional Regularities)

Download

Browse Figures

Versions Notes

Abstract

:

We evaluate the influence of different alphabet orderings on the Lyndon factorization of a string. Experiments with Pizza&Chili datasets show that for most alphabet reorderings, the number of Lyndon factors is usually small, and the length of the longest Lyndon factor can be as large as the input string, which is unfavorable for algorithms and indexes that depend on the number of Lyndon factors. We present results with randomized alphabet permutations that can be used as a baseline to assess the effectiveness of heuristics and methods designed to modify the Lyndon factorization of a string via alphabet reordering.

Keywords:

Lyndon factorization; alphabet reordering; algorithms

MSC:

68W32

1. Introduction

A non-empty primitive string s of length n over an ordered alphabet

Σ

is a Lyndon word if s is smaller than all its proper suffixes in alphabetic order, or alternatively, smaller than all its conjugates [1]. The Lyndon factorization [2] partitions a string s uniquely into substrings (factors)

w_{1}, w_{2}, \dots, w_{k}

such that

s = w_{1} \cdot w_{2} \dots w_{k}

, and each

w_{i}

is a Lyndon word with

w_{1} \geq w_{2} \geq \dots \geq w_{k}

.

Lyndon factorization has important applications in combinatorics and string algorithms (see [1,3,4]). In the context of suffix sorting, the relative order of suffixes of a string inside each Lyndon factor

w_{i}

is the same as their order in the original string [5], which allows the design of divide-and-conquer algorithms (e.g., [5,6]) for computing the suffix array [7]. Lyndon factors can also be used to compute indexes (e.g., [8]) based on the Bijective Burrows–Wheeler transform [9]. The efficiency of these algorithms, however, depends on the number and length of the Lyndon factors and may deteriorate when the longest factor is large [5]. In Section 4, we evaluate the Lyndon factorization of strings from Pizza&Chili datasets [10] and we show in Table 1 that this unfavorable situation is common in practice.

Although the Lyndon factorization of a string s is fixed and unique, it depends on the given alphabet ordering. Recent approaches (e.g., [11,12,13]) have investigated how to modify (increase or reduce) the number of Lyndon factors by reordering the symbols in the alphabet (considering a new alphabetic order for string comparisons). Very recently, Gibney and Thankachan [14] showed that the problem of finding an alphabet ordering to either minimize or maximize the number of factors is NP-complete.

In this paper, we evaluate in practice different alphabet orderings to modify the Lyndon factorization of a string.

In Section 3, we present two variants of a simple heuristic for reordering the alphabet based on the most frequent symbols of the input string. In Section 4, we compare our heuristic with results by Clare and Daykin [11] and we show that, although simpler, ours produces results close to those presented by Clare and Daykin’s method in most cases (even though not satisfactorily, as we discuss next). Unfortunately, we were not able to execute methods based on the evolutionary search techniques presented in [12,13] due to their high running time complexity.

In Section 4, we conclude that no method consistently modifies the Lyndon factorization via alphabet reordering in practice. Additionally, we generate uniform random distributions of alphabetic orderings to evaluate their Lyndon factorizations. Based on these results, we provide a baseline that can be used to assess the effectiveness of heuristics and optimization methods that aim to modify the number of Lyndon factors and the length of the longest Lyndon factor via alphabet reordering.

2. Background

Let

s = s [1] s [2] \dots s [n]

be a string of size

| s | = n

over an alphabet

Σ = {α_{1}, α_{2}, \dots, α_{σ}}

, with the ordering

α_{1} < α_{2} < \dots < α_{σ}

, and

σ = | Σ |

. We denote the set of all strings of symbols in

Σ

by

Σ^{*}

.

A substring of s is defined as

s [i, j] = s [i] \dots s [j]

, with

1 \leq i < j \leq n

. In particular, the substring

s [1, i]

is a prefix and

s [i, n]

is a suffix of s. A prefix or suffix of s is called proper if it is not equal to s. We say that

u \in Σ^{+}

is a factor of s if u is equal to some non-empty substring

s [i, j]

.

A string

s \in Σ^{*}

is smaller than or equal to another string

v \in Σ^{*}

, denoted by

s \leq v

, if either s is a prefix of v or

s = w α z_{1}

and

v = w β z_{2}

with

w, z_{1}, z_{2} \in Σ^{*}

(possibly empty) and the symbols

α, β \in Σ

, with

α < β

.

We denote by

u^{k}

the repeated concatenation of a string u,

k > 1

times. A non-empty string u is a repetition if there exists a string v and some integer

k > 1

such that

u = v^{k}

, otherwise, u is primitive. If a string is primitive, all its conjugates (circular rotations) are distinct.

A primitive string s is a Lyndon word if it is lexicographically smaller than all its proper suffixes [2], or smaller than all its conjugates. For example,

s = a b c a c b

is a Lyndon word. A Lyndon factor of s is a factor w that is a Lyndon word itself.

Theorem 1

(Lyndon Factorization). Any string

s \in Σ^{*}

has a unique factorization

s = w_{1} w_{2} \dots w_{k}

, such that

w_{1} \geq w_{2} \geq \dots \geq w_{k}

is a non-increasing sequence of Lyndon words.

Herein, we denote the number of Lyndon factors as k, and the length of the longest factor as

m = {max}_{1 \leq i \leq k} (| w_{i} |)

. The Lyndon factorization can be computed in linear time by using Duval’s algorithm [15].

For example, given

s = a l o h o m o r a

over

Σ = {a, h, l, m, o, r}

, its Lyndon factorization is

a l o h o m o r \cdot a

with

k = 2

Lyndon factors and longest factor length

m = 8

.

It is easy to see that, for an alphabet

Σ = {a, b}

, with

σ = 2

, the minimum number of Lyndon factors for a string s of length n is

k = 1

with

m = n

, which is achieved by

s = a^{n - 1} b

, while the maximum number of Lyndon factors is

k = n

with

m = 1

, achieved by

s = b a^{n - 1}

.

In this paper, we are interested in evaluating these values for long text datasets and ASCII alphabets.

3. Reordering the Alphabet

The problem of reordering (permuting) the symbols of a given alphabet

Σ = {α_{1}, α_{2}, \dots, α_{σ}}

, with the original ordering

α_{1} < α_{2} < \dots < α_{σ}

, is to create another alphabet

Σ_{π} = {α_{π [1]}, α_{π [2]}, \dots, α_{π [σ]}}

with the same

σ

symbols of

Σ

permuted in a different order, such as

α_{π [1]} < α_{π [2]} < \dots < α_{π [σ]} .

As a consequence, two strings in

Σ^{*}

may have different lexicographic orders when considering them from

Σ_{π}^{*}

. For example, given the strings

u = a l o h o

and

v = m o r a

, we have that

u < v

when

u, v \in Σ^{*}

, while

u > v

when

u, v \in Σ_{π}^{*}

, with

Σ_{π} = {r, o, m, l, h, a}

, where

r < o < m < l < h < a

.

The Lyndon factorization of s depends on the ordering of symbols in

Σ

. Therefore, Lyndon factorizations can differ when strings are drawn from different alphabet orderings.

For example, suppose that

s = a l o h o m o r a

was written over the alphabet permutation

Σ_{π} = {r, o, m, l, h, a}

. The Lyndon factorization of

s \in Σ_{π}^{*}

is

a \cdot l \cdot o h \cdot o m \cdot o \cdot r a

with

k = 6

and

m = 2

.

Whereas, another possible alphabet ordering

Σ_{π^{'}} = {m, r, a, h, l, o}

, would result the following Lyndon factorization for

s \in Σ_{π^{'}}^{*}

a l o h o \cdot m o r a

with

k = 2

and

m = 5

A natural question is whether it is possible to design a efficient method for reordering (permuting) the alphabet to modify the Lyndon factorization, increasing or decreasing k and m based on the input string s. For small alphabets, such as DNA, the number of possible choices is small

4! = 24

, but for larger alphabets, such as ASCII, such a brute-force approach becomes unfeasible.

This problem has been recently investigated by Clare et al. in [11,12,13] with greedy and evolutionary algorithms (on short sequences in the range of thousands of symbols).

In the next section, we present a straightforward heuristic for reordering the alphabet based on frequencies of symbols in s, which produce similar results to those obtained by Clare and Daykin’s method [11].

3.1. Most/Least Frequent Symbol

Let us consider the Parikh vector,

p (s)

, for the string s, where

p (s)

gives the number of occurrences of each

α_{i} \in Σ

in s. The most frequent symbol (MFS) method assigns to

π [i]

the i-th most frequent symbol in

p (s)

, whereas the least frequent symbol (LFS) method assigns to

π [i]

the i-th least frequent symbol in

p (s)

.

A new alphabet ordering

Σ_{π} = {α_{π [1]}, α_{π [2]}, \dots, α_{π [σ]}}

is created accordingly, such as

α_{π [1]} < α_{π [2]} < \dots < α_{π [σ]}

.

In case different symbols have the same values in

p (s)

, we consider their original ranks in

Σ

.

For example, given

s = a l o h o m o r a

over

Σ = {a, h, l, m, o, r}

. We have

p (s) = [2, 1, 1, 1, 3, 1]

that give us the following alphabets reorderings:

Σ_{MFS} = {o, a, h, l, m, r} and Σ_{LFS} = {h, l, m, r, a, o}

The Lyndon factorization of

s \in Σ_{MFS}^{*}

is

a l \cdot o h o m o r a

with

k = 2

and

m = 7

, whereas the factorization of

s \in Σ_{LFS}^{*}

is

a \cdot l o \cdot h o m o r a

with

k = 3

and

m = 6

.

4. Experiments

We evaluate the Lyndon factorization of strings from Pizza&Chili datasets (https://pizzachili.dcc.uchile.cl/, accessed on 1 November 2022). The source-code of methods MFS and LFS (Section 3.1) is freely available on Github (https://github.com/felipelouza/remap/, accessed on 1 November 2022).

Table 1 shows the alphabet size (Column 2), string length in MB (Column 3), number of Lyndon factors (Column 4) and the longest Lyndon factor length in percentage to the string length (Column 5) of each dataset. We considered the standard alphabet ordering in this first experiment.

Table 1. Experiments with Pizza&Chili datasets considering the standard alphabet ordering. The datasets einstein-de, kernel, fib41 and cere are highly repetitive texts. The dataset english.1G is the first 1 GB of the original english dataset. Column 2 shows the alphabet size.

Dataset	σ	Size in MB	Number of Factors	Longest Factor
`sources`	230	201	30	52.00%
`dblp`	97	282	17	37.93%
`dna`	16	385	18	74.75%
`english.1GB`	239	1047	30	57.28%
`proteins`	27	1129	24	80.71%
`einstein-de`	117	88	44	40.39%
`kernel`	160	246	33	41.73%
`fib41`	2	256	21	61.68%
`cere`	5	440	22	79.98%

Notice that the number of Lyndon factors is small even for larger strings (english.1GB and proteins) and the longest Lyndon factor can be as large as the input string. This situation is particularly unfavorable for divide-and-conquer algorithms that take advantage of the input string partitioned into Lyndon factors to compute fundamental data structures for string processing, such as the suffix array (e.g., [5,6]).

4.1. Alphabet Reordering

We compared the methods MFS and LFS (Section 3.1) with the greedy algorithm (with and without backtracking) proposed in [11]. Unfortunately, we were not able to run the evolutionary methods proposed in [12,13] due to their high time complexity (the authors presented experiments only with very small inputs). We also included results obtained from 100 samples drawn from a uniform distribution of permutations. These random results are presented as box plots in the figures.

Figure 1 shows the number of Lyndon factors resulting for each method compared to the number of factors presented in Table 1 (standard alphabet ordering). Results for dataset cere were omitted, MFS and random generated more than 7000 factors, while LFS, greedy and greedy with backtracking generated about 2000 factors. Despite this case, the resulting number of factors is still small for all alphabet orderings, at most twice the number created by the original alphabet order.

Figure 2 shows the lengths of the longest Lyndon factors as a percentage of the total length of the input string for each method. Notice that the length of the longest factor can be as large as the length of the input string. Note that in most cases, the longest factor is not smaller than

33 %

of the input string. Additionally, results of the straightforward methods MFS and LFS were close to the greedy algorithms proposed by Clare and Daykin [11].

We remark that no method consistently always increased (or decreased) neither the number of Lyndon factors, nor their maximum lengths.

4.2. Randomized Alphabet

Table 2 and Table 3 summarize results of the 100 samples of randomized alphabet permutations (method random). We present a baseline that can be used to assess heuristics and optimization methods for increasing/decreasing Lyndon factors via alphabet reordering based on these results. The histograms of these results can be found in Figure 3 and Figure 4

We state that any maximization method can be considered effective if it consistently selects an alphabet permutation

Σ_{π}

that results in a number of Lyndon factors (or the longest factor length) larger than most other random permutations in Q3 (third quartile, >75%). Similarly, a minimization method will be effective if the selected alphabet permutation provides a number of Lyndon factors (or the longest factor length) smaller than most other permutations in Q1 (first quartile, <25%).

5. Conclusions

In this paper, we evaluated the Lyndon factorization of strings from Pizza&Chili [10] for different alphabet reorderings. We showed that in practice, the number of Lyndon factors is usually small, and the length of the longest factors can be as large as the input string. This suggests that perhaps the Lyndon factorization may not be an effective way of breaking down strings into manageable pieces without alphabet reordering, which could potentially lead to even more efficient ways of working with larger strings. We also evaluated randomized alphabet permutations, which can be used as a baseline to assess the effectiveness of heuristics and methods designed to modify the Lyndon factorization via alphabet reordering.

Author Contributions

Conceptualization, M.K.A. and F.A.L.; Software, F.A.L.; Validation, M.K.A.; Formal analysis, M.K.A. and F.A.L.; Writing—original draft preparation, M.K.A. and F.A.L.; Writing—review and editing, M.K.A. and F.A.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by CNPq (grant number 406418/2021-7) and FAPEMIG (grant number APQ-01217-22).

Data Availability Statement

The source-code is freely available at https://github.com/felipelouza/remap/, accessed on 1 November 2022.

Acknowledgments

The authors thank the anonymous reviewers and Marinella Sciortino and Giovanni Manzini for comments that improved the presentation of this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Smyth, W. Computing Patterns in Strings; Pearson Education: Harlow, UK, 2003. [Google Scholar]
Chen, K.T.; Fox, R.H.; Lyndon, R.C. Free differential calculus. IV—The quotient groups of the lower central series. Ann. Math. 1958, 68, 81–95. [Google Scholar] [CrossRef] [Green Version]
Lothaire, M. Combinatorics on Words, 2nd ed.; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
Bona, M. Handbook of Enumerative Combinatorics; Discrete Mathematics and Its Applications, CRC Press: Hoboken, NJ, USA, 2015. [Google Scholar]
Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M. Suffix array and Lyndon factorization of a text. J. Discret. Algorithms 2014, 28, 2–8. [Google Scholar] [CrossRef]
Sunita; Garg, D. Extended suffix array construction using Lyndon factors. Sadhana—Acad. Proc. Eng. Sci. 2018, 43, 133. [Google Scholar]
Manber, U.; Myers, G. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
Bannai, H.; Kärkkäinen, J.; Köppl, D.; Piatkowski, M. Indexing the Bijective BWT. In Proceedings of the CPM; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Saarbrücken, Germany; Wadern, Germany, 2019; Volume 128, pp. 17:1–17:14. [Google Scholar]
Kufleitner, M. On Bijective Variants of the Burrows-Wheeler Transform. In Proceedings of the PSC; Holub, J., Zdárek, J., Eds.; Prague Stringology Club: Prague, Czech Republic, 2009; pp. 65–79. [Google Scholar]
Ferragina, P.; Navarro, G. Pizza&Chili. Available online: pizzachili.dcc.uchile.cl/ (accessed on 1 November 2022).
Clare, A.; Daykin, J.W. Enhanced string factoring from alphabet orderings. Inf. Process. Lett. 2019, 143, 4–7. [Google Scholar] [CrossRef]
Clare, A.; Mills, T.; Daykin, J.W.; Zarges, C. Evolutionary Search Techniques for the Lyndon Factorization of Biosequences; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1543–1550. [Google Scholar]
Major, L.; Clare, A.; Daykin, J.W.; Mora, B.; Gamboa, L.J.P.; Zarges, C. Evaluation of a Permutation-Based Evolutionary Framework for Lyndon Factorizations. In Proceedings of the PPSN; Springer: Cham, Switzerland, 2020; Volume 12269, pp. 390–403. [Google Scholar]
Gibney, D.; Thankachan, S.V. Finding an Optimal Alphabet Ordering for Lyndon Factorization is Hard. In Proceedings of the STACS; Schloss Dagstuhl—Leibniz-Zentrum für Informatik: Saarbrücken, Germany; Wadern, Germany, 2021; Volume 187, pp. 35:1–35:15. [Google Scholar]
Duval, J.P. Factorizing words over an ordered alphabet. J. Algorithms 1983, 4, 363–381. [Google Scholar] [CrossRef]

Figure 1. Number of factors for each method. Results for dataset cere were omitted because of its much larger scale.

Figure 2. Longest Lyndon factor length for each method. Grey points are outliers in the random distributions.

Figure 3. Number of factors for random permutations of the alphabet. Shapiro–Wilk normality tests, with

p = 0.05

, discarded normality of all distributions, except for kernel dataset.

Figure 3. Number of factors for random permutations of the alphabet. Shapiro–Wilk normality tests, with

p = 0.05

, discarded normality of all distributions, except for kernel dataset.

Figure 4. Longest length of factors for random permutations of the alphabet. Shapiro–Wilk normality tests, with

p = 0.05

, discarded normality of all distributions.

Figure 4. Longest length of factors for random permutations of the alphabet. Shapiro–Wilk normality tests, with

p = 0.05

, discarded normality of all distributions.

Table 2. Number of Lyndon factors for random permutations of the alphabet. Column 2 shows the number of factors for the original alphabet. Columns 3 and 7 show the minimum and maximum number of factors obtained with 100 random permutations of the alphabet. Columns 4, 5 and 6 show the first quartile (Q1), the median and the third quartile (Q3) of the distribution.

	Original	Min	Q1	Median	Q3	Max
`sources`	30	15	21	26	31	54
`dblp`	17	13	17	21	25	43
`dna`	18	17	20	23	24	35
`english.1GB`	30	11	7	13	19	28
`proteins`	24	13	22	25	29	43
`einstein-de`	44	21	43	57	71	123
`kernel`	33	10	18	22	28	39
`fib41`	21	22	22	22	41	41
`cere`	22	16	2192	7718	7724	7728

Table 3. Longest Lyndon factors (in percentage of the total length of the string) for random permutations of the alphabet. Column 1 shows the longest factor of the original alphabet. Columns 3 and 7 show the minimum and maximum longest factors obtained with 100 random permutations of the alphabet. Columns 4, 5 and 6 show the first quartile (Q1), the median and the third quartile (Q3) of the distribution.

	Original	Min	Q1	Median	Q3	Max
`sources`	52.00%	27.72%	49.19%	60.20%	72.50%	93.15%
`dblp`	37.93%	20.98%	42.96%	60.50%	71.91%	95.91%
`dna`	74.75%	31.11%	60.68%	60.82%	87.91%	95.11%
`english.1GB`	57.28%	34.66%	62.43%	79.23%	89.45%	97.43%
`proteins`	80.71%	31.22%	46.65%	59.46%	76.16%	99.46%
`einstein-de`	40.39%	18.37%	50.11%	58.25%	79.93%	99.93%
`kernel`	41.73%	9.74%	79.51%	86.36%	90.74%	96.09%
`fib41`	61.68%	38.12%	38.12%	38.12%	61.68%	61.68%
`cere`	79.98%	41.89%	41.89%	79.98%	83.18%	90.74%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Albertini, M.K.; Louza, F.A. Practical Evaluation of Lyndon Factors via Alphabet Reordering. Mathematics 2023, 11, 139. https://doi.org/10.3390/math11010139

AMA Style

Albertini MK, Louza FA. Practical Evaluation of Lyndon Factors via Alphabet Reordering. Mathematics. 2023; 11(1):139. https://doi.org/10.3390/math11010139

Chicago/Turabian Style

Albertini, Marcelo K., and Felipe A. Louza. 2023. "Practical Evaluation of Lyndon Factors via Alphabet Reordering" Mathematics 11, no. 1: 139. https://doi.org/10.3390/math11010139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Practical Evaluation of Lyndon Factors via Alphabet Reordering

Abstract

1. Introduction

2. Background

3. Reordering the Alphabet

3.1. Most/Least Frequent Symbol

4. Experiments

4.1. Alphabet Reordering

4.2. Randomized Alphabet

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI