1. Introduction
Let be a fixed alphabet, and let U be a word over . A subword of U is any word obtained by deleting some (possibly zero) letters of U. For example, the word is a subword of the word mathematics. A factor of U is a special type of a subword consisting of consecutive letters of U. The fact that F is a factor of U can be written as U = PFS, where P and S are some (possibly empty) words (called a prefix and a suffix of U, respectively). For instance, thema is a factor of mathematics. A square is a word of the form S = UU for some nonempty word U, and a shuffle square is a word that can be split into two identical disjoint subwords. For example, the word hotshots is a square, while tuteurer is a shuffle square, but not a square. Clearly, every letter in a shuffle square must occur an even number of times. We will call any word with this property a tangram.
Shuffle squares were introduced by Henshall, Rampersad, and Shallit in [
1]. The main question posed there was about the enumeration of shuffle squares of a fixed length over a given alphabet. Even for the smallest binary case, we do not have a satisfactory answer. It was only recently proven by He, Huang, Nam, and Thaper [
2] that the number of binary shuffle squares of length 2
n is at least
, for
. An intriguing conjecture presented there states that almost every binary tangram is a shuffle square (in the sense that the probability of picking a shuffle square uniformly at random from all tangrams of length 2
n tends to 1 with
n tending to infinity).
Many challenging problems concern the
avoidability of squares and their various relatives. A word
U is
square-free if neither of its factors is a square. For instance, the word
combinatorics is square-free, while the word
repetitive is not. By saying that squares are
avoidable, we mean that there exist arbitrarily long square-free words over some
finite alphabet. In 1906, Thue [
3] proved that squares are avoidable by constructing an infinite family of
ternary square-free words (which is the best possible). This result is considered as the starting point of
combinatorics on words—an important area with many connections to other branches of mathematics and computer science.
Similarly, one may investigate the avoidability of shuffle squares. Using the probabilistic method, Currie [
4] proved that shuffle squares are avoidable over more than
letters (see [
5]). Later, this was improved by lowering the size of an alphabet down to 10, by Müller [
5], to 7, by Guégan and Ochem [
6], and independently, by Grytczuk, Kozik, and Zaleski [
7]. Recently, Bulteau, Jugé, and Vialette [
8] proved that shuffle squares are avoidable over an alphabet of size six, which is currently the best estimate.
In the present paper, we consider the avoidability of more-general shuffle squares, introduced in [
9]. An
anagram of a word
U is any word
V obtained by rearranging the letters of
U. For instance, the words
braze and
zebra are anagrams of each other. More formally, let
be a permutation of the set
denoted as a sequence
. If
is a word of length
n, then
is a
σ-anagram of
U (the word obtained by rearranging the letters in
U according to the permutation
). For instance, if
U =
sword and
, then
words is a
-anagram of
U. We also say that two words,
and
, are
σ-similar if
(or
). For example, the words
U =
braze and
V =
zebra are
-similar with
. A
σ-square is just a word of the form
, where
U and
V are
-similar. A word
W is a
shuffle σ-square if it can be split into two subwords that are
-similar.
Of course, every tangram is a shuffle -square for some permutation . Moreover, it is not hard to demonstrate that tangrams are not avoidable. Actually, any k-ary word of length contains a tangram as a factor. This leads to a natural question: For which families of permutations are the corresponding shuffle -squares avoidable? Our main result gives a partial answer amounting to a sufficient condition concerning the size of the allowed permutation families.
Theorem 1. Let be a constant, and let be a set of permutations of length n satisfying . Then, there exists and arbitrarily long k-ary words avoiding all shuffle σ-squares, with .
A
cyclic shuffle square is a shuffle
-square for a cyclic permutation
. The theorem above implies that cyclic shuffle squares are avoidable, though the resulting upper bound on the alphabet size is, rather, not optimal. A similar conclusion holds for sets
consisting of all permutations avoiding a fixed
pattern. This follows from the famous Stanley–Wilf Conjecture proven by Marcus and Tardos [
10] (see Theorem 3).
In [
9], we proved that every binary tangram is a cyclic shuffle square. We also made the following stronger conjecture, which says that every binary tangram can be shifted cyclically to a shuffle square.
Conjecture 1. Let T be an arbitrary binary tangram. Then, there exists a factorization such that the word is a shuffle square.
The conjecture has been verified for all binary tangrams of length at most 20. The statement is not true for larger alphabets, as, for instance, none of the cyclic shifts of the word 011022 are a shuffle square. Perhaps the following weaker, but more-general conjecture is true.
Conjecture 2. For every integer , there exists an integer such that every k-ary tangram T can be factorized as , so that the word is a shuffle square, for some permutation .
The above conjecture can be stated using a more-general notion of the
cutting distance, which could be of independent interest. A formal definition together with some results and computer experiments are contained in
Section 3.
Section 2 contains the proof of our main result stated in a more-abstract setting based on the idea of a
constraint graph. The last section contains a short discussion of possible directions for future research.
2. Avoiding Abstract Squares
In this section, we present a proof of Theorem 1, which is derived as an immediate consequence of a more-general result. We start by introducing a general setting of abstract shuffle squares.
Consider any undirected graph G (loops are allowed) on the set of all finite words. A word U is called a G-square if , where the factors X and Y have the same length and are joined by an edge in G. In a classical definition of a square, where , the graph G consists of loops only (at every vertex) and has no other edges. So, one may think of a G-square as of an abstract repetitive structure in which there may be no visible similarity between the first part and its “repetition”. Similarly, a word U is a shuffle G-square if U can be split into two disjoint subwords of the same length, which form an edge of G.
Theorem 2. For every real number , there exists an integer such that, for any graph G on the set of k-ary words, in which every word of length n has at most neighbors, there exist arbitrarily long k-ary words avoiding shuffle G-squares.
To see that the above theorem implies Theorem 1, it suffices to define a graph G by joining each word U to its -anagram , for every permutation . Clearly, every word of length n will have at most neighbors in G (by the symmetry of the relation of the -similarity of the words).
We apply the following version of the powerful tool from the probabilistic method—the Lovász Local Lemma (see [
11]).
Lemma 1 (The Local Lemma; Multiple Versions ([
11])).
Let be events in any probability space with dependency graph . Let be a partition such that every event in has the same probability . Suppose that the maximum number of vertices from adjacent to a vertex from is at most . If there exist real numbers such that , then .
The events in the above lemma are typically considered as “bad” events that we want to avoid. In our case, we pick a random word W and a bad event will mean that a shuffle G-square occurs as a factor of W. Then, the lemma guarantees that the probability that none of the bad events occur is positive, provided that all the assumptions are satisfied. As a consequence, there must exist a word avoiding shuffle G-squares.
Proof of Theorem 2. Let be a fixed real number, and let k be a sufficiently large natural number to be specified later. Suppose that there is a fixed graph G on the set of k-ary words satisfying the assumption of the theorem.
Consider a random word of arbitrary, but fixed length N. For a fixed interval of length , let be the event that the factor is a shuffle G-square. Let denote the family of all such events. Clearly, the probability of events in depends only on r, so we may denote it as and estimate it from above as follows.
First, notice that the two parts of the shuffle
G-square determine a partition of the segment
R into two parts, each of size
r. The number of such partitions is equal to
One part of the partition may be occupied by any word of length
r, while the other part contains a neighboring word in the graph
G. Hence,
To estimate , it is enough to notice that two events and are independent whenever the segments R and S are disjoint. Hence, .
Now, put
, and notice that
. So, we may write the following inequalities:
We need two well-known formulas:
Both series are convergent for
; hence, by putting
, we obtain
and
. Thus,
Putting the last equality to (2.3), we obtain
Hence, by Lemma 1, we obtain the desired conclusion, provided that
which can be transformed to
So, the assertion of the theorem holds whenever . □
In the aforementioned proof, we were not taking too much care in optimizing the multiplicative constant in the lower bound for the alphabet size
k. Most probably, one may obtain a better estimate by more-careful manipulations of the constants or by using some other method, like
entropy compression or
Rosenfeld counting (see [
12,
13]). However, the resulting bound on the size of the alphabet will be still linear in
.
Let us formulate now the mentioned consequence concerning shuffle squares based on permutations avoiding a fixed pattern. Recall that, for a given permutation
, a permutation
is said to
avoid as a
pattern if no subsequence of
is order-isomorphic to
. Stanley and Wilf (see [
10]) formulated independently a conjecture stating that the number of permutations of length
n avoiding
grows at most exponentially. The conjecture was proven by Marcus and Tardos in [
10]. Let
denote the set of permutations of length
n avoiding
. Let
.
Theorem 3 (Marcus and Tardos [
10]).
For every fixed permutation π, there exists a constant such that .
This theorem together with Theorem 1 immediately implies the following.
Corollary 1. Let π be any fixed permutation. Then, there exists an integer and an infinite family of k-ary words avoiding shuffle σ-squares, with .
We conclude this section with a demonstration of the Rosenfeld counting argument in the case of ordinary G-squares.
Theorem 4. Let be a real number, and let be an integer. Let G be any graph on the set of k-ary words, in which every word of length n has at most neighbors. Then, there exist arbitrarily long k-ary G-square-free words. Moreover, the number of such words of length n is at least .
Proof. Denote by
the set of all
k-ary words not containing any
G-square as a factor. Let
. We prove by induction that
for all
. It is not hard to confirm that the inequality holds for
. So, assuming that it holds for all
, we prove that it also holds for
.
Let
denote the set of all
k-ary words of length
that are
notG-square-free, but whose prefix of length
N is
G-square-free. In other words, if
and
, then the word
is
G-square-free, but there is at least one suffix of
U that is a
G-square. Clearly, we have
Let
, with
, denote the family of all words in
having a
G-square at the last
positions. Every word
has the form
, where
is a
G-square of length
, with
X joined to
Y in the graph
G. Clearly, the word
is a
G-square-free word of length
, which can be extended to a word in
by appending at most
words
Y of length
t (by the assumption on the graph
G). It follows that
for all
. On the other hand, by the inductionassumptions, we have
for each
. Hence, taking
, we obtain
Since
, we obtain
where the last inequality follows from the assumption
. This completes the proof. □
To illustrate the aforementioned theorem, consider the following scenario. Let = {1, 2, 3, 4, 5, 6, 7, 8} be an alphabet with eight letters. For every , let be an arbitrary involution of the alphabet , i.e., a permutation, which is a product of disjoint transpositions. We may assume that does not have fixed points, so it can be described as a matching, i.e., a set of disjoint pairs of letters. For instance, = {12, 34, 56, 78}, = {15, 26, 37, 48}, and = {18, 27, 36, 45}, are three exemplary involutions. It should be stressed that, for every n, we may choose an involution completely at ease.
Two words of length
n are
related if one of them can be obtained from the other by choosing a subset of positions and applying the involution
to the letters occupying these positions. For instance, using the above-mentioned three involutions
,
, and
, we can see that the word 1 is related to two words,
1 and
2, the word
12 is related to four words,
12, 52, 16, 56, while the word
123 has the following eight siblings:
A word of the form is an involutive square if U and W are related under the involution . Now, by Theorem 4, we obtain that there exist arbitrarily long words over not containing any involutive squares. Indeed, for every word U of length n, there are exactly related words, so the assertion follows by taking .
3. The Cutting Distance between Words
In this section, we focus on shuffle squares in the context of cutting words. Let U and W be two distinct words that are anagrams of each other, i.e., every letter occurs the same number of times in both words. Then, it is possible to cut the word U into pieces that can be rearranged to give the word W. For instance, the word below can be cut into three pieces, b|el|ow, from which we may compose the word elbow, (el|b|ow). It is not hard to check that the number two (of cuts needed to transform below to elbow) is actually minimal. We may thus say that the cutting distance between these two words is equal to 2.
More formally, for any pair of distinct anagrams U and W, there must exist a factorization , , and a permutation such that . We call the minimum number q in such a factorization the cutting distance between words U and W and denote it as . We may extend this definition to pairs of words for which such a cutting is not possible. We, then, adopt the convention that . Notice also that and if and only if . Slightly less obvious is the triangle inequality, but it is also not hard to verify it, as demonstrated below. So, the cutting distance satisfies all three requirements from the definition of a metric, making the set of all finite words a metric space.
Proposition 1. The cutting distance is a metric on the set of all finite words.
Proof. As noticed above, it remains to demonstrate that every triple of words
satisfies the triangle inequality:
The case when one of the the two distances on the right-hand side is infinite is clear. So, assume without loss of generality that , , and . Then, the word X has two factorizations, and , whose parts and can be rearranged to give U and W, respectively. Assuming the worst-case scenario (there is no pair of factors starting or ending at the same position), we obtain a more-fragmented factorization such that each and each is a product of some number of consecutive factors . Therefore, one may produce the word U, as well as the word W out of these factors by appropriate substitutions. This proves that , which completes the proof. □
Given a set of words and a word U, one may define the cutting distance between these objects as . For instance, if denotes the family of shuffle squares, then is the least number of cuts needed to turn the word U into a shuffle square.
Let denote the set of all shuffle squares over a k-letter alphabet. Using this terminology, we may now restate Conjecture 2 as follows.
Conjecture 3. Every k-ary tangram T satisfies , for some finite constant depending only on k.
Let us denote by is a k-ary tangram of lenght n} the maximum of all possible cutting distances between shuffle squares and k-ary tangrams of fixed length n. The above conjecture states that is finite for every . Based on computer experiments and known results, we dare to state the following (risky) conjecture.
Conjecture 4. We have and for all .
Below, we collect some (weak) evidence in favor of the above statement. Firstly, we checked using a computer that there is no counterexample to the equality among binary tangrams of a length up to 20.
Proposition 2. Every binary tangram T of a length at most 20 satisfies , i.e., , for all .
Let us stress, however, that we do not even know if is finite. But, even if is not, then the speed of growth of the function would be interesting to study.
One theoretical fact supporting our belief is the famous
necklace-splitting theorem of Goldberg and West [
14] (see also [
15,
16]). The theorem states that every necklace (a word) with an even number of beads in each of
k kinds can be
fairy split between two thieves by cutting it at no more than
k places. This means that the resulting pieces of the split necklace can be partitioned into two collections so that every kind of bead is equally represented in both collections. It is easy to see that this result is optimal by considering necklaces with beads of the same kind grouped into connected segments.
We state the necklace-splitting theorem below in the terminology of anagrams, squares, and the cutting distance. Denote by the family of all k-ary anagram squares, i.e., words of the form , with U being an anagram of W. We may also define the analogous functions and .
Theorem 5 (Goldberg and West [
14]).
Every binary tangram B satisfies . For every , every k-ary tangram T satisfies . As a consequence, and , for all .
Perhaps only the binary case should be explained. The original necklace-splitting theorem asserts that two cuts are sufficient for every binary tangram B, while we wrote . The reason is that, in the necklace-splitting problem, we must obtain two separate anagrams, while the cutting distance is measured to one word that is an anagram square. So, one cut can always be saved. For instance, consider the word 0000111111. In the necklace problem, we need two cuts, namely 00|00111|111. But, to make an anagram square, only one of them is sufficient; the first cut, 00|00111111, gives the word 0011111100, while the second cut, 0000111|111, gives the word 1110000111, and both resulting words are anagram squares.
This cut saving is not always possible over larger alphabets. Curiously, there also exist examples of tangrams with correspondingly larger cut distances to shuffle squares.
Proposition 3. We have .
Proof. We show that there exists a ternary tangram
T satisfying
. Consider the word
T =
011120002221. It can be cut by three cuts as
and rearranged to the word
S = 00222|20|0111|1. Since every run in
S is of an even length,
S is a shuffle square with the following partition:
.
Now, by running a computer program, we checked that two cuts are not sufficient to make a shuffle square out of T. Hence, . □
By computer experiments, we found that the word
011120002221 is the shortest ternary tangram with the cutting distance of three to shuffle squares. There are only two such tangrams of length 12 (up to renaming the letters).
Table 1 contains a collection of data including the number of ternary tangrams of a given length and a given cutting distance to shuffle squares.
In another experiment, we considered the following variant of shuffle squares introduced in [
1]. A
reverse of a word
,
, is a word
, i.e.,
U written backwards. A
reverse square is a word of the form
; for instance, the word
rattar is an example of a reverse square. Analogously, a word that can be split into two subwords that are the reverse of each other is called a
reverse shuffle square. Denote by
the set of all
k-ary reverse shuffle squares.
As before, we are interested in the cutting distance between tangrams and reverse shuffle squares. Thus, we define and .
It is known (and easy to prove; see [
1]) that every reverse shuffle square must be at the same time an anagram square. In the binary case, the opposite is also true; a binary tangram is a reverse shuffle square if and only if it is an anagram square (see [
1]). By these remarks and by the necklace-splitting theorem, we obtain immediately the following.
Proposition 4. Every binary tangram B satisfies . As a consequence, .
Proof. Let B be a binary tangram. By the necklace-splitting theorem, we may write , where X is an anagram of . So, cutting the word B as and moving U at the end give an anagram square . This completes the proof, since S is also a reverse shuffle square. □
It is natural to wonder what happens for larger alphabets. The discussion above together with computer experiments suggests that the following analog to Conjecture 3 may be true.
Conjecture 5. Every k-ary tangram T satisfies , for some finite constant depending only on k.
This time, we know that the statement of the conjecture is true for , but again, for , even the finiteness of is not obvious. Nevertheless, analogous to the case of shuffle squares, we formulate the following (risky) conjecture.
Conjecture 6. For every , we have .
One tiny fact in favor of this supposition is given below.
Proposition 5. We have .
Proof. We show that there exists a ternary tangram
T satisfying
. Consider the word
T =
0001201222. It can be cut by three cuts as
and rearranged to the word
S =
0|201222|01|0. The word
S is a reverse shuffle square, which can be seen in the decomposition:
.
Now, by running a computer program, we checked that two cuts are not sufficient to make a reverse shuffle square out of T. Hence, . □
By computer experiments, we found that the word
0001201222 is the unique (up to renaming the letters) shortest ternary tangram with the cutting distance of three to reverse shuffle squares.
Table 2 contains a collection of data including the number of ternary tangrams of a given length and a given cutting distance to reverse shuffle squares.
4. Final Comments
Let us conclude the paper with some comments on possible directions for future research. Of course, the most intriguing is the resolution of the stated conjectures and providing the answer to the following question: How many cuts are needed to turn a tangram into a shuffle (or reverse shuffle) square?
A similar question can be asked for other types of squares or actually any imaginable sets of words. For instance, consider the set of all decimal words, i.e., words over the alphabet {0, 1, 9}, representing all positive integers written in the usual decimal notation. Let P = {2, 3, 5, 6, 11, 13, 17, 19, 23, …} be the subset of prime words representing all the prime numbers. How many cuts are needed to turn a given decimal word into a prime word? Is it true that there is an absolute constant c such that, for any , we have either or ?
Another possible direction is to study how much the structure of a constraint graph affects the avoidability of the abstract squares. In our main result, we presented one property (an exponential upper bound for vertex degrees corresponding to words of given length) guaranteeing avoidability. Perhaps other graph-theoretic properties may be of use here. For which classes of graphs are the corresponding abstract squares avoidable? Is it true, for instance, that G-squares are avoidable for any planar graph G?
Finally, besides shuffle squares, one may consider shuffle cubes, shuffle bi-squares, or any shuffle
m-powers, for arbitrary
. For instance, the word
010011110100 is a shuffle cube consisting of three shuffled copies of the word
0110. In [
5], Müller proved that shuffle cubes are avoidable over a six-letter alphabet. Notice that a shuffle cube may not contain any shuffle square as a factor (unlike for ordinary powers), so this result is independent of the theorem in [
8] establishing the avoidability of shuffle squares over a six-letter alphabet. In [
7], we proved that, for every
, shuffle
m-powers are avoidable on an alphabet of size
. Is it true that, for a sufficiently large
m, one may avoid shuffle
m-powers on a binary alphabet? Is there a finite number
k such that all shuffle
m-powers,
, are
simultaneously avoidable on an alphabet of size
k?