1. Introduction
Binary sequences (strings or chains) appear naturally in physical systems for describing growing processes generated from an initial state. Examples are ubiquitous, from the spin up-down systems in physics to codification in information theory and its reflection in biomolecules as DNA and RNA (where two sets of nucleotides exist: purines and pyrimidines). They are also common objects in mathematics, representing, for example, sets of words formed from letters of an alphabet or symbolic representations of dynamical systems, where mapping can be defined to transform real trajectories into two state series [
1]. Binary sequence can also codified fractal sets obtained by an iterative process, as occurs in the classical Cantor set [
2,
3].
In order to study the properties of systems with fractal geometry, Mandelbrot defined some random sets that are generated recursively from a given initial set as a percolation process [
2,
4]. These sets are formed by successive application of a defined set of rules that, either divide the initial set and discard some subsets in each partition, or enlarge the initial set by substituting the existing subsets by multiple replicas of themselves. These sets can be embedded in 1D, 2D and 3D spaces, conditioning some of their characteristics, in particular, percolation.
In this paper, we study the properties of the sequences formed by performing random substitutions of constant length (see, for instace, [
5]). Specifically, we consider a binary alphabet, that we denotate as
, and substitutions of length
. As in the classical percolation process, we assume that, once a void site is formed, i.e., a 0 occurs at this site, subsequent substitutions of it yield
k zeros, always. On the other hand, filled sites, i.e., 1 at this site, are substituted by random words of
k letters with a (uniform) probability of inserting a 1 defined by a parameter
p. In this substitution, the probability of inserting a 0 is
.
The process defined in this way gives rise to sequences of increasing size. The number of ones (filled sites) in the sequence at iteration
i is a stochastic variable that depends on the probability
p and the length
k. We study the probabilistic properties of these sequences. Specifically, we derive the probability distribution of the number of ones at iteration
i. We also obtain the expected values of
X and
, the corresponding variance
, and obtain some properties of them. As an alternative measure of uncertainty, we calculate the mean entropy of these sequences and compare it with their variance [
6,
7]. Finally, we find the parametric curves that relate both measures for each iteration and substitution length. These curves present two regimes: for low values of
p the dependence entropy-variance is convex (concave down), whereas for large values of
p this dependence is practically lost. Therefore, for each iteration and substitution length, we can generate sequences with the same entropy but with different variance and vice versa.
2. Sequence Generation
The generation of the sequences studied in this work is a one-dimensional example of the classical well-known Mandelbrot’s percolation process [
4]. Contrary to the classical formulation, where an initial segment of length
L is subdivided into
k subsegments of length
and, with probability
p, some of them are chosen to be further divided, the model we study here considers an initial digit, that we denotate by 1, that is substituted by a word of length
k formed by digits
. As in the classical model, we assume that digits denotated by 0 are substituted by a word of
k zeros (see
Figure 1) and sites occupied by ones are substituted by a word of
k letters with a independent probability
p. This procedure is applied successively to generate a binary sequence of finite length, after
i iterations or, in the limit, infinite length. When
, this procedure can be viewed as a probabilistic version of the Cantor set for
[
2].
These procedures are called substitutions, as they are obtained by replacing a digit by a word of digits [
1,
8,
9]. In this context, we study the following map that applies at each random substitution of length
k:
where
is a Bernoulli random variable with parameter
, i.e.,
p means the (uniform) probability of inserting 1 in each position and independently for all
j. We note that an alternative description of this substitution process can be defined by using the Kronecker product [
10,
11,
12,
13].
The deterministic version of this procedure brings about well known sequences. For the classical Cantor set, this map reduces to:
which, starting from a unique 1, yield the sequences:
whose limit of the iterative process is the infinite sequence that corresponds to the classical Cantor set. For this particular case, the length of the
i-th vector is
. In general, for substitutions of constant length
k, the
ith-iterated vector has length
.
There are other well-known sequences generated iteratively by substitutions of different length. For instance, the Fibonacci sequence is generated applying the following map [
14,
15]:
Starting from 0, after six iterations we obtain the binary vector:
. Here, the number of 1 of the
ith-sequence represents the fertile population of rabbits. Thus, the number of mature pairs of rabbits after six generations is 5. Notice that, in principle, the position in the sequence has not a spatial meaning. It is also worthy to remark that this sequence can be generated by different maps [
14].
Another self-similar sequence generated by a recursive process is the so-called Morse-Thue (MT), which is obtained by the following map:
starting from 0 (although it could be also initiated by 1) [
14,
15]. Also for this example, there is an alternative way of generating this sequence, simply by take the sequence of the preceding step and add the complement. Besides, this sequence is aperiodic, although completely deterministic. On the contrary, it can be proven that there are short and long range correlations [
14].
3. Main Statistical Properties
In this section, we study the statistical properties of the sequences generated by applying the rule (
1) recurrently to the starting sequence (1). A classical issue to be considered is the probability distribution of each of the letters of the alphabet. In particular, in the case of binary sequences, this problem is reduced to know the probability distribution of, for instance, the number of ones of any population of sequences generated from a set of random rules, e.g. random substitutions.
As a starting point, the classical Bernoulli process could be illustrative. This process can be also viewed as a random substitution of constant length,
k, where both letters, e.g., 0 and 1, are identically substituted according with the rule:
being
and
Bernoulli processes with parameter
. In each iteration,
i, the length of the sequence is
, and the probability distribution of the number of ones,
X, is given by the binomial distribution:
By applying this distribution function, the expected values of the number of ones in a sequence of length
n,
, and its square, respectively
and
, can be calculated:
for
. Using these expressions, the variance of
X reads:
Note that for the binomial distribution, the ratio variance-mean is equal to
for any
n.
Similarly, we can compute these moments for the sequences generated from random substitutions defined in the previous section. In this case, a recurrent formula provides this distribution for any iteration i and for any value of the substitution length k.
Let us present first the results for a substitution of length
. As stated before, we start the generation of the binary sequences from a sequence with a unique one. Thus, the probability of having this initial sequence is assumed to be 1:
In the first step, a sequence of length two is formed whose distribution of ones depends on the probability
p (see
Figure 2). To obtain the the distribution probability of ones in this first step,
, we multiply from left
by the substitution matrix:
givin rise to the vector:
where
.
In the next substitution, a four digits sequence is formed and the probability distribution of ones can be computed from the previous one as follows:
where
. The third iteration confirms the formula (see
Figure 2):
with
where
, and
The general expression for the recurrent formula for the probability distribution of the number of ones in the
i-th iteration is:
being the vector
and the general substitution matrix,
, whose coefficients are given by:
for
and
and
. For
, the coefficients of matrix
(
12) are given by:
By recurrence, Formula (
18) can be converted into a matrix product chain:
Figure 3 depicts the probability distributions for different values of
p for the iteration
.
Figure 4 shows both the analytical (black continuous curves) and the corresponding numerical distributions obtained from 1000 realizations. As it can be seen, the distributions exhibit a peak at
whose height decreases as
p increases. Besides this peak, the distributions also present other maxima whose heights depends on
p, being smaller for lower values of
p. This multimodality could be a consequence of the generation rule that fosters the emergence of clusters of scalable sizes. As a matter of fact, the fractal properties of these sequences are the main subject of a forthcoming paper.
The probability of generating a null sequence, i.e., formed completely by 0, after
i substitutions,
, corresponds to the first component of
. The following recurrence formula can be obtained [
16]:
for
. Note that
represents the probability of yielding a non-null sequence after a substitution that includes at least one 1 in the previous non-null sequence and then,
represents the probability of yielding a null sequence after a substitution of length 1. This expression can be rewritten as:
Because the substitution has length
, and it is formed by independent digits, this expression must be multiplied by itself resulting in Equation (
22).
The first term in this sequence is:
which yields the second one:
It is worthy to remark that the expansions of
as polynomials of
p and
have the
triangle by rows (oeis-A202019) [
17] as coefficients.
Assuming that this sequence of probability distributions converges to
, the following equation holds:
whose lowest solution:
depends on
p as depicted in
Figure 5. Note that, for
p below
the probability of generating a null infinite sequence is one, i.e.,
for
.
Equation (
22) can be generalized straightforwardly to any substitution length
k using the same demonstration as above (see [
16]). Now, the probability of having a null sequence at iteration
is obtained after multiplying
k times the factor
:
As before, in the limit this equation converges to:
that also exhibits a critical value of
p:
As it can be observed in
Figure 5, as the substitution length
k increases the probability of generating a null sequence decreases for
; also does the range where this probability is one, i.e.,
as
.
Expected Values and Variance
Having already the probability distributions of ones, it is straightforward to compute its expected value after
i iterations:
This equation can be written in matrix form:
Now, by applying Formula (
18)
It turns out that
where by
we refer to a vector whose entries are the expectation of a Binomial random variable
for successive iterations. Thus:
Then, the following recurrence formula holds:
for
and
. This recurrence yields the expression:
for
.
This formula can be straightforwardly generalized to any length
k:
The mean of the distribution of 0 in the sequence formed after
i substitutions of length
k is:
that yields a ratio
at the
ith-iteration:
This ratio tends to 0 as
for all
and for all
k. Furthermote, the limit of
as
is infinite for all values of
. Note that this ratio is one for the value of
, independently of
k.
Similarly, it can be calculated the expected value of
,
. First, for the case
we obtain:
For the second iteration, it is straightforward to calculate this value from this definition:
Using the recurrence of the probability distributions for successive iterations, the expected value of the next iteration is also computed:
that simplifies to:
Here we have used the expected value of
for the binomial distribution for the number of ones in a sequence of length
n.
In order to calculate the expected values of
for successive iteration we find the following recursive formula:
that can be derived using Equation (
8) for the expected value of
. Effectively,
where
As before,
represents a vector with entries
where
for
and, consequently, is given by:
Then, since this vector can be split into two vectors:
Equation (
45) holds.
It can be proven that the expected value of
for the
i-th iterations depends on
p as follows:
for
. If we replace the geometric sum:
This expression can be applied to calculate the variance of the distribution of the number of ones in a sequence generated after
i iterations, i.e.,
:
that simplifies to:
This expression can be further compacted if we use the expression of the mean (
37):
These computations can be equally performed for any length
. A general formulation for any value of
k yields the following functions of
p for the expected values of
X and
at the
i-th iteration:
Then, the variance is straightforward calculated and yields:
The limits as i tends to infinity of the mean and the variance depend on the value of p. For , these limits equal 0, whereas for , the limits are infinite.
Contrary to the Bernoulli sequences, the variance reaches a maximum for an intermediate value of
p, that depends on both
k and
i (see
Figure 6).
An interesting index that could provide information about the stochasticity of the generation process is the variance-mean ratio, also known as the dispersion index,
D. For iteration
i, it can be written as:
By comparison with the Poisson distribution, that possesses a value of the dispersion equal to 1, it is said that a process that has a
D-value lower than 1 is “under-dispersed”; on the contrary, having a
D-value larger than 1 reflects an “over-dispersed” stochasticity. Since
and the derivative of
is positive at
for all
and furthermore,
, then there exist a value of
such that
. This value of
p is unique because
exhibits a unique maximum in
. Moreover, it is very close to 1 for all
k and
i and tends to 1 as
i tends to infinity, meaning that under-dispersion only occurs when the probability of inserting 0 is very low.
As observed in
Figure 6, the standard deviation of
X exhibits a maximum at
. Besides, this maximum exists for any
k and
i and it is achieved for the value of
p such that:
which can be written as follows:
which implies that:
for any
k and for each iteration
i. Summing the finite sum of the left hand side of this equation yields to:
This equation cannot be explicitly solved for
p for any value of
k and
i. Nonetheless, the roots of the polynomial:
can be obtained numerically. For each substitution length,
, we are able to fit the roots of polynomial
to the function of
i:
The values of the fitting parameters
and
, as well as the corresponding residual sum of squares, are shown in
Table 1. As it can be observed, as the
k-value increases, it seems that
and
, that is:
This suggests that the roots of the polynomials, i.e., the values of
p where the variance of
X achieves its maximum, tends to 1 as
i tends to infinity. In addition, the variance of
X at the maximum tends to infinity, although, by definition, this variance is null at
. We have not a rigorous explanation of the reason behind the almost independence with
k of the roots of the polynomial
. Likely, it seems to be related to the self-similar properties of the generation process.
4. Entropy
In this section, we study the mean entropy of a population of sequences generated from the substitution rules defined in the previous section. As a matter of illustration, we consider first the entropy of the Bernoulli process of probability
p:
This function is symmetric and exhibits a maximum at
. If we now consider the sequences formed by
n stochastic variables that follow a Bernoulli process, we can compute the entropy of a sequence of length
n with a number of ones
j in terms of the frequency
, that coincides, in average, with
p:
Obviously, the sequences with minimum information content,
and
, have null entropy, i.e.,
. In contrast, the maximum entropy is achieved for
and takes the value:
.
For a population of Bernoulli sequences of probability
p, the mean entropy can be calculated applying the binomial distribution to the entropy of the sequences generated with a probability
p:
where
is the row vector entropy for a sequence of length
n whose components,
are the corresponding entropies Equation (
67) for all possible combinations of ones. For instance, for
, this formula reduces to:
where
are the entropies of sequences with 0, 1 and 2 ones:
and
. Similarly, the mean entropy for a length
can be calculated as follows:
To compute the mean entropy of a population of sequences randomly generated by substitutions according to the rules defined in
Section 2, we have to know the probability of appearance of each of the frequencies of ones in order to weights adequately the sequence entropies. As an example, let us start with the substitution length
. In this case, the probability distributions of 1 for the
i-th iteration are given by Equation (
18), and this yields the mean entropy:
Here, the components of the entropy vector
are given by:
for a sequence of length
, with a number of ones
j, such that
.
Using the matrix expression for
, (Equation (
21)), this equation reduces to:
The entropy in the first iteration, i.e., after substituting the initial 1 by applying the rule (
1) with
is given by:
which yields:
In the next iteration, the entropy can be calculated by the formula:
where
Multiplying the matrices, we get:
In the same way, we can calculate the entropy of the 8-digit binary sequences obtained after three substitutions:
In
Figure 7, we depict the entropies calculated analytically for the first three iterations, as well as their corresponding numeric estimations. Note that, contrary to the Bernoulli sequences, the entropies of these sequences are not symmetric and achieve a maximum at a value of
p that depends on the iteration. Furthermore, it can be shown that the entropy "per digit", i.e.,
, tends to 0 as
for all values of
p.
The differential entropy for successive iterations:
presents an interesting behaviour, as it can be seen in
Figure 7b. There is a critical value that separates two regimes: for
,
is negative, whereas for
,
is positive. As it can be seen,
depends on the iteration: as
,
. As a consequence, the
p-interval where
is positive shrinks to 0. Besides,
tends to 0 for all
p, which means that the entropy converges to the function
, for the infinite sequence.
Because both variance and entropy are measures of uncertainty, it could be interesting to find a functional dependence between both (see
Figure 7 and
Figure 8). The curves we see, for the case
, can be parametrized either by
p, for each iteration
i, or by
i, for each
p. The parametric equations for each
p and
i are given by Equations (
53) and (
71), for the variance and the entropy, respectively. As it can be observed, the relationship between
H and
depends on
p for each iteration
i. It is important to remark that, contrary to what is known for classical probability distributions [
6,
7], the dependence of
H on
is not the graph of a function. Instead, the variation of
brings about a closed curve. Certainly, for
, this functional dependence coincides with that of the classical probability distributions [
6]. On the contrary, for
, where both variables decrease with
p, we find an almost linear dependence between them.
Figure 8 depicts the complete parametric curves for the first iterations obtained analytically.
An interesting consequence of this dependence is that, for the same variance, we can find a value of p that maximizes the entropy. This can be done for any iteration (remember that for , for all ). Note that the dispersion index, D, of these sequences would be different because the mean would change at both values of p. As an example, we find two values of p with a similar variance, in particular, but, with a mean entropy values and . Obviously, we can also find values of p that give rise to sequences with the same entropy that have different variance. For example, a similar mean entropy is achieved for and , whereas the variance of these populations are very distant, and .
5. Concluding Remarks
Motivated by the classical percolation process defined by Mandelbrot [
4], we have studied a class of random substitutions of constant length that maps
and
, being
a Bernoulli process with parameter
. Contrary to the classical Bernoulli process, this is an asymmetric rule that yields infinite sequences with an average ratio
that tends to 0 for any
. By applying this substitution map, we have randomly generated binary sequences whose length increases with the substitution length
k as
for
for iteration
i. Specifically, we have analyzed the combinatorial, statistical and entropic properties of this substitution process starting from a sequence formed by the digit 1.
This asymmetry is reflected in the mass distribution of the number of ones, specifically with regards to its uncertainty. We have computed the variance of this distribution and the entropy of the ensemble of sequences generated at each iteration as a function of
p. The relationship between these two magnitudes is depicted in the H-VAR curves of
Figure 8. It is interesting to remark the two regimes that appear in the curves H-VAR, i.e., concavity down for
and up for
that, to the best of our knowledge, have not been shown before [
6]. This allows to compare sequences generated with different values of
H but with the same variance and, on the contrary, sequences that have the same entropy and different variance. We have left for a forthcoming paper the study of other properties of the generated sequences as, for instance, the distribution of the sizes of the sets of ones and the fractality, measured by the Haussdorff dimension.
The same kind of substitution rules can also be performed in the plane or in the space to generate sets with specific properties. In particular, random substitutions have been applied to study percolation properties of 2D fractal objects [
2,
4]. Using an iterated function system [
3], random fractals are generated as a function of a probabilistic control parameter, e.g. the probability of any site of being occupied, and their percolation characteristics are quantified in terms of this parameter.