**2. Results**

For a binary string *X* = *X*1*X*2... *Xn*, where *Xi*'s ( *i* = 1, ..., *n*) are independent and identically distributed random variables , we assume that **P**(*Xi* = 1) = *p*, **P**(*Xi* = 0) = *q* = 1 − *p*, and *p* > *q*. We define the *k*th Subword Complexity, *Xn*,*k*, to be the number of distinct substrings of length *k* that appear in a random string *X* with the above assumptions. In this work, we obtain the first order asymptotics for the average and the second factorial moment of *Xn*,*k*. The analysis is done in the range *k* = Θ(log *<sup>n</sup>*). We rewrite this range as *k* = *a* log *n*, and by performing a saddle point analysis, we will show that

$$1/\log q^{-1} < a < 1/\log p^{-1} \tag{1}$$

In the first step, we compare the *k*th Subword Complexity to an independent model constructed in the following way: We store a set of *n* independently generated strings by a memory-less source in a trie. This means that each string is a sequence of independent and identically distributed Bernoulli random variables from the binary alphabet A = {0, <sup>1</sup>}, with **P**(1) = *p*, **P**(0) = *q* = 1 − *p* . We denote the number of distinct prefixes of length *k* in the trie by *X*ˆ *<sup>n</sup>*,*k*, and we call it *the kth prefix complexity*. Before proceeding any further, we remind that factorial moments of a random variable are defined as following.

**Definition 1.** *The jth factorial moment of a random variable X is defined as*

$$\mathbb{E}[(X)\_j] = \mathbb{E}[(X)(X-1)(X-2)...(X-j+1)],\tag{2}$$

*where j = 1, 2, ... will show that the first and second factorial moments of Xn*,*<sup>k</sup> are asymptotically comparable to those of X*ˆ *<sup>n</sup>*,*k, when k* = Θ(log *<sup>n</sup>*)*. We have the following theorems.*

**Theorem 1.** *For large values of n, and for k* = Θ(log *<sup>n</sup>*)*, there exists M* > 0 *such that*

$$\mathbf{E}[\mathcal{X}\_{n,k}] - \mathbf{E}[\mathcal{X}\_{n,k}] = O(n^{-M}).$$

We also prove a similar result for the second factorial moments of the *k*th Subword Complexity and the *k*th Prefix Complexity:

**Theorem 2.** *For large values of n, and for k* = Θ(log *<sup>n</sup>*)*, there exists*  > 0 *such that*

$$\mathbb{E}[(X\_{n,k})\_2] - \mathbb{E}[(\hat{X}\_{n,k})\_2] = O(n^{-c}).$$

In the second part of our analysis, we derive the first order asymptotics of the *k*th Prefix Complexity. The methodology used here is analogous to the analysis of profile of tries [19]. The rate of the asymptotic growth depends on the location of the value *a* as seen in (1) . For instance, for the average *k*th Subword Complexity ,**<sup>E</sup>**[*Xn*,*<sup>k</sup>*], we have the following observations.


The above observations will be discussed in depth in the proofs of the following theorems.

**Theorem 3.** *The average of the kth Prefix Complexity has the following asymptotic expansion*

*i. For a* ∈ *I*1*,*

$$\mathbb{E}[\hat{X}\_{n,k}] = 2^k - \Phi\_1((1 + \log p)\log\_{p/q} n) \frac{n^v}{\sqrt{\log n}} \left( 1 + O\left(\frac{1}{\sqrt{\log n}}\right) \right),\tag{3}$$

*where ν* = −*r*0 + *a* log(*p*<sup>−</sup>*r*<sup>0</sup> + *q*<sup>−</sup>*r*<sup>0</sup> )*, and*

$$\Phi\_1(\mathbf{x}) = \frac{(p/q)^{-r\_0/2} + (p/q)^{r\_0/2}}{\sqrt{2\pi}\log p/q} \sum\_{j \in \mathbb{Z}} \Gamma(r\_0 + it\_j) e^{-2\pi ij\mathbf{x}}$$

*is a bounded periodic function.*

*ii. For a* ∈ *I*2*,*

$$\mathbb{E}[\hat{X}\_{n,k}] = \Phi\_1((1+\log p)\log\_{p/q} n) \frac{n^v}{\sqrt{\log n}} \left(1 + O\left(\frac{1}{\sqrt{\log n}}\right)\right).$$

*iii. For a* ∈ *I*3

$$\mathbb{E}[\mathcal{X}\_{n,k}] = n + O(n^{\nu\_0}),$$

*for some ν*0 < 1*.*

**Theorem 4.** *The second factorial moment of the kth Prefix Complexity has the following asymptotic expansion. i. For a* ∈ *I*1*,*

$$\mathbb{E}[(\mathcal{X}\_{n,k})\_2] = \left(2^k - \Phi\_1(\log\_{p/q} n (1 + \log p)) \frac{n^\vee}{\sqrt{\log n}} \left(1 + O\left(\frac{1}{\sqrt{\log n}}\right)\right)\right)^2 \dots$$

*ii. For a* ∈ *I*2*,*

$$\mathbb{E}[(\hat{\mathcal{X}}\_{n,k})\_2] = \Phi\_1^2(\log\_{p/q} n (1 + \log p)) \frac{n^{2\nu}}{\log n} \left( 1 + O\left(\frac{1}{\log n}\right) \right).$$

*iii. For a* ∈ *I*3*,*

$$\mathbb{E}[(\hat{X}\_{n,k})\_2] = n^2 + O(n^{2\upsilon\_0}).$$

The periodic function <sup>Φ</sup>1(*x*) in Theorems 3 and 4 is shown in Figure 2.

**Figure 2. Left**: <sup>Φ</sup>1(*x*) at *p* = 0.90, and various levels of *r*0. The amplitude increases as *r*0 increases. **Right**: <sup>Φ</sup>1(*x*) at *r*0 = 1, and various levels of *p*. The amplitude tends to zero as *p* → 1/2<sup>+</sup>.

The results in Theorem 4 will follow for the second moment of the *k*th Subword Complexity as the analysis can be easily extended from the second factorial moment to the second moment. The variance however, as seen in Figure 3, does not show the same asymptotic behavior as the variance of *k*th Subword Complexity.

**Figure 3.** Approximated second moments (**left**), and variances (**right**) of the *k*th Subword Complexity (**red**), and the *k*th Prefix Complexity (**blue**), for *n* = 4000, at different probability levels, averaged over 10,000 iterations.

#### **3. Proofs and Methods**
