*3.1. Groundwork*

We first introduce a few terminologies and lemmas regarding overlaps of patterns and their number of occurrences in texts. Some of the notations we use in this work are borrowed from [18] and [21].

**Definition 2.** *For a binary word w* = *w*1...*wk of length k, The autocorrelation set* S*w of the word w is defined in the following way.*

$$\mathcal{S}\_{\mathcal{W}} = \{w\_{i+1}...w\_{k} \mid w\_{1}...w\_{i} = w\_{k-i+1}...w\_{k}\}.\tag{4}$$

*The autocorrelation index set is*

$$\mathcal{P}(w) = \{i \mid w\_1...w\_i = w\_{k-i+1}...w\_k\},\tag{5}$$

*And the autocorrelation polynomial is*

$$S\_w(z) = \sum\_{i \in \mathcal{P}(w)} \mathbf{P}(w\_{i+1}...w\_k) z^{k-i}.\tag{6}$$

**Definition 3.** *For the distinct binary words w* = *w*1...*wk and w* = *<sup>w</sup>*1...*wk, the correlation set* S*<sup>w</sup>*,*w of the words w and w is*

$$\mathcal{S}\_{w,w'} = \{w'\_{i+1}...w'\_k \mid w'\_1...w'\_i = w\_{k-i+1}...w\_k\}.\tag{7}$$

*The correlation index set is*

$$\mathcal{P}(w, w') = \{i \mid w'\_1...w'\_i = w\_{k-i+1}...w\_k\},\tag{8}$$

*The correlation polynomial is*

$$S\_{w,w'}(z) = \sum\_{i \in \mathcal{P}(w,w')} \mathbf{P}(w'\_{i+1}...w'\_k)z^{k-i}.\tag{9}$$

The following two lemmas present the probability generating functions for the number of occurrences of a single pattern and a pair of distinct pattern, respectively, in a random text of length *n*. For a detailed dissection on obtaining such generating functions, refer to [18].

**Lemma 1.** *The Occurrence probability generating function for a single pattern w in a binary text over a memoryless source is given by Fw*(*<sup>z</sup>*, *x* − <sup>1</sup>)*, where*

$$F\_w(z,t) = \frac{1}{1 - A(z) - \frac{t\mathbb{P}(w)z^k}{1 - t(S\_w(z) - 1)}},\tag{10}$$

*The coefficient* [*znxm*]*Fw*(*<sup>z</sup>*, *x* − 1) *is the probability that a random binary string of length n has m occurrences of the pattern w.*

**Lemma 2.** *The Occurrence PGF for two distinct Patterns of length k in a Bernoulli random text is given by Fw*,*w*(*<sup>z</sup>*, *x*1 − 1, *x*2 − 1) *where,*

$$F\_{w,w'}(z,t\_1,t\_2) = \frac{1}{1 - A(z) - M(z,t\_1,t\_2)},\tag{11}$$

*and*

$$M(z, t\_1, t\_2) = \left(\mathbf{P}(w)z^k t\_1 \quad \mathbf{P}(w')z^k t\_2\right) \left(\mathbb{I} - \begin{pmatrix} (S\_{w'}(z) - 1)t\_1 & S\_{w, w'}(z)t\_2\\ S\_{w', w}(z)t\_1 & (S\_{w'}(z) - 1)t\_2 \end{pmatrix}\right)^{-1} \begin{pmatrix} 1\\1 \end{pmatrix}.$$

*The coefficient* [*znx<sup>m</sup>*<sup>1</sup> 1 *xm*<sup>2</sup> 2 ]*Fw*,*w*(*<sup>z</sup>*, *x*1 − 1, *x*2 − 1) *is the probability that there are m*1 *occurrences of w and m*2 *occurrences of w in a random string of length n.*

The above results will be used to find the generating functions for the first two factorial moments of the *k*th Subword Complexity in the following section.

#### *3.2. Derivation of Generating Functions*

**Lemma 3.** *For generating functions Hk*(*z*) = ∑*n*≥<sup>0</sup> **<sup>E</sup>**[*Xn*,*<sup>k</sup>*]*z<sup>n</sup> and Gk*(*z*) = ∑*n*≥<sup>0</sup> **<sup>E</sup>**[(*Xn*,*<sup>k</sup>*)2]*zn, we have i.*

$$H\_k(z) = \sum\_{w \in \mathcal{A}^k} \left( \frac{1}{1 - z} - \frac{S\_w(z)}{D\_w(z)} \right),\tag{12}$$

*where Dw*(*z*) = **<sup>P</sup>**(*w*)*z<sup>k</sup>* + (1 − *<sup>z</sup>*)*Sw*(*z*)*, and*

*ii.*

$$\mathcal{G}\_{k}(z) = \sum\_{\substack{w, w' \in \mathcal{A}^{k} \\ w \neq w'}} \left( \frac{1}{1 - z} - \frac{\mathcal{S}\_{w}(z)}{D\_{w}(z)} - \frac{\mathcal{S}\_{w'}(z)}{D\_{w'}(z)} + \frac{\mathcal{S}\_{w}(z)\mathcal{S}\_{w'}(z) - \mathcal{S}\_{w,w'}(z)\mathcal{S}\_{w',w}(z)}{D\_{w,w'}(z)} \right), \tag{13}$$

*where*

$$\begin{split} D\_{w,w'}(z) &= (1-z)(\mathcal{S}\_{\mathbf{w}}(z)\mathcal{S}\_{w'}(z) - \mathcal{S}\_{w,w'}(z)\mathcal{S}\_{w',\mathbf{w}}(z)) \\ &\quad + z^k \left( \mathbf{P}(w)(\mathcal{S}\_{w'}(z) - \mathcal{S}\_{w,w'}(z)) + \mathbf{P}(w')(\mathcal{S}\_{\mathbf{w}}(z) - \mathcal{S}\_{w',\mathbf{w}}(z)) \right) . \end{split} \tag{14}$$
