**Central Approximation.**

Over the main path, the integrals are of the form

$$\begin{split} S\_j^{(0)}(z) &= -\frac{1}{2\pi} \int\_{|t-t\_j| \le k^{-2/5}} \Gamma(r\_0 + it) z^{-r\_0 + it} (p^{-r\_0 - it} + q^{-r\_0 - it})^k dt, \\ &= -\frac{1}{2\pi} \int\_{|t-t\_j| \le k^{-2/5}} \Gamma(r\_0 + it) z^{-k\hbar(t)} dt. \end{split}$$

We have

$$h''(t\_j) = \frac{\log^2 p/q}{((p/q)^{-r\_0/2} + (p/q)^{r\_0/2})^2},\tag{124}$$

and

$$p^{-r\_0 - it\_j} + q^{-r\_0 - it\_j} = p^{-it\_j}(p^{-r\_0} + q^{-r\_0}).\tag{125}$$

Therefore, by Laplace's theorem (refer to [22]) we obtain

$$\begin{split} S\_j^{(0)}(z) &= \frac{1}{\sqrt{2\pi k h''(t\_j)}} \Gamma(r\_0 + it\_j) e^{-ikt(t\_j)} (1 + O(k^{-1/2})) \\ &= \frac{(p/q)^{-r\_0/2} + (p/q)^{r\_0/2}}{\sqrt{2\pi} \log p/q} \\ &\quad \times z^{-r\_0} (p^{-r\_0} + q^{-r\_0})^k \Gamma(r\_0 + it\_j) z^{-it\_j} p^{-ikt\_j} k^{-1/2} \left( 1 + O\left(\frac{1}{\sqrt{k}}\right) \right) . \end{split} \tag{126}$$

We finally sum over all *j* (|*j*| < *j*<sup>∗</sup>), and we ge<sup>t</sup>

$$\begin{split} E\_{k}(z) &= \frac{(p/q)^{-r\_{0}/2} + (p/q)^{r\_{0}/2}}{\sqrt{2\pi}\log p/q} \\ &\times \sum\_{|j|$$

We can rewrite *E* ˜ *k*(*z*) as

$$E\_k(z) = \Phi\_1((1 + a \log p) \log\_{p/q} n) \frac{z^\nu}{\sqrt{\log n}} \left( 1 + O\left(\frac{1}{\sqrt{\log n}}\right) \right),\tag{128}$$

where *ν* = −*r*0 + *a* log(*p*<sup>−</sup>*r*<sup>0</sup> + *q*<sup>−</sup>*r*<sup>0</sup> ), and

$$\Phi\_1(\mathbf{x}) = \frac{(p/q)^{-r\_0/2} + (p/q)^{r\_0/2}}{\sqrt{2a\pi}\log p/q} \sum\_{|j| < j^\*} \Gamma(r\_0 + it\_j) e^{-2\pi i jx}.\tag{129}$$

For part *ii*, we move the line of integration to *r*0 ∈ (0, <sup>∞</sup>). Note that in this range, we must consider the contribution of the pole at *s* = 0. We have

$$E\_k(z) = \text{Res}\_{s=0} \mathcal{E}\_k^\*(s) z^{-s} + \int\_{r\_0 - i\infty}^{r\_0 + i\infty} \mathcal{E}\_k^\*(z) z^{-s} ds. \tag{130}$$

Computing the residue at *s* = 0, and following the same analysis as in part *i* for the above integral, we arrive at

$$E\_k(z) = 2^k - \Phi\_1((1 + a \log p) \log\_{p/q} n) \frac{z^\nu}{\sqrt{\log n}} \left( 1 + O\left(\frac{1}{\sqrt{\log n}}\right) \right). \tag{131}$$

For part *iii*. of Theorem 3, we shift the line of integration to *c*0 ∈ (−2, <sup>−</sup><sup>1</sup>), then we have

$$\begin{split} E\_k(z) &= \text{Res}\_{s=-1} E\_k^\*(s) z^{-s} + \int\_{c-i\infty}^{c+i\infty} E\_k^\*(z) z^{-s} ds \\ &= z + O\left( z^{-c\_0} (p^{-c\_0} + q^{-c\_0})^k \right) \\ &= z^{a\log 2} + O(z^{v\_0}), \end{split} \tag{132}$$

where *ν*0 = −*c*0 + *a* log(*p*<sup>−</sup>*c*<sup>0</sup> + *q*<sup>−</sup>*c*<sup>0</sup> ) < 1.

#### **Step four: Asymptotic depoissonization**

To show that both conditions in (15) hold for *E* ˜ *<sup>k</sup>*(*z*), we extend the real values *z* to complex values *z* = *neiθ*, where |*θ*| < *π*/2. To prove (103), we note that

$$|e^{-i\theta(r\_0+it)}\Gamma(r\_0+it)| = O(|t|^{r\_0-1/2}e^{t\theta-\pi|t|/2}),\tag{133}$$

and therefore

$$\tilde{E}\_k(ne^{i\theta}) = \frac{1}{2\pi} \int\_{-\infty}^{\infty} e^{-i\theta(r\_0+it)} n^{-r\_0-it} \Gamma(r\_0+it) (p^{-r\_0-it} + q^{-r\_0-it})^k dt \tag{134}$$

is absolutely convergen<sup>t</sup> for |*θ*| < *π*/2. The same saddle point analysis applies here and we obtain

$$|\bar{E}\_k(z)| \le B \frac{|z^\nu|}{\sqrt{\log n}'} \tag{135}$$

where *B* = |<sup>Φ</sup>1((<sup>1</sup> + *a* log *<sup>p</sup>*)log*p*/*q <sup>n</sup>*)|, and *ν* is as in (128). Condition (103) is therefore satisfied. To prove condition (104) We see that for a fixed *k*,

$$\begin{split} |\mathcal{E}\_k(z)e^z| &\leq \sum\_{w\in\mathcal{A}^k} |e^z - e^{z(1-\mathbb{P}(w))}| \\ &\leq 2^{k+1} e^{|z|\cos(\theta)}. \end{split} \tag{136}$$

Therefore, we have

$$\mathbb{E}[\mathcal{X}\_{n,k}] = \mathbb{E}(n) + O\left(\frac{n^{\nu-1}}{\sqrt{\log n}}\right). \tag{137}$$

This completes the proof of Theorem 3.

**On the Second Factorial Moment:** We poissonize the sequence (**E**[(*X*<sup>ˆ</sup> *<sup>n</sup>*,*<sup>k</sup>*)2])*n*≥<sup>0</sup> as well. By the analysis in (27),

$$\mathbb{E}[(\hat{X}\_{n,k})\_2] = \sum\_{\substack{w,w' \in \mathcal{A}^k \\ w \neq w'}} \left(1 - (1 - \mathbf{P}(w))^n - (1 - \mathbf{P}(w'))^n + (1 - \mathbf{P}(w) - \mathbf{P}(w'))^n\right),$$

which gives the following poissonized form

$$\begin{split} G(z) &= \sum\_{\substack{n \geq 0 \\ w \not\equiv 0 \\ w \not\equiv w'}} \mathbb{E}[(\hat{\mathcal{R}}\_{n,k})\_2] \frac{z^n}{n!} e^{-z} \\ &= \sum\_{\substack{w \not\equiv 0 \\ w \not\equiv w'}} 1 - e^{-\mathbf{P}(w)z} - e^{-\mathbf{P}(w')z} + e^{-(\mathbf{P}(w) + \mathbf{P}(w'))z} \\ &= \sum\_{\substack{w \not\equiv 0 \\ w \not\equiv w'}} \left(1 - e^{-\mathbf{P}(w')z}\right) \left(1 - e^{-\mathbf{P}(w)z}\right) \\ &= \left(\sum\_{\substack{w \not\equiv 0 \\ w \not\equiv A^k}} \left(1 - e^{-\mathbf{P}(w)z}\right)\right)^2 - \sum\_{\substack{w \equiv \mathcal{A}^k \\ w \equiv \mathcal{A}^k}} \left(1 - e^{-\mathbf{P}(w)z}\right)^2 \\ &= \left(\tilde{E}\_k(z)\right)^2 - \sum\_{\substack{w \equiv \mathcal{A}^k \\ w \equiv \mathcal{A}^k}} \left(1 - e^{-\mathbf{P}(w)z}\right)^2 \\ &= \left(\tilde{E}\_k(z)\right)^2 - \sum\_{\substack{w \equiv \mathcal{A}^k \end{subarray}} \left(1 - 2e^{-\mathbf{P}(w)z} + e^{-2\mathbf{P}(w)z}\right). \tag{138} \end{split} \tag{139}$$

We show that in all ranges of *a* the leftover sum in (138) has a lower order contribution to *G* ˜ *k*(*z*) compared to (*E*˜*k*(*z*))2. We define

$$\tilde{L}\_k(z) = \sum\_{w \in \mathcal{A}^k} \left( 1 - 2e^{-\mathbf{P}(w)z} + e^{-2\mathbf{P}(w)z} \right). \tag{139}$$

In the first range for *k*, we take the Mellin transform of *L* ˜ *<sup>k</sup>*(*z*), which is

$$\begin{split} L\_k^\*(s) &= -2\Gamma(s) \sum\_{w \in \mathcal{A}^k} \mathbf{P}(w)^{-s} + \Gamma(s) \sum\_{w \in \mathcal{A}^k} (2\mathbf{P}(w))^{-s} \\ &= -2\Gamma(s)(p^{-s} + q^{-s})^k + \Gamma(s)2^{-s}(p^{-s} + q^{-s})^k \\ &= \Gamma(s)(p^{-s} + q^{-s})^k(2^{-s-1} - 1), \end{split} \tag{140}$$

and we note that the fundamental strip for this Mellin transform of is −2, 0 as well. The inverse Mellin transform for *c* ∈ (−2, 0) is

$$\begin{split} \bar{L}\_{k}(z) &= \frac{1}{2\pi i} \int\_{c-i\infty}^{c+i\infty} \bar{L}\_{k}^{\*}(s) z^{-s} ds \\ &= \frac{1}{\pi i} \int\_{c-i\infty}^{c+i\infty} \Gamma(s) (p^{-s} + q^{-s})^{k} (2^{-s-1} - 1) z^{-s} ds \end{split} \tag{141}$$

We note that this range of *r*0 corresponds to

$$\frac{2}{\log q^{-1} + \log p^{-1}} < a < \frac{p^2 + q^2}{q^2 \log q^{-1} + p^2 \log p^{-1}}.\tag{142}$$

The integrand in (141) is quite similar to the one seen in (107). The only difference is the extra term 2−*s*−<sup>1</sup> − 1. However, we notice that 2−*s*−<sup>1</sup> − 1 is analytic and bounded. Thus, we obtain the same saddle points with the real part as in (109) and the same imaginary parts in the form of 2*πij* log *p*/*q* , *j* ∈ Z. Thus, the same saddle point analysis for the integral in (107) applies to *L* ˜ *k*(*z*) as well. We avoid

repeating the similar steps, and we skip to the central approximation, where by Laplace's theorem (ref. [22]), we ge<sup>t</sup>

$$\begin{split} L\_{k}(z) &= \frac{(p/q)^{-r\_{0}/2} + (p/q)^{r\_{0}/2}}{\sqrt{2\pi}\log p/q} \\ &\times \sum\_{|j|$$

which can be represented as

$$\bar{L}\_k(z) = \Phi\_2((1 + a \log p) \log\_{p/q} n) \frac{z^v}{\sqrt{\log n}} \left( 1 + O\left(\frac{1}{\sqrt{\log n}}\right) \right),\tag{144}$$

where

$$\Phi\_2(\mathbf{x}) = \frac{(p/q)^{-r\_0/2} + (p/q)^{r\_0/2}}{\sqrt{2a\pi}\log p/q} \sum\_{|j| < j^\*} (2^{-r\_0 - 1 - it\_j} - 1)\Gamma(r\_0 + it\_j)e^{-2\pi i jx}.\tag{145}$$

This shows that *L* ˜ *k*(*z*) = *O z<sup>ν</sup>* log *n* , when 2 log *q*−<sup>1</sup> + log *p*−<sup>1</sup> < *a* < *p*2 + *q*2 *q*2 log *q*−<sup>1</sup> + *p*2 log *p*−<sup>1</sup> . Subsequently, for 1 log *q*−<sup>1</sup> < *a* < 2 log *q*−<sup>1</sup> + log *p*−<sup>1</sup> , we ge<sup>t</sup> *L* ˜ *k*(*z*) = 2*k* − <sup>Φ</sup>2((<sup>1</sup> + *a* log *<sup>p</sup>*)log*p*/*q n*) *z<sup>ν</sup>* log *n* 1 + *O* 1 log *n* , (146)

and for  $\frac{p^2 + q^2}{q^2 \log q^{-1} + p^2 \log p^{-1}} < a < \frac{1}{\log p^{-1}}$ , we get 
$$\bar{L}\_k(z) = O(n^2). \tag{147}$$

It is not difficult to see that for each range of *a* as stated above, *L* ˜ *k*(*z*) has a lower order contribution to the asymptotic expansion of *G* ˜ *<sup>k</sup>*(*z*), compared to (*E*˜*k*(*z*))2. Therefore, this leads us to Theorem 4, which will be proved bellow.

**Proof of Theorem 4.** It is only left to show that the two depoissonization conditions hold: For condition (103) in Theorem 15, from (135) we have

$$|\bar{G}\_k(z)| \le B^2 \frac{|z^{2\nu}|}{\log n'} \tag{148}$$

and for condition (104), we have, for fixed *k*,

$$\left| \left| \mathbb{G}\_{k}(z)e^{z} \right| \leq \sum\_{\substack{\boldsymbol{w}, \boldsymbol{w}' \in \mathcal{A}^{k} \\ \boldsymbol{w} \neq \boldsymbol{w}'}} \left| e^{z} - e^{(1 - \mathbb{P}(\boldsymbol{w}))z} - e^{(1 - \mathbb{P}(\boldsymbol{w}'))z} + e^{(1 - (\mathbb{P}(\boldsymbol{w}) + \mathbb{P}(\boldsymbol{w}')))z} \right| \\ \leq 4^{k} e^{|z| \cos \theta}. \tag{149}$$

Therefore both depoissonization conditions are satisfied and the desired result follows.

#### **Corollary. A Remark on the Second Moment and the Variance**

For the second moment we have

$$\begin{split} \mathbb{E}\left[\left(\hat{\lambda}\_{n,k}^{\prime}\right)^{2}\right] &= \sum\_{\substack{w,w' \in \mathcal{A}^{k} \\ w \neq w'}} \mathbb{E}\left[\hat{\lambda}\_{n,k}^{\prime(w)} \hat{\lambda}\_{n,k}^{(w')}\right] + \sum\_{w \in \mathcal{A}^{k}} \mathbb{E}\left[\hat{\lambda}\_{n,k}^{\prime(w)}\right] \\ &= \sum\_{\substack{w,w' \in \mathcal{A}^{k} \\ w \neq w'}} \left(1 - (1 - \mathbf{P}(w))^{n} - (1 - \mathbf{P}(w'))^{n} + (1 - \mathbf{P}(w) - \mathbf{P}(w'))^{n}\right) \\ &\quad + \sum\_{w \in \mathcal{A}^{k}} \left(1 - (1 - \mathbf{P}(w))^{n}\right). \end{split} \tag{150}$$

Therefore, by (105) and (138) the Poisson transform of the second moment, which we denote by *G* ˜(2) *k* (*z*) is

$$G\_k^{(2)}(z) = (\mathbb{E}\_k(z))^2 + \mathbb{E}\_k(z) - \sum\_{w \in \mathcal{A}^k} \left( 1 - 2\varepsilon^{-\mathbf{P}(w)z} + \varepsilon^{-2\mathbf{P}(w)z} \right),\tag{151}$$

which results in the same first order asymptotic as the second factorial moment. Also, it is not difficult to extend the proof in Chapter 6 to show that the second moments of the two models are asymptotically the same. For the variance we have

$$\begin{split} \text{Var}\left[\mathcal{R}\_{n,k}\right] &= \mathbb{E}\left[\left(\mathcal{R}\_{n,k}\right)^{2}\right] - \left(\mathbb{E}\left[\mathcal{R}\_{n,k}\right]\right)^{2} \\ &= \sum\_{\begin{subarray}{c}w,w'\in\mathcal{A}^{k}\\w\neq w' \end{subarray}} \left(1 - \left(1 - \mathbf{P}(w)\right)^{n} - \left(1 - \mathbf{P}(w')\right)^{n} + \left(1 - \mathbf{P}(w) - \mathbf{P}(w')\right)^{n}\right) \\ &\quad + \sum\_{\begin{subarray}{c}w,w'\in\mathcal{A}^{k}\\w\neq w' \end{subarray}} \left(1 - \left(1 - \mathbf{P}(w)\right)^{n}\right) \\ &\quad - \sum\_{\begin{subarray}{c}w,w'\in\mathcal{A}^{k}\\w\neq w' \end{subarray}} \left(1 - \left(1 - \mathbf{P}(w)\right)^{n} - \left(1 - \mathbf{P}(w')\right)^{n} + \left(1 - \mathbf{P}(w) - \mathbf{P}(w')\right)^{n}\right) \\ &\quad - \sum\_{w\in\mathcal{A}^{k}} \left(1 - \left(1 - \mathbf{P}(w)\right)^{n} - \left(1 - \mathbf{P}(w)\right)^{n} + \left(1 - \mathbf{P}(w)\right)^{2n}\right) \\ &= \sum\_{w\in\mathcal{A}^{k}} \left(\left(1 - \mathbf{P}(w)\right)^{n} - \left(1 - \mathbf{P}(w)\right)^{2n}\right). \end{split} \tag{152}$$

Therefore the Poisson transform, which we denote by *G* ˜ var *k*(*z*) is

$$\tilde{G}\_k^{\text{var}}(z) = \sum\_{w \in \mathcal{A}^k} \left( e^{-\mathbf{P}(w)z} - e^{-(2\mathbf{P}(w) + (\mathbf{P}(w))^2)z} \right). \tag{153}$$

The Mellin transform of the above function has the following form

$$G\_{\
u}^{\pi\_{\text{var}}}(z) = \Gamma(s)(p^{-s} + q^{-s})^k(-1 + O(\mathbf{P}(w))).\tag{154}$$

This is quite similar to what we saw in (106), which indicates that the variance has the same asymptotic growth as the expected value. But the variance of the two models do not behave in the same way (cf. Figure 2).

#### **4. Summary and Conclusions**

We studied the first-order asymptotic growth of the first two (factorial) moments of the *k*th Subword Complexity. We recall that the *k*th Subword Complexity of a string of length *n* is denoted by *Xn*,*k*, and is defined as the number of distinct subwords of length *k*, that appear in the string. We are interested in the asymptotic analysis for when *k* grows as a function of the string's length. More specifically, we conduct the analysis for *k* = Θ(log *<sup>n</sup>*), and as *n* → ∞.

The analysis is inspired by the earlier work of Jacquet and Szpankowski on the analysis of suffix trees, where they are compared to independent tries (cf. [14]). In our work, we compare the first two moments of the *k*th Subword Complexity to the *k*th Prefix Complexity over a random trie built over *n* independently generated binary strings. We recall that we define the *k*th Prefix Complexity as the number of distinct prefixes that appear in the trie at level *k* and lower.

We obtain the generating functions representing the expected value and the second factorial moments as their coefficients, in both settings. We prove that the first two moments have the same asymptotic growth in both models. For deriving the asymptotic behavior, we split the range for *k* into three intervals. We analyze each range using the saddle point method, in combination with residue analysis. We close our work with some remarks regarding the comparison of the second moment and the variance to the *k*th Prefix Complexity.

## **5. Future Challenges**

The intervals' endpoints for *a* in Theorems 3 and 4 are not investigated in this work. The asymptotic analysis of the end points can be studied using van der Waerden saddle point method [24].

The analogous results are not (yet) known in the case where the underlying probability source has Markovian dependence or in the case of dynamical sources.

**Author Contributions:** This paper is based on a Ph.D. dissertation conducted by the L.A. under the supervision of the M.D.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** M.D.W. Ward's research is supported by FFAR Grant 534662, by the USDA NIFA Food and Agriculture Cyberinformatics and Tools (FACT) initiative, by NSF Grant DMS-1246818, by the NSF Science & Technology Center for Science of Information Grant CCF-0939370, and by the Society Of Actuaries.

**Acknowledgments:** The authors thank Wojciech Szpankowski and Mireille Régnier for insightful conversations on this topic.

**Conflicts of Interest:** The authors declare no conflict of interest.
