*1.1. Part I*

This part of the analysis is inspired by the earlier work of Jacquet and Szpankowski [14] on the analysis of suffix trees by comparing them to independent tries. A trie, first introduced by René de la Briandais in 1959 (see [15]), is a search tree that stores *n* strings, according to their prefixes. A suffix tree, introduced by Weiner in 1973 (see [16]), is a trie where the strings are suffixes of a given string. An example of these data structures are given in Figure 1.

**Figure 1.** The suffix tree in (**a**) is built over the first four suffixes of string *X* = 101110..., and the trie in (**b**) is build over strings *X*1 = 111..., *X*2 = 101..., *X*3 = 100, and *X*4 = 010....

A direct asymptotic analysis of the moments is a difficult task, as patterns in a string are not independent from each other. However, we note that each pattern in a string can be regarded as a prefix of a suffix of the string. Therefore, the number of distinct patterns of length *k* in a string is actually the number of nodes of the suffix tree at level *k* and lower. It is shown by I. Gheorghiciuc and M. D. Ward [17] that the expected value of the *k*-th Subword Complexity of a Bernoulli string of length *n* is asymptotically comparable to the expected value of the number of nodes at level *k* of a trie built over *n* independent strings generated by a memory-less source.

We extend this analysis to the desired range for *k*, and we prove that the result holds for when *k* grows logarithmically with *n*. Additionally, we show that asymptotically, the second factorial moment of the *k*-th Subword Complexity can also be estimated by admitting the same independent model generated by a memory-less source. The proof of this theorem heavily relies on the characterization of the overlaps of the patterns with themselves and with one another. Autocorrelation and correlation polynomials explicitly describe these overlaps. The analytic properties of these polynomials are key to understanding repetitions of patterns in large Bernoulli strings. This, in conjunction with Cauchy's integral formula (used to compare the generating functions in the two models) and the residue theorem, provides solid verification that the second factorial moment in the Subword Complexity behaves the same as in the independent model.

To make this comparison, we derive the generating functions of the first two factorial moments in both settings. In a paper published by F. Bassino, J. Clément, and P. Nicodème in 2012 [18], the authors provide a multivariate probability generating function *f*(*<sup>z</sup>*, *x*) for the number of occurrences of patterns in a finite Bernoulli string. That is, given a pattern *w*, the coefficient of the term *znxm* in *f*(*<sup>z</sup>*, *x*) is the probability in the Bernoulli model that a random string of size *n* has exactly *m* occurrences of the pattern *w*. Following their technique, we derive the exact expression for the generating functions of the first two factorial moments of the *k*th Subword Complexity. In the independent model, the generating functions are obtained by basic probability concepts.
