**1. Introduction**

Analyzing and understanding occurrences of patterns in a character string is helpful for extracting useful information regarding the nature of a string. We classify strings to low-complexity and high-complexity, according to their level of randomness. For instance, we take the binary string *X* = 10101010..., which is constructed by repetitions of the pattern *w* = 10. This string is periodic, and therefore has low randomness. Such periodic strings are classified as low-complexity strings, whereas strings that do not show periodicity are considered to have high complexity. An effective way of measuring a string's randomness is to count all distinct patterns that appear as contiguous subwords in the string. This value is called the Subword Complexity. The name is given by Ehrenfeucht, Lee, and Rozenberg [1], and initially was introduced by Morse and Hedlund in 1938 [2]. The higher the Subword Complexity, the more complex the string is considered to be.

Assessing information about the distribution of the Subword Complexity enables us to better characterize strings, and determine atypically random or periodic strings that have complexities far from the average complexity [3]. This type of string classification has applications in fields such as data compression [4], genome analysis (see [5–9]), and plagiarism detection [10]. For example, in data compression, a data set is considered compressible if it has low complexity, as consists of repeated subwords. In computational genomics, Subword Complexity (known as k-mers) is used in detection of repeated sequences and DNA barcoding [11,12]. *k*-mers are composed of A, T, G, and C nucleotides. For instance, 7-mers for a DNA sequence GTAGAGCTGT is four, meaning that there are 4-hour distinct substrings of length 7 in the given DNA sequence. Counting *k*-mers becomes challenging for longer DNA sequences. Our results can be easily extended to the alphabet {*<sup>A</sup>*, *T*, *G*, *C*} and directly applied in theoretical analysis of the genomic *k*-mer distributions under the Bernoulli probabilistic model, particularly when the length *n* of the sequence approaches infinity.

There are two variations for the definition of the Subword Complexity: the one that counts all distinct subwords of a given string (also known as Complexity Index and Sequence Complexity [13]), and the one that only counts the subwords of the same length, say *k*, that appear in the string. In our work, we analyze the latter, and we call it the *k*th Subword Complexity to avoid any confusion.

Throughout this work, we consider the *k*th Subword Complexity of a random binary string of length *n* over a memory-less source, and we denote it by *Xn*,*k*. We analyze the first and second factorial moments of *Xn*,*<sup>k</sup>* (1) for the range *k* = Θ(log *<sup>n</sup>*), as *n* → ∞. More precisely, will divide the analysis into three ranges as follows.

$$d. \quad \frac{1}{\log q^{-1}} \log n < k < \frac{2}{\log q^{-1} + \log p^{-1}} \log n$$

$$di. \quad \frac{2}{\log q^{-1} + \log p^{-1}} \log n < k < \frac{1}{q \log q^{-1} + p \log p^{-1}} \log n \text{ and } \dots$$

$$iii. \qquad \frac{1}{q \log q^{-1} + p \log p^{-1}} \log n < k < \frac{1}{\log p^{-1}} \log n.$$

Our approach involves two major steps. First, we choose a suitable model for the asymptotic analysis, and afterwards we provide proofs for the derivation of the asymptotic expansion of the first two factorial moments.
