1. Introduction
Random numbers play an important role in cryptography, gambling, Monte Carlo methods and many other applications. Nowadays, random numbers are generated using so-called random number generators (RNGs), and the “quality” of the generated numbers is evaluated using special statistical tests [
1]. This problem is so important for applications that there are special standards for RNGs and for so-called test batteries, that is, sets of tests. The current practice for using an RNG is to verify the sequences it generates with tests from some battery (such as those recommended by [
2,
3] or other standards).
Many statistical tests are designed to test some deviations from randomness described as classes of random processes (e.g., Bernoulli process with unequal probabilities 0 and 1, Markov chains with some unknown parameters, stationary ergodic processes, etc.) [
1,
2,
3,
4,
5].
A natural question is: how do we compare different tests and, in particular, create a suitable battery of tests? Currently, this question is mostly addressed experimentally: possible candidate tests are applied to a set of known RNGs and the tests that reject more (“bad”) RNGs are suitable candidates for the battery. In addition, researchers try to choose independent tests (i.e., those that reject different RNGs) and take into account other natural properties (e.g., testing speed, etc.) [
1,
2,
3,
4]. Obviously, such an approach depends significantly on the set of selected tests and RNGs pre-selected for consideration. It is worth noting that at present there are dozens of RNGs and tests, and their number is growing fast, so the recommended batteries of tests are rather unstable (see [
4]).
It is clear that increasing the number of tests in a battery increases the total testing time or, conversely, if testing time is limited, increasing the number of tests causes the length of the binary sequence being examined to decrease and therefore the power of any battery test is reduced. Therefore, it is highly desirable to include in the battery powerful tests designed for different deviations from randomness.
The goal of this paper is to develop a theoretical framework for test comparison and illustrate it by comparing some popular tests. The main idea of the proposed approach is based on the definition of randomness developed in algorithmic information theory (AIT). Apparently, it is natural to use this theory, since it is the only mathematically correct theory that formally defines what a random binary sequence is, and by definition any RNG should generate such sequences. Similar to AIT, we extend the notion of “random sequence” to any statistical test T, and then compare the “size” of the set of random sequences corresponding to different tests. More precisely, let and be random sequences according to and . Then, if , then accepts a large set of sequences as random, whereas rejects these sequences as non-random. So, in this sense, a test cannot replace in a battery of tests (here dim is the Hausdorff dimension.).
Based on this approach, we give some practical recommendations for building test batteries. In particular, we recommend including in the test batteries a test based on a dictionary data compressor, like Lempel–Ziv codes [
6], grammar-based codes [
7] and some others.
The rest of this paper consists is organized as follows. The next part contains definitions and preliminary information, the third part is a comparison of the test performance on Markov processes with different memories and general stationary processes, and the fourth part investigates tests based on Lempel–Ziv data compressors. The fifth part is a brief conclusion; some of the concepts used in this paper are given in the
Appendix A.
2. Definitions and Preliminaries
2.1. Hypothesis Testing
In hypothesis testing, there is a main hypothesis the sequence x is random} and an alternative hypothesis . (In the probabilistic approach, is that the sequence is generated by a Bernoulli source with equal probabilities 0 and 1.) A test is an algorithm for which the input is the prefix (of the infinite sequence ) and the output is one of two possible words: random or non-random (meaning that the sequence is random or non-random, respectively).
Let there be a hypothesis , some alternative , let T be a test and be a statistic, that is, a function on which is applied to a binary sequence . Here and below is the set of all n-bit binary words, is the set of all infinite words .
By definition, Type I error occurs if
is true and
is rejected; the significance level is defined as the probability of the Type I error. Denote the critical region of the test
T for the significance level
by
and let
Recall that, by definition,
is rejected if and only if
and, hence,
see [
8]. We also apply another natural limitation. We consider only tests
T such that for all
n and
. (Here and below,
is the number of elements
X if
X is a set, and the length of
X, if
X is a word.)
A finite sequence is considered random for a given test T and the significance level if it belongs to .
2.2. Batteries of Tests
Let us consider a situation where the randomness testing is performed by conducting a battery of statistical tests for randomness. Suppose that the battery
contains a finite or countable set of tests
and
is the significance level of
i-th test,
. If the battery is applied in such a way that the hypothesis
is rejected when at least one test in the battery rejects it, then the significance level
of this battery satisfies the following inequality:
because
for any events
A and
B (This inequality is a simple extension of the so-called Bonferroni correction, see [
9]).
It will be convenient to formulate this inequality in a different way. Suppose there is some
and a sequence
of non-negative
such that
. For example, we can define the following sequence
:
If the significance level
equals
, then the significance level of the battery
is not grater than
. (Indeed, from (
2) we obtain
.) Note that this simple observation makes it possible to treat a test battery as a single test.
2.2.1. Random and Non-Random Infinite Sequences
Kolmogorov complexity is one of the central notations of algorithmic information theory (AIT), see [
10,
11,
12,
13,
14,
15,
16,
17,
18]. We will consider the so-called prefix-free Kolmogorov complexity
, which is defined on finite binary words
u and is closely related to the notion of randomness. More precisely, an infinite binary sequence
is random if there exists a constant
C such that
for all
n, see [
19]. Conversely, the sequence
x is non-random if
In some sense, Kolmogorov complexity is the length of the shortest lossless prefix-free code, that is, for any (algorithmically realisable) code
f there exists a constant
for which
[
10,
11,
12,
13,
14,
15,
16]. Recall that a code
f is lossless if there is a mapping
such that for any word
u and
f is prefix-free (or unprefixed) if for any words
,
is not a prefix of
and
is not a prefix of
.
Let
f be a lossless prefix-free code defined for all finite words. Similarly to (
4), we call it random with respect to
f if there is a constant
such that
for all
n. We call this difference the statistic corresponding to
f and define
Similarly, the sequence
x is non-random with respect to
f if
Informally,
x is random with respect to
f if the statistic
is bounded by some constant on all prefixes
and, conversely,
x is non-random if
is unbounded when the prefix length grows.
Based on these definitions, we can reformulate the concepts of randomness and non-randomness in a manner similar to what is customary in mathematical statistics. Namely, for any
we define the set
. It is easy to see that (
1) is valid and, therefore, this set represents the critical region
, where the test
T is as follows:
:
.
Based on these consideration, (
6) and the definitions of randomness (
4), (
5) we give the following definition of randomness and non-randomness for the statistic
and corresponding test
. An infinite sequence
is random according to the test
if there exists such
that for any integer
n and this
the word
is random (according to the
test). Otherwise, the sequence
x is non-random.
Note that we can use the statistic
with the critical value
,
, see [
20,
21]. So, there is no need to use the density distribution formula and it greatly simplifies the use of the test and makes it possible to use this test for any data compressor
f.
It is important to note that there are tests developed within the AIT that can be used to test RNG [
22,
23].
2.2.2. Test Performance Comparison
For test T, let us define the set of all infinite sequences that are random for T.
We use this definition to compare the “effectiveness” of different tests as follows. The test is more efficient than if the size of the difference is not equal to zero, where the size is measured by the Hausdorff dimension.
Informally, the “smallest” set of random sequences corresponds to a test based on Kolmogorov complexity (
4) (corresponding set
contains “truly” random sequences). For a given test
we cannot calculate the difference
because the statistic (
4) is noncomputabele, but in the case of two tests
and
, where
, we can say that the set of sequences random according to
contains clearly non-random sequences. So, in some sense,
is more efficient than
. (Recall that we only consider computable tests.)
The definition of the Hausdorff dimension is given in the
Appendix A, but here we briefly note that we use the Hausdorff dimension for it as follows: for any binary sequence
we define a real number
and for any set of infinite binary sequences
S we denote the Hausdorff dimension of
by
. So, a test
is more efficient than
(formally
) if
. Obviously, information about a test’s effectiveness can be useful to developers of the test’s batteries.
Also note that the Hausdorff dimension is widely used in information theory. Perhaps the first such use was due to Eggleston [
24] (see also [
25,
26]), and later the Hausdorff dimension found numerous applications in AIT [
27,
28,
29].
2.2.3. Shannon Entropy
In RNG testing, one of the popular alternative hypotheses () is that the considered sequence generated by Markov process of memory (or connectivity) , but the transition probabilities are unknown. (, i.e., , corresponds to the Bernoulli process). Another popular and perhaps the most general is that the sequence is generated by a stationary ergodic process () (excluding ).
Let us consider the Bernoulli process
for which
By definition, the Shannon entropy
of this process is defined as
[
30]. For any stationary ergodic process
the entropy of order
k is defined as follows:
where
is the mathematical expectation according to
,
is the conditional probability
, it does not depend on
i due to stationarity [
30].
It is known in Information Theory that for stationary ergodic processes (including
and
for
and there exists the limit Shannon entropy
. Besides, for
[
30].
Shannon entropy plays an important role in data compression because for any lossless and prefix-free code, the average codeword length (per letter) is at least as large as the entropy, and this limit can be reached. More precisely, let
be a lossless, prefix-free code defined on
, and let
. Then, for any
,
, and codewords of average length
. In addition, there are codes
such that
[
30].
2.2.4. Typical Sequences and Universal Codes
The sequence is typical for the measure if for any word , where is the number of occurrences of a word in a word .
Let us denote the set of all typical sequences as
and note that
[
30]. This notion is deeply related to information theory. Thus, Eggleston proved the equality
for Bernoulli processes (
) [
24], and later this was generalized for
[
26,
28].
By definition, a code
is universal for a set of processes
S if for any
and any
In 1968, R. Krichevsky [
31] proposed a code
,
is an integer, whose redundancy, i.e., the average difference between the code length and Shannon entropy, is asymptotically minimal. This code and its generalisations are described in the
Appendix A, but here we note the following main property. For any stationary ergodic process
, that is,
and typical
,
see [
32].
Currently there are many universal codes which are based on different ideas and approaches, among which we note the PPM universal code [
33], the arithmetic code [
34], the Burrows–Wheeler transform [
35], which is used along with the book-stack (or MTF) code [
36,
37,
38], and some others [
39,
40,
41].
The most interesting for us is the class of grammar-based codes suggested by Kieffer and Yang [
7,
42] which includes the Lempel–Ziv (LZ) codes [
6] (note that perhaps the first grammar-based code was described in [
43]).
The point is that all of them are universal codes and hence they “compress” stationary processes asymptotically to entropy and therefore cannot be distinguishable at . On the other hand, we show that grammar-based codes can distinguish “large” sets of sequences as non-random beyond .
2.2.5. Two-Faced Processes
The so-called two-faced processes are described in [
20,
21] and their definitions will be given in
Appendix A. Here, we note some of their properties: the set of two-faced processes
of order
, and probability
, contains the measures
from
such that
Note that they are called two-faced because they appear to be truly random if we look at word frequencies whose length is less than
s, but are “completely” non-random if the word length is equal to or greater than
s (and
p is far from
).
3. Comparison of the Efficiency of Tests for Markov Processes with Different Memories and General Stationary Processes
We now describe the statistical tests for Markov processes and stationary ergodic processes as follows. By (
6), statistical definitions are as follows:
where
and
are universal codes for
and
defined in the
Appendix A, see (
A4) and (
A5). We also denote the corresponding tests by
and
. The following statement compares the performance of these tests.
Theorem 1. For any integers and
Moreover,
.
Proof. First, let us say a few words about the scheme of the proof. If we apply the test to typical sequences of a two-faced process , they will appear random since . So, the set of random sequences (i.e., random sequences according to test) contains the set of the typical sequences for which equals the limit Shannon entropy . Hence, .
On the other hand, typical sequences of a two-faced process
are not random according to
since
(
11) and the test.
“compresses” them till the Shannon entropy
. So,
. Then
.
More formally, consider a typical sequence
x of
,
. So,
see (
11), where the first equality is due to typicality, and the second to the property of two-faced processes (
11).
From here and (
A1), (
A4) we obtain
, where
. From this and typicality we can see that
. Hence, there exists such
that
, if
. So
. So, if we take
, we can see that for
is negative. From this and the definition of randomness (
5), we can see that typical sequences from
are random according to
, i.e.,
. From this and (
A6), we obtain
. □
4. Effectiveness of Tests Based on Lempel-Ziv Data Compressors
In this part we will describe a test that is more effective than and for any m.
First, we will briefly describe the LZ77 code based on the definition in [
44]. Suppose there is a binary string
that is encoded using the code LZ77. This string is represented by a list of pairs
. Each pair
represents a string, and the concatenation of these strings is
. In particular, if
, then the pair represents the string
, which is a single terminal. If
, then the pair represents a portion of the prefix of
that is represented by the preceding
pairs; namely, the
terminals beginning at position
in
; see ([
44] part 3.1). The length of the codeword depends on the encoding of the sub-words
which are integers. For this purpose we will use a prefix code
C for integers, for which for any integer
m
Such codes are known in information theory; see, for example, ([
30] part 7.2). Note that
C is the prefix code and, hence, for any
the codeword
can be decoded to
. There is the following upper bound for the length of the LZ77 code [
30,
44]: for any word
if
.
We will now describe such sequences that, on the one hand, are not typical for any stationary ergodic measure and, on the other hand, are not random and will be rejected by the suggested test. Thus, the proposed model allows us to detect non-random sequences that are not typical for for any stationary processes.On the other hand, those sequences are recognized tests based on LZ77 as non-random. To do this, we take any random sequence
(that is, for which (
4) is valid) and define a new sequence
as follows. Let for
For example,
,
…
,
,
….
The idea behind this sequence is quite clear. Firstly, it is obvious that the word
y cannot be typical for a stationary ergodic source and, secondly, when
is encoded the second subword
will be encoded by a very short word (about
, since it coincides with the previous word
. So, for large
k the length of the encoded word
will be about
. So,
. Hence, it follows that
Here, we took into account that
x is random and,
, see [
28].) So, having taken into account the definitions of non-randomness (
6) and (
7), we can see that
is non random according to statistics
. Denote this test by
.
Let us consider the test
,
are integers. Having taken into account that the sequence
x is random, we can see that
So, from from (
A4) we can see that for any
n . The same reasoning is true for the code
.
We can now compare the size of random sequence sets across different tests as follows:
Taking into account (
15), we can see that
Likewise, the same is true for the
test. From the latest inequality we obtain the following.
Theorem 2. For any random (according to (4)) sequence x the sequence is non-random for the test , whereas this sequence is random for tests and . Moreover, and for any . Comment. The sequence
is constructed by duplicating parts of
x. This construction can be slightly modified as follows: instead of duplication (as
), we can use
, where
contains
the first letters of
u,
. In this case,
and, therefore,
5. Conclusions
Here, we describe some recommendations for the practical testing of RNGs, based on the method of comparing the power of different statistical tests. Based on Theorem 1, we can recommend to use several tests
, based on the analysis of occurrence frequencies of words of different length
s. In addition, we recommend using tests for which
s depends on the length
n of the sequence under consideration. For example,
),
), etc. They can be included in the test battery directly or as the “mixture”
with several non-zero
coefficients, see (
A2) in the
Appendix A.
Theorem 2 shows that it is useful to include tests based on dictionary data compressors such as the Lempel–Ziv code. In such a case we can use the statistic
with the critical value
,
, see [
20,
21]. Note that in this case, there is no need to use the density distribution formula, which greatly simplifies the use of the test and makes it possible to use a similar test for any grammar-based data compressor.