1. Introduction
Information theory is mainly concerned with stationary random processes
, where
takes values in a set
, with cardinality
. The strong convergence of the entropy at time
n of a random processes divided by
n to a constant limit called the entropy rate of the process is known as the ergodic theorem of information theory or the asymptotic equipartition property (AEP) [
1], in some sense, of the expression
Its original version proven in the 1950s for ergodic stationary processes is known as the Shannon–McMillan theorem for the convergence in mean and as the Shannon–Breiman–McMillan theorem [
2,
3,
4] for the almost everywhere convergence. Since then, generalized versions of Shannon–McMillan–Breiman’s limit theorem were developed by many authors [
1,
2,
4,
5]. Extensions have been made in the direction of weakening the assumptions on the reference measure, state space, index set and required properties of the process. For the general development, please see Girardin [
6] and the references therein.
In statistics, smoothing data is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise phenomena. One of the most used smoothing methods is moving average (MA). A number of authors have studied the question of almost everywhere convergence for an invertible transformation of
X, which is measure preserving the moving averages, e.g., Akcoglu and Del Junco [
7]; Bellow, Jones, and Rosenblatt [
8]; Junco and Steele [
9]; Schwartz [
10]; and Haili and Nair [
11]. Recently, Wang and Yang [
12,
13] proposed a new concept of the generalized entropy density, and established a generalized entropy ergodic theorem for time-nonhomogeneous Markov chain and for non-null stationary processes. Shi, Wang et al. [
14] studied the generalized entropy ergodic theorem for nonhomogeneous Markov chains indexed by a binary tree.
Motivated by the work above, in this paper we will give a moving average version of the Shannon–McMillan–Breiman theorem. The results in this paper generalize the results of those in [
2]. It is worth noting that, in some sense, the indices
and
are symmetrical. In this paper, we are discussing the so-called forward moving average; if the growth rate of
w.r.p. to integer
n is slow enough, all conclusions in this article still hold true, i.e., the backward moving average is still established.
The method used in showing the main results is the “sandwich” approximation approach of Algoet and Cover [
2], which depends strongly on the moving strong law of large numbers: sample entropy is asymptotically sandwiched between two functions whose limits can be determined from the moving SLLN theorem.
This paper is organized as follows. In
Section 2, we introduced some necessary preparatory knowledge. To distinguish this from the main conclusion theorem names, we present some required preliminaries and three lemmas. In
Section 3, we give the main results and some properties of them are studied in the same section. Also, we give examples of applications.
2. Preliminaries
Throughout this section, let
denote a fixed probability space and let
be a stationary sequence taking values from a finite set
. For the sequence
, denote the partial sequence
by
and
by
for
. Likewise, we write
,
for the sequence of
and
, respectively. Let
and
wherever the conditioning event has positive probability. Define random variables
by setting
in the corresponding definitions. Since
the conditional probability makes sense
(i.e., almost everywhere holds true under measure
).
Definition 1 (see e.g., [
2]).
The canonical Markov approximation of order m to the probability is defined for large as We will prove a new version of AEP for a stationary process . Before developing the main theme of the paper, we shall need to derive some basic lemmas. Let be a pair of positive integers such that as , and for every , .
Lemma 1. Let be a stationary process with values from a finite set ; then, we haveandwhere the base of the logarithm is taken to be 2. Proof. Let
A be the support set of
; then,
where
indicates taking expectation under measure
.
Similarly, let
denote the support set of
. Then, we have
By Markov’s inequality and Equation (
4), we have, for any
,
Noting that
, we see by the Borel–Cantelli lemma that the event
By the arbitrariness of
, we have
Applying the same arguments using Markov’s inequality to Equation (
3), we obtain
This proves the lemma. □
Lemma 2. (SLLN for MA): For a stationary stochastic process ,andwhere , . Proof. It is not difficult to verify that
. An argument similar to the one used in Lemma 1 shows that
Let
, and define
Since
It is straightforward to show that
Note that
By Equations (8) and (9) and and the property of superior limits, we have
Setting
, dividing both sides of Equation (
10) by
s, we obtain
Using the inequalities
and
,
.
It follows from Equation (
11) that
By the fact that
, we have
From Equations (11)–(13), we have
Putting
in Equation (
14), we obtain
Replacing
by
in the above argument, we can obtain
These imply that
Note that
; therefore, we have, by Equation (
15),
Since
, Equation (
5) follows immediately from Equations (7) and (16).
The remainder of the argument is analogous to that in proving Equation (
5) and is left to the reader. □
Lemma 3. (No gap): and .
Proof. We know that for stationary precesses , so it remains to show that .
Let
. Since
is integrable. Now, since all random variables are discrete, we may write
Therefore,
and
is measurable relative to
field
;
is a non-negative supermartingale, hence converges
to an integrable limit function for all
.
Note that, for any
m,
where the last equation follows from stationarity.
Since
is finite and
is bounded and continuous in
p for all
, the bounded convergence theorem allows interchange of expectation and limit, yielding
Thus,
. □
3. Main Results
With the preliminary accounted for, we wish to use the Lemma 1 to conclude that
It is not easy to prove Equation (
17). However, the closely related quantities
and
are easily identified as entropy rates.
Recall that the entropy rate is given by
Of course,
by stationarity and the fact that conditioning does not increase entropy. It will be crucial that
.
With the help of the preceding lemmas, we can now prove the following theorem:
Theorem 1. (AEP) If H is the entropy rate of a finite-valued stationary process , then it holds that Remark 1. In the case , Theorem 1 reduces to the famous Shannon–McMillan–Breiman theorem, which is the fundamental theorem of information theory. Let ; it gives a delayed average version of AEP.
Proof. We argue that the sequence of random variables is asymptotically sandwiched between the upper bound and the lower bound for all . The AEP will follow since and . □
From Lemma 1, we have
which we rewrite, taking the existence of the
into account, as
for
Also, from Lemma 1, we have
which we rewrite as
From the definition of
in Lemma 2, we have, by putting together Equations (6) and (7),
for all
m.
Now, we give some interesting applications of our main results in the next examples.
Example 1. Let be independent, identically distributed random variables drawn from the probability mass function ; then, Example 2. LetLet be drawn i.i.d. according to this distribution; then, Example 3. Let be independent identically distributed random variables drawn according to the probability mass function . Thus, . Let , where q is another probability mass function on ; then,
where
is the informational divergence between two probability distributions
and
on a common alphabet
.
Since convergence almost everywhere implies convergence in probability, Theorem 1 has the following implication:
Definition 2. The typical set with respect to is the set of sequence with the following properties: As a consequence of the Theorem 1, we can show that the set has the following properties:
Proposition 1. Let be independent, identically distributed random variables drawn from the probability mass function ; then,
- (1).
If , then - (2).
for sufficiently large n.
- (3).
, where denotes the number of elements in set A.
- (4).
for sufficiently large n.
Proof. The property (1) is immediate from the definition of .
Property (2) follows directly from Theorem 1, since the probability of the event tends to 1 as .
Thus, for any
, there exists an
such that for all
, we have
Setting , we have the following:
To prove property (3), noticing that
where the second inequality follows from Equation (
19),
Finally, for sufficiently large
n,
,
where the second inequality follows from Definition 2. Therefore,
These complete the proof of the proposition. □
Example 4. Let be i.i.d. . Let . Let and . Then we have the following:
(1) ;
(2) ;
(3) , for all n;
(4) , for sufficiently large n.
Proof. (1) By Theorem 1, the probability is typical goes to 1.
(2) By the Strong Law of Large Numbers for moving average, we have
. So there exists
and
such that
for all
, and there exists
such that
for all
. So for all
,
So for any
there exists
such that
for all
; therefore,
.
(3) By the law of total probability
. Also, for
, from Theorem 1 in the text,
. Combining these two equations gives
Multiplying through by gives the result .
(4) Since from (2)
, there exists
N such that
for all
. From Theorem 1 in the text, for
,
. So, combining these two gives
Multiplying through by gives the result for sufficiently large n. □