1. Introduction
As it was noticed by Dębowski [
1,
2,
3], a generalized calculus of Shannon information measures for arbitrary fields—initiated by Gelfand et al. [
4] and later developed by Dobrushin [
5], Pinsker [
6], and Wyner [
7]—is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language. Fulfilling this need, Dębowski [
1] has developed the calculus of Shannon information measures for arbitrary fields, relaxing the requirement of regular conditional probability, assumed implicitly by Dobrushin [
5] and Pinsker [
6]. He has done it unaware of the classical paper by Wyner [
7], which pursued exactly the same idea, with some differences due to an independent interest.
Compared to exposition [
7], the added value of the paper [
1] was considering continuity and invariance of Shannon information measures with respect to completion of fields. Unfortunately, the proof of Theorem 2 in [
1] establishing this invariance and the generalized chain rule contains some mistakes and gaps, which we have discovered recently. For this reason, in this article, we would like to provide a correction and a few new auxiliary results which may be of an independent interest. In this way, we will complete the full generalization of Shannon information measures and their properties, which was developed step-by-step by Gelfand et al. [
4], Dobrushin [
5], Pinsker [
6], Wyner [
7], and Dębowski [
1]. By the way, we will also rediscuss the linguistic motivations of our results.
The preliminaries are as follows. Fix a probability space
. Fields are set algebras closed under finite Boolean operations, whereas
-fields are assumed to be closed also under countable unions and products. A field is called finite if it has finitely many elements. A finite partition is a finite collection of events
which are disjoint and whose union equals
. The definition proposed by Wyner [
7] and Dębowski [
1] independently reads as follows:
Definition 1. For finite partitions and and a probability measure P, the entropy and mutual information are defined asSubsequently, for an arbitrary field and finite partitions α and β, we define the pointwise conditional entropy and mutual information aswhere is the conditional probability of event with respect to the smallest complete σ-field containing . Subsequently, for arbitrary fields , , and , the (average) conditional entropy and mutual information are defined aswhere the supremum is taken over all finite subpartitions and is the expectation. Finally, we define the unconditional entropy and mutual information , as it is generally done in information theory. When the probability measure P is clear from the context, we omit subscript P from all above notations. Although the above measures, called Shannon information measures, have usually been discussed for
-fields, the defining equations (
3) also make sense for fields. We observe a number of identities, such as
and
. It is important to stress that Definition 1, in contrast to the earlier expositions by Dobrushin [
5] and Pinsker [
6], is simpler—as it applies one Radon–Nikodym derivative less—and does not require regular conditional probability, i.e., it does not demand that conditional distribution
be a probability measure almost surely. In fact, the expressions on the right-hand sides of the equations in (
3) are defined for all
,
, and
. No problems arise when conditional probability is not regular since conditional distribution
restricted to a finite field
is a probability measure almost surely [
8] (Theorem 33.2).
We should admit that in the context of statistical language modeling, the respective probability space is countably generated so regular conditional probability is guaranteed to exist. Thus, for linguistic applications, one might think that expositions [
5,
6] are sufficient, although for a didactic reason, the approaches proposed by Wyner [
7] and Dębowski [
1] lead to a simpler and more general calculus of Shannon information measures. Yet, there is a more important reason for Definition 1. Namely, to discuss the ergodic decomposition of entropy rate and excess entropy—some highly relevant results for statistical language modeling, developed in [
1] and to be briefly recalled in
Section 3—we need the invariance of Shannon information measures with respect to completion of fields. But within the framework of Dobrushin [
5] and Pinsker [
6], such invariance of completion does not hold for strongly nonergodic processes, which seem to arise quite naturally in statistical modeling of natural language [
1,
2,
3]. Thus, the approach proposed by Wyner [
7] and Dębowski [
1] is in fact indispensable.
Thus, let us inspect the problem of invariance of Shannon information measures with respect to completion of fields. A
-field is called complete, with respect to a given probability measure
P, if it contains all sets of outer
P-measure 0. Let
denote the intersection of all complete
-fields containing class
, i.e.,
is the completion of the generated
-field. Let
denote the intersection of all fields that contain
and
. Assuming Definition 1, the following statement has been claimed true by Dębowski [
1] (Theorem 2):
Theorem 1. Let , , , and be subfields of .
- 1.
(invariance of completion);
- 2.
(chain rule).
The property stated in Theorem 1.1 will be referred to as the invariance of completion. It was not discussed by Wyner [
7]. The property stated in Theorem 1.2 is usually referred to as the chain rule or the polymatroid identity. It was proved independently by Wyner [
7].
As we have mentioned, the invariance of completion is crucial to prove the ergodic decomposition of the entropy rate and excess entropy of stationary processes. But the proof of the invariance of completion given by Dębowski [
1] contains a mistake in the order of quantifiers, and the respective proof of the chain rule is too laconic and contains a gap. For this reason, we would like to supplement the corrected proofs in this article. As we have mentioned, the chain rule was proved by Wyner [
7], using an approximation result by Dobrushin [
5] and Pinsker [
6]. For completeness, we would like to provide a different proof of this approximation result—which follows easily from the invariance of completion—and to supply proofs of both parts of Theorem 1.
The corrected proofs of Theorem 1, to be presented in
Section 2, are much longer than the original proofs by Dębowski [
1]. In particular, for the sake of proving Theorem 1, we will discuss a few other approximation results, which seem to be of an independent interest. To provide more context for our statements, in
Section 3, we will also recall the ergodic decomposition of excess entropy and its application to statistical language modeling.
2. Proofs
Let us write
for a sequence
of fields such that
. (
need not be a
-field.) Our proof of Theorem 1 will rest on a few approximation results and this statement by Dębowski [
1] (Theorem 1):
Theorem 2. Let , , , and be subfields of .
- 1.
;
- 2.
with the equality if and only if almost surely for all and ;
- 3.
;
- 4.
if ;
- 5.
for .
Let
. Subsequently, let us denote the symmetric difference
Symmetric difference satisfies the following identities, which will be used:
Moreover, we will apply the Bonferroni inequalities
and inequality
.
In the following, we will derive the necessary approximation results. Our point of departure is the following folklore fact.
Theorem 3 (approximation of
-fields)
. For any field and any event , there is a sequence of events such that Proof. Denote the class of sets
G that satisfy (
10) as
. It is sufficient to show that
is a complete
-field that contains the field
. Clearly, all
satisfy (
10) so
. Now, we verify the conditions for
to be a
-field.
We have . Hence, .
For , consider such that . Then, , where . Hence, .
For
, consider events
such that
. Then,
Moreover,
Hence,
which tends to 0 for
n going to infinity. Since
, we thus obtain that
.
Completeness of -field is straightforward since, for any and , we obtain using the same sequence of approximating events in field as for event A. □
The second approximation result is the following bound:
Theorem 4 (continuity of entropy)
. Fix an and a field . For finite partitions and such that for all , we have Proof. We have the expectation
. Hence, by the Markov inequality we obtain
Denote
From the Bonferroni inequality, we obtain
. Subsequently, we observe that
holds almost surely. Hence,
Function
is subadditive and increasing for
. In particular, we have
for
. Thus, on the event
B we obtain
Plugging (
18) into (
17) yields the claim. □
Now, we can prove the invariance of completion. Note that
Proof of Theorem 1. 1 (invariance of completion): Consider some measurable fields
,
, and
. We are going to demonstrate
Equality
is straightforward since
almost surely for all
. It remains to prove
. For this goal, it suffices to show that for any
and any finite partitions
and
there exists a finite partition
such that
Fix then some
and finite partitions
and
. Invoking Theorem 3, we know that for each
there exists a class of sets
which need not be a partition, such that
for all
. Let us put
and let us construct sets
and
for
. Subsequently, we put
for
and
. In this way, we obtain a partition
.
The next step of the proof is showing an analogue of bound (
22) for partitions
and
. To begin, for
, we have
Now, we observe for
and
that
Hence, by the Bonferroni inequality we derive
Resuming our bounds, we obtain
for all
and
. Then, invoking Theorem 4 yields
Taking
sufficiently small, we obtain (
21), which is the desired claim. □
Some consequence of the above result is this approximation result proved by Dobrushin [
5] and Pinsker [
6] and used by Wyner [
7] to demonstrate the chain rule. Applying the invariance of completion, we supply a different proof than Dobrushin [
5] and Pinsker [
6].
Theorem 5 (split of join)
. Let , , , and be subfields of . We havewhere the supremum is taken over all finite subpartitions. Proof. Define class
It can be easily verified that
is a field such that
. Thus, for all finite partitions
and
we have
. Moreover, by definition of
, for each finite partition
there exists finite partitions
and
such that partition
is finer than
. Hence, by Theorem 2.4, we obtain in this case,
In consequence, by Theorem 1.1, we obtain the claim
□
The final approximation result which we need to prove the chain rule is as follows:
Theorem 6 (convergence of conditioning)
. Let be a finite partition and let be a field. For each , there exists a finite partition such that for any partition finer than we have Proof. Fix an
. For each
and
, partition
is finite and belongs to
. If we consider partition
, it remains finite and still satisfies
. Let a partition
be finer than
. Then,
almost surely for all
. We also observe
We recall that function
is subadditive and increasing for
. In particular, we have
for
. Hence, for
we obtain almost surely
Taking
n so large that
yields the claim. □
Taking the above into account, we can demonstrate the chain rule. Our proof essentially follows the ideas of Wyner [
7], except for invoking Theorem 6.
Proof of Theorem 1. 2 (chain rule): Let
,
,
, and
be arbitrary fields, and let
,
,
, and
be finite partitions. The point of our departure is the chain rule for finite partitions [
9] (Equation 2.60)
By Definition 1 and Theorems 1.1, 5, and 6, conditional mutual information
can be approximated by
, where we take appropriate limits of refined finite partitions with a certain care.
In particular, by Theorems 1.1, 5, and 6, taking sufficiently fine finite partitions of arbitrary fields
and
, the chain rule (
38) for finite partitions implies
where all expressions are finite. Hence, we also obtain
where all expressions are finite. Having established the above claim for a finite partition
, we generalize it to
for an arbitrary field
, taking its appropriately fine finite partitions. □
3. Applications
This section borrows its statements largely from Dębowski [
1,
2,
3] and is provided only to sketch some context for our research and justify its applicability to statistical language modeling. Let
be a two-sided infinite stationary process over a countable alphabet
on a probability space
, where
. We denote random blocks
and complete
-fields
generated by them. By the generalized calculus of Shannon information measures, i.e., Theorems 1 and 2, we can define the entropy rate
and the excess entropy
of process
as
see [
10] for more background.
Let
be the shift operation and let
be the invariant
-field. By the Birkhoff ergodic theorem [
11], we have
for the tail
-fields
and
. Hence, by Theorems 1 and 2 we further obtain expressions
Denoting the conditional probability
, which is a random stationary ergodic measure by the ergodic decomposition theorem [
12], we notice that
and
, and consequently we obtain the ergodic decomposition of the entropy rate and excess entropy, which reads
Formulae (
45) and (46) were derived by Gray and Davisson [
13] and Dębowski [
1] respectively. The ergodic decomposition of the entropy rate (
45) states that a stationary process is asymptotically deterministic, i.e.,
, if and only if almost all its ergodic components are asymptotically deterministic, i.e.,
almost surely. In contrast, the ergodic decomposition of the excess entropy (46) states that a stationary process is infinitary, i.e.,
, if some of its ergodic components are infinitary, i.e.,
with a nonzero probability, or if
, i.e., if the process is strongly nonergodic in particular, see [
14,
15].
The linguistic interpretation of the above results is as follows. There is a hypothesis by Hilberg [
16] that the excess entropy of natural language is infinite. This hypothesis can be partly confirmed by the original estimates of conditional entropy by Shannon [
17], by the power-law decay of the estimates of the entropy rate given by the PPM compression algorithm [
18], by the approximately power-law growth of vocabulary called Heaps’ or Herdan’s law [
2,
3,
19,
20], and by some other experiments applying neural statistical language models [
21,
22]. In parallel, Dębowski [
1,
2,
3] supposed that the very large excess entropy in natural language may be caused by the fact that texts in natural language describe some relatively slowly evolving and very complex reality. Indeed, it can be mathematically proved that if the abstract reality described by random texts is unchangeable and infinitely complex, then the resulting stochastic process is strongly nonergodic, i.e.,
in particular [
1,
2,
3]. Consequently, its excess entropy is infinite by formula (46). We suppose that a similar mechanism may work for natural language, see [
23,
24,
25,
26] for further examples of abstract stochastic mechanisms leading to infinitary processes.