**1. Introduction**

As it was noticed by D ˛ebowski [1–3], a generalized calculus of Shannon information measures for arbitrary fields—initiated by Gelfand et al. [4] and later developed by Dobrushin [5], Pinsker [6], and Wyner [7]—is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language. Fulfilling this need, D ˛ebowski [1] has developed the calculus of Shannon information measures for arbitrary fields, relaxing the requirement of regular conditional probability, assumed implicitly by Dobrushin [5] and Pinsker [6]. He has done it unaware of the classical paper by Wyner [7], which pursued exactly the same idea, with some differences due to an independent interest.

Compared to exposition [7], the added value of the paper [1] was considering continuity and invariance of Shannon information measures with respect to completion of fields. Unfortunately, the proof of Theorem 2 in [1] establishing this invariance and the generalized chain rule contains some mistakes and gaps, which we have discovered recently. For this reason, in this article, we would like to provide a correction and a few new auxiliary results which may be of an independent interest. In this way, we will complete the full generalization of Shannon information measures and their properties, which was developed step-by-step by Gelfand et al. [4], Dobrushin [5], Pinsker [6], Wyner [7], and D ˛ebowski [1]. By the way, we will also rediscuss the linguistic motivations of our results.

The preliminaries are as follows. Fix a probability space (<sup>Ω</sup>, J , *<sup>P</sup>*). Fields are set algebras closed under finite Boolean operations, whereas *σ*-fields are assumed to be closed also under countable unions and products. A field is called finite if it has finitely many elements. A finite partition is a finite collection of events " *Bj* #*J j*=1 ⊂ J which are disjoint and whose union equals Ω. The definition proposed by Wyner [7] and D ˛ebowski [1] independently reads as follows:

**Definition 1.** *For finite partitions α* = {*Ai*}*Ii*=<sup>1</sup> *and β* = "*Bj*#*Jj*=<sup>1</sup> *and a probability measure P, the entropy and mutual information are defined as*

$$H\_P(a) := \sum\_{i=1}^I P(A\_i) \log \frac{1}{P(A\_i)}, \qquad \qquad I\_P(a; \beta) := \sum\_{i=1}^I \sum\_{j=1}^I P(A\_i \cap B\_j) \log \frac{P(A\_i \cap B\_j)}{P(A\_i)P(B\_j)}.\tag{1}$$

*Subsequently, for an arbitrary field* C *and finite partitions α and β, we define the pointwise conditional entropy and mutual information as*

$$H\_P(\mathfrak{a}||\mathcal{C}) := H\_{P(\cdot|\mathcal{C})}(\mathfrak{a})\_\prime \qquad \qquad \qquad I\_P(\mathfrak{a}; \mathcal{B}||\mathcal{C}) := I\_{P(\cdot|\mathcal{C})}(\mathfrak{a}; \mathcal{B})\_\prime \tag{2}$$

*where P*(*E*|C) *is the conditional probability of event E* ∈ J *with respect to the smallest complete σ-field containing* C*. Subsequently, for arbitrary fields* A*,* B*, and* C*, the (average) conditional entropy and mutual information are defined as*

$$H\_P(\mathcal{A}|\mathcal{C}) := \sup\_{a \subset \mathcal{A}} \mathbb{E}\_P H\_P(a||\mathcal{C}), \qquad \qquad I\_P(\mathcal{A}; \mathcal{B}|\mathcal{C}) := \sup\_{a \subset \mathcal{A}, \mathcal{\beta} \subset \mathcal{B}} \mathbb{E}\_P I(a; \mathcal{\beta}||\mathcal{C}), \tag{3}$$

*where the supremum is taken over all finite subpartitions and E PX* := *XdP is the expectation. Finally, we define the unconditional entropy HP*(A) := *HP*(A| {<sup>∅</sup>, Ω}) *and mutual information IP*(A; B) := *IP*(A; B| {<sup>∅</sup>, <sup>Ω</sup>})*, as it is generally done in information theory. When the probability measure P is clear from the context, we omit subscript P from all above notations.*

Although the above measures, called Shannon information measures, have usually been discussed for *σ*-fields, the defining equations (3) also make sense for fields. We observe a number of identities, such as *H*(A) = *<sup>I</sup>*(A; A) and *H*(A|C) = *<sup>I</sup>*(A; A|C). It is important to stress that Definition 1, in contrast to the earlier expositions by Dobrushin [5] and Pinsker [6], is simpler—as it applies one Radon–Nikodym derivative less—and does not require regular conditional probability, i.e., it does not demand that conditional distribution (*P*(*E*|C))*E*∈J be a probability measure almost surely. In fact, the expressions on the right-hand sides of the equations in (3) are defined for all A, B, and C. No problems arise when conditional probability is not regular since conditional distribution (*P*(*E*|C))*E*∈E restricted to a finite field E is a probability measure almost surely [8] (Theorem 33.2).

We should admit that in the context of statistical language modeling, the respective probability space is countably generated so regular conditional probability is guaranteed to exist. Thus, for linguistic applications, one might think that expositions [5,6] are sufficient, although for a didactic reason, the approaches proposed by Wyner [7] and D ˛ebowski [1] lead to a simpler and more general calculus of Shannon information measures. Yet, there is a more important reason for Definition 1. Namely, to discuss the ergodic decomposition of entropy rate and excess entropy—some highly relevant results for statistical language modeling, developed in [1] and to be briefly recalled in Section 3—we need the invariance of Shannon information measures with respect to completion of fields. But within the framework of Dobrushin [5] and Pinsker [6], such invariance of completion does not hold for strongly nonergodic processes, which seem to arise quite naturally in statistical modeling of natural language [1–3]. Thus, the approach proposed by Wyner [7] and D ˛ebowski [1] is in fact indispensable.

Thus, let us inspect the problem of invariance of Shannon information measures with respect to completion of fields. A *σ*-field is called complete, with respect to a given probability measure *P*, if it contains all sets of outer *P*-measure 0. Let *σ*(A) denote the intersection of all complete *σ*-fields containing class A, i.e., *σ*(A) is the completion of the generated *σ*-field. Let A∧B denote the intersection of all fields that contain A and B. Assuming Definition 1, the following statement has been claimed true by D ˛ebowski [1] (Theorem 2):

**Theorem 1.** *Let* A*,* B*,* C*, and* D *be subfields of* J *.* *1. <sup>I</sup>*(A; B|C) = *<sup>I</sup>*(A; *σ*(B)|C) = *<sup>I</sup>*(A; B| *σ*(C)) *(invariance of completion); 2.<sup>I</sup>*(A;B∧C|D)=*<sup>I</sup>*(A;B|D)+*<sup>I</sup>*(A;C|B∧D)*(chainrule).*

The property stated in Theorem 1. 1 will be referred to as the invariance of completion. It was not discussed by Wyner [7]. The property stated in Theorem 1. 2 is usually referred to as the chain rule or the polymatroid identity. It was proved independently by Wyner [7].

As we have mentioned, the invariance of completion is crucial to prove the ergodic decomposition of the entropy rate and excess entropy of stationary processes. But the proof of the invariance of completion given by D ˛ebowski [1] contains a mistake in the order of quantifiers, and the respective proof of the chain rule is too laconic and contains a gap. For this reason, we would like to supplement the corrected proofs in this article. As we have mentioned, the chain rule was proved by Wyner [7], using an approximation result by Dobrushin [5] and Pinsker [6]. For completeness, we would like to provide a different proof of this approximation result—which follows easily from the invariance of completion—and to supply proofs of both parts of Theorem 1.

The corrected proofs of Theorem 1, to be presented in Section 2, are much longer than the original proofs by D ˛ebowski [1]. In particular, for the sake of proving Theorem 1, we will discuss a few other approximation results, which seem to be of an independent interest. To provide more context for our statements, in Section 3, we will also recall the ergodic decomposition of excess entropy and its application to statistical language modeling.
