1. Introduction
Let
P be a continuous probability distribution on the real axis with density
. Its
entropy is defined as
What is the substantive sense of
? More precisely, does there exist a mathematical object whose natural quantitative magnitude (e.g., volume) is a certain function of the entropy?
Traditionally, entropy is treated as a measure of disorder. However, this explanation does not answer the question stated above because it does not establish a relationship between entropy and any other quantitative characteristic of disorder that can be defined and measured regardless of the entropy.
To illustrate the problem, consider the entropy of a discrete distribution
,
Its substantive meaning is well known. Namely, let
be a finite alphabet. Then, the set of those words
of length
in which every letter
occurs with mean frequency close to
has cardinality of order
(this follows from the Shannon–McMillan–Breiman theorem (see [
1,
2])). Thus, the entropy of a discrete distribution determines the exponential rate for the number of those words of length
n in which letters occur with prescribed frequencies.
Can we say anything of that sort about the entropy of a continuous distribution? It turns out—yes. Indeed, from Theorem 3 stated below, it follows that entropy (
1) determines the exponential rate for the
Lebesgue measure of the set of sequences
of length
that generate empirical measures on
close to
P. The proximity of distributions should be understood here in the sense of a
fine topology, which is defined in the same way as the weak topology, but with the use of integrable functions instead of bounded ones.
For example, if
P is the exponential distribution with density
,
, then
and so the set of sequences
of length
that generate empirical measures close to
P (in the fine topology) has Lebesgue measure of order
.
Another example: for the Gaussian distribution
P with density
we get
and the set of sequences
of length
that generate empirical measures close to
P (in the fine topology) has Lebesgue measure of order
.
These examples are based on the presentation of entropy (
1) in the form
, where
Q is the Lebesgue measure on the real axis and
is the Kullback–Leibler information function:
as well as on a certain generalization of the so-called local large deviation principle.
Let P and Q be two probability distributions on a space X. Roughly speaking, the local large deviation principle asserts that the measure of the set of sequences that generate empirical measures close to P has exponential order , provided .
As far as we know this principle was first proven by Sanov for a pair of continuous probability distributions on the real axis in [
3]. Later, it was extended to the general metric spaces (see, for example, [
4,
5,
6,
7]), abstract measurable spaces (see [
8,
9,
10]), and spaces of trajectories of various stochastic processes (see [
11,
12,
13,
14,
15,
16,
17,
18,
19]).
It should be mentioned that different authors called the function
in different ways: the Kullback–Leibler information function [
4], the relative entropy [
6], the rate function [
5,
7,
15], the Kullback–Leibler divergence, the action functional [
16], and the Kullback–Leibler distance [
20] (though, of course, it is nonsymmetric and hence not a metric at all). For brevity, in the sequel, we will prefer the term “Kullback action” rather than any of the listed above.
Until recently, the Kullback action and the local large deviation principle were studied only in the case when both arguments
P,
Q were probability distributions. Only recently, in papers [
9,
10], was the measure
Q allowed to be no more than finite and positive, and the measure
P was allowed to be finitely additive, and, moreover, real-valued. Unfortunately, this is still insufficient for the interpretation of entropy (
1) because the Lebesgue measure on the real axis is
infinite. Therefore, it is highly desirable to define properly the Kullback action and to obtain a generalization of the local large deviation principle for infinite measures
Q. Our main result is the solution of this problem.
It turns out that at least two different ways of generalization are possible. The first approach is based on the use of the fine topology in the space of probability distributions. This is presented in Theorem 3. In the second approach, the whole space X is replaced by its certain part Y of finite measure Q, and the distribution P is replaced by its conditional distribution on Y. Thereby, the problem reduces to the case of finite measures. This approach is implemented in Theorems 4 and 5.
In fact, it makes sense to consider finitely additive probability distributions P as well since some sequences of empirical measures may converge to finitely additive distributions. In such a case, the Kullback action can take values or only (Theorem 6). The corresponding versions of the large deviation principle for finitely additive measures P are presented in Theorems 7 and 8.
First results on the large deviation principle for infinite measures were obtained in [
21,
22], where a countable set
X and the “counting” measure
Q (such that
for all
) were considered. In such a case, the Kullback action
coincides (up to the sign) with entropy (
2). It was revealed in [
21,
22] that, for the “counting” measure
Q on the countable space
X, the ordinary form of the large deviation principle, formulated in terms of the weak topology, fails and so one should use the fine topology instead.
The paper is organized as follows. In the next section we recall the local large deviation principle for finite measures (Theorem 1). In
Section 3, we define the Kullback action
as the Legendre dual functional to the so-called spectral potential
and formulate two variants of the large deviation principle for the case of
σ-finite measure
μ (Theorems 3–5). These theorems are proven in
Section 4,
Section 5,
Section 6 and
Section 7. In
Section 8, we formulate two variants of the large deviation principle for
σ-finite measures
μ and finitely additive probability distributions
ν (Theorems 7 and 8). Theorem 6 states that, in fact,
turns into
or
if the measure
ν has no density with respect to
μ. It is proven in
Section 9. The final
Section 10 contains proofs of Theorems 7 and 8.
2. The Kullback Action for Finite Measures
Let us consider an arbitrary set X supplied with a σ-field of its subsets. In what follows by “measures” we mean only nonnegative measures on the measurable space .
We will use the following notation:
— all bounded measurable functions ;
— all finite measures on ;
— all probability measures (distributions) on ;
— all σ-finite measures on .
Suppose that and the measure ν is absolutely continuous with respect to μ. Then, by the Radon–Nikodym theorem, ν can be presented in the form , where φ is a nonnegative measurable function, which is called the density of ν with respect to μ and denoted as . This function is uniquely defined up to a set of zero measure μ.
The
Kullback action is a function of a probability measure
and a finite measure
defined in the following way: if
ν is absolutely continuous with respect to
μ, then
and
, otherwise. In (
4), we set
for
. Therefore,
belongs to the interval
.
With each finite sequence
, we associate an
empirical measure that is supported on the set
and assigns to each
the measure
. The expectation of any function
with respect to this empirical measure looks like
Let us fix any probability measure . If the points are treated as independent random variables with common distribution μ, then the empirical measure becomes a random variable itself, taking values in . We will be interested in the asymptotics of its distribution. It turns out that, at a first approximation, this asymptotics is exponential with the exponent .
To describe the asymptotics of the empirical measures distribution, we need two topologies on the space
. The first one is the weak topology generated by neighborhoods of the form
where
and
. The second topology is generated by neighborhoods of the same form (
5) but with functions
therein. In addition, it is supposed in this case that
contains only those measures
ν for which all integrals
do exist. This topology will be referred to as the
fine topology. It is useful because it enables us to formulate the usual law of large numbers in the next form: for any probability distribution
, the sequence of empirical measures
converges to
μ in the probability in the fine topology. On the other hand, a shortcoming of the fine topology is the fact that, with respect to it, the affine map
, where
, may be discontinuous at the ends of the segment
.
It is easy to see that the fine topology on contains the weak one, but the converse, in general, does not take place.
For any nonnegative measure μ on X, denote by its Cartesian power supported on . The next theorem describes asymptotics of the empirical measures distribution.
Theorem 1 (the local large deviation principle for finite measures)
. For any measures , , and number , there exists a weak neighborhood such thatOn the other hand, for any measures , , number , and any fine neighborhood , the following estimate holds for all large enough n:In the case of a metric space X supplied with a Borel σ-field, the neighborhood in (6) can be chosen from the weak topology generated by bounded continuous functions. Remark 1. When , the difference in (6) should be replaced by . Remark 2. So long as each weak neighborhood in belongs to the fine topology, estimates (6) and (7) complement each other: the coefficient cannot be increased in (6) and cannot be decreased in (7). Remark 3. Theorem 1 is also true for finitely additive probability distributions ν on the space X if we set in such a case (see [9]). It is worth mentioning that, until recently, the absolute majority of papers on the large deviation principle dealt with random variables in Polish space (i.e., complete separable metric space), and only a few of them treated random variables in a topological space (see, for example, [
4]), or in a measurable space in which the
σ-field is generated by open balls and does not necessarily contain Borel sets (see [
7],
Section 7). In addition, only countably additive probability distributions
ν and
μ were considered as arguments of the Kullback action. Theorem 1 for an arbitrary measurable space
X, finitely additive measures
ν and nonnormalized measures
μ was first proven in [
9], and its generalization for finitely additive measures
μ was proven in [
10].
3. The Kullback Action for σ-Finite Measures
The shortcoming of Theorem 1 is that it does not involve the case of infinite measure
μ. In particular, it does not explain any sense of entropy (
1) of an absolutely continuous probability distribution on the real axis. Unfortunately, the direct extension of Theorem 1 on infinite measures
μ is wrong. The next example demonstrates this.
Example ([
22])
. Let X be a countable set supplied with the discrete σ-field and μ be the counting measure on X (such that for every ). Consider a topology on the space of probability distributions generated by the neighborhoods(in other words, the topology of ). Then, for any neighborhood (8) and any number , there exists a finite subset such that, for all n large enough,The topology on under consideration contains the weak topology generated by functions from . It follows that, for , estimate (9) contradicts (6), and hence the latter cannot take place. It turns out that, to extend Theorem 1 on
σ-finite measures
μ, it is enough to replace the weak neighborhood in (
6) with a fine one. This is the main result of the paper. Its exact formulation is given in Theorem 3 below.
We also propose one more approach to extend Theorem 1, using only weak topology. Its idea is to replace the space
X in estimates (
6) and (
7) by a large enough subset
of finite measure
, and to replace the probability measure
by its conditional distribution on
Y. The corresponding results are stated in Theorems 4 and 5 below.
In order to describe asymptotics of the empirical measures distribution correctly in the case of σ-finite measure μ, the definition of the Kullback action should be modified. To this end, we have to introduce the notion of a spectral potential.
Denote by
the set of all bounded above measurable functions on a measurable space
. The
spectral potential is the nonlinear functional
If the integral in this formula diverges, then we set
. Thus,
can take values in the interval
.
For brevity, let us introduce the notation
where
and
. If the integral diverges, then we put
.
Now, we define the
Kullback action as a function of the pair of arguments
and
as follows:
The next theorem shows, in particular, that in the case of a finite measure
μ this definition coincides with the previous one (
4).
Theorem 2. If a probability distribution is absolutely continuous with respect to and , thenIn particular, for the finite measure μ, the alternative (11) takes place. The following theorem is our main result for the case of countably additive distributions.
Theorem 3 (the local large deviation principle for infinite measures)
. For any measures , , and number , there exists a fine neighborhood such thatOn the other hand, for any measures , , number , and any fine neighborhood , the following estimate holds for all large enough n:If , then the difference in (13) should be replaced by , and if then the sum in (14) should be replaced by . Let us also formulate the local large deviation principle in terms of weak neighborhoods.
For any probability measure
and any measurable subset
with
, define a conditional measure
according to the formula
It is easily seen that the measure ν can be approximated by the conditional measures , where , in the fine topology (and all the more in the weak one). Therefore, it can make sense to replace fine neighborhoods of ν in Theorem 3 by weak neighborhoods of close conditional measures .
We will say that the Kullback action
is
well-defined if
ν has a density
, and, in addition, at least one of the two integrals
is finite. In all other cases (i.e., when both integrals (
15) are infinite or the measure
ν has no density with respect to
μ), we will say that the Kullback action is
ill-defined.
Theorem 4. Suppose that, for some measures and , the Kullback action is well-defined. Then, for any number , there exists a set with such that for any containing and having a finite measure :- (a)
there exists a weak neighborhood satisfying the estimate - (b)
for any fine neighborhood and all large enough n,
In addition, for any and any fine neighborhood , there exists a set with such that for all large enough n, Theorem 5. Suppose that for some measures and , the Kullback action is ill-defined. Then, there exists a set with , such that, for any containing and having a finite measure , and any , there exists a weak neighborhood satisfying the estimate It is worth mentioning that, under conditions of Theorem 5, the equality
may take place. In such a case, estimates (
19) and (
14) have opposite senses. Nevertheless, there is no contradiction here because the sets in these estimates are different.
4. Proof of Theorem 2
Recall that, under conditions of Theorem 2, the measure
is absolutely continuous with respect to
and has a density
. First of all, we will prove that for any function
,
If at least one of the expressions
or
takes the infinite value allowed to it, then the left-hand side of (
20) turns into
, and so the inequality is true. Thus, it is enough to consider the case of finite
and
.
Suppose first that
For any
, define the set
and the conditional distribution
on it:
Evidently,
has the density
where
is the characteristic function of
.
From elementary properties of integrals, it follows that
(in the passage from (
23) to (24), Jensen’s inequality is used). If
, the expression in (25) converges to
Therefore, (
23)–(25) imply the first case of (
20) in the limit.
Now, suppose that
and
are finite and
Consider the sets
As before, define the conditional distributions
and densities
by means of (
21) and (
22). Then, calculations (
23)–(25) still hold, but the expression in (25) converges now to the limit
In the situation under consideration, the first and the third summands in (
27) are finite, while the second one turns into
. Therefore, from (
23)–(25), it follows that
, which contradicts the assumption about finiteness of
. Thus, in the situation when both
and
are finite, equality (
26) cannot take place. Thereby, inequality (
20) is completely proven.
To finish the proof of Theorem 2, it is enough to verify the equality
By virtue of (
20) the left-hand side of (
28) does not exceed the right-hand one. If the right-hand side of (
28) equals
, then the equality is trivial. Consider the case when the right-hand side of (
28) is greater than
. By
σ-finiteness of
μ, there exists a function
such that the integral
is also finite. Consider the family of functions
Obviously,
, and if
t goes to
, then
It follows that the supremum in the left-hand side of (
28) coincides with the right-hand side. ☐
6. Proof of the Second Part of Theorem 3
Now let us proceed to estimate (
14). It is trivial if
. Thus, in the sequel, we may suppose that
. Then, (
10) implies that
ν is absolutely continuous with respect to
μ and has a density
.
First, consider the case of finite
. Then, Theorem 2 implies
Fix any
and any fine neighborhood
. Consider the sets
(in the latter inequality, it is supposed that each element of the sequence
satisfies the condition
). Note that, for
,
Hence,
By the law of large numbers,
. Thus, (
32) implies (
14).
Now, suppose that
. Then, by Theorem 2,
Divide the whole space
X into two parts:
, where
Set
. Evidently,
and
Then, construct a sequence of embedded sets
with
, such that
, and, at the same time,
Such construction is possible due to (
33) and (
34). Evidently,
, and each
is of finite measure
μ, and their union gives the whole
X.
Denote by
the conditional distribution of
ν on
:
It has the density
where
is the characteristic function of
. Evidently, the sequence
converges to
ν in the fine topology, and (
35) implies that
. In addition, the condition
implies that
.
Fix an arbitrary
and an arbitrary fine neighborhood
. Choose
k so large that
and simultaneously
. In the case of finite Kullback action, estimate (
14) is already proven. Apply it to the pair of measures
and
μ, the neighborhood
of the measure
, and the number
:
provided
n is large enough. This is exactly estimate (
14) for the case
. ☐
8. The Case of Finitely Additive Probability Distributions ν
The necessity of consideration of finitely additive probability distributions
ν is caused by the fact that they may happen to be accumulation points for some sequences of empirical measures. Thus, to make the description of the empirical measures distribution complete, we should obtain the estimates similar to (
13) and (
14) for finitely additive probability distributions
ν as well.
In fact, this can be done, and the principal result is that Theorems 3 and 5
still hold true for finitely additive probability distributions ν, provided the Kullback action is defined by (10). In addition, in that case,
may take values
or
only, and the both are possible.
The transition from countably additive distributions to only finitely additive ones is not trivial. First of all, we should adapt some previous definitions to the new setting.
Denote by
the set of all finitely additive probability measures on
. Each
is naturally identified with a positive normalized linear functional on the space of bounded measurable functions
(i.e., a functional that takes nonnegative values on nonnegative functions and the unit value on the unit function). Using this identification, we denote the integral of
with respect to
as
. In addition, for bounded above functions
, let us define
as
Thus, for
, the value
belongs to the interval
. Similarly, for a measurable function
f that is bounded from below, put
Now, we define the Kullback action
for the case when
and
:
Obviously, this definition just duplicates (
10).
Theorem 6. If has no density with respect to , then turns into or . In particular, if μ is finite or ν is countably additive, then .
Let us introduce a fine topology on
by means of neighborhoods of the form
where
and the functions
are such that all
are finite. Clearly, this definition is analogous to (
5). Note that the bounded above functions in (
37) may be replaced by bounded below or even nonnegative ones. This will not change the collection of neighborhoods (
37).
Now, we reformulate Theorems 3 and 5 for the case of finitely additive distributions ν (note that Theorem 4 cannot be reformulated since is well-defined, and hence ν is countably additive in it).
Theorem 7. For any measures , , and number , there exists a fine neighborhood such thatOn the other hand, for any measures , , number , and any fine neighborhood , the following estimate holds for all large enough n:If , then the difference in (38) should be replaced by , and if , then the sum in (39) should be replaced by . A measure will be called proper with respect to a measure if, for any , there exists a set such that and . If, on the contrary, there exists an such that the inequality implies , then the measure ν will be called improper with respect to μ. Obviously, in the case of finite μ, all measures are proper, and, in the case of σ-finite μ, all countably additive measures are proper.
Theorem 8. Suppose that for some measures and , the Kullback action is ill-defined, and the measure ν is proper with respect to μ. Then, there exists a set with , such that, for any containing and having a finite measure , and any , there exists a weak neighborhood satisfying the estimate 10. Proof of Theorems 7 and 8
The proof for the first part of Theorem 7 is exactly the same as for the first part of Theorem 3, so we omit it. If , then the second part of Theorem 7 follows from the second part of Theorem 3. Thus, it remains to consider the case .
Let be some σ-field of subsets of X. We will call it discrete if it is generated by a countable or finite partition of X.
Lemma 10. For any measure and any its fine neighborhood , there exists a discrete σ-subfield such that- (a)
the restriction of ν to is countably additive;
- (b)
there exists a fine neighborhood generated by -measurable functions;
- (c)
if the measure ν is proper with respect to , then the σ-field mentioned above can be chosen in such a way that each of its atoms has a finite measure μ.
Proof. A base for the fine topology on
is formed by the neighborhoods
where
are measurable nonnegative functions on
with
. Let us prove the Lemma for a neighborhood of this sort.
Define the step-functions
, where
denotes the integer part of a number, and the neighborhood
Evidently,
, and, for each
, we have
It follows that
.
To each integer vector
, assign the set
These sets form a countable measurable partition of
X and generate the desired discrete
σ-subfield
. The functions
are
-measurable.
Note that, for any
, we have
Thus, when
C goes to
,
It follows that the restriction of
ν to the
σ-field
is countably additive.
Assume that the measure ν is proper with respect to μ. In this case, we can construct a countable partition of X into subsets such that and as . The latter condition implies the equality . Therefore, the restriction of ν to the σ-field generated by the atoms is countably additive. This σ-field may be treated as . By construction, its atoms have finite measure μ. ☐
Let us finish the proof of Theorem 7. It remains to obtain estimate (
39) for
. In this situation, the measure
ν has no density with respect to
μ, and, according to Theorem 6, we have the alternative: either
or
. In the first case, estimate (
39) is trivial. Thus, it is enough to consider the second case
.
Suppose the measure
ν is proper with respect to
μ and
. We can apply Lemma 10 to
ν and construct the corresponding discrete
σ-subfield
and fine neighborhood
. Denote by
and
the restrictions of
ν and
μ to
. By Lemma 10, they are countably additive. From definition (
36), it follows that if
for some
, then
as well (since otherwise
). Thus, the distribution
on
is absolutely continuous with respect to
.
Recall that by definition,
where
is the set of all bounded above
-measurable functions. The same is true for all bounded above
-measurable functions, and hence
as well. Since
is absolutely continuous with respect to
, the second part of Theorem 7 for
and
is proven. It implies the estimate
for all large enough
n. Due to the inclusion
, we obtain (
39).
Consider the case of improper ν. We can apply Lemma 10 and construct the corresponding discrete σ-subfield and a fine neighborhood generated by -measurable functions. The field is generated by a certain denumerable partition . Change numeration of the sets so that . Put and denote by the conditional distribution of ν on . Due to the countable additivity, and for all large enough k. In addition, the improperness of ν implies that for all large enough k.
Fix such a large
k that
, and, at the same time,
. The latter implies
for at least one
. Without loss of generality, we may assume that
for all
. Obviously, for any large enough
n, there exists a sequence
such that the empirical measure
is so close to
that
and each of the sets
contains at least one of the points
. Define positive integers
in such a way that
for
. Then,
for all
(since otherwise
) and
for at least one
j. Therefore,
and thereby estimate (
39) is completely proven. ☐
Proof of Theorem 8. If , then the assertion of Theorem 8 follows from Theorem 5.
Let . Then, ν is not absolutely continuous with respect to μ.
Since ν is proper, by Lemma 9, there exists an such that, for any positive integer n, there exists satisfying and . Set .
Suppose a set
with
contains
. Then, the conditional distribution
is not absolutely continuous with respect to
μ. On the other hand, (
36) and the conditions
and
imply the inequality
. Hence,
by Theorem 6. In this case, estimate (
40) follows from estimate (
6) of Theorem 1. ☐