Next Article in Journal
Empirical Convergence Theory of Harmony Search Algorithm for Box-Constrained Discrete Optimization of Convex Function
Next Article in Special Issue
On Small Deviation Asymptotics in the L2-Norm for Certain Gaussian Processes
Previous Article in Journal
A Two-Stage Mono- and Multi-Objective Method for the Optimization of General UPS Parallel Manipulators
Previous Article in Special Issue
Asymptotically Exact Constants in Natural Convergence Rate Estimates in the Lindeberg Theorem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Statistical Estimation of the Kullback–Leibler Divergence

by
Alexander Bulinski
1,* and
Denis Dimitrov
2
1
Steklov Mathematical Institute of Russian Academy of Sciences, 119991 Moscow, Russia
2
Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119234 Moscow, Russia
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(5), 544; https://doi.org/10.3390/math9050544
Submission received: 25 January 2021 / Revised: 24 February 2021 / Accepted: 28 February 2021 / Published: 4 March 2021
(This article belongs to the Special Issue Analytical Methods and Convergence in Probability with Applications)

Abstract

:
Asymptotic unbiasedness and L 2 -consistency are established, under mild conditions, for the estimates of the Kullback–Leibler divergence between two probability measures in R d , absolutely continuous with respect to (w.r.t.) the Lebesgue measure. These estimates are based on certain k-nearest neighbor statistics for pair of independent identically distributed (i.i.d.) due vector samples. The novelty of results is also in treating mixture models. In particular, they cover mixtures of nondegenerate Gaussian measures. The mentioned asymptotic properties of related estimators for the Shannon entropy and cross-entropy are strengthened. Some applications are indicated.

1. Introduction

The Kullback–Leibler divergence introduced in [1] is used for quantification of similarity of two probability measures. It plays important role in various domains such as statistical inference (see, e.g., [2,3]), metric learning [4,5], machine learning [6,7], computer vision [8,9], network security [10], feature selection and classification [11,12,13], physics [14], biology [15], medicine [16,17], finance [18], among others. It is worth to emphasize that mutual information, widely used in many research directions (see, e.g., [19,20,21,22,23]), is a special case of the Kullback–Leibler divergence for certain measures. Moreover, the Kullback–Leibler divergence itself belongs to a class of f-divergence measures (with f ( t ) = log t ). For comparison of various f-divergence measures see, e.g., [24], their estimates are considered in [25,26].
Let P and Q be two probability measures on a measurable space ( S , B ) . The Kullback–Leibler divergence between P and Q is defined, according to [1], by way of
D ( P | | Q ) : = S log d P d Q d P if P Q , otherwise ,
where d P d Q stands for the Radon–Nikodym derivative. The integral in (1) can take values in [ 0 , ] . We employ the base e of logarithms since a constant factor is not essential here.
If ( S , B ) = ( R d , B ( R d ) ) , where d N , and (absolutely continuous) P and Q have densities, p ( x ) and q ( x ) , x R d , w.r.t. the Lebesgue measure μ , then (1) can be expressed as
D ( P | | Q ) = R d p ( x ) log p ( x ) q ( x ) d x ,
where we write d x instead of μ ( d x ) to simplify notation. One formally sets 0 / 0 : = 0 , a / 0 : = if a > 0 , log 0 : = , log ( ) : = and 0 log 0 : = 0 . Then log ( p ( x ) q ( x ) ) is a measurable function with values in [ , ] . So, the right-hand sides of (1) and (2) coincide. Formula (2) is justified by Lemma A1, see Appendix A.
Denote by S ( f ) : = { x R d : f ( x ) > 0 } the support of a (version of) probability density f. The integral in (2) is taken over S ( p ) and does not depend on the choice of p and q versions.
The following two functionals are closely related to the Kullback - Leibler divergence. For probability measures P and Q on ( R d , B ( R d ) ) having densities p ( x ) and q ( x ) , x R d , w.r.t. the Lebesgue measure μ , one can introduce, according to [27], p. 35, entropy H (also called the Shannon differential entropy) and cross-entropy C as follows
H ( P ) : = R d p ( x ) log p ( x ) d x , C ( P , Q ) : = R d p ( x ) log q ( x ) d x .
In view of (2), D ( P | | Q ) = C ( P , Q ) H ( P ) whenever the right-hand side is well defined.
Usually one constructs statistical estimates of some characteristics of a stochastic model under consideration relying on a collection of observations. In the pioneering paper [28] the estimator of the Shannon differential entropy was proposed, based on the nearest neighbor statistics. In a series of papers this estimate was studied and applied. Moreover, estimators of the Rényi entropy, mutual information and the Kullback–Leibler divergence have appeared (see, e.g., [29,30,31]). However, the authors of [32] indicated the occurrence of gaps in the known proofs concerning the limit behavior of such statistics. Almost all of these flaws refer to the lack of proved correctness of using the (reversed) Fatou lemma (see, e.g., [28], inequality after the statement (21), or [31], inequality (91)) or the generalized Helly–Bray lemma (see, e.g., [30], page 2171). One can find these lemmas in [33], p. 233, and [34], p. 187. Paper [32] has attracted our attention and motivated study of the declared asymptotic properties. Furthermore, we would like to highlight the important role of the papers [28,30,31,32]. Thus, in a recent work [35] the new functionals were introduced to prove asymptotic unbiasedness and L 2 -consistency of the Kozachenko–Leonenko estimators of the Shannon differential entropy. We used the criterion of uniform integrability, for different families of functions, to avoid employment of the Fatou lemma since it is not clear whether one could indicate the due majorizing functions for those families. The present paper is aimed at extension of our approach to grasp the Kullback–Leibler divergence estimation. Instead of the nearest neighbor statistics we employ the k-nearest neighbor statistics (on order statistics see, e.g., [36]) and also use more general forms of the mentioned functionals.
Note in passing that there exist investigations treating important aspects of the entropy, Kullback–Leibler divergence and mutual information estimation. The mixed models and conditional entropy estimation are studied, e.g., in [37,38]. The central limit theorem (CLT) for the Kozachenko–Leonenko estimates is established in [39]. In [40], deep analysis of efficiency of functional weighted estimates was performed (including CLT). The limit theorems for point processes on manifolds are employed in [41] to analyze behavior of the Shannon and the Rényi entropy estimates. The convergence rates for the Shannon entropy (truncated) estimates are obtained in [42] for one-dimensional case, see also [43] for multidimensional case. A kernel density plug-in estimator of the various divergence functionals is studied in [25]. The principal assumptions of that paper are the following: the densities are smooth and have common bounded support S, they are strictly lower bounded on S, moreover, the set S is smooth with respect to the employed kernel. Ensemble estimation of various divergence functionals is studied in [25]. Profound results for smooth bounded densities are established in recent work [44]. The mutual information estimation by the local Gaussian approximation is developed in [45]. Note that various deep results (including the central limit theorem) were obtained for the Kullback–Leibler estimates under certain conditions imposed on derivatives of unknown densities (see, e.g., the recent papers [25,46]). In a series of papers the authors demand boundedness of densities to prove L 2 -consistency for the Kozachenko–Leonenko estimates of differential Shannon entropy (see, e.g., [47]).
Our goal is to provide wide conditions for the asymptotic unbiasedness and L 2 -consistency of the specified Kullback–Leibler divergence estimates without such smoothness and boundedness hypotheses. Furthermore, we do not assume that densities have bounded supports. As a byproduct we obtain new results concerning Shannon differential entropy and cross-entropy.
We employ probabilistic and analytical techniques, namely, weak convergence of probability measures, conditional expectations, regular probability distributions, k-nearest neighbor statistics, probability inequalities, integration by parts in the Lebesgue–Stieltjes integral, analysis of integrals depending on certain parameters and taken over specified domains, criterion of the uniform integrability of various families of functions, slowly varying functions.
The paper is organized as follows. In Section 2, we introduce some notation. In Section 3 we formulate main results, i.e., Theorems 1 and 2. Their proofs are provided in Section 4 and Section 5, respectively. Section 6 contains concluding remarks and perspectives of future research. Proofs of several lemmas are given in Appendix A.

2. Notation

Let X and Y be random vectors taking values in R d and having distributions P X and P Y , respectively, (below we will take P = P X and Q = P Y ). Consider random vectors X 1 , X 2 , and Y 1 , Y 2 , with values in R d such that l a w ( X i ) = l a w ( X ) and l a w ( Y i ) = l a w ( Y ) , i N . Assume also that { X i , Y i , i N } are independent. We are interested in statistical estimation of D ( P X | | P Y ) constructed by means of observations X n : = { X 1 , , X n } and Y m : = { Y 1 , , Y m } , n , m N . All the random variables under consideration are defined on a probability space ( Ω , F , P ) , each measure space is assumed complete.
For a finite set E = { z 1 , , z N } R d , where z i z j ( i j ) , and a vector v R d , renumerate points of E as z ( 1 ) ( v ) , , z ( N ) ( v ) in such a way that v z ( 1 ) v z ( N ) , · is the Euclidean norm in R d . If there are points z i 1 , , z i s having the same distance from v then we numerate them according to the indexes i 1 , , i s increase. In other words, for k = 1 , , N , z ( k ) ( v ) is the k-nearest neighbor of v in a set E. To indicate that z ( k ) ( v ) is constructed by means of E we write z ( k ) ( v , E ) . Fix k { 1 , , n 1 } , l { 1 , , m } and (for each ω Ω ) put
R n , k ( i ) : = X i X ( k ) ( X i , X n \ { X i } ) , V m , l ( i ) : = X i Y ( l ) ( X i , Y m ) , i = 1 , , n .
We assume that X and Y have densities p = d P X d μ and q = d P Y d μ . Then with probability one all the points in X n are distinct as well as points of Y m .
Following [31] (see Formula (17) there) introduce an estimate of D ( P X | | P Y )
D ˜ n , m ( K n , L n ) : = 1 n i = 1 n ψ ( k i ) ψ ( l i ) + log m n 1 + d n i = 1 n log V m , l i ( i ) R n , k i ( i ) ,
where ψ ( t ) = d d t log Γ ( t ) = Γ ( t ) Γ ( t ) is the digamma function, t > 0 , K n : = { k i } i = 1 n , L n : = { l i } i = 1 n are collections of integers and, for some r N and all i N , k i r , l i r . Note that (3) is well-defined for n max i = 1 , , n k i + 1 , m max i = 1 , , n l i . If k i = k and l i = l , i = 1 , , n , then, for n k + 1 and m l , we write
D ^ n , m ( k , l ) : = ψ ( k ) ψ ( l ) + log m n 1 + d n i = 1 n log V m , l ( i ) R n , k ( i ) .
If k = l then
D ^ n , m ( k ) = log m n 1 + d n i = 1 n log V m , k ( i ) R n , k ( i )
and we come to formula (5) in [31]. For an intuitive background of the proposed estimates one can address [31] (Introduction, Parts B and C).
We write B ( x , r ) : = { y R d : x y r } for x R d , r > 0 , and V d = μ ( B ( 0 , 1 ) ) is the volume of the unit ball in R d . Similar to (3) with the same notation and the same conditions for k i and l i , i = 1 , , n , one can define the Kozachenko - Leonenko type estimates of H ( P X ) and C ( P X , P Y ) , respectively, by formulas
H ˜ n ( K n ) : = 1 n i = 1 n ψ ( k i ) + log V d + log ( n 1 ) + d n i = 1 n log R n , k i ( i ) ,
C ˜ n , m ( L n ) : = 1 n i = 1 n ψ ( l i ) + log V d + log m + d n i = 1 n log V m , l i ( i ) .
In [28], an estimate (5) was proposed for k i = 1 , i = 1 , , n . If k i = k , l i = l , i = 1 , , n , n k + 1 and m l , then one has
H ^ n ( k ) : = 1 n i = 1 n log V d R n , k d ( i ) ( n 1 ) e ψ ( k ) , C ^ n , m ( l ) : = 1 n i = 1 n log V d V m , l d ( i ) m e ψ ( l ) .
Remark 1.
All our results are valid for statistics (3). To simplify notation we consider estimates (4) since the study of D ˜ n , m ( K n , L n ) follows the same lines. For the same reason, as in the case of Kullback–Leibler divergence, we will only deal with (7) since (5) and (6) can be analyzed in quite the same way.
Some extra notation is necessary. As in [35], given a probability density f in R d , we consider the following functions of x R d , r > 0 and R > 0 , that is, define integral functionals (depending on parameters)
I f ( x , r ) : = B ( x , r ) f ( y ) d y r d V d ,
M f ( x , R ) : = sup r ( 0 , R ] I f ( x , r ) , m f ( x , R ) : = inf r ( 0 , R ] I f ( x , r ) .
Some properties of function B ( x , r ) f ( y ) d y are demonstrated in [48]. By virtue of Lemma 2.1 [35], for each probability density f, the function I f ( x , r ) introduced above is continuous in ( x , r ) on R d × ( 0 , ) . Hence on account of Theorem 15.84 [49] the functions m f ( · , R ) and M f ( · , R ) for any R > 0 have to be upper semicontinuous and lower semicontinuous, respectively. Therefore, Borel measurability of these nonnegative functions ensues from Proposition 15.82 [49]. On the other hand, the function m f ( x , · ) is evidently nonincreasing whereas M f ( x , · ) is nondecreasing for each x in R d . Notably, changing sup r ( 0 , R ] to sup r ( 0 , ) transforms the function M f ( x , R ) into the famous Hardy–Littlewood maximal function M f ( x ) well-known in Harmonic analysis.
Set e [ 1 ] : = 0 and e [ N ] : = exp { e [ N 1 ] } , N Z + . Introduce a function log [ 1 ] ( t ) : = log t , t > 0 . For N N , N > 1 , set log [ N ] ( t ) : = log ( log [ N 1 ] ( t ) ) . Evidently, this function (for each N N ) is defined if t > e [ N 2 ] . For N N , consider the continuous nondecreasing function G N : R + R + , given by formula
G N ( t ) : = 0 , t [ 0 , e [ N 1 ] ] , t log [ N ] ( t ) , t ( e [ N 1 ] , ) .
In other words we employ the function having the form t r ( t ) where a function r ( t ) , taken as N iterations of log t , is slowly varying for large t.
For probability densities p , q in R d , N N and positive constants ν , t , ε , R , introduce the functionals taking values in [ 0 , ]
K p , q ( ν , N , t ) : = x , y R d , x y > t G N | log x y | ν p ( x ) q ( y ) d x d y ,
Q p , q ( ε , R ) : = R d M q ε ( x , R ) p ( x ) d x ,
T p , q ( ε , R ) : = R d m q ε ( x , R ) p ( x ) d x .
Set K p , q ( ν , N ) : = K p , q ( ν , N , e [ N ] ) .
Remark 2.
We have stipulated that 1 / 0 : = (thus m q ε ( x , R ) : = whenever m q ( x , R ) = 0 ). One can write in (12), (13) the integrals over the support S ( p ) instead of integrating over R d , whatever the versions of p and q are taken.

3. Main Results

Theorem 1.
Let, for some positive ε , R and N N , the functionals K p , f ( 1 , N ) , Q p , f ( ε , R ) , T p , f ( ε , R ) be finite if f = p and f = q . Then D ( P X | | P Y ) < and
lim n , m E D ^ n , m ( k , l ) = D ( P X | | P Y ) .
Consider 3 kinds of conditions (labeled A,B,C, possibly with indices, and involving parameters indicated in parentheses) on probability densities.
( A ; p , f , ν ) For probability densities p , f in R d and some positive ν
L p , f ( ν ) : = R d × R d | log x y | ν p ( x ) f ( y ) d x d y < .
As usual, A g ( z ) Q ( d z ) = 0 whenever g ( z ) = (or ) for z A and Q ( A ) = 0 , where Q is a σ -finite measure on ( R d , B ( R d ) ) . Condition (15) with ν > 1 is used, e.g., in [28,31,47].
( B 1 ; f ) A version of f is upper bounded by a positive number M ( f ) ( 0 , ) :
f ( x ) M ( f ) , x R d .
( C 1 ; f ) A version of f is lower bounded by a positive number m ( f ) ( 0 , ) :
f ( x ) m ( f ) , x S ( f ) .
Corollary 1.
Let, for some ν > 1 , condition ( A ; p , f , ν ) be satisfied when f = p and f = q . Then the statements of Theorem 1 are true, provided that ( B 1 ; f ) and ( C 1 ; f ) are both valid for f = p and f = q . Moreover, if the latter assumption involving ( B 1 ; f ) and ( C 1 ; f ) holds then conditions of Theorem 1 are satisfied whenever p and q have bounded supports.
Next we formulate conditions to guarantee L 2 -consistency of estimates (4).
Theorem 2.
Let the requirement K p , f ( 1 , N ) < in conditions of Theorem 1 be replaced by K p , f ( 2 , N ) < , given f = p and f = q . Then D ( P X | | P Y ) < and, for any fixed k , l N , the estimates D ^ n , m ( k , l ) are L 2 -consistent, i.e.,
lim n , m E D ^ n , m ( k , l ) D ( P X | | P Y ) 2 = 0 .
Corollary 2.
For some ν > 2 , let condition ( A ; p , f , ν ) be satisfied if f = p and f = q . Assume that ( B 1 ; f ) and ( C 1 ; f ) are both valid for f = p and f = q . Then the statements of Theorem 2 are true. Moreover, if the latter assumption involving ( B 1 ; f ) and ( C 1 ; f ) holds then conditions of Theorem 2 are satisfied whenever p and q have bounded supports.
Currently we dwell on a modification of condition ( C 1 ; f ) introduced in [35] that allows us to work with densities that need not have bounded supports.
( C 2 ; f ) There exist a version of density f and R > 0 such that, for some c > 0 ,
m f ( x , R ) c f ( x ) , x R d .
Remark 3.
If, for some positive ε, R and c, condition ( C 2 ; q ) is true and
R d q ( x ) ε p ( x ) d x < ,
then T p , q ( ε , R ) is finite. Hence we could apply, for f = p and f = q in Theorems 1 and 2, condition ( C 2 ; f ) and presume, for some ε > 0 , validity of (17) and finiteness of R d p 1 ε ( x ) d x instead of the corresponding assumptions T p , q ( ε , R ) < and T p , p ( ε , R ) < . An illustrative example to this point is provided with a density having unbounded support.
Corollary 3.
Let X, Y be Gaussian random vectors in R d with E X = μ X , E Y = μ Y and nondegenerate covariance matrices Σ X and Σ Y , respectively. Then relations (14) and (16) hold where
D ( P X | | P Y ) = 1 2 tr Σ Y 1 Σ X + μ Y μ X T Σ Y 1 μ Y μ X d + log det Σ Y det Σ X .
The latter formula can be found, e.g., in [2], p. 147, example 6.3. The proof of Corollary 3 is discussed in Appendix A.
Similarly to condition ( C 2 ; f ) let us consider the following one.
( B 2 ; f ) There exist a version of density f and R > 0 such that, for some C > 0 ,
M f ( x , R ) C f ( x ) , x S ( f ) .
Remark 4.
If, for some positive ε, R and c, condition ( B 2 ; q ) is true and
R d q ( x ) ε p ( x ) d x < ,
then obviously Q p , q ( ε , R ) < . Thus, in Theorems 1 and 2 one can employ, for f = p and f = q , condition ( B 2 ; f ) and exploit, for some ε > 0 , the validity of (18) and finiteness of R d p 1 + ε ( x ) d x instead of the assumptions Q p , q ( ε , R ) < and Q p , p ( ε , R ) < , respectively.
Remark 5.
D.Evans applied “positive density condition” in Definition 2.1 of [48] assuming the existence of constants β > 1 and δ > 0 such that r d β B ( x , r ) q ( y ) d y β r d for all 0 r δ and x R d . Consequently m q ( x , δ ) 1 β V d : = m > 0 , x R d . Then T p , q ( ε , δ ) m ε R d p ( x ) d x = m ε < for all ε > 0 . Analogously, M q ( x , δ ) β V d : = M , M > 0 , x R d , and Q p , q ( ε , δ ) M ε R d p ( x ) d x = M ε < for all ε > 0 . The above mentioned inequalities from Definition 2.1 of [48] are valid, provided that density f is smooth and its support in R d is a convex closed body, see proof in [50]. Therefore, if p and q are smooth and their supports are compact convex bodies in R d , the relations (14) and (16) are valid.
Moreover, as a byproduct of Theorems 1 and 2, we obtain the new results indicating both the asymptotic unbiasedness and L 2 -consistency of the estimates (7) for the Shannon differential entropy and cross-entropy.
Theorem 3.
Let Q p , q ( ε , R ) < and T p , q ( ε , R ) < for some positive ε and R. Then C ( P X , P Y ) is finite and the following statements hold for any fixed l N .
(1) 
If, for some N N , K p , q ( 1 , N ) < , then E C ^ n , m ( l ) C ( P X , P Y ) , n , m .
(2) 
If, for some N N , K p , q ( 2 , N ) < , then E ( C ^ n , m ( l ) C ( P X , P Y ) ) 2 0 , n , m .
In particular, one can employ L p , q ( ν ) with ν > 1 instead of K p , q ( 1 , N ) , and with ν > 2 instead of K p , q ( 2 , N ) , where N N .
The first claim of this Theorem follows from the proof of Theorem 1. In a similar way one can infer the second statement from the proof of Theorem 2. If we take q = p in conditions of Theorem 3 then we get the statement concerning the entropy since C ( P X , P X ) = H ( P X ) .
Now we consider the case when p and q are mixtures of some probability densities. Namely,
p ( x ) : = i = 1 I a i p i ( x ) , q ( x ) : = j = 1 J b j q j ( x ) ,
where p i ( x ) , q j ( x ) are probability densities (w.r.t. the Lebesgue measure μ ), positive weights a i , b j are such that i = 1 I a i = 1 , j = 1 J b j = 1 , i = 1 , , I , j = 1 , , J , x R d . Some applications of models described by mixtures are treated, e.g., in [51].
Corollary 4.
Let random vectors X and Y have densities of the form (19). Assume that, for some positive ε, R and N N , the functionals K f , g ( 1 , N ) , Q f , g ( ε , R ) , T f , g ( ε , R ) are finite, whenever f { p 1 , , p I } and g { p 1 , , p I , q 1 , , q J } . Then D ( P X | | P Y ) < and, for any fixed k , l N , (14) holds. Moreover, if the requirement K f , g ( 1 , N ) < is replaced by K f , g ( 2 , N ) < then (16) is true.
The proof of this Corollary is given in Appendix A. Thus, due to Corollaries 3 and 4 one can guarantee the validity of (14) and (16) for any mixtures of nondegenerate Gaussian densities. Note also that in a similar way we can claim the asymptotic unbaisedness and L 2 -consistency of estimates (7) for mixtures satisfying conditions of Corollary 4.
Remark 6.
Let us compare our new results with those established in [35]. Developing the approach of [35] to analysis of asymptotic behavior of the Kozachenko–Leonenko estimates of the Shannon differential entropy we encounter new complications due to dealing with k-nearest neighbor statistics for k N (not only for k = 1 ). Accordingly, in the framework of the Kullback–Leibler divergence estimation, we propose a new way to bound the function 1 F m , l , x ( u ) playing the key role in the proofs (see Formula (28)). Furthermore, instead of the function G ( t ) = t log t (for t > 1 ), used in [35] for the Shannon entropy estimates, we employ a regularly varying function G N ( t ) = t log [ N ] ( t ) where (for t large enough) log [ N ] ( t ) is the N-fold iteration of the logarithmic function and N N can be large. Whence in the definition of integral functional K p , q ( ν , N , t ) by formula (11) one can take a function G N ( z ) having, for z > 0 , the growth rate close to that of function z. Moreover, this entails a generalization of paper [35] results. Now we invoke convexity of G N (see Lemma 6) to provide more general conditions for asymptotic unbiasedness and L 2 -consistency of the Shannon differential entropy as opposed to [35].

4. Proof of Theorem 1

Note that the general structure of this proof, as well as that of Theorem 2, is similar to the one originally proposed in [28] and later used in various papers (see, e.g., [30,31,47]). Nevertheless in order to prove both theorems correctly we employ new ideas and conditions (such as uniform integrability of a family of random variables) in our reasoning.
Remark 7.
In the proof, for certain random variables α , α 1 , α 2 , (depending on some parameters), we will demonstrate that E α n E α , as n (and that all these expectations are finite). To this end, for a fixed R d -valued random vector τ and each x A , where A is a specified subset of R d , we will prove that
E ( α n | τ = x ) E ( α | τ = x ) , n .
It turns out that E ( α n | τ = x ) = E ( α n , x ) and E ( α | τ = x ) = E α x , where the auxiliary random variables α n , x and α x can be constructed explicitly for each x R d . Moreover, it is possible to show that, for each x A , one has α n , x l a w α x , n . Thus, to prove (20) the Fatou lemma is not used, it is not evident whether there exists a random variable majorizing those under consideration. Instead we verify, for each x A , the uniform integrability (w.r.t. measure P ) of a family ( α n , x ) n n 0 ( x ) . Here we employ the necessary and sufficient conditions of uniform integrability provided by de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 in [52]). After that, to prove the desired relation E α n E α , n , we have a new task. Namely, we check the uniform integrability of a family ( E ( α n | τ = x ) ) n k 0 , where x A , w.r.t. the measure P τ , i.e., the law of τ, and k 0 does not depend of x. Then we can prove that
E α n = A E ( α n | τ = x ) P τ ( d x ) A E ( α | τ = x ) P τ ( d x ) = E α , n .
Further we will explain a number of nontrivial details concerning the proofs of uniform integrability of various families, the choice of the mentioned random variables (vectors), the set A, n 0 ( x ) and k 0 .
The first auxiliary result explains why without loss of generality (w.l.g.) we can consider the same parameters ε , R , N for different functionals in conditions of Theorems 1 and 2.
Lemma 1.
Let p and q be any probability densities in R d . Then the following statements are valid.
(1) 
If K p , q ( ν 0 , N 0 ) < for some ν 0 > 0 and N 0 N then K p , q ( ν , N ) < for any ν ( 0 , ν 0 ] and each N N 0 .
(2) 
If Q p , q ( ε 1 , R 1 ) < for some ε 1 > 0 and R 1 > 0 then Q p , q ( ε , R ) < for any ε ( 0 , ε 1 ] and each R > 0 .
(3) 
If T p , q ( ε 2 , R 2 ) < for some ε 2 > 0 and R 2 > 0 then T p , q ( ε , R ) < for any ε ( 0 , ε 2 ] and each R > 0 .
In particular one can take q = p and the statements of Lemma 1 still remain valid. The proof of Lemma 1 is given in Appendix A.
Remark 8.
The results of Lemma 1 allow us to ensure (14) by demanding the finiteness of the functionals K p , q ( 1 , N 1 ) , Q p , q ( ε 1 , R 1 ) , T p , q ( ε 2 , R 2 ) , K p , p ( 1 , N 2 ) , Q p , p ( ε 3 , R 3 ) , T p , p ( ε 4 , R 4 ) , for some ε i > 0 , R j > 0 and N j N , where i = 1 , 2 , 3 , 4 and j = 1 , 2 . Moreover, if we assume the finiteness of K p , q ( 2 , N 3 ) and K p , p ( 2 , N 4 ) , for some N 3 N , N 4 N , instead of the finiteness of K p , q ( 2 , N 1 ) and K p , p ( 2 , N 2 ) then (16) holds.
According to Remark 2.4 of [35] if, for some positive ε , R , the integrals Q p , q ( ε , R ) , T p , q ( ε , R ) , Q p , p ( ε , R ) , T p , p ( ε , R ) are finite then
R d p ( x ) | log q ( x ) | d x < , R d p ( x ) | log p ( x ) | d x < .
Therefore D ( p | | q ) < (and thus P X P Y in view of Lemma A1).
For n N such that n > 1 , for fixed k N and m N , where 1 k n 1 , 1 l m and i = 1 , , n , set ϕ m , l ( i ) : = m V m , l d ( i ) , ζ n , k ( i ) : = ( n 1 ) R n , k d ( i ) . Then we can rewrite the estimate D ^ n , m ( k , l ) as follows:
D ^ n , m ( k , l ) = ψ ( k ) ψ ( l ) + 1 n i = 1 n log ϕ m , l ( i ) log ζ n , k ( i ) .
It is sufficient to prove the following two assertions.
Statement 1.
For each fixedl, allmlarge enough and any i N , E | log ϕ m , l ( i ) | is finite.
Moreover,
E 1 n i = 1 n log ϕ m , l ( i ) = E log ϕ m , l ( 1 ) ψ ( l ) log V d R d p ( x ) log q ( x ) d x , m .
Statement 2.
For each fixed k, all nlarge enough and any i N , E | log ζ n , k ( i ) | is finite.
Moreover,
E 1 n i = 1 n log ζ n , k ( i ) = E log ζ n , k ( 1 ) ψ ( k ) log V d R d p ( x ) log p ( x ) d x , n .
Then in view of (2) and (21)–(24)
E D ^ n , m ( k , l ) R d p ( x ) log q ( x ) d x + R d p ( x ) log p ( x ) d x = D ( P X | | P Y ) , n , m ,
and Theorem 1 will be proved.
Recall that, as explained in [35], for a nonnegative random variable V (thus 0 E V ) and any random R d -valued vector, one has
E V = R d E ( V | X = x ) P X ( d x ) .
This signifies that both sides of (25) coincide, being finite or infinite simultaneously. Let F ( u , ω ) be a regular conditional distribution function of a nonnegative random variable U given X where u R and ω Ω . Let h be a measurable function such that h : R [ 0 , ) . It was also explained in [35] that, for P X -almost all x R d , it follows (without assuming E h ( U ) < )
E ( h ( U ) | X = x ) = [ 0 , ) h ( u ) d F ( u , x ) .
This means that both sides of (26) are finite or infinite simultaneously and coincide.
By virtue of (25) and (26) one can establish that E | log ϕ m , l ( i ) | < , for all m large enough, fixed l and for all i, and that (23) holds. To perform this take U = ϕ m , l ( i ) , X = X i , h ( u ) = | log u | , u > 0 (we use h ( u ) = log 2 u in the proof of Theorem 2) and V = h ( U ) . If h : R R and E | h ( U ) | < then (26) is true as well. To avoid increasing the volume of this paper we will only examine the evaluation of E log ϕ m , l ( i ) as all the steps of the proof will be the same when treating E | log ϕ m , l ( i ) | .
The proof of Statement 1 is partitioned into 4 steps. The first three demonstrate that there is a measurable A S ( p ) , depending on p and q versions, such that P X ( S ( p ) \ A ) = 0 and, for any x A , i N , the following relation holds:
E ( log ϕ m , l ( i ) | X i = x ) = E ( log ϕ m , l ( 1 ) | X 1 = x ) ψ ( l ) log V d log q ( x ) , m .
The last Step 4 justifies the desired result (23). Finally Step 5 validates Statement 2.
Step 1. Here we establish the distribution convergence for the auxiliary random variables. Fix any i N and l { 1 , , m } . To simplify notation we do not indicate the dependence of functions on d. For x R d and u 0 , we identify the asymptotic behavior (as m ) of the function
F m , l , x i ( u ) : = P ϕ m , l ( i ) u | X i = x = P m V m , l d ( i ) u | X i = x = 1 P V m , l ( i ) > u m 1 d | X i = x = 1 P x Y ( l ) ( x , Y m ) > u m 1 d = 1 s = 0 l 1 m s W m , x ( u ) s 1 W m , x ( u ) m s : = P ξ m , l , x u ,
where
W m , x ( u ) : = B ( x , r m ( u ) ) q ( z ) d z , r m ( u ) : = u m 1 d , ξ m , l , x : = m x Y ( l ) ( x , Y m ) d .
We take into account in (28) that random vectors Y 1 , , Y m , X i are independent and condition that Y 1 , , Y m have the same law as Y. We also noted that an event x Y ( l ) ( x , Y m ) > r m ( u ) is a union of pair-wise disjoint events A s , s = 0 , , l 1 . Here A s means that exactly s observations among Y m belong to the ball B ( x , r m ( u ) ) and other m s are outside this ball (probability that Y belongs to the sphere { z R d : z x = r } equals 0 since Y has a density w.r.t. the Lebesgue measure μ ). Formulas (28) and (29) show that F m , l , x i ( u ) is the regular conditional distribution function of ϕ m , l ( i ) given X i = x . Moreover, (28) means that ϕ m , l ( i ) , i { 1 , , n } are identically distributed and we may omit the dependence on i. So, one can write F m , l , x ( u ) instead of F m , l , x i ( u ) .
According to the Lebesgue differentiation theorem (see, e.g., [49], p. 654) if q L 1 ( R d ) , for μ -almost all x R d , one has
lim r 0 + 1 μ ( B ( x , r ) ) B ( x , r ) | q ( z ) q ( x ) | d z = 0 .
Let Λ ( q ) denote the set of Lebesgue points of a function q, namely the points in R d satisfying (30). Evidently it depends on the choice of version within the class of functions in L 1 ( R d ) equivalent to q, and, for an arbitrary version of q, we have μ ( R d \ Λ ( q ) ) = 0 .
Clearly, for each u 0 , r m ( u ) 0 as m , and μ ( B ( x , r m ( u ) ) ) = V d r m ( u ) d = V d u m . Therefore by virtue of (30), for any fixed x Λ ( q ) and u 0 ,
W m , x ( u ) = V d u m q ( x ) + α m ( x , u ) ,
where α m ( x , u ) 0 , m . Hence, for x Λ ( q ) S ( q ) (thus q ( x ) > 0 ), due to (28)
F m , l , x ( u ) 1 s = 0 l 1 ( V d q ( x ) u ) s s ! e V d q ( x ) u : = F l , x ( u ) , m .
Relation (31) means that
ξ m , l , x l a w ξ l , x , x Λ ( q ) S ( q ) , m ,
where ξ l , x has the Gamma distribution Γ ( α , λ ) with parameters α = V d q ( x ) and λ = l .
For any x S ( q ) , one can assume w.l.g. that the random variables ξ l , x and { ξ m , l , x } m l are defined on a probability space ( Ω , F , P ) . Indeed, by the Lomnicki–Ulam theorem (see, e.g., [53], p. 93) the independent copies of Y 1 , Y 2 , and { ξ l , x } x S ( q ) exist on a certain probability space. The convergence in distribution of random variables survives under continuous mapping. Thus, for any x Λ ( q ) S ( q ) , we see that
log ξ m , l , x l a w log ξ l , x , m .
We have employed that ξ l , x > 0 a.s. for each x Λ ( q ) S ( q ) and Y has a density, so it follows that P ( ξ m , l , x > 0 ) = P ( x Y ( l ) ( x , Y m ) > 0 ) = 1 . More precisely, we take strictly positive versions of ξ l , x and ξ m , l , x for each x Λ ( q ) S ( q ) .
Step 2. Now we show that, instead of (27) validity, one can verify the following assertion. For μ-almost every x Λ ( q ) S ( q )
E log ξ m , l , x E log ξ l , x , m .
Note that if η Γ ( α , λ ) , where α > 0 and λ > 0 , then E log η = ψ ( λ ) log α , where ψ is a digamma function. Set α = V d q ( x ) for x S ( q ) (then α > 0 ) and λ = l . Hence E log ξ l , x = ψ ( l ) log ( V d q ( x ) ) = ψ ( l ) log V d log q ( x ) . By virtue of (26), for each x R d ,
E log ξ m , l , x = ( 0 , ) log u d F m , l , x ( u ) = ( 0 , ) log u d P ( ϕ m , l ( 1 ) u | X 1 = x ) = E ( log ϕ m , l ( 1 ) | X 1 = x ) .
Hence, for x Λ ( q ) S ( q ) , the relation E ( log ϕ m , l ( 1 ) | X 1 = x ) ψ ( l ) log V d log q ( x ) holds if and only if (33) is true.
According to Theorem 3.5 [54] we would have established (33) if relation (32) could be supplemented, for μ -almost all x Λ ( q ) S ( q ) , by the condition of uniform integrability of a family { log ξ m , l , x } m m 0 ( x ) . Note that, for each N N , a function G N ( t ) introduced by (10) is nondecreasing on ( 0 , ) and G N ( t ) t , as t . By the de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 [52]), to ensure, for μ -almost every x Λ ( q ) S ( q ) , the uniform integrability of { log ξ m , l , x } m m 0 ( x ) , it suffices to prove the following statement. For the indicated x, a positive C 0 ( x ) and m 0 ( x ) N , one has
sup m m 0 ( x ) E G N ( | log ξ m , l , x | ) C 0 ( x ) < ,
where G N appears in conditions of Theorem 1. Moreover, it is possible to find m 0 N that does not depend on x R d as we will show further.
Step 3. This step is devoted to proving validity of (34). It is convenient to divide this step into its own parts (3a), (3b), etc. For any N N , set
g N ( t ) = 1 t log [ N ] ( log t ) + 1 j = 1 N 1 log [ j ] ( log t ) , t 0 , 1 e [ N ] , 0 , t 1 e [ N ] , e [ N ] , 1 t log [ N ] ( log t ) + 1 j = 1 N 1 log [ j ] ( log t ) , t e [ N ] , ,
where the product over empty set (when N = 1 ) is equal to 1.
The proof of the following result is placed at Appendix A.
Lemma 2.
Let F ( u ) , u R , be a distribution function such that F ( 0 ) = 0 . Then, for each N N , one has
( 1 ) 0 , 1 e [ N ] G N ( | log u | ) d F ( u ) = 0 , 1 e [ N ] F ( u ) ( g N ( u ) ) d u ,
( 2 ) e [ N ] , G N ( | log u | ) d F ( u ) = e [ N ] , ( 1 F ( u ) ) g N ( u ) d u .
Fix N appearing in conditions of Theorem 1. Observe that, for u 1 e [ N ] , e [ N ] , one has G N ( | log u | ) = 0 . Therefore, according to Lemma 2, for x Λ ( q ) S ( q ) and m l , we get E G N ( | log ξ m , l , x | ) : = I 1 ( m , x ) + I 2 ( m , x ) where
I 1 ( m , x ) : = 0 , 1 e [ N ] F m , l , x ( u ) ( g N ( u ) ) d u , I 2 ( m , x ) : = ( e [ N ] , ) ( 1 F m , l , x ( u ) ) g N ( u ) d u .
For convenience sake we write I 1 ( m , x ) and I 2 ( m , x ) without indicating their dependence on N , l and d (these parameters are fixed).
Part (3a). We provide bounds for I 1 ( m , x ) . Take R > 0 appearing in conditions of Theorem 1 and any u 0 , 1 e [ N ] . Introduce m 1 : = max 1 e [ N ] R d , l , where, for a R , a : = inf { m Z : m a } . Then r m ( u ) = u m 1 / d 1 e [ N ] m 1 / d R if m m 1 . Note also that we can consider only m l everywhere below, because the size of sample Y m is not less than the number of neighbors l (see, e.g., (28)). Thus, for R > 0 , u 0 , 1 e [ N ] , x R d and m m 1 ,
W m , x ( u ) μ ( B ( x , r m ( u ) ) ) = B ( x , r m ( u ) ) q ( y ) d y r m d ( u ) V d sup r ( 0 , R ] B ( x , r ) q ( y ) d y r d V d = M q ( x , R ) ,
and we arrive at the inequality
W m , x ( u ) M q ( x , R ) μ ( B ( x , r m ( u ) ) ) = M q ( x , R ) V d u m .
If γ ( 0 , 1 ] and t [ 0 , 1 ] then, for all m 1 , invoking the Bernoulli inequality, one has
1 ( 1 t ) m ( m t ) γ .
Recall that we assume Q p , q ( ε , R ) < for some ε > 0 , R > 0 . By virtue of Lemma 1 one can take ε < 1 . So, due to (36) and since W m , x ( u ) [ 0 , 1 ] for all x R d , u > 0 and m l , we get
1 ( 1 W m , x ( u ) ) m ( m W m , x ( u ) ) ε .
Thus in view of (28), (35) and (37) we have established that, for all x Λ ( q ) S ( q ) , u ( 0 , 1 e [ N ] ] and m m 1 ,
F m , l , x ( u ) = 1 s = 0 l 1 m s W m , x ( u ) s 1 W m , x ( u ) m s 1 ( 1 W m , x ( u ) ) m m M q ( x , R ) V d u m ε = ( M q ( x , R ) ) ε V d ε u ε .
Therefore, for any x Λ ( q ) S ( q ) and m m 1 , one can write
I 1 ( m , x ) ( M q ( x , R ) ) ε V d ε 0 , 1 e [ N ] u ε ( g N ( u ) ) d u ( M q ( x , R ) ) ε V d ε 0 , 1 e [ N ] log [ N ] ( log u ) + 1 u 1 ε d u = U 1 ( ε , N , d ) ( M q ( x , R ) ) ε ,
where U 1 ( ε , N , d ) : = V d ε L N ( ε ) , L N ( ε ) : = [ e [ N ] , ) ( log [ N ] ( t ) + 1 ) e ε t d t < . We took into account that ( g N ( u ) ) 1 u ( log [ N ] ( log u ) + 1 ) whenever u 0 , 1 e [ N ] .
Part (3b).We give bounds for I 2 ( m , x ) . Since g N ( u ) log [ N + 1 ] ( u ) + 1 u if u ( e [ N ] , ) , we can write, for m max { e [ N ] 2 , l } ,
I 2 ( m , x ) ( e [ N ] , m ] ( 1 F m , l , x ( u ) ) log [ N + 1 ] ( u ) + 1 u d u + ( m , m 2 ] ( 1 F m , l , x ( u ) ) log [ N + 1 ] ( u ) + 1 u d u + m 2 , ( 1 F m , l , x ( u ) ) g N ( u ) d u : = J 1 ( m , x ) + J 2 ( m , x ) + J 3 ( m , x ) .
Evidently,
1 F m , l , x ( u ) = r = m l + 1 m m r P m , x ( u ) r 1 P m , x ( u ) m r = P ( Z m l + 1 ) ,
where P m , x ( u ) = 1 W m , x ( u ) and Z Bin ( m , P m , x ( u ) ) .
By Markov’s inequality P ( Z t ) e λ t E e λ Z for any λ > 0 and t > 0 . One has
E e λ Z = j = 0 m e λ j m j P m , x ( u ) j 1 P m , x ( u ) m j = j = 0 m m j P m , x ( u ) e λ j 1 P m , x ( u ) m j = 1 P m , x ( u ) + e λ P m , x ( u ) m .
Consequently, for each λ > 0 ,
1 F m , l , x ( u ) e λ ( m l + 1 ) 1 P m , x ( u ) + e λ P m , x ( u ) m = e λ ( m l + 1 ) W m , x ( u ) + e λ ( 1 W m , x ( u ) ) m = e λ ( l 1 ) 1 1 1 e λ W m , x ( u ) m .
To simplify bounds we take λ = 1 and set S 1 = S 1 ( l ) : = e l 1 , S 2 : = 1 1 e (recall that l is fixed). Thus, S 1 1 and S 2 < 1 . Therefore,
1 F m , l , x ( u ) S 1 1 S 2 W m , x ( u ) m S 1 exp S 2 m W m , x ( u ) ,
where we have used simple inequality 1 t e t , t [ 0 , 1 ] .
For R > 0 appearing in conditions of the Theorem and any u e [ N ] , m , one can choose m 2 : = max 1 R 2 d , e [ N ] 2 , l such that if m m 2 then r m ( u ) = u m 1 / d 1 m 1 / d R . Due to (29) and (41), for u ( e [ N ] , m ] and m m 2 , one has
1 F m , l , x ( u ) S 1 exp S 2 m V d u m W m , x ( u ) V d u m = S 1 exp S 2 V d u B ( x , r m ( u ) ) q ( z ) d z μ ( B ( x , r m ( u ) ) ) S 1 exp S 2 V d u m q ( x , R ) ,
by definition of m f (for f = q ) in (9). Now we use the following Lemma 3.2 of [35].
Lemma 3.
For a version of a density q and each R > 0 , one has μ ( S ( q ) \ D q ( R ) ) = 0 where D q ( R ) : = { x S ( q ) : m q ( x , R ) > 0 } and m q ( · , R ) is defined according to (9).
It is easily seen that, for any t > 0 and each δ ( 0 , e ] , one has e t t δ . Thus, for x D q ( R ) , m m 2 , u ( e [ N ] , m ] and ε > 0 , we deduce from conditions of the Theorem (in view of Lemma 1 one can suppose that ε ( 0 , e ] ) that
1 F m , l , x ( u ) S 1 S 2 V d u m q ( x , R ) ε .
We also took into account that m q ( x , R ) > 0 for x D q ( R ) and applied relation (42). Thus, for all x Λ ( q ) S ( q ) D q ( R ) and any m m 2 ,
J 1 ( m , x ) S 1 ( S 2 V d ) ε ( m q ( x , R ) ) ε ( e [ N ] , ) log [ N + 1 ] ( u ) + 1 u 1 + ε d u = U 2 ( ε , N , d , l ) ( m q ( x , R ) ) ε ,
where U 2 ( ε , N , d , l ) : = S 1 ( l ) L N ( ε ) ( S 2 V d ) ε .
Part (3c). We provide the bound for J 2 ( m , x ) . For all x Λ ( q ) S ( q ) D q ( R ) and any m m 2 , in view of (43), it holds 1 F m , l , x ( m ) S 1 S 2 V d m q ( x , R ) m ε . Hence (as m 2 2 )
J 2 ( m , x ) m , m 2 ( 1 F m , l , x ( u ) ) log [ N + 1 ] ( u ) + 1 u d u 1 F m , l , x ( m ) m , m 2 log [ N + 1 ] ( u ) + 1 d log u S 1 ( S 2 V d ) ε m q ( x , R ) ε m ε 2 log [ N ] ( 2 log m ) + 1 3 2 log m .
Then, for all x Λ ( q ) S ( q ) D q ( R ) and any m m 2 ,
J 2 ( m , x ) U 3 ( m , ε , N , d , l ) m q ( x , R ) ε ,
where U 3 ( m , ε , N , d , l ) : = 3 2 S 1 ( l ) ( S 2 V d ) ε m ε 2 log m log [ N ] ( 2 log m ) + 1 0 , m .
Part (3d).To indicate bounds for J 3 ( m , x ) we employ several auxiliary results.
Lemma 4.
For each N N and any ν > 0 , there are a : = a ( d , ν ) 0 , b : = b ( N , d , ν ) 0 such that, for arbitrary x , y R d ,
G N | log x y d | ν a G N | log x y | ν + b .
The proof is given in Appendix A.
On the one hand, by (29), for any w 0 , we get
W m , x ( m w ) = B ( x , w 1 / d ) q ( z ) d z = W 1 , x ( w ) .
On the other hand, by (28), one has F 1 , 1 , x ( w ) = 1 1 W 1 , x ( w ) = W 1 , x ( w ) . Consequently, for any m N , w 0 and all x R d ,
W m , x ( m w ) = F 1 , 1 , x ( w ) .
Moreover, F 1 , 1 , x ( w ) = P ( Y x d w ) . So, ξ 1 , 1 , x = l a w Y x d . Thus, due to Lemmas 2 and 4 (for ν = 1 )
e [ N ] , ( 1 F 1 , 1 , x ( w ) ) g N ( w ) d w = e [ N ] , G N ( log w ) d F 1 , 1 , x ( w ) = E G N log ξ 1 , 1 , x I ξ 1 , 1 , x > e [ N ] = E [ G N ( log Y x d ) I { Y x d > e [ N ] } ] = y R d , x y > e [ N ] 1 / d G N ( log x y d ) q ( y ) d y a ( d , 1 ) y R d , x y > e [ N ] 1 / d G N ( | log x y | ) q ( y ) d y + b ( N , d , 1 ) = a ( d , 1 ) y R d , x y > e [ N ] G N ( log x y ) q ( y ) d y + b ( N , d , 1 ) ,
since G N ( t ) = 0 for t [ 0 , e [ N 1 ] ] , N N .
Now we will estimate 1 F m , l , x ( u ) in a way different from (40). Fix any δ > 0 . Note that, for all m ( l 1 ) 1 + 1 δ and s { 0 , , l 1 } , it holds m m s m m l + 1 1 + δ . Then, for all x R d , u 0 and m max { l , ( l 1 ) 1 + 1 δ } , in view of (28) one can write
1 F m , l , x ( u ) = 1 W m , x ( u ) s = 0 l 1 m 1 s m m s W m , x ( u ) s 1 W m , x ( u ) ( m 1 ) s ( 1 + δ ) 1 W m , x ( u ) s = 0 l 1 m 1 s W m , x ( u ) s 1 W m , x ( u ) ( m 1 ) s
( 1 + δ ) 1 W m , x ( u ) .
We are going to employ the following statement as well.
Lemma 5.
For each N N , a function log [ N ] ( t ) , t > e [ N 1 ] , is slowly varying at infinity.
The proof is elementary and thus is omitted.
Part (3e). Now we are ready to get the bound for J 3 ( m , x ) . Set u = m w . Then one has
J 3 ( m , x ) = m 2 , ( 1 F m , l , x ( u ) ) 1 u log [ N ] ( log u ) + 1 j = 1 N 1 log [ j ] ( log u ) d u = m , ( 1 F m , l , x ( m w ) ) 1 w log [ N + 1 ] ( m w ) + 1 j = 2 N log [ j ] ( m w ) d w .
Given w > m , Lemma 5 implies that log [ N + 1 ] ( m w ) log [ N + 1 ] ( w 2 ) = log [ N ] ( 2 log w ) 2 log [ N + 1 ] ( w ) for w large enough, namely for all w W , where W = W ( N ) . Take δ > 0 and set m 3 : = max l , ( l 1 ) 1 + 1 δ , W ( N ) , e [ N ] . Let further m m 3 . Then
J 3 ( m , x ) 2 m , ( 1 F m , l , x ( m w ) ) 1 w log [ N + 1 ] ( w ) + 1 j = 2 N log [ j ] ( w ) d w .
By virtue of (46) and (48) one has
1 F m , l , x ( m w ) ( 1 + δ ) 1 W m , x ( m w ) = ( 1 + δ ) 1 F 1 , 1 , x ( w ) .
Hence it can be seen that
J 3 ( m , x ) 2 ( 1 + δ ) m , ( 1 F 1 , 1 , x ( w ) ) g N 1 ( w ) d w .
Introduce
R N ( x ) : = y R d , x y > e [ N ] G N ( log x y ) q ( y ) d y , A p ( G N ) : = { x S ( p ) : R N ( x ) < } .
Let us note that (1) P X ( S ( p ) \ A p ( G N ) ) = 0 as K p , q ( 1 , N ) < ;
(2) P X ( S ( p ) \ S ( q ) ) = 0 as P X P Y (see Lemma A1);
(3) μ S ( q ) \ ( Λ ( q ) D q ( R ) ) = 0 due to Lemma 3.
Since P X μ we conclude that P X S ( q ) \ ( Λ ( q ) D q ( R ) ) = 0 . Hence, one has P X S ( p ) \ ( Λ ( q ) D q ( R ) ) = 0 in view of 2) and because B \ C ( B \ A ) ( A \ C ) for any A , B , C R d . Set further A : = Λ ( q ) S ( q ) D q ( R ) S ( p ) A p ( G N ) . It follows from (1), (2) and (3) that P X ( S ( p ) \ A ) = 0 , so P X ( A ) = 1 . We are going to consider only x A .
Then, by virtue of (47) and (50), for all m m 3 and x A , we come to the inequality
J 3 ( m , x ) 2 ( 1 + δ ) a ( d , 1 ) R N ( x ) + b ( N , d , 1 ) = A ( δ , d ) R N ( x ) + B ( δ , d , N ) ,
where A ( δ , d ) : = 2 ( 1 + δ ) a ( d , 1 ) , B ( δ , d , N ) : = 2 ( 1 + δ ) b ( N , d , 1 ) .
Part (3f). Here we get the upper bound for E G N ( | log ξ m , l , x | ) . For m max { m 1 , m 2 , m 3 } and each x A , taking into account (39), (44), (45) and (51) we can claim that
E G N ( | log ξ m , l , x | ) I 1 ( m , x ) + J 1 ( m , x ) + J 2 ( m , x ) + J 3 ( m , x ) U 1 ( ε , N , d ) ( M q ( x , R ) ) ε + U 2 ( ε , N , d , l ) ( m q ( x , R ) ) ε + U 3 ( m , ε , N , d , l ) m q ( x , R ) ε + A ( δ , d ) R N ( x ) + B ( δ , d , N ) .
For any κ > 0 , one can take m 4 = m 4 ( κ , ε , N , d , l ) N such that U 3 ( m , ε , N , d , l ) κ if m m 4 . Then by virtue of (52), for each x A and m m 0 : = max { m 1 , m 2 , m 3 , m 4 } ,
E G N ( | log ξ m , l , x | ) U 1 ( ε , N , d ) ( M q ( x , R ) ) ε + U 2 ( ε , N , d , l ) + κ ( m q ( x , R ) ) ε + A ( δ , d ) R N ( x ) + B ( δ , d , N ) : = C 0 ( x ) < .
Hence, for each x A , we have established uniform integrability of the family log ξ m , l , x m m 0 .
Step 4. Now we verify (23). It was checked, for each x A (thus, for P X -almost every x belonging to S ( p ) ) that E ( log ϕ m , l ( 1 ) | X 1 = x ) ψ ( l ) log V d log q ( x ) , m . Set Z m , l ( x ) : = E ( log ϕ m , l ( 1 ) | X 1 = x ) = E log ξ m , l , x . Consider x A and take any m m 0 . We use the following property of G N which is shown in Appendix A.
Lemma 6.
For each N N , a function G N is convex on R + .
By the Jensen inequality a function G N is nondecreasing and convex.
G N ( | Z m , l ( x ) | ) = G N ( | E log ξ m , l , x | ) G N ( E | log ξ m , l , x | ) E G N ( | log ξ m , l , x | ) .
Relation (53) guarantees that, for all m m 0 ,
R d G N ( | Z m , l ( x ) | ) p ( x ) d x U 1 ( ε , N , d ) Q p , q ( ε , R ) + U 2 ( ε , N , d , l ) + κ T p , q ( ε , R ) + A ( δ , d ) K p , q ( 1 , N ) + B ( δ , d , N ) < .
Now we know that the family { Z m , l ( x ) } m m 0 , x A , is uniformly integrable w.r.t. measure P X . Thus, for i N ,
E log ϕ m , l ( i ) = R d E ( log ϕ m , l ( 1 ) | X 1 = x ) P X 1 ( d x ) = R d Z m , l ( x ) p ( x ) d x ψ ( l ) log V d R d p ( x ) log q ( x ) d x , m ,
and we come to relation (23) establishing Statement 1.
Step 5. Here we prove Statement 2. Similar to F m , l , x ( u ) , one can introduce, for n , k N , n k + 1 , x R d and u 0 , the following function
F ˜ n , k , x ( u ) : = P ζ n , k ( i ) u | X i = x = 1 P x X ( k ) ( x , X n \ { x } ) > r n 1 ( u ) = 1 s = 0 k 1 n 1 s V n 1 , x ( u ) s 1 V n 1 , x ( u ) n 1 s : = P ξ ˜ n , k , x u ,
where r n ( u ) was defined in (29), and
V n , x ( u ) : = B ( x , r n ( u ) ) p ( z ) d z , ξ ˜ n , k , x : = ( n 1 ) x X ( k ) ( x , X n \ { x } ) d .
Formulas (54) and (55) show that F ˜ n , k , x ( u ) is the regular conditional distribution function of ζ n , k ( i ) given X i = x . Moreover, for any fixed u 0 and x Λ ( p ) S ( p ) (thus p ( x ) > 0 ),
F ˜ n , k , x ( u ) 1 s = 0 k 1 ( V d p ( x ) u ) s s ! e V d p ( x ) u : = F ˜ k , x ( u ) , n .
Hence, ξ ˜ n , k , x l a w ξ ˜ k , x , x Λ ( p ) S ( p ) , n . Set A ˜ p ( G N ) : = { x S ( p ) : R ˜ N ( x ) < } , where N N and
R ˜ N ( x ) : = y R d , x y > e [ N ] G N ( log x y ) p ( y ) d y .
Take A ˜ : = Λ ( p ) S ( p ) D p ( R ) A ˜ p ( G N ) . Then P X ( A ˜ ) = 1 and, for x A ˜ , one can verify that E G N ( | log ξ ˜ n , k , x | ) C ˜ 0 ( x ) < , for all n n 0 , and therefore E log ξ ˜ n , k , x E log ξ ˜ k , x as n . Thus, E ( log ζ n , k ( 1 ) | X 1 = x ) ψ ( k ) log V d log p ( x ) , n . Set Z ˜ n , k ( x ) : = E ( log ζ n , k ( 1 ) | X 1 = x ) . One can see that, for all n n 0 , R d G N ( | Z ˜ n , k ( x ) | ) p ( x ) d x < . Hence similar to Steps 1–4 we come to relation (24).
So, (14) holds and the proof of Theorem 1 is complete.

5. Proof of Theorem 2

We will follow the general scheme described in Remark 7. However now this scheme is more involved.
First of all note that, in view of Lemma 1, the finiteness of K p , q ( 2 , N ) and K p , p ( 2 , N ) implies the finiteness of K p , q ( 1 , N ) and K p , p ( 1 , N ) , respectively. Thus, the conditions of Theorem 2 entail validity of Theorem 1 statements. Consequently under the conditions of Theorem 2, for n and m large enough, one can claim that D ^ n , m ( k , l ) L 1 ( Ω ) and E D ^ n , m ( k , l ) D ( P X | | P Y ) , as n , m .
We will show that D ^ n , m ( k , l ) L 2 ( Ω ) for all n and m large enough. Then one can write
E D ^ n , m ( k , l ) D ( P X | | P Y ) 2 = var D ^ n , m ( k , l ) + E D ^ n , m ( k , l ) D ( P X | | P Y ) 2 .
Therefore to prove (16) we will demonstrate that var D ^ n , m ( k , l ) 0 , n , m .
Due to (28) the random variables log ϕ m , l ( 1 ) , , log ϕ m , l ( n ) are identically distributed (and log ζ n , k ( 1 ) , , log ζ n , k ( n ) are identically distributed as well). The variables ϕ m , l ( i ) and ζ n , k ( i ) are the same as in (22). We will demonstrate that log ϕ m , l ( 1 ) and log ζ n , k ( 1 ) belong to L 2 ( Ω ) . Hence (22) yields
var D ^ n , m ( k , l ) = 1 n 2 i , j = 1 n cov log ϕ m , l ( i ) log ζ n , k ( i ) , log ϕ m , l ( j ) log ζ n , k ( j ) = 1 n var log ϕ m , l ( 1 ) + 2 n 2 1 i < j n cov log ϕ m , l ( i ) , log ϕ m , l ( j ) + 1 n var log ζ n , k ( 1 ) + 2 n 2 1 i < j n cov log ζ n , k ( i ) , log ζ n , k ( j ) 2 n 2 i , j = 1 n cov log ϕ m , l ( i ) , log ζ n , k ( j ) .
We mainly follow the notation employed in the above proof of Theorem 1, except the possibly different choice of the sets A R d , A ˜ R d , positive U j , C j ( x ) , C ˜ j ( x ) and integers m j , n j , where j Z + and x R d . The following Theorem 2 proof is also subdivided in 5 parts. Steps 1–3 deal with the demonstration of relation 1 n var ( log ϕ m , l ( 1 ) ) 0 as n , m . Step 4 validates the relation 2 n 2 1 i < j n cov ( log ϕ m , l ( i ) , log ϕ m , l ( j ) ) 0 as n , m . At Step 5 we establish that
2 n 2 1 i < j n cov ( log ζ n , k ( i ) , log ζ n , k ( j ) ) 0 , n ,
This step is rather involved. Step 6 justifies the desired statement var D ^ n , m ( k , l ) 0 , n , m .
Step 1. We study E log 2 ϕ m , l ( 1 ) , as m . For x R d and N N , introduce
R N , 2 ( x ) : = x y e [ N ] G N ( log 2 x y ) q ( y ) d y .
Set A p , 2 ( G N ) : = { x S ( p ) : R N , 2 ( x ) < } . Then P X ( S ( p ) \ A p , 2 ( G N ) ) = 0 since K p , q ( 2 , N ) < . Consider
A : = Λ ( q ) S ( q ) D q ( R ) S ( p ) A p , 2 ( G N ) ,
where the first four sets appeared in Theorem 1 proof, R and N are indicated in conditions of Theorem 2. It is easily seen that P X ( A ) = 1 . The reasoning is exactly the same as in the proof of Theorem 1.
Recall that, for each x A , one has log ξ m , l , x l a w log ξ l , x , m , where ξ m , l , x : = m x Y ( l ) ( x , Y m ) d and ξ l , x has Γ ( V d q ( x ) , l ) distribution. Convergence in law of random variables is maintained by continuous transformations. Thus, for each x A , we get
log 2 ξ m , l , x l a w log 2 ξ l , x , m .
For any x A , according to (28),
E log 2 ξ m , l , x = ( 0 , ) log 2 u d F m , l , x ( u ) = ( 0 , ) log 2 u d P ( ϕ m , l ( 1 ) u | X 1 = x ) = E ( log 2 ϕ m , l ( 1 ) | X 1 = x ) .
Note that if η Γ ( α , λ ) , where α > 0 and λ > 0 , then it is not difficult to verify that
E log 2 η = Γ ( λ ) Γ ( λ ) 2 ψ ( λ ) log α + log 2 α .
Since ξ l , x Γ ( V d q ( x ) , l ) , for x S ( q ) , one has
E log 2 ξ l , x = Γ ( l ) Γ ( l ) 2 ψ ( l ) log ( V d q ( x ) ) + log 2 ( V d q ( x ) ) = log 2 q ( x ) + h 1 log q ( x ) + h 2 ,
where h 1 : = h 1 ( l , d ) and h 2 : = h 2 ( l , d ) depend only on fixed l and d.
We prove now that, for x A , one has
E ( log 2 ϕ m , l ( 1 ) | X 1 = x ) log 2 q ( x ) + h 1 log q ( x ) + h 2 , m .
Taking into account (60) and (61) we can claim that relation (62) is equivalent to the following one: E log 2 ξ m , l , x E log 2 ξ l , x , m . So, in view of (59) to prove (62) it is sufficient to show that, for each x A , a family log 2 ξ m , l , x m m 0 ( x ) is uniformly integrable for some m 0 ( x ) N . Then, following Theorem 1 proof, one can certify that, for all x A and some nonnegative C 0 ( x ) ,
sup m m 0 ( x ) E G N ( log 2 ξ m , l , x ) C 0 ( x ) < .
Step 2. Now we will prove (63). For each N N , introduce ρ ( N ) : = exp { e [ N 1 ] } and
h N ( t ) : = 0 , t 1 ρ ( N ) , ρ ( N ) , 2 log t t log [ N ] ( log 2 t ) + 1 j = 1 N 1 log [ j ] ( log 2 t ) , t 0 , 1 ρ ( N ) ρ ( N ) , .
As usual, a product over an empty set (if N = 1 ) equals to 1. To show (63) we refer to the next lemma.
Lemma 7.
Let F ( u ) , u R , be a distribution function such that F ( 0 ) = 0 . Fix an arbitrary N N . Then
( 1 ) 0 , 1 ρ ( N ) G N ( log 2 u ) d F ( u ) = 0 , 1 ρ ( N ) F ( u ) ( h N ( u ) ) d u ,
( 2 ) ρ ( N ) , G N ( log 2 u ) d F ( u ) = ρ ( N ) , ( 1 F ( u ) ) h N ( u ) d u .
The proof of this lemma is omitted, being quite similar to one of Lemma 2. By Lemma 7 and since G N ( log 2 u ) = 0 , for u 1 ρ ( N ) , ρ ( N ) , one has
E G N ( log 2 ξ m , l , x ) = 0 , 1 ρ ( N ) F m , l , x ( u ) ( h N ( u ) ) d u + ρ ( N ) , ( 1 F m , l , x ( u ) ) h N ( u ) d u : = I 1 ( m , x ) + I 2 ( m , x ) .
To simplify notation we do not indicate the dependence of I i ( m , x ) ( i = 1 , 2 ) on fixed N, l and d.
For clarity, further implementation of Step 2 is divided into several parts.
Part (2a).At first we consider I 1 ( m , x ) . As in Theorem 1 proof, for fixed R > 0 and ε > 0 appearing in the conditions of Theorem 2, an inequality F m , l , x ( u ) ( M q ( x , R ) ) ε V d ε u ε holds for any x A , u 0 , 1 ρ ( N ) and m m 1 : = max 1 ρ ( N ) R d , l . Taking into account that 0 ( h N ( u ) ) ( 2 log u ) log [ N ] ( log 2 u ) + 1 u if u 0 , 1 ρ ( N ) , we get, for m m 1 ,
I 1 ( m , x ) ( M q ( x , R ) ) ε V d ε 0 , 1 ρ ( N ) ( 2 log u ) log [ N ] ( log 2 u ) + 1 u 1 ε d u = U 1 ( ε , N , d ) ( M q ( x , R ) ) ε .
Here U 1 ( ε , N , d ) : = V d ε L N , 2 ( ε ) , L N , 2 ( ε ) : = e [ N 1 ] , 2 t log [ N ] ( t 2 ) + 1 e ε t d t < for each ε > 0 and any N N .
Part (2b).Consider I 2 ( m , x ) . Following the previous theorem proof we at first observe that h N ( u ) 2 log u u log [ N ] ( log 2 u ) + 1 for u ( ρ ( N ) , ) . So, for all m max { ρ 2 ( N ) , l } ,
I 2 ( m , x ) ( ρ ( N ) , m ] ( 1 F m , l , x ( u ) ) 2 log u log [ N ] ( log 2 u ) + 1 u d u + ( m , m 2 ] ( 1 F m , l , x ( u ) ) 2 log u log [ N ] ( log 2 u ) + 1 u d u + ( m 2 , ) ( 1 F m , l , x ( u ) ) h N ( u ) d u : = J 1 ( m , x ) + J 2 ( m , x ) + J 3 ( m , x ) ,
where we do not indicate the dependence of J j ( m , x ) ( j = 1 , 2 , 3 ) on N, l and d.
For R > 0 and ε > 0 appearing in the conditions of Theorem 2, one can show (see Theorem 1 proof), that inequality
1 F m , l , x ( u ) S 1 S 2 V d u m q ( x , R ) ε
holds for any x A , u ρ ( N ) , m and all m m 2 : = max 1 R 2 d , ρ 2 ( N ) , l . Here S 1 : = S 1 ( l ) and S 2 are the same as in the proof of Theorem 1. For all x A and m m 2 , we come to the relations
J 1 ( m , x ) S 1 ( S 2 V d ) ε ( m q ( x , R ) ) ε ( ρ ( N ) , ) 2 log u log [ N ] ( log 2 u ) + 1 u 1 + ε d u = U 2 ( ε , N , d , l ) ( m q ( x , R ) ) ε ,
where U 2 ( ε , N , d , l ) : = 2 S 1 ( l ) L N , 2 ( ε ) ( S 2 V d ) ε .
Part (2c). Let us consider J 2 ( m , x ) . Take δ > 0 . Then, due to (65), for all x A and any m m 2 ,
J 2 ( m , x ) 2 1 F m , l , x ( m ) m , m 2 log u log [ N ] ( log 2 u ) + 1 d log u 4 S 1 ( S 2 V d ) ε m ε 2 m q ( x , R ) ε log [ N ] ( 4 log 2 m ) + 1 log 2 m = U 3 ( m , ε , N , d , l ) m q ( x , R ) ε ,
where U 3 ( m , ε , N , d , l ) : = 4 S 1 ( S 2 V d ) ε m ε 2 log 2 m log [ N ] ( 4 log 2 m ) + 1 0 , m .
Part (2d). Now we turn to J 3 ( m , x ) . Take u = m w . Then J 3 ( m , x ) has the form
m , ( 1 F m , l , x ( m w ) ) 2 log ( m w ) w log [ N ] ( log 2 ( m w ) ) + 1 j = 1 N 1 log [ j ] ( log 2 ( m w ) ) d w .
Due to Lemma 5 there exists T ( N ) > ρ ( N ) such that
log [ N ] ( log 2 ( w 2 ) ) = log [ N ] ( 4 log 2 w ) 2 log [ N ] ( log 2 w ) , w T ( N ) .
Pick some δ > 0 and set m 3 : = max l , ( l 1 ) 1 + 1 δ , T ( N ) , ρ ( N ) , where T ( N ) was introduced in (68). Consider m m 3 . In view of Lemma 4 (for ν = 2 ), (49), (68) and since w > m ,
J 3 ( m , x ) m , ( 1 F m , l , x ( m w ) ) 2 log ( w 2 ) w log [ N ] ( log 2 ( w 2 ) ) + 1 j = 1 N 1 log [ j ] ( log 2 w ) d w 4 ( 1 + δ ) m , ( 1 F 1 , 1 , x ( w ) ) 2 log w w log [ N ] ( log 2 w ) + 1 j = 1 N 1 log [ j ] ( log 2 w ) d w = 4 ( 1 + δ ) m , ( 1 F 1 , 1 , x ( w ) ) h N ( w ) d w 4 ( 1 + δ ) ρ ( N ) , ( 1 F 1 , 1 , x ( w ) ) h N ( w ) d w = 4 ( 1 + δ ) ρ ( N ) , G N ( log 2 w ) d F 1 , 1 , x ( w ) = 4 ( 1 + δ ) E [ G N ( log 2 ξ 1 , 1 , x ) I { ξ 1 , 1 , x > ρ ( N ) } ] = 4 ( 1 + δ ) E [ G N ( ( log Y x d ) 2 ) I { Y x d > ρ ( N ) } ] = 4 ( 1 + δ ) y R d , x y > ( ρ ( N ) ) 1 / d G N ( ( log x y d ) 2 ) q ( y ) d y 4 ( 1 + δ ) a ( d , 2 ) y R d , x y > ρ ( N ) 1 / d G N ( log 2 x y ) q ( y ) d y + b ( N , d , 2 )
4 ( 1 + δ ) a ( d , 2 ) R N , 2 ( x ) + G N ( e [ N 1 ] 2 ) + b ( N , d , 2 ) = A ( δ , d ) R N , 2 ( x ) + B ( δ , d , N ) ,
A ( δ , d ) : = 4 ( 1 + δ ) a ( d , 2 ) , B ( δ , d , N ) : = 4 ( 1 + δ ) a ( d , 2 ) G N ( e [ N 1 ] 2 ) + b ( N , d , 2 ) , R N , 2 ( x ) is defined in (57). Here we have also used, for any N N , ν , t , u > 0 , t < u , the following estimates
K p , q ( ν , N , u ) K p , q ( ν , N , t ) K p , q ( ν , N , u ) + max { G N ( | log t | ν ) , G N ( | log u | ν ) } .
Part (2e). Examine E G N ( log 2 ξ m , l , x ) . Thus, for each x A and m max { m 1 , m 2 , m 3 } , taking into account (64), (66), (67) and (69), we can claim that
E G N ( log 2 ξ m , l , x ) I 1 ( m , x ) + J 1 ( m , x ) + J 2 ( m , x ) + J 3 ( m , x ) U 1 ( ε , N , d ) ( M q ( x , R ) ) ε + U 2 ( ε , N , d , l ) ( m q ( x , R ) ) ε + U 3 ( m , ε , N , d , l ) m q ( x , R ) ε
+ A ( δ , d ) R N , 2 ( x ) + B ( δ , d , N ) .
Moreover, for any κ > 0 , one can choose m 4 : = m 4 ( κ , ε , N , d , l ) N such that, for m m 4 , it holds U 3 ( m , ε , N , d , l ) κ . Then by (70), for each x A and m m 0 : = max { m 1 , m 2 , m 3 , m 4 } ,
E G N ( log 2 ξ m , l , x ) U 1 ( ε , N , d ) ( M q ( x , R ) ) ε + U 2 ( ε , N , d , l ) + κ ( m q ( x , R ) ) ε + A ( δ , d ) R N , 2 ( x ) + B ( δ , d , N ) : = C 0 ( x ) < .
Hence we have proved the uniform integrability of the family log 2 ξ m , l , x m m 0 for each x A . Therefore, for any x A (thus for P X -almost every x S ( p ) ), relation (62) holds.
Step 3. Now we can return to E log 2 ϕ m , l ( 1 ) . Set Δ m , l ( x ) : = E ( log 2 ϕ m , l ( 1 ) | X 1 = x ) = E log 2 ξ m , l , x . Consider x A and take any m m 0 . A function G N is nondecreasing and convex according to Lemma 6. Due to the Jensen inequality
G N ( Δ m , l ( x ) ) = G N ( E log 2 ξ m , l , x ) E G N ( log 2 ξ m , l , x ) .
Relation (72) guarantees that, for each x A and all m m 0 ,
R d G N ( Δ m , l ( x ) ) p ( x ) d x U 1 ( ε , N , d ) Q p , q ( ε , R ) + U 2 ( ε , N , d , l ) + κ T p , q ( ε , R ) + A ( δ , d ) K p , q ( 2 , N ) + B ( δ , d , N ) < .
Uniform integrability of the family { Δ m , l ( · ) } m m 0 (w.r.t measure P X ) is thus established. Hence one can claim that
E log 2 ϕ m , l ( 1 ) R d p ( x ) log 2 q ( x ) d x + h 1 R d p ( x ) log q ( x ) d x + h 2 , m .
It is easily seen that finiteness of integrals Q p , q ( ε , R ) , T p , q ( ε , R ) implies that
R d p ( x ) log 2 q ( x ) d x < , R d p ( x ) | log q ( x ) | d x < .
Thus, E log 2 ϕ m , l ( 1 ) τ 2 < and var log ϕ m , l ( 1 ) = E log 2 ϕ m , l ( 1 ) E log ϕ m , l ( 1 ) 2 τ 2 τ 1 2 < , m , where τ 1 : = ψ ( l ) log V d R d p ( x ) log q ( x ) d x according to (23). Consequently, 1 n var log ϕ m , l ( 1 ) 0 as n , m .
Step 4. Now we consider cov ( log ϕ m , l ( i ) , log ϕ m , l ( j ) ) for i j , where i , j { 1 , , n } . For x , y R d , define conditional distribution function
Φ m , l , x , y i , j ( u , w ) : = P ( ϕ m , l ( i ) u , ϕ m , l ( j ) w | X i = x , X j = y ) , u , w 0 .
For x , y R d , u , w 0 , i j ,
Φ m , l , x , y i , j ( u , w ) = 1 P ( ϕ m , l ( i ) > u | X i = x , X j = y ) P ( ϕ m , l ( j ) > w | X i = x , X j = y ) + P ( ϕ m , l ( i ) > u , ϕ m , l ( j ) > w | X i = x , X j = y ) = 1 P x Y ( l ) ( x , Y m ) > r m ( u ) P y Y ( l ) ( y , Y m ) > r m ( w ) + P x Y ( l ) ( x , Y m ) > r m ( u ) , y Y ( l ) ( y , Y m ) > r m ( w ) .
Here r m ( a ) = a m 1 d for all a 0 , as previously. One can write Φ m , l , x , y ( u , w ) instead of Φ m , l , x , y i , j ( u , w ) , because the right-hand side of (73) does not depend on i and j.
Set A 1 : = ( x , y ) : x A , y A , x y and A 2 : = ( x , y ) : x A , y A , x = y , where A is introduced in (58). Evidently, P X P X ( A 1 ) = 1 and P X P X ( A 2 ) = 0 . Consider ( x , y ) A 1 . Obviously, for any a > 0 , r m ( a ) 0 , as m . For ( x , y ) A 1 we take m 5 = m 5 ( u , w , x y ) : = 1 + 2 x y d max u , w . Then r m ( u ) < x y 2 and r m ( w ) < x y 2 for all m m 5 . Thus, B ( x , r m ( u ) ) B ( y , r m ( w ) ) = if m m 5 . Consequently, for m m 6 ( u , w , x y ) : = max m 5 , 2 ( l 1 ) ,
P x Y ( l ) ( x , Y m ) > r m ( u ) , y Y ( l ) ( y , Y m ) > r m ( w ) = s 1 = 0 l 1 s 2 = 0 l 1 m ! s 1 ! s 2 ! ( m s 1 s 2 ) ! W m , x ( u ) s 1 W m , y ( w ) s 2 1 W m , x ( u ) W m , y ( w ) m s 1 s 2 .
In view of (28), (73) and (74), one has for Φ m , l , x , y ( u , w ) the following representation
1 s 1 = 0 l 1 m s 1 W m , x ( u ) s 1 1 W m , x ( u ) m s 1 s 2 = 0 l 1 m s 2 W m , y ( w ) s 2 1 W m , y ( w ) m s 2 + s 1 = 0 l 1 s 2 = 0 l 1 m ! s 1 ! s 2 ! ( m s 1 s 2 ) ! W m , x ( u ) s 1 W m , y ( w ) s 2 1 W m , x ( u ) W m , y ( w ) m s 1 s 2 .
For any fixed ( x , y ) A 1 and u , w 0 , we get, as m ,
m ! s 1 ! s 2 ! ( m s 1 s 2 ) ! W m , x ( u ) s 1 W m , y ( w ) s 2 ( V d u q ( x ) ) s 1 s 1 ! ( V d w q ( y ) ) s 2 s 2 ! , 1 W m , x ( u ) W m , y ( w ) m s 1 s 2 e V d u q ( x ) + w q ( y ) .
Then, according to (31), (75) and (76), for all fixed u , w 0 , ( x , y ) A 1 , one has
Φ m , l , x , y ( u , w ) 1 s 1 = 0 l 1 ( V d u q ( x ) ) s 1 s 1 ! e V d u q ( x ) s 2 = 0 l 1 ( V d w q ( y ) ) s 2 s 2 ! e V d w q ( y ) + s 1 = 0 l 1 s 2 = 0 l 1 ( V d u q ( x ) ) s 1 s 1 ! ( V d w q ( y ) ) s 2 s 2 ! e V d u q ( x ) + w q ( y ) = 1 s 1 = 0 l 1 ( V d u q ( x ) ) s 1 s 1 ! e V d u q ( x ) 1 s 2 = 0 l 1 ( V d w q ( y ) ) s 2 s 2 ! e V d w q ( y ) = F l , x ( u ) F l , y ( w ) : = Φ l , x , y ( u , w ) , m .
Thus, Φ l , x , y ( · , · ) is identified as a distribution function of a vector η l , x , y : = ( ξ l , x , ξ l , y ) having independent components such that ξ l , x Γ ( V d q ( x ) , l ) , ξ l , y Γ ( V d q ( y ) , l ) . Observe also that Φ m , l , x , y ( · , · ) is a distribution function of a random vector η m , l , x , y : = ( ξ m , l , x , ξ m , l , y ) . Consequently, we have shown that η m , l , x , y l a w η l , x , y as m . Hence, for any ( x , y ) A 1 ,
log ξ m , l , x log ξ m , l , y l a w log ξ l , x log ξ l , y , m .
Here we take strictly positive versions of random variables under consideration. Note that, for all i , j N , i j ,
E ( log ξ m , l , x log ξ m , l , y ) = ( 0 , ) × ( 0 , ) log u log w d Φ m , l , x , y ( u , w ) = E log ϕ m , l ( i ) log ϕ m , l ( j ) | X i = x , X j = y .
One has E ( log ξ l , x log ξ l , y ) = E log ξ l , x E log ξ l , y = a l , d ( x ) a l , d ( y ) because ξ l , x and ξ l , y are independent, here a l , d ( z ) : = ψ ( l ) log V d log q ( z ) , z R d .
Now we intend to verify that, for any ( x , y ) A 1 ,
E log ϕ m , l ( 1 ) log ϕ m , l ( 2 ) | X 1 = x , X 2 = y a l , d ( x ) a l , d ( y ) .
Equivalently, one can prove that E ( log ξ m , l , x log ξ m , l , y ) E ( log ξ l , x log ξ l , y ) for each ( x , y ) A 1 , as m .
Part (4a). We will prove the uniform integrability of a family { log ξ m , l , x log ξ m , l , y } m m 0 for ( x , y ) A 1 . The convex function G N ( · ) is nondecreasing. Thus, following the proof of Step 2 for any ( x , y ) A 1 , one can find m 0 (the same as in the proof of Step 2 such that, for all m m 0 ,
E G N ( | log ξ m , l , x log ξ m , l , y | ) E G N 1 2 log 2 ξ m , l , x + 1 2 log 2 ξ m , l , y 1 2 E G N ( log 2 ξ m , l , x ) + E G N ( log 2 ξ m , l , y ) U 1 2 ( M q ( x , R ) ) ε + ( M q ( y , R ) ) ε + U 2 + κ 2 ( m q ( x , R ) ) ε + ( m q ( y , R ) ) ε + A 2 R N , 2 ( x ) + R N , 2 ( y ) + B : = C ˜ 0 ( x , y ) .
Here we used (71). It is essential that U 1 , U 2 , κ , A , B do not depend on x and y. Hence, for any ( x , y ) A 1 , a family { log ξ m , l , x log ξ m , l , y } m m 0 is uniformly integrable. So, we establish (78) for ( x , y ) A 1 .
Part (4b). We return to cov ( log ϕ m , l ( i ) , log ϕ m , l ( j ) ) for i j , i , j { 1 , , n } . Set T m , l ( x , y ) : = E log ϕ m , l ( 1 ) log ϕ m , l ( 2 ) | X 1 = x , X 2 = y where ( x , y ) A 1 . Then (78) means that T m , l ( x , y ) = E ( log ξ m , l , x log ξ m , l , y ) a l , d ( x ) a l , d ( y ) for any ( x , y ) A 1 , as m . Note that
G N ( | T m , l ( x , y ) | ) = G N ( | E log ξ m , l , x log ξ m , l , y | ) G N ( E | log ξ m , l , x log ξ m , l , y | ) E G N ( | log ξ m , l , x log ξ m , l , y | ) .
As P X P X ( A 1 ) = 1 , one can conclude due to (79) and (80) that, for all m m 0 ,
R d × R d G N ( | T m , l ( x , y ) | ) p ( x ) p ( y ) d x d y = ( x , y ) A 1 G N ( | T m , l ( x , y ) | ) p ( x ) p ( y ) d x d y U 1 R d M q ε ( x , R ) p ( x ) d x + U 2 + κ R d m q ε ( x , R ) p ( x ) d x + A R d R N , 2 ( x ) p ( x ) d x + B = U 1 Q p , q ( ε , R ) + ( U 2 + κ ) T p , q ( ε , R ) + A K p , q ( 2 , N ) + B < .
Therefore, for ( x , y ) A 1 , the family T m , l ( x , y ) m m 0 is uniformly integrable w.r.t. P X P X . Consequently,
R d × R d T m , l ( x , y ) p ( x ) p ( y ) d x d y R d × R d a l , d ( x ) a l , d ( y ) p ( x ) p ( y ) d x d y , m .
Thus
E log ϕ m , l ( 1 ) log ϕ m , l ( 2 ) ψ ( l ) log V d R d log q ( x ) p ( x ) d x 2 , m .
On the other hand, taking also into account (23), we come to the relation
E log ϕ m , l ( 1 ) E log ϕ m , l ( 2 ) ψ ( l ) log V d R d log q ( x ) p ( x ) d x 2 .
Hence (81) and (82) imply that
2 n 2 1 i < j n cov log ϕ m , l ( i ) , log ϕ m , l ( j ) = n 1 n cov ( log ϕ m , l ( 1 ) , log ϕ m , l ( 2 ) ) 0 , n , m .
Step 5. Now we consider cov ( log ζ n , k ( i ) , log ζ n , k ( j ) ) for i j , where i , j { 1 , , n } .
Similarly to Step 4, for x , y R d and u , w 0 , introduce a conditional distribution function
Φ ˜ n , k , x , y i , j ( u , w ) : = P ( ζ n , k ( i ) u , ζ n , k ( j ) w | X i = x , X j = y )
= P x X ( k ) ( x , X n i , j { y } ) r n 1 ( u ) , y X ( k ) ( y , X n i , j { x } ) r n 1 ( w )
: = P ( η ˜ n , k , x y , i , j u , η ˜ n , k , y x , i , j w ) , u , w 0 ,
where X n i , j = X n \ { X i , X j } , η ˜ n , k , x y , i , j : = ( n 1 ) x X ( k ) ( x , X n i , j { y } ) d . We write Φ ˜ n , k , x , y ( u , w ) , η ˜ n , k , x y and η ˜ n , k , y x instead of Φ ˜ n , k , x , y i , j ( u , w ) , η ˜ n , k , x y , i , j , η ˜ n , k , y x , i , j , respectively, (because X 1 , X 2 , are i.i.d. random vectors). Moreover, Φ ˜ n , k , x , y ( u , w ) is the distribution function of a random vector η ˜ n , k , x , y : = ( η ˜ n , k , x y , η ˜ n , k , y x ) and the regular conditional distribution function of a random vector ( ζ n , k ( i ) , ζ n , k ( j ) ) given ( X i , X j ) = ( x , y ) . One has
Φ ˜ n , k , x , y ( u , w ) = 1 P x X ( k ) ( x , X n i , j { y } ) > r n 1 ( u ) P y X ( k ) ( y , X n i , j { x } ) > r n 1 ( w ) + P x X ( k ) ( x , X n i , j { y } ) > r n 1 ( u ) , y X ( k ) ( y , X n i , j { x } ) > r n 1 ( w ) .
Introduce
R ˜ N , 2 ( x ) : = x y e [ N ] G N ( log 2 x y ) p ( y ) d y ,
A ˜ p , 2 ( G N ) : = { x S ( p ) : R ˜ N , 2 ( x ) < } and A ˜ : = Λ ( p ) S ( p ) D p ( R ) A ˜ p , 2 ( G N ) , where the first three sets appeared in Theorem 1 proof (Step 5) Then P X ( S ( p ) \ A ˜ p , 2 ( G N ) ) = 0 since K p , p ( 2 , N ) < . It is easily seen that P X ( A ˜ ) = 1 .
Take A ˜ 1 : = ( x , y ) : x A ˜ , y A ˜ , x y and A ˜ 2 : = ( x , y ) : x A ˜ , y A ˜ , x = y . Evidently, P X P X ( A ˜ 1 ) = 1 and P X P X ( A ˜ 2 ) = 0 . For any a > 0 , r m ( a ) 0 , as m . Hence, for ( x , y ) A ˜ 1 , one can find n ˜ 5 = n ˜ 5 ( u , w , x y ) = 2 + 2 x y d max u , w such that r n 1 ( u ) < x y 2 , r n 1 ( w ) < x y 2 if n n ˜ 5 . Then B ( x , r n 1 ( u ) ) B ( y , r n 1 ( w ) ) = if n n ˜ 5 ( u , w , x y ) . Thus, for n n ˜ 6 : = max n ˜ 5 , 2 k , one has
Φ ˜ n , k , x , y ( u , w ) = 1 s 1 = 0 k 1 n 2 s 1 V n 1 , x ( u ) s 1 1 V n 1 , x ( u ) n 2 s 1
s 2 = 0 k 1 n 2 s 2 V n 1 , y ( w ) s 2 1 V n 1 , y ( w ) n 2 s 2
+ s 1 = 0 k 1 s 2 = 0 k 1 ( n 2 ) ! s 1 ! s 2 ! ( n 2 s 1 s 2 ) ! V n 1 , x ( u ) s 1 V n 1 , y ( w ) s 2 1 V n 1 , x ( u ) V n 1 , y ( w ) n 2 s 1 s 2 .
Therefore, for each fixed ( x , y ) A ˜ 1 , u , w 0 , we get, as n ,
Φ ˜ n , k , x , y ( u , w ) 1 s 1 = 0 k 1 ( V d u p ( x ) ) s 1 s 1 ! e V d u p ( x ) s 2 = 0 k 1 ( V d w p ( y ) ) s 2 s 2 ! e V d w p ( y ) + s 1 = 0 k 1 s 2 = 0 k 1 ( V d u p ( x ) ) s 1 s 1 ! ( V d w p ( y ) ) s 2 s 2 ! e V d u p ( x ) + w p ( y ) = 1 s 1 = 0 k 1 ( V d u p ( x ) ) s 1 s 1 ! e V d u p ( x ) 1 s 2 = 0 k 1 ( V d w p ( y ) ) s 2 s 2 ! e V d w p ( y ) = F ˜ k , x ( u ) F ˜ k , y ( w ) : = Φ ˜ k , x , y ( u , w ) .
Here Φ ˜ k , x , y ( · , · ) denotes the distribution function of a vector η ˜ k , x , y : = ( ξ ˜ k , x , ξ ˜ k , y ) . The components of η ˜ k , x , y are independent, ξ ˜ k , x Γ ( V d p ( x ) , k ) and ξ ˜ k , y Γ ( V d p ( y ) , k ) . Consequently, for each fixed ( x , y ) A ˜ 1 , we have shown that η ˜ n , k , x , y l a w η ˜ k , x , y as n . Therefore, for such ( x , y ) ,
log η ˜ n , k , x y log η ˜ n , k , y x l a w log ξ ˜ k , x log ξ ˜ k , y , n .
Here we take strictly positive versions of the random variables under consideration. In a way similar to (77), for i , j { 1 , , n } , i j , we write
E ( log η ˜ n , k , x y log η ˜ n , k , y x ) = ( 0 , ) × ( 0 , ) log u log w d Φ ˜ n , k , x , y ( u , w ) = E log ζ n , k ( i ) log ζ n , k ( j ) | X i = x , X j = y .
Since ξ ˜ k , x and ξ ˜ k , y are independent, write E ( log ξ ˜ k , x log ξ ˜ k , y ) = E log ξ ˜ k , x E log ξ ˜ k , y = b k , d ( x ) b k , d ( y ) , where b k , d ( z ) : = ψ ( k ) log V d log p ( z ) , z R d .
For any fixed M > 0 , consider A ˜ 1 , M : = ( x , y ) A ˜ 1 : x y > M . Now our aim is to verify that, for each ( x , y ) A ˜ 1 , M ,
E log ζ n , k ( 1 ) log ζ n , k ( 2 ) | X 1 = x , X 2 = y b k , d ( x ) b k , d ( y ) .
Equivalently, we can prove, for each ( x , y ) A ˜ 1 , M , that
E log η ˜ n , k , x y log η ˜ n , k , y x E log ξ ˜ k , x log ξ ˜ k , y , n .
The idea that we consider only ( x , y ) A ˜ 1 , M is principal for the further proof.
Part (5a). We are going to establish that, for ( x , y ) A ˜ 1 , M , a family { log η ˜ n , k , x y log η ˜ n , k , y x } n n ˜ 0 is uniformly integrable, where n ˜ 0 N is independent of x , y , but might depend onM. Then, due to (83), the relation (85) would be valid for such ( x , y ) as well. As we have seen, the function G N ( · ) is nondecreasing and convex. Hence
E G N ( | log η ˜ n , k , x y log η ˜ n , k , y x | ) 1 2 E G N ( log 2 η ˜ n , k , x y ) + E G N ( log 2 η ˜ n , k , y x ) .
Let us consider, for instance, E G N ( log 2 η ˜ n , k , x y ) . Alike Step 2 we can write
E G N ( log 2 η ˜ n , k , x y ) = 0 , 1 ρ ( N ) F ˜ n , k , x y ( u ) ( h N ( u ) ) d u + ρ ( N ) , ( 1 F ˜ n , k , x y ( u ) ) h N ( u ) d u : = I 1 ( n , x , y ) + I 2 ( n , x , y ) ,
where
F ˜ n , k , x y ( u ) : = P η ˜ n , k , x y u = 1 P x X ( k ) ( x , X n i , j { y } ) > r n 1 ( u ) = I x y > r n 1 ( u ) 1 s = 0 k 1 n 2 s V n 1 , x ( u ) s 1 V n 1 , x ( u ) n 2 s + I x y r n 1 ( u ) 1 s = 0 k 2 n 2 s V n 1 , x ( u ) s 1 V n 1 , x ( u ) n 2 s ,
As usual a sum over empty set is equal to 0 (for k = 1 ).
If u 0 , 1 ρ ( N ) , where ρ ( N ) : = exp { e [ N 1 ] } and n n ˜ 1 : = 1 ρ ( N ) M d + 1 , then r n 1 ( u ) M . Thus, r n 1 ( u ) < x y if ( x , y ) A ˜ 1 , M . In view of (87) and similarly to (38), one has
F ˜ n , k , x y ( u ) n 2 n 1 ε M p ( x , R ) V d u ε M p ( x , R ) ε V d ε u ε
for ( x , y ) A ˜ 1 , M , u 0 , 1 ρ ( N ) , n max { n ˜ 1 ( M ) , n ˜ 2 ( R ) } , here n ˜ 2 ( R ) : = max { 1 ρ ( N ) R d , k + 1 } . So, I 1 ( n , x , y ) U 1 ( ε , N , d ) M p ( x , R ) ε for ( x , y ) A ˜ 1 , M and n max { n ˜ 1 ( M ) , n ˜ 2 ( R ) } . Moreover, for all u 0 , in view of (87) it holds
1 F ˜ n , k , x y ( u ) s = 0 k 1 n 2 s V n 1 , x ( u ) s 1 V n 1 , x ( u ) n 2 s .
The same reasoning as was used in Theorem 1 proof (Step 3, Part (3b)) leads to the inequalities
1 F ˜ n , k , x y ( u ) S 1 ( k ) 1 S 2 V n 1 , x ( u ) n 2 S 1 exp S 2 ( n 2 ) V n 1 , x ( u ) S 1 exp n 2 n 1 S 2 V d u m p ( x , R ) S 1 S 2 2 V d u m p ( x , R ) ε
for all n max n ˜ 3 ( R ) , 3 . Then similarly to (70), the relation
E G N ( log 2 η ˜ n , k , x y ) U 1 ( M p ( x , R ) ) ε + U ˜ 2 + κ ( m p ( x , R ) ) ε + A R ˜ N , 2 ( x ) + B : = C ˜ 1 ( x ) <
is valid for all ( x , y ) A ˜ 1 , M and n n ˜ 0 ( M ) : = max n ˜ 1 ( M ) , n ˜ 2 , n ˜ 3 , n ˜ 4 ( κ ) , 3 . Here U 1 , U ˜ 2 , κ , A , B do not depend on x and y. The term E G N ( log 2 η ˜ n , k , y x ) can be treated in the above manner. Thus, in view of (86), one has
E G N ( | log η ˜ n , k , x y log η ˜ n , k , y x | ) U 1 2 ( M p ( x , R ) ) ε + ( M p ( y , R ) ) ε + U 2 + κ 2 ( m p ( x , R ) ) ε + ( m p ( y , R ) ) ε + A 2 R ˜ N , 2 ( x ) + R ˜ N , 2 ( y ) + B : = C ˜ 1 ( x , y ) .
Therefore, for any ( x , y ) A ˜ 1 , M , a family { log η ˜ n , k , x y log η ˜ n , k , y x } n n ˜ 0 is uniformly integrable. Thus, we come to (84) for ( x , y ) A ˜ 1 , M .
Part (5b). Now we return to upper bound for cov log ζ n , k ( 1 ) , log ζ n , k ( 2 ) . Set
T ˜ n , k ( x , y ) : = E log ζ n , k ( 1 ) log ζ n , k ( 2 ) | X 1 = x , X 2 = y = E log η ˜ n , k , x y log η ˜ n , k , y x
for all ( x , y ) A ˜ 1 . Validity of (84) is equivalent to the following relation: for any ( x , y ) A ˜ 1 , M , T ˜ n , k ( x , y ) b k , d ( x ) b k , d ( y ) , as n . Take any ( x , y ) A ˜ 1 . For each M > 0 , it was shown that
T ˜ n , k ( x , y ) I { x y > M } b k , d ( x ) b k , d ( y ) I { x y > M } , n .
Note that
G N ( | T ˜ n , k ( x , y ) | I { x y > M } ) G N ( | T ˜ n , k ( x , y ) | ) = G N ( | E log η ˜ n , k , x y log η ˜ n , k , y x | ) G N ( E | log η ˜ n , k , x y log η ˜ n , k , y x | ) E G N ( | log η ˜ n , k , x y log η ˜ n , k , y x | ) .
Due to (88) and (89) one can conclude that, for all n n ˜ 0 ,
R d × R d G N ( | T ˜ n , k ( x , y ) | I { x y > M } ) p ( x ) p ( y ) d x d y U 1 R d M p ε ( x , R ) p ( x ) d x + U ˜ 2 + κ R d m p ε ( x , R ) p ( x ) d x + A R d R ˜ N , 2 ( x ) p ( x ) d x + B = U 1 Q p , p ( ε , R ) + ( U ˜ 2 + κ ) T p , p ( ε , R ) + A K p , p ( 2 , N ) + B < .
Therefore, for ( x , y ) A ˜ 1 , the family T ˜ n , k ( x , y ) I { x y > M } n n ˜ 0 is uniformly integrable w.r.t. P X P X . Hence, by virtue of (84), for each M > 0 ,
D ( M ) T ˜ n , k ( x , y ) p ( x ) p ( y ) d x d y D ( M ) b k , d ( x ) b k , d ( y ) p ( x ) p ( y ) d x d y , n ,
where D ( M ) : = { x , y R d , x y > M } . Now we turn to the case x y M . One has s = 1 X 1 X 2 1 s = X 1 = X 2 and P X 1 = X 2 = 0 as X 1 and X 2 are independent and have a density p ( x ) w.r.t. the Lebesgue measure μ . Then in view of continuity of a probability measure it holds that P X 1 X 2 M 0 , as M 0 . Taking into account that, for an integrable function h, C h d P 0 as P ( C ) 0 , we get
E ( log ζ n , k ( 1 ) log ζ n , k ( 2 ) I { X 1 X 2 M } ) 0 , M 0 ,
since E log ζ n , k ( 1 ) log ζ n , k ( 2 ) 1 2 E log 2 ζ n , k ( 1 ) + E log 2 ζ n , k ( 2 ) < (the proof is similar to the establishing that E log 2 ϕ m , l ( 1 ) < ). Thus, for any γ > 0 , one can find M 1 = M 1 ( γ ) > 0 such that, for all M ( 0 , M 1 ] and n n ˜ 0 ,
| R 2 d \ D ( M ) T ˜ n , k ( x , y ) p ( x ) p ( y ) d x d y | = | E log ζ n , k ( 1 ) log ζ n , k ( 2 ) I { X 1 X 2 M } | < γ 3 .
Also there exists M 2 = M 2 ( γ ) > 0 such that, for all M ( 0 , M 2 ] ,
| R 2 d \ D ( M ) b k , d ( x ) b k , d ( y ) p ( x ) p ( y ) d x d y | < γ 3 .
Take M ( γ ) : = min { M 1 ( γ ) , M 2 ( γ ) } . Due to (90) there is n ˜ 7 ( M ( γ ) , γ ) such that n max { n ˜ 0 , n ˜ 7 ( M ( γ ) , γ ) } entails the following inequality
| D ( M ) T ˜ n , k ( x , y ) p ( x ) p ( y ) d x d y D ( M ) b k , d ( x ) b k , d ( y ) p ( x ) p ( y ) d x d y | < γ 3 .
So, in view of (91)–(93), for any γ > 0 , there is M ( γ ) > 0 such that, for all n large enough, i.e., n max { n ˜ 0 , n ˜ 7 ( M ( γ ) , γ ) } , one has
| R d × R d T ˜ n , k ( x , y ) p ( x ) p ( y ) d x d y R d × R d b k , d ( x ) b k , d ( y ) p ( x ) p ( y ) d x d y | < γ .
By virtue of the formula
R d × R d b k , d ( x ) b k , d ( y ) p ( x ) p ( y ) d x d y = ψ ( k ) log V d R d ( log p ( x ) ) p ( x ) d x 2 ,
and taking into account (94) we deduce the limit relation, for n ,
E log ζ n , k ( 1 ) log ζ n , k ( 2 ) ψ ( k ) log V d R d ( log p ( x ) ) p ( x ) d x 2 .
Moreover, in view of (24) (see Step 5 of Theorem 1 proof), it follows that
E log ζ n , k ( 1 ) E log ζ n , k ( 2 ) ψ ( k ) log V d R d ( log p ( x ) ) p ( x ) d x 2 .
Therefore,
2 n 2 1 i < j n cov log ζ n , k ( i ) , log ζ n , k ( j ) = n 1 n cov ( log ζ n , k ( 1 ) , log ζ n , k ( 2 ) ) 0 , n .
Step 6. Here we complete the analysis of summands in (56). Reasoning as at Steps 1–3 shows that 1 n 2 i = 1 n var log ζ n , k ( i ) = 1 n var log ζ n , k ( 1 ) 0 since
var ( log ζ n , k ( i ) ) = var ( log ζ n , k ( 1 ) ) v k <
for each i N , as n . It remains to prove that 2 n 2 i , j = 1 n cov ( log ϕ m , l ( i ) , log ζ n , k ( j ) ) 0 , as n , m .
For i = 1 , , n , one has | cov log ϕ m , l ( i ) , log ζ n , k ( i ) | var ( log ϕ m , l ( 1 ) ) var ( log ζ n , k ( 1 ) ) 1 2 < for all n , m large enough. So, it suffices to show that
1 n 2 i , j = 1 , , n ; i j cov ( log ϕ m , l ( i ) , log ζ n , k ( j ) ) 0 , n , m .
For i , j = 1 , , n , i j , u , w 0 , x , y R d , let us introduce a conditional distribution function
P ϕ m , l ( i ) u , ζ n , k ( j ) w | X i = x , X j = y = P x Y ( l ) ( x , Y m ) r m ( u ) , y X ( k ) ( y , X n i , j { x } ) r n 1 ( w ) = P x Y ( l ) ( x , Y m ) r m ( u ) P y X ( k ) ( y , X n i , j { x } ) r n 1 ( w ) = 1 s 1 = 0 l 1 m s 1 ( W m , x ( u ) ) s 1 ( 1 W m , x ( u ) ) m s 1 · ( I x y > r n 1 ( w ) 1 s = 0 k 1 n 2 s V n 1 , y ( w ) s 1 V n 1 , y ( w ) n 2 s + I x y r n 1 ( w ) 1 s = 0 k 2 n 2 s V n 1 , y ( w ) s 1 V n 1 , y ( w ) n 2 s ) .
We used that { X n , Y m } is a collection of independent vectors. Now we combine the estimates obtained at Steps 4 and 5 of Theorem 2 proof to verify that, for i , j { 1 , , n } and i j , cov log ϕ m , l ( i ) , log ζ n , k ( j ) = cov log ϕ m , l ( 1 ) , log ζ n , k ( 2 ) 0 , n , m .
Thus, we have established that var ( D ^ n , m ( k , l ) ) 0 as n , m , hence (16) holds. The proof of Theorem 2 is complete.

6. Conclusions

The aim of this paper is to provide wide conditions ensuring the asymptotic unbiasedness and mean square consistency for statistical estimates of the Kullback–Leibler divergence proposed in [31]. We do not impose restrictions on the smoothness of the densities under consideration and do not assume that the densities have bounded supports. Thus, in particular one can apply our results to various mixtures of distributions, for instance, to mixture of any nondegenerate normal laws in R d (Corollary 4). As a byproduct we relax conditions in our recent analysis of the Kozachenko - Leonenko type estimators for the Shannon differential entropy [55] and use these conditions in estimating the cross-entropy as well. Observe that the integral functional K p , q appearing in Theorems 1–3 involves the function G N ( t ) which is close to a function t when parameter N is large enough. Thus, we impose essentially less restrictive condition than one requiring a function G ( t ) = t 1 + ν for some ν > 0 instead of G N ( t ) . Even for the latter choice of G our results provide the first valid proof without appealing to the Fatou lemma (the long standing problem to obtain correct proofs was discussed in Introduction). An interesting and hard problem for future research is to find the class of functions φ : R + R + such that one can replace G N ( t ) in expression of K p , q by G ( t ) = t φ ( t ) , where φ ( t ) , as t , and keep the validity of established theorems. Here one can see the analogy with investigation of fluctuations of sums of random variables or Brownian motion by G.H.Hardy, H.D.Steinhaus, A.Ya.Khinchin, A.N.Kolmogorov, I.G.Petrovski, W.Feller and other researchers. The increasing precision on the way of description of the upper and lower functions has led to the law of the iterated logarithm and its generalizations. Another deep problem is to provide sharp conditions of CLT validity for estimates of the Kullback - Leibler divergence.
Beside pure theoretical aspects the estimates of entropy and related functionals have diverse applications. In [5], the estimates of the Kullback - Leibler divergence are applied to the change-point detection in time series. That issue is important, e.g., in analysis of stochastic financial models. Moreover, it is interesting to study the spatial variant of this problem. Namely, in [55,56] statistical estimates of entropy and scan-statistics (see, e.g., [57]) were employed for identification of inhomogeneities of fiber materials. In [58], the Kullback–Leibler divergence estimators are used to identify multivariate spatial clusters in the Bernoulli model. A modification of the latter paper idea can also be applied to analysis of the fiber structures. Such structures in R d can be modeled by spatial point stochastic process to specify the locations of the centers of fibers (segments). A certain law on the unit sphere of R d can be used to model their directions. The length of fibers can be fixed or follow some distribution on R + . Since various scan domains could contain random number of observations the development of present results will be applied along with the theory of random sums of random variables. The latter theory (see, e.g., [59]) is essential in this case. Moreover, we intend to employ the studied estimators in the feature selection theory, actively used in Genome-wide association studies (GWAS), see, e.g., [16,17,22]. In this regard statistical estimates of the mutual information were proposed, see, e.g., [12]. Note also an important problem of stability analysis of constructing, by means of statistical estimates of the mutual information, a sub-collection of relevant (in a sense) factors determining a random response. The above mentioned applications will be considered in separate publications, supplemented with computer simulations and illustrative graphs.

Author Contributions

Conceptualization, A.B. and D.D.; validation, A.B. and D.D.; writing—original draft preparation, A.B. and D.D.; writing—review and editing, A.B. and D.D.; supervision, A.B.; project administration, A.B.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

The work of the first author is supported by the Russian Science Foundation under grant 14-21-00162 and performed at the Steklov Mathematical Institute of Russian Academy of Sciences.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to Professor A.Tsybakov for useful discussions. We also thank the Reviewers for remarks and suggestions improving the exposition.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proofs of Lemmas 1–3 are similar to the proofs of Lemma 2.5, 3.1 and 3.2 in [35]. We provide them for the sake of completeness.
Proof of Lemma 1. 
(1) Note that log x y > e [ N 1 ] 1 if x y > e [ N ] and N N . Hence, for such x , y , one has ( log x y ) ν ( log x y ) ν 0 if ν ( 0 , ν 0 ] . If N N 0 then G N ( u ) G N 0 ( u ) for u e [ N 1 ] e [ N 0 1 ] . Thus, K p , q ( ν , N ) K p , q ( ν 0 , N 0 ) < for ν ( 0 , ν 0 ] and any integer N N 0 .
(2) Assume that Q p , q ( ε 1 , R 1 ) < . Consider Q p , q ( ε 1 , R ) where R > 0 . If 0 < R R 1 then, for each x R d , in accordance with the definition of M q one has M q ( x , R ) M q ( x , R 1 ) . Consequently, Q p , q ( ε 1 , R ) Q p , q ( ε 1 , R 1 ) < . Let now R > R 1 . One has
M q ( x , R ) max M q ( x , R 1 ) , sup R 1 < r R B ( x , R 1 ) q ( x ) d x + B ( x , r ) \ B ( x , R 1 ) q ( x ) d x μ ( B ( x , r ) )
max M q ( x , R 1 ) , M q ( x , R 1 ) + 1 μ ( B ( x , R 1 ) ) = M q ( x , R 1 ) + 1 μ ( B ( x , R 1 ) ) .
Therefore,
Q p , q ( ε 1 , R ) = R d ( M q ( x , R ) ) ε 1 p ( x ) d x R d M q ( x , R 1 ) + 1 R 1 d V d ε 1 p ( x ) d x max { 1 , 2 ε 1 1 } Q p , q ( ε 1 , R 1 ) + ( R 1 d V d ) ε 1 < .
Suppose now that Q p , q ( ε 1 , R ) < for some ε 1 > 0 and R > 0 . Then, for each ε ( 0 , ε 1 ] , the Lyapunov inequality leads to the estimate Q p , q ( ε , R ) ( Q p , q ( ε 1 , R ) ) ε ε 1 < .
(3) Let T p , q ( ε 2 , R 2 ) < . Take 0 < R R 2 . Then, for any x R d , according to the definition of m q we get 0 m q ( x , R 2 ) m q ( x , R ) . Hence T p , q ( ε 2 , R ) T p , q ( ε 2 , R 2 ) < . Consider R > R 2 . For any x R d and every a > 0 , the function I q ( x , r ) is continuous in r on ( 0 , a ] . Next fix an arbitrary x S ( q ) Λ ( q ) . We see that there exists lim r 0 + I q ( x , r ) = q ( x ) . For such x, set I q ( x , 0 ) : = q ( x ) . Thus, I q ( x , · ) is continuous on any segment [ 0 , a ] . Hence, one can find R ˜ 2 in [ 0 , R 2 ] such that m q ( x , R 2 ) = I q ( x , R ˜ 2 ) and there exists R 0 in [ 0 , R ] such that m q ( x , R ) = I q ( x , R 0 ) . If R 0 R 2 then m q ( x , R ) = m q ( x , R 2 ) (since m q ( x , R ) m q ( x , R 2 ) for R > R 2 and m q ( x , R ) = I q ( x , R 0 ) m q ( x , R 2 ) as R 0 [ 0 , R 2 ] ). Assume that R 0 ( R 2 , R ] . Obviously R 0 > 0 as R 2 > 0 . One has
m q ( x , R ) = I q ( x , R 0 ) = B ( x , R 2 ) q ( y ) d y + B ( x , R 0 ) \ B ( x , R 2 ) q ( y ) d y μ ( ( x , R 0 ) ) B ( x , R 2 ) q ( y ) d y μ ( B ( x , R 0 ) ) = μ ( B ( x , R 2 ) ) μ ( B ( x , R 0 ) ) I q ( x , R 2 ) μ ( B ( x , R 2 ) ) μ ( B ( x , R 0 ) ) m q ( x , R 2 )
= R 2 R 0 d m q ( x , R 2 ) R 2 R d m q ( x , R 2 ) .
Thus, in any case ( R 0 [ 0 , R 2 ] or R 0 ( R 2 , R ] ) one has m q ( x , R ) R 2 R d m q ( x , R 2 ) as R 2 < R . Taking into account that μ ( S ( q ) \ ( S ( q ) Λ ( q ) ) ) = 0 we deduce the inequality
T p , q ( ε 2 , R ) R R 2 ε 2 d T p , q ( ε 2 , R 2 ) < .
Assume now that T p , q ( ε 2 , R ) < for some ε 2 > 0 and R > 0 . Then, for any ε ( 0 , ε 2 ] , the Lyapunov inequality entails T p , q ( ε , R ) ( T p , q ( ε 2 , R ) ) ε ε 2 < . This completes the proof. □
Proof of Lemma 2. 
Begin with relation (1). Observe that if a function g is measurable and bounded on a finite interval ( a , b ] and ν is a finite measure on the Borel subsets of ( a , b ] then ( a , b ] g ( x ) ν ( d x ) is finite. So applying the integration by parts formula (see, e.g., [33], p. 245), for each a 0 , 1 e [ N ] , we get
a , 1 e [ N ] F ( u ) g N ( u ) d u = a , 1 e [ N ] F ( u ) d G N ( log u ) = G N ( log a ) F ( a ) + a , 1 e [ N ] G N ( log u ) d F ( u ) .
Assume now that 0 , 1 e [ N ] G N ( log u ) d F ( u ) < . Then by the monotone convergence theorem
lim a 0 + ( 0 , a ] G N ( log u ) d F ( u ) = 0 .
Given a > 0 the following lower bound is obvious
( 0 , a ] G N ( log u ) d F ( u ) G N ( log a ) ( 0 , a ] d F ( u ) = G N ( log a ) ( F ( a ) F ( 0 ) ) = G N ( log a ) F ( a ) 0 .
Therefore (A2) implies that
G N ( log a ) F ( a ) 0 , a 0 + .
By the Lebesgue monotone convergence theorem letting a 0 + in (A1) yields the desired relation (1) of our Lemma. Now we assume that
0 , 1 e [ N ] F ( u ) g N ( u ) d u < .
Hence from the equality 0 , 1 e [ N ] F ( u ) g N ( u ) d u = 0 , 1 e [ N ] F ( u ) d ( G N ( log u ) ) we get lim b 0 + ( 0 , b ] F ( u ) d ( G N ( log u ) ) = 0 by monotone convergence theorem. Therefore, for any c ( 0 , b ) , we come to the inequalities
( 0 , b ] F ( u ) d ( G N ( log u ) ) ( c , b ] F ( u ) d ( G N ( log u ) )
= F ( b ) G N ( log b ) + F ( c ) G N ( log c ) + ( c , b ] G N ( log u ) d F ( u )
F ( c ) G N ( log c ) F ( b ) G N ( log b ) + ( F ( b ) F ( c ) ) G N ( log b )
= F ( c ) G N ( log c ) 1 G N ( log b ) G N ( log c ) .
Let c = b 2 ( b 1 e [ N ] < 1 ). Then, for all positive b small enough,
1 G N ( log b ) G N ( log c ) = 1 G N ( log b ) G N ( 2 log b ) = 1 1 2 log [ N ] ( log b ) log [ N ] ( 2 log b ) 1 2 .
Thus ( 0 , b ] F ( u ) d ( G N ( log u ) ) 1 2 F ( b 2 ) G N ( log ( b 2 ) ) 0 , so F ( b 2 ) G N ( log b 2 ) 0 as b 0 . Consequently we come to (A3) taking a = b 2 . Then (A1) implies relation (1).
When one of the (nonnegative) integrals in (1) turns infinite while the other one is finite we come to a contradiction. Thus, (1) is established. In quite the same manner one can verify validity of relation (2), therefore further details can be omitted. □
Proof of Lemma 3. 
Take x S ( q ) Λ ( q ) and R > 0 . Suppose that m q ( x , R ) = 0 . Since the function I q ( x , r ) defined in (8) is continuous in ( x , r ) R d × ( 0 , ) , there exists R ˜ [ 0 , R ] ( R ˜ = R ˜ ( x , R ) ) such that m q ( x , R ) = I q ( x , R ˜ ) ( I q ( x , 0 ) : = lim r 0 + I q ( x , r ) = q ( x ) for any x Λ ( q ) by continuity). If R ˜ = 0 then m q ( x , r ) = q ( x ) > 0 as x S ( q ) Λ ( q ) . Hence we have to deal with R ˜ ( 0 , R ] . If I q ( x , R ˜ ) = 0 then B ( x , r ) q ( y ) d y = 0 for any 0 < r R ˜ . Thus, (30) ensures that q ( x ) = 0 . However, x S ( q ) Λ ( q ) . So m q ( x , R ) > 0 for x S ( q ) Λ ( q ) . Thus, S ( q ) Λ ( q ) D q ( R ) : = { x S ( q ) : m q ( x , R ) > 0 } . It remains to note that S ( q ) \ Λ ( q ) R d \ Λ ( q ) and μ ( R d \ Λ ( q ) ) = 0 . Therefore μ ( S ( q ) \ D q ( R ) ) = 0 . □
Proof of Lemma 4. 
We will check that, for given N N and τ > 0 , there exist a : = a ( τ ) 0 and b : = b ( N , τ ) 0 such that, for any c 0 ,
G N ( τ c ) a G N ( c ) + b .
For c = 0 the statement is obviously true. Let c > 0 . It easily seen that log [ N ] ( τ c ) log [ N ] ( c ) 1 as c . Hence one can find c 0 ( N , τ ) such that, for all c c 0 ( N , τ ) , the inequality log [ N ] ( τ c ) log [ N ] ( c ) 2 is valid. Consequently, for c c 0 ( N , τ ) ,
G N ( τ c ) G N ( c ) = τ c log [ N ] ( τ c ) c log [ N ] ( c ) 2 τ : = a ( τ ) .
For all 0 c c 0 ( N , τ ) we write G N ( τ c ) G N ( τ c 0 ( N , τ ) ) : = b ( N , τ ) . Therefore, for any c 0 , we come to (A5). Thus, for any ν > 0 and x , y R d , x y , one has
G N ( | log ( x y d ) | ν ) = G N ( d ν | log ( x y ) | ν ) a ( d ν ) G N ( | log ( x y ) | ν ) + b ( N , d ν ) .
Proof of Lemma 6. 
For t [ 0 , e [ N 1 ] ] , a function G N ( t ) 0 is convex. We show that G N is convex on ( e [ N 1 ] , ) . Consider t > e [ N 1 ] . Write : = 1 and : = 0 . Then, for N N ,
( G N ( t ) ) = log [ N ] ( t ) + j = 1 N 1 1 log [ j ] ( t ) .
Obviously, 1 log [ k ] ( t ) = 1 t log [ k ] 2 ( t ) s = 1 k 1 1 log [ s ] ( t ) , k N . Thus, for t > e [ N 1 ] , we get
G N ( t ) = 1 t j = 1 N 1 1 log [ j ] ( t ) + k = 1 N 1 1 t 1 log [ k ] 2 ( t ) s = 1 k 1 1 log [ s ] ( t ) j { 1 , , N 1 } , j k 1 log [ j ] ( t ) = 1 t j = 1 N 1 1 log [ j ] ( t ) 1 k = 1 N 1 s = 1 k 1 log [ s ] ( t ) .
For N = 1 and t > 0 , we have G 1 ( t ) = 1 t > 0 . Take now N > 1 . Clearly, for t > e [ N 1 ] , one has 1 t j = 1 N 1 1 log [ j ] ( t ) > 0 because log [ j ] ( t ) > log [ j ] ( e [ N 1 ] ) = e [ N 1 j ] 1 > 0 when 1 j N 1 . Observe also that
k = 1 N 1 s = 1 k 1 log [ s ] ( t ) < k = 1 N 1 s = 1 k 1 e [ N 1 s ] k = 1 N 1 1 e [ N 2 ] = N 1 e [ N 2 ] 1 .
The last inequality is established by induction in N. Thus, in view of (A6), we have proved that, for all t > e [ N 1 ] and N N , the inequality ( G N ( t ) ) > 0 holds. Hence, the function G N ( t ) is (strictly) convex on e [ N 1 ] , .
Let h : [ a , ) R be a continuous nondecreasing function. If the restrictions of h to [ a , b ] and ( b , ) (where a < b ) are convex functions then, in general, it is not true that h is convex on [ a , ) . However, we can show that G N is convex on [ 0 , ) . Note that a function G N is convex on [ e [ N 1 ] , ) since it is convex on ( e [ N 1 ] , ) and continuous on [ e [ N 1 ] , ) . Take now any z [ 0 , e [ N 1 ] ] , y ( e [ N 1 ] , ) and s [ 0 , 1 ] . Then
G N ( s z + ( 1 s ) y ) G N ( s e [ N 1 ] + ( 1 s ) y ) s G N ( e [ N 1 ] ) + ( 1 s ) G N ( y ) = ( 1 s ) G N ( y ) = s G N ( z ) + ( 1 s ) G N ( y )
as G N ( z ) = 0 . Thus, for each N N , a function G N ( · ) is convex on R + . □
Proof of Corollary 3. 
The proof (i.e., checking the conditions of both Theorem 1 and 2) is quite similar to the proof of Corollary 2.11 in [35]. □
Proof of Corollary 4. 
Take f ( x ) = i = 1 k γ i f i ( x ) , where f i ( x ) is a density, γ i > 0 , i = 1 , , k , i = 1 k γ i = 1 , x R d . Then according to (9) and (10), for any x R d , r > 0 and R > 0 , one has I f ( x , r ) = i = 1 k γ i I f i ( x , r ) , M f ( x , R ) i = 1 k γ i M f i ( x , R ) , m f ( x , R ) i = 1 k γ i m f i ( x , R ) . We will apply these relations for f = p and f = q . It is well-known that, for any ε > 0 , c i 0 , i = 1 , , k , k N , the following inequality is valid ( i = 1 k c i ) ε max { 1 , k ε 1 } i = 1 k c i ε . Moreover, this inequality is obviously satisfied for all ε R as for ε 0 it holds ( i = 1 k c i ) ε i = 1 k c i ε . Therefore
Q p , q ( ε , R ) max { 1 , J ε 1 } i = 1 I j = 1 J a i b j ε Q p i , q j ( ε , R ) < ,
T p , q ( ε , R ) i = 1 I j = 1 J a i b j ε T p i , q j ( ε , R ) < .
The same reasoning leads to bounds Q p , p ( ε , R ) < and T p , p ( ε , R ) < . Now in view of (13), for ν > 0 , t > 0 and N N , we can write K p , q ( ν , N , t ) = i = 1 I j = 1 J a i b j K p i , q j ( ν , N , t ) . In this manner we can also represent K p , p ( ν , N , t ) . □
Lemma A1.
Let probability measures P , Q and a σ-finite measure μ(e.g., the Lebesgue measure) be defined on ( R d , B ( R d ) ) . Assume that P and Q have densities p ( x ) and q ( x ) , x R d , w.r.t. the measure μ. Then the following statements are true.
(1) 
P Q if and only if P ( S ( p ) \ S ( q ) ) = 0 ;
(2) 
formula (2) holds.
Proof of Lemma A1. 
(1) Let P Q . Obviously Q ( R d \ S ( q ) ) = 0 . Therefore P ( R d \ S ( q ) ) = 0 . Since S ( p ) \ S ( q ) R d \ S ( q ) , one has P ( S ( p ) \ S ( q ) ) = 0 .
Now let P ( S ( p ) \ S ( q ) ) = 0 . Assume that P is not absolutely continuous w.r.t. Q . Then there exists a set A such that Q ( A ) = 0 and P ( A ) > 0 . Consequently μ ( A ) > 0 as P μ . We can write A = A 1 A 2 , where A 1 : = A ( R d \ S ( q ) ) , A 2 : = A S ( q ) . We get Q ( A ) = Q ( A 1 ) + Q ( A 2 ) as A 1 A 2 = . Note that Q ( A 1 ) = 0 since q 0 on A 1 , so Q ( A 2 ) = 0 . Relation Q ( A 2 ) = A 2 q ( x ) μ ( d x ) yields μ ( A 2 ) = 0 ( q > 0 on A 2 and μ is a σ -finite measure). One has P ( A 2 ) = 0 because P μ . Thus, P ( A ) = P ( A 1 ) + P ( A 2 ) = P ( A 1 ) > 0 . Clearly, A 1 R d \ S ( q ) . Hence P ( S ( p ) \ S ( q ) ) = P ( S ( p ) ( R d \ S ( q ) ) ) P ( S ( p ) A 1 ) = P ( A 1 ) > 0 . We come to the contradiction. Therefore P Q .
In such a way we have proved that if P μ and Q μ , the relation P Q holds if and only if P ( S ( p ) \ S ( q ) ) = 0 . Obviously we can take as p and q any versions of d P d μ and d Q d μ .
(2) Suppose that P Q . We know that P , Q are probability measures, Q μ where μ is a σ -finite measure. Then, in view of [33], statement (b) of Lemma on p. 273, the following equality d P d Q = d P d μ / d Q d μ holds Q -a.s. and consequently P -a.s. too (on the set B : = { x : d Q d μ = 0 } having Q ( B ) = 0 a density d P d Q can be taken equal to zero). So, d P d Q ( x ) = p ( x ) q ( x ) for P -almost all x R d . One has
R d p ( x ) log p ( x ) q ( x ) μ ( d x ) = R d log p ( x ) q ( x ) d P = R d log d P d Q d P ,
where all integrals converge or diverge simultaneously. Indeed, if h is a measurable function with values in [ , ] then A h ( x ) ν ( d x ) : = 0 , whenever ν ( A ) = 0 ( ν is a finite or a σ -finite measure). We also employed [33], statement (a) of Lemma on p.273, when we changed the integration by μ to integration by P .
Now assume that P is not absolutely continuous w.r.t. Q , i.e., P ( S ( p ) \ S ( q ) ) > 0 in view of part (1) of the present Lemma. As usual, for any measurable B R d , B 0 μ ( d x ) = 0 . Then
R d p ( x ) log p ( x ) q ( x ) μ ( d x ) = S ( p ) \ S ( q ) p ( x ) log p ( x ) q ( x ) μ ( d x ) + S ( p ) S ( q ) p ( x ) log p ( x ) q ( x ) μ ( d x ) .
Evidently
S ( p ) \ S ( q ) p ( x ) log p ( x ) q ( x ) μ ( d x ) = S ( p ) \ S ( q ) log p ( x ) q ( x ) P ( d x ) = · P ( S ( p ) \ S ( q ) ) =
as P ( S ( p ) \ S ( q ) ) > 0 . Since log t 1 t if t > 0 , we write, for all x S ( p ) S ( q ) , log p ( x ) q ( x ) = log q ( x ) p ( x ) 1 q ( x ) p ( x ) . Thus, S ( p ) S ( q ) p ( x ) log p ( x ) q ( x ) μ ( d x ) S ( p ) S ( q ) p ( x ) 1 q ( x ) p ( x ) μ ( d x ) = S ( p ) S ( q ) p ( x ) μ ( d x ) S ( p ) S ( q ) q ( x ) μ ( d x ) = P ( S ( p ) S ( q ) ) Q ( S ( p ) S ( q ) ) 0 1 = 1 . Consequently R d p ( x ) log p ( x ) q ( x ) μ ( d x ) = . The proof is complete. □
Remark A1.
Note that formula (2) can give an infinite value of D ( P | | Q ) also when P Q . It is enough to take p ( x ) = 1 π ( 1 + x 2 ) and q ( x ) = 1 2 π exp { x 2 2 } , x R .

References

  1. Kullback, R.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  2. Moulin, P.; Veeravalli, V.V. Statistical Inference for Engineers and Data Scientists; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
  3. Pardo, L. New developments in statistical information theory based on entropy and divergence measures. Entropy 2019, 21, 391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Ji, S.; Zhang, Z.; Ying, S.; Wang, L.; Zhao, X.; Gao, Y. Kullback–Leibler divergence metric learning. IEEE Trans. Cybern. 2020, 1–12. [Google Scholar] [CrossRef]
  5. Noh, Y.K.; Sugiyama, M.; Liu, S.; du Plessis, M.C.; Park, F.C.; Lee, D.D. Bias reduction and metric learning for nearest-neighbor estimation of Kullback–Leibler divergence. Neural Comput. 2018, 30, 1930–1960. [Google Scholar] [CrossRef]
  6. Claici, S.; Yurochkin, M.; Ghosh, S.; Solomon, J. Model Fusion with Kullback–Leibler Divergence. In Proceedings of the 37th International Conference on Machine Learning, Online, 12–18 July 2020; Daumé, H., III, Singh, A., Eds.; PMLR: Brookline, MA, USA, 2020; Volume 119, pp. 2038–2047. [Google Scholar]
  7. Póczos, B.; Xiong, L.; Schneider, J. Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 14–17 July 2011; AUAI Press: Arlington, VA, USA, 2011; pp. 599–608. [Google Scholar]
  8. Cui, S.; Luo, C. Feature-based non-parametric estimation of Kullback–Leibler divergence for SAR image change detection. Remote Sens. Lett. 2016, 11, 1102–1111. [Google Scholar] [CrossRef] [Green Version]
  9. Deledalle, C.-A. Estimation of Kullback–Leibler losses for noisy recovery problems within the exponential family. Electron. J. Stat. 2017, 11, 3141–3164. [Google Scholar] [CrossRef]
  10. Yu, X.-P.; Chen, S.-X.; Peng, M.-L. Application of partial least squares algorithm based on Kullback–Leibler divergence in intrusion detection. In Proceedings of the International Conference on Computer Science and Technology (CST2016), Shenzhen, China, 8–10 January 2016; Cai, N., Ed.; World Scientific: Singapore, 2017; pp. 256–263. [Google Scholar]
  11. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2017, 50, 1–45. [Google Scholar] [CrossRef] [Green Version]
  12. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
  13. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
  14. Granero-Belinchón, C.; Roux, S.G.; Garnier, N.B. Kullback–Leibler divergence measure of intermittency: Application to turbulence. Phys. Rev. E 2018, 97, 013107. [Google Scholar] [CrossRef] [Green Version]
  15. Charzyńska, A.; Gambin, A. Improvement of the k-NN entropy estimator with applications in systems biology. Entropy 2016, 18, 13. [Google Scholar] [CrossRef] [Green Version]
  16. Wang, M.; Jiang, J.; Yan, Z.; Alberts, I.; Ge, J.; Zhang, H.; Zuo, C.; Yu, J.; Rominger, A.; Shi, K.; et al. Individual brain metabolic connectome indicator based on Kullback–Leibler Divergence Similarity Estimation predicts progression from mild cognitive impairment to Alzheimer’s dementia. Eur. J. Nucl. Med. Mol. Imaging 2020, 47, 2753–2764. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Zhong, J.; Liu, R.; Chen, P. Identifying critical state of complex diseases by single-sample Kullback–Leibler divergence. BMC Genom. 2020, 21, 87. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Li, J.; Shang, P. Time irreversibility of financial time series based on higher moments and multiscale Kullback–Leibler divergence. Phys. A Stat. Mech. Appl. 2018, 502, 248–255. [Google Scholar] [CrossRef]
  19. Beraha, M.; Betelli, A.M.; Papini, M.; Tirinzoni, A.; Restelli, M. Feature selection via mutual information: New theoretical insights. arXiv 2019, arXiv:1907.07384v1. [Google Scholar]
  20. Carrara, N.; Ernst, J. On the estimation of mutual information. Proceedings 2019, 33, 31. [Google Scholar] [CrossRef] [Green Version]
  21. Lord, W.M.; Sun, J.; Bollt, E.M. Geometric k-nearest neighbor estimation of entropy and mutual information. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 033114. [Google Scholar] [CrossRef] [Green Version]
  22. Moon, K.R.; Sricharan, K.; Hero, A.O., III. Ensemble estimation of generalized mutual information with applications to Genomics. arXiv 2019, arXiv:1701.08083v2. [Google Scholar]
  23. Suzuki, J. Estimation of Mutual Information; Springer: Singapore, 2021. [Google Scholar]
  24. Sason, I.; Verdú, S. F-difergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  25. Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Ensemble estimation of information divergence. Entropy 2018, 20, 560. [Google Scholar] [CrossRef] [Green Version]
  26. Rubenstein, P.K.; Bousquet, O.; Djolonga, J.; Riquelme, C.; Tolstikhin, I. Practical and Consistent Estimation of f-Divergences. In Proceedings of the NeurIPS 2019, 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Advances in Neural Information Processing Systems. Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 4070–4080. [Google Scholar]
  27. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  28. Kozachenko, L.F.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Inf. Transm. 1987, 23, 9–16. [Google Scholar]
  29. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Leonenko, N.N.; Pronzato, L.; Savani, V. A class of Rényi information estimations for multidimensional densities. Ann. Stat. 2010, 36, 2153–2182. [Google Scholar] [CrossRef]
  31. Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
  32. Pál, D.; Póczos, B.; Szepesvári, C. Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs. In Proceedings of the NIPS 2010 Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; Advances in Neural Information Processing Systems. Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2010; Volume 23, pp. 1849–1857. [Google Scholar]
  33. Shiryaev, A.N. Probability—1, 3rd ed.; Springer: New York, NY, USA, 2016. [Google Scholar]
  34. Loève, M. Probability Theory, 4th ed.; Springer: New York, NY, USA, 1977. [Google Scholar]
  35. Bulinski, A.; Dimitrov, D. Statistical estimation of the Shannon entropy. Acta Math. Sin. Ser. 2019, 35, 17–46. [Google Scholar] [CrossRef] [Green Version]
  36. Biau, G.; Devroye, L. Lectures on the Nearest Neighbor Method; Springer: Cham, Switzerland, 2015. [Google Scholar]
  37. Bulinski, A.; Kozhevin, A. Statistical estimation of conditional Shannon entropy. ESAIM Probab. Stat. 2019, 23, 350–386. [Google Scholar] [CrossRef] [Green Version]
  38. Coelho, F.; Braga, A.P.; Verleysen, M. A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [Google Scholar] [CrossRef] [Green Version]
  39. Delattre, S.; Fournier, N. On the Kozachenko-Leonenko entropy estimator. J. Stat. Plan. Inference 2017, 185, 69–93. [Google Scholar] [CrossRef] [Green Version]
  40. Berrett, T.B.; Samworth, R.J. Efficient two-sample functional estimation and the super-oracle phenomenon. arXiv 2019, arXiv:1904.09347. [Google Scholar]
  41. Penrose, M.D.; Yukich, J.E. Limit theory for point processes in manifolds. Ann. Appl. Probab. 2013, 6, 2160–2211. [Google Scholar] [CrossRef]
  42. Tsybakov, A.B.; Van der Meulen, E.C. Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Stat. 1996, 23, 75–83. [Google Scholar]
  43. Singh, S.; Pószoc, B. Analysis of k-nearest neighbor distances with application to entropy estimation. arXiv 2016, arXiv:1603.08578v2. [Google Scholar]
  44. Ryu, J.J.; Ganguly, S.; Kim, Y.-H.; Noh, Y.-K.; Lee, D.D. Nearest neighbor density functional estimation from inverse Laplace transform. arXiv 2020, arXiv:1805.08342v3. [Google Scholar]
  45. Gao, S.; Steeg, G.V.; Galstyan, A. Efficient Estimation of Mutual Information for Strongly Dependent Variables. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; Lebanon, G., Vishwanathan, S.V.N., Eds.; PMLR: Brookline, MA, USA, 2015; Volume 38, pp. 277–286. [Google Scholar]
  46. Berrett, T.B.; Samworth, R.J.; Yuan, M. Efficient multivariate entropy estimation via k-nearest neighbour distances. Ann. Stat. 2019, 47, 288–318. [Google Scholar] [CrossRef] [Green Version]
  47. Goria, M.N.; Leonenko, N.N.; Mergel, V.V.; Novi Inverardi, P.L. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. J. Nonparametr. Stat. 2005, 17, 277–297. [Google Scholar] [CrossRef]
  48. Evans, D. A computationally efficient estimator for mutual information. Proc. R. Soc. A Math. Phys. Eng. Sci. 2008, 464, 1203–1215. [Google Scholar] [CrossRef]
  49. Yeh, J. Real Analysis: Theory of Measure and Integration, 3rd ed.; World Scientific: Singapore, 2014. [Google Scholar]
  50. Evans, D.; Jones, A.J.; Schmidt, W.M. Asymptotic moments of near-neighbour distance distributions. Proc. R. Soc. A Math. Phys. Eng. Sci. 2002, 458, 2839–2849. [Google Scholar] [CrossRef]
  51. Bouguila, N.; Wentao, F. Mixture Models and Applications; Springer: Cham, Switzerland, 2020. [Google Scholar]
  52. Borkar, V.S. Probability Theory. An Advanced Course; Springer: New York, NY, USA, 1995. [Google Scholar]
  53. Kallenberg, O. Foundations of Modern Probability; Springer: New York, NY, USA, 1997. [Google Scholar]
  54. Billingsley, P. Convergence of Probability Measures, 2nd ed.; Wiley & Sons: New York, NY, USA, 1999. [Google Scholar]
  55. Alonso Ruiz, P.; Spodarev, E. Entropy-based inhomogeneity detection in fiber materials. Methodol. Comput. Appl. Probab. 2018, 20, 1223–1239. [Google Scholar] [CrossRef]
  56. Dresvyanskiy, D.; Karaseva, T.; Makogin, V.; Mitrofanov, S.; Redenbach, C.; Spodarev, E. Detecting anomalies in fibre systems using 3-dimensional image data. Stat. Comput. 2020, 30, 817–837. [Google Scholar] [CrossRef] [Green Version]
  57. Glaz, J.; Naus, J.; Wallenstein, S. Scan Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
  58. Walther, G. Optimal and fast detection of spatial clusters with scan statistics. Ann. Stat. 2010, 38, 1010–1033. [Google Scholar] [CrossRef] [Green Version]
  59. Gnedenko, B.V.; Korolev, V.Yu. Random Summation: Limit Theorems and Applications; CRC Press: Boca Raton, FL, USA, 1996. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bulinski, A.; Dimitrov, D. Statistical Estimation of the Kullback–Leibler Divergence. Mathematics 2021, 9, 544. https://doi.org/10.3390/math9050544

AMA Style

Bulinski A, Dimitrov D. Statistical Estimation of the Kullback–Leibler Divergence. Mathematics. 2021; 9(5):544. https://doi.org/10.3390/math9050544

Chicago/Turabian Style

Bulinski, Alexander, and Denis Dimitrov. 2021. "Statistical Estimation of the Kullback–Leibler Divergence" Mathematics 9, no. 5: 544. https://doi.org/10.3390/math9050544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop