Statistical Estimation of the Kullback–Leibler Divergence

Bulinski, Alexander; Dimitrov, Denis

doi:10.3390/math9050544

Open AccessArticle

Statistical Estimation of the Kullback–Leibler Divergence

by

Alexander Bulinski

^1,*

and

Denis Dimitrov

²

¹

Steklov Mathematical Institute of Russian Academy of Sciences, 119991 Moscow, Russia

²

Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119234 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(5), 544; https://doi.org/10.3390/math9050544

Submission received: 25 January 2021 / Revised: 24 February 2021 / Accepted: 28 February 2021 / Published: 4 March 2021

(This article belongs to the Special Issue Analytical Methods and Convergence in Probability with Applications)

Download Versions Notes

Abstract

:

Asymptotic unbiasedness and

L^{2}

-consistency are established, under mild conditions, for the estimates of the Kullback–Leibler divergence between two probability measures in

R^{d}

, absolutely continuous with respect to (w.r.t.) the Lebesgue measure. These estimates are based on certain k-nearest neighbor statistics for pair of independent identically distributed (i.i.d.) due vector samples. The novelty of results is also in treating mixture models. In particular, they cover mixtures of nondegenerate Gaussian measures. The mentioned asymptotic properties of related estimators for the Shannon entropy and cross-entropy are strengthened. Some applications are indicated.

Keywords:

Kullback–Leibler divergence; Shannon differential entropy; statistical estimates; k-nearest neighbor statistics; asymptotic behavior; Gaussian model; mixtures

MSC:

60F25; 62G20; 62H12

1. Introduction

The Kullback–Leibler divergence introduced in [1] is used for quantification of similarity of two probability measures. It plays important role in various domains such as statistical inference (see, e.g., [2,3]), metric learning [4,5], machine learning [6,7], computer vision [8,9], network security [10], feature selection and classification [11,12,13], physics [14], biology [15], medicine [16,17], finance [18], among others. It is worth to emphasize that mutual information, widely used in many research directions (see, e.g., [19,20,21,22,23]), is a special case of the Kullback–Leibler divergence for certain measures. Moreover, the Kullback–Leibler divergence itself belongs to a class of f-divergence measures (with

f (t) = log t

). For comparison of various f-divergence measures see, e.g., [24], their estimates are considered in [25,26].

Let

P

and

Q

be two probability measures on a measurable space

(S, B)

. The Kullback–Leibler divergence between

P

and

Q

is defined, according to [1], by way of

D (P | | Q) : = \{\begin{matrix} \int_{S} log (\frac{d P}{d Q}) d P & if P ≪ Q, \\ \infty otherwise, \end{matrix}

(1)

where

\frac{d P}{d Q}

stands for the Radon–Nikodym derivative. The integral in (1) can take values in

[0, \infty]

. We employ the base e of logarithms since a constant factor is not essential here.

If

(S, B) = (R^{d}, B (R^{d}))

, where

d \in N

, and (absolutely continuous)

P

and

Q

have densities,

p (x)

and

q (x)

,

x \in R^{d}

, w.r.t. the Lebesgue measure

μ

, then (1) can be expressed as

D (P | | Q) = \int_{R^{d}} p (x) log (\frac{p (x)}{q (x)}) d x,

(2)

where we write

d x

instead of

μ (d x)

to simplify notation. One formally sets

0 / 0 : = 0

,

a / 0 : = \infty

if

a > 0

,

log 0 : = - \infty

,

log (\infty) : = \infty

and

0 log 0 : = 0

. Then

log (\frac{p (x)}{q (x)})

is a measurable function with values in

[- \infty, \infty]

. So, the right-hand sides of (1) and (2) coincide. Formula (2) is justified by Lemma A1, see Appendix A.

Denote by

S (f) : = {x \in R^{d} : f (x) > 0}

the support of a (version of) probability density f. The integral in (2) is taken over

S (p)

and does not depend on the choice of p and q versions.

The following two functionals are closely related to the Kullback - Leibler divergence. For probability measures

P

and

Q

on

(R^{d}, B (R^{d}))

having densities

p (x)

and

q (x)

,

x \in R^{d}

, w.r.t. the Lebesgue measure

μ

, one can introduce, according to [27], p. 35, entropy H (also called the Shannon differential entropy) and cross-entropy C as follows

H (P) : = - \int_{R^{d}} p (x) log p (x) d x, C (P, Q) : = - \int_{R^{d}} p (x) log q (x) d x .

In view of (2),

D (P | | Q) = C (P, Q) - H (P)

whenever the right-hand side is well defined.

Usually one constructs statistical estimates of some characteristics of a stochastic model under consideration relying on a collection of observations. In the pioneering paper [28] the estimator of the Shannon differential entropy was proposed, based on the nearest neighbor statistics. In a series of papers this estimate was studied and applied. Moreover, estimators of the Rényi entropy, mutual information and the Kullback–Leibler divergence have appeared (see, e.g., [29,30,31]). However, the authors of [32] indicated the occurrence of gaps in the known proofs concerning the limit behavior of such statistics. Almost all of these flaws refer to the lack of proved correctness of using the (reversed) Fatou lemma (see, e.g., [28], inequality after the statement (21), or [31], inequality (91)) or the generalized Helly–Bray lemma (see, e.g., [30], page 2171). One can find these lemmas in [33], p. 233, and [34], p. 187. Paper [32] has attracted our attention and motivated study of the declared asymptotic properties. Furthermore, we would like to highlight the important role of the papers [28,30,31,32]. Thus, in a recent work [35] the new functionals were introduced to prove asymptotic unbiasedness and

L^{2}

-consistency of the Kozachenko–Leonenko estimators of the Shannon differential entropy. We used the criterion of uniform integrability, for different families of functions, to avoid employment of the Fatou lemma since it is not clear whether one could indicate the due majorizing functions for those families. The present paper is aimed at extension of our approach to grasp the Kullback–Leibler divergence estimation. Instead of the nearest neighbor statistics we employ the k-nearest neighbor statistics (on order statistics see, e.g., [36]) and also use more general forms of the mentioned functionals.

Note in passing that there exist investigations treating important aspects of the entropy, Kullback–Leibler divergence and mutual information estimation. The mixed models and conditional entropy estimation are studied, e.g., in [37,38]. The central limit theorem (CLT) for the Kozachenko–Leonenko estimates is established in [39]. In [40], deep analysis of efficiency of functional weighted estimates was performed (including CLT). The limit theorems for point processes on manifolds are employed in [41] to analyze behavior of the Shannon and the Rényi entropy estimates. The convergence rates for the Shannon entropy (truncated) estimates are obtained in [42] for one-dimensional case, see also [43] for multidimensional case. A kernel density plug-in estimator of the various divergence functionals is studied in [25]. The principal assumptions of that paper are the following: the densities are smooth and have common bounded support S, they are strictly lower bounded on S, moreover, the set S is smooth with respect to the employed kernel. Ensemble estimation of various divergence functionals is studied in [25]. Profound results for smooth bounded densities are established in recent work [44]. The mutual information estimation by the local Gaussian approximation is developed in [45]. Note that various deep results (including the central limit theorem) were obtained for the Kullback–Leibler estimates under certain conditions imposed on derivatives of unknown densities (see, e.g., the recent papers [25,46]). In a series of papers the authors demand boundedness of densities to prove

L^{2}

-consistency for the Kozachenko–Leonenko estimates of differential Shannon entropy (see, e.g., [47]).

Our goal is to provide wide conditions for the asymptotic unbiasedness and

L^{2}

-consistency of the specified Kullback–Leibler divergence estimates without such smoothness and boundedness hypotheses. Furthermore, we do not assume that densities have bounded supports. As a byproduct we obtain new results concerning Shannon differential entropy and cross-entropy.

We employ probabilistic and analytical techniques, namely, weak convergence of probability measures, conditional expectations, regular probability distributions, k-nearest neighbor statistics, probability inequalities, integration by parts in the Lebesgue–Stieltjes integral, analysis of integrals depending on certain parameters and taken over specified domains, criterion of the uniform integrability of various families of functions, slowly varying functions.

The paper is organized as follows. In Section 2, we introduce some notation. In Section 3 we formulate main results, i.e., Theorems 1 and 2. Their proofs are provided in Section 4 and Section 5, respectively. Section 6 contains concluding remarks and perspectives of future research. Proofs of several lemmas are given in Appendix A.

2. Notation

Let X and Y be random vectors taking values in

R^{d}

and having distributions

P_{X}

and

P_{Y}

, respectively, (below we will take

P = P_{X}

and

Q = P_{Y}

). Consider random vectors

X_{1}, X_{2}, \dots

and

Y_{1}, Y_{2}, \dots

with values in

R^{d}

such that

l a w (X_{i}) = l a w (X)

and

l a w (Y_{i}) = l a w (Y)

,

i \in N

. Assume also that

{X_{i}, Y_{i}, i \in N}

are independent. We are interested in statistical estimation of

D (P_{X} | | P_{Y})

constructed by means of observations

X_{n} : = {X_{1}, \dots, X_{n}}

and

Y_{m} : = {Y_{1}, \dots, Y_{m}}

,

n, m \in N

. All the random variables under consideration are defined on a probability space

(Ω, F, P)

, each measure space is assumed complete.

For a finite set

E = {z_{1}, \dots, z_{N}} \subset R^{d}

, where

z_{i} \neq z_{j}

(i \neq j)

, and a vector

v \in R^{d}

, renumerate points of E as

z_{(1)} (v), \dots, z_{(N)} (v)

in such a way that

∥ v - z_{(1)} ∥ \leq \dots \leq ∥ v - z_{(N)} ∥

,

∥ \cdot ∥

is the Euclidean norm in

R^{d}

. If there are points

z_{i_{1}}, \dots, z_{i_{s}}

having the same distance from v then we numerate them according to the indexes

i_{1}, \dots, i_{s}

increase. In other words, for

k = 1, \dots, N

,

z_{(k)} (v)

is the k-nearest neighbor of v in a set E. To indicate that

z_{(k)} (v)

is constructed by means of E we write

z_{(k)} (v, E)

. Fix

k \in {1, \dots, n - 1}

,

l \in {1, \dots, m}

and (for each

ω \in Ω

) put

R_{n, k} (i) : = ∥ X_{i} - X_{(k)} (X_{i}, X_{n} \ {X_{i}}) ∥, V_{m, l} (i) : = ∥ X_{i} - Y_{(l)} (X_{i}, Y_{m}) ∥, i = 1, \dots, n .

We assume that X and Y have densities

p = \frac{d P_{X}}{d μ}

and

q = \frac{d P_{Y}}{d μ}

. Then with probability one all the points in

X_{n}

are distinct as well as points of

Y_{m}

.

Following [31] (see Formula (17) there) introduce an estimate of

D (P_{X} | | P_{Y})

{\tilde{D}}_{n, m} (K_{n}, L_{n}) : = \frac{1}{n} \sum_{i = 1}^{n} (ψ (k_{i}) - ψ (l_{i})) + log (\frac{m}{n - 1}) + \frac{d}{n} \sum_{i = 1}^{n} log (\frac{V_{m, l_{i}} (i)}{R_{n, k_{i}} (i)}),

(3)

where

ψ (t) = \frac{d}{d t} log Γ (t) = \frac{Γ^{'} (t)}{Γ (t)}

is the digamma function,

t > 0

,

K_{n} : = {k_{i}}_{i = 1}^{n}

,

L_{n} : = {l_{i}}_{i = 1}^{n}

are collections of integers and, for some

r \in N

and all

i \in N

,

k_{i} \leq r

,

l_{i} \leq r

. Note that (3) is well-defined for

n \geq {max}_{i = 1, \dots, n} k_{i} + 1

,

m \geq {max}_{i = 1, \dots, n} l_{i}

. If

k_{i} = k

and

l_{i} = l

,

i = 1, \dots, n

, then, for

n \geq k + 1

and

m \geq l

, we write

{\hat{D}}_{n, m} (k, l) : = ψ (k) - ψ (l) + log (\frac{m}{n - 1}) + \frac{d}{n} \sum_{i = 1}^{n} log (\frac{V_{m, l} (i)}{R_{n, k} (i)}) .

(4)

If

k = l

then

{\hat{D}}_{n, m} (k) = log (\frac{m}{n - 1}) + \frac{d}{n} \sum_{i = 1}^{n} log (\frac{V_{m, k} (i)}{R_{n, k} (i)})

and we come to formula (5) in [31]. For an intuitive background of the proposed estimates one can address [31] (Introduction, Parts B and C).

We write

B (x, r) : = {y \in R^{d} : ∥ x - y ∥ \leq r}

for

x \in R^{d}

,

r > 0

, and

V_{d} = μ (B (0, 1))

is the volume of the unit ball in

R^{d}

. Similar to (3) with the same notation and the same conditions for

k_{i}

and

l_{i}

,

i = 1, \dots, n

, one can define the Kozachenko - Leonenko type estimates of

H (P_{X})

and

C (P_{X}, P_{Y})

, respectively, by formulas

{\tilde{H}}_{n} (K_{n}) : = - \frac{1}{n} \sum_{i = 1}^{n} ψ (k_{i}) + log V_{d} + log (n - 1) + \frac{d}{n} \sum_{i = 1}^{n} log R_{n, k_{i}} (i),

(5)

{\tilde{C}}_{n, m} (L_{n}) : = - \frac{1}{n} \sum_{i = 1}^{n} ψ (l_{i}) + log V_{d} + log m + \frac{d}{n} \sum_{i = 1}^{n} log V_{m, l_{i}} (i) .

(6)

In [28], an estimate (5) was proposed for

k_{i} = 1

,

i = 1, \dots, n

. If

k_{i} = k

,

l_{i} = l

,

i = 1, \dots, n

,

n \geq k + 1

and

m \geq l

, then one has

{\hat{H}}_{n} (k) : = \frac{1}{n} \sum_{i = 1}^{n} log (\frac{V_{d} R_{n, k}^{d} (i) (n - 1)}{e^{ψ (k)}}), {\hat{C}}_{n, m} (l) : = \frac{1}{n} \sum_{i = 1}^{n} log (\frac{V_{d} V_{m, l}^{d} (i) m}{e^{ψ (l)}}) .

(7)

Remark 1.

All our results are valid for statistics (3). To simplify notation we consider estimates (4) since the study of

{\tilde{D}}_{n, m} (K_{n}, L_{n})

follows the same lines. For the same reason, as in the case of Kullback–Leibler divergence, we will only deal with (7) since (5) and (6) can be analyzed in quite the same way.

Some extra notation is necessary. As in [35], given a probability density f in

R^{d}

, we consider the following functions of

x \in R^{d}

,

r > 0

and

R > 0

, that is, define integral functionals (depending on parameters)

I_{f} (x, r) : = \frac{\int_{B (x, r)} f (y) d y}{r^{d} V_{d}},

(8)

M_{f} (x, R) : = sup_{r \in (0, R]} I_{f} (x, r), m_{f} (x, R) : = inf_{r \in (0, R]} I_{f} (x, r) .

(9)

Some properties of function

\int_{B (x, r)} f (y) d y

are demonstrated in [48]. By virtue of Lemma 2.1 [35], for each probability density f, the function

I_{f} (x, r)

introduced above is continuous in

(x, r)

on

R^{d} \times (0, \infty)

. Hence on account of Theorem 15.84 [49] the functions

m_{f} (\cdot, R)

and

M_{f} (\cdot, R)

for any

R > 0

have to be upper semicontinuous and lower semicontinuous, respectively. Therefore, Borel measurability of these nonnegative functions ensues from Proposition 15.82 [49]. On the other hand, the function

m_{f} (x, \cdot)

is evidently nonincreasing whereas

M_{f} (x, \cdot)

is nondecreasing for each x in

R^{d}

. Notably, changing

{sup}_{r \in (0, R]}

to

{sup}_{r \in (0, \infty)}

transforms the function

M_{f} (x, R)

into the famous Hardy–Littlewood maximal function

M_{f} (x)

well-known in Harmonic analysis.

Set

e_{[- 1]} : = 0

and

e_{[N]} : = exp {e_{[N - 1]}}

,

N \in Z_{+}

. Introduce a function

{log}_{[1]} (t) : = log t

,

t > 0

. For

N \in N

,

N > 1

, set

{log}_{[N]} (t) : = log ({log}_{[N - 1]} (t)) .

Evidently, this function (for each

N \in N

) is defined if

t > e_{[N - 2]}

. For

N \in N

, consider the continuous nondecreasing function

G_{N} : R_{+} \to R_{+}

, given by formula

G_{N} (t) : = \{\begin{matrix} 0, & t \in [0, e_{[N - 1]}], \\ t {log}_{[N]} (t), & t \in (e_{[N - 1]}, \infty) . \end{matrix}

(10)

In other words we employ the function having the form

t r (t)

where a function

r (t)

, taken as N iterations of

log t

, is slowly varying for large t.

For probability densities

p, q

in

R^{d}

,

N \in N

and positive constants

ν, t, ε, R

, introduce the functionals taking values in

[0, \infty]

K_{p, q} (ν, N, t) : = \int_{x, y \in R^{d}, ∥ x - y ∥ > t} G_{N} (| log ∥x - y∥ |^{ν}) p (x) q (y) d x d y,

(11)

Q_{p, q} (ε, R) : = \int_{R^{d}} M_{q}^{ε} (x, R) p (x) d x,

(12)

T_{p, q} (ε, R) : = \int_{R^{d}} m_{q}^{- ε} (x, R) p (x) d x .

(13)

Set

K_{p, q} (ν, N) : = K_{p, q} (ν, N, e_{[N]})

.

Remark 2.

We have stipulated that

1 / 0 : = \infty

(thus

m_{q}^{- ε} (x, R) : = \infty

whenever

m_{q} (x, R) = 0

). One can write in (12), (13) the integrals over the support

S (p)

instead of integrating over

R^{d}

, whatever the versions of p and q are taken.

3. Main Results

Theorem 1.

Let, for some positive

ε, R

and

N \in N

, the functionals

K_{p, f} (1, N)

,

Q_{p, f} (ε, R)

,

T_{p, f} (ε, R)

be finite if

f = p

and

f = q

. Then

D (P_{X} | | P_{Y}) < \infty

and

lim_{n, m \to \infty} E {\hat{D}}_{n, m} (k, l) = D (P_{X} | | P_{Y}) .

(14)

Consider 3 kinds of conditions (labeled A,B,C, possibly with indices, and involving parameters indicated in parentheses) on probability densities.

(A; p, f, ν)

For probability densities

p, f

in

R^{d}

and some positive

ν

L_{p, f} (ν) : = \int_{R^{d} \times R^{d}} {| log ∥x - y∥ |}^{ν} p (x) f (y) d x d y < \infty .

(15)

As usual,

\int_{A} g (z) Q (d z) = 0

whenever

g (z) = \infty

(or

- \infty

) for

z \in A

and

Q (A) = 0

, where Q is a

σ

-finite measure on

(R^{d}, B (R^{d}))

. Condition (15) with

ν > 1

is used, e.g., in [28,31,47].

(B_{1}; f)

A version of f is upper bounded by a positive number

M (f) \in (0, \infty)

:

f (x) \leq M (f), x \in R^{d} .

(C_{1}; f

) A version of f is lower bounded by a positive number

m (f) \in (0, \infty)

:

f (x) \geq m (f), x \in S (f) .

Corollary 1.

Let, for some

ν > 1

, condition

(A; p, f, ν)

be satisfied when

f = p

and

f = q

. Then the statements of Theorem 1 are true, provided that

(B_{1}; f)

and

(C_{1}; f)

are both valid for

f = p

and

f = q

. Moreover, if the latter assumption involving

(B_{1}; f)

and

(C_{1}; f)

holds then conditions of Theorem 1 are satisfied whenever p and q have bounded supports.

Next we formulate conditions to guarantee

L^{2}

-consistency of estimates (4).

Theorem 2.

Let the requirement

K_{p, f} (1, N) < \infty

in conditions of Theorem 1 be replaced by

K_{p, f} (2, N) < \infty

, given

f = p

and

f = q

. Then

D (P_{X} | | P_{Y}) < \infty

and, for any fixed

k, l \in N

, the estimates

{\hat{D}}_{n, m} (k, l)

are

L^{2}

-consistent, i.e.,

lim_{n, m \to \infty} E {({\hat{D}}_{n, m} (k, l) - D (P_{X} | | P_{Y}))}^{2} = 0 .

(16)

Corollary 2.

For some

ν > 2

, let condition

(A; p, f, ν)

be satisfied if

f = p

and

f = q

. Assume that

(B_{1}; f)

and

(C_{1}; f)

are both valid for

f = p

and

f = q

. Then the statements of Theorem 2 are true. Moreover, if the latter assumption involving

(B_{1}; f)

and

(C_{1}; f)

holds then conditions of Theorem 2 are satisfied whenever p and q have bounded supports.

Currently we dwell on a modification of condition

(C_{1}; f)

introduced in [35] that allows us to work with densities that need not have bounded supports.

(C_{2}; f)

There exist a version of density f and

R > 0

such that, for some

c > 0

,

m_{f} (x, R) \geq c f (x), x \in R^{d} .

Remark 3.

If, for some positive ε, R and c, condition

(C_{2}; q

) is true and

\int_{R^{d}} q {(x)}^{- ε} p (x) d x < \infty,

(17)

then

T_{p, q} (ε, R)

is finite. Hence we could apply, for

f = p

and

f = q

in Theorems 1 and 2, condition

(C_{2}; f)

and presume, for some

ε > 0

, validity of (17) and finiteness of

\int_{R^{d}} p^{1 - ε} (x) d x

instead of the corresponding assumptions

T_{p, q} (ε, R) < \infty

and

T_{p, p} (ε, R) < \infty

. An illustrative example to this point is provided with a density having unbounded support.

Corollary 3.

Let X, Y be Gaussian random vectors in

R^{d}

with

E X = μ_{X}

,

E Y = μ_{Y}

and nondegenerate covariance matrices

Σ_{X}

and

Σ_{Y}

, respectively. Then relations (14) and (16) hold where

D (P_{X} | | P_{Y}) = \frac{1}{2} (tr (Σ_{Y}^{- 1} Σ_{X}) + {(μ_{Y} - μ_{X})}^{T} Σ_{Y}^{- 1} (μ_{Y} - μ_{X}) - d + log (\frac{det Σ_{Y}}{det Σ_{X}})) .

The latter formula can be found, e.g., in [2], p. 147, example 6.3. The proof of Corollary 3 is discussed in Appendix A.

Similarly to condition

(C_{2}; f)

let us consider the following one.

(B_{2}; f)

There exist a version of density f and

R > 0

such that, for some

C > 0

,

M_{f} (x, R) \leq C f (x), x \in S (f) .

Remark 4.

If, for some positive ε, R and c, condition

(B_{2}; q)

is true and

\int_{R^{d}} q {(x)}^{ε} p (x) d x < \infty,

(18)

then obviously

Q_{p, q} (ε, R) < \infty

. Thus, in Theorems 1 and 2 one can employ, for

f = p

and

f = q

, condition

(B_{2}; f)

and exploit, for some

ε > 0

, the validity of (18) and finiteness of

\int_{R^{d}} p^{1 + ε} (x) d x

instead of the assumptions

Q_{p, q} (ε, R) < \infty

and

Q_{p, p} (ε, R) < \infty

, respectively.

Remark 5.

D.Evans applied “positive density condition” in Definition 2.1 of [48] assuming the existence of constants

β > 1

and

δ > 0

such that

\frac{r^{d}}{β} \leq \int_{B (x, r)} q (y) d y \leq β r^{d}

for all

0 \leq r \leq δ

and

x \in R^{d}

. Consequently

m_{q} (x, δ) \geq \frac{1}{β V_{d}} : = m > 0

,

x \in R^{d}

. Then

T_{p, q} (ε, δ) \leq m^{- ε} \int_{R^{d}} p (x) d x = m^{- ε} < \infty

for all

ε > 0

. Analogously,

M_{q} (x, δ) \leq \frac{β}{V_{d}} : = M

,

M > 0

,

x \in R^{d}

, and

Q_{p, q} (ε, δ) \leq M^{ε} \int_{R^{d}} p (x) d x = M^{ε} < \infty

for all

ε > 0

. The above mentioned inequalities from Definition 2.1 of [48] are valid, provided that density f is smooth and its support in

R^{d}

is a convex closed body, see proof in [50]. Therefore, if p and q are smooth and their supports are compact convex bodies in

R^{d}

, the relations (14) and (16) are valid.

Moreover, as a byproduct of Theorems 1 and 2, we obtain the new results indicating both the asymptotic unbiasedness and

L^{2}

-consistency of the estimates (7) for the Shannon differential entropy and cross-entropy.

Theorem 3.

Let

Q_{p, q} (ε, R) < \infty

and

T_{p, q} (ε, R) < \infty

for some positive ε and R. Then

C (P_{X}, P_{Y})

is finite and the following statements hold for any fixed

l \in N

.

(1): If, for some $N \in N$ , $K_{p, q} (1, N) < \infty$ , then $E {\hat{C}}_{n, m} (l) \to C (P_{X}, P_{Y}), n, m \to \infty .$
(2): If, for some $N \in N$ , $K_{p, q} (2, N) < \infty$ , then $E {({\hat{C}}_{n, m} (l) - C (P_{X}, P_{Y}))}^{2} \to 0, n, m \to \infty .$

In particular, one can employ

L_{p, q} (ν)

with

ν > 1

instead of

K_{p, q} (1, N)

, and with

ν > 2

instead of

K_{p, q} (2, N)

, where

N \in N

.

The first claim of this Theorem follows from the proof of Theorem 1. In a similar way one can infer the second statement from the proof of Theorem 2. If we take

q = p

in conditions of Theorem 3 then we get the statement concerning the entropy since

C (P_{X}, P_{X}) = H (P_{X})

.

Now we consider the case when p and q are mixtures of some probability densities. Namely,

p (x) : = \sum_{i = 1}^{I} a_{i} p_{i} (x), q (x) : = \sum_{j = 1}^{J} b_{j} q_{j} (x),

(19)

where

p_{i} (x)

,

q_{j} (x)

are probability densities (w.r.t. the Lebesgue measure

μ

), positive weights

a_{i}

,

b_{j}

are such that

\sum_{i = 1}^{I} a_{i} = 1

,

\sum_{j = 1}^{J} b_{j} = 1

,

i = 1, \dots, I

,

j = 1, \dots, J

,

x \in R^{d}

. Some applications of models described by mixtures are treated, e.g., in [51].

Corollary 4.

Let random vectors X and Y have densities of the form (19). Assume that, for some positive ε, R and

N \in N

, the functionals

K_{f, g} (1, N)

,

Q_{f, g} (ε, R)

,

T_{f, g} (ε, R)

are finite, whenever

f \in {p_{1}, \dots, p_{I}}

and

g \in {p_{1}, \dots, p_{I}, q_{1}, \dots, q_{J}}

. Then

D (P_{X} | | P_{Y}) < \infty

and, for any fixed

k, l \in N

, (14) holds. Moreover, if the requirement

K_{f, g} (1, N) < \infty

is replaced by

K_{f, g} (2, N) < \infty

then (16) is true.

The proof of this Corollary is given in Appendix A. Thus, due to Corollaries 3 and 4 one can guarantee the validity of (14) and (16) for any mixtures of nondegenerate Gaussian densities. Note also that in a similar way we can claim the asymptotic unbaisedness and

L^{2}

-consistency of estimates (7) for mixtures satisfying conditions of Corollary 4.

Remark 6.

Let us compare our new results with those established in [35]. Developing the approach of [35] to analysis of asymptotic behavior of the Kozachenko–Leonenko estimates of the Shannon differential entropy we encounter new complications due to dealing with k-nearest neighbor statistics for

k \in N

(not only for

k = 1

). Accordingly, in the framework of the Kullback–Leibler divergence estimation, we propose a new way to bound the function

1 - F_{m, l, x} (u)

playing the key role in the proofs (see Formula (28)). Furthermore, instead of the function

G (t) = t log t

(for

t > 1

), used in [35] for the Shannon entropy estimates, we employ a regularly varying function

G_{N} (t) = t {log}_{[N]} (t)

where (for t large enough)

{log}_{[N]} (t)

is the N-fold iteration of the logarithmic function and

N \in N

can be large. Whence in the definition of integral functional

K_{p, q} (ν, N, t)

by formula (11) one can take a function

G_{N} (z)

having, for

z > 0

, the growth rate close to that of function z. Moreover, this entails a generalization of paper [35] results. Now we invoke convexity of

G_{N}

(see Lemma 6) to provide more general conditions for asymptotic unbiasedness and

L^{2}

-consistency of the Shannon differential entropy as opposed to [35].

4. Proof of Theorem 1

Note that the general structure of this proof, as well as that of Theorem 2, is similar to the one originally proposed in [28] and later used in various papers (see, e.g., [30,31,47]). Nevertheless in order to prove both theorems correctly we employ new ideas and conditions (such as uniform integrability of a family of random variables) in our reasoning.

Remark 7.

In the proof, for certain random variables

α, α_{1}, α_{2}, \dots

(depending on some parameters), we will demonstrate that

E α_{n} \to E α

, as

n \to \infty

(and that all these expectations are finite). To this end, for a fixed

R^{d}

-valued random vector τ and each

x \in A

, where A is a specified subset of

R^{d}

, we will prove that

E (α_{n} | τ = x) \to E (α | τ = x), n \to \infty .

(20)

It turns out that

E (α_{n} | τ = x) = E (α_{n, x})

and

E (α | τ = x) = E α_{x}

, where the auxiliary random variables

α_{n, x}

and

α_{x}

can be constructed explicitly for each

x \in R^{d}

. Moreover, it is possible to show that, for each

x \in A

, one has

α_{n, x} \overset{l a w}{\to} α_{x}

,

n \to \infty

. Thus, to prove (20) the Fatou lemma is not used, it is not evident whether there exists a random variable majorizing those under consideration. Instead we verify, for each

x \in A

, the uniform integrability (w.r.t. measure

P

) of a family

{(α_{n, x})}_{n \geq n_{0} (x)}

. Here we employ the necessary and sufficient conditions of uniform integrability provided by de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 in [52]). After that, to prove the desired relation

E α_{n} \to E α

,

n \to \infty

, we have a new task. Namely, we check the uniform integrability of a family

{(E (α_{n} | τ = x))}_{n \geq k_{0}}

, where

x \in A

, w.r.t. the measure

P_{τ}

, i.e., the law of τ, and

k_{0}

does not depend of x. Then we can prove that

E α_{n} = \int_{A} E (α_{n} | τ = x) P_{τ} (d x) \to \int_{A} E (α | τ = x) P_{τ} (d x) = E α, n \to \infty .

Further we will explain a number of nontrivial details concerning the proofs of uniform integrability of various families, the choice of the mentioned random variables (vectors), the set A,

n_{0} (x)

and

k_{0}

.

The first auxiliary result explains why without loss of generality (w.l.g.) we can consider the same parameters

ε, R, N

for different functionals in conditions of Theorems 1 and 2.

Lemma 1.

Let p and q be any probability densities in

R^{d}

. Then the following statements are valid.

(1): If $K_{p, q} (ν_{0}, N_{0}) < \infty$ for some $ν_{0} > 0$ and $N_{0} \in N$ then $K_{p, q} (ν, N) < \infty$ for any $ν \in (0, ν_{0}]$ and each $N \geq N_{0}$ .
(2): If $Q_{p, q} (ε_{1}, R_{1}) < \infty$ for some $ε_{1} > 0$ and $R_{1} > 0$ then $Q_{p, q} (ε, R) < \infty$ for any $ε \in (0, ε_{1}]$ and each $R > 0$ .
(3): If $T_{p, q} (ε_{2}, R_{2}) < \infty$ for some $ε_{2} > 0$ and $R_{2} > 0$ then $T_{p, q} (ε, R) < \infty$ for any $ε \in (0, ε_{2}]$ and each $R > 0$ .

In particular one can take

q = p

and the statements of Lemma 1 still remain valid. The proof of Lemma 1 is given in Appendix A.

Remark 8.

The results of Lemma 1 allow us to ensure (14) by demanding the finiteness of the functionals

K_{p, q} (1, N_{1})

,

Q_{p, q} (ε_{1}, R_{1})

,

T_{p, q} (ε_{2}, R_{2})

,

K_{p, p} (1, N_{2})

,

Q_{p, p} (ε_{3}, R_{3})

,

T_{p, p} (ε_{4}, R_{4})

, for some

ε_{i} > 0, R_{j} > 0

and

N_{j} \in N

, where

i = 1, 2, 3, 4

and

j = 1, 2

. Moreover, if we assume the finiteness of

K_{p, q} (2, N_{3})

and

K_{p, p} (2, N_{4})

, for some

N_{3} \in N

,

N_{4} \in N

, instead of the finiteness of

K_{p, q} (2, N_{1})

and

K_{p, p} (2, N_{2})

then (16) holds.

According to Remark 2.4 of [35] if, for some positive

ε, R

, the integrals

Q_{p, q} (ε, R)

,

T_{p, q} (ε, R)

,

Q_{p, p} (ε, R)

,

T_{p, p} (ε, R)

are finite then

\int_{R^{d}} p (x) | log q (x) | d x < \infty, \int_{R^{d}} p (x) | log p (x) | d x < \infty .

(21)

Therefore

D (p | | q) < \infty

(and thus

P_{X} ≪ P_{Y}

in view of Lemma A1).

For

n \in N

such that

n > 1

, for fixed

k \in N

and

m \in N

, where

1 \leq k \leq n - 1

,

1 \leq l \leq m

and

i = 1, \dots, n

, set

ϕ_{m, l} (i) : = m V_{m, l}^{d} (i)

,

ζ_{n, k} (i) : = (n - 1) R_{n, k}^{d} (i)

. Then we can rewrite the estimate

{\hat{D}}_{n, m} (k, l)

as follows:

{\hat{D}}_{n, m} (k, l) = ψ (k) - ψ (l) + \frac{1}{n} \sum_{i = 1}^{n} (log ϕ_{m, l} (i) - log ζ_{n, k} (i)) .

(22)

It is sufficient to prove the following two assertions.

Statement 1.

For each fixedl, allmlarge enough and any

i \in N

,

E | log ϕ_{m, l} (i) |

is finite.

Moreover,

E (\frac{1}{n} \sum_{i = 1}^{n} log ϕ_{m, l} (i)) = E log ϕ_{m, l} (1) \to ψ (l) - log V_{d} - \int_{R^{d}} p (x) log q (x) d x, m \to \infty .

(23)

Statement 2.

For each fixed k, all nlarge enough and any

i \in N

,

E | log ζ_{n, k} (i) |

is finite.

Moreover,

E (\frac{1}{n} \sum_{i = 1}^{n} log ζ_{n, k} (i)) = E log ζ_{n, k} (1) \to ψ (k) - log V_{d} - \int_{R^{d}} p (x) log p (x) d x, n \to \infty .

(24)

Then in view of (2) and (21)–(24)

\begin{matrix} E {\hat{D}}_{n, m} (k, l) \to - \int_{R^{d}} p (x) log q (x) d x + \int_{R^{d}} p (x) log p (x) d x = D (P_{X} | | P_{Y}), n, m \to \infty, \end{matrix}

and Theorem 1 will be proved.

Recall that, as explained in [35], for a nonnegative random variable V (thus

0 \leq E V \leq \infty

) and any random

R^{d}

-valued vector, one has

E V = \int_{R^{d}} E (V | X = x) P_{X} (d x) .

(25)

This signifies that both sides of (25) coincide, being finite or infinite simultaneously. Let

F (u, ω)

be a regular conditional distribution function of a nonnegative random variable

U

given X where

u \in R

and

ω \in Ω

. Let h be a measurable function such that

h : R \to [0, \infty)

. It was also explained in [35] that, for

P_{X}

-almost all

x \in R^{d}

, it follows (without assuming

E h (U) < \infty

)

E (h (U) | X = x) = \int_{[0, \infty)} h (u) d F (u, x) .

(26)

This means that both sides of (26) are finite or infinite simultaneously and coincide.

By virtue of (25) and (26) one can establish that

E | log ϕ_{m, l} (i) | < \infty

, for all m large enough, fixed l and for all i, and that (23) holds. To perform this take

U = ϕ_{m, l} (i)

,

X = X_{i}

,

h (u) = | log u |

,

u > 0

(we use

h (u) = {log}^{2} u

in the proof of Theorem 2) and

V = h (U)

. If

h : R \to R

and

E | h (U) | < \infty

then (26) is true as well. To avoid increasing the volume of this paper we will only examine the evaluation of

E log ϕ_{m, l} (i)

as all the steps of the proof will be the same when treating

E | log ϕ_{m, l} (i) |

.

The proof of Statement 1 is partitioned into 4 steps. The first three demonstrate that there is a measurable

A \subset S (p)

, depending on p and q versions, such that

P_{X} (S (p) \ A) = 0

and, for any

x \in A

,

i \in N

, the following relation holds:

E (log ϕ_{m, l} (i) | X_{i} = x) = E (log ϕ_{m, l} (1) | X_{1} = x) \to ψ (l) - log V_{d} - log q (x), m \to \infty .

(27)

The last Step 4 justifies the desired result (23). Finally Step 5 validates Statement 2.

Step 1. Here we establish the distribution convergence for the auxiliary random variables. Fix any

i \in N

and

l \in {1, \dots, m}

. To simplify notation we do not indicate the dependence of functions on d. For

x \in R^{d}

and

u \geq 0

, we identify the asymptotic behavior (as

m \to \infty

) of the function

\begin{matrix} \begin{matrix} F_{m, l, x}^{i} (u) : = P (ϕ_{m, l} (i) \leq u | X_{i} = x) = P (m V_{m, l}^{d} (i) \leq u | X_{i} = x) \\ = 1 - P (V_{m, l} (i) > {(\frac{u}{m})}^{\frac{1}{d}} | X_{i} = x) = 1 - P (∥x - Y_{(l)} (x, Y_{m})∥ > {(\frac{u}{m})}^{\frac{1}{d}}) \\ = 1 - \sum_{s = 0}^{l - 1} (\binom{m}{s}) {(W_{m, x} (u))}^{s} {(1 - W_{m, x} (u))}^{m - s} : = P (ξ_{m, l, x} \leq u), \end{matrix} \end{matrix}

(28)

where

W_{m, x} (u) : = \int_{B (x, r_{m} (u))} q (z) d z, r_{m} (u) : = {(\frac{u}{m})}^{\frac{1}{d}}, ξ_{m, l, x} : = m {∥x - Y_{(l)} (x, Y_{m})∥}^{d} .

(29)

We take into account in (28) that random vectors

Y_{1}, \dots, Y_{m}, X_{i}

are independent and condition that

Y_{1}, \dots, Y_{m}

have the same law as Y. We also noted that an event

\{∥x - Y_{(l)} (x, Y_{m})∥ > r_{m} (u)\}

is a union of pair-wise disjoint events

A_{s}

,

s = 0, \dots, l - 1

. Here

A_{s}

means that exactly s observations among

Y_{m}

belong to the ball

B (x, r_{m} (u))

and other

m - s

are outside this ball (probability that Y belongs to the sphere

{z \in R^{d} : ∥ z - x ∥ = r}

equals 0 since Y has a density w.r.t. the Lebesgue measure

μ

). Formulas (28) and (29) show that

F_{m, l, x}^{i} (u)

is the regular conditional distribution function of

ϕ_{m, l} (i)

given

X_{i} = x

. Moreover, (28) means that

ϕ_{m, l} (i)

,

i \in {1, \dots, n}

are identically distributed and we may omit the dependence on i. So, one can write

F_{m, l, x} (u)

instead of

F_{m, l, x}^{i} (u)

.

According to the Lebesgue differentiation theorem (see, e.g., [49], p. 654) if

q \in L^{1} (R^{d})

, for

μ

-almost all

x \in R^{d}

, one has

lim_{r \to 0 +} \frac{1}{μ (B (x, r))} \int_{B (x, r)} | q (z) - q (x) | d z = 0 .

(30)

Let

Λ (q)

denote the set of Lebesgue points of a function q, namely the points in

R^{d}

satisfying (30). Evidently it depends on the choice of version within the class of functions in

L^{1} (R^{d})

equivalent to q, and, for an arbitrary version of q, we have

μ (R^{d} \ Λ (q)) = 0

.

Clearly, for each

u \geq 0

,

r_{m} (u) \to 0

as

m \to \infty

, and

μ (B (x, r_{m} (u))) = V_{d} {(r_{m} (u))}^{d} = \frac{V_{d} u}{m}

. Therefore by virtue of (30), for any fixed

x \in Λ (q)

and

u \geq 0

,

W_{m, x} (u) = \frac{V_{d} u}{m} (q (x) + α_{m} (x, u)),

where

α_{m} (x, u) \to 0, m \to \infty

. Hence, for

x \in Λ (q) \cap S (q)

(thus

q (x) > 0

), due to (28)

\begin{matrix} \begin{matrix} F_{m, l, x} (u) \to 1 - \sum_{s = 0}^{l - 1} \frac{{(V_{d} q (x) u)}^{s}}{s!} e^{- V_{d} q (x) u} : = F_{l, x} (u), m \to \infty . \end{matrix} \end{matrix}

(31)

Relation (31) means that

ξ_{m, l, x} \overset{l a w}{\to} ξ_{l, x}, x \in Λ (q) \cap S (q), m \to \infty,

where

ξ_{l, x}

has the Gamma distribution

Γ (α, λ)

with parameters

α = V_{d} q (x)

and

λ = l

.

For any

x \in S (q)

, one can assume w.l.g. that the random variables

ξ_{l, x}

and

{ξ_{m, l, x}}_{m \geq l}

are defined on a probability space

(Ω, F, P)

. Indeed, by the Lomnicki–Ulam theorem (see, e.g., [53], p. 93) the independent copies of

Y_{1}, Y_{2}, \dots

and

{ξ_{l, x}}_{x \in S (q)}

exist on a certain probability space. The convergence in distribution of random variables survives under continuous mapping. Thus, for any

x \in Λ (q) \cap S (q)

, we see that

log ξ_{m, l, x} \overset{l a w}{\to} log ξ_{l, x}, m \to \infty .

(32)

We have employed that

ξ_{l, x} > 0

a.s. for each

x \in Λ (q) \cap S (q)

and Y has a density, so it follows that

P (ξ_{m, l, x} > 0) = P (∥x - Y_{(l)} (x, Y_{m})∥ > 0) = 1

. More precisely, we take strictly positive versions of

ξ_{l, x}

and

ξ_{m, l, x}

for each

x \in Λ (q) \cap S (q)

.

Step 2. Now we show that, instead of (27) validity, one can verify the following assertion. For μ-almost every

x \in Λ (q) \cap S (q)

E log ξ_{m, l, x} \to E log ξ_{l, x}, m \to \infty .

(33)

Note that if

η \sim Γ (α, λ)

, where

α > 0

and

λ > 0

, then

E log η = ψ (λ) - log α

, where

ψ

is a digamma function. Set

α = V_{d} q (x)

for

x \in S (q)

(then

α > 0

) and

λ = l

. Hence

E log ξ_{l, x} = ψ (l) - log (V_{d} q (x)) = ψ (l) - log V_{d} - log q (x)

. By virtue of (26), for each

x \in R^{d}

,

\begin{matrix} E log ξ_{m, l, x} = \int_{(0, \infty)} log u d F_{m, l, x} (u) = \int_{(0, \infty)} log u d P (ϕ_{m, l} (1) \leq u | X_{1} = x) \\ = E (log ϕ_{m, l} (1) | X_{1} = x) . \end{matrix}

Hence, for

x \in Λ (q) \cap S (q)

, the relation

E (log ϕ_{m, l} (1) | X_{1} = x) \to ψ (l) - log V_{d} - log q (x)

holds if and only if (33) is true.

According to Theorem 3.5 [54] we would have established (33) if relation (32) could be supplemented, for

μ

-almost all

x \in Λ (q) \cap S (q)

, by the condition of uniform integrability of a family

{log ξ_{m, l, x}}_{m \geq m_{0} (x)}

. Note that, for each

N \in N

, a function

G_{N} (t)

introduced by (10) is nondecreasing on

(0, \infty)

and

\frac{G_{N} (t)}{t} \to \infty

, as

t \to \infty

. By the de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 [52]), to ensure, for

μ

-almost every

x \in Λ (q) \cap S (q)

, the uniform integrability of

{log ξ_{m, l, x}}_{m \geq m_{0} (x)}

, it suffices to prove the following statement. For the indicated x, a positive

C_{0} (x)

and

m_{0} (x) \in N

, one has

sup_{m \geq m_{0} (x)} E G_{N} (| log ξ_{m, l, x} |) \leq C_{0} (x) < \infty,

(34)

where

G_{N}

appears in conditions of Theorem 1. Moreover, it is possible to find

m_{0} \in N

that does not depend on

x \in R^{d}

as we will show further.

Step 3. This step is devoted to proving validity of (34). It is convenient to divide this step into its own parts (3a), (3b), etc. For any

N \in N

, set

g_{N} (t) = \{\begin{matrix} - \frac{1}{t} ({log}_{[N]} (- log t) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} (- log t)}), & t \in (0, \frac{1}{e_{[N]}}], \\ 0, & t \in (\frac{1}{e_{[N]}}, e_{[N]}], \\ \frac{1}{t} ({log}_{[N]} (log t) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} (log t)}), & t \in (e_{[N]}, \infty), \end{matrix}

where the product over empty set (when

N = 1

) is equal to 1.

The proof of the following result is placed at Appendix A.

Lemma 2.

Let

F (u), u \in R

, be a distribution function such that

F (0) = 0

. Then, for each

N \in N

, one has

(1) \int_{(0, \frac{1}{e_{[N]}}]} G_{N} (| log u |) d F (u) = \int_{(0, \frac{1}{e_{[N]}}]} F (u) (- g_{N} (u)) d u,

(2) \int_{(e_{[N]}, \infty)} G_{N} (| log u |) d F (u) = \int_{(e_{[N]}, \infty)} (1 - F (u)) g_{N} (u) d u .

Fix N appearing in conditions of Theorem 1. Observe that, for

u \in (\frac{1}{e_{[N]}}, e_{[N]}]

, one has

G_{N} (| log u |) = 0

. Therefore, according to Lemma 2, for

x \in Λ (q) \cap S (q)

and

m \geq l

, we get

E G_{N} (| log ξ_{m, l, x} |) : = I_{1} (m, x) + I_{2} (m, x)

where

\begin{matrix} I_{1} (m, x) : = \int_{(0, \frac{1}{e_{[N]}}]} F_{m, l, x} (u) (- g_{N} (u)) d u, I_{2} (m, x) : = \int_{(e_{[N]}, \infty)} (1 - F_{m, l, x} (u)) g_{N} (u) d u . \end{matrix}

For convenience sake we write

I_{1} (m, x)

and

I_{2} (m, x)

without indicating their dependence on

N, l

and d (these parameters are fixed).

Part (3a). We provide bounds for

I_{1} (m, x)

. Take

R > 0

appearing in conditions of Theorem 1 and any

u \in (0, \frac{1}{e_{[N]}}]

. Introduce

m_{1} : = max \{⌈\frac{1}{e_{[N]} R^{d}}⌉, l\}

, where, for

a \in R

,

⌈ a ⌉ : = inf {m \in Z : m \geq a}

. Then

r_{m} (u) = {(\frac{u}{m})}^{1 / d} \leq {(\frac{1}{e_{[N]} m})}^{1 / d} \leq R

if

m \geq m_{1}

. Note also that we can consider only

m \geq l

everywhere below, because the size of sample

Y_{m}

is not less than the number of neighbors l (see, e.g., (28)). Thus, for

R > 0

,

u \in (0, \frac{1}{e_{[N]}}]

,

x \in R^{d}

and

m \geq m_{1}

,

\begin{matrix} \frac{W_{m, x} (u)}{μ (B (x, r_{m} (u)))} = \frac{\int_{B (x, r_{m} (u))} q (y) d y}{r_{m}^{d} (u) V_{d}} \leq sup_{r \in (0, R]} \frac{\int_{B (x, r)} q (y) d y}{r^{d} V_{d}} = M_{q} (x, R), \end{matrix}

and we arrive at the inequality

W_{m, x} (u) \leq M_{q} (x, R) μ (B (x, r_{m} (u))) = \frac{M_{q} (x, R) V_{d} u}{m} .

(35)

If

γ \in (0, 1]

and

t \in [0, 1]

then, for all

m \geq 1

, invoking the Bernoulli inequality, one has

1 - {(1 - t)}^{m} \leq {(m t)}^{γ} .

(36)

Recall that we assume

Q_{p, q} (ε, R) < \infty

for some

ε > 0

,

R > 0

. By virtue of Lemma 1 one can take

ε < 1

. So, due to (36) and since

W_{m, x} (u) \in [0, 1]

for all

x \in R^{d}

,

u > 0

and

m \geq l

, we get

1 - {(1 - W_{m, x} (u))}^{m} \leq {(m W_{m, x} (u))}^{ε} .

(37)

Thus in view of (28), (35) and (37) we have established that, for all

x \in Λ (q) \cap S (q)

,

u \in (0, \frac{1}{e_{[N]}}]

and

m \geq m_{1}

,

\begin{matrix} \begin{matrix} F_{m, l, x} (u) = 1 - \sum_{s = 0}^{l - 1} (\binom{m}{s}) {(W_{m, x} (u))}^{s} {(1 - W_{m, x} (u))}^{m - s} \\ \leq 1 - {(1 - W_{m, x} (u))}^{m} \leq {(m \frac{M_{q} (x, R) V_{d} u}{m})}^{ε} = {(M_{q} (x, R))}^{ε} V_{d}^{ε} u^{ε} . \end{matrix} \end{matrix}

(38)

Therefore, for any

x \in Λ (q) \cap S (q)

and

m \geq m_{1}

, one can write

\begin{matrix} \begin{matrix} I_{1} (m, x) \leq {(M_{q} (x, R))}^{ε} V_{d}^{ε} \int_{(0, \frac{1}{e_{[N]}}]} u^{ε} (- g_{N} (u)) d u \\ \leq {(M_{q} (x, R))}^{ε} V_{d}^{ε} \int_{(0, \frac{1}{e_{[N]}}]} \frac{{log}_{[N]} (- log u) + 1}{u^{1 - ε}} d u = U_{1} (ε, N, d) {(M_{q} (x, R))}^{ε}, \end{matrix} \end{matrix}

(39)

where

U_{1} (ε, N, d) : = V_{d}^{ε} L_{N} (ε)

,

L_{N} (ε) : = \int_{[e_{[N]}, \infty)} ({log}_{[N]} (t) + 1) e^{- ε t} d t < \infty

. We took into account that

(- g_{N} (u)) \leq \frac{1}{u} ({log}_{[N]} (- log u) + 1)

whenever

u \in (0, \frac{1}{e_{[N]}}]

.

Part (3b).We give bounds for

I_{2} (m, x)

. Since

g_{N} (u) \leq \frac{{log}_{[N + 1]} (u) + 1}{u}

if

u \in (e_{[N]}, \infty)

, we can write, for

m \geq max {e_{[N]}^{2}, l}

,

\begin{matrix} \begin{matrix} I_{2} (m, x) \leq \int_{(e_{[N]}, \sqrt{m}]} (1 - F_{m, l, x} (u)) \frac{{log}_{[N + 1]} (u) + 1}{u} d u \\ + \int_{(\sqrt{m}, m^{2}]} (1 - F_{m, l, x} (u)) \frac{{log}_{[N + 1]} (u) + 1}{u} d u + \int_{(m^{2}, \infty)} (1 - F_{m, l, x} (u)) g_{N} (u) d u \\ : = J_{1} (m, x) + J_{2} (m, x) + J_{3} (m, x) . \end{matrix} \end{matrix}

Evidently,

\begin{matrix} \begin{matrix} 1 - F_{m, l, x} (u) = \sum_{r = m - l + 1}^{m} (\binom{m}{r}) {(P_{m, x} (u))}^{r} {(1 - P_{m, x} (u))}^{m - r} = P (Z \geq m - l + 1), \end{matrix} \end{matrix}

where

P_{m, x} (u) = 1 - W_{m, x} (u)

and

Z \sim Bin (m, P_{m, x} (u))

.

By Markov’s inequality

P (Z \geq t) \leq e^{- λ t} E e^{λ Z}

for any

λ > 0

and

t > 0

. One has

\begin{matrix} \begin{matrix} E e^{λ Z} = \sum_{j = 0}^{m} e^{λ j} (\binom{m}{j}) {(P_{m, x} (u))}^{j} {(1 - P_{m, x} (u))}^{m - j} \\ = \sum_{j = 0}^{m} (\binom{m}{j}) {(P_{m, x} (u) e^{λ})}^{j} {(1 - P_{m, x} (u))}^{m - j} = {(1 - P_{m, x} (u) + e^{λ} P_{m, x} (u))}^{m} . \end{matrix} \end{matrix}

Consequently, for each

λ > 0

,

\begin{matrix} \begin{matrix} 1 - F_{m, l, x} (u) \leq e^{- λ (m - l + 1)} {(1 - P_{m, x} (u) + e^{λ} P_{m, x} (u))}^{m} \\ = e^{- λ (m - l + 1)} {(W_{m, x} (u) + e^{λ} (1 - W_{m, x} (u)))}^{m} = e^{λ (l - 1)} {(1 - (1 - \frac{1}{e^{λ}}) W_{m, x} (u))}^{m} . \end{matrix} \end{matrix}

(40)

To simplify bounds we take

λ = 1

and set

S_{1} = S_{1} (l) : = e^{l - 1}

,

S_{2} : = 1 - \frac{1}{e}

(recall that l is fixed). Thus,

S_{1} \geq 1

and

S_{2} < 1

. Therefore,

1 - F_{m, l, x} (u) \leq S_{1} {(1 - S_{2} W_{m, x} (u))}^{m} \leq S_{1} exp \{- S_{2} m W_{m, x} (u)\},

(41)

where we have used simple inequality

1 - t \leq e^{- t}

,

t \in [0, 1]

.

For

R > 0

appearing in conditions of the Theorem and any

u \in (e_{[N]}, \sqrt{m}]

, one can choose

m_{2} : = max \{⌈\frac{1}{R^{2 d}}⌉, ⌈e_{[N]}^{2}⌉, l\}

such that if

m \geq m_{2}

then

r_{m} (u) = {(\frac{u}{m})}^{1 / d} \leq {(\frac{1}{\sqrt{m}})}^{1 / d} \leq R .

Due to (29) and (41), for

u \in (e_{[N]}, \sqrt{m}]

and

m \geq m_{2}

, one has

\begin{matrix} \begin{matrix} 1 - F_{m, l, x} (u) \leq S_{1} exp \{- S_{2} m \frac{V_{d} u}{m} \frac{W_{m, x} (u)}{\frac{V_{d} u}{m}}\} \\ = S_{1} exp \{- S_{2} V_{d} u \frac{\int_{B (x, r_{m} (u))} q (z) d z}{μ (B (x, r_{m} (u)))}\} \leq S_{1} exp \{- S_{2} V_{d} u m_{q} (x, R)\}, \end{matrix} \end{matrix}

(42)

by definition of

m_{f}

(for

f = q

) in (9). Now we use the following Lemma 3.2 of [35].

Lemma 3.

For a version of a density q and each

R > 0

, one has

μ (S (q) \ D_{q} (R)) = 0

where

D_{q} (R) : = {x \in S (q) : m_{q} (x, R) > 0}

and

m_{q} (\cdot, R)

is defined according to (9).

It is easily seen that, for any

t > 0

and each

δ \in (0, e]

, one has

e^{- t} \leq t^{- δ}

. Thus, for

x \in D_{q} (R)

,

m \geq m_{2}

,

u \in (e_{[N]}, \sqrt{m}]

and

ε > 0

, we deduce from conditions of the Theorem (in view of Lemma 1 one can suppose that

ε \in (0, e]

) that

1 - F_{m, l, x} (u) \leq S_{1} {(S_{2} V_{d} u m_{q} (x, R))}^{- ε} .

(43)

We also took into account that

m_{q} (x, R) > 0

for

x \in D_{q} (R)

and applied relation (42). Thus, for all

x \in Λ (q) \cap S (q) \cap D_{q} (R)

and any

m \geq m_{2}

,

\begin{matrix} \begin{matrix} J_{1} (m, x) \leq \frac{S_{1}}{{(S_{2} V_{d})}^{ε} {(m_{q} (x, R))}^{ε}} \int_{(e_{[N]}, \infty)} \frac{{log}_{[N + 1]} (u) + 1}{u^{1 + ε}} d u \\ = U_{2} (ε, N, d, l) {(m_{q} (x, R))}^{- ε}, \end{matrix} \end{matrix}

(44)

where

U_{2} (ε, N, d, l) : = S_{1} (l) L_{N} (ε) {(S_{2} V_{d})}^{- ε}

.

Part (3c). We provide the bound for

J_{2} (m, x)

. For all

x \in Λ (q) \cap S (q) \cap D_{q} (R)

and any

m \geq m_{2}

, in view of (43), it holds

1 - F_{m, l, x} (\sqrt{m}) \leq S_{1} {(S_{2} V_{d} m_{q} (x, R) \sqrt{m})}^{- ε}

. Hence (as

m_{2} \geq 2

)

\begin{matrix} \begin{matrix} J_{2} (m, x) \leq \int_{(\sqrt{m}, m^{2}]} (1 - F_{m, l, x} (u)) \frac{{log}_{[N + 1]} (u) + 1}{u} d u \\ \leq (1 - F_{m, l, x} (\sqrt{m})) \int_{(\sqrt{m}, m^{2}]} ({log}_{[N + 1]} (u) + 1) d log u \\ \leq S_{1} {(S_{2} V_{d})}^{- ε} {(m_{q} (x, R))}^{- ε} m^{- \frac{ε}{2}} ({log}_{[N]} (2 log m) + 1) \frac{3}{2} log m . \end{matrix} \end{matrix}

Then, for all

x \in Λ (q) \cap S (q) \cap D_{q} (R)

and any

m \geq m_{2}

,

J_{2} (m, x) \leq U_{3} (m, ε, N, d, l) {(m_{q} (x, R))}^{- ε},

(45)

where

U_{3} (m, ε, N, d, l) : = \frac{3}{2} S_{1} (l) {(S_{2} V_{d})}^{- ε} m^{- \frac{ε}{2}} log m ({log}_{[N]} (2 log m) + 1) \to 0

,

m \to \infty

.

Part (3d).To indicate bounds for

J_{3} (m, x)

we employ several auxiliary results.

Lemma 4.

For each

N \in N

and any

ν > 0

, there are

a : = a (d, ν) \geq 0, b : = b (N, d, ν) \geq 0

such that, for arbitrary

x, y \in R^{d}

,

G_{N} (| log {∥ x - y ∥}^{d} |^{ν}) \leq a G_{N} ({| log ∥ x - y ∥ |}^{ν}) + b .

The proof is given in Appendix A.

On the one hand, by (29), for any

w \geq 0

, we get

W_{m, x} (m w) = \int_{B (x, w^{1 / d})} q (z) d z = W_{1, x} (w) .

On the other hand, by (28), one has

F_{1, 1, x} (w) = 1 - (1 - W_{1, x} (w)) = W_{1, x} (w)

. Consequently, for any

m \in N

,

w \geq 0

and all

x \in R^{d}

,

W_{m, x} (m w) = F_{1, 1, x} (w) .

(46)

Moreover,

F_{1, 1, x} (w) = P ({∥Y - x∥}^{d} \leq w)

. So,

ξ_{1, 1, x} \overset{l a w}{=} {∥Y - x∥}^{d}

. Thus, due to Lemmas 2 and 4 (for

ν = 1

)

\begin{matrix} \begin{matrix} \int_{(e_{[N]}, \infty)} (1 - F_{1, 1, x} (w)) g_{N} (w) d w = \int_{(e_{[N]}, \infty)} G_{N} (log w) d F_{1, 1, x} (w) \\ = E [G_{N} (log ξ_{1, 1, x}) I \{ξ_{1, 1, x} > e_{[N]}\}] = E [G_{N} {(log ∥ Y - x ∥}^{d} {) I {∥ Y - x ∥}^{d} > e_{[N]}}] \\ = \int_{y \in R^{d}, ∥x - y∥ > {(e_{[N]})}^{1 / d}} G_{N} (log {∥x - y∥}^{d}) q (y) d y \\ \leq a (d, 1) \int_{y \in R^{d}, ∥x - y∥ > {(e_{[N]})}^{1 / d}} G_{N} (| log ∥x - y∥ |) q (y) d y + b (N, d, 1) \\ = a (d, 1) \int_{y \in R^{d}, ∥x - y∥ > e_{[N]}} G_{N} (log ∥x - y∥) q (y) d y + b (N, d, 1), \end{matrix} \end{matrix}

(47)

since

G_{N} (t) = 0

for

t \in [0, e_{[N - 1]}]

,

N \in N

.

Now we will estimate

1 - F_{m, l, x} (u)

in a way different from (40). Fix any

δ > 0

. Note that, for all

m \geq (l - 1) (1 + \frac{1}{δ})

and

s \in {0, \dots, l - 1}

, it holds

\frac{m}{m - s} \leq \frac{m}{m - l + 1} \leq 1 + δ

. Then, for all

x \in R^{d}

,

u \geq 0

and

m \geq max {l, (l - 1) (1 + \frac{1}{δ})}

, in view of (28) one can write

\begin{matrix} \begin{matrix} 1 - F_{m, l, x} (u) = (1 - W_{m, x} (u)) \sum_{s = 0}^{l - 1} (\binom{m - 1}{s}) \frac{m}{m - s} {(W_{m, x} (u))}^{s} {(1 - W_{m, x} (u))}^{(m - 1) - s} \\ \leq (1 + δ) (1 - W_{m, x} (u)) \sum_{s = 0}^{l - 1} (\binom{m - 1}{s}) {(W_{m, x} (u))}^{s} {(1 - W_{m, x} (u))}^{(m - 1) - s} \end{matrix} \end{matrix}

\leq (1 + δ) (1 - W_{m, x} (u)) .

(48)

We are going to employ the following statement as well.

Lemma 5.

For each

N \in N

, a function

{log}_{[N]} (t)

,

t > e_{[N - 1]}

, is slowly varying at infinity.

The proof is elementary and thus is omitted.

Part (3e). Now we are ready to get the bound for

J_{3} (m, x)

. Set

u = m w

. Then one has

\begin{matrix} \begin{matrix} J_{3} (m, x) = \int_{(m^{2}, \infty)} (1 - F_{m, l, x} (u)) \frac{1}{u} ({log}_{[N]} (log u) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} (log u)}) d u \\ = \int_{(m, \infty)} (1 - F_{m, l, x} (m w)) \frac{1}{w} ({log}_{[N + 1]} (m w) + \frac{1}{\prod_{j = 2}^{N} {log}_{[j]} (m w)}) d w . \end{matrix} \end{matrix}

Given

w > m

, Lemma 5 implies that

{log}_{[N + 1]} (m w) \leq {log}_{[N + 1]} (w^{2}) = {log}_{[N]} (2 log w) \leq 2 {log}_{[N + 1]} (w)

for w large enough, namely for all

w \geq W

, where

W = W (N)

. Take

δ > 0

and set

m_{3} : = max \{l, ⌈(l - 1) (1 + \frac{1}{δ})⌉, ⌈W (N)⌉, ⌈e_{[N]}⌉\}

. Let further

m \geq m_{3}

. Then

\begin{matrix} \begin{matrix} J_{3} (m, x) \leq 2 \int_{(m, \infty)} (1 - F_{m, l, x} (m w)) \frac{1}{w} ({log}_{[N + 1]} (w) + \frac{1}{\prod_{j = 2}^{N} {log}_{[j]} (w)}) d w . \end{matrix} \end{matrix}

By virtue of (46) and (48) one has

1 - F_{m, l, x} (m w) \leq (1 + δ) (1 - W_{m, x} (m w)) = (1 + δ) (1 - F_{1, 1, x} (w)) .

(49)

Hence it can be seen that

\begin{matrix} \begin{matrix} J_{3} (m, x) \leq 2 (1 + δ) \int_{(m, \infty)} (1 - F_{1, 1, x} (w)) g_{N_{1}} (w) d w . \end{matrix} \end{matrix}

(50)

Introduce

R_{N} (x) : = \int_{y \in R^{d}, ∥x - y∥ > e_{[N]}} G_{N} (log ∥x - y∥) q (y) d y, A_{p} (G_{N}) : = {x \in S (p) : R_{N} (x) < \infty} .

Let us note that (1)

P_{X} (S (p) \ A_{p} (G_{N})) = 0

as

K_{p, q} (1, N) < \infty

;

(2)

P_{X} (S (p) \ S (q)) = 0

as

P_{X} ≪ P_{Y}

(see Lemma A1);

(3)

μ (S (q) \ (Λ (q) \cap D_{q} (R))) = 0

due to Lemma 3.

Since

P_{X} ≪ μ

we conclude that

P_{X} (S (q) \ (Λ (q) \cap D_{q} (R))) = 0

. Hence, one has

P_{X} (S (p) \ (Λ (q) \cap D_{q} (R))) = 0

in view of 2) and because

B \ C \subset (B \ A) \cup (A \ C)

for any

A, B, C \subset R^{d}

. Set further

A : = Λ (q) \cap S (q) \cap D_{q} (R) \cap S (p) \cap A_{p} (G_{N})

. It follows from (1), (2) and (3) that

P_{X} (S (p) \ A) = 0

, so

P_{X} (A) = 1

. We are going to consider only

x \in A

.

Then, by virtue of (47) and (50), for all

m \geq m_{3}

and

x \in A

, we come to the inequality

J_{3} (m, x) \leq 2 (1 + δ) (a (d, 1) R_{N} (x) + b (N, d, 1)) = A (δ, d) R_{N} (x) + B (δ, d, N),

(51)

where

A (δ, d) : = 2 (1 + δ) a (d, 1)

,

B (δ, d, N) : = 2 (1 + δ) b (N, d, 1)

.

Part (3f). Here we get the upper bound for

E G_{N} (| log ξ_{m, l, x} |)

. For

m \geq max {m_{1}, m_{2}, m_{3}}

and each

x \in A

, taking into account (39), (44), (45) and (51) we can claim that

\begin{matrix} \begin{matrix} E G_{N} (| log ξ_{m, l, x} |) \leq I_{1} (m, x) + J_{1} (m, x) + J_{2} (m, x) + J_{3} (m, x) \\ \leq U_{1} (ε, N, d) {(M_{q} (x, R))}^{ε} + U_{2} (ε, N, d, l) {(m_{q} (x, R))}^{- ε} \\ + U_{3} (m, ε, N, d, l) {(m_{q} (x, R))}^{- ε} + (A (δ, d) R_{N} (x) + B (δ, d, N)) . \end{matrix} \end{matrix}

(52)

For any

κ > 0

, one can take

m_{4} = m_{4} (κ, ε, N, d, l) \in N

such that

U_{3} (m, ε, N, d, l) \leq κ

if

m \geq m_{4}

. Then by virtue of (52), for each

x \in A

and

m \geq m_{0} : = max {m_{1}, m_{2}, m_{3}, m_{4}}

,

\begin{matrix} \begin{matrix} E G_{N} (| log ξ_{m, l, x} |) \leq U_{1} (ε, N, d) {(M_{q} (x, R))}^{ε} \\ + (U_{2} (ε, N, d, l) + κ) {(m_{q} (x, R))}^{- ε} + (A (δ, d) R_{N} (x) + B (δ, d, N)) : = C_{0} (x) < \infty . \end{matrix} \end{matrix}

(53)

Hence, for each

x \in A

, we have established uniform integrability of the family

{\{log ξ_{m, l, x}\}}_{m \geq m_{0}}

.

Step 4. Now we verify (23). It was checked, for each

x \in A

(thus, for

P_{X}

-almost every x belonging to

S (p)

) that

E (log ϕ_{m, l} (1) | X_{1} = x) \to ψ (l) - log V_{d} - log q (x)

,

m \to \infty

. Set

Z_{m, l} (x) : = E (log ϕ_{m, l} (1) | X_{1} = x) = E log ξ_{m, l, x}

. Consider

x \in A

and take any

m \geq m_{0}

. We use the following property of

G_{N}

which is shown in Appendix A.

Lemma 6.

For each

N \in N

, a function

G_{N}

is convex on

R_{+}

.

By the Jensen inequality a function

G_{N}

is nondecreasing and convex.

G_{N} (| Z_{m, l} (x) |) = G_{N} (| E log ξ_{m, l, x} |) \leq G_{N} (E | log ξ_{m, l, x} |) \leq E G_{N} (| log ξ_{m, l, x} |) .

Relation (53) guarantees that, for all

m \geq m_{0}

,

\begin{matrix} \int_{R^{d}} G_{N} (| Z_{m, l} (x) |) p (x) d x \leq U_{1} (ε, N, d) Q_{p, q} (ε, R) \\ + (U_{2} (ε, N, d, l) + κ) T_{p, q} (ε, R) + A (δ, d) K_{p, q} (1, N) + B (δ, d, N) < \infty . \end{matrix}

Now we know that the family

{Z_{m, l} (x)}_{m \geq m_{0}}

,

x \in A

, is uniformly integrable w.r.t. measure

P_{X}

. Thus, for

i \in N

,

\begin{matrix} E log ϕ_{m, l} (i) = \int_{R^{d}} E (log ϕ_{m, l} (1) | X_{1} = x) P_{X_{1}} (d x) = \int_{R^{d}} Z_{m, l} (x) p (x) d x \\ \to ψ (l) - log V_{d} - \int_{R^{d}} p (x) log q (x) d x, m \to \infty, \end{matrix}

and we come to relation (23) establishing Statement 1.

Step 5. Here we prove Statement 2. Similar to

F_{m, l, x} (u)

, one can introduce, for

n, k \in N

,

n \geq k + 1

,

x \in R^{d}

and

u \geq 0

, the following function

\begin{matrix} \begin{matrix} {\tilde{F}}_{n, k, x} (u) : = P (ζ_{n, k} (i) \leq u | X_{i} = x) = 1 - P (∥x - X_{(k)} (x, X_{n} \ {x})∥ > r_{n - 1} (u)) \\ = 1 - \sum_{s = 0}^{k - 1} (\binom{n - 1}{s}) {(V_{n - 1, x} (u))}^{s} {(1 - V_{n - 1, x} (u))}^{n - 1 - s} : = P ({\tilde{ξ}}_{n, k, x} \leq u), \end{matrix} \end{matrix}

(54)

where

r_{n} (u)

was defined in (29), and

V_{n, x} (u) : = \int_{B (x, r_{n} (u))} p (z) d z, {\tilde{ξ}}_{n, k, x} : = (n - 1) {∥x - X_{(k)} (x, X_{n} \ {x})∥}^{d} .

(55)

Formulas (54) and (55) show that

{\tilde{F}}_{n, k, x} (u)

is the regular conditional distribution function of

ζ_{n, k} (i)

given

X_{i} = x

. Moreover, for any fixed

u \geq 0

and

x \in Λ (p) \cap S (p)

(thus

p (x) > 0

),

\begin{matrix} \begin{matrix} {\tilde{F}}_{n, k, x} (u) \to 1 - \sum_{s = 0}^{k - 1} \frac{{(V_{d} p (x) u)}^{s}}{s!} e^{- V_{d} p (x) u} : = {\tilde{F}}_{k, x} (u), n \to \infty . \end{matrix} \end{matrix}

Hence,

{\tilde{ξ}}_{n, k, x} \overset{l a w}{\to} {\tilde{ξ}}_{k, x}

,

x \in Λ (p) \cap S (p)

,

n \to \infty

. Set

{\tilde{A}}_{p} (G_{N}) : = {x \in S (p) : {\tilde{R}}_{N} (x) < \infty}

, where

N \in N

and

{\tilde{R}}_{N} (x) : = \int_{y \in R^{d}, ∥x - y∥ > e_{[N]}} G_{N} (log ∥x - y∥) p (y) d y .

Take

\tilde{A} : = Λ (p) \cap S (p) \cap D_{p} (R) \cap {\tilde{A}}_{p} (G_{N})

. Then

P_{X} (\tilde{A}) = 1

and, for

x \in \tilde{A}

, one can verify that

E G_{N} (| log {\tilde{ξ}}_{n, k, x} |) \leq {\tilde{C}}_{0} (x) < \infty

, for all

n \geq n_{0}

, and therefore

E log {\tilde{ξ}}_{n, k, x} \to E log {\tilde{ξ}}_{k, x}

as

n \to \infty

. Thus,

E (log ζ_{n, k} (1) | X_{1} = x) \to ψ (k) - log V_{d} - log p (x)

,

n \to \infty

. Set

{\tilde{Z}}_{n, k} (x) : = E (log ζ_{n, k} (1) | X_{1} = x)

. One can see that, for all

n \geq n_{0}

,

\int_{R^{d}} G_{N} (| {\tilde{Z}}_{n, k} (x) |) p (x) d x < \infty

. Hence similar to Steps 1–4 we come to relation (24).

So, (14) holds and the proof of Theorem 1 is complete.

5. Proof of Theorem 2

We will follow the general scheme described in Remark 7. However now this scheme is more involved.

First of all note that, in view of Lemma 1, the finiteness of

K_{p, q} (2, N)

and

K_{p, p} (2, N)

implies the finiteness of

K_{p, q} (1, N)

and

K_{p, p} (1, N)

, respectively. Thus, the conditions of Theorem 2 entail validity of Theorem 1 statements. Consequently under the conditions of Theorem 2, for n and m large enough, one can claim that

{\hat{D}}_{n, m} (k, l) \in L^{1} (Ω)

and

E {\hat{D}}_{n, m} (k, l) \to D (P_{X} | | P_{Y})

, as

n, m \to \infty

.

We will show that

{\hat{D}}_{n, m} (k, l) \in L^{2} (Ω)

for all n and m large enough. Then one can write

E {({\hat{D}}_{n, m} (k, l) - D (P_{X} | | P_{Y}))}^{2} = var ({\hat{D}}_{n, m} (k, l)) + {(E {\hat{D}}_{n, m} (k, l) - D (P_{X} | | P_{Y}))}^{2} .

Therefore to prove (16) we will demonstrate that

var ({\hat{D}}_{n, m} (k, l)) \to 0

,

n, m \to \infty

.

Due to (28) the random variables

log ϕ_{m, l} (1), \dots, log ϕ_{m, l} (n)

are identically distributed (and

log ζ_{n, k} (1)

,

\dots, log ζ_{n, k} (n)

are identically distributed as well). The variables

ϕ_{m, l} (i)

and

ζ_{n, k} (i)

are the same as in (22). We will demonstrate that

log ϕ_{m, l} (1)

and

log ζ_{n, k} (1)

belong to

L^{2} (Ω)

. Hence (22) yields

\begin{matrix} \begin{matrix} var ({\hat{D}}_{n, m} (k, l)) = \frac{1}{n^{2}} \sum_{i, j = 1}^{n} cov (log ϕ_{m, l} (i) - log ζ_{n, k} (i), log ϕ_{m, l} (j) - log ζ_{n, k} (j)) \\ = \frac{1}{n} var (log ϕ_{m, l} (1)) + \frac{2}{n^{2}} \sum_{1 \leq i < j \leq n} cov (log ϕ_{m, l} (i), log ϕ_{m, l} (j)) + \frac{1}{n} var (log ζ_{n, k} (1)) \\ + \frac{2}{n^{2}} \sum_{1 \leq i < j \leq n} cov (log ζ_{n, k} (i), log ζ_{n, k} (j)) - \frac{2}{n^{2}} \sum_{i, j = 1}^{n} cov (log ϕ_{m, l} (i), log ζ_{n, k} (j)) . \end{matrix} \end{matrix}

(56)

We mainly follow the notation employed in the above proof of Theorem 1, except the possibly different choice of the sets

A \subset R^{d}

,

\tilde{A} \subset R^{d}

, positive

U_{j}, C_{j} (x), {\tilde{C}}_{j} (x)

and integers

m_{j}, n_{j}

, where

j \in Z_{+}

and

x \in R^{d}

. The following Theorem 2 proof is also subdivided in 5 parts. Steps 1–3 deal with the demonstration of relation

\frac{1}{n} var (log ϕ_{m, l} (1)) \to 0

as

n, m \to \infty

. Step 4 validates the relation

\frac{2}{n^{2}} \sum_{1 \leq i < j \leq n} cov (log ϕ_{m, l} (i), log ϕ_{m, l} (j)) \to 0

as

n, m \to \infty

. At Step 5 we establish that

\frac{2}{n^{2}} \sum_{1 \leq i < j \leq n} cov (log ζ_{n, k} (i), log ζ_{n, k} (j)) \to 0, n \to \infty,

This step is rather involved. Step 6 justifies the desired statement

var ({\hat{D}}_{n, m} (k, l)) \to 0

,

n, m \to \infty

.

Step 1. We study

E {log}^{2} (ϕ_{m, l} (1))

, as

m \to \infty

. For

x \in R^{d}

and

N \in N

, introduce

R_{N, 2} (x) : = \int_{∥x - y∥ \geq e_{[N]}} G_{N} ({log}^{2} ∥x - y∥) q (y) d y .

(57)

Set

A_{p, 2} (G_{N}) : = {x \in S (p) : R_{N, 2} (x) < \infty}

. Then

P_{X} (S (p) \ A_{p, 2} (G_{N})) = 0

since

K_{p, q} (2, N) < \infty

. Consider

A : = Λ (q) \cap S (q) \cap D_{q} (R) \cap S (p) \cap A_{p, 2} (G_{N}),

(58)

where the first four sets appeared in Theorem 1 proof, R and N are indicated in conditions of Theorem 2. It is easily seen that

P_{X} (A) = 1

. The reasoning is exactly the same as in the proof of Theorem 1.

Recall that, for each

x \in A

, one has

log ξ_{m, l, x} \overset{l a w}{\to} log ξ_{l, x}, m \to \infty

, where

ξ_{m, l, x} : = m {∥x - Y_{(l)} (x, Y_{m})∥}^{d}

and

ξ_{l, x}

has

Γ (V_{d} q (x), l)

distribution. Convergence in law of random variables is maintained by continuous transformations. Thus, for each

x \in A

, we get

{log}^{2} ξ_{m, l, x} \overset{l a w}{\to} {log}^{2} ξ_{l, x}, m \to \infty .

(59)

For any

x \in A

, according to (28),

\begin{matrix} \begin{matrix} E {log}^{2} ξ_{m, l, x} = \int_{(0, \infty)} {log}^{2} u d F_{m, l, x} (u) = \int_{(0, \infty)} {log}^{2} u d P (ϕ_{m, l} (1) \leq u | X_{1} = x) \\ = E ({log}^{2} ϕ_{m, l} (1) | X_{1} = x) . \end{matrix} \end{matrix}

(60)

Note that if

η \sim Γ (α, λ)

, where

α > 0

and

λ > 0

, then it is not difficult to verify that

\begin{matrix} \begin{matrix} E {log}^{2} η = \frac{Γ^{''} (λ)}{Γ (λ)} - 2 ψ (λ) log α + {log}^{2} α . \end{matrix} \end{matrix}

Since

ξ_{l, x} \sim Γ (V_{d} q (x), l)

, for

x \in S (q)

, one has

\begin{matrix} \begin{matrix} E {log}^{2} ξ_{l, x} = \frac{Γ^{''} (l)}{Γ (l)} - 2 ψ (l) log (V_{d} q (x)) + {log}^{2} (V_{d} q (x)) = {log}^{2} q (x) + h_{1} log q (x) + h_{2}, \end{matrix} \end{matrix}

(61)

where

h_{1} : = h_{1} (l, d)

and

h_{2} : = h_{2} (l, d)

depend only on fixed l and d.

We prove now that, for

x \in A

, one has

E ({log}^{2} ϕ_{m, l} (1) | X_{1} = x) \to {log}^{2} q (x) + h_{1} log q (x) + h_{2}, m \to \infty .

(62)

Taking into account (60) and (61) we can claim that relation (62) is equivalent to the following one:

E {log}^{2} ξ_{m, l, x} \to E {log}^{2} ξ_{l, x}

,

m \to \infty

. So, in view of (59) to prove (62) it is sufficient to show that, for each

x \in A

, a family

{\{{log}^{2} ξ_{m, l, x}\}}_{m \geq m_{0} (x)}

is uniformly integrable for some

m_{0} (x) \in N

. Then, following Theorem 1 proof, one can certify that, for all

x \in A

and some nonnegative

C_{0} (x)

,

sup_{m \geq m_{0} (x)} E G_{N} ({log}^{2} ξ_{m, l, x}) \leq C_{0} (x) < \infty .

(63)

Step 2. Now we will prove (63). For each

N \in N

, introduce

ρ (N) : = exp {\sqrt{e_{[N - 1]}}}

and

h_{N} (t) : = \{\begin{matrix} 0, & t \in (\frac{1}{ρ (N)}, ρ (N)], \\ \frac{2 log t}{t} ({log}_{[N]} ({log}^{2} t) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} ({log}^{2} t)}), & t \in (0, \frac{1}{ρ (N)}] \cup (ρ (N), \infty) . \end{matrix}

As usual, a product over an empty set (if

N = 1

) equals to 1. To show (63) we refer to the next lemma.

Lemma 7.

Let

F (u), u \in R

, be a distribution function such that

F (0) = 0

. Fix an arbitrary

N \in N

. Then

(1) \int_{(0, \frac{1}{ρ (N)}]} G_{N} ({log}^{2} u) d F (u) = \int_{(0, \frac{1}{ρ (N)}]} F (u) (- h_{N} (u)) d u,

(2) \int_{(ρ (N), \infty)} G_{N} ({log}^{2} u) d F (u) = \int_{(ρ (N), \infty)} (1 - F (u)) h_{N} (u) d u .

The proof of this lemma is omitted, being quite similar to one of Lemma 2. By Lemma 7 and since

G_{N} ({log}^{2} u) = 0

, for

u \in (\frac{1}{ρ (N)}, ρ (N)]

, one has

\begin{matrix} E G_{N} ({log}^{2} ξ_{m, l, x}) = \int_{(0, \frac{1}{ρ (N)}]} F_{m, l, x} (u) (- h_{N} (u)) d u + \int_{(ρ (N), \infty)} (1 - F_{m, l, x} (u)) h_{N} (u) d u \\ : = I_{1} (m, x) + I_{2} (m, x) . \end{matrix}

To simplify notation we do not indicate the dependence of

I_{i} (m, x)

(

i = 1, 2

) on fixed N, l and d.

For clarity, further implementation of Step 2 is divided into several parts.

Part (2a).At first we consider

I_{1} (m, x)

. As in Theorem 1 proof, for fixed

R > 0

and

ε > 0

appearing in the conditions of Theorem 2, an inequality

F_{m, l, x} (u) \leq {(M_{q} (x, R))}^{ε} V_{d}^{ε} u^{ε}

holds for any

x \in A

,

u \in (0, \frac{1}{ρ (N)}]

and

m \geq m_{1} : = max \{⌈\frac{1}{ρ (N) R^{d}}⌉, l\}

. Taking into account that

0 \leq (- h_{N} (u)) \leq \frac{(- 2 log u) ({log}_{[N]} ({log}^{2} u) + 1)}{u}

if

u \in (0, \frac{1}{ρ (N)}]

, we get, for

m \geq m_{1}

,

\begin{matrix} \begin{matrix} I_{1} (m, x) \leq {(M_{q} (x, R))}^{ε} V_{d}^{ε} \int_{(0, \frac{1}{ρ (N)}]} \frac{(- 2 log u) ({log}_{[N]} ({log}^{2} u) + 1)}{u^{1 - ε}} d u \\ = U_{1} (ε, N, d) {(M_{q} (x, R))}^{ε} . \end{matrix} \end{matrix}

(64)

Here

U_{1} (ε, N, d) : = V_{d}^{ε} L_{N, 2} (ε)

,

L_{N, 2} (ε) : = \int_{[\sqrt{e_{[N - 1]}}, \infty)} 2 t ({log}_{[N]} (t^{2}) + 1) e^{- ε t} d t < \infty

for each

ε > 0

and any

N \in N

.

Part (2b).Consider

I_{2} (m, x)

. Following the previous theorem proof we at first observe that

h_{N} (u) \leq \frac{2 log u}{u} ({log}_{[N]} ({log}^{2} u) + 1)

for

u \in (ρ (N), \infty)

. So, for all

m \geq max {ρ^{2} (N), l}

,

\begin{matrix} \begin{matrix} I_{2} (m, x) \leq \int_{(ρ (N), \sqrt{m}]} (1 - F_{m, l, x} (u)) \frac{2 log u ({log}_{[N]} ({log}^{2} u) + 1)}{u} d u \\ + \int_{(\sqrt{m}, m^{2}]} (1 - F_{m, l, x} (u)) \frac{2 log u ({log}_{[N]} ({log}^{2} u) + 1)}{u} d u \\ + \int_{(m^{2}, \infty)} (1 - F_{m, l, x} (u)) h_{N} (u) d u : = J_{1} (m, x) + J_{2} (m, x) + J_{3} (m, x), \end{matrix} \end{matrix}

where we do not indicate the dependence of

J_{j} (m, x)

(

j = 1, 2, 3

) on N, l and d.

For

R > 0

and

ε > 0

appearing in the conditions of Theorem 2, one can show (see Theorem 1 proof), that inequality

1 - F_{m, l, x} (u) \leq S_{1} {(S_{2} V_{d} u m_{q} (x, R))}^{- ε}

(65)

holds for any

x \in A

,

u \in (ρ (N), \sqrt{m}]

and all

m \geq m_{2} : = max \{⌈\frac{1}{R^{2 d}}⌉, ⌈ρ^{2} (N)⌉, l\}

. Here

S_{1} : = S_{1} (l)

and

S_{2}

are the same as in the proof of Theorem 1. For all

x \in A

and

m \geq m_{2}

, we come to the relations

\begin{matrix} \begin{matrix} J_{1} (m, x) \leq \frac{S_{1}}{{(S_{2} V_{d})}^{ε} {(m_{q} (x, R))}^{ε}} \int_{(ρ (N), \infty)} \frac{2 log u ({log}_{[N]} ({log}^{2} u) + 1)}{u^{1 + ε}} d u \\ = U_{2} (ε, N, d, l) {(m_{q} (x, R))}^{- ε}, \end{matrix} \end{matrix}

(66)

where

U_{2} (ε, N, d, l) : = 2 S_{1} (l) L_{N, 2} (ε) {(S_{2} V_{d})}^{- ε}

.

Part (2c). Let us consider

J_{2} (m, x)

. Take

δ > 0

. Then, due to (65), for all

x \in A

and any

m \geq m_{2}

,

\begin{matrix} \begin{matrix} J_{2} (m, x) \leq 2 (1 - F_{m, l, x} (\sqrt{m})) \int_{(\sqrt{m}, m^{2}]} log u ({log}_{[N]} ({log}^{2} u) + 1) d log u \\ \leq 4 S_{1} {(S_{2} V_{d})}^{- ε} m^{- \frac{ε}{2}} {(m_{q} (x, R))}^{- ε} ({log}_{[N]} (4 {log}^{2} m) + 1) {log}^{2} m \\ = U_{3} (m, ε, N, d, l) {(m_{q} (x, R))}^{- ε}, \end{matrix} \end{matrix}

(67)

where

U_{3} (m, ε, N, d, l) : = 4 S_{1} {(S_{2} V_{d})}^{- ε} m^{- \frac{ε}{2}} ({log}^{2} m) ({log}_{[N]} (4 {log}^{2} m) + 1) \to 0

,

m \to \infty

.

Part (2d). Now we turn to

J_{3} (m, x)

. Take

u = m w

. Then

J_{3} (m, x)

has the form

\begin{matrix} \begin{matrix} \int_{(m, \infty)} (1 - F_{m, l, x} (m w)) \frac{2 log (m w)}{w} ({log}_{[N]} ({log}^{2} (m w)) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} ({log}^{2} (m w))}) d w . \end{matrix} \end{matrix}

Due to Lemma 5 there exists

T (N) > ρ (N)

such that

{log}_{[N]} ({log}^{2} (w^{2})) = {log}_{[N]} (4 {log}^{2} w) \leq 2 {log}_{[N]} ({log}^{2} w), w \geq T (N) .

(68)

Pick some

δ > 0

and set

m_{3} : = max \{l, ⌈(l - 1) (1 + \frac{1}{δ})⌉, ⌈T (N)⌉, ⌈ρ (N)⌉\}

, where

T (N)

was introduced in (68). Consider

m \geq m_{3}

. In view of Lemma 4 (for

ν = 2

), (49), (68) and since

w > m

,

\begin{matrix} \begin{matrix} J_{3} (m, x) \leq \int_{(m, \infty)} (1 - F_{m, l, x} (m w)) \frac{2 log (w^{2})}{w} ({log}_{[N]} ({log}^{2} (w^{2})) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} ({log}^{2} w)}) d w \\ \leq 4 (1 + δ) \int_{(m, \infty)} (1 - F_{1, 1, x} (w)) \frac{2 log w}{w} ({log}_{[N]} ({log}^{2} w) + \frac{1}{\prod_{j = 1}^{N - 1} {log}_{[j]} ({log}^{2} w)}) d w \\ = 4 (1 + δ) \int_{(m, \infty)} (1 - F_{1, 1, x} (w)) h_{N} (w) d w \leq 4 (1 + δ) \int_{(ρ (N), \infty)} (1 - F_{1, 1, x} (w)) h_{N} (w) d w \\ = 4 (1 + δ) \int_{(ρ (N), \infty)} G_{N} ({log}^{2} w) d F_{1, 1, x} (w) = 4 (1 + δ) E [G_{N} ({log}^{2} ξ_{1, 1, x}) I {ξ_{1, 1, x} > ρ (N)}] \\ = 4 (1 + δ) E [G_{N} ({(log {∥Y - x∥}^{d})}^{2}) {I {∥ Y - x ∥}^{d} > ρ (N)}] \\ = 4 (1 + δ) \int_{y \in R^{d}, ∥ x - y ∥ > {(ρ (N))}^{1 / d}} G_{N} {((log ∥ x - y ∥}^{d})^{2}) q (y) d y \\ \leq 4 (1 + δ) (a (d, 2) \int_{y \in R^{d}, ∥ x - y ∥ > {(ρ (N))}^{1 / d}} G_{N} ({log}^{2} ∥ x - y ∥) q (y) d y + b (N, d, 2)) \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} \leq 4 (1 + δ) (a (d, 2) (R_{N, 2} (x) + G_{N} (e_{[N - 1]}^{2})) + b (N, d, 2)) \\ = A (δ, d) R_{N, 2} (x) + B (δ, d, N), \end{matrix} \end{matrix}

(69)

A (δ, d) : = 4 (1 + δ) a (d, 2)

,

B (δ, d, N) : = 4 (1 + δ) (a (d, 2) G_{N} (e_{[N - 1]}^{2}) + b (N, d, 2))

,

R_{N, 2} (x)

is defined in (57). Here we have also used, for any

N \in N

,

ν, t, u > 0

,

t < u

, the following estimates

K_{p, q} (ν, N, u) \leq K_{p, q} (ν, N, t) \leq K_{p, q} (ν, N, u) + max {G_{N} {(| log t |}^{ν}), G_{N} {(| log u |}^{ν})} .

Part (2e). Examine

E G_{N} ({log}^{2} ξ_{m, l, x})

. Thus, for each

x \in A

and

m \geq max {m_{1}, m_{2}, m_{3}}

, taking into account (64), (66), (67) and (69), we can claim that

\begin{matrix} \begin{matrix} E G_{N} ({log}^{2} ξ_{m, l, x}) \leq I_{1} (m, x) + J_{1} (m, x) + J_{2} (m, x) + J_{3} (m, x) \\ \leq U_{1} (ε, N, d) {(M_{q} (x, R))}^{ε} + U_{2} (ε, N, d, l) {(m_{q} (x, R))}^{- ε} + U_{3} (m, ε, N, d, l) {(m_{q} (x, R))}^{- ε} \end{matrix} \end{matrix}

+ A (δ, d) R_{N, 2} (x) + B (δ, d, N) .

(70)

Moreover, for any

κ > 0

, one can choose

m_{4} : = m_{4} (κ, ε, N, d, l) \in N

such that, for

m \geq m_{4}

, it holds

U_{3} (m, ε, N, d, l) \leq κ

. Then by (70), for each

x \in A

and

m \geq m_{0} : = max {m_{1}, m_{2}, m_{3}, m_{4}}

,

\begin{matrix} \begin{matrix} E G_{N} ({log}^{2} ξ_{m, l, x}) \leq U_{1} (ε, N, d) {(M_{q} (x, R))}^{ε} \\ + (U_{2} (ε, N, d, l) + κ) {(m_{q} (x, R))}^{- ε} + A (δ, d) R_{N, 2} (x) + B (δ, d, N) : = C_{0} (x) < \infty . \end{matrix} \end{matrix}

(71)

Hence we have proved the uniform integrability of the family

{\{{log}^{2} ξ_{m, l, x}\}}_{m \geq m_{0}}

for each

x \in A

. Therefore, for any

x \in A

(thus for

P_{X}

-almost every

x \in S (p)

), relation (62) holds.

Step 3. Now we can return to

E {log}^{2} ϕ_{m, l} (1)

. Set

Δ_{m, l} (x) : = E ({log}^{2} ϕ_{m, l} (1) | X_{1} = x) = E {log}^{2} ξ_{m, l, x}

. Consider

x \in A

and take any

m \geq m_{0}

. A function

G_{N}

is nondecreasing and convex according to Lemma 6. Due to the Jensen inequality

\begin{matrix} G_{N} (Δ_{m, l} (x)) = G_{N} (E {log}^{2} ξ_{m, l, x}) \leq E G_{N} ({log}^{2} ξ_{m, l, x}) . \end{matrix}

(72)

Relation (72) guarantees that, for each

x \in A

and all

m \geq m_{0}

,

\begin{matrix} \int_{R^{d}} G_{N} (Δ_{m, l} (x)) p (x) d x \leq U_{1} (ε, N, d) Q_{p, q} (ε, R) + (U_{2} (ε, N, d, l) + κ) T_{p, q} (ε, R) \\ + A (δ, d) K_{p, q} (2, N) + B (δ, d, N) < \infty . \end{matrix}

Uniform integrability of the family

{Δ_{m, l} (\cdot)}_{m \geq m_{0}}

(w.r.t measure

P_{X}

) is thus established. Hence one can claim that

\begin{matrix} E {log}^{2} ϕ_{m, l} (1) \to \int_{R^{d}} p (x) {log}^{2} q (x) d x + h_{1} \int_{R^{d}} p (x) log q (x) d x + h_{2}, m \to \infty . \end{matrix}

It is easily seen that finiteness of integrals

Q_{p, q} (ε, R)

,

T_{p, q} (ε, R)

implies that

\int_{R^{d}} p (x) {log}^{2} q (x) d x < \infty, \int_{R^{d}} p (x) | log q (x) | d x < \infty .

Thus,

E {log}^{2} ϕ_{m, l} (1) \to τ_{2} < \infty

and

var (log ϕ_{m, l} (1)) = E {log}^{2} ϕ_{m, l} (1) - {(E log ϕ_{m, l} (1))}^{2} \to τ_{2} - τ_{1}^{2} < \infty

,

m \to \infty

, where

τ_{1} : = ψ (l) - log V_{d} - \int_{R^{d}} p (x) log q (x) d x

according to (23). Consequently,

\frac{1}{n} var (log ϕ_{m, l} (1)) \to 0

as

n, m \to \infty

.

Step 4. Now we consider

cov (log ϕ_{m, l} (i), log ϕ_{m, l} (j))

for

i \neq j

, where

i, j \in {1, \dots, n}

. For

x, y \in R^{d}

, define conditional distribution function

Φ_{m, l, x, y}^{i, j} (u, w) : = P (ϕ_{m, l} (i) \leq u, ϕ_{m, l} (j) \leq w | X_{i} = x, X_{j} = y), u, w \geq 0 .

For

x, y \in R^{d}

,

u, w \geq 0

,

i \neq j

,

\begin{matrix} \begin{matrix} Φ_{m, l, x, y}^{i, j} (u, w) = 1 - P (ϕ_{m, l} (i) > u | X_{i} = x, X_{j} = y) \\ - P (ϕ_{m, l} (j) > w | X_{i} = x, X_{j} = y) + P (ϕ_{m, l} (i) > u, ϕ_{m, l} (j) > w | X_{i} = x, X_{j} = y) \\ = 1 - P (∥x - Y_{(l)} (x, Y_{m})∥ > r_{m} (u)) - P (∥y - Y_{(l)} (y, Y_{m})∥ > r_{m} (w)) \\ + P (∥x - Y_{(l)} (x, Y_{m})∥ > r_{m} (u), ∥y - Y_{(l)} (y, Y_{m})∥ > r_{m} (w)) . \end{matrix} \end{matrix}

(73)

Here

r_{m} (a) = {(\frac{a}{m})}^{\frac{1}{d}}

for all

a \geq 0

, as previously. One can write

Φ_{m, l, x, y} (u, w)

instead of

Φ_{m, l, x, y}^{i, j} (u, w)

, because the right-hand side of (73) does not depend on i and j.

Set

A_{1} : = \{(x, y) : x \in A, y \in A, x \neq y\}

and

A_{2} : = \{(x, y) : x \in A, y \in A, x = y\}

, where A is introduced in (58). Evidently,

(P_{X} \otimes P_{X}) (A_{1}) = 1

and

(P_{X} \otimes P_{X}) (A_{2}) = 0

. Consider

(x, y) \in A_{1}

. Obviously, for any

a > 0

,

r_{m} (a) \to 0

, as

m \to \infty

. For

(x, y) \in A_{1}

we take

m_{5} = m_{5} (u, w, ∥x - y∥) : = 1 + ⌈{(\frac{2}{∥x - y∥})}^{d} max \{u, w\}⌉

. Then

r_{m} (u) < \frac{∥x - y∥}{2}

and

r_{m} (w) < \frac{∥x - y∥}{2}

for all

m \geq m_{5}

. Thus,

B (x, r_{m} (u)) \cap B (y, r_{m} (w)) = ⌀

if

m \geq m_{5}

. Consequently, for

m \geq m_{6} (u, w, ∥x - y∥) : = max \{m_{5}, 2 (l - 1)\}

,

\begin{matrix} \begin{matrix} P (∥x - Y_{(l)} (x, Y_{m})∥ > r_{m} (u), ∥y - Y_{(l)} (y, Y_{m})∥ > r_{m} (w)) \\ = \sum_{s_{1} = 0}^{l - 1} \sum_{s_{2} = 0}^{l - 1} \frac{m!}{s_{1}! s_{2}! (m - s_{1} - s_{2})!} {(W_{m, x} (u))}^{s_{1}} {(W_{m, y} (w))}^{s_{2}} {(1 - W_{m, x} (u) - W_{m, y} (w))}^{m - s_{1} - s_{2}} . \end{matrix} \end{matrix}

(74)

In view of (28), (73) and (74), one has for

Φ_{m, l, x, y} (u, w)

the following representation

\begin{matrix} \begin{matrix} 1 - \sum_{s_{1} = 0}^{l - 1} (\binom{m}{s_{1}}) {(W_{m, x} (u))}^{s_{1}} {(1 - W_{m, x} (u))}^{m - s_{1}} - \sum_{s_{2} = 0}^{l - 1} (\binom{m}{s_{2}}) {(W_{m, y} (w))}^{s_{2}} {(1 - W_{m, y} (w))}^{m - s_{2}} \\ + \sum_{s_{1} = 0}^{l - 1} \sum_{s_{2} = 0}^{l - 1} \frac{m!}{s_{1}! s_{2}! (m - s_{1} - s_{2})!} {(W_{m, x} (u))}^{s_{1}} {(W_{m, y} (w))}^{s_{2}} {(1 - W_{m, x} (u) - W_{m, y} (w))}^{m - s_{1} - s_{2}} . \end{matrix} \end{matrix}

(75)

For any fixed

(x, y) \in A_{1}

and

u, w \geq 0

, we get, as

m \to \infty

,

\begin{matrix} \begin{matrix} \frac{m!}{s_{1}! s_{2}! (m - s_{1} - s_{2})!} {(W_{m, x} (u))}^{s_{1}} {(W_{m, y} (w))}^{s_{2}} \to \frac{{(V_{d} u q (x))}^{s_{1}}}{s_{1}!} \frac{{(V_{d} w q (y))}^{s_{2}}}{s_{2}!}, \\ {(1 - W_{m, x} (u) - W_{m, y} (w))}^{m - s_{1} - s_{2}} \to e^{- V_{d} (u q (x) + w q (y))} . \end{matrix} \end{matrix}

(76)

Then, according to (31), (75) and (76), for all fixed

u, w \geq 0

,

(x, y) \in A_{1}

, one has

\begin{matrix} \begin{matrix} Φ_{m, l, x, y} (u, w) \to 1 - \sum_{s_{1} = 0}^{l - 1} \frac{{(V_{d} u q (x))}^{s_{1}}}{s_{1}!} e^{- V_{d} u q (x)} - \sum_{s_{2} = 0}^{l - 1} \frac{{(V_{d} w q (y))}^{s_{2}}}{s_{2}!} e^{- V_{d} w q (y)} \\ + \sum_{s_{1} = 0}^{l - 1} \sum_{s_{2} = 0}^{l - 1} \frac{{(V_{d} u q (x))}^{s_{1}}}{s_{1}!} \frac{{(V_{d} w q (y))}^{s_{2}}}{s_{2}!} e^{- V_{d} (u q (x) + w q (y))} \\ = (1 - \sum_{s_{1} = 0}^{l - 1} \frac{{(V_{d} u q (x))}^{s_{1}}}{s_{1}!} e^{- V_{d} u q (x)}) (1 - \sum_{s_{2} = 0}^{l - 1} \frac{{(V_{d} w q (y))}^{s_{2}}}{s_{2}!} e^{- V_{d} w q (y)}) \\ = F_{l, x} (u) F_{l, y} (w) : = Φ_{l, x, y} (u, w), m \to \infty . \end{matrix} \end{matrix}

Thus,

Φ_{l, x, y} (\cdot, \cdot)

is identified as a distribution function of a vector

η_{l, x, y} : = (ξ_{l, x}, ξ_{l, y})

having independent components such that

ξ_{l, x} \sim Γ (V_{d} q (x), l)

,

ξ_{l, y} \sim Γ (V_{d} q (y), l)

. Observe also that

Φ_{m, l, x, y} (\cdot, \cdot)

is a distribution function of a random vector

η_{m, l, x, y} : = (ξ_{m, l, x}, ξ_{m, l, y})

. Consequently, we have shown that

η_{m, l, x, y} \overset{l a w}{\to} η_{l, x, y}

as

m \to \infty

. Hence, for any

(x, y) \in A_{1}

,

log ξ_{m, l, x} log ξ_{m, l, y} \overset{l a w}{\to} log ξ_{l, x} log ξ_{l, y}, m \to \infty .

Here we take strictly positive versions of random variables under consideration. Note that, for all

i, j \in N

,

i \neq j

,

\begin{matrix} \begin{matrix} E (log ξ_{m, l, x} log ξ_{m, l, y}) = \int_{(0, \infty) \times (0, \infty)} log u log w d Φ_{m, l, x, y} (u, w) \\ = E (log ϕ_{m, l} (i) log ϕ_{m, l} (j) | X_{i} = x, X_{j} = y) . \end{matrix} \end{matrix}

(77)

One has

E (log ξ_{l, x} log ξ_{l, y}) = E log ξ_{l, x} E log ξ_{l, y} = a_{l, d} (x) a_{l, d} (y)

because

ξ_{l, x}

and

ξ_{l, y}

are independent, here

a_{l, d} (z) : = ψ (l) - log V_{d} - log q (z)

,

z \in R^{d}

.

Now we intend to verify that, for any

(x, y) \in A_{1}

,

\begin{matrix} \begin{matrix} E (log ϕ_{m, l} (1) log ϕ_{m, l} (2) | X_{1} = x, X_{2} = y) \to a_{l, d} (x) a_{l, d} (y) . \end{matrix} \end{matrix}

(78)

Equivalently, one can prove that

E (log ξ_{m, l, x} log ξ_{m, l, y}) \to E (log ξ_{l, x} log ξ_{l, y})

for each

(x, y) \in A_{1}

, as

m \to \infty

.

Part (4a). We will prove the uniform integrability of a family

{log ξ_{m, l, x} log ξ_{m, l, y}}_{m \geq m_{0}}

for

(x, y) \in A_{1}

. The convex function

G_{N} (\cdot)

is nondecreasing. Thus, following the proof of Step 2 for any

(x, y) \in A_{1}

, one can find

m_{0}

(the same as in the proof of Step 2 such that, for all

m \geq m_{0}

,

\begin{matrix} \begin{matrix} E G_{N} (| log ξ_{m, l, x} log ξ_{m, l, y} |) \leq E G_{N} (\frac{1}{2} {log}^{2} ξ_{m, l, x} + \frac{1}{2} {log}^{2} ξ_{m, l, y}) \\ \leq \frac{1}{2} (E G_{N} ({log}^{2} ξ_{m, l, x}) + E G_{N} ({log}^{2} ξ_{m, l, y})) \leq \frac{U_{1}}{2} ({(M_{q} (x, R))}^{ε} + {(M_{q} (y, R))}^{ε}) \\ + \frac{U_{2} + κ}{2} ({(m_{q} (x, R))}^{- ε} + {(m_{q} (y, R))}^{- ε}) + \frac{A}{2} (R_{N, 2} (x) + R_{N, 2} (y)) + B : = {\tilde{C}}_{0} (x, y) . \end{matrix} \end{matrix}

(79)

Here we used (71). It is essential that

U_{1}, U_{2}, κ, A, B

do not depend on x and y. Hence, for any

(x, y) \in A_{1}

, a family

{log ξ_{m, l, x} log ξ_{m, l, y}}_{m \geq m_{0}}

is uniformly integrable. So, we establish (78) for

(x, y) \in A_{1}

.

Part (4b). We return to

cov (log ϕ_{m, l} (i), log ϕ_{m, l} (j))

for

i \neq j

,

i, j \in {1, \dots, n}

. Set

T_{m, l} (x, y) : = E (log ϕ_{m, l} (1) log ϕ_{m, l} (2) | X_{1} = x, X_{2} = y)

where

(x, y) \in A_{1}

. Then (78) means that

T_{m, l} (x, y) = E (log ξ_{m, l, x} log ξ_{m, l, y}) \to a_{l, d} (x) a_{l, d} (y)

for any

(x, y) \in A_{1}

, as

m \to \infty

. Note that

\begin{matrix} \begin{matrix} G_{N} (| T_{m, l} (x, y) |) = G_{N} (| E log ξ_{m, l, x} log ξ_{m, l, y} |) \\ \leq G_{N} (E | log ξ_{m, l, x} log ξ_{m, l, y} |) \leq E G_{N} (| log ξ_{m, l, x} log ξ_{m, l, y} |) . \end{matrix} \end{matrix}

(80)

As

(P_{X} \otimes P_{X}) (A_{1}) = 1

, one can conclude due to (79) and (80) that, for all

m \geq m_{0}

,

\begin{matrix} \int_{R^{d} \times R^{d}} G_{N} (| T_{m, l} (x, y) |) p (x) p (y) d x d y = \int_{(x, y) \in A_{1}} G_{N} (| T_{m, l} (x, y) |) p (x) p (y) d x d y \\ \leq U_{1} \int_{R^{d}} M_{q}^{ε} (x, R) p (x) d x + (U_{2} + κ) \int_{R^{d}} m_{q}^{- ε} (x, R) p (x) d x + A \int_{R^{d}} R_{N, 2} (x) p (x) d x + B \\ = U_{1} Q_{p, q} (ε, R) + (U_{2} + κ) T_{p, q} (ε, R) + A K_{p, q} (2, N) + B < \infty . \end{matrix}

Therefore, for

(x, y) \in A_{1}

, the family

{\{T_{m, l} (x, y)\}}_{m \geq m_{0}}

is uniformly integrable w.r.t.

P_{X} \otimes P_{X}

. Consequently,

\begin{matrix} \begin{matrix} \int_{R^{d} \times R^{d}} T_{m, l} (x, y) p (x) p (y) d x d y \to \int_{R^{d} \times R^{d}} a_{l, d} (x) a_{l, d} (y) p (x) p (y) d x d y, m \to \infty . \end{matrix} \end{matrix}

Thus

E log ϕ_{m, l} (1) log ϕ_{m, l} (2) \to {(ψ (l) - log V_{d} - \int_{R^{d}} log q (x) p (x) d x)}^{2}, m \to \infty .

(81)

On the other hand, taking also into account (23), we come to the relation

E log ϕ_{m, l} (1) E log ϕ_{m, l} (2) \to {(ψ (l) - log V_{d} - \int_{R^{d}} log q (x) p (x) d x)}^{2} .

(82)

Hence (81) and (82) imply that

\frac{2}{n^{2}} \sum_{1 \leq i < j \leq n} cov (log ϕ_{m, l} (i), log ϕ_{m, l} (j)) = \frac{n - 1}{n} cov (log ϕ_{m, l} (1), log ϕ_{m, l} (2)) \to 0, n, m \to \infty .

Step 5. Now we consider

cov (log ζ_{n, k} (i), log ζ_{n, k} (j))

for

i \neq j

, where

i, j \in {1, \dots, n}

.

Similarly to Step 4, for

x, y \in R^{d}

and

u, w \geq 0

, introduce a conditional distribution function

{\tilde{Φ}}_{n, k, x, y}^{i, j} (u, w) : = P (ζ_{n, k} (i) \leq u, ζ_{n, k} (j) \leq w | X_{i} = x, X_{j} = y)

= P (∥x - X_{(k)} (x, X_{n}^{i, j} \cup {y})∥ \leq r_{n - 1} (u), ∥y - X_{(k)} (y, X_{n}^{i, j} \cup {x})∥ \leq r_{n - 1} (w))

: = P ({\tilde{η}}_{n, k, x}^{y, i, j} \leq u, {\tilde{η}}_{n, k, y}^{x, i, j} \leq w), u, w \geq 0,

where

X_{n}^{i, j} = X_{n} \ {X_{i}, X_{j}}

,

{\tilde{η}}_{n, k, x}^{y, i, j} : = (n - 1) {∥x - X_{(k)} (x, X_{n}^{i, j} \cup {y})∥}^{d}

. We write

{\tilde{Φ}}_{n, k, x, y}

(u, w)

,

{\tilde{η}}_{n, k, x}^{y}

and

{\tilde{η}}_{n, k, y}^{x}

instead of

{\tilde{Φ}}_{n, k, x, y}^{i, j} (u, w)

,

{\tilde{η}}_{n, k, x}^{y, i, j}

,

{\tilde{η}}_{n, k, y}^{x, i, j}

, respectively, (because

X_{1}, X_{2}, \dots

are i.i.d. random vectors). Moreover,

{\tilde{Φ}}_{n, k, x, y} (u, w)

is the distribution function of a random vector

{\tilde{η}}_{n, k, x, y} : = ({\tilde{η}}_{n, k, x}^{y}, {\tilde{η}}_{n, k, y}^{x})

and the regular conditional distribution function of a random vector

(ζ_{n, k} (i), ζ_{n, k} (j))

given

(X_{i}, X_{j}) = (x, y)

. One has

\begin{matrix} \begin{matrix} {\tilde{Φ}}_{n, k, x, y} (u, w) = 1 - P (∥x - X_{(k)} (x, X_{n}^{i, j} \cup {y})∥ > r_{n - 1} (u)) \\ - P (∥y - X_{(k)} (y, X_{n}^{i, j} \cup {x})∥ > r_{n - 1} (w)) \\ + P (∥x - X_{(k)} (x, X_{n}^{i, j} \cup {y})∥ > r_{n - 1} (u), ∥y - X_{(k)} (y, X_{n}^{i, j} \cup {x})∥ > r_{n - 1} (w)) . \end{matrix} \end{matrix}

Introduce

{\tilde{R}}_{N, 2} (x) : = \int_{∥x - y∥ \geq e_{[N]}} G_{N} ({log}^{2} ∥x - y∥) p (y) d y,

{\tilde{A}}_{p, 2} (G_{N}) : = {x \in S (p) : {\tilde{R}}_{N, 2} (x) < \infty}

and

\tilde{A} : = Λ (p) \cap S (p) \cap D_{p} (R) \cap {\tilde{A}}_{p, 2} (G_{N})

, where the first three sets appeared in Theorem 1 proof (Step 5) Then

P_{X} (S (p) \ {\tilde{A}}_{p, 2} (G_{N})) = 0

since

K_{p, p} (2, N) < \infty

. It is easily seen that

P_{X} (\tilde{A}) = 1

.

Take

{\tilde{A}}_{1} : = \{(x, y) : x \in \tilde{A}, y \in \tilde{A}, x \neq y\}

and

{\tilde{A}}_{2} : = \{(x, y) : x \in \tilde{A}, y \in \tilde{A}, x = y\}

. Evidently,

(P_{X} \otimes P_{X}) ({\tilde{A}}_{1}) = 1

and

(P_{X} \otimes P_{X}) ({\tilde{A}}_{2}) = 0

. For any

a > 0

,

r_{m} (a) \to 0

, as

m \to \infty

. Hence, for

(x, y) \in {\tilde{A}}_{1}

, one can find

{\tilde{n}}_{5} = {\tilde{n}}_{5} (u, w, ∥x - y∥) = 2 + ⌈{(\frac{2}{∥x - y∥})}^{d} max \{u, w\}⌉

such that

r_{n - 1} (u) < \frac{∥x - y∥}{2}

,

r_{n - 1} (w) < \frac{∥x - y∥}{2}

if

n \geq {\tilde{n}}_{5}

. Then

B (x, r_{n - 1} (u)) \cap B (y, r_{n - 1} (w)) = ⌀

if

n \geq {\tilde{n}}_{5} (u, w, ∥x - y∥)

. Thus, for

n \geq {\tilde{n}}_{6} : = max \{{\tilde{n}}_{5}, 2 k\}

, one has

{\tilde{Φ}}_{n, k, x, y} (u, w) = 1 - \sum_{s_{1} = 0}^{k - 1} (\binom{n - 2}{s_{1}}) {(V_{n - 1, x} (u))}^{s_{1}} {(1 - V_{n - 1, x} (u))}^{n - 2 - s_{1}}

\begin{matrix} \begin{matrix} - \sum_{s_{2} = 0}^{k - 1} (\binom{n - 2}{s_{2}}) {(V_{n - 1, y} (w))}^{s_{2}} {(1 - V_{n - 1, y} (w))}^{n - 2 - s_{2}} \end{matrix} \end{matrix}

+ \sum_{s_{1} = 0}^{k - 1} \sum_{s_{2} = 0}^{k - 1} \frac{(n - 2)!}{s_{1}! s_{2}! (n - 2 - s_{1} - s_{2})!} {(V_{n - 1, x} (u))}^{s_{1}} {(V_{n - 1, y} (w))}^{s_{2}} {(1 - V_{n - 1, x} (u) - V_{n - 1, y} (w))}^{n - 2 - s_{1} - s_{2}} .

Therefore, for each fixed

(x, y) \in {\tilde{A}}_{1}

,

u, w \geq 0

, we get, as

n \to \infty

,

\begin{matrix} \begin{matrix} {\tilde{Φ}}_{n, k, x, y} (u, w) \to 1 - \sum_{s_{1} = 0}^{k - 1} \frac{{(V_{d} u p (x))}^{s_{1}}}{s_{1}!} e^{- V_{d} u p (x)} - \sum_{s_{2} = 0}^{k - 1} \frac{{(V_{d} w p (y))}^{s_{2}}}{s_{2}!} e^{- V_{d} w p (y)} \\ + \sum_{s_{1} = 0}^{k - 1} \sum_{s_{2} = 0}^{k - 1} \frac{{(V_{d} u p (x))}^{s_{1}}}{s_{1}!} \frac{{(V_{d} w p (y))}^{s_{2}}}{s_{2}!} e^{- V_{d} (u p (x) + w p (y))} \\ = (1 - \sum_{s_{1} = 0}^{k - 1} \frac{{(V_{d} u p (x))}^{s_{1}}}{s_{1}!} e^{- V_{d} u p (x)}) (1 - \sum_{s_{2} = 0}^{k - 1} \frac{{(V_{d} w p (y))}^{s_{2}}}{s_{2}!} e^{- V_{d} w p (y)}) = {\tilde{F}}_{k, x} (u) {\tilde{F}}_{k, y} (w) \\ : = {\tilde{Φ}}_{k, x, y} (u, w) . \end{matrix} \end{matrix}

Here

{\tilde{Φ}}_{k, x, y} (\cdot, \cdot)

denotes the distribution function of a vector

{\tilde{η}}_{k, x, y} : = ({\tilde{ξ}}_{k, x}, {\tilde{ξ}}_{k, y})

. The components of

{\tilde{η}}_{k, x, y}

are independent,

{\tilde{ξ}}_{k, x} \sim Γ (V_{d} p (x), k)

and

{\tilde{ξ}}_{k, y} \sim Γ (V_{d} p (y), k)

. Consequently, for each fixed

(x, y) \in {\tilde{A}}_{1}

, we have shown that

{\tilde{η}}_{n, k, x, y} \overset{l a w}{\to} {\tilde{η}}_{k, x, y}

as

n \to \infty

. Therefore, for such

(x, y)

,

log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} \overset{l a w}{\to} log {\tilde{ξ}}_{k, x} log {\tilde{ξ}}_{k, y}, n \to \infty .

(83)

Here we take strictly positive versions of the random variables under consideration. In a way similar to (77), for

i, j \in {1, \dots, n}

,

i \neq j

, we write

\begin{matrix} \begin{matrix} E (log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x}) = \int_{(0, \infty) \times (0, \infty)} log u log w d {\tilde{Φ}}_{n, k, x, y} (u, w) \\ = E (log ζ_{n, k} (i) log ζ_{n, k} (j) | X_{i} = x, X_{j} = y) . \end{matrix} \end{matrix}

Since

{\tilde{ξ}}_{k, x}

and

{\tilde{ξ}}_{k, y}

are independent, write

E (log {\tilde{ξ}}_{k, x} log {\tilde{ξ}}_{k, y}) = E log {\tilde{ξ}}_{k, x} E log {\tilde{ξ}}_{k, y} = b_{k, d} (x) b_{k, d} (y)

, where

b_{k, d} (z) : = ψ (k) - log V_{d} - log p (z)

,

z \in R^{d}

.

For any fixed

M > 0

, consider

{\tilde{A}}_{1, M} : = \{(x, y) \in {\tilde{A}}_{1} : ∥x - y∥ > M\}

. Now our aim is to verify that, for each

(x, y) \in {\tilde{A}}_{1, M}

,

\begin{matrix} \begin{matrix} E (log ζ_{n, k} (1) log ζ_{n, k} (2) | X_{1} = x, X_{2} = y) \to b_{k, d} (x) b_{k, d} (y) . \end{matrix} \end{matrix}

(84)

Equivalently, we can prove, for each

(x, y) \in {\tilde{A}}_{1, M}

, that

E log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} \to E log {\tilde{ξ}}_{k, x} log {\tilde{ξ}}_{k, y}, n \to \infty .

(85)

The idea that we consider only

(x, y) \in {\tilde{A}}_{1, M}

is principal for the further proof.

Part (5a). We are going to establish that, for

(x, y) \in {\tilde{A}}_{1, M}

, a family

{log {\tilde{η}}_{n, k, x}^{y}

log {\tilde{η}}_{n, k, y}^{x}}_{n \geq {\tilde{n}}_{0}}

is uniformly integrable, where

{\tilde{n}}_{0} \in N

is independent of

x, y

, but might depend onM. Then, due to (83), the relation (85) would be valid for such

(x, y)

as well. As we have seen, the function

G_{N} (\cdot)

is nondecreasing and convex. Hence

\begin{matrix} \begin{matrix} E G_{N} (| log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} |) \leq \frac{1}{2} (E G_{N} ({log}^{2} {\tilde{η}}_{n, k, x}^{y}) + E G_{N} ({log}^{2} {\tilde{η}}_{n, k, y}^{x})) . \end{matrix} \end{matrix}

(86)

Let us consider, for instance,

E G_{N} ({log}^{2} {\tilde{η}}_{n, k, x}^{y})

. Alike Step 2 we can write

\begin{matrix} \begin{matrix} E G_{N} ({log}^{2} {\tilde{η}}_{n, k, x}^{y}) = \int_{(0, \frac{1}{ρ (N)}]} {\tilde{F}}_{n, k, x}^{y} (u) (- h_{N} (u)) d u + \int_{(ρ (N), \infty)} (1 - {\tilde{F}}_{n, k, x}^{y} (u)) h_{N} (u) d u \\ : = I_{1} (n, x, y) + I_{2} (n, x, y), \end{matrix} \end{matrix}

where

\begin{matrix} \begin{matrix} {\tilde{F}}_{n, k, x}^{y} (u) : = P ({\tilde{η}}_{n, k, x}^{y} \leq u) = 1 - P (∥x - X_{(k)} (x, X_{n}^{i, j} \cup {y})∥ > r_{n - 1} (u)) \\ = I \{∥x - y∥ > r_{n - 1} (u)\} (1 - \sum_{s = 0}^{k - 1} (\binom{n - 2}{s}) {(V_{n - 1, x} (u))}^{s} {(1 - V_{n - 1, x} (u))}^{n - 2 - s}) \\ + I \{∥x - y∥ \leq r_{n - 1} (u)\} (1 - \sum_{s = 0}^{k - 2} (\binom{n - 2}{s}) {(V_{n - 1, x} (u))}^{s} {(1 - V_{n - 1, x} (u))}^{n - 2 - s}), \end{matrix} \end{matrix}

(87)

As usual a sum over empty set is equal to 0 (for

k = 1

).

If

u \in (0, \frac{1}{ρ (N)}]

, where

ρ (N) : = exp {\sqrt{e_{[N - 1]}}}

and

n \geq {\tilde{n}}_{1} : = ⌈\frac{1}{ρ (N) M^{d}}⌉ + 1

, then

r_{n - 1} (u) \leq M

. Thus,

r_{n - 1} (u) < ∥x - y∥

if

(x, y) \in {\tilde{A}}_{1, M}

. In view of (87) and similarly to (38), one has

{\tilde{F}}_{n, k, x}^{y} (u) \leq {(\frac{n - 2}{n - 1})}^{ε} {(M_{p} (x, R) V_{d} u)}^{ε} \leq {(M_{p} (x, R))}^{ε} V_{d}^{ε} u^{ε}

for

(x, y) \in {\tilde{A}}_{1, M}

,

u \in (0, \frac{1}{ρ (N)}]

,

n \geq max {{\tilde{n}}_{1} (M), {\tilde{n}}_{2} (R)}

, here

{\tilde{n}}_{2} (R) : = max {⌈\frac{1}{ρ (N) R^{d}}⌉,

k + 1}

. So,

I_{1} (n, x, y) \leq U_{1} (ε, N, d) {(M_{p} (x, R))}^{ε}

for

(x, y) \in {\tilde{A}}_{1, M}

and

n \geq max {{\tilde{n}}_{1} (M),

{\tilde{n}}_{2} (R)}

. Moreover, for all

u \geq 0

, in view of (87) it holds

1 - {\tilde{F}}_{n, k, x}^{y} (u) \leq \sum_{s = 0}^{k - 1} (\binom{n - 2}{s}) {(V_{n - 1, x} (u))}^{s} {(1 - V_{n - 1, x} (u))}^{n - 2 - s} .

The same reasoning as was used in Theorem 1 proof (Step 3, Part (3b)) leads to the inequalities

\begin{matrix} \begin{matrix} 1 - {\tilde{F}}_{n, k, x}^{y} (u) \leq S_{1} (k) {(1 - S_{2} V_{n - 1, x} (u))}^{n - 2} \leq S_{1} exp \{- S_{2} (n - 2) V_{n - 1, x} (u)\} \\ \leq S_{1} exp \{- \frac{n - 2}{n - 1} S_{2} V_{d} u m_{p} (x, R)\} \leq S_{1} {(\frac{S_{2}}{2} V_{d} u m_{p} (x, R))}^{- ε} \end{matrix} \end{matrix}

for all

n \geq max \{{\tilde{n}}_{3} (R), 3\}

. Then similarly to (70), the relation

\begin{matrix} \begin{matrix} E G_{N} ({log}^{2} {\tilde{η}}_{n, k, x}^{y}) \leq U_{1} {(M_{p} (x, R))}^{ε} + ({\tilde{U}}_{2} + κ) {(m_{p} (x, R))}^{- ε} + A {\tilde{R}}_{N, 2} (x) + B : = {\tilde{C}}_{1} (x) < \infty \end{matrix} \end{matrix}

is valid for all

(x, y) \in {\tilde{A}}_{1, M}

and

n \geq {\tilde{n}}_{0} (M) : = max \{{\tilde{n}}_{1} (M), {\tilde{n}}_{2}, {\tilde{n}}_{3}, {\tilde{n}}_{4} (κ), 3\}

. Here

U_{1}, {\tilde{U}}_{2}, κ, A, B

do not depend on x and y. The term

E G_{N} ({log}^{2} {\tilde{η}}_{n, k, y}^{x})

can be treated in the above manner. Thus, in view of (86), one has

\begin{matrix} \begin{matrix} E G_{N} (| log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} |) \leq \frac{U_{1}}{2} ({(M_{p} (x, R))}^{ε} + {(M_{p} (y, R))}^{ε}) \\ + \frac{U_{2} + κ}{2} ({(m_{p} (x, R))}^{- ε} + {(m_{p} (y, R))}^{- ε}) + \frac{A}{2} ({\tilde{R}}_{N, 2} (x) + {\tilde{R}}_{N, 2} (y)) + B : = {\tilde{C}}_{1} (x, y) . \end{matrix} \end{matrix}

(88)

Therefore, for any

(x, y) \in {\tilde{A}}_{1, M}

, a family

{log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x}}_{n \geq {\tilde{n}}_{0}}

is uniformly integrable. Thus, we come to (84) for

(x, y) \in {\tilde{A}}_{1, M}

.

Part (5b). Now we return to upper bound for

cov (log ζ_{n, k} (1), log ζ_{n, k} (2))

. Set

{\tilde{T}}_{n, k} (x, y) : = E (log ζ_{n, k} (1) log ζ_{n, k} (2) | X_{1} = x, X_{2} = y) = E log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x}

for all

(x, y) \in {\tilde{A}}_{1}

. Validity of (84) is equivalent to the following relation: for any

(x, y) \in {\tilde{A}}_{1, M}

,

{\tilde{T}}_{n, k} (x, y) \to b_{k, d} (x) b_{k, d} (y)

, as

n \to \infty

. Take any

(x, y) \in {\tilde{A}}_{1}

. For each

M > 0

, it was shown that

\begin{matrix} \begin{matrix} {\tilde{T}}_{n, k} (x, y) I {∥x - y∥ > M} \to b_{k, d} (x) b_{k, d} (y) I {∥x - y∥ > M}, n \to \infty . \end{matrix} \end{matrix}

Note that

\begin{matrix} \begin{matrix} G_{N} (| {\tilde{T}}_{n, k} (x, y) | I {∥x - y∥ > M}) \leq G_{N} (| {\tilde{T}}_{n, k} (x, y) |) = G_{N} (| E log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} |) \\ \leq G_{N} (E | log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} |) \leq E G_{N} (| log {\tilde{η}}_{n, k, x}^{y} log {\tilde{η}}_{n, k, y}^{x} |) . \end{matrix} \end{matrix}

(89)

Due to (88) and (89) one can conclude that, for all

n \geq {\tilde{n}}_{0}

,

\begin{matrix} \int_{R^{d} \times R^{d}} G_{N} (| {\tilde{T}}_{n, k} (x, y) | I {∥x - y∥ > M}) p (x) p (y) d x d y \\ \leq U_{1} \int_{R^{d}} M_{p}^{ε} (x, R) p (x) d x + ({\tilde{U}}_{2} + κ) \int_{R^{d}} m_{p}^{- ε} (x, R) p (x) d x + A \int_{R^{d}} {\tilde{R}}_{N, 2} (x) p (x) d x + B \\ = U_{1} Q_{p, p} (ε, R) + ({\tilde{U}}_{2} + κ) T_{p, p} (ε, R) + A K_{p, p} (2, N) + B < \infty . \end{matrix}

Therefore, for

(x, y) \in {\tilde{A}}_{1}

, the family

{\{{\tilde{T}}_{n, k} (x, y) I {∥x - y∥ > M}\}}_{n \geq {\tilde{n}}_{0}}

is uniformly integrable w.r.t.

P_{X} \otimes P_{X}

. Hence, by virtue of (84), for each

M > 0

,

\begin{matrix} \begin{matrix} \int_{D (M)} {\tilde{T}}_{n, k} (x, y) p (x) p (y) d x d y \to \int_{D (M)} b_{k, d} (x) b_{k, d} (y) p (x) p (y) d x d y, n \to \infty, \end{matrix} \end{matrix}

(90)

where

D (M) : = {x, y \in R^{d}, ∥x - y∥ > M}

. Now we turn to the case

∥x - y∥ \leq M

. One has

⋂_{s = 1}^{\infty} \{∥X_{1} - X_{2}∥ \leq \frac{1}{s}\} = \{X_{1} = X_{2}\}

and

P (X_{1} = X_{2}) = 0

as

X_{1}

and

X_{2}

are independent and have a density

p (x)

w.r.t. the Lebesgue measure

μ

. Then in view of continuity of a probability measure it holds that

P (∥X_{1} - X_{2}∥ \leq M) \to 0

, as

M \to 0 .

Taking into account that, for an integrable function h,

\int_{C} h d P \to 0

as

P (C) \to 0

, we get

E (log ζ_{n, k} (1) log ζ_{n, k} (2) I {∥X_{1} - X_{2}∥ \leq M}) \to 0, M \to 0,

since

E log ζ_{n, k} (1) log ζ_{n, k} (2) \leq \frac{1}{2} (E {log}^{2} ζ_{n, k} (1) + E {log}^{2} ζ_{n, k} (2)) < \infty

(the proof is similar to the establishing that

E {log}^{2} ϕ_{m, l} (1) < \infty

). Thus, for any

γ > 0

, one can find

M_{1} = M_{1} (γ) > 0

such that, for all

M \in (0, M_{1}]

and

n \geq {\tilde{n}}_{0}

,

| \int_{R^{2 d} \ D (M)} {\tilde{T}}_{n, k} (x, y) p (x) p (y) d x d y | = | E log ζ_{n, k} (1) log ζ_{n, k} (2) I {∥X_{1} - X_{2}∥ \leq M} | < \frac{γ}{3} .

(91)

Also there exists

M_{2} = M_{2} (γ) > 0

such that, for all

M \in (0, M_{2}]

,

| \int_{R^{2 d} \ D (M)} b_{k, d} (x) b_{k, d} (y) p (x) p (y) d x d y | < \frac{γ}{3} .

(92)

Take

M (γ) : = min {M_{1} (γ), M_{2} (γ)}

. Due to (90) there is

{\tilde{n}}_{7} (M (γ), γ)

such that

n \geq max {{\tilde{n}}_{0}, {\tilde{n}}_{7} (M (γ), γ)}

entails the following inequality

| \int_{D (M)} {\tilde{T}}_{n, k} (x, y) p (x) p (y) d x d y - \int_{D (M)} b_{k, d} (x) b_{k, d} (y) p (x) p (y) d x d y | < \frac{γ}{3} .

(93)

So, in view of (91)–(93), for any

γ > 0

, there is

M (γ) > 0

such that, for all n large enough, i.e.,

n \geq max {{\tilde{n}}_{0}, {\tilde{n}}_{7} (M (γ), γ)}

, one has

\begin{matrix} | \int_{R^{d} \times R^{d}} {\tilde{T}}_{n, k} (x, y) p (x) p (y) d x d y - \int_{R^{d} \times R^{d}} b_{k, d} (x) b_{k, d} (y) p (x) p (y) d x d y | < γ . \end{matrix}

(94)

By virtue of the formula

\begin{matrix} \int_{R^{d} \times R^{d}} b_{k, d} (x) b_{k, d} (y) p (x) p (y) d x d y = {(ψ (k) - log V_{d} - \int_{R^{d}} (log p (x)) p (x) d x)}^{2}, \end{matrix}

and taking into account (94) we deduce the limit relation, for

n \to \infty

,

\begin{matrix} E log ζ_{n, k} (1) log ζ_{n, k} (2) \to {(ψ (k) - log V_{d} - \int_{R^{d}} (log p (x)) p (x) d x)}^{2} . \end{matrix}

Moreover, in view of (24) (see Step 5 of Theorem 1 proof), it follows that

\begin{matrix} E log ζ_{n, k} (1) E log ζ_{n, k} (2) \to {(ψ (k) - log V_{d} - \int_{R^{d}} (log p (x)) p (x) d x)}^{2} . \end{matrix}

Therefore,

\frac{2}{n^{2}} \sum_{1 \leq i < j \leq n} cov (log ζ_{n, k} (i), log ζ_{n, k} (j)) = \frac{n - 1}{n} cov (log ζ_{n, k} (1), log ζ_{n, k} (2)) \to 0, n \to \infty .

Step 6. Here we complete the analysis of summands in (56). Reasoning as at Steps 1–3 shows that

\frac{1}{n^{2}} \sum_{i = 1}^{n} var (log ζ_{n, k} (i))

= \frac{1}{n} var (log ζ_{n, k} (1)) \to 0

since

var (log ζ_{n, k} (i)) = var (log ζ_{n, k} (1)) \to v_{k} < \infty

for each

i \in N

, as

n \to \infty

. It remains to prove that

\frac{2}{n^{2}} \sum_{i, j = 1}^{n} cov (log

ϕ_{m, l} (i),

log ζ_{n, k} (j)) \to 0

, as

n, m \to \infty

.

For

i = 1, \dots, n

, one has

| cov (log ϕ_{m, l} (i), log ζ_{n, k} (i)) | \leq {(var (log ϕ_{m, l} (1)) var (log ζ_{n, k} (1)))}^{\frac{1}{2}}

< \infty

for all

n, m

large enough. So, it suffices to show that

\frac{1}{n^{2}} \sum_{i, j = 1, \dots, n; i \neq j} cov (log ϕ_{m, l} (i), log ζ_{n, k} (j)) \to 0, n, m \to \infty .

For

i, j = 1, \dots, n

,

i \neq j

,

u, w \geq 0

,

x, y \in R^{d}

, let us introduce a conditional distribution function

\begin{matrix} \begin{matrix} P (ϕ_{m, l} (i) \leq u, ζ_{n, k} (j) \leq w | X_{i} = x, X_{j} = y) \\ = P (∥x - Y_{(l)} (x, Y_{m})∥ \leq r_{m} (u), ∥y - X_{(k)} (y, X_{n}^{i, j} \cup {x})∥ \leq r_{n - 1} (w)) \\ = P (∥x - Y_{(l)} (x, Y_{m})∥ \leq r_{m} (u)) P (∥y - X_{(k)} (y, X_{n}^{i, j} \cup {x})∥ \leq r_{n - 1} (w)) \\ = (1 - \sum_{s_{1} = 0}^{l - 1} (\binom{m}{s_{1}}) {(W_{m, x} (u))}^{s_{1}} {(1 - W_{m, x} (u))}^{m - s_{1}}) \\ \cdot (I \{∥x - y∥ > r_{n - 1} (w)\} (1 - \sum_{s = 0}^{k - 1} (\binom{n - 2}{s}) {(V_{n - 1, y} (w))}^{s} {(1 - V_{n - 1, y} (w))}^{n - 2 - s}) \\ + I \{∥x - y∥ \leq r_{n - 1} (w)\} (1 - \sum_{s = 0}^{k - 2} (\binom{n - 2}{s}) {(V_{n - 1, y} (w))}^{s} {(1 - V_{n - 1, y} (w))}^{n - 2 - s})) . \end{matrix} \end{matrix}

We used that

{X_{n}, Y_{m}}

is a collection of independent vectors. Now we combine the estimates obtained at Steps 4 and 5 of Theorem 2 proof to verify that, for

i, j \in {1, \dots, n}

and

i \neq j

,

cov (log ϕ_{m, l} (i), log ζ_{n, k} (j)) = cov (log ϕ_{m, l} (1), log ζ_{n, k} (2)) \to 0

,

n, m \to \infty

.

Thus, we have established that

var ({\hat{D}}_{n, m} (k, l)) \to 0

as

n, m \to \infty

, hence (16) holds. The proof of Theorem 2 is complete.

6. Conclusions

The aim of this paper is to provide wide conditions ensuring the asymptotic unbiasedness and mean square consistency for statistical estimates of the Kullback–Leibler divergence proposed in [31]. We do not impose restrictions on the smoothness of the densities under consideration and do not assume that the densities have bounded supports. Thus, in particular one can apply our results to various mixtures of distributions, for instance, to mixture of any nondegenerate normal laws in

R^{d}

(Corollary 4). As a byproduct we relax conditions in our recent analysis of the Kozachenko - Leonenko type estimators for the Shannon differential entropy [55] and use these conditions in estimating the cross-entropy as well. Observe that the integral functional

K_{p, q}

appearing in Theorems 1–3 involves the function

G_{N} (t)

which is close to a function t when parameter N is large enough. Thus, we impose essentially less restrictive condition than one requiring a function

G (t) = t^{1 + ν}

for some

ν > 0

instead of

G_{N} (t)

. Even for the latter choice of G our results provide the first valid proof without appealing to the Fatou lemma (the long standing problem to obtain correct proofs was discussed in Introduction). An interesting and hard problem for future research is to find the class of functions

φ : R_{+} \to R_{+}

such that one can replace

G_{N} (t)

in expression of

K_{p, q}

by

G (t) = t φ (t)

, where

φ (t) \to \infty

, as

t \to \infty

, and keep the validity of established theorems. Here one can see the analogy with investigation of fluctuations of sums of random variables or Brownian motion by G.H.Hardy, H.D.Steinhaus, A.Ya.Khinchin, A.N.Kolmogorov, I.G.Petrovski, W.Feller and other researchers. The increasing precision on the way of description of the upper and lower functions has led to the law of the iterated logarithm and its generalizations. Another deep problem is to provide sharp conditions of CLT validity for estimates of the Kullback - Leibler divergence.

Beside pure theoretical aspects the estimates of entropy and related functionals have diverse applications. In [5], the estimates of the Kullback - Leibler divergence are applied to the change-point detection in time series. That issue is important, e.g., in analysis of stochastic financial models. Moreover, it is interesting to study the spatial variant of this problem. Namely, in [55,56] statistical estimates of entropy and scan-statistics (see, e.g., [57]) were employed for identification of inhomogeneities of fiber materials. In [58], the Kullback–Leibler divergence estimators are used to identify multivariate spatial clusters in the Bernoulli model. A modification of the latter paper idea can also be applied to analysis of the fiber structures. Such structures in

R^{d}

can be modeled by spatial point stochastic process to specify the locations of the centers of fibers (segments). A certain law on the unit sphere of

R^{d}

can be used to model their directions. The length of fibers can be fixed or follow some distribution on

R_{+}

. Since various scan domains could contain random number of observations the development of present results will be applied along with the theory of random sums of random variables. The latter theory (see, e.g., [59]) is essential in this case. Moreover, we intend to employ the studied estimators in the feature selection theory, actively used in Genome-wide association studies (GWAS), see, e.g., [16,17,22]. In this regard statistical estimates of the mutual information were proposed, see, e.g., [12]. Note also an important problem of stability analysis of constructing, by means of statistical estimates of the mutual information, a sub-collection of relevant (in a sense) factors determining a random response. The above mentioned applications will be considered in separate publications, supplemented with computer simulations and illustrative graphs.

Author Contributions

Conceptualization, A.B. and D.D.; validation, A.B. and D.D.; writing—original draft preparation, A.B. and D.D.; writing—review and editing, A.B. and D.D.; supervision, A.B.; project administration, A.B.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

The work of the first author is supported by the Russian Science Foundation under grant 14-21-00162 and performed at the Steklov Mathematical Institute of Russian Academy of Sciences.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to Professor A.Tsybakov for useful discussions. We also thank the Reviewers for remarks and suggestions improving the exposition.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proofs of Lemmas 1–3 are similar to the proofs of Lemma 2.5, 3.1 and 3.2 in [35]. We provide them for the sake of completeness.

Proof of Lemma 1.

(1) Note that

log ∥ x - y ∥ > e_{[N - 1]} \geq 1

if

∥ x - y ∥ > e_{[N]}

and

N \in N

. Hence, for such

x, y

, one has

(log ∥ x - {y ∥)}^{ν} \leq (log ∥ x - {y ∥)}^{ν_{0}}

if

ν \in (0, ν_{0}]

. If

N \geq N_{0}

then

G_{N} (u) \leq G_{N_{0}} (u)

for

u \geq e_{[N - 1]} \geq e_{[N_{0} - 1]}

. Thus,

K_{p, q} (ν, N) \leq K_{p, q} (ν_{0}, N_{0}) < \infty

for

ν \in (0, ν_{0}]

and any integer

N \geq N_{0}

.

(2) Assume that

Q_{p, q} (ε_{1}, R_{1}) < \infty

. Consider

Q_{p, q} (ε_{1}, R)

where

R > 0

. If

0 < R \leq R_{1}

then, for each

x \in R^{d}

, in accordance with the definition of

M_{q}

one has

M_{q} (x, R) \leq M_{q} (x, R_{1})

. Consequently,

Q_{p, q} (ε_{1}, R) \leq Q_{p, q} (ε_{1}, R_{1}) < \infty

. Let now

R > R_{1}

. One has

M_{q} (x, R) \leq max \{M_{q} (x, R_{1}), sup_{R_{1} < r \leq R} \frac{\int_{B (x, R_{1})} q (x) d x + \int_{B (x, r) \ B (x, R_{1})} q (x) d x}{μ (B (x, r))}\}

\leq max \{M_{q} (x, R_{1}), M_{q} (x, R_{1}) + \frac{1}{μ (B (x, R_{1}))}\} = M_{q} (x, R_{1}) + \frac{1}{μ (B (x, R_{1}))} .

Therefore,

\begin{matrix} Q_{p, q} (ε_{1}, R) = \int_{R^{d}} {(M_{q} (x, R))}^{ε_{1}} p (x) d x \leq \int_{R^{d}} {(M_{q} (x, R_{1}) + \frac{1}{R_{1}^{d} V_{d}})}^{ε_{1}} p (x) d x \\ \leq max {1, 2^{ε_{1} - 1}} (Q_{p, q} (ε_{1}, R_{1}) + {(R_{1}^{d} V_{d})}^{- ε_{1}}) < \infty . \end{matrix}

Suppose now that

Q_{p, q} (ε_{1}, R) < \infty

for some

ε_{1} > 0

and

R > 0

. Then, for each

ε \in (0, ε_{1}]

, the Lyapunov inequality leads to the estimate

Q_{p, q} (ε, R) \leq {(Q_{p, q} (ε_{1}, R))}^{\frac{ε}{ε_{1}}} < \infty

.

(3) Let

T_{p, q} (ε_{2}, R_{2}) < \infty

. Take

0 < R \leq R_{2}

. Then, for any

x \in R^{d}

, according to the definition of

m_{q}

we get

0 \leq m_{q} (x, R_{2}) \leq m_{q} (x, R)

. Hence

T_{p, q} (ε_{2}, R) \leq T_{p, q} (ε_{2}, R_{2}) < \infty

. Consider

R > R_{2}

. For any

x \in R^{d}

and every

a > 0

, the function

I_{q} (x, r)

is continuous in r on

(0, a]

. Next fix an arbitrary

x \in S (q) \cap Λ (q)

. We see that there exists

{lim}_{r \to 0 +} I_{q} (x, r) = q (x)

. For such x, set

I_{q} (x, 0) : = q (x)

. Thus,

I_{q} (x, \cdot)

is continuous on any segment

[0, a]

. Hence, one can find

{\tilde{R}}_{2}

in

[0, R_{2}]

such that

m_{q} (x, R_{2}) = I_{q} (x, {\tilde{R}}_{2})

and there exists

R_{0}

in

[0, R]

such that

m_{q} (x, R) = I_{q} (x, R_{0})

. If

R_{0} \leq R_{2}

then

m_{q} (x, R) = m_{q} (x, R_{2})

(since

m_{q} (x, R) \leq m_{q} (x, R_{2})

for

R > R_{2}

and

m_{q} (x, R) = I_{q} (x, R_{0}) \geq m_{q} (x, R_{2})

as

R_{0} \in [0, R_{2}]

). Assume that

R_{0} \in (R_{2}, R]

. Obviously

R_{0} > 0

as

R_{2} > 0

. One has

\begin{matrix} m_{q} (x, R) = I_{q} (x, R_{0}) = \frac{\int_{B (x, R_{2})} q (y) d y + \int_{B (x, R_{0}) \ B (x, R_{2})} q (y) d y}{μ ((x, R_{0}))} \\ \geq \frac{\int_{B (x, R_{2})} q (y) d y}{μ (B (x, R_{0}))} = \frac{μ (B (x, R_{2}))}{μ (B (x, R_{0}))} I_{q} (x, R_{2}) \geq \frac{μ (B (x, R_{2}))}{μ (B (x, R_{0}))} m_{q} (x, R_{2}) \end{matrix}

\begin{matrix} = {(\frac{R_{2}}{R_{0}})}^{d} m_{q} (x, R_{2}) \geq {(\frac{R_{2}}{R})}^{d} m_{q} (x, R_{2}) . \end{matrix}

Thus, in any case (

R_{0} \in [0, R_{2}]

or

R_{0} \in (R_{2}, R]

) one has

m_{q} (x, R) \geq {(\frac{R_{2}}{R})}^{d} m_{q} (x, R_{2})

as

R_{2} < R

. Taking into account that

μ (S (q) \ (S (q) \cap Λ (q))) = 0

we deduce the inequality

T_{p, q} (ε_{2}, R) \leq {(\frac{R}{R_{2}})}^{ε_{2} d} T_{p, q} (ε_{2}, R_{2}) < \infty .

Assume now that

T_{p, q} (ε_{2}, R) < \infty

for some

ε_{2} > 0

and

R > 0

. Then, for any

ε \in (0, ε_{2}]

, the Lyapunov inequality entails

T_{p, q} (ε, R) \leq {(T_{p, q} (ε_{2}, R))}^{\frac{ε}{ε_{2}}} < \infty

. This completes the proof. □

Proof of Lemma 2.

Begin with relation (1). Observe that if a function g is measurable and bounded on a finite interval

(a, b]

and

ν

is a finite measure on the Borel subsets of

(a, b]

then

\int_{(a, b]} g (x) ν (d x)

is finite. So applying the integration by parts formula (see, e.g., [33], p. 245), for each

a \in (0, \frac{1}{e_{[N]}}]

, we get

\begin{matrix} \begin{matrix} \int_{(a, \frac{1}{e_{[N]}}]} F (u) (- g_{N} (u)) d u = \int_{(a, \frac{1}{e_{[N]}}]} F (u) d (- G_{N} (- log u)) \\ = G_{N} (- log a) F (a) + \int_{(a, \frac{1}{e_{[N]}}]} G_{N} (- log u) d F (u) . \end{matrix} \end{matrix}

(A1)

Assume now that

\int_{(0, \frac{1}{e_{[N]}}]} G_{N} (- log u) d F (u) < \infty

. Then by the monotone convergence theorem

lim_{a \to 0 +} \int_{(0, a]} G_{N} (- log u) d F (u) = 0 .

(A2)

Given

a > 0

the following lower bound is obvious

\begin{matrix} \int_{(0, a]} G_{N} (- log u) d F (u) \geq G_{N} (- log a) \int_{(0, a]} d F (u) \\ = G_{N} (- log a) (F (a) - F (0)) = G_{N} (- log a) F (a) \geq 0 . \end{matrix}

Therefore (A2) implies that

G_{N} (- log a) F (a) \to 0, a \to 0 + .

(A3)

By the Lebesgue monotone convergence theorem letting

a \to 0 +

in (A1) yields the desired relation (1) of our Lemma. Now we assume that

\int_{(0, \frac{1}{e_{[N]}}]} F (u) (- g_{N} (u)) d u < \infty .

(A4)

Hence from the equality

\int_{(0, \frac{1}{e_{[N]}}]} F (u) (- g_{N} (u)) d u = \int_{(0, \frac{1}{e_{[N]}}]} F (u) d (- G_{N}

(- log u))

we get

{lim}_{b \to 0 +} \int_{(0, b]} F (u) d (- G_{N} (- log u))

= 0

by monotone convergence theorem. Therefore, for any

c \in (0, b)

, we come to the inequalities

\int_{(0, b]} F (u) d (- G_{N} (- log u)) \geq \int_{(c, b]} F (u) d (- G_{N} (- log u))

= - F (b) G_{N} (- log b) + F (c) G_{N} (- log c) + \int_{(c, b]} G_{N} (- log u) d F (u)

\geq F (c) G_{N} (- log c) - F (b) G_{N} (- log b) + (F (b) - F (c)) G_{N} (- log b)

= F (c) G_{N} (- log c) (1 - \frac{G_{N} (- log b)}{G_{N} (- log c)}) .

Let

c = b^{2}

(

b \leq \frac{1}{e_{[N]}} < 1

). Then, for all positive b small enough,

\begin{matrix} \begin{matrix} 1 - \frac{G_{N} (- log b)}{G_{N} (- log c)} = 1 - \frac{G_{N} (- log b)}{G_{N} (- 2 log b)} = 1 - (\frac{1}{2}) \frac{{log}_{[N]} (- log b)}{{log}_{[N]} (- 2 log b)} \geq \frac{1}{2} . \end{matrix} \end{matrix}

Thus

\int_{(0, b]} F (u) d (- G_{N} (- log u)) \geq \frac{1}{2} F (b^{2}) G_{N} (- log (b^{2})) \geq 0

, so

F (b^{2}) G_{N} (- log b^{2}) \to 0

as

b \to 0

. Consequently we come to (A3) taking

a = b^{2}

. Then (A1) implies relation (1).

When one of the (nonnegative) integrals in (1) turns infinite while the other one is finite we come to a contradiction. Thus, (1) is established. In quite the same manner one can verify validity of relation (2), therefore further details can be omitted. □

Proof of Lemma 3.

Take

x \in S (q) \cap Λ (q)

and

R > 0

. Suppose that

m_{q} (x, R) = 0

. Since the function

I_{q} (x, r)

defined in (8) is continuous in

(x, r) \in R^{d} \times (0, \infty)

, there exists

\tilde{R} \in [0, R]

(

\tilde{R} = \tilde{R} (x, R)

) such that

m_{q} (x, R) = I_{q} (x, \tilde{R})

(

I_{q} (x, 0) : = {lim}_{r \to 0 +} I_{q} (x, r) = q (x)

for any

x \in Λ (q)

by continuity). If

\tilde{R} = 0

then

m_{q} (x, r) = q (x) > 0

as

x \in S (q) \cap Λ (q)

. Hence we have to deal with

\tilde{R} \in (0, R]

. If

I_{q} (x, \tilde{R}) = 0

then

\int_{B (x, r)} q (y) d y = 0

for any

0 < r \leq \tilde{R}

. Thus, (30) ensures that

q (x) = 0

. However,

x \in S (q) \cap Λ (q)

. So

m_{q} (x, R) > 0

for

x \in S (q) \cap Λ (q)

. Thus,

S (q) \cap Λ (q) \subset D_{q} (R) : = {x \in S (q) : m_{q} (x, R) > 0}

. It remains to note that

S (q) \ Λ (q) \subset R^{d} \ Λ (q)

and

μ (R^{d} \ Λ (q)) = 0

. Therefore

μ (S (q) \ D_{q} (R)) = 0

. □

Proof of Lemma 4.

We will check that, for given

N \in N

and

τ > 0

, there exist

a : = a (τ) \geq 0

and

b : = b (N, τ) \geq 0

such that, for any

c \geq 0

,

G_{N} (τ c) \leq a G_{N} (c) + b .

(A5)

For

c = 0

the statement is obviously true. Let

c > 0

. It easily seen that

\frac{{log}_{[N]} (τ c)}{{log}_{[N]} (c)} \to 1

as

c \to \infty

. Hence one can find

c_{0} (N, τ)

such that, for all

c \geq c_{0} (N, τ)

, the inequality

\frac{{log}_{[N]} (τ c)}{{log}_{[N]} (c)} \leq 2

is valid. Consequently, for

c \geq c_{0} (N, τ)

,

\frac{G_{N} (τ c)}{G_{N} (c)} = \frac{τ c {log}_{[N]} (τ c)}{c {log}_{[N]} (c)} \leq 2 τ : = a (τ) .

For all

0 \leq c \leq c_{0} (N, τ)

we write

G_{N} (τ c) \leq G_{N} (τ c_{0} (N, τ)) : = b (N, τ)

. Therefore, for any

c \geq 0

, we come to (A5). Thus, for any

ν > 0

and

x, y \in R^{d}

,

x \neq y

, one has

G_{N} {(| log (∥ x - y ∥}^{d} {) |}^{ν}) = G_{N} (d^{ν} | log (∥ x - {y ∥) |}^{ν}) \leq a (d^{ν}) G_{N} (| log (∥ x - {y ∥) |}^{ν}) + b (N, d^{ν}) .

□

Proof of Lemma 6.

For

t \in [0, e_{[N - 1]}]

, a function

G_{N} (t) \equiv 0

is convex. We show that

G_{N}

is convex on

(e_{[N - 1]}, \infty)

. Consider

t > e_{[N - 1]}

. Write

\prod_{⌀} : = 1

and

\sum_{⌀} : = 0

. Then, for

N \in N

,

\begin{matrix} {(G_{N} (t))}^{'} = {log}_{[N]} (t) + \prod_{j = 1}^{N - 1} \frac{1}{{log}_{[j]} (t)} . \end{matrix}

Obviously,

{(\frac{1}{{log}_{[k]} (t)})}^{'} = - \frac{1}{t {log}_{[k]}^{2} (t)} \prod_{s = 1}^{k - 1} \frac{1}{{log}_{[s]} (t)}

,

k \in N

. Thus, for

t > e_{[N - 1]}

, we get

\begin{matrix} {(G_{N} (t))}^{''} = \frac{1}{t} \prod_{j = 1}^{N - 1} \frac{1}{{log}_{[j]} (t)} + \sum_{k = 1}^{N - 1} (- \frac{1}{t} \frac{1}{{log}_{[k]}^{2} (t)} \prod_{s = 1}^{k - 1} \frac{1}{{log}_{[s]} (t)} \prod_{j \in {1, \dots, N - 1}, j \neq k} \frac{1}{{log}_{[j]} (t)}) \\ = \frac{1}{t} (\prod_{j = 1}^{N - 1} \frac{1}{{log}_{[j]} (t)}) (1 - \sum_{k = 1}^{N - 1} \prod_{s = 1}^{k} \frac{1}{{log}_{[s]} (t)}) . \end{matrix}

For

N = 1

and

t > 0

, we have

{(G_{1} (t))}^{''} = \frac{1}{t} > 0

. Take now

N > 1

. Clearly, for

t > e_{[N - 1]}

, one has

\frac{1}{t} \prod_{j = 1}^{N - 1} \frac{1}{{log}_{[j]} (t)} > 0

because

{log}_{[j]} (t) > {log}_{[j]} (e_{[N - 1]}) = e_{[N - 1 - j]} \geq 1 > 0

when

1 \leq j \leq N - 1

. Observe also that

\sum_{k = 1}^{N - 1} \prod_{s = 1}^{k} \frac{1}{{log}_{[s]} (t)} < \sum_{k = 1}^{N - 1} \prod_{s = 1}^{k} \frac{1}{e_{[N - 1 - s]}} \leq \sum_{k = 1}^{N - 1} \frac{1}{e_{[N - 2]}} = \frac{N - 1}{e_{[N - 2]}} \leq 1 .

(A6)

The last inequality is established by induction in N. Thus, in view of (A6), we have proved that, for all

t > e_{[N - 1]}

and

N \in N

, the inequality

{(G_{N} (t))}^{''} > 0

holds. Hence, the function

G_{N} (t)

is (strictly) convex on

(e_{[N - 1]}, \infty)

.

Let

h : [a, \infty) \to R

be a continuous nondecreasing function. If the restrictions of h to

[a, b]

and

(b, \infty)

(where

a < b

) are convex functions then, in general, it is not true that h is convex on

[a, \infty)

. However, we can show that

G_{N}

is convex on

[0, \infty)

. Note that a function

G_{N}

is convex on

[e_{[N - 1]}, \infty)

since it is convex on

(e_{[N - 1]}, \infty)

and continuous on

[e_{[N - 1]}, \infty)

. Take now any

z \in [0, e_{[N - 1]}]

,

y \in (e_{[N - 1]}, \infty)

and

s \in [0, 1]

. Then

\begin{matrix} G_{N} (s z + (1 - s) y) \leq G_{N} (s e_{[N - 1]} + (1 - s) y) \leq s G_{N} (e_{[N - 1]}) + (1 - s) G_{N} (y) \\ = (1 - s) G_{N} (y) = s G_{N} (z) + (1 - s) G_{N} (y) \end{matrix}

as

G_{N} (z) = 0

. Thus, for each

N \in N

, a function

G_{N} (\cdot)

is convex on

R_{+}

. □

Proof of Corollary 3.

The proof (i.e., checking the conditions of both Theorem 1 and 2) is quite similar to the proof of Corollary 2.11 in [35]. □

Proof of Corollary 4.

Take

f (x) = \sum_{i = 1}^{k} γ_{i} f_{i} (x)

, where

f_{i} (x)

is a density,

γ_{i} > 0

,

i = 1, \dots, k

,

\sum_{i = 1}^{k} γ_{i} = 1

,

x \in R^{d}

. Then according to (9) and (10), for any

x \in R^{d}

,

r > 0

and

R > 0

, one has

I_{f} (x, r) = \sum_{i = 1}^{k} γ_{i} I_{f_{i}} (x, r)

,

M_{f} (x, R) \leq \sum_{i = 1}^{k} γ_{i} M_{f_{i}} (x, R)

,

m_{f} (x, R) \geq \sum_{i = 1}^{k} γ_{i} m_{f_{i}} (x, R)

. We will apply these relations for

f = p

and

f = q

. It is well-known that, for any

ε > 0

,

c_{i} \geq 0

,

i = 1, \dots, k

,

k \in N

, the following inequality is valid

{(\sum_{i = 1}^{k} c_{i})}^{ε} \leq max {1, k^{ε - 1}} \sum_{i = 1}^{k} c_{i}^{ε}

. Moreover, this inequality is obviously satisfied for all

ε \in R

as for

ε \leq 0

it holds

{(\sum_{i = 1}^{k} c_{i})}^{ε} \leq \sum_{i = 1}^{k} c_{i}^{ε}

. Therefore

Q_{p, q} (ε, R) \leq max {1, J^{ε - 1}} \sum_{i = 1}^{I} \sum_{j = 1}^{J} a_{i} b_{j}^{ε} Q_{p_{i}, q_{j}} (ε, R) < \infty,

T_{p, q} (ε, R) \leq \sum_{i = 1}^{I} \sum_{j = 1}^{J} a_{i} b_{j}^{- ε} T_{p_{i}, q_{j}} (ε, R) < \infty .

The same reasoning leads to bounds

Q_{p, p} (ε, R) < \infty

and

T_{p, p} (ε, R) < \infty

. Now in view of (13), for

ν > 0, t > 0

and

N \in N

, we can write

K_{p, q} (ν, N, t) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} a_{i} b_{j} K_{p_{i}, q_{j}} (ν, N, t)

. In this manner we can also represent

K_{p, p} (ν, N, t)

. □

Lemma A1.

Let probability measures

P

,

Q

and a σ-finite measure μ(e.g., the Lebesgue measure) be defined on

(R^{d}, B (R^{d}))

. Assume that

P

and

Q

have densities

p (x)

and

q (x)

,

x \in R^{d}

, w.r.t. the measure μ. Then the following statements are true.

(1): $P ≪ Q$ if and only if $P (S (p) \ S (q)) = 0$ ;
(2): formula (2) holds.

Proof of Lemma A1.

(1) Let

P ≪ Q

. Obviously

Q (R^{d} \ S (q)) = 0

. Therefore

P (R^{d} \ S (q)) = 0

. Since

S (p) \ S (q) \subseteq R^{d} \ S (q)

, one has

P (S (p) \ S (q)) = 0

.

Now let

P (S (p) \ S (q)) = 0

. Assume that

P

is not absolutely continuous w.r.t.

Q

. Then there exists a set A such that

Q (A) = 0

and

P (A) > 0

. Consequently

μ (A) > 0

as

P ≪ μ

. We can write

A = A_{1} \cup A_{2}

, where

A_{1} : = A \cap (R^{d} \ S (q))

,

A_{2} : = A \cap S (q)

. We get

Q (A) = Q (A_{1}) + Q (A_{2})

as

A_{1} \cap A_{2} = ⌀

. Note that

Q (A_{1}) = 0

since

q \equiv 0

on

A_{1}

, so

Q (A_{2}) = 0

. Relation

Q (A_{2}) = \int_{A_{2}} q (x) μ (d x)

yields

μ (A_{2}) = 0

(

q > 0

on

A_{2}

and

μ

is a

σ

-finite measure). One has

P (A_{2}) = 0

because

P ≪ μ

. Thus,

P (A) = P (A_{1}) + P (A_{2}) = P (A_{1}) > 0

. Clearly,

A_{1} \subset R^{d} \ S (q)

. Hence

P (S (p) \ S (q)) = P (S (p) \cap (R^{d} \ S (q))) \geq P (S (p) \cap A_{1}) = P (A_{1}) > 0

. We come to the contradiction. Therefore

P ≪ Q

.

In such a way we have proved that if

P ≪ μ

and

Q ≪ μ

, the relation

P ≪ Q

holds if and only if

P (S (p) \ S (q)) = 0

. Obviously we can take as p and q any versions of

\frac{d P}{d μ}

and

\frac{d Q}{d μ}

.

(2) Suppose that

P ≪ Q

. We know that

P

,

Q

are probability measures,

Q ≪ μ

where

μ

is a

σ

-finite measure. Then, in view of [33], statement (b) of Lemma on p. 273, the following equality

\frac{d P}{d Q} = \frac{d P}{d μ} / \frac{d Q}{d μ}

holds

Q

-a.s. and consequently

P

-a.s. too (on the set

B : = {x : \frac{d Q}{d μ} = 0}

having

Q (B) = 0

a density

\frac{d P}{d Q}

can be taken equal to zero). So,

\frac{d P}{d Q} (x) = \frac{p (x)}{q (x)}

for

P

-almost all

x \in R^{d}

. One has

\int_{R^{d}} p (x) log (\frac{p (x)}{q (x)}) μ (d x) = \int_{R^{d}} log (\frac{p (x)}{q (x)}) d P = \int_{R^{d}} log (\frac{d P}{d Q}) d P,

where all integrals converge or diverge simultaneously. Indeed, if h is a measurable function with values in

[- \infty, \infty]

then

\int_{A} h (x) ν (d x) : = 0

, whenever

ν (A) = 0

(

ν

is a finite or a

σ

-finite measure). We also employed [33], statement (a) of Lemma on p.273, when we changed the integration by

μ

to integration by

P

.

Now assume that

P

is not absolutely continuous w.r.t.

Q

, i.e.,

P (S (p) \ S (q)) > 0

in view of part (1) of the present Lemma. As usual, for any measurable

B \subset R^{d}

,

\int_{B} 0 μ (d x) = 0

. Then

\int_{R^{d}} p (x) log (\frac{p (x)}{q (x)}) μ (d x) = \int_{S (p) \ S (q)} p (x) log (\frac{p (x)}{q (x)}) μ (d x) + \int_{S (p) \cap S (q)} p (x) log (\frac{p (x)}{q (x)}) μ (d x) .

Evidently

\int_{S (p) \ S (q)} p (x) log (\frac{p (x)}{q (x)}) μ (d x) = \int_{S (p) \ S (q)} log (\frac{p (x)}{q (x)}) P (d x) = \infty \cdot P (S (p) \ S (q)) = \infty

as

P (S (p) \ S (q)) > 0

. Since

- log t \geq 1 - t

if

t > 0

, we write, for all

x \in S (p) \cap S (q)

,

log \frac{p (x)}{q (x)} = - log \frac{q (x)}{p (x)} \geq 1 - \frac{q (x)}{p (x)}

. Thus,

\int_{S (p) \cap S (q)} p (x) log (\frac{p (x)}{q (x)}) μ (d x) \geq \int_{S (p) \cap S (q)} p (x)

(1 - \frac{q (x)}{p (x)}) μ (d x) = \int_{S (p) \cap S (q)} p (x) μ (d x) - \int_{S (p) \cap S (q)} q (x) μ (d x) = P (S (p) \cap S (q)) - Q (S (p) \cap S (q)) \geq 0 - 1 = - 1

. Consequently

\int_{R^{d}} p (x) log (\frac{p (x)}{q (x)}) μ (d x) = \infty

. The proof is complete. □

Remark A1.

Note that formula (2) can give an infinite value of

D (P | | Q)

also when

P ≪ Q

. It is enough to take

p (x) = \frac{1}{π (1 + x^{2})}

and

q (x) = \frac{1}{\sqrt{2 π}} exp {- \frac{x^{2}}{2}}

,

x \in R

.

References

Kullback, R.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Moulin, P.; Veeravalli, V.V. Statistical Inference for Engineers and Data Scientists; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
Pardo, L. New developments in statistical information theory based on entropy and divergence measures. Entropy 2019, 21, 391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ji, S.; Zhang, Z.; Ying, S.; Wang, L.; Zhao, X.; Gao, Y. Kullback–Leibler divergence metric learning. IEEE Trans. Cybern. 2020, 1–12. [Google Scholar] [CrossRef]
Noh, Y.K.; Sugiyama, M.; Liu, S.; du Plessis, M.C.; Park, F.C.; Lee, D.D. Bias reduction and metric learning for nearest-neighbor estimation of Kullback–Leibler divergence. Neural Comput. 2018, 30, 1930–1960. [Google Scholar] [CrossRef]
Claici, S.; Yurochkin, M.; Ghosh, S.; Solomon, J. Model Fusion with Kullback–Leibler Divergence. In Proceedings of the 37th International Conference on Machine Learning, Online, 12–18 July 2020; Daumé, H., III, Singh, A., Eds.; PMLR: Brookline, MA, USA, 2020; Volume 119, pp. 2038–2047. [Google Scholar]
Póczos, B.; Xiong, L.; Schneider, J. Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 14–17 July 2011; AUAI Press: Arlington, VA, USA, 2011; pp. 599–608. [Google Scholar]
Cui, S.; Luo, C. Feature-based non-parametric estimation of Kullback–Leibler divergence for SAR image change detection. Remote Sens. Lett. 2016, 11, 1102–1111. [Google Scholar] [CrossRef] [Green Version]
Deledalle, C.-A. Estimation of Kullback–Leibler losses for noisy recovery problems within the exponential family. Electron. J. Stat. 2017, 11, 3141–3164. [Google Scholar] [CrossRef]
Yu, X.-P.; Chen, S.-X.; Peng, M.-L. Application of partial least squares algorithm based on Kullback–Leibler divergence in intrusion detection. In Proceedings of the International Conference on Computer Science and Technology (CST2016), Shenzhen, China, 8–10 January 2016; Cai, N., Ed.; World Scientific: Singapore, 2017; pp. 256–263. [Google Scholar]
Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2017, 50, 1–45. [Google Scholar] [CrossRef] [Green Version]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
Granero-Belinchón, C.; Roux, S.G.; Garnier, N.B. Kullback–Leibler divergence measure of intermittency: Application to turbulence. Phys. Rev. E 2018, 97, 013107. [Google Scholar] [CrossRef] [Green Version]
Charzyńska, A.; Gambin, A. Improvement of the k-NN entropy estimator with applications in systems biology. Entropy 2016, 18, 13. [Google Scholar] [CrossRef] [Green Version]
Wang, M.; Jiang, J.; Yan, Z.; Alberts, I.; Ge, J.; Zhang, H.; Zuo, C.; Yu, J.; Rominger, A.; Shi, K.; et al. Individual brain metabolic connectome indicator based on Kullback–Leibler Divergence Similarity Estimation predicts progression from mild cognitive impairment to Alzheimer’s dementia. Eur. J. Nucl. Med. Mol. Imaging 2020, 47, 2753–2764. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhong, J.; Liu, R.; Chen, P. Identifying critical state of complex diseases by single-sample Kullback–Leibler divergence. BMC Genom. 2020, 21, 87. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, J.; Shang, P. Time irreversibility of financial time series based on higher moments and multiscale Kullback–Leibler divergence. Phys. A Stat. Mech. Appl. 2018, 502, 248–255. [Google Scholar] [CrossRef]
Beraha, M.; Betelli, A.M.; Papini, M.; Tirinzoni, A.; Restelli, M. Feature selection via mutual information: New theoretical insights. arXiv 2019, arXiv:1907.07384v1. [Google Scholar]
Carrara, N.; Ernst, J. On the estimation of mutual information. Proceedings 2019, 33, 31. [Google Scholar] [CrossRef] [Green Version]
Lord, W.M.; Sun, J.; Bollt, E.M. Geometric k-nearest neighbor estimation of entropy and mutual information. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 033114. [Google Scholar] [CrossRef] [Green Version]
Moon, K.R.; Sricharan, K.; Hero, A.O., III. Ensemble estimation of generalized mutual information with applications to Genomics. arXiv 2019, arXiv:1701.08083v2. [Google Scholar]
Suzuki, J. Estimation of Mutual Information; Springer: Singapore, 2021. [Google Scholar]
Sason, I.; Verdú, S. F-difergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Ensemble estimation of information divergence. Entropy 2018, 20, 560. [Google Scholar] [CrossRef] [Green Version]
Rubenstein, P.K.; Bousquet, O.; Djolonga, J.; Riquelme, C.; Tolstikhin, I. Practical and Consistent Estimation of f-Divergences. In Proceedings of the NeurIPS 2019, 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Advances in Neural Information Processing Systems. Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 4070–4080. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
Kozachenko, L.F.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Inf. Transm. 1987, 23, 9–16. [Google Scholar]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Leonenko, N.N.; Pronzato, L.; Savani, V. A class of Rényi information estimations for multidimensional densities. Ann. Stat. 2010, 36, 2153–2182. [Google Scholar] [CrossRef]
Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
Pál, D.; Póczos, B.; Szepesvári, C. Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs. In Proceedings of the NIPS 2010 Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; Advances in Neural Information Processing Systems. Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2010; Volume 23, pp. 1849–1857. [Google Scholar]
Shiryaev, A.N. Probability—1, 3rd ed.; Springer: New York, NY, USA, 2016. [Google Scholar]
Loève, M. Probability Theory, 4th ed.; Springer: New York, NY, USA, 1977. [Google Scholar]
Bulinski, A.; Dimitrov, D. Statistical estimation of the Shannon entropy. Acta Math. Sin. Ser. 2019, 35, 17–46. [Google Scholar] [CrossRef] [Green Version]
Biau, G.; Devroye, L. Lectures on the Nearest Neighbor Method; Springer: Cham, Switzerland, 2015. [Google Scholar]
Bulinski, A.; Kozhevin, A. Statistical estimation of conditional Shannon entropy. ESAIM Probab. Stat. 2019, 23, 350–386. [Google Scholar] [CrossRef] [Green Version]
Coelho, F.; Braga, A.P.; Verleysen, M. A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [Google Scholar] [CrossRef] [Green Version]
Delattre, S.; Fournier, N. On the Kozachenko-Leonenko entropy estimator. J. Stat. Plan. Inference 2017, 185, 69–93. [Google Scholar] [CrossRef] [Green Version]
Berrett, T.B.; Samworth, R.J. Efficient two-sample functional estimation and the super-oracle phenomenon. arXiv 2019, arXiv:1904.09347. [Google Scholar]
Penrose, M.D.; Yukich, J.E. Limit theory for point processes in manifolds. Ann. Appl. Probab. 2013, 6, 2160–2211. [Google Scholar] [CrossRef]
Tsybakov, A.B.; Van der Meulen, E.C. Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Stat. 1996, 23, 75–83. [Google Scholar]
Singh, S.; Pószoc, B. Analysis of k-nearest neighbor distances with application to entropy estimation. arXiv 2016, arXiv:1603.08578v2. [Google Scholar]
Ryu, J.J.; Ganguly, S.; Kim, Y.-H.; Noh, Y.-K.; Lee, D.D. Nearest neighbor density functional estimation from inverse Laplace transform. arXiv 2020, arXiv:1805.08342v3. [Google Scholar]
Gao, S.; Steeg, G.V.; Galstyan, A. Efficient Estimation of Mutual Information for Strongly Dependent Variables. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; Lebanon, G., Vishwanathan, S.V.N., Eds.; PMLR: Brookline, MA, USA, 2015; Volume 38, pp. 277–286. [Google Scholar]
Berrett, T.B.; Samworth, R.J.; Yuan, M. Efficient multivariate entropy estimation via k-nearest neighbour distances. Ann. Stat. 2019, 47, 288–318. [Google Scholar] [CrossRef] [Green Version]
Goria, M.N.; Leonenko, N.N.; Mergel, V.V.; Novi Inverardi, P.L. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. J. Nonparametr. Stat. 2005, 17, 277–297. [Google Scholar] [CrossRef]
Evans, D. A computationally efficient estimator for mutual information. Proc. R. Soc. A Math. Phys. Eng. Sci. 2008, 464, 1203–1215. [Google Scholar] [CrossRef]
Yeh, J. Real Analysis: Theory of Measure and Integration, 3rd ed.; World Scientific: Singapore, 2014. [Google Scholar]
Evans, D.; Jones, A.J.; Schmidt, W.M. Asymptotic moments of near-neighbour distance distributions. Proc. R. Soc. A Math. Phys. Eng. Sci. 2002, 458, 2839–2849. [Google Scholar] [CrossRef]
Bouguila, N.; Wentao, F. Mixture Models and Applications; Springer: Cham, Switzerland, 2020. [Google Scholar]
Borkar, V.S. Probability Theory. An Advanced Course; Springer: New York, NY, USA, 1995. [Google Scholar]
Kallenberg, O. Foundations of Modern Probability; Springer: New York, NY, USA, 1997. [Google Scholar]
Billingsley, P. Convergence of Probability Measures, 2nd ed.; Wiley & Sons: New York, NY, USA, 1999. [Google Scholar]
Alonso Ruiz, P.; Spodarev, E. Entropy-based inhomogeneity detection in fiber materials. Methodol. Comput. Appl. Probab. 2018, 20, 1223–1239. [Google Scholar] [CrossRef]
Dresvyanskiy, D.; Karaseva, T.; Makogin, V.; Mitrofanov, S.; Redenbach, C.; Spodarev, E. Detecting anomalies in fibre systems using 3-dimensional image data. Stat. Comput. 2020, 30, 817–837. [Google Scholar] [CrossRef] [Green Version]
Glaz, J.; Naus, J.; Wallenstein, S. Scan Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
Walther, G. Optimal and fast detection of spatial clusters with scan statistics. Ann. Stat. 2010, 38, 1010–1033. [Google Scholar] [CrossRef] [Green Version]
Gnedenko, B.V.; Korolev, V.Yu. Random Summation: Limit Theorems and Applications; CRC Press: Boca Raton, FL, USA, 1996. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bulinski, A.; Dimitrov, D. Statistical Estimation of the Kullback–Leibler Divergence. Mathematics 2021, 9, 544. https://doi.org/10.3390/math9050544

AMA Style

Bulinski A, Dimitrov D. Statistical Estimation of the Kullback–Leibler Divergence. Mathematics. 2021; 9(5):544. https://doi.org/10.3390/math9050544

Chicago/Turabian Style

Bulinski, Alexander, and Denis Dimitrov. 2021. "Statistical Estimation of the Kullback–Leibler Divergence" Mathematics 9, no. 5: 544. https://doi.org/10.3390/math9050544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Statistical Estimation of the Kullback–Leibler Divergence

Abstract

1. Introduction

2. Notation

3. Main Results

4. Proof of Theorem 1

5. Proof of Theorem 2

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI