Next Article in Journal
A Residual Network and FPGA Based Real-Time Depth Map Enhancement System
Previous Article in Journal
Quantum Maps with Memory from Generalized Lindblad Equation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Finite-Sample Bounds on the Accuracy of Plug-In Estimators of Fisher Information

1
National Key Lab of Science and Technology on Communications, University of Electronic Science and Technology of China, Chengdu 611731, China
2
Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA
3
Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(5), 545; https://doi.org/10.3390/e23050545
Submission received: 15 March 2021 / Revised: 14 April 2021 / Accepted: 21 April 2021 / Published: 28 April 2021
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Finite-sample bounds on the accuracy of Bhattacharya’s plug-in estimator for Fisher information are derived. These bounds are further improved by introducing a clipping step that allows for better control over the score function. This leads to superior upper bounds on the rates of convergence, albeit under slightly different regularity conditions. The performance bounds on both estimators are evaluated for the practically relevant case of a random variable contaminated by Gaussian noise. Moreover, using Brown’s identity, two corresponding estimators of the minimum mean-square error are proposed.

1. Introduction

This work considers the problem of estimating the Fisher information for the location of a univariate probability density function (PDF) f based on n random samples Y 1 , , Y n independently drawn from f. To clarify, the Fisher information of a differentiable density function f is given by
I ( f ) = { f ( t ) > 0 } ( f ( t ) ) 2 f ( t ) d t ,
where f is the derivative of f. For the remainder of the paper, it is assumed that { f ( t ) > 0 } = R , but an extension to the general case is not difficult. The paper considers plug-in estimators based on kernel density estimates of f. That is, the Fisher information is estimated by plugging a kernel density estimate of f into the right-hand side of (1).
Estimation of the Fisher information in (1) via a plug-in estimator based on kernel density estimates was first considered by Bhattacharya in [1]. Bhattacharya showed that, under mild conditions on f, the plug-in estimator is consistent for a large class of kernels, and he provided bounds on its accuracy in the large (asymptotic) sample regime. These bounds were later revised and improved by Dmitriev and Tarasenko in [2]. However, to the best of our knowledge, no finite-sample regime bounds on the accuracy of Bhattacharya’s estimator can be found in the literature. The paper aims at closing this gap.
Bounds on the accuracy of plug-in estimators rely on bounds on the accuracy of the underlying density estimators. For kernel-based density estimators, such bounds have received considerable attention in the literature. For example, Schuster [3] showed that, under mild regularity conditions, the estimation error for higher-order derivatives can be controlled by the estimation error for the corresponding cumulative distribution function (CDF). The interested reader is referred to [4,5,6,7,8] and the references therein. In this paper, as a preliminary result for the analysis of Bhattacharya’s estimator, the bounds in [3] are further tightened by replacing some suboptimal constants with the optimal ones.
A problem that arises in the performance analysis of plug-in estimators for Fisher information is that the score function of the estimated density, that is, the ratio of the derivative of the PDF and the PDF itself, is hard to bound, especially in the tails. Bhattacharya worked around this problem by truncating the integration range in (1), thus avoiding evaluation of the estimated score function on these critical regions. However, in order for the estimator to stay consistent, this truncation has to be done rather aggressively so that the error introduced by ignoring the tails can outweigh the approximation error introduced by the density estimate. In this paper, we propose a simple remedy that allows for a much less aggressive truncation of the integration range and, in turn, for significantly tighter bounds on the approximation error. Namely, we propose the clipping of the score function whenever it exceeds a suitably chosen upper bound. In the vast majority of cases, the corresponding clipped estimates of Fisher estimation are identical to their non-clipped counterparts, meaning that the clipping has a negligible influence on the estimation accuracy. However, the knowledge that extreme values of the score would have been clipped, had they occurred, allows for much-improved performance guarantees.
It should be explicitly stated that this paper does not address the question of how best to estimate Fisher information. Although this question is highly interesting and relevant, it is far beyond the scope of this work. In addition, it is not the aim of the paper to compare the plug-in estimator to alternative estimators for the Fisher information or to claim that it provides superior result. A variety of well-motivated parametric and nonparametric Fisher information estimators have been proposed in the literature; see, for example, [9,10,11] and the references therein. However, comparing and contrasting these estimators in a fair manner is not straightforward and arguably constitutes a research question in its own right. Finally, the problem of obtaining estimator-independent bounds on the sample complexity of Fisher information falls under the umbrella of estimation of nonlinear functionals; see, for example, [12]. Most of the commonly used information measures, such as entropy, relative entropy, and mutual information, are nonlinear functionals, and their estimation has recently received considerable attention; the interested reader is referred to [13,14,15,16,17] and the references therein.
Despite its limited scope, we are convinced that the work presented in this paper is useful in a wider context. First, from a theoretical point of view, it strengthens some classic results in nonparametric estimation and, as explained above, provides bounds for the finite-sample regime, thus filling a gap in the literature. Second, from a practical perspective, the Fisher information typically provides useful bounds or limits on the estimation error (e.g., the well-known Cramér–Rao lower bound), but is not in itself the quantity of interest—an exception is the case of estimating a random signal in additive Gaussian noise, where the minimum mean square error (MMSE) and other relevant quantities can be expressed in term of the Fisher information. The problem of estimating Fisher information also arises in image processing, model selection, experimental design, and many more areas. Applications of our results include, for example, to provide the Cramér–Rao bound and, for the case of a random variable in additive Gaussian noise, to address the power allocation problem [18]. These connections will be discussed in more detail in Section 4. Most often, however, Fisher information plays the role of side information, and its estimation does not warrant investing large computational resources. This prevents the use of sophisticated estimators, which require solving non-trivial optimization problems. In contrast, kernel density estimates are relatively easy to compute and have been widely used in nonparametric statistics so that efficient implementations in software or even hardware [19] are readily available. Hence, for the foreseeable future, plug-in estimators are bound to remain a common and often the only viable option for estimating Fisher information in practice.
The paper is organized as follows: Section 2 revisits Bhattacharya’s estimator. In particular, Theorem 1 provides explicit and tighter non-asymptotic bounds on its convergence rate, improving the results in [1,2]. Furthermore, Theorem 2 provides an alternative bound under the additional assumption that the density function is upper bounded within any given interval. The explicit non-asymptotic results enable us to see that the sample complexity of Bhattacharya’s estimator is considerable and that the potentially unbounded score function is a critical bottleneck for tighter bounds. Section 3 proposes a “harmless” modification of Bhattacharya’s estimator, namely, a clipping of the estimated score function, which is shown to be sufficient to remedy its large sample complexity. In particular, Theorem 3 shows that the clipped estimator has significantly better bounds on rates of convergence, albeit with slightly different assumptions on the PDF. Section 4 evaluates the convergence rates of the two estimators for the practically relevant case of a random variable contaminated by additive Gaussian noise. Moreover, using Brown’s identity, which relates the Fisher information and the MMSE, consistent estimators for the MMSE are proposed and their rates of convergence are evaluated in Proposition 1. Section 5 concludes the paper.

Notation

The expected value and variance of a random variable X are denoted by E [ X ] and Var ( X ) , respectively. The gamma function is denoted by Γ ( · ) . Estimators of a PDF f based on n samples are denoted by f n . No notational distinction is made between an estimator, which is a random variable, and its realizations (estimates), which are deterministic. However, the difference will be clear from the context or will be highlighted explicitly otherwise. The nth derivative of a function F : R R is denoted by F ( n ) ; the first-order derivative is also denoted by F to improve readability.

2. Bhattacharya’s Estimator

In this section, we revisit the asymptotically consistent estimator proposed by Bhattacharya in [1] and produce explicit and non-asymptotic bounds on its accuracy.
Bhattacharya’s estimator is given by
I n = k n k n f n ( t ) 2 f n ( t ) d t ,
where k n 0 determines the integration interval as a function of the sample size n and the unknown functions f and f are replaced by their kernel estimates, that is,
f n ( t ) = 1 n i = 1 n 1 a 0 K t Y i a 0 ,
f n ( t ) = 1 n i = 1 n 1 a 1 K t Y i a 1 .
Here, a 0 , a 1 > 0 are bandwidth parameters, and K : R R denotes the kernel, which is assumed to satisfy certain regularity conditions that will be discussed later in this section.

2.1. Estimating a Density and Its Derivatives

In order to analyze plug-in estimators, it is necessary to obtain rates of convergence for f n and f n , that is, the kernel estimators of the density and its derivative. The following result, which is largely based on the proof by Schuster in [3], provides such rates. The proof in [3] makes use of the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality for the empirical CDF. The next lemma refines the result in [3] by using the best possible constant for the DKW inequality shown in [20].
Lemma 1.
Let r { 0 , 1 } and
v r = K ( r + 1 ) ( t ) d t ,
δ r , a r = sup t R E f n ( r ) ( t ) f ( r ) ( t ) .
Then, for any ϵ > δ r , a r and any n 1 , the following bound holds:
P sup t R f n ( r ) ( t ) f ( r ) ( t ) > ϵ 2 e 2 n a r 2 r + 2 ( ϵ δ r , a r ) 2 v r 2 .
Proof. 
See Appendix A. □

2.2. Analysis of Bhattacharya’s Estimator

The following theorem is a non-asymptotic refinement of the result obtained by Bhattacharya in Theorem 3 of [1] and Dmitriev and Tarasenko in Theorem 1 of [2].
Theorem 1.
Assume that there exists a function ϕ : R R such that
sup | t | x 1 f ( t ) ϕ ( x ) , x R .
Then, provided that
sup | t | k n f n ( r ) ( t ) f ( r ) ( t ) ϵ r , r { 0 , 1 } ,
and
ϵ 0 ϕ ( k n ) < 1 ,
the following bound holds:
I ( f ) I n   4 ϵ 1 k n ρ max ( k n ) + 2 ϵ 1 2 k n ϕ ( k n ) + ϵ 0 ϕ ( k n ) I ( f ) 1 ϵ 0 ϕ ( k n ) + c ( k n ) ,
where
ρ max ( k n ) = sup | t | k n f ( t ) f ( t ) ,
c ( k n ) = I ( f ) k n k n ( f ( t ) ) 2 f ( t ) d t .
Proof. 
See Appendix B. □
The bound in (11) is an improvement of the original bound in [1,2], which contains terms of the form ϵ 0 ϕ 4 ( k n ) .
Note that ϕ ( k n ) in (8) can be rapidly increasing with k n . For example, as will be shown later, ϕ ( k n ) increases super-exponentially with k n for a random variable contaminated by Gaussian noise. This implies that, while Bhattacharya’s estimator converges, the rate of convergence guaranteed by the bound in (11) is extremely slow. A modified bound is proposed in the subsequent theorem.
Theorem 2.
Assume that f ( t ) is bounded on the interval t [ k n , k n ] , i.e.,
sup | t | k n f ( t ) f 0 ,
for some f 0 R . If the assumptions in (8), (9), and (10) hold, then
| I ( f ) I n | ϵ 1 4 + d f ( k n ) + d f n ( k n ) + ϵ 0 2 + d f n ( k n ) ρ max ( k n ) ψ ( ϵ 0 , k n ) + c ( k n ) ,
where ρ max and c are given by (12) and (13), respectively,
ψ ( ϵ 0 , k n ) = max log ( f 0 + ϵ 0 ) , log ϕ ( k n ) 1 ϵ 0 ϕ ( k n ) ,
and d g ( k n ) denotes the number of zeros of the derivative of the function g on the interval [ k n , k n ] , i.e.,
d g ( k n ) = t [ k n , k n ] : g ( t ) = 0 .
Proof. 
See Appendix C. □
Remark 1.
Note that ψ in (15) is on the order of log ( ϕ ( k n ) ) , which typically increases much more slowly with k n than ϕ in (11). As a result, the bound in Theorem 2 can lead to a better bound on the convergence rate than that in Theorem 1, given appropriate upper bounds on d f and d f n . Since Gaussian blurring of a univariate density function never creates new maxima, we have that d f Y d f X , which is a constant. However, to the best of our knowledge, the only known upper bound on d f n is given by d f n n [21] (Theorem 2), which is not useful in practice. Despite this drawback, we include Theorem 2 for the sake of completeness and in the hope that tighter bounds on d f n might be established in the future.
The main problem in the convergence analysis of the estimator in (2) is that 1 / f n ( t ) is only bounded if f ( t ) > ϵ 0 . For distributions with sub-Gaussian tails, this implies that the interval [ k n , k n ] , on which this is guaranteed to be the case, grows sub-logarithmically (compare Theorem 4), causing the required number of samples to grow super-exponentially. In next section, we propose an estimator that has better guaranteed rates of convergence.

3. The Clipped Bhattacharya Estimator

In order to remedy the slow guaranteed convergence rates of Bhattacharya’s estimator, we dispense with the tail assumption in (8), but introduce the new assumption that the unknown true score function ρ ( t ) = f ( t ) / f ( t ) is bounded (in absolute value) by a known function ρ ¯ . This allows us to clip f n ( t ) / f n ( t ) and, in turn, 1 / f n ( t ) without affecting the consistency of the estimator.
Theorem 3.
Assume that there exists a function ρ ¯ : R R such that
| ρ ( t ) | | ρ ¯ ( t ) | , t R
and let
I n c = k n k n min | ρ n ( t ) | , | ρ ¯ ( t ) | | f n ( t ) | d t ,
where
ρ n ( t ) = f n ( t ) f n ( t ) .
Under the assumptions in (9), it holds that
( 21 ) | I ( f ) I n c | max 4 ϵ 1 Φ 1 ( k n ) + 2 ϵ 0 Φ 2 ( k n ) + c ( k n ) , 3 ϵ 1 Φ max 1 ( k n ) + ϵ 0 Φ max 2 ( k n ) ( 22 ) 4 ϵ 1 Φ max 1 ( k n ) + 2 ϵ 0 Φ max 2 ( k n ) + c ( k n ) ,
where c ( k n ) is defined in (13) and
Φ m ( x ) = x x ρ m ( t ) d t ,
Φ max m ( x ) = x x ρ ¯ m ( t ) d t .
In addition, if f ( t ) is bounded as in (14), then
Φ m ( k n ) min ( 2 + d f ) ρ ¯ m 1 ( k n ) ψ ( 0 , k n ) , Φ max m ( k n ) ,
where ψ and d f are defined in (16) and (17), respectively.
Proof. 
See Appendix D. □
For the upper-bound function ρ ¯ ( t ) in assumption (18), in practice, we can set ρ ¯ ( k n ) = ρ max ( k n ) if the latter is available. Although ρ max ( k n ) also increases with k n , it usually increases much more slowly than ϕ ( k n ) . For example, as shown later, ρ max ( k n ) is linear in k n in the Gaussian noise case. As a result, better bounds on the convergence rate can be shown for the clipped estimator.

4. Estimation of the Fisher Information of a Random Variable in Gaussian Noise

This section evaluates the results of Section 2 and Section 3 for the important special case of a random variable contaminated by additive Gaussian noise. To this end, we let f Y denote the PDF of a random variable
Y = snr X + Z ,
where snr > 0 is a signal-to-noise-ratio parameter, X is an arbitrary random variable, Z is a standard Gaussian random variable, and X and Z are independent. We are interested in estimating the Fisher information of f Y . We only make the very mild assumption that X has a finite second moment, but otherwise, it is allowed to be an arbitrary random variable. We further assume that snr is known and that Gaussian kernels are used in the density estimators, i.e.,
K ( t ) = 1 2 π e t 2 2 .
The following lemma provides explicit expressions for the quantities appearing in Section 2 and Section 3 that are needed to evaluate the error bounds for the Bhattacharya and the clipped estimator.
Lemma 2.
Let K be as in (27). Then,
δ r , a r a r · 1 π e , r = 0 2 e + 1 π , r = 1 ,
v r = 2 π , r = 0 2 e π , r = 1 ,
ρ max ( k n ) 3 snr Var ( X ) + 3 k n ,
I ( f Y ) 1 ,
ϕ ( t ) 2 π e t 2 + snr E [ X 2 ] .
Proof. 
See Appendix F. □
We now bound c ( k n ) . To this end, we need the notion of sub-Gaussian random variables: A random variable X is said to be α -sub-Gaussian if
E [ e t X ] e α 2 t 2 2 t R .
Lemma 3.
Suppose that E [ X 2 ] < . Then,
c ( k n ) inf v > 0 2 Γ 1 ( 1 + v ) v + 1 2 π 1 2 ( 1 + v ) snr E [ X 2 ] + 1 k n 2 v 1 + v .
In addition, if | X | is α-sub-Gaussian, then
c ( k n ) inf v > 0 2 Γ 1 ( 1 + v ) v + 1 2 π 1 2 ( 1 + v ) 2 e α 2 snr k n 2 2 v 1 + v .
Proof. 
See Appendix G. □

4.1. Convergence of Bhattacharya’s Estimator

By combining the results in Lemma 1, Theorem 1, Lemma 2, and Lemma 3, we have the following theorem.
Theorem 4.
Let K be as in (27). Choose the parameters of Bhattacharya’s estimator as follows: a 0 = n w 0 , where w 0 0 , 1 4 , a 1 = n w 1 , where w 1 0 , 1 6 , and k n = u log ( n ) , where u 0 , min ( w 0 , w 1 ) . Then, for n w 0 u > c 5 ,
P I n I ( f Y ) ε n 2 e c 1 n 1 4 w 0 + 2 e c 2 n 1 6 w 1 ,
where
ε n n w 1 u log ( n ) 4 c 3 + 12 u log ( n ) + 2 c 5 n u w 1 + c 5 n u w 0 1 c 5 n u w 0 + c 4 u log ( n ) ,
and where the constants are given by
c 1 = π 1 1 2 π e 2 ,
c 2 = e π 1 2 e + 1 2 π 2 ,
c 3 = 3 snr Var ( X ) ,
c 4 = 2 Γ 1 2 3 2 snr E [ X 2 ] + 1 π 1 4 ,
c 5 = 2 π e snr E X 2 .
In addition, if | X | is α-sub-Gaussian, then
ε n n w 1 u log ( n ) 4 c 3 + 12 u log ( n ) + 2 c 5 n u w 1 + c 5 n u w 0 1 c 5 n u w 0 + c 6 n u 4 ,
where
c 6 = 2 3 2 Γ 1 2 3 2 e α 2 snr 4 π 1 4 .
Proof. 
See Appendix H. □
Note that the parameters k n , a 0 , and a 1 are chosen so as to guarantee the convergence of I n ( f n ) to I ( f Y ) with probability 1. For the details, please refer to the proof in Appendix H.
The parameters u and w in the above theorem are auxiliary variables that couple the bandwidth of the kernel density estimators in (3) and (4) with the integration range of the Fisher information estimator in (2). Choosing them according to Theorem 4 results in a trade-off between precision, ε n , and confidence, i.e., the probability of the estimation error exceeding ε n . On the one hand, small values of u and large values of w result in better precision (i.e., small ε n ) at the cost of a lower confidence (i.e., large probability of exceeding ε n ). On the other hand, large values of u and small values of w improve the confidence but deteriorate the precision. In turn, this also affects the convergence rates, meaning that faster convergence of the precision can be achieved at the expense of a slower convergence of the confidence and vice versa.

4.2. Convergence of the Clipped Estimator

From the evaluation of Bhattacharya’s estimator in Theorem 4, it is apparent that the bottleneck term is the truncation parameter k n = u log ( n ) , which results in slow precision decay of the order ε n = O 1 u log ( n ) . Next, it is shown that the clipped estimator results in improved precision over Bhattacharya’s estimator. Specifically, the precision will be shown to decay polynomially in n instead of logarithmically. Another benefit of the clipped estimator is that its error analysis holds for every n 1 .
By utilizing the results in Lemma 1, Lemma 2, and Lemma 3, we specialize the result in Theorem 2 to the Gaussian noise case.
Theorem 5.
Let K be as in (27). Choose the parameters of the clipped estimator as follows: a 0 = n w 0 , where w 0 0 , 1 4 , a 1 = n w 1 , where w 1 0 , 1 6 , and k n = n u , where u 0 , min w 0 3 , w 1 2 . Then, for n 1
P I n c I ( f Y ) ε n 2 e c 1 n 1 4 w 0 + 2 e c 2 n 1 6 w 1 ,
where
ε n 4 n 3 u w 0 c 3 2 n 2 u + 3 + 4 n 2 u w 1 2 c 3 n u + 3 + c 4 n u ,
and the constants c 1 to c 4 are as in Theorem 4. In addition, if | X | is α-sub-Gaussian, then
ε n 4 n 3 u w 0 c 3 2 n 2 u + 3 + 4 n 2 u w 1 2 c 3 n u + 3 + c 6 e n 2 u 4 ,
where c 6 is given by (44).
Proof. 
See Appendix I. □
Again, the parameters k n , a 0 , and a 1 are chosen to guarantee the consistency of the estimator. For further details, please refer to Appendix I.

4.3. Applications to the Estimations of the MMSE

As discussed in the introduction, the Fisher information is often merely a proxy for the actual quantity of interest. One accuracy measure that is typically of interest is the MMSE, which is defined as
mmse ( X | Y ) = E ( X E [ X | Y ] ) 2 .
In additive Gaussian noise, the MMSE can not only be bounded by the Fisher information, but both are related via Brown’s identity:
I ( f Y ) = 1 snr mmse ( X | Y ) .
Based on this relation, we propose the following estimators for the MMSE:
mmse n ( X , snr ) = 1 I n snr
and
mmse n c ( X , snr ) = 1 I n c snr .
The results for the estimators of Fisher information in Theorem 4 and Theorem 5 can be immediately extended to the MMSE estimators as follows.
Proposition 1.
Let K be as in (27), and let w 0 , w 1 , and n be such that they satisfy the conditions in Theorem 4. It then holds that
P mmse n ( X , snr ) mmse ( X , snr ) snr ε n 2 e c 1 n 1 4 w 0 + 2 e c 2 n 1 6 w 1 ,
where ε n , c 1 , and c 2 are given in Theorem 4.
Proposition 2.
Let K be as in (27), and let w 0 and w 1 be such that they satisfy the conditions in Theorem 5. It then holds that
P mmse n c ( X | Y ) mmse ( X | Y ) snr ε n 2 e c 1 n 1 4 w 0 + 2 e c 2 n 1 6 w 1 ,
where ε n , c 1 , and c 2 are given in Theorem 5.

4.4. Sample Complexity

Finally, we demonstrate the difference in the bounds on the convergence rates between Bhattacharya’s estimator and its clipped version by comparing the sample complexity of the two estimators, that is, the required number of samples to guarantee a given accuracy with a given confidence. MATLAB implementations of both estimators, as well as the code used to generate the figures below, can be found in [22].
To this end, we consider the simple example of estimating the density of a Gaussian random variable in additive Gaussian noise. More precisely, we assume that X and Z in (26) are independent and identically distributed according to the standard normal distribution N ( 0 , 1 ) , and that snr = 1 . This trivially implies that X is α -sub-Gaussian with α = 1 . In order to make the comparison as fair as possible, the parameters of the kernel estimators, a 0 , a 1 , and k n , are not chosen according to Theorem 4 or Theorem 5, but are calculated by numerically minimizing the required number of samples; see [22] for details.
Let P err = P I n I ( f Y ) ε n . The left-hand plot in Figure 1 shows the corresponding bounds on the sample complexities of the two estimators with P err = 0.2 and ε n varying from 0.1 to 0.9 . Note that the results with larger ε n are not shown because I ( f Y ) 1 , as shown in Lemma 2. Moreover, the right-hand plot in Figure 1 shows the sample complexities for ε n = 0.5 with P err varying from 0.1 to 0.9 . By inspection, the clipped estimator reduces the sample complexity by several orders of magnitude; note that the y-axes scale logarithmically. As discussed before, this does not imply that the clipped estimator is more accurate in general. However, it does imply that the clipped estimator provides significantly better worst-case performance, i.e., it requires significantly fewer samples to guarantee a certain precision or confidence. Finally, note that this improvement comes at a low cost in terms of complexity and regularity assumptions. The complexity of both algorithms is almost identical, with the clipped estimator only requiring an additional evaluation of ρ ¯ . The regularity conditions are identical for bounded density functions, and slightly stronger for the clipped estimator for unbounded density functions.

5. Conclusions

This work focused on the estimation of the Fisher information for the location of a univariate random variable using plug-in estimators based on estimators of the PDF and its derivative. Two estimators of the Fisher information were considered. The first estimator is the estimator due to Bhattacharya, for which new, sharper convergence results were shown. The paper also proposed a second estimator, termed a clipped estimator, which provides better bounds on the convergence rates. The accuracy bounds on both estimators were specialized to the practically relevant case of a random variable contaminated by additive Gaussian noise. Moreover, using special proprieties of the Gaussian noise case, two estimators for the MMSE were proposed, and their convergence rates were analyzed. This was done by using Brown’s identity, which connects the Fisher information and the MMSE. Finally, using a numerical example, it was demonstrated that the proposed clipped estimator can achieve a significantly lower sample complexity at little or no additional cost.

Author Contributions

Investigation, Writing—original draft: W.C.; Investigation, Writing - original draft: A.D.; Investigation, Writing—original draft: M.F.; Supervision, Resources, Funding acquisition, Writing—review & editing: H.V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the U.S. National Science Foundation under Grants CCF-0939370 and CCF-1908308, and in part by the German Research Foundation (DFG) under Grant 424522268.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. A Proof of Lemma 1

Our starting point is the following bound due to [3] (p. 1188):
sup t R E f n ( r ) ( t ) f n ( r ) ( t ) v r a r r + 1 sup t R F n ( t ) F Y ( t ) ,
where F is the CDF of f, F n is the empirical CDF, and v r is defined in (5). Now, let δ r , a r be as in (6), and consider the following sequence of bounds:
( A 2 ) P sup t R f n ( r ) ( t ) f ( r ) ( t ) > ϵ P sup t R f n ( r ) ( t ) E [ f n ( r ) ( t ) ] > ϵ δ r , a r ( A 3 ) P sup t R | F n ( t ) F ( t ) | > a r r + 1 ( ϵ δ r , a r ) v r ( A 4 ) 2 e 2 n a r 2 r + 2 ( ϵ δ r , a r ) 2 v r 2 ,
where (A2) follows by using the triangle inequality; (A3) follows by using the bound in (A1); and (A4) follows by using the sharp DKW inequality [20]:
P sup t R | F n ( t ) F ( t ) | > ϵ 2 e 2 n ϵ 2 .
This concludes the proof.

Appendix B. A Proof of Theorem 1

First, using the triangle inequality, we have that
I ( f ) I n     k n k n ( f n ( t ) ) 2 f n ( t ) ( f ( t ) ) 2 f ( t ) d t + c ( k n ) .
Next, we bound the first term in (A6):
  k n k n ( f n ( t ) ) 2 f n ( t ) ( f ( t ) ) 2 f ( t ) d t ( A 7 ) =   k n k n f ( t ) ( f n ( t ) ) 2 f n ( t ) ( f ( t ) ) 2 f n ( t ) f ( t ) d t ( A 8 )   k n k n f ( t ) ( f n ( t ) ) 2 f ( t ) ( f ( t ) ) 2 f n ( t ) f ( t ) d t   +   k n k n f ( t ) ( f ( t ) ) 2 f n ( t ) ( f ( t ) ) 2 f n ( t ) f ( t ) d t ( A 9 ) =   k n k n ( f n ( t ) ) 2 ( f ( t ) ) 2 f n ( t ) d t   +   k n k n f n ( t ) f ( t ) f n ( t ) ( f ( t ) ) 2 f ( t ) d t ( A 10 ) sup | t | k n | f n ( t ) + f ( t ) | f n ( t ) f n ( t ) f ( t ) 2 k n + sup | t | k n | f n ( t ) f ( t ) | f n ( t ) k n k n ( f ( t ) ) 2 f ( t ) d t ( A 11 ) sup | t | k n | f n ( t ) + f ( t ) | f n ( t ) ϵ 1 2 k n + sup | t | k n 1 f n ( t ) ϵ 0 I ( f ) ,
where the last bound follows from the assumptions in (9). Now, consider the first term in (A11):
( A 12 ) sup | t | k n | f n ( t ) + f ( t ) | f n ( t ) sup | t | k n 2 | f ( t ) | + ϵ 1 f n ( t ) ( A 13 ) sup | t | k n 2 | f ( t ) | + ϵ 1 f ( t ) f ( t ) + f n ( t ) ( A 14 ) sup | t | k n 2 | f ( t ) | + ϵ 1 f ( t ) ϵ 0 ( A 15 ) = sup | t | k n 2 f ( t ) f ( t ) + ϵ 1 f ( t ) 1 ϵ 0 f ( t ) ( A 16 ) 2 sup | t | k n f ( t ) f ( t ) + ϵ 1 ϕ ( k n ) 1 ϵ 0 ϕ ( k n ) ,
where the bound in (A14) follows from the assumptions in (9) and the properties of ϕ , which imply
ϵ 0 ϕ ( k n ) < 1 ϵ 0 < f ( t ) , | t | k n ;
and the bound in (A16) follows from the definition of ϕ in (8). Now, consider the second term in (A11):
( A 18 ) sup | t | k n 1 f n ( t ) = sup | t | k n 1 f ( t ) f ( t ) + f n ( t ) ( A 19 ) sup | t | k n 1 f ( t ) ϵ 0 ( A 20 ) = sup | t | k n 1 1 ϵ 0 f ( t ) 1 f ( t ) ( A 21 ) 1 1 ϵ 0 ϕ ( k n ) ϕ ( k n ) ,
where (A20) follows by using similar steps, leading to the bound in (A14), and (A21) follows from the definition of ϕ .
Combining the bounds in (A6), (A11), (A16), and (A21) concludes the proof.

Appendix C. A Proof of Theorem 2

Using (9), (14), and (A21), it holds that
( A 22 ) | log ( f n ) | max log ( f n ) , log 1 f n ( A 23 ) max log ( f 0 + ϵ 0 ) , log ϕ ( k n ) 1 ϵ 0 ϕ ( k n ) ( A 24 ) = ψ ( ϵ 0 , k n ) .
Next, we bound the first term on the right-hand side of (A6). Starting with (A9) above, it holds that
  k n k n ( f n ( t ) ) 2 f n ( t ) ( f ( t ) ) 2 f ( t ) d t ( A 25 ) ϵ 1 k n k n f n ( t ) + f ( t ) f n ( t ) d t + ϵ 0 k n k n ( f ( t ) ) 2 f n ( t ) f ( t ) d t ( A 26 ) ϵ 1 k n k n f n ( t ) f n ( t ) d t + ϵ 1 k n k n f ( t ) f n ( t ) d t + ϵ 0 ρ max ( k n ) k n k n f ( t ) f n ( t ) d t ,
where the inequality in (A25) follows again from (9), and the last bound follows from the triangle inequality together with the definition of ρ max .
Consider the integral in the first term in (A26):
  k n k n f n ( t ) f n ( t ) d t ( A 27 ) = k n k n log ( f n ( t ) ) d t ( A 28 ) = k n k n sign log ( f n ( t ) ) · log ( f n ( t ) ) d t ( A 29 ) = sign log ( f n ( t ) ) · log ( f n ( t ) ) | k n k n k n k n log ( f n ( t ) ) d d t sign log ( f n ( t ) ) d t ,
where the last equality follows from integration by parts. The first term in (A29) can be upper bounded as
sign log ( f n ( t ) ) · log ( f n ( t ) ) | k n k n 2 ψ ( ϵ 0 , k n ) ,
where the inequality in (A30) follows from (A24). The second term in (A29) is given by
( A 31 ) k n k n log ( f n ( t ) ) d d t sign log ( f n ( t ) ) d t = t [ k n , k n ] : f n ( t ) = 0 log ( f n ( t ) ) ( A 32 ) d f n ( k n ) ψ ( ϵ 0 , k n ) .
By substituting (A30) and (A32) into (A29), one obtains
k n k n f n ( t ) f n ( t ) d t ( 2 + d f n ) ψ ( ϵ 0 , k n ) .
Next, we consider the integral in the second and third terms in (A26):
( A 34 ) k n k n f ( t ) f n ( t ) d t k n k n f ( t ) f ( t ) ϵ 0 d t ( A 35 ) = k n k n log ( f ( t ) ϵ 0 ) d t ( A 36 ) = k n k n sign log ( f ( t ) ϵ 0 ) · log ( f ( t ) ϵ 0 ) d t   = sign log ( f ( t ) ϵ 0 ) · log ( f ( t ) ϵ 0 ) | k n k n ( A 37 ) k n k n log ( f ( t ) ϵ 0 ) d d t sign log ( f ( t ) ϵ 0 ) d t ( A 38 ) ( 2 + d f ( k n ) ) max log ( f 0 ϵ 0 ) , log ϕ ( k n ) 1 ϵ 0 ϕ ( k n ) ( A 39 ) ( 2 + d f ( k n ) ) ψ ( ϵ 0 , k n ) ,
where the inequalities in (A34) follows from the assumptions in (9) and (A17), and the bound in (A38) follows by using steps similar to those leading to the bound in (A33).
Combining the bounds in (A6), (A26), (A33), and (A38) concludes the proof.

Appendix D. A Proof of Theorem 3

The difficulty in bounding the error of a clipped estimator is in showing that the clipping is strict enough to avoid gross overestimation, yet permissive enough to avoid gross underestimation. The proof presented here is based on two auxiliary estimators that are constructed to under- and overestimate I n c ( f n ) in a controlled manner.
Let
I ̲ n = k n k n f n ( t ) ϵ 1 2 f n ( t ) + ϵ 0 d t ,
where ϵ denotes an “ ϵ -compression” operator, i.e.,
f ( t ) ϵ =   f ( t ) ϵ , f ( t ) > ϵ 0 , ϵ f ( t ) ϵ f ( t ) + ϵ , f ( t ) < ϵ .
Next, consider the estimator
I ¯ n = k n k n f n ( t ) γ 1 , n ( t ) 2 f n ( t ) + γ 0 , n ( t ) d t ,
where the functions γ i , n : R [ 0 , ϵ i ] , i = 0 , 1 are chosen as follows: If it holds that
| ρ n ( t ) | | ρ ¯ ( t ) | ,
then γ 0 , n ( t ) = γ 1 , n ( t ) = 0 . If, on the other hand,
| ρ n ( t ) | > | ρ ¯ ( t ) | ,
then γ 0 , n ( t ) and γ 1 , n ( t ) are chosen such that
f n ( t ) γ 1 , n ( t ) f n ( t ) + γ 0 , n ( t ) = ρ ¯ ( t ) .
Note that because
f n ( t ) ϵ 1 f n ( t ) + ϵ 0 | ρ ( t ) | | ρ ¯ ( t ) | ,
this is always possible.
In Appendix E, it is shown that the following relations hold between the estimators defined above:
I ̲ n I ( f ) ,
I ̲ n I n c ,
I n c I ¯ n + ϵ 1 Φ max 1 ( k n ) ,
I ( f ) I ̲ n 4 ϵ 1 Φ 1 ( k n ) + 2 ϵ 0 Φ 2 ( k n ) + c ( k n ) ,
I ¯ n I ̲ n 2 ϵ 1 Φ max 1 ( k n ) + ϵ 0 Φ max 2 ( k n ) .
The bound in Theorem 3 can now be obtained by bounding the under- and overestimation errors separately. For I n c I ( f ) , it holds that
( A 52 ) I ( f ) I n c I ( f ) I ̲ n ( A 53 ) 4 ϵ 1 Φ 1 ( k n ) + 2 ϵ 0 Φ 2 ( k n ) + c ( k n ) .
For I n c > I ( f ) , it hold that
( A 54 ) I n c I ( f ) I ¯ n I ̲ n + ϵ 1 Φ max 1 ( k n ) ( A 55 ) 3 ϵ 1 Φ max 1 ( k n ) + ϵ 0 Φ max 2 ( k n ) .
The bound in (22) follows. Furthermore, following the same steps as those leading to the bound in (A33), the bound in (25) follows.

Appendix E. A Proof of the Estimator Relations in Theorem 3

The bound in (A47) follows directly from the fact that under the assumptions in (9)
f n ( t ) ϵ 1 2 f n ( t ) + ϵ 0 ( f ( t ) ) 2 f ( t ) .
Analogously, (A48) follows from
f n ( t ) ϵ 1 2 f n ( t ) + ϵ 0 | ρ ( t ) | | f n ( t ) ϵ 1 | | ρ ( t ) | | f n ( t ) | .
In order to show (A50), note that under the assumptions in (9), it holds that
| f n ( t ) ϵ 1 | | f ( t ) | ,
( f n ( t ) + ϵ 0 ) f ( t ) 2 ϵ 0 ,
| f n ( t ) ϵ 1 f ( t ) | 2 ϵ 1 .
Hence, in analogy to Theorem 1, the estimation error of I ̲ n can be written as
I ( f ) I ̲ n = k n k n ( f ( t ) ) 2 f ( t ) f n ( t ) ϵ 1 2 f n ( t ) + ϵ 0 d t + c ( k n ) .
Using the same arguments as in the proof of Theorem 1, the integral term on the right-hand side of (A61) can be bounded by
  k n k n ( f ( t ) ) 2 f ( t ) f n ( t ) ϵ 1 2 f n ( t ) + ϵ 0 d t ( A 62 ) =   k n k n f n ( t ) ϵ 1 2 f ( t ) ( f ( t ) ) 2 ( f n ( t ) + ϵ 0 ) f ( t ) ( f n ( t ) + ϵ 0 ) d t     k n k n f n ( t ) ϵ 1 f ( t ) f n ( t ) ϵ 1 + f ( t ) f n ( t ) + ϵ 0 d t ( A 63 ) + k n k n f ( t ) ( f n ( t ) + ε 0 ) ( f ( t ) ) 2 f ( t ) ( f n ( t ) + ε 0 ) d t ( A 64 ) 2 ϵ 1 k n k n f n ( t ) ϵ 1 + | f ( t ) | f n ( t ) + ϵ 0 d t + 2 ϵ 0 k n k n ( f ( t ) ) 2 f ( t ) ( f n ( t ) + ϵ 0 ) d t ( A 65 ) 2 ϵ 1 k n k n 2 f ( t ) f ( t ) d t + 2 ϵ 0 k n k n f ( t ) f ( t ) 2 d t ( A 66 ) 4 ϵ 1 k n k n ρ ( t ) + 2 ϵ 0 k n k n ρ 2 ( t ) d t ( A 67 ) = 4 ϵ 1 Φ 1 ( k n ) + 2 ε 0 Φ 2 ( k n ) .
Using the same steps, it is not difficult to show (A51), where the factor 2 does not arise because, in contrast to (A59) and (A60),
f n ( t ) + ϵ 0 f n ( t ) + γ 0 , n ( t ) ϵ 0 ,
f n ( t ) γ 1 , n ( t ) f n ( t ) ϵ 1 ϵ 1 ,
and c ( k n ) does not arise because both estimators are defined on [ k n , k n ] .
In order to show (A49), first note that for | ρ n ( t ) | | ρ ¯ ( t ) | , it holds that
f n ( t ) γ 1 , n ( t ) 2 f n ( t ) + γ 0 , n ( t ) = ( f n ( t ) ) 2 f n ( t ) = | ρ n ( t ) | | f n ( t ) | ,
i.e., I ¯ n ( f n ) = I n c ( f n ) . Hence, I n c ( f n ) > I ¯ ( f n ) implies | ρ n ( t ) | | ρ ¯ ( t ) | on some region of [ k n , k n ] . On this region, it holds that
( A 71 ) f n ( t ) γ 1 , n ( t ) 2 f n ( t ) + γ 0 , n ( t ) = | f n ( t ) γ 1 , n ( t ) | f n ( t ) + γ 0 , n ( t ) | f n ( t ) γ 1 , n ( t ) | ( A 72 ) = | ρ ¯ ( t ) | | f n ( t ) γ 1 , n ( t ) | .
Because
| f n ( t ) | | f n ( t ) γ 1 , n ( t ) | γ 1 , n ϵ 1 ,
it follows that
( A 74 ) I n c ( f n ) I ¯ n ( f n ) k n k n | ρ ¯ ( t ) | ϵ 1 d t ( A 75 ) ϵ 1 Φ max 1 ( k n ) .

Appendix F. A Proof of Lemma 2

We begin by bounding v r and δ r , a r . First,
v 0 = | t | K ( t ) d t = 2 π ,
v 1 = t 2 1 K ( t ) d t = 2 2 e π .
Second,
( A 78 ) δ r , a r =   E [ f n ( r ) ( t ) ] f Y ( r ) ( t ) ( A 79 ) =   1 a r K t y a r f Y ( r ) ( y ) f Y ( r ) ( t ) d y ( A 80 ) =   K y f Y ( r ) ( t + a r y ) f Y ( r ) ( t ) d y ( A 81 ) sup t R f Y ( r + 1 ) ( t ) K y a r | y | d y ( A 82 ) = a r 2 π sup t R f Y ( r + 1 ) ( t ) .
Now, for r = 0 ,
( A 83 ) f Y ( 1 ) ( t ) =   E ( t snr X ) 1 2 π e ( t snr X ) 2 2 ( A 84 ) 1 2 π 1 e ,
where we have used the bound t e t 2 2 1 e . For r = 1 ,
( A 85 ) f Y ( 2 ) ( t ) =   E ( t snr X ) 2 1 1 2 π e ( t snr X ) 2 2 ( A 86 ) 1 2 π 2 e + 1 2 π ,
where we have used the bound t 2 e t 2 2 2 e .
Next, we bound the score function ρ Y as follows:
( A 87 ) | ρ Y ( t ) | =   f Y ( t ) f Y ( t ) ( A 88 ) =   snr E [ X | Y = t ] t ( A 89 ) snr E X | Y = t + | t | ( A 90 ) snr E X 2 | Y = t + | t | ( A 91 ) 3 snr Var ( X ) + 4 t 2 + | t | ( A 92 ) 3 snr Var ( X ) + 3 | t | ,
where the equality in (A88) follows by using the identify f Y ( t ) f Y ( t ) = snr E [ X | Y = t ] t [23], the inequality in (A90) follows from Jensen’s inequality, and the inequality in (A91) follows from the bound in [24] (Proposition 1.2). Using the bound in (A92), it follows that
ρ max ( k n ) = max | t | k n | ρ ( t ) | 3 snr Var ( X ) + 3 k n .
Using the relation between the Fisher information and the MMSE, we have that
I ( f Y ) = 1 snr mmse ( X , snr ) 1 .
Finally, the function ϕ is obtained by observing that
( A 95 ) f Y ( t ) = E 1 2 π e ( t snr X ) 2 2 ( A 96 ) 1 2 π e E ( t snr X ) 2 2 ( A 97 ) 1 2 π e t 2 + snr E [ X 2 ] ,
where we used Jensen’s inequality and the fact that ( a + b ) 2 2 ( a 2 + b 2 ) . This concludes the proof.

Appendix G. A Proof of Lemma 3

Choose some v > 0 . Then,
( A 98 ) c ( k n ) = E ρ Y 2 ( Y ) 1 { | Y | k n } ( A 99 ) E 1 1 + v ρ Y 2 ( Y ) 1 + v E v 1 + v 1 { | Y | k n } 1 + v v ( A 100 ) = E 1 1 + v ρ Y 2 ( Y ) 1 + v E v 1 + v 1 { | Y | k n } ( A 101 ) = E 1 1 + v | ρ Y ( Y ) | 2 ( 1 + v ) P v 1 + v | Y | k n ( A 102 ) = E 1 1 + v | E [ Z | Y ] | 2 ( 1 + v ) P v 1 + v | Y | k n ( A 103 ) E 1 1 + v | Z | 2 ( 1 + v ) P v 1 + v | Y | k n ( A 104 ) = 2 Γ 1 ( 1 + v ) v + 1 2 π 1 2 ( 1 + v ) P v 1 + v | Y | k n ( A 105 ) = 2 Γ 1 ( 1 + v ) v + 1 2 π 1 2 ( 1 + v ) snr E [ | X | 2 ] + 1 k n 2 v 1 + v ,
where (A99) follows from Hölder’s inequality, (A102) follows by using the identity
ρ Y ( t ) = snr E [ X | Y = t ] t = E [ Z | Y = t ] ,
and (A105) follows from Markov’s inequality.
Now, if E [ X 2 ] < , then using Markov’s inequality,
P | Y | k n E [ Y 2 ] k n 2 = snr E [ | X | 2 ] + 1 k n 2 .
Moreover, using the Chernoff bound,
( A 108 ) P | Y | k n e k n t E e t | Y | ( A 109 ) 2 e k n t + t 2 2 E e t snr | X | ( A 110 ) = 2 e k n t + t 2 2 e α 2 snr 2 .
Therefore,
( A 111 ) c ( k n ) inf t > 0 inf v > 0 2 Γ 1 ( 1 + v ) v + 1 2 π 1 2 ( 1 + v ) 2 v 1 + v e v 1 + v k n t + t 2 2 + α 2 snr 2 ( A 112 ) inf v > 0 2 Γ 1 ( 1 + v ) v + 1 2 π 1 2 ( 1 + v ) 2 v 1 + v e v 1 + v α 2 snr k n 2 2 .
This concludes the proof.

Appendix H. A Proof of Theorem 4

Let
ε n = 4 ϵ 1 k n ρ max ( k n ) + 2 ϵ 1 2 k n ϕ ( k n ) + ϵ 0 ϕ ( k n ) 1 ϵ 0 ϕ ( k n ) + c ( k n ) ,
which is obtained from (11) by bounding I ( f ) by 1 according to (31). In order to apply the bounds in Lemma 1 and Theorem 1, the following equalities/inequalities must hold for r { 0 , 1 } :
ϵ r > δ r , a r ,
a r 2 r + 2 ( ϵ r δ r , a r ) 2 v r 2 1 n ,
lim n ϵ 1 ρ max ( k n ) = 0 ,
lim n ϵ 1 2 k n ϕ ( k n ) = 0 ,
lim n ϵ 0 ϕ ( k n ) = 0 ,
lim n c ( k n ) = 0 .
To satisfy (A114), we choose
a 0 = ϵ 0 = n w 0 , w 0 0 , 1 4 ,
a 1 = ϵ 1 = n w 1 , w 1 0 , 1 6 ,
k n = n u , u 0 , min w 0 , w 1 .
Then, together with the bounds in Lemma 2, the relevant quantities in (A114) are as follows:
a 0 2 ( ϵ 0 δ 0 , a 0 ) 2 v 0 2 = c 1 2 n 4 w 0 ,
a 1 4 ( ϵ 1 δ 1 , a 1 ) 2 v 1 2 = c 2 2 n 6 w 1 ,
ϵ 1 k n ρ max ( k n ) c 3 u log ( n ) + 3 u log ( n ) n w 1 ,
ϵ 1 2 k n ϕ ( k n ) c 5 u log ( n ) n u 2 w 1 ,
ϵ 0 ϕ ( k n ) c 5 n u w 0 ,
c ( k n ) c 4 u log ( n ) ,
which yields (37). Now, if | X | is α -sub-Gaussian, the bound in (43) can be obtained from Lemma 3 with v = 1 .
Because (9) leads to (11), one obtains
  P I n ( f n ) I ( f Y ) ε n ( A 119 ) P sup | t | k n f n ( t ) f Y ( t ) ϵ 0   +   P sup | t | k n f n ( t ) f Y ( t ) ϵ 1 ( A 120 ) P sup t R f n ( t ) f Y ( t ) > ϵ 0   +   P sup t R f n ( t ) f Y ( t ) > ϵ 1 ( A 121 ) 2 e n π a 0 2 ϵ 0 a 0 1 2 π e 2 + 2 e n e π a 1 4 ϵ 1 a 1 2 e + 1 2 π 2 ( A 122 ) = 2 e π 1 1 2 π e 2 n 1 4 w 0 + 2 e e π 1 2 e + 1 2 π 2 n 1 6 w 1 ( A 123 ) = 2 e c 1 n 1 4 w 0 + 2 e c 2 n 1 6 w 1 ,
where the inequality in (A121) follows from Lemma 1, and the last step follows from (A115), (A116), and (A117). This concludes the proof.

Appendix I. A Proof of Theorem 5

Let
ε n = 4 ϵ 1 Φ max 1 ( k n ) + 2 ϵ 0 Φ max 2 ( k n ) + c ( k n ) .
To apply the bounds in Theorem 3 and Lemma 2, the following equalities/inequalities must hold for r { 0 , 1 } :
ϵ r > δ r , a r ,
a r 2 r + 2 ( ϵ r δ r , a r ) 2 v r 2 1 n ,
lim n ϵ 1 Φ max 1 ( k n ) = 0 ,
lim n ϵ 0 Φ max 0 ( k n ) = 0 ,
lim n c ( k n ) = 0 .
To satisfy (A125), we choose
a 0 = ϵ 0 = n w 0 , w 0 0 , 1 4 ,
a 1 = ϵ 1 = n w 1 , w 1 0 , 1 6 ,
k n = n u , u 0 , min w 0 3 , w 1 2 .
Then, together with the bounds in Lemma 2, the relevant quantities in (A125) are as follows:
a 0 2 ( ϵ 0 δ 0 , a 0 ) 2 v 0 2 = c 1 2 n 4 w 0 ,
a 1 4 ( ϵ 1 δ 1 , a 1 ) 2 v 1 2 = c 2 2 n 6 w 1 ,
ϵ 1 Φ max 1 ( k n ) = n u w 1 2 c 3 + 3 n u ,
ϵ 0 Φ max 1 ( k n ) = n u w 0 2 c 3 2 + 6 n 2 u ,
c ( k n ) c 4 n u ,
which yields (46). Moreover, if | X | is α -sub-Gaussian, the bound in (47) can be obtained from Lemma 3.
Following the same steps leading to (A123), we have that
( A 130 ) P I n c ( f n ) I ( f Y ) ε n 2 e π 1 1 2 π e 2 n 1 4 w + 2 e e π 1 2 e + 1 2 π 2 n 1 6 w ( A 131 ) = 2 e c 1 n 1 4 w 0 + 2 e c 2 n 1 6 w 1 .
This concludes the proof.

References

  1. Bhattacharya, P. Estimation of a probability density function and its derivatives. Sankhyā: Indian J. Stat. Ser. A 1967, 29, 373–382. [Google Scholar]
  2. Dmitriev, Y.G.; Tarasenko, F. On the estimation of functionals of the probability density and its derivatives. Theory Probab. Appl. 1974, 18, 628–633. [Google Scholar] [CrossRef]
  3. Schuster, E.F. Estimation of a probability density function and its derivatives. Ann. Math. Stat. 1969, 40, 1187–1195. [Google Scholar] [CrossRef]
  4. Rüschendorf, L. Consistency of estimators for multivariate density functions and for the mode. Sankhyā: Indian J. Stat. Ser. A 1977, 39, 243–250. [Google Scholar]
  5. Silverman, B.W. Weak and strong uniform consistency of the kernel estimate of a density and its derivatives. Ann. Stat. 1978, 6, 177–184. [Google Scholar] [CrossRef]
  6. Roussas, G.G. Kernel estimates under association: Strong uniform consistency. Stat. Probab. Lett. 1991, 12, 393–403. [Google Scholar] [CrossRef]
  7. Wertz, W.; Schneider, B. Statistical density estimation: A bibliography. Int. Stat. Rev. Int. Stat. 1979, 47, 155–175. [Google Scholar]
  8. Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer: Paris, France, 2009. [Google Scholar]
  9. Donoho, D.L. One-sided inference about functionals of a density. Ann. Stat. 1988, 16, 1390–1420. [Google Scholar] [CrossRef]
  10. Berisha, V.; Hero, A.O. Empirical non-parametric estimation of the Fisher information. IEEE Signal Process. Lett. 2014, 22, 988–992. [Google Scholar] [CrossRef] [Green Version]
  11. Spall, J.C. Monte Carlo computation of the Fisher information matrix in nonstandard settings. J. Comput. Graph. Stat. 2005, 14, 889–909. [Google Scholar] [CrossRef] [Green Version]
  12. Birgé, L.; Massart, P. Estimation of integral functionals of a density. Ann. Stat. 1995, 23, 11–29. [Google Scholar] [CrossRef]
  13. Cao, W.; Dytso, A.; Fauß, M.; Poor, H.V.; Feng, G. On Nonparametric Estimation of the Fisher Information. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020. [Google Scholar]
  14. Sricharan, K.; Raich, R.; Hero, A.O. Estimation of nonlinear functionals of densities with confidence. IEEE Trans. Inf. Theory 2012, 58, 4135–4159. [Google Scholar] [CrossRef] [Green Version]
  15. Wu, Y.; Yang, P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory 2016, 62, 3702–3720. [Google Scholar] [CrossRef] [Green Version]
  16. Han, Y.; Jiao, J.; Weissman, T.; Wu, Y. Optimal rates of entropy estimation over Lipschitz balls. arXiv 2017, arXiv:1711.02141. [Google Scholar]
  17. Verdú, S. Empirical Estimation of Information Measures: A Literature Guide. Entropy 2019, 21, 720. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Lozano, A.; Tulino, A.M.; Verdú, S. Optimum power allocation for parallel Gaussian channels with arbitrary input distributions. IEEE Trans. Inf. Theory 2006, 52, 3033–3051. [Google Scholar] [CrossRef] [Green Version]
  19. Gramacki, A.; Sawerwain, M.; Gramacki, J. FPGA-based bandwidth selection for kernel density estimation using high level synthesis approach. Bull. Pol. Acad. Sci. Tech. Sci. 2016, 64, 821–829. [Google Scholar] [CrossRef] [Green Version]
  20. Massart, P. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 1990, 18, 1269–1283. [Google Scholar] [CrossRef]
  21. Carreira-Perpiñán, M.Á.; Williams, C.K. On the Number of Modes of a Gaussian Mixture. Available online: http://www.cs.utoronto.ca/~miguel/papers/ps/sst03.pdf (accessed on 24 April 2021).
  22. Cao, W.; Dytso, A.; Fauß, M.; Poor, H.V. MATLAB Codes for Nonparametric Estimation of the Fisher Information. Available online: https://github.com/mifauss/Fisher_Information_Estimation (accessed on 24 April 2021).
  23. Esposito, R. On a relation between detection and estimation in decision theory. Inf. Control 1968, 12, 116–120. [Google Scholar] [CrossRef] [Green Version]
  24. Fozunbal, M. On regret of parametric mismatch in minimum mean square error estimation. In Proceedings of the 2010 IEEE International Symposium on Information Theory, Austin, TX, USA, 13–18 June 2010. [Google Scholar]
Figure 1. Sample complexity with Gaussian input. Left: number of samples required versus error of the estimators I n and I n c given P err = 0.2 . Right: number of samples required versus confidence of the estimators with given ε n = 0.5 .
Figure 1. Sample complexity with Gaussian input. Left: number of samples required versus error of the estimators I n and I n c given P err = 0.2 . Right: number of samples required versus confidence of the estimators with given ε n = 0.5 .
Entropy 23 00545 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Cao, W.; Dytso, A.; Fauß, M.; Poor, H.V. Finite-Sample Bounds on the Accuracy of Plug-In Estimators of Fisher Information. Entropy 2021, 23, 545. https://doi.org/10.3390/e23050545

AMA Style

Cao W, Dytso A, Fauß M, Poor HV. Finite-Sample Bounds on the Accuracy of Plug-In Estimators of Fisher Information. Entropy. 2021; 23(5):545. https://doi.org/10.3390/e23050545

Chicago/Turabian Style

Cao, Wei, Alex Dytso, Michael Fauß, and H. Vincent Poor. 2021. "Finite-Sample Bounds on the Accuracy of Plug-In Estimators of Fisher Information" Entropy 23, no. 5: 545. https://doi.org/10.3390/e23050545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop