Next Article in Journal
Advances in Uncertain Information Fusion
Next Article in Special Issue
Influence of Explanatory Variable Distributions on the Behavior of the Impurity Measures Used in Classification Tree Learning
Previous Article in Journal
Equilibrium Control in Uncertain Linear Quadratic Differential Games with V-Jumps and State Delays: A Case Study on Carbon Emission Reduction
Previous Article in Special Issue
Contrast Information Dynamics: A Novel Information Measure for Cognitive Modelling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation

by
Neri Merhav
The Viterbi Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Technion City, Haifa 3200003, Israel
Entropy 2024, 26(11), 944; https://doi.org/10.3390/e26110944
Submission received: 19 September 2024 / Revised: 29 October 2024 / Accepted: 4 November 2024 / Published: 4 November 2024
(This article belongs to the Collection Feature Papers in Information Theory)

Abstract

:
We propose two families of asymptotically local minimax lower bounds on parameter estimation performance. The first family of bounds applies to any convex, symmetric loss function that depends solely on the difference between the estimate and the true underlying parameter value (i.e., the estimation error), whereas the second is more specifically oriented to the moments of the estimation error. The proposed bounds are relatively easy to calculate numerically (in the sense that their optimization is over relatively few auxiliary parameters), yet they turn out to be tighter (sometimes significantly so) than previously reported bounds that are associated with similar calculation efforts, across many application examples. In addition to their relative simplicity, they also have the following advantages: (i) Essentially no regularity conditions are required regarding the parametric family of distributions. (ii) The bounds are local (in a sense to be specified). (iii) The bounds provide the correct order of decay as functions of the number of observations, at least in all the examples examined. (iv) At least the first family of bounds extends straightforwardly to vector parameters.

1. Introduction

The theory of parameter estimation consists of a very large plethora of lower bounds (as well as upper bounds), which characterize the fundamental performance limits of any estimator in a given parametric model. In this context, it is common to distinguish between Bayesian bounds (see, e.g., the Bayesian Cramér–Rao bound [1], the Bayesian Bhattacharyya bound, the Bobrovsky–Zakai bound [2], the Bellini–Tartara bound [3], the Chazan–Zakai–Ziv bound [4], the Weiss–Weinstein bound [5,6], and more (see [7] for a comprehensive overview), and non-Bayesian bounds, where in the former, the parameter to be estimated is considered a random variable with a given probability law, as opposed to the latter, where it is assumed an unknown deterministic constant. The category of non-Bayesian bounds is further subdivided into two subclasses; one is associated with local bounds that hold for classes of estimators with certain limitations, such as unbiased estimators (see, e.g., the Cramér–Rao bound [8,9,10,11,12], the Bhattacharyya bound [13], the Barankin bound [14], the Chapman–Robbins bound [15], the Fraser–Guttman bound [16], the Keifer bound [17], and more), and the other is the subclass of minimax bounds (see, e.g., Ziv and Zakai [18], Hajek [19], Le Cam [20], Assouad [21], Fano [22], Lehmann (Sections 4.2–4.4 in [23]), Nazin [24], Yang and Barron [25], Guntuboyina [26,27], Kim [28], and many more).
In this paper, we focus on the minimax approach, and more concretely, on the local minimax approach. According to the minimax approach, we are given a parametric family of probability density functions (or probability mass functions, in the discrete case), { p ( x 1 , , x n | θ ) , ( x 1 , , x n ) R n , θ Θ } , where θ is a d-dimensional parameter vector, Θ R d is the parameter space, n is a positive integer designating the number of observations, and we define a loss function, l ( θ , θ ^ n ) , where θ ^ n is an estimator, which is a function of the observations x 1 , , x n only. The minimax performance is defined as
R n ( Θ ) = inf θ ^ n ( · ) sup θ Θ E θ { l ( θ , θ ^ n ) } ,
where E θ denotes expectation w.r.t. p ( · | θ ) . As customary, we consider here loss functions with the property that l ( θ , θ ^ n ) depends on θ and θ ^ n only via their difference, that is, l ( θ , θ ^ n ) = ρ ( θ θ ^ n ) , where the function ρ ( · ) satisfies certain assumptions (see Section 2). The local asymptotic minimax performance at the point θ Θ is defined as follows (see also, e.g., [19]). Let { ζ n * , n 1 } be a positive sequence, tending to infinity, with the property that
r ( θ ) = lim δ 0 lim inf n inf θ ^ n ( · ) sup { θ : θ θ δ } ζ n * · E θ { ρ ( θ ^ n θ ) }
is a strictly positive finite constant. Then, we say that r ( θ ) is the local asymptotic minimax performance with respect to (w.r.t.) { ζ n * } at the point θ Θ . Roughly speaking, the significance is that the performance of a good estimator, θ ^ n , at θ is about E θ { ρ ( θ θ ^ n ) } r ( θ ) / ζ n * . For example, in the scalar mean square error (MSE) case, where ρ ( ε ) = ε 2 , and where the observations are Gaussian, i.i.d., with mean θ and known variance σ 2 , it is actually shown in Example 2.4, p. 257 in [23] that r ( θ ) = σ 2 w.r.t. ζ n * = n , for all θ R , which is attained by the sample mean estimator, θ ^ n = 1 n i = 1 n x i .
Our focus in this work is on the derivation of some new lower bounds that are as follows: (i) essentially free of regularity conditions on the smoothness of the parametric family { p ( · | θ ) , θ Θ } , (ii) relatively simple and easy to calculate, at least numerically, which amounts to the property that the bound contains only a small number of auxiliary parameters to be numerically optimized (typically, no more than two or three parameters), (iii) tighter than earlier reported bounds that are associated with similar calculation efforts as described in (ii), and (iv) lend themselves to extensions that yield even stronger bounds (albeit with more auxiliary parameters to be optimized), as well as extensions to vector parameters. We propose two families of lower bounds on R n ( Θ ) , along with their local versions, of bounding r ( θ ) , with the four above-described properties. The first applies to any convex, symmetric loss function ρ ( · ) , whereas the second is more specifically oriented to the moments of the estimation error, ρ ( ε ) = | ε | t , where t is a positive real, not necessarily an integer, with special attention devoted to the MSE case, t = 2 . For the sake of simplicity and clarity of the exposition, in the first two main sections of the paper (Section 3 and Section 4), our focus is on the case of a scalar parameter, as most of our examples are associated with the scalar case. In Section 5, we extend some of our findings to the vector case.
To put this work in the perspective of earlier work on minimax estimation, we next briefly review some of the basic approaches in this problem area. Admittedly, considering the vast amount of literature on the subject, our review below is by no means exhaustive. For a more comprehensive review, the reader is referred to Kim [28].
First, observe the simple fact that the minimax performance is lower bounded by the Bayesian performance of the same loss function (see, e.g., [1,2,3,4,5,6,7]) for any prior on the parameter, θ , and so, every lower bound on the Bayesian performance is automatically a valid lower bound also on the minimax performance (see Section 4.2 in [23]). Indeed, in Section 2.3 in [26], it is argued that the vast majority of existing minimax lower bounding techniques are based upon bounding the Bayes risk from below w.r.t. some prior. Many of these Bayesian bounds, however, are subjected to certain restrictions and regularity conditions concerning the smoothness of the prior and the family of densities, { p ( · | θ ) , θ Θ } .
Dating back to Ziv and Zakai’s 1969 article [18] on parameter estimation, applied mostly in the context of time-delay estimation, this prior puts all its mass equally on two values, θ 0 and θ 1 , of the parameter θ , and considering an auxiliary hypothesis testing problem of distinguishing between the two hypotheses, H 0 : θ = θ 0 and H 1 : θ = θ 1 with equal priors (see Section 2 for exact definitions). A simple argument regarding the sub-optimality of a decision rule that is based on estimating θ and deciding on the hypothesis with the closer value of θ , combined with Chebychev’s inequality, yields a simple lower bound on the corresponding Bayes risk, and hence also the minimax risk, in terms of the probability of error of the optimal decision rule. Five years later, Bellini and Tartara [3], and then independently, Chazan, Zakai, and Ziv [4], improved the bound of [18] using somewhat different arguments and obtained Bayesian bounds that apply to the uniform prior. These bounds are also given in terms of the error probability pertaining to the optimal maximum a posteriori (MAP) decision rule of binary hypothesis testing with equal priors, but this time, it had an integral form. These bounds were demonstrated to be fairly tight in several application examples, but they are rather difficult to calculate in most cases. Shortly before the Bellini–Tartara and the Chazan–Zakai–Ziv articles were published, Le Cam [20] proposed a minimax lower bound, which is also given in terms of the error probability associated with binary hypothesis testing, or equivalently, the total variation between p ( · | θ 0 ) and p ( · | θ 1 ) , under the postulate that the loss function l ( · , · ) is a metric. We will refer to Le Cam’s bound in a more detailed manner later, in the context of our first proposed bound.
A decade later, Assouad [21] extended Le Cam’s two-point testing bound to multiple points, where instead of just two test points, θ 0 and θ 1 , as before, there are, more generally, m test points, θ 0 , θ 1 , , θ m 1 , and correspondingly, the auxiliary hypothesis testing problem consists of m hypotheses, H i : θ = θ i , i = 0 , 1 , , m 1 , with certain priors (again, to be defined in Section 2). Based on those test points, Assouad devised the so-called hypercube method. Another related bounding technique that is based on multiple test points, and referred to as Fano’s method, amounts to further bounding from below the error probability of multiple hypotheses using Fano’s inequality (see Section 2.10 in [29]). Considering the large number of auxiliary parameters to be optimized when multiple hypotheses are present, these bounds demand heavy computational efforts. Also, Fano’s inequality is often loose, even though it is adequate enough for the purpose of proving converse-to-coding theorems in information theory [29]. In later years, Le Cam [30] and Yu [31] extended Le Cam’s original approach to apply to testing mixtures of densities. More recently, Yang and Barron [25] related the minimax problem to the metric entropy of the parametric family, { p ( · | θ , θ Θ } , and Cai and Zhou [32] combined Le Cam’s and Assouad’s methods by considering a larger number of dimensions. Guntuboyina [26,27] pursued a different direction by deriving minimax lower bounds using f-divergences.
The outline of this article is as follows. In Section 2, we define the problem setting, provide a few formal definitions along with background, and establish the notation. In Section 3, we develop the first family of bounds, and in Section 4, we present the second family, both for the scalar parameter case. Finally, in Section 5, we extend some of our findings to the vector case.

2. Problem Setting, Background, Definitions, and Notation

As outlined in the last paragraph of the Introduction, Section 3 and Section 4 are on the scalar parameter case, whereas in Section 5, we extend some of the results to the case of a vector parameter with dimension d. In order to avoid repetition, we formalize the problem and define the notation here for the more general vector case, with the understanding that for the scalar case, all definitions remain the same, except that they are confined to the special case of d = 1 .
Consider a family of probability density functions (PDFs), { p ( · | θ ) , θ Θ } , where θ is a parameter vector of dimension d to be estimated, and Θ R d is the parameter space. We denote by E θ { · } the expectation operator w.r.t. p ( · | θ ) . Let X = ( X 1 , , X n ) be a random vector of observations, governed by p ( · | θ ) for some θ Θ . The support of p ( · | θ ) is assumed X n , the nth Cartesian power of the alphabet X of each component X i , i = 1 , , n . The alphabet X may be a finite set, a countable set, a finite interval, an infinite interval, or the entire real line. In the first two cases, the PDFs should be understood to be replaced by probability mass functions and integrations over the observation space should be replaced by summations. A realization of X will be denoted by x = ( x 1 , , x n ) X n .
An estimator, θ ^ n , is given by any function of the observation vector, g n : X n Θ , that is, θ ^ n = g n [ x ] . Since X is random, so is the estimate g n [ X ] , as well as the estimation error, ε n [ X ] = θ ^ n θ = g n [ X ] θ . We associate with every possible vector value, ϵ , of ε n [ X ] a certain loss (or “cost”, or “price”), ρ ( ϵ ) , where ρ ( · ) is a non-negative function with the following properties: (i) monotonically non-increasing in each component, ϵ i , of ϵ , wherever ϵ i < 0 , i = 1 , 2 , , d , (ii) monotonically non-decreasing in each component, ϵ i , of ϵ , wherever ϵ i 0 , i = 1 , 2 , , d , and (iii) ρ ( 0 ) = 0 .
Referring to Section 3 and Section 4, which deals with the scalar case ( d = 1 ), we further adopt the following assumptions regarding the loss function ρ . In Section 3, we assume that ρ ( · ) is as follows: (iv) convex, and (v) symmetric, i.e., ρ ( ϵ ) = ρ ( ϵ ) for every ϵ . In Section 4, we assume, more specifically, that ρ ( ε ) = | ε | t , where t is a positive constant, not necessarily an integer. This is a special case of the class of loss functions considered in Section 3, except when t ( 0 , 1 ) , in which case, | ε | t is a concave (rather than a convex) function of ε . In Section 5, for the vector extension of the results of Section 3, the above-mentioned symmetry assumption (v) is extended to become radial symmetry; namely, ρ ( ϵ ) will be assumed to depend on ϵ only via its Euclidean norm, ϵ .
The expected cost of an estimator g n at a point θ Θ , is defined as
R n ( θ , g n ) = E θ { ρ ( g n [ X ] θ ) } .
The global minimax performance is defined as
R n ( Θ ) = inf g n sup θ Θ R n ( θ , g n ) .
Another related notion is that of local asymptotic minimax performance, defined in Section 1, and repeated here for the sake of completeness. Let { ζ n * , n 1 } be a positive sequence, tending to infinity, with the property that r ( θ ) , defined as in (2), is a strictly positive finite constant. Then, we say that r ( θ ) is the local asymptotic minimax performance w.r.t. { ζ n * } at θ Θ . The sequence { 1 / ζ n * } is referred to as the convergence rate of the minimax estimator.
Similarly, as in earlier articles on minimax lower bounds, many of our proposed bounds are given in terms of the error probability pertaining to certain auxiliary hypothesis testing problems that are associated with two or more test points in the parameter space, and the choice of those test points is subjected to optimization. We therefore provide a few definitions, notation conventions, and background associated with elementary hypothesis testing. For further details, the reader is referred to any one of many textbooks that cover the topic, for example, Van Trees (see Sections 2.2 and 2.3 in [1]), Helstrom (see Chapter III in [33]), or Whalen (see Chapter 5 in [34]).
For given θ 0 and θ 1 , both in Θ , consider the problem of deciding between two possible hypotheses regarding the probability distribution that governs the given vector of observations, X. Under hypothesis H 0 , X is governed by p ( · | θ 0 ) , and under hypothesis H 1 , X is governed by p ( · | θ 1 ) . Suppose also that the a priori probability of H 0 to be the actual underlying hypothesis is Pr { H 0 } = q , where 0 q 1 is given, and so the a priori probability of H 1 is the complementary probability, Pr { H 1 } = 1 q . When q = 1 2 , we say that the priors are equal; otherwise, the priors are unequal. A decision rule Ω = ( Ω 0 , Ω 1 ) is a partition of the observation space, X n , into two disjoint decision regions, Ω 0 X n and its complementary region, Ω 1 = Ω 0 c : Given that X = x Ω 0 , we decide in favor of H 0 ; otherwise, we decide in favor of H 1 . The probability of error, associated with the hypotheses H 0 and H 1 , referring to test points θ 0 and θ 1 , respectively, with priors q and 1 q , respectively, when using the decision rule Ω , is defined as
P e ( q , θ 0 , θ 1 , Ω ) = q · Ω 1 p ( x | θ 0 ) d x + ( 1 q ) · Ω 0 p ( x | θ 1 ) d x .
A well-known elementary result in decision theory asserts that the optimal decision rule, Ω * = ( Ω 0 * , Ω 1 * ) , in the sense of minimizing P e ( q , θ 0 , θ 1 , Ω ) , is given by
Ω 0 * = { x X n : q · p ( x | θ 0 ) ( 1 q ) · p ( x | θ 1 ) }
Ω 1 * = { x X n : q · p ( x | θ 0 ) < ( 1 q ) · p ( x | θ 1 ) } ,
where it should be pointed out that attributing the case of a tie, q · ( x | θ 0 ) = ( 1 q ) · p ( x | θ 1 ) , to Ω 0 * is completely arbitrary, and could have been attributed alternatively to Ω 1 * without affecting the probability of error. Here and in the sequel, the minimum probability of error, associated with Ω * , namely, P e ( q , θ 0 , θ 1 , Ω * ) , will be denoted more simply by P e ( q , θ 0 , θ 1 ) . On substituting Ω * into the above definition of the probability of error, we obtain
P e ( q , θ 0 , θ 1 ) = q · { x : q · p ( x | θ 0 ) < ( 1 q ) · p ( x | θ 1 ) } p ( x | θ 0 ) d x + ( 1 q ) · { x : q · p ( x | θ 0 ) ( 1 q ) · p ( x | θ 1 ) } p ( x | θ 1 ) d x = { x : q · p ( x | θ 0 ) < ( 1 q ) · p ( x | θ 1 ) } q · p ( x | θ 0 ) d x + { x : q · p ( x | θ 0 ) ( 1 q ) · p ( x | θ 1 ) } ( 1 q ) · p ( x | θ 1 ) d x = { x : q · p ( x | θ 0 ) < ( 1 q ) · p ( x | θ 1 ) } min { q · p ( x | θ 0 ) , ( 1 q ) p ( x | θ 1 ) } d x + { x : q · p ( x | θ 0 ) ( 1 q ) · p ( x | θ 1 ) } min { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } d x = X n min { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } d x = 1 X n max { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } d x .
The expression of the second to the last line will appear in some of our lower bounds in the sequel and will be recognized and interpreted as P e ( q , θ 0 , θ 1 ) . The expression of the last line is the one that extends to multiple hypothesis testing, as will be detailed below.
The auxiliary problem of binary hypothesis testing extends from two hypotheses to a general number, m, of hypotheses, associated with m test points, θ 0 , θ 1 , , θ m 1 , in the following manner. Under hypothesis H i , the observation vector X is governed by p ( · | θ i ) and the a priori probability of H i being the underlying true hypothesis, Pr { H i } , is denoted q i , i = 0 , 1 , , m 1 , where q 0 , q 1 , , q m 1 are given non-negative numbers summing to unity. Here, a decision rule Ω = ( Ω 0 , Ω 1 , , Ω m 1 ) is a partition of X n into m disjoint regions such that if x Ω i , we decide in favor of H i , i = 0 , 1 , , m 1 . The probability of error associated with Ω , θ 0 , θ 1 , , θ m 1 and q = ( q 0 , q 1 , , q m 1 ) is defined as
P e ( q , θ 0 , θ 1 , , θ m 1 , Ω ) = i = 0 m 1 q i Ω i c p ( x | θ i ) d x ,
where Ω i c is complementary to Ω i . The optimal MAP decision rule Ω * selects the hypothesis H i whose index i maximizes the product q i · p ( x | θ i ) among all i { 0 , 1 , , m 1 } , for the given x, where ties are broken arbitrarily. The probability of error, associated with Ω * , denoted P e ( q , θ 0 , θ 1 , , θ m 1 ) , is well known to be given by
P e ( q , θ 0 , θ 1 , , θ m 1 ) = 1 X n max { q 0 · p ( x | θ 0 , q 1 · p ( x | θ 1 ) , , q m 1 · p ( x | θ m 1 ) } d x .
Note that for m > 2 , this is different from the expression
X n min { q 0 · p ( x | θ 0 , q 1 · p ( x | θ 1 ) , , q m 1 · p ( x | θ m 1 ) } d x ,
as the latter can be interpreted as the probability that the index i of the true hypothesis H i minimizes (rather than maximizes) the product q i p ( x | θ i ) over i { 0 , 1 , , m 1 } for the given x. Imagine an observer that, upon observing a realization x of the random vector X, creates a list of indices { i } with the k largest values of q i p ( x | θ i ) for some k < m , and an error is defined as the event where the correct i is not in that list. This is referred to as a list error, which is a term borrowed from the fields of coded communication and information theory. The last expression is the probability of list error for k = m 1 . We will encounter this expression later in certain versions of our lower bound. This completes the background needed about hypothesis testing.
As described in Section 1, our objective in this work is to derive relatively simple and easily computable lower bounds to r ( θ ) , which are as tight as possible. While many existing lower bounds in the literature are satisfactory in terms of yielding the correct rate of convergence, 1 / ζ n * , here we wish to improve the bound on the constant factor, r ( θ ) . Many of our examples involve numerical calculations which include optimization over auxiliary parameters and occasionally also numerical integrations. All these calculations were carried out using MATLAB R2019a (9.6.0.1072779) 64-bit (win64).

3. Lower Bounds for Convex Symmetric Loss Functions

As explained earlier, here and in Section 4, we consider a scalar parameter, namely, d = 1 .
Theorem 1.
Let the assumptions of Section 2 be satisfied for d = 1 and let ρ ( · ) be a symmetric convex loss function. Then,
R n ( Θ ) sup θ 0 , θ 1 Θ 2 · ρ θ 1 θ 0 2 · sup 0 q 1 P e ( q , θ 0 , θ 1 ) ,
where P e ( q , θ 0 , θ 1 ) is defined as in Equation (8).
Proof of Theorem 1.
For every θ 0 , θ 1 Θ and q [ 0 , 1 ] ,
R n ( Θ ) q E θ 0 { ρ ( g n [ X ] θ 0 ) } + ( 1 q ) E θ 1 { ρ ( g n [ X ] θ 1 ) } = ( a ) R n [ q · p ( x | θ 0 ) ρ ( g n [ x ] θ 0 ) + ( 1 q ) · p ( x | θ 1 ) ρ ( θ 1 g n [ x ] ) ] d x 2 · R n min { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } × 1 2 ρ ( g n [ x ] θ 0 ) + 1 2 ρ ( θ 1 g n [ x ] ) d x ( b ) 2 · R n min { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } × ρ g n [ x ] θ 0 2 + θ 1 g n [ x ] 2 d x = 2 · ρ θ 1 θ 0 2 · R n min { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } d x = 2 · ρ θ 1 θ 0 2 · P e ( q , θ 0 , θ 1 ) ,
where (a) is due to the assumed symmetry of ρ ( · ) and (b) is by its assumed convexity. Since the inequality,
R n ( Θ ) 2 · ρ θ 1 θ 0 2 · P e ( q , θ 0 , θ 1 )
applies to every θ 0 , θ 1 Θ and q [ 0 , 1 ] , it applies, in particular, also to the supremum over these auxiliary parameters. This completes the proof of Theorem 1. □
Before we proceed, two comments are in order:
  • Note that P e ( q , θ 0 , θ 1 ) is a concave function of q for fixed ( θ 0 , θ 1 ) , as it can be presented as the minimum among a family of affine functions of q, given by min Ω [ q · P ( Ω | θ 0 ) + ( 1 q ) · P ( Ω c | θ 1 ) ] , where Ω runs over all possible subsets of the observation space, X n . Another way to see why this is true is by observing that P e ( q , θ 0 , θ 1 ) is given in the second to the last line of (8) by an integral, whose integrand, min { q · p ( x | θ 0 ) , ( 1 q ) · p ( x | θ 1 ) } , is concave in q. Clearly, P e ( q , θ 0 , θ 1 ) = 0 whenever q = 0 or q = 1 . Thus, P e ( q , θ 0 , θ 1 ) is maximized by some q between 0 and 1. If P e ( q , θ 0 , θ 1 ) is strictly concave in q, then the maximizing q is unique.
  • Note that the lower bound (11) is tighter than the lower bound of ρ ( θ 1 θ 0 2 ) P e ( 1 2 , θ 0 , θ 1 ) , which was obtained in Equations (6)–(9a) in [18], both because of the factor of 2 and because of the freedom to optimize q rather than setting q = 1 / 2 . In a further development of [18] the factor of 2 was accomplished too, but at the price of assuming that the density of the estimation error is symmetric about the origin (see discussion after (10) therein), which limits the class of estimators to which the bound applies. The factor of 2 and the degree of freedom q are also the two ingredients that make the difference between (11) and the lower bound due to Le Cam [20] (see also [26,28]). In Chapter 2 in [26] Guntuboyina reviews standard bounding techniques, including those of Le Cam, Assouad, and Fano. In particular, in Example 2.3.2 therein, Guntiboyina presents a lower bound in terms of the error probability associated with general priors, given by η 2 · P e ( q , θ 0 , θ 1 ) , where in the case of two hypotheses, η = min ϑ { ρ ( θ 0 ϑ ) + ρ ( θ 1 ϑ ) } , in our notation. Now, if ρ is symmetric and monotonically non-decreasing in the absolute error, then the minimizing ϑ is given by θ 0 + θ 1 2 , which yields η 2 = ρ θ 1 θ 0 2 and so, again, the resulting bound is of the same form as (11) except that it lacks the prefactor of 2.
Our first example demonstrates Theorem 1 on a somewhat technical but simple model, with an emphasis on the point that the optimal q may differ from 1 / 2 and that it is therefore useful to maximize w.r.t. q in order to improve the bound relative to the choice q = 1 / 2 .
Example 1.
Let X be a random variable distributed exponentially according to
p ( x | θ ) = θ e θ x , x 0 ,
and Θ = { 1 , 2 } , so that the only possibility to select two different values of θ in the lower bound are θ 0 = 1 and θ 1 = 2 . In terms of the hypothesis testing problem pertaining to the lower bound, the likelihood ratio test (LRT) is by comparison of q e x to ( 1 q ) · 2 e 2 x . Now, if 2 ( 1 q ) q , or equivalently, 1 q 2 3 , the decision is always in favor of H 0 , and then P e ( q , 1 , 2 ) = 1 q . For 0 q < 2 3 , the optimal LRT compares X to x 0 ( q ) = ln 2 ( 1 q ) q . If X > x 0 ( q ) , one decides in favor of H 0 ; otherwise, one decides in favor of H 1 . Thus,
P e ( q , 1 , 2 ) = q 0 x 0 ( q ) e x d x + ( 1 q ) x 0 ( q ) 2 e 2 x d x = q [ 1 e x 0 ( q ) ] + ( 1 q ) e 2 x 0 ( q ) = q 1 q 2 ( 1 q ) + ( 1 q ) q 2 ( 1 q ) 2 = q ( 4 5 q ) 4 ( 1 q ) .
In summary,
P e q , 1 , 2 = q ( 4 5 q ) 4 ( 1 q ) 0 q < 2 3 1 q 2 3 q 1
It turns out that for q = 1 2 , P e ( 1 2 , 1 , 2 ) = 3 8 = 0.375 , whereas the maximum is 0.382 , attained at q = 1 5 5 = 0.5528 . Thus,
R n ( Θ ) 2 · 0.382 · ρ 2 1 2 = 0.764 · ρ 1 2 .
This concludes Example 1.
In the above example, we considered just one observation, n = 1 . From now on, we will refer to the case where n 1 . In particular, the following simple corollary to Theorem 1 yields a local asymptotic minimax lower bound.
Corollary 1.
For a given θ Θ and a constant s, let { ξ n } n 1 denote a positive sequence tending to zero with the property that
lim n max q P e ( q , θ , θ + 2 s ξ n )
exists and is given by a strictly positive constant, which will be denoted by P e ( θ , s ) . Also, let
ω ( s ) = lim u 0 ρ ( s · u ) ρ ( u ) .
Then, the local asymptotic minimax performance w.r.t. ζ n = 1 / ρ ( ξ n ) is lower bounded by
r ( θ ) sup s R 2 ω ( s ) · P e ( θ , s ) .
Corollary 1 is readily obtained from Theorem 1 by substituting θ 0 = θ and θ 1 = θ + 2 s ξ n in Equation (11), then multiplying both sides of the inequality by ζ n = 1 / ρ ( ξ n ) , and finally, taking the limit inferior of both sides.
Next, we study a few examples of the use of Corollary 1. As in Example 1, we emphasize again in Example 2 below the importance of having the degree of freedom to maximize over the prior q rather than to fix q = 1 2 . Also, in all the examples that were examined, the rate of convergence, 1 / ζ n = ρ ( ξ n ) , is the same as the optimal rate of convergence, 1 / ζ n * . In other words, it is tight in the sense that there exists an estimator (for example, the maximum likelihood estimator) for which R n ( θ , g n ) tends to zero at the same rate. In some of these examples, we compare our lower bound to r ( θ ) to those of earlier reported results on the same models.
Example 2.
Let X 1 , , X n be independently, identically distributed (i.i.d.) random variables, uniformly distributed in the range [ 0 , θ ] . In the corresponding hypothesis testing problem of Theorem 1, the hypotheses are θ = θ 0 and θ = θ 1 > θ 0 with priors, q and 1 q . There are two cases: If q / θ 0 n < ( 1 q ) / θ 1 n , or equivalently, q < θ 0 n / ( θ 0 n + θ 1 n ) , one decides always in favor of H 1 and so, the probability of error is q. If, on the other hand, q / θ 0 n > ( 1 q ) / θ 1 n , namely, q > θ 0 n / ( θ 0 n + θ 1 n ) , we decide in favor of H 1 whenever max i X i > θ 0 and then an error occurs only if H 1 is true, yet max i X i < θ 0 , which happens with probability ( 1 q ) θ 0 θ 1 n . Thus,
P e ( q , θ 0 , θ 1 ) = q q < θ 0 n θ 0 n + θ 1 n ( 1 q ) θ 0 θ 1 n q θ 0 n θ 0 n + θ 1 n = min q , ( 1 q ) θ 0 θ 1 n ,
which is readily seen to be maximized by q = θ 0 n θ 0 n + θ 1 n and then
max q P e ( q , θ 0 , θ 1 ) = θ 0 n θ 0 n + θ 1 n .
Now, to apply Corollary 1, we let θ 0 = θ and θ 1 = θ 0 ( 1 + 2 σ / n ) , which amounts to s = θ 0 σ = θ σ and ξ n = 1 / n . Then,
P e ( θ , s ) = lim n 1 1 + [ 1 + 2 s / ( θ n ) ] n = 1 1 + e 2 s / θ .
In the case of the MSE criterion, ρ ( ε ) = ε 2 , we have ω ( s ) = s 2 , and so,
r ( θ ) sup s 0 2 s 2 1 + e 2 s / θ = θ 2 · sup u 0 u 2 2 ( 1 + e u ) 0.2414 θ 2
w.r.t. ζ n = 1 / ρ ( ξ n ) = 1 / ( 1 / n ) 2 = n 2 . This bound will be further improved upon in Section 4.
If instead of maximizing w.r.t. q, we select q = 1 / 2 , then
P e 1 2 , θ , θ 1 + 2 σ n = 1 2 · θ θ ( 1 + 2 σ / n ) n 1 2 · e 2 σ = 1 2 · e 2 s / θ ,
and then the resulting bound would become
r ( θ ) sup s 0 s 2 e 2 s / θ = θ 2 · sup u 0 u 2 e u 4 0.1353 θ 2
w.r.t. ζ n = n 2 . Therefore, the maximization over q plays an important role here in terms of tightening the lower bound to r ( θ ) .
More generally, for ρ ( ε ) = | ε | t ( t 1 ), ω ( s ) = | s | t , and we obtain
r ( θ ) θ t · sup u 0 2 u t 1 + e 2 u ,
w.r.t. ζ n = n t , where the supremum, which is in fact a maximum, can always be calculated numerically for every given t. For large t, the maximizing u is approximately t / 2 , which yields
r ( θ ) ( t θ ) t 2 t 1 ( 1 + e t ) .
On the other hand, for q = 1 / 2 , we end up with
r ( θ ) sup s 0 s t e 2 s / θ = t θ 2 e t .
For large t, the bound of q = 1 / 2 is inferior to the bound with the optimal q, by a factor of about 1 / 2 . This concludes Example 2.
Example 3.
Let X 1 , X 2 , , X n be i.i.d. random variables, uniformly distributed in the interval [ θ , θ + 1 ] . For the hypothesis testing problem, let θ 1 be chosen between θ 0 and θ 0 + 1 . Clearly, if min i X i < θ 1 , the underlying hypothesis is certainly H 1 . Likewise, if max i X i > θ 0 + 1 , the decision is in favor of H 0 with certainty. Thus, an error can occur only if all { X i } fall in the interval [ θ 1 , θ 0 + 1 ] , an event that occurs with probability ( θ 0 + 1 θ 1 ) n . In this event, the best to do is to select the hypothesis with the larger prior with a probability of error given by min { q , 1 q } . Thus,
P e q , θ 0 , θ 1 = ( θ 0 + 1 θ 1 ) n · min { q , 1 q } ,
and so,
max q P e q , θ 0 , θ 1 = 1 2 [ 1 ( θ 1 θ 0 ) ] n ,
achieved by q = 1 / 2 . Now, let us select ξ n = 1 / n , which yields
lim n max q P e q , θ 0 , θ 1 = lim n 1 2 · 1 2 s n n = 1 2 · e 2 s .
For ρ ( ε ) = | ϵ | t , ( t 1 ), we have
r ( θ ) sup s 0 2 s t · 1 2 e 2 s = sup s 0 s t e 2 s = t 2 e t
w.r.t. ζ n = 1 / | 1 / n | t = n t . For the case of MSE, t = 2 , r ( θ ) e 2 0.1353 . The constant 0.1353 should be compared with 1 1 / 2 128 0.0023 (see Example 4.9 in [35]), which is two orders of magnitude smaller. This concludes Example 3.
Example 4.
Let X i = θ + Z i , where { Z i } are i.i.d., Gaussian random variables with zero mean and variance σ 2 . Here, for the corresponding binary hypothesis testing problem, the optimal value of q is always q * = 1 2 . This can be readily seen from the concavity of P e ( q , θ 0 , θ 1 ) in q and its symmetry around q = 1 / 2 , as P e ( q , θ 0 , θ 1 ) = P e ( 1 q , θ 0 , θ 1 ) . Since
P e 1 2 , θ 0 , θ 1 = P r i = 1 n Z i n ( θ 1 θ 0 ) 2 = P r 1 σ n i = 1 n Z i n ( θ 1 θ 0 ) 2 σ = Q n ( θ 1 θ 0 ) 2 σ ,
where
Q ( t ) = t e u 2 / 2 d u 2 π ,
we select ξ n = 1 n , which yields θ 1 θ 0 = 2 s n
P e ( θ , s ) = Q s σ ,
and then for the MSE case, ω ( s ) = s 2 ,
r ( θ ) sup s 0 2 s 2 Q s σ = σ 2 · sup u 0 { 2 u 2 Q ( u ) } 0.3314 σ 2
w.r.t. ζ n = 1 / ( 1 / n ) 2 = n , and so, the asymptotic lower bound to R n ( θ , g n ) is 0.3314 σ 2 / n .
We now compare this bound (which will be further improved in Section 4) with a few earlier reported results. In one of the versions of Le Cam’s bound (see Example 4.7 in [35]) for the same model, the lower bound to r ( θ ) turns out to be σ 2 24 = 0.0417 σ 2 , namely, an order of magnitude smaller. Also, in Example 3.1 in [28], another version of Le Cam’s method yields r ( θ ) ( 1 1 / 2 ) σ 2 / 8 = 0.0366 σ 2 . According to Corollary 4.3 in [36], r ( θ ) σ 2 8 e = 0.046 σ 2 . Yet another comparison is with Theorem 5.9 in [37], where we find an inequality, which in our notation reads as follows:
sup θ Θ P θ ( g n [ X ] θ ) 2 2 α σ 2 n 1 2 α α 0 , 1 2 .
Combining it with Chebychev’s inequality yields
sup θ Θ E θ ( g n [ X ] θ ) 2 σ 2 n · max 0 α 1 / 2 α ( 1 2 α ) = 0.125 σ 2 n .
In [23] (p. 257), it is shown that when Θ = R , the minimax estimator for this model is the sample mean, and so, in this case, the correct constant in front of σ 2 is actually 1. This concludes Example 4.
The next example is related to Example 4, as it is based on the use of the central limit theorem (CLT), which means that the Gaussian tail distribution is used here too.
Example 5.
Consider an exponential family,
p ( x | θ ) = i = 1 n p ( x i | θ ) = i = 1 n e θ T ( x i ) Z ( θ ) = exp θ i = 1 n T ( x i ) Z n ( θ ) .
where T ( · ) is a given function and Z ( θ ) is a normalization function given by
Z ( θ ) = R e θ T ( x ) d x ,
assuming that the integral converges. In the auxiliary binary hypothesis problem, the test statistic is i = 1 n T ( X i ) . If q = 1 / 2 , ξ n = 1 / n and θ 1 θ 0 = 2 s / n , the LRT amounts to examining whether i = 1 n [ T ( X i ) E θ { T ( X i ) } ] is larger than
s n · n d 2 ln Z ( θ ) d θ 2 = s n · d 2 ln Z ( θ ) d θ 2 .
In this case, the probability of error can be asymptotically assessed using the CLT, which after a simple algebraic manipulation, becomes:
P e ( θ , s ) = Q s d 2 ln Z ( θ ) / d θ 2 d 2 ln Z ( θ ) / d θ 2 = Q s d 2 ln Z ( θ ) d θ 2 = Q s I ( θ ) ,
where I ( θ ) is the Fisher information. Thus, for the MSE,
r ( θ ) sup s 0 2 s 2 Q s I ( θ ) 0.3314 I ( θ )
w.r.t. ζ n = 1 / ( 1 / n ) 2 = n . This concludes Example 5.
In several earlier examples, our bound was shown to outperform (sometimes significantly so) earlier reported bounds for the corresponding models. However, to be honest and fair, we should not ignore the fact that there are also situations where our bound may not be tighter than earlier bounds applied to the same model. Such a case is demonstrated in Example 6 below, where our result is compared to those of Ziv and Zakai [18] and Chazan, Zakai, and Ziv [4] in the context of estimating the delay of a known continuous-time signal corrupted by additive white Gaussian noise (for further developments and applications of the Ziv–Zakai and the Chazan–Zakai–Ziv bounds, see, e.g., [38,39,40,41,42,43,44,45] and references therein). Having said that, it should also be kept in mind that our emphasis in this work is on bounds that are relatively simple and easy to calculate (at least numerically), whereas the Chazan–Zakai–Ziv bound, although very tight in many situations, is notoriously difficult to calculate. Indeed, in this setting, the explicit behavior of the resulting complicated bound of [4] is clearly transparent only at high values of the signal-to-noise ratio (SNR)—see Equations (11), (12), and (14) in [4].
Example 6.
Let X ( t ) = s ( t , θ ) + N ( t ) , t [ 0 , T ] , where N ( t ) is additive white Gaussian noise (AWGN) with double-sided spectral density N 0 and s ( t , θ ) is a deterministic signal that depends on the unknown parameter, θ. It is assumed that the signal energy, E = 0 T s 2 ( t , θ ) d t , does not depend on θ (which is the case, for example, when θ is a delay parameter of a pulse fully contained in the observation interval, or when θ is the frequency or the phase of a sinusoidal waveform). We further assume that s ( t , θ ) is at least twice differentiable w.r.t. θ, and that the energies of the first two derivatives are also independent of θ. Then, as shown in Appendix A, for small | θ 1 θ 0 | ,
ϱ ( θ 0 , θ 1 ) = 1 E 0 T s ( t , θ 0 ) s ( t , θ 1 ) d t = 1 ( θ 1 θ 0 ) 2 2 E 0 T s ( t , θ ) θ | θ = θ 0 2 d t + o ( | θ 1 θ 0 | 2 ) = 1 ( θ 1 θ 0 ) 2 E ˙ 2 E + o ( | θ 1 θ 0 | 2 ) ,
where E ˙ is the energy s ˙ ( t , θ ) = s ( t , θ ) / θ .
The optimal LRT in deciding between the two hypotheses is based on the comparison between the correlations, 0 T X ( t ) s ( t , θ 0 ) d t and 0 T X ( t ) s ( t , θ 1 ) d t . Again, the optimal value of q is q * = 1 / 2 . Thus,
P e 1 2 , θ 0 , θ 1 = Q E 2 N 0 1 ϱ ( θ 0 , θ 1 ) Q E E ˙ ( θ 1 θ 0 ) 2 4 E N 0 = Q E ˙ 4 N 0 · | θ 1 θ 0 | = Q P ˙ 4 N 0 · T | θ 1 θ 0 | ,
where P ˙ = E ˙ / T is the power of s ˙ ( t , θ ) . Since we are dealing here with continuous time, instead of a sequence ξ n , we use a function, ξ ( T ) , of the observation time, T, which in this case would be ξ ( T ) = 1 T . Let θ 0 = θ and θ 1 = θ + 2 s T . Then,
P e ( θ , s ) = Q P ˙ N 0 · | s | ,
which, for the MSE case, yields
r ( θ ) sup s 0 2 s 2 Q P ˙ N 0 · s = N 0 P ˙ sup u 0 { 2 u 2 Q ( u ) } 0.3314 N 0 P ˙ .
w.r.t. ζ ( T ) = 1 / ( 1 / T ) 2 = T , which means that the minimax loss is lower bounded by r ( θ ) / T 0.3314 N 0 / E ˙ . This has the same form as the Cramér–Rao lower bound (CRLB), except that the multiplicative factor is 0.3314 rather than 1. In Equation (20) in [18], the bound is of the same form, but with a multiplicative constant of 0.16 for a high signal-to-noise ratio (SNR). However, in [4], the constant of proportionality was improved to 1 in the high SNR limit, just like in the Cramér–Rao lower bound for the same model. The constant 0.3314 will be improved later to 0.4549 (under Example 9 in the sequel), but it will still be below 1.
The case where s ( t , θ ) is not everywhere differentiable w.r.t. θ can be handled in a similar manner, but some caution should be exercised. For example, consider the model,
X ( t ) = s ( t θ ) + N ( t ) ,
where < t < , θ [ 0 , T ] , N ( t ) is AWGN as before, and s ( · ) is a rectangular pulse with duration Δ and amplitude E / Δ , E being the signal energy. Here, ϱ ( θ , θ + δ ) = 1 | δ | / Δ ; namely, it also includes a linear term in | δ | , not just the quadratic one. This changes the asymptotic behavior of the resulting lower bound to r ( θ ) , which turns out to be 0.7544 ( N 0 Δ / P ) 2 w.r.t. T 2 (namely, a minimax lower bound of 0.7544 ( N 0 Δ / E ) 2 ). It is interesting to compare this bound to the Chapman–Robbins bound for the same model, which is a local bound of the same form but with a multiplicative constant of 0.1602 instead of 0.7544 , and which is limited to unbiased estimators. The Chazan–Zakai–Ziv bound [4] for this case is difficult to calculate, but in the high SNR regime, it behaves like 3 ( N 0 Δ / P ) 2 w.r.t. T 2 .
It is conceivable that the Chazan–Zakai–Ziv bound for estimating the delay of non-differential signals in Gaussian white noise has been improved even further ever since its publication in 1975, but the point of this particular example remains: our bound may not always be the best available bound. Still, since our bound is easy to calculate, it is worth comparing it to other bounds for every given model. This concludes Example 6.

4. Bounds Based on the Minimum Expected Loss over Some Test Points

In this section, we derive our second family of bounds for the scalar case, d = 1 .

4.1. Two Test Points

The following generic, yet conceptually very simple, lower bound assumes neither symmetry nor convexity of the loss function ρ ( · ) . For a given ( x , q , θ 0 , θ 1 ) R n × [ 0 , 1 ] × Θ 2 , let us define
ψ ( x , q , θ 0 , θ 1 ) = min ϑ { q p ( x | θ 0 ) ρ ( ϑ θ 0 ) + ( 1 q ) p ( x | θ 1 ) ρ ( ϑ θ 1 ) } .
Then,
R n ( Θ ) sup θ 0 , θ 1 , q q E θ 0 { ρ ( g n [ X ] θ 0 ) } + ( 1 q ) E θ 1 { ρ ( g n [ X ] θ 1 ) } = sup θ 0 , θ 1 , q R n [ q p ( x | θ 0 ) ρ ( g n [ x ] θ 0 ) + ( 1 q ) p ( x | θ 1 ) ρ ( g n [ x ] θ 1 ) ] d x sup θ 0 , θ 1 , q R n ψ ( x , q , θ 0 , θ 1 ) d x .
If we further assume the symmetry of ρ , then it is easy to see that the minimizer ϑ * , which achieves ψ ( x , q , θ 0 , θ 1 ) , is always within the interval [ θ 0 , θ 1 ] . This is because the objective increases monotonically as we move away from the interval [ θ 0 , θ 1 ] in either direction. Of course, this simple idea can easily be extended to apply to the weighted sums of more than two points, in principle, but it would become more complicated—see the next subsection for three such points.
If ρ is concave, then the minimizing ϑ is either θ 0 or θ 1 , depending on which of q p ( x | θ 0 ) and ( 1 q ) p ( x | θ 1 ) is smaller, and the bound becomes
R n ( Θ ) sup θ 0 , θ 1 , q ρ ( θ 1 θ 0 ) P e ( q , θ 0 , θ 1 ) .
Minimality at the edge points may also happen for some loss functions that are not concave, like the loss function ρ ( u ) = 1 { | u | Δ } .
The generic lower bound (50) is more general than our first bound in the sense that it does not require convexity or symmetry of ρ , but the downside is that the resulting expressions are harder to deal with directly, as will be seen shortly. For loss functions other than the MSE or general moments of the estimation error, it may not be a trivial task even to derive a closed-form expression of ψ ( x , q , θ 0 , θ 1 ) (i.e., to carry out the minimization associated with the definition of ψ ).
For the case of the MSE, ρ ( ε ) = ε 2 , the calculation of ψ ( x , q , θ 0 , θ 1 ) is straightforward, and it readily yields
ψ ( x , q , θ 0 , θ 1 ) = ( θ 1 θ 0 ) 2 · q p ( x | θ 0 ) · ( 1 q ) p ( x | θ 1 ) q p ( x | θ 0 ) + ( 1 q ) p ( x | θ 1 ) .
However, it may not be convenient to integrate this function of x due to the summation at the denominator. One way to alleviate this difficulty is to observe that
ψ ( x , q , θ 0 , θ 1 ) ( θ 1 θ 0 ) 2 · q p ( x | θ 0 ) · ( 1 q ) p ( x | θ 1 ) 2 max { q p ( x | θ 0 ) , ( 1 q ) p ( x | θ 1 ) } = 1 2 · ( θ 1 θ 0 ) 2 · min { q p ( x | θ 0 ) , ( 1 q ) p ( x | θ 1 ) } ,
which after integration yields again
R n ( Θ ) sup θ 0 , θ 1 , q 1 2 ( θ 1 θ 0 ) 2 · P e ( q , θ 0 , θ 1 ) ,
exactly as in Theorem 1 in the special case of the MSE. This indicates that the bound (50) is at least as tight as the bound of Theorem 1 for the MSE.
It turns out, however, that we can do better than bounding the denominator, q p ( x | θ 0 ) + ( 1 q ) p ( x | θ 1 ) , by 2 · max { q p ( x | θ 0 ) , ( 1 q ) p ( x | θ 1 ) } for the purpose of obtaining a more convenient integrand. To this end, we invoke the following lemma, whose proof appears in Appendix B.
Lemma 1.
Let k be a positive integer and let a 1 , , a k be positive reals. Then,
i = 1 k a i = inf ( r 1 , , r k ) S max a 1 r 1 , , a k r k ,
where S is the interior of the k-dimensional simplex, namely, the set of all vectors ( r 1 , , r k ) with strictly positive components that sum to unity.
Applying Lemma 1 with the assignments k = 2 , a 1 = q p ( x | θ 0 ) , and a 2 = ( 1 q ) p ( x | θ 1 ) , we have
q p ( x | θ 0 + ( 1 q ) p ( x | θ 1 ) = inf r ( 0 , 1 ) max q p ( x | θ 0 ) r , ( 1 q ) p ( x | θ 1 ) 1 r .
Thus,
ψ ( x , q , θ 0 , θ 1 ) = q p ( x | θ 0 ) · ( 1 q ) p ( x | θ 1 ) inf r ( 0 , 1 ) max q p ( x | θ 0 ) / r , ( 1 q ) p ( x | θ 1 ) / ( 1 r ) = sup r ( 0 , 1 ) q p ( x | θ 0 ) · ( 1 q ) p ( x | θ 1 ) max q p ( x | θ 0 ) / r , ( 1 q ) p ( x | θ 1 ) / ( 1 r ) = sup r ( 0 , 1 ) r ( 1 q ) p ( x | θ 1 ) r r * ( 1 r ) q p ( x | θ 0 ) r r * = sup r ( 0 , 1 ) min { r ( 1 q ) p ( x | θ 1 ) , ( 1 r ) q p ( x | θ 0 ) } ,
where
r * = q p ( x | θ 0 ) q p ( x | θ 0 ) + ( 1 q ) p ( x | θ 1 ) .
Thus, the bound becomes
R n ( Θ ) sup { ( θ 0 , θ 1 , q ) Θ 2 × ( 0 , 1 ) } ( θ 1 θ 0 ) 2 × R n sup r ( 0 , 1 ) min { r ( 1 q ) p ( x | θ 1 ) , ( 1 r ) q p ( x | θ 0 ) } d x sup ( θ 0 , θ 1 , q ) Θ 2 × ( 0 , 1 ) sup r ( 0 , 1 ) ( θ 1 θ 0 ) 2 × R n min { r ( 1 q ) p ( x | θ 1 ) , ( 1 r ) q p ( x | θ 0 ) } d x = sup ( θ 0 , θ 1 , q , r ) Θ 2 × ( 0 , 1 ) 2 ( θ 1 θ 0 ) 2 · ( q + r 2 q r ) × R n min r ( 1 q ) q + r 2 q r · p ( x | θ 1 ) , ( 1 r ) q q + r 2 q r · p ( x | θ 0 ) d x = sup ( θ 0 , θ 1 , q , r ) Θ 2 × ( 0 , 1 ) 2 ( θ 1 θ 0 ) 2 · ( q + r 2 q r ) · P e ( 1 r ) q q + r 2 q r , θ 0 , θ 1 .
The bound of Theorem 1 for the MSE is obtained as a special case of r = 1 / 2 . Therefore, after the optimization over the additional degree of freedom, r, the resulting bound cannot be worse than the MSE bound of Theorem 1. In fact, it may strictly improve as we will demonstrate shortly. The choice r = q gives a prior of 1 / 2 in the error probability factor, and then the maximum of the external factor, q + r 2 r q = 2 q ( 1 q ) , is maximized by q = 1 / 2 .
Example 7.
To demonstrate the new bound for the MSE, let us revisit Example 2 and see how it improves the multiplicative constant. In that example,
P e ( 1 r ) q q + r 2 q r , θ 0 , θ 1 = min ( 1 r ) q q + r 2 q r , r ( 1 q ) q + r 2 q r · θ 0 θ 1 n .
Let us denote α = ( θ 0 / θ 1 ) n and recall that α ( 0 , 1 ) , provided that we select θ 1 > θ 0 . Then,
( q + r 2 q r ) · P e ( 1 r ) q q + r 2 q r , θ 0 , θ 1 = min { ( 1 r ) q , r ( 1 q ) α } .
The maximum w.r.t. q is attained when ( 1 r ) q = r ( 1 q ) α , namely, for q = q * = α r / [ 1 ( 1 α ) r ] , which yields
max q ( q + r 2 q r ) · P e ( 1 r ) q q + r 2 q r , θ 0 , θ 1 = α r ( 1 r ) 1 ( 1 α ) r .
Let us denote β = 1 α ( 0 , 1 ) . The maximum of r ( 1 r ) / ( 1 β r ) is attained for
r = r * = 1 1 β β = 1 α 1 α = 1 1 + α ,
which yields
max q , r ( q + r 2 q r ) · P e ( 1 r ) q q + r 2 q r , θ 0 , θ 1 = sup r ( 0 , 1 ) α r ( 1 r ) 1 ( 1 α ) r = α · 1 α 1 α 2 = α ( 1 + α ) 2 .
To obtain a local bound in the spirit of Corollary 1, take θ 0 = θ , θ 1 = θ ( 1 + s n θ ) , which yields α = e s / θ in the limit of large n, and so,
r ( θ ) sup s 0 s 2 e s / θ ( 1 + e s / [ 2 θ ] ) 2 = θ 2 · sup u 0 u 2 · e u ( 1 + e u / 2 ) 2 0.3102 θ 2 ,
w.r.t. ζ n = n 2 , which improves on our earlier bound in Example 2, r ( θ ) 0.2414 θ 2 . This concludes Example 7.
More generally, for general moments of the estimation error, a similar derivation yields the following:
Theorem 2.
For ρ ( ε ) = | ε | t , t 1 (not necessarily an integer),
R n ( Θ ) sup θ 0 , θ 1 , q , r | θ 1 θ 0 | t [ ( 1 r ) t 1 q + r t 1 ( 1 q ) ] · P e ( 1 r ) t 1 q ( 1 r ) t 1 q + r t 1 ( 1 q ) , θ 0 , θ 1 .
Applying the local version of Theorem 2 to Example 7, we obtain
r ( θ ) θ t · sup s > 0 s t e s ( 1 + e s / t ) t
w.r.t. ζ n = n t . Changing the optimization variable from s to σ = s / t , we end up with
r ( θ ) ( θ t ) t · sup σ > 0 σ e σ + 1 t ( 0.2785 t θ ) t .
The factor of ( 0.2785 t ) t should be compared with ( 1 / 2 e ) t = 0 . 1839 t of Example 2, pertaining to the choice q = r = 1 / 2 . The gap increases exponentially with t. For the maximum likelihood estimator pertaining to Example 2, which is g n [ X ] = max i X i , it is easy to show that whenever t is an integer,
E θ { | θ ^ θ | t } = n ! t ! θ t ( n + t ) ! = t ! θ t ( n + 1 ) ( n + 2 ) ( n + t ) ,
and so, the asymptotic gap is between t ! and ( 0.2785 t ) t . Considering the Stirling approximation, the ratio between the upper bound and the lower bound is about 2 π t · ( 1.3211 ) t .

4.2. Three Test Points

The idea behind the bounds of Section 4.1 can be conceptually extended to be based on more than two test points, but the resulting expressions become cumbersome very quickly as the number of test points grows. For three points, however, this is still manageable and can provide improved bounds. Let us select the three test points to be θ 0 Δ , θ 0 , and θ 0 + Δ for some θ 0 and Δ , and let us assign weights q, r, and w = 1 q r . Consider the bound
R n ( Θ ) q E θ 0 Δ { ρ ( g n [ X ] θ 0 + Δ ) } + r E θ 0 { ρ ( g n [ X ] θ 0 ) } + w E θ 0 + Δ { ρ ( g n [ X ] θ 0 Δ ) } = R n [ q · p ( x | θ 0 Δ ) ρ ( g n [ x ] θ 0 + Δ ) + r · p ( x | θ 0 ) ρ ( g n [ x ] θ 0 ) + w · p ( x | θ 0 + Δ ) ρ ( g n [ x ] θ 0 Δ ) ] d x R n Ψ ( x , θ 0 , Δ , q , r ) d x ,
where
Ψ ( x , θ 0 , Δ , q , r ) = min ϑ { q · p ( x | θ 0 Δ ) ρ ( ϑ θ 0 + Δ ) + r · p ( x | θ 0 ) ρ ( ϑ θ 0 ) + w · p ( x | θ 0 + Δ ) ρ ( ϑ θ 0 Δ ) } .
Considering the case of the MSE,
Ψ ( x , θ 0 , Δ , q , r ) = min ϑ { q · p ( x | θ 0 Δ ) ( ϑ θ 0 + Δ ) 2 + r · p ( x | θ 0 ) ( ϑ θ 0 ) 2 + w · p ( x | θ 0 + Δ ) ( ϑ θ 0 Δ ) 2 }
can be found in closed-form. Denoting temporarily a = q p ( x | θ 0 Δ ) , b = r p ( x | θ 0 ) , and c = w p ( x | θ 0 + Δ ) , Ψ ( x , θ 0 , Δ , q , r ) is attained by
ϑ = ϑ * = a ( θ 0 Δ ) + b θ 0 + c ( θ 0 + Δ ) a + b + c = θ 0 + ( c a ) Δ a + b + c .
On substituting ϑ * into the sum of squares, we end up with
Ψ ( x , θ 0 , Δ , q , r ) = [ a ( b + 2 c ) 2 + b ( c a ) 2 + c ( 2 a + b ) 2 ] Δ 2 ( a + b + c ) 2 = ( a b + b c + 4 a c ) Δ 2 a + b + c = [ q r p ( x | θ 0 Δ ) p ( x | θ 0 ) + r w p ( x | θ 0 ) p ( x | θ 0 + Δ ) + 4 q w p ( x | θ 0 Δ ) p ( x | θ 0 + Δ ) ] Δ 2 q p ( x | θ 0 Δ ) + r p ( x | θ 0 ) + w p ( x | θ 0 + Δ ) .
The lower bound on the MSE is then the sum of three integrals
I 1 = q r Δ 2 · R n p ( x | θ 0 Δ ) p ( x | θ 0 ) d x q p ( x | θ 0 Δ ) + r p ( x | θ 0 ) + w p ( x | θ 0 + Δ )
I 2 = r w Δ 2 · R n p ( x | θ 0 ) p ( x | θ 0 + Δ ) d x q p ( x | θ 0 Δ ) + r p ( x | θ 0 ) + w p ( x | θ 0 + Δ )
I 3 = 4 q w Δ 2 · R n p ( x | θ 0 Δ ) p ( x | θ 0 + Δ ) d x q p ( x | θ 0 Δ ) + r p ( x | θ 0 ) + w p ( x | θ 0 + Δ ) .
Example 8.
Revisiting again Example 2, for a given t > 0 , let us denote C ( t ) = [ 0 , t ] n , and then p ( x | θ 0 + i Δ ) = [ θ 0 + i Δ ] n · I { x C ( θ 0 + i Δ ) , i = 1 , 0 , 1 . Then,
I 1 = q r Δ 2 · C ( θ 0 Δ ) ( θ 0 Δ ) n θ 0 n d x q ( θ 0 Δ ) n + r θ 0 n + w ( θ 0 + Δ ) n = q r Δ 2 · ( θ 0 Δ ) n ( θ 0 Δ ) n θ 0 n q ( θ 0 Δ ) n + r θ 0 n + w ( θ 0 + Δ ) n = q r Δ 2 θ 0 n q ( θ 0 Δ ) n + r θ 0 n + w ( θ 0 + Δ ) n = q r Δ 2 q [ θ 0 / ( θ 0 Δ ) ] n + r + w [ θ 0 / ( θ 0 + Δ ) ] n .
For Δ = θ 0 s / n , we then have, for large n:
I 1 θ 0 2 n 2 · q r s 2 q e s + r + w e s = θ 0 2 n 2 · q r s 2 e s q e 2 s + r e s + w .
I 2 = r w Δ 2 C ( θ 0 ) θ 0 n ( θ 0 + Δ ) n d x q ( θ 0 Δ ) n I { x C ( θ 0 Δ ) } + r θ 0 n + w ( θ 0 + Δ ) n = r w Δ 2 · ( θ 0 Δ ) n θ 0 n ( θ 0 + Δ ) n q ( θ 0 Δ ) n + r θ 0 n + w ( θ 0 + Δ ) n + r w Δ 2 · [ θ 0 n ( θ 0 Δ ) n ] θ 0 n ( θ 0 + Δ ) n r θ 0 n + w ( θ 0 + Δ ) n = r w Δ 2 · ( 1 Δ / θ 0 ) n q [ ( 1 + Δ / θ 0 ) / ( 1 Δ / θ 0 ) ] n + r ( 1 + Δ / θ 0 ) n + w + r w Δ 2 · [ 1 ( 1 Δ / θ 0 ) n ] r ( 1 + Δ / θ 0 ) n + w θ 0 2 n 2 · r w s 2 e s q e 2 s + r e s + w + 1 e s r e s + w .
Similarly,
I 3 θ 0 2 n 2 · 4 q w s 2 q e 2 s + r e s + w .
Thus,
r ( θ 0 ) θ 0 2 · sup s 0 sup ( q , r , w ) R + 3 : q + r + w = 1 s 2 · q r e s + 4 q w + r w e s q e 2 s + r e s + w + r w ( 1 e s ) r e s + w 0.4624 θ 0 2 ,
w.r.t. ζ n = n 2 , which is a significant improvement of nearly 50% over the previous bound, 0.3102 θ 0 2 , which was obtained on the basis of two test points, let alone the bound of Theorem 1, which was 0.2414 θ 0 2 . This concludes Example 8.
In general, the integrals I 1 , I 2 , and I 3 are not easy to calculate due to the summations at the denominators of the integrands. One way to proceed is to apply Lemma 1 to the sum of k = 3 terms, but this would introduce two additional parameters to be optimized. Returning to the earlier shorthand notation of a, b, and c, a different approach to get rid of summations at the denominators, at the expense of some loss of tightness, is the following:
a b + b c + 4 a c a + b + c = a ( b + 2 c ) a + b + c + c ( b + 2 a ) a + b + c a ( b + 2 c ) a + b + 2 c + c ( b + 2 a ) 2 a + b + c a ( b + 2 c ) 2 max { a , b + 2 c } + c ( b + 2 a ) 2 max { 2 a + b , c } = 1 2 · min { a , b + 2 c } + min { 2 a + b , c } 1 2 · min { a , b } + min { b , c } = 1 2 · min { q p ( x | θ 0 Δ ) , r p ( x | θ 0 ) } + min { r p ( x | θ 0 ) , w p ( x | θ 0 + Δ ) } ,
and so,
R n ( Θ ) Δ 2 2 [ R n min { q p ( x | θ 0 Δ ) , r p ( x | θ 0 ) } d x + R n min { r p ( x | θ 0 ) , w p ( x | θ 0 + Δ ) } d x ] = Δ 2 2 [ ( q + r ) P e q q + r , θ 0 Δ , θ 0 + ( r + w ) P e r r + w , θ 0 , θ 0 + Δ ] .
Note that by setting w = 0 , we recover the bound obtained by integrating (53), and therefore, by optimizing w, the resulting bound cannot be worse. A slightly different (better, but more complicated) route in (81) is to apply Lemma 1 with k = 2 in the following manner:
a b + b c + 4 a c a + b + c = a ( b + 2 c ) a + b + c + c ( b + 2 a ) a + b + c a ( b + 2 c ) a + b + 2 c + c ( b + 2 a ) 2 a + b + c = a ( b + 2 c ) min u max { a / u , ( b + 2 c ) / ( 1 u ) } + c ( b + 2 a ) min v max { ( 2 a + b ) / ( 1 v ) , c / v } = max u min { ( 1 u ) a , u ( b + 2 c ) } + max v min { v ( 2 a + b ) , ( 1 v ) c } max u min { ( 1 u ) a , u b } + max v min { v b , ( 1 v ) c } = max u min { ( 1 u ) q p ( x | θ 0 Δ ) , u r p ( x | θ 0 ) } + max v min { v r p ( x | θ 0 ) , ( 1 v ) w p ( x | θ 0 + Δ ) } ,
and so,
R n ( Θ ) Δ 2 · [ max u R n min { ( 1 u ) q p ( x | θ 0 Δ ) , u r p ( x | θ 0 ) } d x + max v R n min { v r p ( x | θ 0 ) , ( 1 v ) w p ( x | θ 0 + Δ ) } d x ] = Δ 2 · { max u [ ( 1 u ) q + u r ] · P e ( 1 u ) q ( 1 u ) q + u r , θ 0 Δ , θ 0 + max v [ v r + ( 1 v ) w ] · P e v r v r + ( 1 v ) w , θ 0 , θ 0 + Δ } .
We have just proved the following result:
Theorem 3.
For the MSE case,
R n ( Θ ) sup θ 0 , Δ , q , r , w , ( u , v ) [ 0 , 1 ] 2 Δ 2 · { max u [ ( 1 u ) q + u r ] · P e ( 1 u ) q ( 1 u ) q + u r , θ 0 Δ , θ 0 + max v [ v r + ( 1 v ) w ] · P e v r v r + ( 1 v ) w , θ 0 , θ 0 + Δ } .
where the maximum over q, r, and w is under the constraints that they are all non-negative and sum to unity.
Example 9.
Revisiting Example 4, it is natural that both error probabilities would correspond to a prior of 1 / 2 . This dictates the relations,
u = q q + r
v = w w + r ,
and so, the bound becomes
R n ( Θ ) max Δ , q , w , r Δ 2 2 q r q + r + 2 w r w + r · Q n Δ 2 σ = max s 0 2 s σ n 2 Q ( s ) · max { ( q , r ) : q 0 , r 0 , q + r 1 } 2 q r q + r + 2 r ( 1 q r ) 1 q = σ 2 n · max s 0 { 4 s 2 Q ( s ) } · 0.6862 = σ 2 n · 0.6629 · 0.6862 0.4549 σ 2 n ,
which is an improvement on the bound of Example 4, which was 0.3314 σ 2 / n . Similarly, in Example 6, for smooth signals, the multiplicative constant also improves from 0.3314 to 0.4549, and for the rectangular pulse, it improves from 0.7544 to 1.0352 (about 37% improvement in all of them). This concludes Example 9.
Example 10.
Revisiting Example 2, we now have
R n ( Θ ) Δ 2 · [ max u min { ( 1 u ) q , u r α } + max v min { v r , ( 1 v ) w α } ] = Δ 2 q r α q + r α + w r α r + w α = Δ 2 q r α q + r α + r ( 1 q r ) α r + ( 1 q r ) α = s 2 n 2 e s r q q + r e s + 1 q r r + ( 1 q r ) e s = s 2 n 2 r q q e s + r + 1 q r r e s + ( 1 q r ) ,
whose maximum is 0.3909 / n 2 , an improvement relative to the earlier bound of 0.3102 / n 2 . This concludes Example 10.

5. Extensions to the Vector Case

In this section, we outline extensions of some of our findings in Section 3 and Section 4 to the vector case. Let θ be a parameter vector of dimension d and Θ R d . Let ρ ( ε ) be a convex loss function that depends on the d-dimensional error vector ε only via its norm, ε (that is, ρ has radial symmetry).
First, observe that Theorem 1 extends verbatim to the vector case, as nothing in the proof of Theorem 1 is based on any assumption or property that holds only if θ is a scalar parameter.
Corollary 1 can also be extended by letting the two test points of the auxiliary hypothesis testing problem, θ 0 and θ 1 , be selected such that the distance between them, θ 1 θ 0 , would decay at an appropriate rate as a function of n, so as to make the probability of error, max q P e ( q , θ 0 , θ 1 ) , converge to a positive constant as n , as was achieved in Corollary 1 for the scalar case. But in the vector case considered now, there is an additional degree of freedom, which is the direction of the displacement vector, θ 1 θ 0 . This direction can now be optimized so as to yield the largest (hence the tightest) lower bound. To be specific, in the vector case, Corollary 1 remains the same except that now s should be thought of as a d-dimensional vector rather than a scalar, ρ ( u ) in the denominator of (19) should be replaced by ρ ( v · u ) , where v R d is an arbitrary fixed non-zero vector (for example, any unit norm vector in the case of MSE), and the supremum in (20) should be taken over R d . To demonstrate this point, we now revisit Example 5, but this time, for the vector case.
Example 11.
Consider the case where p ( x | θ ) = i = 1 n p ( x i | θ ) , with each factor in this product PDF being given by a d-dimensional exponential family,
p ( x | θ ) = exp { θ T ( x ) } Z ( θ )
where θ T ( x ) is the inner product of the d-dimensional parameter vector θ and a d-dimensional vector of statistics, T ( x ) = ( T 1 ( x ) , T 2 ( x ) , , T d ( x ) ) , and
Z ( θ ) = R exp { θ T ( x ) } d x ,
provided that the integral converges. In the above notation, both θ and T ( x ) are understood to be column vectors and θ denotes transposition of θ to a row vector. Similarly as in Corollary 1, to obtain a local bound at a given θ, we let θ 0 = θ and θ 1 = θ + 2 s n , where s R d . A simple extension of the derivation in Example 5 (using the CLT) yields
P e ( θ , s ) = Q s I ( θ ) s ,
where s is considered a column vector, the superscript prime denotes vector transposition as before, and I ( θ ) = 2 ln Z ( θ ) is the d × d Fisher information matrix of the exponential family. Considering the case of the MSE, ρ ( ϵ ) = ϵ 2 with v being any unit norm vector, we have
w ( s ) = lim u 0 s · u 2 v · u 2 = s 2 ,
and then the following lower bound is obtained w.r.t. 1 / v / n 2 = n :
r ( θ ) sup s R d 2 s 2 Q s I ( θ ) s = sup t 0 2 t 2 · max { s : s 2 = t 2 } Q s I ( θ ) s = sup t 0 2 t 2 · Q t λ min ( θ ) = 1 λ min ( θ ) sup τ 0 2 τ 2 · Q ( τ ) 0.3314 λ min ( θ ) ,
where λ min ( θ ) is the smallest eigenvalue of I ( θ ) . In this example, it is apparent that the chosen direction of the vector s is that of the eigenvector corresponding to the smallest eigenvalue of the Fisher information matrix. This completes Example 11.
In the vector case considered in this section, it is also instructive to extend the scope from two test points, θ 0 and θ 1 , to multiple test points, θ 0 , θ 1 , , θ m 1 , along with the corresponding weights (or priors), q 0 , q 1 , , q m 1 (see the exposition at the end of Section 2). Interestingly, this will also lead to new bounds even for the case of m = 2 test points.
To this end, we also consider a set of m unitary transformation matrices, T 0 , T 1 , T 2 , , T m 1 , with the properties: (i) θ Θ if and only if T i θ Θ for all i = 0 , 1 , , m 1 , and (ii) T 0 + T 1 + + T m 1 = 0 . For example, if d = 2 , take T i to be matrices of rotation by 2 π i / m , i = 0 , 1 , , m 1 .
Theorem 4.
Let θ 0 , θ 1 , , θ m 1 Θ R d , d 1 and q 0 , q 1 , , q m 1 be given, and let T 0 , T 1 , T m 1 be unitary transformations that sum to zero, as described in the above paragraph. Then, for a convex loss function ρ ( ε ) that depends on ε only via ε , we have:
R n ( Θ ) m · ρ 1 m i = 0 m 1 T i θ i · R n min { q 0 · p ( x | θ 0 ) , , q m 1 p ( x | θ m 1 ) } d x .
As explained in Section 2, the integral on the right-hand side can be interpreted as the probability of list error with a list size of m 1 .
Proof of Theorem 4.
The proof is a direct extension of the proof of Theorem 1:
sup θ Θ E θ { ρ ( g n [ X ] θ ) } i = 0 m 1 q i E θ i { ρ ( g n [ X ] θ i ) } = R n i = 0 m 1 q i · p ( x | θ i ) ρ ( g n [ x ] θ i ) d x m · R n min { q 0 · p ( x | θ 0 ) , , q m 1 p ( x | θ m 1 ) } × 1 m i = 0 m 1 ρ ( θ i g n [ X ] ) d x = ( a ) m · R n min { q 0 · p ( x | θ 0 ) , , q m 1 p ( x | θ m 1 ) } × 1 m i = 0 m 1 ρ ( T i ( θ i g n [ X ] ) ) d x ( b ) m · R n min { q 0 · p ( x | θ 0 ) , , q m 1 p ( x | θ m 1 ) } × ρ 1 m i = 0 m 1 T i ( θ i g n [ X ] ) d x = ( c ) m · R n min { q 0 · p ( x | θ 0 ) , , q m 1 p ( x | θ m 1 ) } × ρ 1 m i = 0 m 1 T i θ i d x = m ρ 1 m i = 0 m 1 T i θ i · R n min { q 0 · p ( x | θ 0 ) , , q m 1 p ( x | θ m 1 ) } d x ,
where in (a), we have used the unitarity (and hence the norm-preserving property) of { T i } , as ρ ( · ) is assumed to depend only on the norm of the error vector; in (b), we used the convexity of ρ , and in (c), we used the fact that i = 0 m 1 T i = 0 , which implies that i = 0 m 1 T i g n [ X ] = 0 , thus making the bound independent of the estimator, g n . This completes the proof of Theorem 4. □
Theorem 1 is a special case where m = 2 , T 0 = I , and T 1 = I , where I is the d × d identity matrix. The integral associated with the lower bound of Theorem 4 might not be trivial to evaluate in general for m 3 . However, there are some choices of the auxiliary parameters that may facilitate calculations. One such choice is as follows. For some positive integer k < m , take θ 0 = θ 1 = = θ k 1 = ϑ 0 , for some ϑ 0 Θ , q 0 = q 1 = = q k 1 = Q / k , for some Q ( 0 , 1 ) , θ k = θ k + 1 = = θ m 1 = ϑ 1 , for some ϑ 1 Θ , and finally, q k = q k + 1 = = q m 1 = ( 1 Q ) / ( m k ) . The integrand then becomes the minimum between two functions only, as in Section 3. Denoting α = k / m , the bound then becomes
R n ( Θ ) m ρ 1 m i = 0 k 1 T i ( ϑ 0 ϑ 1 ) · R n min Q k · p ( x | ϑ 0 ) , 1 Q m k · p ( x | ϑ 1 ) d x = ρ 1 m i = 0 k 1 T i ( ϑ 0 ϑ 1 ) · R n min Q α · p ( x | ϑ 0 ) , 1 Q 1 α · p ( x | ϑ 1 ) d x = ρ 1 m i = 0 k 1 T i ( ϑ 0 ϑ 1 ) · Q α + 1 Q 1 α · P e ( 1 α ) Q ( 1 α ) Q + α ( 1 Q ) , ϑ 0 , ϑ 1 .
Redefining
q = ( 1 α ) Q ( 1 α ) Q + α ( 1 Q ) ,
we have
Q α + 1 Q 1 α = 1 1 α q + 2 α q ,
and the following corollary to Theorem 2 is obtained.
Corollary 2.
Let the conditions of Theorem 4 be satisfied. Then,
R n ( Θ ) sup ϑ 0 , ϑ 1 , α , q ρ 1 m i = 0 k 1 T i ( ϑ 0 ϑ 1 ) · P e ( q , ϑ 0 , ϑ 1 ) 1 α q + 2 α q = sup ϑ 0 , ϑ 1 , q ρ 1 m i = 0 k 1 T i ( ϑ 0 ϑ 1 ) · P e ( q , ϑ 0 , ϑ 1 ) min { q , 1 q } .
Note that if m is even, α = 1 2 , and T 0 = I ; then, we are actually back to the bound of m = 2 , and so, the optimal bound for even m > 2 cannot be worse than our bound of m = 2 . We do not have, however, precisely the same argument for an odd m, but for a large m, it becomes immaterial if m is even or odd. In its general form, the bound of Theorem 4 is a heavy optimization problem, as we have the freedom to optimize θ 0 , , θ m 1 , T 0 , , T m 1 (under the constraints that they are all unitary and sum to zero), and q 0 , , q m 1 (under the constraints that they are all non-negative and sum to unity).
Another relatively convenient choice is to take θ i = T i 1 θ 0 , i = 1 , , m 1 to obtain another corollary to Theorem 4:
Corollary 3.
Let the conditions of Theorem 2 be satisfied. Then,
R n ( Θ ) sup θ 0 , T 0 , , T m 1 , q 0 , , q m 1 m ρ ( θ 0 ) · R n min { q 0 p ( x | θ 0 ) , q 1 p ( x | T 1 1 θ 0 ) , , q m 1 p ( x | T m 1 1 θ 0 ) } d x .
Example 12.
To demonstrate a calculation of the extended lower bound for m = 3 , consider the following model. We are observing a noisy signal,
Z i = ϑ ϕ i + ( ϑ + ζ ) ψ i + N i , i = 1 , 2 , , n ,
where ϑ is the desired parameter to be estimated, ζ is a nuisance parameter, taking values within an interval [ δ , δ ] for some δ > 0 , { N i } are i.i.d. Gaussian random variables with zero mean and variance σ 2 , and ϕ i and ψ i are two given orthogonal waveforms with i = 1 n ϕ i 2 = i = 1 n ψ i 2 = n . Suppose we are interested in estimating ϑ based on the sufficient statistics X = 1 n i = 1 n Z i ϕ i and Y = 1 n i = 1 n Z i ψ i , which are jointly Gaussian random variables with the mean vector ( ϑ , ϑ + ζ ) and the covariance matrix σ 2 n · I , I being the 2 × 2 identity matrix. We denote realizations of ( X , Y ) by ( x , y ) . Let us also denote θ = ( ϑ , ϑ + ζ ) . Since we are interested only in estimating ϑ, our loss function will depend only on the estimation error of the first component of θ, which is ϑ. Consider the choice m = 3 and let T i be counter-clockwise rotation transformations by 2 π i / 3 , i = 0 , 1 , 2 . For a given Δ ( 0 , δ ] , let us select θ 0 = ( Δ , 0 ) , θ 1 = T 1 1 θ 0 = ( Δ / 2 , Δ 3 / 2 ) , and θ 2 = T 2 1 θ 0 = ( Δ / 2 , Δ 3 / 2 ) . Finally, let q 0 = q 1 = q 2 = 1 3 . In order to calculate the integral
I = R 2 min 1 3 p ( x , y | θ 0 ) , 1 3 p ( x , y | θ 1 ) , 1 3 p ( x , y | θ 2 ) d x d y ,
the plane R 2 can be partitioned into three slices over which the integrals contributed are equal. In each such region, the smallest p ( x , y | θ i ) is integrated. In other words, every p ( x , y | θ i ) in its turn is integrated over the region whose Euclidean distance to θ i is larger than the distances to the other two values of θ. For θ 0 = ( Δ , 0 ) , this is the region { ( x , y ) : x 0 , | y | x 3 } . The factor of 1 3 cancels with the three identical contributions from θ 0 , θ 1 , and θ 2 due to the symmetry. Therefore,
I = 0 d x x 3 x 3 d y · p ( x , y | θ 0 ) = 0 d x x 3 x 3 · n 2 π σ 2 exp n ( x + Δ ) 2 + n y 2 2 σ 2 d y = n 2 π σ 2 · 0 exp n ( x + Δ ) 2 2 σ 2 1 2 Q x 3 n σ d x .
Our next mathematical manipulations in this example are in the spirit of the passage from Theorem 1 to Corollary 1, that is, selecting the test points increasingly close to each other as functions of n, so that the probability of list error would tend to a positive constant. To this end, we change the integration variable x to u = x n / σ and select Δ = s σ / n for some s 0 to be optimized later. Then,
I = n 2 π σ 2 · 0 exp n ( u σ / n + s σ / n ) 2 2 σ 2 1 2 Q ( u 3 ) d u σ n = 1 2 π · 0 e ( u + s ) 2 / 2 · [ 1 2 Q ( u 3 ) ] d u .
The MSE bound then becomes
R n ( θ , g n ) sup s 0 3 · s σ n 2 · 1 2 π · 0 e ( u + s ) 2 / 2 · [ 1 2 Q ( u 3 ) ] d u = σ 2 n · sup s 0 3 s 2 2 π · 0 e ( u + s ) 2 / 2 · [ 1 2 Q ( u 3 ) ] d u 0.2514 σ 2 n .
This bound is not as tight as the corresponding bound of m = 2 , which results in 0.3314 σ 2 / n , but it should be kept in mind that here, we have not attempted to optimize the choices of θ 0 , θ 1 , θ 2 , T 0 , T 1 , T 2 , q 0 , and q 1 . Instead, we have chosen these parameter values from considerations of computational convenience, just to demonstrate the calculation. This concludes Example 12.
Finally, a comment is in order regarding the possible extensions of Section 4 to the vector case. Such extensions are conceptually straightforward whenever the loss function of the error vector is given by the sum of losses associated with the different components of the error vector. Most notably, when the loss function is the MSE, ρ ( ϵ ) = ϵ 2 , each component of the estimation error can be handled separately by the methods of Section 4. Of course, the two or three test points should be chosen such that they differ in all components of the parameter vector. In the case of three test points, it makes sense to select them equally spaced along a certain straight line of a general direction in Θ . We will not pursue any further this extension in the framework of this work.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Proof of Equation (44).
Consider the Taylor series expansion,
s ( t , θ + δ ) = s ( t , θ ) + δ · s ˙ ( t , θ ) + δ 2 2 · s ¨ ( t , θ ) + o ( δ 2 ) ,
where s ˙ ( t , θ ) and s ¨ ( t , θ ) are the first two partial derivatives of s ( t , θ ) w.r.t. θ . Correlating both sides with s ( t , θ ) yields
0 T s ( t , θ ) s ( t , θ + δ ) d t = E + δ · 0 T s ( t , θ ) s ˙ ( t , θ ) d t + δ 2 2 · 0 T s ( t , θ ) s ¨ ( t , θ ) d t + o ( δ 2 ) .
Now,
0 T s ( t , θ ) s ˙ ( t , θ ) d t = 1 2 · θ 0 T s 2 ( t , θ ) d t = 1 2 · E θ = 0 ,
since the energy E is assumed independent of θ . Also, since 2 E θ 2 = 0 , we have
0 = 2 θ 2 0 T s 2 ( t , θ ) d t = θ 2 · 0 T s ( t , θ ) s ˙ ( t , θ ) d t = 2 · 0 T s ˙ 2 ( t , θ ) d t + 2 · 0 T s ( t , θ ) s ¨ ( t , θ ) d t ,
which yields
0 T s ( t , θ ) s ¨ ( t , θ ) d t = 0 T s ˙ 2 ( t , θ ) d t ,
and so,
0 T s ( t , θ ) s ( t , θ + δ ) d t = E δ 2 2 0 T s ˙ 2 ( t , θ ) d t + o ( δ 2 ) ,
and hence,
ϱ ( θ , θ + Δ ) = 1 δ 2 2 E 0 T s ˙ 2 ( t , θ ) d t + o ( δ 2 ) .

Appendix B

Proof of Lemma 1.
First, observe that
i = 1 k a i = i = 1 k r i · a i r i max a 1 r 1 , , a k r k ,
and since the inequality i = 1 k a i max { a 1 / r 1 , , a k / r k } applies to all r S , it also applies to the infimum of max { a 1 / r 1 , , a k / r k } over S , thus establishing the inequality “≤” between the two sides. To establish the “≥” inequality, r * S is defined to be the vector whose components are given by r i * = a i / j = 1 k a j . Then,
inf ( r 1 , , r k ) S max a 1 r 1 , , a k r k max a 1 r 1 * , , a k r k * = i = 1 k a i .
This completes the proof of Lemma 1. □

References

  1. Van Trees, H.L. Detection, Estimation and Modulation Theory: Part I; John Wiley & Sons, Inc.: New York, NY, USA, 1968. [Google Scholar]
  2. Bobrovsky, B.; Zakai, M. A lower bound on the estimation error for certain diffusion processes. IEEE Trans. Inform. Theory 1976, 22, 45–52. [Google Scholar] [CrossRef]
  3. Bellini, S.; Tartara, G. Bounds on error in signal parameter estimation. IEEE Trans. Commun. 1974, 22, 340–342. [Google Scholar] [CrossRef]
  4. Chazan, D.; Zakai, M.; Ziv, J. Improved lower bounds on signal parameter estimation. IEEE Trans. Inform. Theory 1975, 21, 90–93. [Google Scholar] [CrossRef]
  5. Weiss, A.J. Fundamental Bounds in Parameter Estimation. Ph.D. Thesis, Department of Electrical Engineering—Systems, Tel Aviv University, Tel Aviv, Israel, June 1985. [Google Scholar]
  6. Weiss, A.J.; Weinstein, E. A lower bound on the mean square error inrandom parameter estimation. IEEE Trans. Inf. Theory 1985, 31, 680–682. [Google Scholar] [CrossRef]
  7. Van Tress, H.L.; Bell, K.L. (Eds.) Bayesian Bounds for Parametric Estimation and Nonlinear Filtering/Tracking; John Wiley & Sons, Inc.: New York, NY, USA, 2007. [Google Scholar]
  8. Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. 1922, 222, 309. [Google Scholar]
  9. Dugue, D. Application des proprietes de la limite au sens du calcul des probabilities a l’etudes des diverses questions d’estimation. Ecol. Poly. 1937, 3, 305–372. [Google Scholar]
  10. Frechet, M. Sur l’extension de certaines evaluations statistiques au cas de petits echantillons. Rev. Inst. Int. Stat. 1943, 11, 182–205. [Google Scholar] [CrossRef]
  11. Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  12. Cramér, H. Mathematical Methods of Statistics; Princeton University Press: Princeton, NJ, USA, 1946. [Google Scholar]
  13. Bhattacharyya, A. On some analogues of the amount of information and their use in statistical estimation. Sankhya India J. Stat. 1946, 8, 1–14. [Google Scholar]
  14. Barankin, E.W. Locally best unbiased estimators. Ann. Math. Stat. 1949, 20, 477–501. [Google Scholar] [CrossRef]
  15. Chapman, D.G.; Robbins, H. Minimum variance estimation without regularity assumption. Ann. Math. Stat. 1951, 22, 581–586. [Google Scholar] [CrossRef]
  16. Fraser, D.A.; Guttman, I. Bhattacharyya bound without regularity assumptions. Ann. Math. Stat. 1952, 23, 629–632. [Google Scholar] [CrossRef]
  17. Kiefer, J. On minimum variance estimation. Ann. Math. Stat. 1952, 23, 627–629. [Google Scholar] [CrossRef]
  18. Ziv, J.; Zakai, M. Some lower bounds on signal parameter estimation. IEEE Trans. Inform. Theory 1969, 15, 386–391. [Google Scholar] [CrossRef]
  19. Hájek, J. Local asymptotic minimax and admissibility in estimation. In Proceedings of the Sixth Berekely Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1972; pp. 175–194. [Google Scholar]
  20. Le Cam, L.M. Convergence of estimates under dimensionality restrictions. Ann. Stat. 1973, 1, 38–53. [Google Scholar]
  21. Assouad, P. Deux remarques sur l’estimation. C. R. L’Acad. Sci. Paris Ser. Math. 1983, 296, 1021–1024. [Google Scholar]
  22. Fano, R.M. Transmission of Information: A Statistical Theory of Communication; The M.I.T. Press: Cambridge, MA, USA, 1961. [Google Scholar]
  23. Lehmann, E.L. Theory of Point Estimation; John Wiley & Sons: New York, NY, USA, 1983. [Google Scholar]
  24. Nazin, A.V. On minimax bound for parameter estimation in ball (bias accounting). In New Trends in Probability and Statistics; Walter de Gruyter GmbH: Berlin, Germany, 1991; pp. 612–616. [Google Scholar]
  25. Yang, Y.; Barron, A.R. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999, 27, 1564–1599. [Google Scholar] [CrossRef]
  26. Guntuboyina, A. Minimax Lower Bounds. Ph.D. Thesis, Statistics Department, Yale University, New Haven, CT, USA, 2011. [Google Scholar]
  27. Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inform. Theory 2011, 57, 2386–2399. [Google Scholar] [CrossRef]
  28. Kim, A.K.H. Obtaining minimax lower bounds: A review. J. Korean Stat. Soc. 2020, 49, 673–701. [Google Scholar] [CrossRef]
  29. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  30. Le Cam, L. Asymptotic Methods in Statistical Decision Theory; Springer: New York, NY, USA, 1986. [Google Scholar]
  31. Yu, B. Assouad, Fano and Le Cam. In A Festschrift for Lucien Le Cam; Pollard, D., Torgersen, E., Yang, G.L., Eds.; Springer: New York, NY, USA, 1997; pp. 423–435. [Google Scholar]
  32. Cai, T.T.; Zhou, H.H. Optimal rates of convergence for sparse covariance matrix estimation. Ann. Stat. 2012, 40, 2389–2420. [Google Scholar] [CrossRef]
  33. Helstrom, C.W. Statistical Theory of Signal Detection, 2nd ed.; Pergamon Press: Oxford, UK, 1975. [Google Scholar]
  34. Whalen, D. Detection of Signals in Noise; Academic Press: New York, NY, USA, 1971. [Google Scholar]
  35. Parag, P. Lecture 17: Minimax Bounds. In Foundations of Machine Learning; Indian Institute of Science: Bangalore, India, 2022; Available online: https://ece.iisc.ac.in/~parimal/2022/ml/lecture-17.pdf (accessed on 18 September 2024).
  36. Rogers, J. Information Theoretic Minimax Lower Bounds. 2018. Available online: https://jenniferbrennan.github.io/resources/Minimax_Lower_Bound_Notes.pdf (accessed on 18 September 2024).
  37. Rigollet, P. Chapter 5: Minimax lower bounds. In High Dimensional Statistics; OpenCourseWare Lecture Notes; Massachusetts Institute of Technology: Cambridge, MA, USA, 2015; Available online: https://ocw.mit.edu/courses/18-s997-high-dimensional-statistics-spring-2015/501374d1714bfd55ff6345189b9c2e26_MIT18_S997S15_Chapter5.pdf (accessed on 18 September 2024).
  38. Zhang, Z.; Shi, Z.; Gu, Y. Ziv–Zakai bound for DOAs estimation. IEEE Trans. Signal Process. 2022, 71, 136–149. [Google Scholar] [CrossRef]
  39. Bell, K.L.; Steinberg, Y.; Ephraim, Y.; Van Trees, H.L. Extended Ziv–Zakai lower bound for vector parameter estimation. IEEE Trans. Inform. Theory 1997, 43, 624–637. [Google Scholar] [CrossRef]
  40. Zhang, Z.; Shi, Z.; Zhou, C.; Yan, C.; Gu, Y. Ziv–Zakai bound for compressive time delay estimation. IEEE Trans. Signal Process. 2022, 70, 4006–4019. [Google Scholar] [CrossRef]
  41. Sun, J.; Ma, S.; Xu, G.; Li, S. Trade-off between positioning and communication for millimeter wave systems with Ziv–Zakai bound. IEEE Trans. Commun. 2023, 71, 3752–3762. [Google Scholar] [CrossRef]
  42. Mishra, K.V.; Eldar, Y.C. Performance of time delay estimation in a cognitive radar. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 3141–3145. [Google Scholar]
  43. Bell, K.L.; Ephraim, Y.; Van Trees, H.L. Explicit Ziv–Zakai lower bound for bearing estimation. IEEE Trans. Signal Process. 1996, 44, 2810–2824. [Google Scholar] [CrossRef]
  44. Zhang, Z.; Shi, Z.; Gu, Y.; Greco, M.S.; Gini, F. Ziv–Zakai bound for compressive time delay estimation from zero-mean Gaussian signal. Signal Process. Lett. 2023, 30, 1112–1116. [Google Scholar] [CrossRef]
  45. Chiriac, V.M.; He, Q.; Haimovich, A.M.; Blum, R.S. Ziv–Zakai bound for joint parameter estimation in MIMO radar systems. IEEE Trans. Signal Process. 2015, 63, 4956–4968. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Merhav, N. Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation. Entropy 2024, 26, 944. https://doi.org/10.3390/e26110944

AMA Style

Merhav N. Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation. Entropy. 2024; 26(11):944. https://doi.org/10.3390/e26110944

Chicago/Turabian Style

Merhav, Neri. 2024. "Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation" Entropy 26, no. 11: 944. https://doi.org/10.3390/e26110944

APA Style

Merhav, N. (2024). Two New Families of Local Asymptotically Minimax Lower Bounds in Parameter Estimation. Entropy, 26(11), 944. https://doi.org/10.3390/e26110944

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop