1. Introduction
The theory of parameter estimation consists of a very large plethora of lower bounds (as well as upper bounds), which characterize the fundamental performance limits of any estimator in a given parametric model. In this context, it is common to distinguish between Bayesian bounds (see, e.g., the Bayesian Cramér–Rao bound [
1], the Bayesian Bhattacharyya bound, the Bobrovsky–Zakai bound [
2], the Bellini–Tartara bound [
3], the Chazan–Zakai–Ziv bound [
4], the Weiss–Weinstein bound [
5,
6], and more (see [
7] for a comprehensive overview), and non-Bayesian bounds, where in the former, the parameter to be estimated is considered a random variable with a given probability law, as opposed to the latter, where it is assumed an unknown deterministic constant. The category of non-Bayesian bounds is further subdivided into two subclasses; one is associated with local bounds that hold for classes of estimators with certain limitations, such as unbiased estimators (see, e.g., the Cramér–Rao bound [
8,
9,
10,
11,
12], the Bhattacharyya bound [
13], the Barankin bound [
14], the Chapman–Robbins bound [
15], the Fraser–Guttman bound [
16], the Keifer bound [
17], and more), and the other is the subclass of minimax bounds (see, e.g., Ziv and Zakai [
18], Hajek [
19], Le Cam [
20], Assouad [
21], Fano [
22], Lehmann (Sections 4.2–4.4 in [
23]), Nazin [
24], Yang and Barron [
25], Guntuboyina [
26,
27], Kim [
28], and many more).
In this paper, we focus on the minimax approach, and more concretely, on the local minimax approach. According to the minimax approach, we are given a parametric family of probability density functions (or probability mass functions, in the discrete case),
, where
is a
d-dimensional parameter vector,
is the parameter space,
n is a positive integer designating the number of observations, and we define a loss function,
, where
is an estimator, which is a function of the observations
only. The minimax performance is defined as
where
denotes expectation w.r.t.
. As customary, we consider here loss functions with the property that
depends on
and
only via their difference, that is,
, where the function
satisfies certain assumptions (see
Section 2). The local asymptotic minimax performance at the point
is defined as follows (see also, e.g., [
19]). Let
be a positive sequence, tending to infinity, with the property that
is a strictly positive finite constant. Then, we say that
is the local asymptotic minimax performance with respect to (w.r.t.)
at the point
. Roughly speaking, the significance is that the performance of a good estimator,
, at
is about
. For example, in the scalar mean square error (MSE) case, where
, and where the observations are Gaussian, i.i.d., with mean
and known variance
, it is actually shown in Example 2.4, p. 257 in [
23] that
w.r.t.
, for all
, which is attained by the sample mean estimator,
.
Our focus in this work is on the derivation of some new lower bounds that are as follows: (i) essentially free of regularity conditions on the smoothness of the parametric family
, (ii) relatively simple and easy to calculate, at least numerically, which amounts to the property that the bound contains only a small number of auxiliary parameters to be numerically optimized (typically, no more than two or three parameters), (iii) tighter than earlier reported bounds that are associated with similar calculation efforts as described in (ii), and (iv) lend themselves to extensions that yield even stronger bounds (albeit with more auxiliary parameters to be optimized), as well as extensions to vector parameters. We propose two families of lower bounds on
, along with their local versions, of bounding
, with the four above-described properties. The first applies to any convex, symmetric loss function
, whereas the second is more specifically oriented to the moments of the estimation error,
, where
t is a positive real, not necessarily an integer, with special attention devoted to the MSE case,
. For the sake of simplicity and clarity of the exposition, in the first two main sections of the paper (
Section 3 and
Section 4), our focus is on the case of a scalar parameter, as most of our examples are associated with the scalar case. In
Section 5, we extend some of our findings to the vector case.
To put this work in the perspective of earlier work on minimax estimation, we next briefly review some of the basic approaches in this problem area. Admittedly, considering the vast amount of literature on the subject, our review below is by no means exhaustive. For a more comprehensive review, the reader is referred to Kim [
28].
First, observe the simple fact that the minimax performance is lower bounded by the Bayesian performance of the same loss function (see, e.g., [
1,
2,
3,
4,
5,
6,
7]) for any prior on the parameter,
, and so, every lower bound on the Bayesian performance is automatically a valid lower bound also on the minimax performance (see Section 4.2 in [
23]). Indeed, in Section 2.3 in [
26], it is argued that the vast majority of existing minimax lower bounding techniques are based upon bounding the Bayes risk from below w.r.t. some prior. Many of these Bayesian bounds, however, are subjected to certain restrictions and regularity conditions concerning the smoothness of the prior and the family of densities,
.
Dating back to Ziv and Zakai’s 1969 article [
18] on parameter estimation, applied mostly in the context of time-delay estimation, this prior puts all its mass equally on two values,
and
, of the parameter
, and considering an auxiliary hypothesis testing problem of distinguishing between the two hypotheses,
and
with equal priors (see
Section 2 for exact definitions). A simple argument regarding the sub-optimality of a decision rule that is based on estimating
and deciding on the hypothesis with the closer value of
, combined with Chebychev’s inequality, yields a simple lower bound on the corresponding Bayes risk, and hence also the minimax risk, in terms of the probability of error of the optimal decision rule. Five years later, Bellini and Tartara [
3], and then independently, Chazan, Zakai, and Ziv [
4], improved the bound of [
18] using somewhat different arguments and obtained Bayesian bounds that apply to the uniform prior. These bounds are also given in terms of the error probability pertaining to the optimal maximum a posteriori (MAP) decision rule of binary hypothesis testing with equal priors, but this time, it had an integral form. These bounds were demonstrated to be fairly tight in several application examples, but they are rather difficult to calculate in most cases. Shortly before the Bellini–Tartara and the Chazan–Zakai–Ziv articles were published, Le Cam [
20] proposed a minimax lower bound, which is also given in terms of the error probability associated with binary hypothesis testing, or equivalently, the total variation between
and
, under the postulate that the loss function
is a metric. We will refer to Le Cam’s bound in a more detailed manner later, in the context of our first proposed bound.
A decade later, Assouad [
21] extended Le Cam’s two-point testing bound to multiple points, where instead of just two test points,
and
, as before, there are, more generally,
m test points,
, and correspondingly, the auxiliary hypothesis testing problem consists of
m hypotheses,
,
, with certain priors (again, to be defined in
Section 2). Based on those test points, Assouad devised the so-called hypercube method. Another related bounding technique that is based on multiple test points, and referred to as Fano’s method, amounts to further bounding from below the error probability of multiple hypotheses using Fano’s inequality (see Section 2.10 in [
29]). Considering the large number of auxiliary parameters to be optimized when multiple hypotheses are present, these bounds demand heavy computational efforts. Also, Fano’s inequality is often loose, even though it is adequate enough for the purpose of proving converse-to-coding theorems in information theory [
29]. In later years, Le Cam [
30] and Yu [
31] extended Le Cam’s original approach to apply to testing mixtures of densities. More recently, Yang and Barron [
25] related the minimax problem to the metric entropy of the parametric family,
, and Cai and Zhou [
32] combined Le Cam’s and Assouad’s methods by considering a larger number of dimensions. Guntuboyina [
26,
27] pursued a different direction by deriving minimax lower bounds using
f-divergences.
The outline of this article is as follows. In
Section 2, we define the problem setting, provide a few formal definitions along with background, and establish the notation. In
Section 3, we develop the first family of bounds, and in
Section 4, we present the second family, both for the scalar parameter case. Finally, in
Section 5, we extend some of our findings to the vector case.
2. Problem Setting, Background, Definitions, and Notation
As outlined in the last paragraph of the Introduction,
Section 3 and
Section 4 are on the scalar parameter case, whereas in
Section 5, we extend some of the results to the case of a vector parameter with dimension
d. In order to avoid repetition, we formalize the problem and define the notation here for the more general vector case, with the understanding that for the scalar case, all definitions remain the same, except that they are confined to the special case of
.
Consider a family of probability density functions (PDFs), , where is a parameter vector of dimension d to be estimated, and is the parameter space. We denote by the expectation operator w.r.t. . Let X be a random vector of observations, governed by for some . The support of is assumed , the nth Cartesian power of the alphabet of each component , . The alphabet may be a finite set, a countable set, a finite interval, an infinite interval, or the entire real line. In the first two cases, the PDFs should be understood to be replaced by probability mass functions and integrations over the observation space should be replaced by summations. A realization of X will be denoted by .
An estimator, , is given by any function of the observation vector, , that is, . Since X is random, so is the estimate , as well as the estimation error, . We associate with every possible vector value, , of a certain loss (or “cost”, or “price”), , where is a non-negative function with the following properties: (i) monotonically non-increasing in each component, , of , wherever , , (ii) monotonically non-decreasing in each component, , of , wherever , , and (iii) .
Referring to
Section 3 and
Section 4, which deals with the scalar case (
), we further adopt the following assumptions regarding the loss function
. In
Section 3, we assume that
is as follows: (iv) convex, and (v) symmetric, i.e.,
for every
. In
Section 4, we assume, more specifically, that
, where
t is a positive constant, not necessarily an integer. This is a special case of the class of loss functions considered in
Section 3, except when
, in which case,
is a concave (rather than a convex) function of
. In
Section 5, for the vector extension of the results of
Section 3, the above-mentioned symmetry assumption (v) is extended to become radial symmetry; namely,
will be assumed to depend on
only via its Euclidean norm,
.
The expected cost of an estimator
at a point
, is defined as
The global minimax performance is defined as
Another related notion is that of local asymptotic minimax performance, defined in
Section 1, and repeated here for the sake of completeness. Let
be a positive sequence, tending to infinity, with the property that
, defined as in (
2), is a strictly positive finite constant. Then, we say that
is the local asymptotic minimax performance w.r.t.
at
. The sequence
is referred to as the convergence rate of the minimax estimator.
Similarly, as in earlier articles on minimax lower bounds, many of our proposed bounds are given in terms of the error probability pertaining to certain auxiliary hypothesis testing problems that are associated with two or more
test points in the parameter space, and the choice of those test points is subjected to optimization. We therefore provide a few definitions, notation conventions, and background associated with elementary hypothesis testing. For further details, the reader is referred to any one of many textbooks that cover the topic, for example, Van Trees (see Sections 2.2 and 2.3 in [
1]), Helstrom (see Chapter III in [
33]), or Whalen (see Chapter 5 in [
34]).
For given
and
, both in
, consider the problem of deciding between two possible hypotheses regarding the probability distribution that governs the given vector of observations,
X. Under hypothesis
,
X is governed by
, and under hypothesis
,
X is governed by
. Suppose also that the a priori probability of
to be the actual underlying hypothesis is
, where
is given, and so the a priori probability of
is the complementary probability,
. When
, we say that the priors are equal; otherwise, the priors are unequal. A decision rule
is a partition of the observation space,
, into two disjoint decision regions,
and its complementary region,
: Given that
, we decide in favor of
; otherwise, we decide in favor of
. The probability of error, associated with the hypotheses
and
, referring to test points
and
, respectively, with priors
q and
, respectively, when using the decision rule
, is defined as
A well-known elementary result in decision theory asserts that the optimal decision rule,
, in the sense of minimizing
, is given by
where it should be pointed out that attributing the case of a tie,
, to
is completely arbitrary, and could have been attributed alternatively to
without affecting the probability of error. Here and in the sequel, the minimum probability of error, associated with
, namely,
, will be denoted more simply by
. On substituting
into the above definition of the probability of error, we obtain
The expression of the second to the last line will appear in some of our lower bounds in the sequel and will be recognized and interpreted as
. The expression of the last line is the one that extends to multiple hypothesis testing, as will be detailed below.
The auxiliary problem of binary hypothesis testing extends from two hypotheses to a general number,
m, of hypotheses, associated with
m test points,
, in the following manner. Under hypothesis
, the observation vector
X is governed by
and the a priori probability of
being the underlying true hypothesis,
, is denoted
,
, where
are given non-negative numbers summing to unity. Here, a decision rule
is a partition of
into
m disjoint regions such that if
, we decide in favor of
,
. The probability of error associated with
,
and
is defined as
where
is complementary to
. The optimal MAP decision rule
selects the hypothesis
whose index
i maximizes the product
among all
, for the given
x, where ties are broken arbitrarily. The probability of error, associated with
, denoted
, is well known to be given by
Note that for
, this is different from the expression
as the latter can be interpreted as the probability that the index
i of the true hypothesis
minimizes (rather than maximizes) the product
over
for the given
x. Imagine an observer that, upon observing a realization
x of the random vector
X, creates a list of indices
with the
k largest values of
for some
, and an error is defined as the event where the correct
i is not in that list. This is referred to as a list error, which is a term borrowed from the fields of coded communication and information theory. The last expression is the probability of list error for
. We will encounter this expression later in certain versions of our lower bound. This completes the background needed about hypothesis testing.
As described in
Section 1, our objective in this work is to derive relatively simple and easily computable lower bounds to
, which are as tight as possible. While many existing lower bounds in the literature are satisfactory in terms of yielding the correct rate of convergence,
, here we wish to improve the bound on the constant factor,
. Many of our examples involve numerical calculations which include optimization over auxiliary parameters and occasionally also numerical integrations. All these calculations were carried out using MATLAB R2019a (9.6.0.1072779) 64-bit (win64).
3. Lower Bounds for Convex Symmetric Loss Functions
As explained earlier, here and in
Section 4, we consider a scalar parameter, namely,
.
Theorem 1. Let the assumptions of Section 2 be satisfied for and let be a symmetric convex loss function. Then,where is defined as in Equation (8). Proof of Theorem 1. For every
and
,
where (a) is due to the assumed symmetry of
and (b) is by its assumed convexity. Since the inequality,
applies to every
and
, it applies, in particular, also to the supremum over these auxiliary parameters. This completes the proof of Theorem 1. □
Before we proceed, two comments are in order:
Note that
is a concave function of
q for fixed
, as it can be presented as the minimum among a family of affine functions of
q, given by
, where
runs over all possible subsets of the observation space,
. Another way to see why this is true is by observing that
is given in the second to the last line of (
8) by an integral, whose integrand,
, is concave in
q. Clearly,
whenever
or
. Thus,
is maximized by some
q between 0 and 1. If
is strictly concave in
q, then the maximizing
q is unique.
Note that the lower bound (
11) is tighter than the lower bound of
, which was obtained in Equations (6)–(9a) in [
18], both because of the factor of 2 and because of the freedom to optimize
q rather than setting
. In a further development of [
18] the factor of 2 was accomplished too, but at the price of assuming that the density of the estimation error is symmetric about the origin (see discussion after (10) therein), which limits the class of estimators to which the bound applies. The factor of 2 and the degree of freedom
q are also the two ingredients that make the difference between (
11) and the lower bound due to Le Cam [
20] (see also [
26,
28]). In Chapter 2 in [
26] Guntuboyina reviews standard bounding techniques, including those of Le Cam, Assouad, and Fano. In particular, in Example 2.3.2 therein, Guntiboyina presents a lower bound in terms of the error probability associated with general priors, given by
, where in the case of two hypotheses,
, in our notation. Now, if
is symmetric and monotonically non-decreasing in the absolute error, then the minimizing
is given by
, which yields
and so, again, the resulting bound is of the same form as (
11) except that it lacks the prefactor of 2.
Our first example demonstrates Theorem 1 on a somewhat technical but simple model, with an emphasis on the point that the optimal q may differ from and that it is therefore useful to maximize w.r.t. q in order to improve the bound relative to the choice .
Example 1. Let X be a random variable distributed exponentially according toand , so that the only possibility to select two different values of θ in the lower bound are and . In terms of the hypothesis testing problem pertaining to the lower bound, the likelihood ratio test (LRT) is by comparison of to . Now, if , or equivalently, , the decision is always in favor of , and then . For , the optimal LRT compares X to . If , one decides in favor of ; otherwise, one decides in favor of . Thus,In summary,It turns out that for , , whereas the maximum is , attained at . Thus,This concludes Example 1. In the above example, we considered just one observation, . From now on, we will refer to the case where . In particular, the following simple corollary to Theorem 1 yields a local asymptotic minimax lower bound.
Corollary 1. For a given and a constant s, let denote a positive sequence tending to zero with the property thatexists and is given by a strictly positive constant, which will be denoted by . Also, letThen, the local asymptotic minimax performance w.r.t. is lower bounded by Corollary 1 is readily obtained from Theorem 1 by substituting
and
in Equation (
11), then multiplying both sides of the inequality by
, and finally, taking the limit inferior of both sides.
Next, we study a few examples of the use of Corollary 1. As in Example 1, we emphasize again in Example 2 below the importance of having the degree of freedom to maximize over the prior q rather than to fix . Also, in all the examples that were examined, the rate of convergence, , is the same as the optimal rate of convergence, . In other words, it is tight in the sense that there exists an estimator (for example, the maximum likelihood estimator) for which tends to zero at the same rate. In some of these examples, we compare our lower bound to to those of earlier reported results on the same models.
Example 2. Let be independently, identically distributed (i.i.d.) random variables, uniformly distributed in the range . In the corresponding hypothesis testing problem of Theorem 1, the hypotheses are and with priors, q and . There are two cases: If , or equivalently, , one decides always in favor of and so, the probability of error is q. If, on the other hand, , namely, , we decide in favor of whenever and then an error occurs only if is true, yet , which happens with probability . Thus,which is readily seen to be maximized by and thenNow, to apply Corollary 1, we let and , which amounts to and . Then,In the case of the MSE criterion, , we have , and so,w.r.t. . This bound will be further improved upon in Section 4. If instead of maximizing w.r.t. q, we select , thenand then the resulting bound would becomew.r.t. . Therefore, the maximization over q plays an important role here in terms of tightening the lower bound to . More generally, for (), , and we obtainw.r.t. , where the supremum, which is in fact a maximum, can always be calculated numerically for every given t. For large t, the maximizing u is approximately , which yieldsOn the other hand, for , we end up withFor large t, the bound of is inferior to the bound with the optimal q, by a factor of about . This concludes Example 2. Example 3. Let be i.i.d. random variables, uniformly distributed in the interval . For the hypothesis testing problem, let be chosen between and . Clearly, if , the underlying hypothesis is certainly . Likewise, if , the decision is in favor of with certainty. Thus, an error can occur only if all fall in the interval , an event that occurs with probability . In this event, the best to do is to select the hypothesis with the larger prior with a probability of error given by . Thus,and so,achieved by . Now, let us select , which yieldsFor , (), we havew.r.t. . For the case of MSE, , . The constant should be compared with (see Example 4.9 in [35]), which is two orders of magnitude smaller. This concludes Example 3. Example 4. Let , where are i.i.d., Gaussian random variables with zero mean and variance . Here, for the corresponding binary hypothesis testing problem, the optimal value of q is always . This can be readily seen from the concavity of in q and its symmetry around , as . Sincewherewe select , which yields and then for the MSE case, ,w.r.t. , and so, the asymptotic lower bound to is . We now compare this bound (which will be further improved in Section 4) with a few earlier reported results. In one of the versions of Le Cam’s bound (see Example 4.7 in [35]) for the same model, the lower bound to turns out to be , namely, an order of magnitude smaller. Also, in Example 3.1 in [28], another version of Le Cam’s method yields . According to Corollary 4.3 in [36], . Yet another comparison is with Theorem 5.9 in [37], where we find an inequality, which in our notation reads as follows:Combining it with Chebychev’s inequality yieldsIn [23] (p. 257), it is shown that when , the minimax estimator for this model is the sample mean, and so, in this case, the correct constant in front of is actually 1. This concludes Example 4. The next example is related to Example 4, as it is based on the use of the central limit theorem (CLT), which means that the Gaussian tail distribution is used here too.
Example 5. Consider an exponential family,where is a given function and is a normalization function given byassuming that the integral converges. In the auxiliary binary hypothesis problem, the test statistic is . If , and , the LRT amounts to examining whether is larger thanIn this case, the probability of error can be asymptotically assessed using the CLT, which after a simple algebraic manipulation, becomes:where is the Fisher information. Thus, for the MSE,w.r.t. . This concludes Example 5. In several earlier examples, our bound was shown to outperform (sometimes significantly so) earlier reported bounds for the corresponding models. However, to be honest and fair, we should not ignore the fact that there are also situations where our bound may not be tighter than earlier bounds applied to the same model. Such a case is demonstrated in Example 6 below, where our result is compared to those of Ziv and Zakai [
18] and Chazan, Zakai, and Ziv [
4] in the context of estimating the delay of a known continuous-time signal corrupted by additive white Gaussian noise (for further developments and applications of the Ziv–Zakai and the Chazan–Zakai–Ziv bounds, see, e.g., [
38,
39,
40,
41,
42,
43,
44,
45] and references therein). Having said that, it should also be kept in mind that our emphasis in this work is on bounds that are relatively simple and easy to calculate (at least numerically), whereas the Chazan–Zakai–Ziv bound, although very tight in many situations, is notoriously difficult to calculate. Indeed, in this setting, the explicit behavior of the resulting complicated bound of [
4] is clearly transparent only at high values of the signal-to-noise ratio (SNR)—see Equations (11), (12), and (14) in [
4].
Example 6. Let , , where is additive white Gaussian noise (AWGN) with double-sided spectral density and is a deterministic signal that depends on the unknown parameter, θ. It is assumed that the signal energy, , does not depend on θ (which is the case, for example, when θ is a delay parameter of a pulse fully contained in the observation interval, or when θ is the frequency or the phase of a sinusoidal waveform). We further assume that is at least twice differentiable w.r.t. θ, and that the energies of the first two derivatives are also independent of θ. Then, as shown in Appendix A, for small ,where is the energy . The optimal LRT in deciding between the two hypotheses is based on the comparison between the correlations, and . Again, the optimal value of q is . Thus,where is the power of . Since we are dealing here with continuous time, instead of a sequence , we use a function, , of the observation time, T, which in this case would be . Let and . Then,which, for the MSE case, yieldsw.r.t. , which means that the minimax loss is lower bounded by . This has the same form as the Cramér–Rao lower bound (CRLB), except that the multiplicative factor is 0.3314 rather than 1. In Equation (20) in [18], the bound is of the same form, but with a multiplicative constant of 0.16 for a high signal-to-noise ratio (SNR). However, in [4], the constant of proportionality was improved to 1 in the high SNR limit, just like in the Cramér–Rao lower bound for the same model. The constant 0.3314 will be improved later to 0.4549 (under Example 9 in the sequel), but it will still be below 1. The case where is not everywhere differentiable w.r.t. θ can be handled in a similar manner, but some caution should be exercised. For example, consider the model,where , , is AWGN as before, and is a rectangular pulse with duration Δ and amplitude , E being the signal energy. Here, ; namely, it also includes a linear term in , not just the quadratic one. This changes the asymptotic behavior of the resulting lower bound to , which turns out to be w.r.t. (namely, a minimax lower bound of ). It is interesting to compare this bound to the Chapman–Robbins bound for the same model, which is a local bound of the same form but with a multiplicative constant of instead of , and which is limited to unbiased estimators. The Chazan–Zakai–Ziv bound [4] for this case is difficult to calculate, but in the high SNR regime, it behaves like w.r.t. . It is conceivable that the Chazan–Zakai–Ziv bound for estimating the delay of non-differential signals in Gaussian white noise has been improved even further ever since its publication in 1975, but the point of this particular example remains: our bound may not always be the best available bound. Still, since our bound is easy to calculate, it is worth comparing it to other bounds for every given model. This concludes Example 6.
5. Extensions to the Vector Case
In this section, we outline extensions of some of our findings in
Section 3 and
Section 4 to the vector case. Let
be a parameter vector of dimension
d and
. Let
be a convex loss function that depends on the
d-dimensional error vector
only via its norm,
(that is,
has radial symmetry).
First, observe that Theorem 1 extends verbatim to the vector case, as nothing in the proof of Theorem 1 is based on any assumption or property that holds only if is a scalar parameter.
Corollary 1 can also be extended by letting the two test points of the auxiliary hypothesis testing problem,
and
, be selected such that the distance between them,
, would decay at an appropriate rate as a function of
n, so as to make the probability of error,
, converge to a positive constant as
, as was achieved in Corollary 1 for the scalar case. But in the vector case considered now, there is an additional degree of freedom, which is the
direction of the displacement vector,
. This direction can now be optimized so as to yield the largest (hence the tightest) lower bound. To be specific, in the vector case, Corollary 1 remains the same except that now
s should be thought of as a
d-dimensional vector rather than a scalar,
in the denominator of (
19) should be replaced by
, where
is an arbitrary fixed non-zero vector (for example, any unit norm vector in the case of MSE), and the supremum in (
20) should be taken over
. To demonstrate this point, we now revisit Example 5, but this time, for the vector case.
Example 11. Consider the case where , with each factor in this product PDF being given by a d-dimensional exponential family,where is the inner product of the d-dimensional parameter vector θ and a d-dimensional vector of statistics, , andprovided that the integral converges. In the above notation, both θ and are understood to be column vectors and denotes transposition of θ to a row vector. Similarly as in Corollary 1, to obtain a local bound at a given θ, we let and , where . A simple extension of the derivation in Example 5 (using the CLT) yieldswhere s is considered a column vector, the superscript prime denotes vector transposition as before, and is the Fisher information matrix of the exponential family. Considering the case of the MSE, with v being any unit norm vector, we haveand then the following lower bound is obtained w.r.t. :where is the smallest eigenvalue of . In this example, it is apparent that the chosen direction of the vector s is that of the eigenvector corresponding to the smallest eigenvalue of the Fisher information matrix. This completes Example 11. In the vector case considered in this section, it is also instructive to extend the scope from two test points,
and
, to multiple test points,
, along with the corresponding weights (or priors),
(see the exposition at the end of
Section 2). Interestingly, this will also lead to new bounds even for the case of
test points.
To this end, we also consider a set of m unitary transformation matrices, , with the properties: (i) if and only if for all , and (ii) . For example, if , take to be matrices of rotation by , .
Theorem 4. Let , and be given, and let be unitary transformations that sum to zero, as described in the above paragraph. Then, for a convex loss function that depends on ε only via , we have: As explained in
Section 2, the integral on the right-hand side can be interpreted as the probability of list error with a list size of
.
Proof of Theorem 4. The proof is a direct extension of the proof of Theorem 1:
where in (a), we have used the unitarity (and hence the norm-preserving property) of
, as
is assumed to depend only on the norm of the error vector; in (b), we used the convexity of
, and in (c), we used the fact that
, which implies that
, thus making the bound independent of the estimator,
. This completes the proof of Theorem 4. □
Theorem 1 is a special case where
,
, and
, where
I is the
identity matrix. The integral associated with the lower bound of Theorem 4 might not be trivial to evaluate in general for
. However, there are some choices of the auxiliary parameters that may facilitate calculations. One such choice is as follows. For some positive integer
, take
, for some
,
, for some
,
, for some
, and finally,
. The integrand then becomes the minimum between two functions only, as in
Section 3. Denoting
, the bound then becomes
Redefining
we have
and the following corollary to Theorem 2 is obtained.
Corollary 2. Let the conditions of Theorem 4 be satisfied. Then, Note that if m is even, , and ; then, we are actually back to the bound of , and so, the optimal bound for even cannot be worse than our bound of . We do not have, however, precisely the same argument for an odd m, but for a large m, it becomes immaterial if m is even or odd. In its general form, the bound of Theorem 4 is a heavy optimization problem, as we have the freedom to optimize , (under the constraints that they are all unitary and sum to zero), and (under the constraints that they are all non-negative and sum to unity).
Another relatively convenient choice is to take , to obtain another corollary to Theorem 4:
Corollary 3. Let the conditions of Theorem 2 be satisfied. Then, Example 12. To demonstrate a calculation of the extended lower bound for , consider the following model. We are observing a noisy signal,where ϑ is the desired parameter to be estimated, ζ is a nuisance parameter, taking values within an interval for some , are i.i.d. Gaussian random variables with zero mean and variance , and and are two given orthogonal waveforms with . Suppose we are interested in estimating ϑ based on the sufficient statistics and , which are jointly Gaussian random variables with the mean vector and the covariance matrix , I being the identity matrix. We denote realizations of by . Let us also denote . Since we are interested only in estimating ϑ, our loss function will depend only on the estimation error of the first component of θ, which is ϑ. Consider the choice and let be counter-clockwise rotation transformations by , . For a given , let us select , , and . Finally, let . In order to calculate the integralthe plane can be partitioned into three slices over which the integrals contributed are equal. In each such region, the smallest is integrated. In other words, every in its turn is integrated over the region whose Euclidean distance to is larger than the distances to the other two values of θ. For , this is the region . The factor of cancels with the three identical contributions from , , and due to the symmetry. Therefore, Our next mathematical manipulations in this example are in the spirit of the passage from Theorem 1 to Corollary 1, that is, selecting the test points increasingly close to each other as functions of n, so that the probability of list error would tend to a positive constant. To this end, we change the integration variable x to and select for some to be optimized later. Then,The MSE bound then becomesThis bound is not as tight as the corresponding bound of , which results in , but it should be kept in mind that here, we have not attempted to optimize the choices of , , , , , , , and . Instead, we have chosen these parameter values from considerations of computational convenience, just to demonstrate the calculation. This concludes Example 12. Finally, a comment is in order regarding the possible extensions of
Section 4 to the vector case. Such extensions are conceptually straightforward whenever the loss function of the error vector is given by the sum of losses associated with the different components of the error vector. Most notably, when the loss function is the MSE,
, each component of the estimation error can be handled separately by the methods of
Section 4. Of course, the two or three test points should be chosen such that they differ in all components of the parameter vector. In the case of three test points, it makes sense to select them equally spaced along a certain straight line of a general direction in
. We will not pursue any further this extension in the framework of this work.