*2.1. Preliminaries*

The minimax principle aims at the worst case suggesting for it the best solution [2]; thus, this approach provides a guaranteed result [5]. However, being applied to the problem of estimation of location, it yields a robust version of the principle of the maximum likelihood [2]. Usually, estimation of location is of a primary interest, and, in this study, we focus on it.

Let *x*1, ... , *xn* be a sample from a distribution with density *p*(*x* − θ) from a convex class P, where θ is a location parameter. Further, we assume that *p* is a symmetric distribution density; hence, θ is the center of symmetry to be estimated.

An *M*-estimate *Tn* of θ is a solution to the following minimization problem:

$$T\_n = \arg\min\_{\theta} \sum\_{i=1}^n \underline{p}(\mathbf{x}\_i - \theta),\tag{1}$$

where ρ (*u*) is called the *function of contrast* [15]: ρ (*xi* − θ) is a measure of di fference between the observation *xi* and the estimated center of symmetry. The following particular cases of (**??**) are of a particular interest: (i) for ρ (*u*) = *u*2, we have the least squares method with the sample mean *x* as the estimate of location; (ii) for ρ (*u*)=|*u*|, we arrive at the least absolute values method with the sample median estimate *med x*; and (iii), mostly importantly, for a given density *p*, the choice ρ (*u*) = −log *p*(*u*) yields the maximum likelihood (*ML*) estimate of location.

It is more convenient to formulate the properties of *M*-estimates in terms of the derivative of the function of contrast ψ(*u*) = ρ - (*u*) called a *score function*. In this case, an *M*-estimate is defined as a solution to the following implicit estimating equation

$$\sum\_{i=1}^{n} \psi(\mathbf{x}\_i - T\_n) = 0.\tag{2}$$

Under rather general conditions of regularity imposed on the class Ψ of score functions ψ and on the class P of densities *p* (their various forms can be found in Reference [2,5,6]), *M*-estimates are consistent and asymptotically normal with the asymptotic variance *AV*:

$$Var\{n^{1/2}T\_n\} = AV(\psi, p) = \frac{A}{B^2} = \frac{\int \psi^2(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x}}{\left(\int \psi'(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x}\right)^2} \tag{3}$$

where

$$A(\psi, p) = \int \psi^2(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x},$$

$$B(\psi, p) = \int \psi'(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x}.$$

For *M*-estimates (**??**), the following result holds [5].

**Theorem 1.** *(Huber, 1964) Under regularity conditions, M-estimates satisfy the minimax property*

$$AV(\psi^\*, p) \le AV(\psi^\*, p^\*) = \sup\_{p \in \mathcal{P}} \inf\_{\psi \in \Psi} AV(\psi, p) \le AV(\psi, p^\*), \tag{4}$$

*where p*<sup>∗</sup>(*x*) *is the least informative (favorable) density p*∗ *minimizing Fisher information for location <sup>I</sup>*(*p*) *over the class* P*:*

$$\log^\* = \arg\min\_{p \in \mathcal{P}} I(p),\\ I(p) = \int \left[\frac{p'(\mathbf{x})}{p(\mathbf{x})}\right]^2 p(\mathbf{x}) \, d\mathbf{x}.\tag{5}$$

From (**??**) and (**??**), it follows that the minimax function of contrast and score function are given by the maximum likelihood method for the least informative density *p*<sup>∗</sup>:

$$
\overline{\rho}^\*(\mathbf{x}) = -\log p^\*(\mathbf{x}), \\
\psi^\*(\mathbf{x}) = -p^\*(\mathbf{x})'/p^\*(\mathbf{x}).\tag{6}
$$

Thus, the pair (ψ<sup>∗</sup>, *p*<sup>∗</sup>) is the saddle-point of the functional *AV*(ψ, *p*). The right-hand part of inequality (**??**) is the Rao–Cramér inequality:

$$AV(\psi, p^\*) \ge AV(-p^{\*\prime}/p^\*, p^\*) = 1/\int \left(p^\*(\mathbf{x})^\prime\right)^2/p^\*(\mathbf{x}) \, d\mathbf{x} = 1/I(p^\*)\, d\mathbf{x}$$

whereas its left-hand part guarantees the asymptotic accuracy of robust minimax estimation with the following upper bound upon the asymptotic variance of the minimax variance robust *M*-estimate of location: *AV*(ψ<sup>∗</sup>, *p*) ≤ 1/*I*(*p*<sup>∗</sup>).

The key point of the minimax approach is the solution of the variational problem (**??**): further, various classes P with the corresponding least informative densities *p*∗ and minimax estimates are enlisted.

Now, we recall the Huber's classical solution for ε-contaminated normal distributions (Tukey's gross-error model):

$$\mathcal{P}\_{\varepsilon} = \left\{ p : p(\mathbf{x}) \ge (1 - \varepsilon) \overline{\mathbf{op}}(\mathbf{x}), \\ \mathbf{x} \right\}\_{\varepsilon} \tag{7}$$

where ϕ (*x*) = (<sup>2</sup>π)−1/2 exp−*<sup>x</sup>*2/2is the standard normal distribution density.

**Theorem 2.** *(Huber, 1964) In the class* P<sup>ε</sup>*, the least informative density p*∗ *and the optimal score function has the following form [2]:*

$$p^\*(\mathbf{x}) = p\_{\text{Hulber}}(\mathbf{x}) = \begin{cases} (1 - \varepsilon) \overline{\mathbf{p}}\left(\mathbf{x}\right), & |\mathbf{x}| \le k, \\\ \mathbb{C} \exp\left(-D|\mathbf{x}|\right), & |\mathbf{x}| > k, \end{cases} \tag{8}$$

*Mathematics* **2021**, *9*, 105

$$\psi^\*(\mathbf{x}) = \psi\_{Huber}(\mathbf{x}) = \begin{cases} \mathbf{x}, & |\mathbf{x}| \le k, \\ \text{CD}\,\text{sgn}\,(\mathbf{x}), & |\mathbf{x}| > k, \end{cases} \tag{9}$$

*where the parameters C, D, and k satisfy the conditions of norming, continuity, and continuous di*ff*erentiability of the solution at x* = *k:*

$$\int p^\*(\mathbf{x}) \, d\mathbf{x} = 1,\\ p^\*(k-0) = p^\*(k+0),\\ p^{\*\prime}(k-0) = p^{\*\prime}(k+0).$$

Finally, we ge<sup>t</sup> the linear bounded score ψ*Huber*(*x*) = max{−*k*, min{*<sup>x</sup>*, *k*}}, where *k* depends on the value of the contamination parameter ε, as follows:

2 ϕ (*k*) *k* − 2 <sup>Φ</sup>(−*k*) = ε 1 − ε, <sup>Φ</sup>(*x*) = *x*−∞ ϕ (*t*) *dt*;

the values of the parameter *k* = *k*(ε) are tabulated in Reference [5].

The particular cases of this solution for ε = 0 and ε → 1 are given by *k* → ∞ (the sample mean) and *k* = 0 (the sample median), respectively.

The Huber's score function ψ*Huber*(*x*) is a robust version of the *ML* estimation: in the center |*xi* −θ|≤ *k* , the data are processed by the *ML* method, and they are trimmed within the exponential distribution tails |*xi* −θ|<sup>&</sup>gt; *k* . In the limiting case of a completely unknown density as ε → 1, the minimax variance *M*-estimate of location tends to the sample median.

#### *2.2. Free Extremals of the Basic Variational Problem*

Consider the problem of minimization of Fisher information for location (**??**) under two basic side conditions of non-negativeness and norming: *p*(*x*) ≥ 0, ∞−∞ *p*(*x*) *dx* = 1.

Set *p*(*x*) = *q*(*x*), and rewrite this minimization problem as

$$\text{minimize} I(p) = 4 \int\_{-\infty}^{\infty} (q'(\mathbf{x}))^2 \, d\mathbf{x}\\ \text{under} \int\_{-\infty}^{\infty} q^2(\mathbf{x}) \, d\mathbf{x} = 1.$$

Introducing the Lagrange multiplier λ related to the norming condition, we obtain the following differential equation for the function *q*(*x*): <sup>4</sup>*q*--(*x*) + <sup>λ</sup>*q*(*x*) = 0.

The general solutions of this harmonic oscillator equation have the well-known forms depending on the sign of λ = 4*k*2: (i) the exponential *q*(*x*) = *<sup>C</sup>*1*ekx* + *<sup>C</sup>*2*e*<sup>−</sup>*kx*, the cosine *q*(*x*) = *C*1 sin *kx* + *C*2 cos *kx*, and the linear *q*(*x*) = *C*1 + *C*2*x*, where *k* = ± √λ/2.

Further, we show that all these forms work in the structures of least informative distribution densities.

#### *2.3. Least Informative Distributions*

The neighborhoods of normal, generally, are not the only models of interest. In real-life applications, the information about the distribution central part tails, its moments, and/or subranges is rather often available. The empirical distribution and quantile functions, histograms, and kernel estimates, together with their tolerance limits, provide other examples. To enhance the efficiency of robust estimates, this information can be used in minimax settings.

Further, we deal with symmetric distribution densities *p*(−*<sup>x</sup>*) = *p*(*x*). Evidently, distribution densities must also satisfy the non-negativeness and norming conditions common for all classes. For brevity, we do not write out all these conditions any time we define a distribution density class.

Now, we enlist several examples of distribution classes convenient for the description of a prior knowledge about data distributions [7].


$$\mathcal{P}\_2 = \left\{ p : \sigma^2(p) = \int\_{-\infty}^{\infty} x^2 p(x) \, dx \le \overline{\sigma}^2 \right\}.$$

All distribution densities with bounded variances are the members of this class. Evidently, the Cauchy-type distributions do not belong to it. The least informative density in this class is normal [17]:

$$p\_2^\*(\mathbf{x}) = N(\mathbf{x}; 0, \overline{\sigma}) = \frac{1}{\sqrt{2\pi}\overline{\sigma}} \exp\left(-\frac{\mathbf{x}^2}{2\overline{\sigma}^2}\right).$$


$$p\_4^\*(\mathbf{x}) = \begin{cases} \ \frac{1}{2} \cos^2 \left( \frac{\mathbf{x} \cdot \mathbf{x}}{2!} \right) & \text{for} \quad |\mathbf{x}| \le l\_1 \\\ 0 & \text{for} \quad |\mathbf{x}| > l\_2 \end{cases}$$

(5) The class of distributions with a bounded interquantile distribution mass:

$$\mathcal{P}\_5 = \left\{ p : \int\_{-l}^{l} p(\mathbf{x}) \, d\mathbf{x} \ge 1 - \beta \right\} , 0 \le \beta < 1. \right\}$$

The parameters *l* and β are assumed given. The restriction upon the interquantile mass means that *P*(|*X*|≤ *l*) ≥ 1 − β. We can redefine this class in a different way as the class with an upper-bounded interquantile range *IQR*β = *<sup>P</sup>*−<sup>1</sup> 1+β 2 − *<sup>P</sup>*−<sup>1</sup> 1−β2 ≤ 2*l*, where *<sup>P</sup>*(*x*) is a probability density function. The least informative density in this class has both the cosine and exponential parts working at the center and tail areas, respectively [7],

$$p\_5^\*(\mathbf{x}) = \begin{cases} A\_1 \cos^2(B\_1 \mathbf{x}) & \text{for} \quad |\mathbf{x}| \le l\_\* \\\ A\_2 \exp(-B\_2 |\mathbf{x}|) & \text{for} \quad |\mathbf{x}| > l\_\* \end{cases}$$

where the constants *A*1, *A*2, *B*1, and *B*2 are determined from the simultaneous equations (restrictions) of the class P5, namely the conditions of normalization and upon the distribution interquantile range, and the conditions of continuity and continuous differentiability at *x* = *l* (for details, see Reference [7]). It is worth noting that the classes P1 and P4 are the particular cases of the class P5 when β → 0 and β = 1, respectively.

#### **3. Hampel's Robust and Shurygin's Stable Estimates of Location**

Robust methods have lower sensitivity to possible departures from the accepted distribution models as compared to conventional statistical methods. To analyze the sensitivity of estimation, it is natural to have its specific indicator. In what follows, we introduce these indicators, namely the influence function and related to it measures.

#### *3.1. Hampel's Local Approach to Robustness*

Let *P* be a distribution function corresponding to *p* ∈ P, the class of distribution densities, and let *T*(*P*) be a functional defined in a subset of all distribution functions. The natural estimate defined by *T* is *Tn* = *<sup>T</sup>*(*Pn*), i.e., the functional computed at the sample distribution function *Pn*.

The influence function *IF*(*x*; *T*, *p*) of this functional is defined as

$$IF(\mathbf{x}; T, p) = \lim\_{t \to 0} \frac{T((1 - t)P + t\Delta\_x) - T(P)}{t},\tag{10}$$

where Δ*x* is the degenerate distribution taking mass 1 at *x* [6].

The influence function measures the impact of an infinitesimal contamination at *x* on the value of an estimate, formally being the Gaˆteaux derivative of the functional *<sup>T</sup>*(*P*). For an *M*-estimate with a score function ψ, the influence function is proportional to it: *IF*(*<sup>x</sup>*;ψ, *p*) = ψ(*x*)/*B*(ψ, *p*), where the term *<sup>B</sup>*(ψ, *p*) stands in Equation (**??**) for the asymptotic variance of *M*-estimates.

Based on the influence function, several local measures of robustness are defined [6], including *the gross-error sensitivity* of *T* at *p*:

$$\gamma^\*(T, p) = \sup\_{\mathfrak{x}} \left| IF(\mathfrak{x}; T, p) \right|.$$

This indicator of sensitivity gives an upper bound upon the asymptotic estimate bias and measures the worst influence of an infinitesimal contamination on the value of an estimate. Maximizing the efficiency of an *M*-estimate of location under the condition of a bounded gross-error sensitivity at the normal distribution

$$\max\_{\psi} \operatorname{eff}(\psi, \{\underline{\psi}, \underline{\phi}\} \text{under} \gamma^\* \le \overline{\gamma}\_\*)$$

where *eff*(ψ, *p*) = 1 *AV*(ψ,*p*)*I*(*p*) formally leads to the Huber's minimax linear bounded score ψ*Huber*(*x*) = max{−*k*, min{*<sup>x</sup>*, *k*}} in the class of contaminated normal distributions [6]. This particular result confirms the following general observation: the best estimates within both approaches, Huber's minimax and Hampel's local, are rather close in their performances.

#### *3.2. Meshalkin-Shurygin's Stable Estimates of Location*

This topic is partially reversal to the conventional setting: the maximum of some measure of sensitivity is minimized under the guaranteed value of the estimate variance or efficiency.

The conventional point-wise local measures of sensitivity, such as the influence and change-of-variance functions [6] are not appropriate here—a global indicator of sensitivity is desirable. We show that a novel global indicator of robustness proposed by Shurygin [18], the estimate *stability*, is closely related to the classical variation of the functional of the estimate asymptotic variance. Although the related theory has been developed for stable estimation of an arbitrary parameter of the underlying distribution, here, we focus on stable estimation of location.

A measure of the *M*-estimate sensitivity called *variance sensitivity* (*VS*) is introduced in Reference [19]. Formally, it is defined as the Lagrange functional derivative of the asymptotic variance (**??**):

$$\text{V}S(\psi, p) = \frac{\partial A V(\psi, p)}{\partial p} = \frac{\partial}{\partial p} \left( \frac{\int \psi^2(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x}}{\left( \int \psi'(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x} \right)^2} \right) = \frac{\int \psi^2(\mathbf{x}) \, d\mathbf{x}}{\left( \int \psi'(\mathbf{x}) p(\mathbf{x}) \, d\mathbf{x} \right)^2}. \tag{11}$$

Equation (**??**) gives a global measure of the stability of an *M*-estimate in a model, where the outliers occur uniformly anywhere on the real line. The boundness of the Lagrange derivative (**??**) holds under the condition of square integrability of ψ with the corresponding *redescending* scores when ψ(*x*) → 0 for |*x*|→ ∞.

In Reference [7], it is shown that the principal part of the variation <sup>δ</sup>*AV*(ψ, *p*) of the asymptotic variance *AV*(ψ, *p*) with respect to <sup>δ</sup>*p* is proportional to the variance sensitivity (**??**) or to the Lagrange derivative of the asymptotic variance: <sup>δ</sup>*AV*(ψ, *p*) ∝ ∂*AV*(ψ, *p*)/∂*p* .

Further, consider the following optimization problem: what is the minimum variance sensitive score function for a given distribution density *p*? The solution of this optimization problem is given by the minimum variance sensitive (*MVS*) score function:

$$\psi\_{MVS}(\mathbf{x}) = \arg\min\_{\psi} VS(\psi, p) = -p'(\mathbf{x}).\tag{12}$$

The estimate with this optimal score function (**??**) is called as the estimate of *minimum variance sensitivity* with *VSmin* = *VS*(ψ*MVS*, *p*). We define a global measure of the *stability* of any *M*-estimate comparing an estimate variance sensitivity with its minimum

$$0 \le stb(\psi, p) = V S\_{\min}(p) / V S(\psi, p) \le 1\,\,\,\mu$$

A number of optimization criteria settings with different weights for efficiency and stability of *M*-estimates of location have been proposed and solved; in other examples, efficiency is maximized under guaranteed stability, and vice versa [18]. Practically in all these cases, we deal with the score functions ψ*op<sup>t</sup>*(*x*) with the following limiting forms: the maximum likelihood case ψ*op<sup>t</sup>*(*x*) → ψ*ML*(*x*) = −*p*-(*x*)/*p*(*x*) when the requirement of high efficiency mostly matters and the opposite redescending case ψ*op<sup>t</sup>*(*x*) → ψ*MVS*(*x*) = −*p*-(*x*) when the requirement of high stability is important.

A compromise case is given by a stable estimate called *radical* with equal efficiency and stability, *eff*(ψ, *p*) = *stb*(ψ, *p*), desirably both highly efficient and stable: its score function is given by

$$
\psi\_{\rm nd}(\mathbf{x}) = \psi\_{\rm ML}(\mathbf{x}) \sqrt{p(\mathbf{x})} = -p'(\mathbf{x}) / \sqrt{p(\mathbf{x})} \,. \tag{13}
$$

Finally, it should be noted that the minimum sensitivity and radical estimates belong to the class of *M*-estimates with the exponentially weighted maximum likelihood score functions previously proposed by Meshalkin [20].

#### **4. Asymptotic E**ffi**ciency of M-Estimates: A Comparative Study**

Here, we compare the asymptotic efficiency performance of various robust and stable estimates of location at the conventional in robustness studies distributions: some particular results can be found in Reference [19,21]. We mainly focus on the minimax variance estimates for distributions with bounded interquantile ranges, as until present, their performance has not been thoroughly studied. It is important that some obtained results are unexpected and surprising.

#### *4.1. Robust and Stable* M*-Estimates of a Location Parameter*

We test the following *M*-estimates of location: (i) the sample mean and median, (ii) the Huber's minimax variance *M*-estimate with the linear bounded score ψ*Huber*(*x*) = max [−1.14, min (*x*, 1.14)] optimal for ε-contaminated standard normal distributions with the parameter of contamination ε = 0.1, (iii) the Hampel's *M*-estimate with the redescending three-part score

ψ*Hampel*(*x*) = ⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩ *x* for0 ≤|*x*|≤ *a*, *a* sgn(*x*) for*a* ≤|*x*|≤ *b*, *a r*−|*x*| *<sup>r</sup>*−*b* sgn(*x*) for*a* ≤|*x*|≤ *b*, 0 for*r* ≤|*x*|,

where the parameters *a* = 1.31, *b* = 2.039, and *r* = 4 (see Reference [6], pp. 166–167), (iv) the minimax variance *M*-estimates with the scores ψβ(*x*) for various values of the parameter β : 0.01, 0.05, 0.1, 0.5, 0.9, 0.95, 0.99, and (v) the stable Shurygin's minimum variance sensitivity and radical *M*-estimates with the scores ψ*MVS*(*x*) = −*p*-(*x*) and ψ*rad*(*x*) = −*p*-(*x*)/ *p*(*x*), respectively.

## *4.2. Data Distributions*


(iv) the heavy-tailed Tukey gross-error model as the Cauchy contaminated normal density *p*(*x*) = 0.9*<sup>N</sup>*(*x*; 0, 1) + 0.1*<sup>C</sup>*(*x*; 0, <sup>1</sup>).
