**2. Minimum Empirical Divergence Estimators**

## *2.1. Statistical Divergences*

Let *ϕ* be a convex function defined on R and [0, ∞]-valued, such that *ϕ*(1) = 0, and let *<sup>P</sup>* <sup>∈</sup> *<sup>M</sup>*<sup>1</sup> be some probability measure. For any signed finite measure *<sup>Q</sup>* defined on the same measurable space (R*m*, <sup>B</sup>(R*m*)), absolutely continuous (a.c.) with respect to *<sup>P</sup>*, the *<sup>ϕ</sup>* divergence between *Q* and *P* is defined by

$$D\_{\varphi}(Q, P) := \int \varphi\left(\frac{dQ}{dP}(\mathbf{x})\right)dP(\mathbf{x}).\tag{5}$$

When *Q* is not a.c. with respect to *P*, we set *Dϕ*(*Q*, *P*) = ∞. This extension, for the case when *Q* is not absolutely continuous with respect to *P*, was considered in order to have a unique definition of divergences, appropriate for both cases—that of continuous probability laws and that of discrete probability laws.

This definition extends the one of divergences between probability measures [23], and the necessity of working with signed finite measures will be explained in Section 2.2.

Largely used in information theory, the Kullback–Leibler divergence is associated with the real convex function *ϕ*(*x*) := *x* log *x* − *x* + 1 and is defined by

$$KL(Q,P) := \int \log\left(\frac{dQ}{dP}\right) dQ.$$

The modified Kullback–Leibler divergence is associated with the convex function *ϕ*(*x*) := − log *x* + *x* − 1 and is defined through

$$KL\_m(Q, P) := \int -\log\left(\frac{dQ}{dP}\right)dP.$$

Other divergences, largely used in inferential statistics, are the *χ*<sup>2</sup> and the modified *χ*<sup>2</sup> divergences, namely

$$
\chi^2(Q, P) := \frac{1}{2} \int \left(\frac{dQ}{dP} - 1\right)^2 dP,
$$

$$
\chi^2\_m(Q, P) := \frac{1}{2} \int \frac{\left(\frac{dQ}{dP} - 1\right)^2}{\frac{dQ}{dP}} dP,
$$

these being associated with the convex functions *ϕ*(*x*) := <sup>1</sup> <sup>2</sup> (*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)<sup>2</sup> and *<sup>ϕ</sup>*(*x*) :<sup>=</sup> <sup>1</sup> <sup>2</sup> (*x* − 1)2/*x*, respectively. The Hellinger distance and the *L*<sup>1</sup> distance are also *ϕ* divergences. They are associated with the convex functions *ϕ*(*x*) := 2( <sup>√</sup>*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)<sup>2</sup> and *<sup>ϕ</sup>*(*x*) :<sup>=</sup> <sup>|</sup>*<sup>x</sup>* <sup>−</sup> <sup>1</sup>|, respectively.

All the preceding examples, except the *L*<sup>1</sup> distance, belong to the class of power divergences introduced by Cressie and Read [24] and defined by the convex functions:

$$\mathbf{x} \in \mathbb{R}\_+^\* \mapsto \varrho\_\gamma(\mathbf{x}) := \frac{\mathbf{x}^\gamma - \gamma \mathbf{x} + \gamma - 1}{\gamma(\gamma - 1)},\tag{6}$$

for *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup> \ {0, 1} and *<sup>ϕ</sup>*0(*x*) :<sup>=</sup> <sup>−</sup> log *<sup>x</sup>* <sup>+</sup> *<sup>x</sup>* <sup>−</sup> 1, *<sup>ϕ</sup>*1(*x*) :<sup>=</sup> *<sup>x</sup>* log *<sup>x</sup>* <sup>−</sup> *<sup>x</sup>* <sup>+</sup> 1. The Kullback– Leibler divergence is associated with *ϕ*1, the modified Kullback–Leibler with *ϕ*0, the *χ*<sup>2</sup> divergence with *<sup>ϕ</sup>*2, the modified *<sup>χ</sup>*<sup>2</sup> divergence with *<sup>ϕ</sup>*−1, and the Hellinger distance with *ϕ*1/2. When *ϕγ* is not defined on (−∞, 0) or when *ϕγ* is not convex, the definition of the corresponding power divergence function *<sup>Q</sup>* <sup>∈</sup> *<sup>M</sup>*1<sup>→</sup> *<sup>D</sup>ϕγ* (*Q*, *<sup>P</sup>*) can be extended to the whole set of signed finite measures by taking the following extension of *ϕγ*:

$$\mathfrak{q}\_{\gamma} \colon \ x \in \mathbb{R} \mapsto \mathfrak{q}\_{\gamma}(\mathfrak{x})\mathbf{1}\_{[0,\infty)}(\mathfrak{x}) + (+\infty)\mathbf{1}\_{(-\infty,0)}(\mathfrak{x}).$$

The *ϕ* divergence between some set Ω of signed finite measures and a probability measure *P* is defined by

$$D\_{\mathfrak{P}}(\Omega, P) = \inf\_{Q \in \Omega} D\_{\mathfrak{P}}(Q, P). \tag{7}$$

Assuming that *Dϕ*(Ω, *P*) is finite, a measure *Q*<sup>∗</sup> ∈ Ω is called a *ϕ*-projection of *P* on Ω if

$$D\_{\boldsymbol{\varphi}}(Q^\*, P) \le D\_{\boldsymbol{\varphi}}(Q, P), \text{ for all } Q \in \Omega.$$

### *2.2. Minimum Empirical Divergence Estimators*

Let *X*1, ... , *Xn* be an i.i.d. sample on the random vector *X* with the probability distribution *<sup>P</sup>*0. The "plug-in" estimator of the *<sup>ϕ</sup>* divergence *<sup>D</sup>ϕ*(M<sup>1</sup> *<sup>θ</sup>*, *P*0) between the set <sup>M</sup><sup>1</sup> *<sup>θ</sup>* and the probability measure *P*<sup>0</sup> is defined by replacing *P*<sup>0</sup> with the empirical measure associated with the sample. More precisely,

$$\hat{D}\_{\boldsymbol{\varphi}}(\mathcal{M}\_{\boldsymbol{\theta}^\*}^1 P\_0) = \inf\_{Q \in \mathcal{M}\_{\boldsymbol{\theta}}^1} D\_{\boldsymbol{\varphi}}(Q, P\_n) = \inf\_{Q \in \mathcal{M}\_{\boldsymbol{\theta}}^1} \int \boldsymbol{\varphi} \left(\frac{d\mathcal{Q}}{dP\_n}(\boldsymbol{\chi})\right) dP\_n(\boldsymbol{\chi}), \tag{8}$$

where *Pn* := <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *δXi* is the empirical measure associated with the sample, *δ<sup>x</sup>* being the Dirac measure putting all mass at *<sup>x</sup>*. If the projection of the measure *Pn* on <sup>M</sup><sup>1</sup> *<sup>θ</sup>* exists, it is a law a.c. with respect to *Pn*. Then, it is natural to consider

$$\mathcal{M}\_{\theta}^{(n)} = \{ \mathbb{Q} \in M^1 : \mathbb{Q} \text{ a.c. with respect to } P\_n \text{ and } \sum\_{i=1}^n \emptyset(X\_i, \theta) Q(X\_i) = 0 \},\tag{9}$$

and then, the plug-in estimator (8) can be written as

$$\hat{D}\_{\boldsymbol{\uprho}}(\mathcal{M}\_{\boldsymbol{\uprho}}^{1}, P\_{0}) = \inf\_{\substack{\boldsymbol{Q} \in \mathcal{M}\_{\boldsymbol{\uprho}}^{(n)}}} \frac{1}{n} \sum\_{i=1}^{n} \boldsymbol{\uprho}(\boldsymbol{n} \boldsymbol{Q}(\boldsymbol{X}\_{i})).\tag{10}$$

The infimum in the above expression (10) may be achieved at a point situated on the frontier of the set <sup>M</sup>(*n*) *<sup>θ</sup>* , a case in which the Lagrange method for characterizing the infimum and computing *<sup>D</sup>*4*ϕ*(M<sup>1</sup> *<sup>θ</sup>*, *P*0) cannot be applied. In order to avoid this difficulty, Broniatowski and Keziou [12,25] proposed to work on sets of signed finite measures and defined

$$\mathcal{M}\_{\theta} := \{ \mathbb{Q} \in M : \int d\mathbb{Q} = 1 \text{ and } \int \emptyset(\mathbf{x}, \theta) d\mathbb{Q}(\mathbf{x}) = 0 \}, \tag{11}$$

where *<sup>M</sup>* denotes the set of all signed finite measures on the measurable space (R*m*, <sup>B</sup>(R*m*)), and

$$\mathcal{M} := \bigcup\_{\emptyset \in \Theta} \mathcal{M}\_{\emptyset}. \tag{12}$$

They showed that, if *Q*∗ <sup>1</sup> the projection of *Pn* on <sup>M</sup><sup>1</sup> *<sup>θ</sup>* is an interior point of <sup>M</sup><sup>1</sup> *<sup>θ</sup>* and *Q*<sup>∗</sup> the projection of *Pn* on M*<sup>θ</sup>* is an interior point of M*θ*, then both approaches based on signed finite measures, respectively on probability measures, for defining minimum divergence estimators coincide. On the other hand, in the case when *Q*∗ <sup>1</sup> is a frontier point of <sup>M</sup><sup>1</sup> *<sup>θ</sup>*, the estimator of the parameter *θ*<sup>0</sup> defined using the context of signed finite measures converges to *<sup>θ</sup>*0. These aspects justify the substitution of <sup>M</sup><sup>1</sup> *<sup>θ</sup>* by M*θ*.

In the following, we briefly recall the definitions of the estimators for the moment condition proposed in [12] in the context of signed finite measure sets.

Denote by *<sup>g</sup>* the function defined on <sup>R</sup>*<sup>m</sup>* <sup>×</sup> <sup>Θ</sup> and <sup>R</sup>*l*<sup>+</sup>1-valued:

$$\overline{\mathcal{S}}(\mathbf{x},\boldsymbol{\theta}) := (\mathbf{1}\_{\mathcal{X}}(\mathbf{x}), \mathcal{g}\_1(\mathbf{x},\boldsymbol{\theta}), \dots, \mathcal{g}\_l(\mathbf{x},\boldsymbol{\theta})) \; \mathsf{T} \; . \tag{13}$$

Given a *ϕ* divergence, when the function *ϕ* is strictly convex on its domain, denote

$$\wp^\*(\mu) := \mu \wp'^{-1}(\mu) - \wp(\wp'^{-1}(\mu))\_{\ast}$$

the convex conjugate of the function *<sup>ϕ</sup>*. For a given probability measure *<sup>P</sup>* <sup>∈</sup> *<sup>M</sup>*<sup>1</sup> and a fixed *θ* ∈ Θ, define

$$\Lambda\_{\theta}(P) := \{ t \in \mathbb{R}^{l+1} : \int |\varphi^\*(t\_0 + \sum\_{j=1}^l t\_j \underline{g}\_j(\mathbf{x}, \theta))| dP(\mathbf{x}) < \infty \}. \tag{14}$$

We also use the notations <sup>Λ</sup>*<sup>θ</sup>* for <sup>Λ</sup>*<sup>θ</sup>* (*P*0) and <sup>Λ</sup>(*n*) *<sup>θ</sup>* for Λ*<sup>θ</sup>* (*Pn*).

Supposing that *P*<sup>0</sup> admits a projection *Q*∗ *<sup>θ</sup>* on M*<sup>θ</sup>* with the same support as *P*<sup>0</sup> and that the function *ϕ* is strictly convex on its domain, then the *ϕ* divergence *Dϕ*(M*θ*, *P*0) admits the dual representation:

$$D\_{\boldsymbol{\theta}}(\mathcal{M}\_{\boldsymbol{\theta}}, P\_0) = \sup\_{t \in \Lambda\_{\boldsymbol{\theta}}} \int m(\boldsymbol{x}, \boldsymbol{\theta}, t)dP\_0(\boldsymbol{x}), \tag{15}$$

where *m*(*x*, *θ*, *t*) := *t*<sup>0</sup> − *ϕ*∗(*t g*(*x*, *θ*)).

The supremum in (15) is unique and is reached at a point that we denote as *t<sup>θ</sup>* = *t<sup>θ</sup>* (*P*0):

$$t\_{\theta} := \underset{t \in \Lambda\_{\theta}}{\text{arg}\, \text{sup}} \int m(\mathbf{x}, \theta, t)dP\_{0}(\mathbf{x}). \tag{16}$$

Then, *Dϕ*(M*θ*, *P*0), *tθ*, *Dϕ*(M, *P*0) and *θ*<sup>0</sup> can be estimated respectively by

$$\hat{D}\_{\theta}(\mathcal{M}\_{\theta}, P\_0) := \sup\_{t \in \Lambda\_{\theta}^{(n)}} \int m(\mathbf{x}, \theta, t) dP\_n(\mathbf{x}), \tag{17}$$

$$\widehat{\mathfrak{f}}\_{\theta} := \arg \sup\_{t \in \Lambda\_{\theta}^{(n)}} \int m(\mathfrak{x}, \theta, t) dP\_n(\mathfrak{x}), \tag{18}$$

$$\hat{D}\_{\boldsymbol{\theta}}(\mathcal{M}, P\_0) := \inf\_{\theta \in \Theta} \sup\_{t \in \Lambda\_{\boldsymbol{\theta}}^{(n)}} \int m(\boldsymbol{x}, \theta, t) dP\_n(\boldsymbol{x}), \tag{19}$$

$$\widehat{\theta}\_{\theta} := \arg \inf\_{\theta \in \Theta} \sup\_{t \in \Lambda\_{\theta}^{(n)}} \int m(\mathbf{x}, \theta, t) dP\_{\mathbb{R}}(\mathbf{x}). \tag{20}$$

The estimators defined in (20) are called minimum empirical divergence estimators. We refer to [12] for the complete study of the existence and of the asymptotic properties of the above estimators.

The influence functions of these estimators and corresponding robustness properties were studied in [17]. According to those results, for *θ* ∈ Θ fixed, the influence function of the estimator 4*t<sup>θ</sup>* is given by

$$\text{IF}(\mathbf{x}; t\_{\theta}, P\_0) = -\left[ \int \frac{\partial^2}{\partial^2 t} m(\mathbf{y}, \theta, t\_{\theta}(P\_0)) dP\_0(\mathbf{y}) \right]^{-1} \frac{\partial}{\partial t} m(\mathbf{x}, \theta, t\_{\theta}(P\_0)), \tag{21}$$

where

$$\frac{\partial}{\partial t}m(\mathbf{x},\boldsymbol{\theta},\mathbf{t}) = (\mathbf{1},\mathbf{0}\_{l})^{\top} - q^{\prime -1}(\mathbf{t}^{\top}\,\overline{\mathbf{g}}(\mathbf{x},\boldsymbol{\theta}))\overline{\mathbf{g}}(\mathbf{x},\boldsymbol{\theta}),\tag{22}$$

$$\frac{\partial^2}{\partial^2 t} m(\mathbf{x}, \boldsymbol{\theta}, \mathbf{t}) = -\frac{1}{q^{\prime\prime} (q^{\prime -1} (\mathbf{t}^\top \overline{\mathbf{g}}(\mathbf{x}, \boldsymbol{\theta})))} \overline{\mathbf{g}}(\mathbf{x}, \boldsymbol{\theta}) \overline{\mathbf{g}}(\mathbf{x}, \boldsymbol{\theta})^\top,\tag{23}$$

with the particular case *θ* = *θ*0:

$$\text{IF}(\mathbf{x}; t\_{\theta\_0}, P\_0) = -\boldsymbol{\varrho}^{\prime\prime}(\mathbf{1}) \left[ \int \overline{\boldsymbol{\mathfrak{F}}}(\boldsymbol{y}, \theta\_0) \overline{\boldsymbol{\mathfrak{F}}}(\boldsymbol{y}, \theta\_0)^\top \boldsymbol{d} P\_0(\boldsymbol{y}) \right]^{-1} (\boldsymbol{0}, \boldsymbol{g}(\mathbf{x}, \theta\_0)^\top)^\top. \tag{24}$$

On the other hand, the influence function of the estimator *θ* 4*<sup>ϕ</sup>* is given by

$$\begin{split} \text{IF}(\mathbf{x}; T\_{\theta}, \mathbf{P}\_{0}) &= -\left\{ \left[ \int \frac{\partial}{\partial \theta} \mathbf{g}(\mathbf{y}, \theta \mathbf{o}) \, d\mathbf{P}\_{0}(\mathbf{y}) \right]^{\top} \left[ \int \mathbf{g}(\mathbf{y}, \theta \mathbf{o}) \mathbf{g}(\mathbf{y}, \theta \mathbf{o})^{\top} \, d\mathbf{P}\_{0}(\mathbf{y}) \right]^{-1} \int \frac{\partial}{\partial \theta} \mathbf{g}(\mathbf{y}, \theta \mathbf{o}) \, d\mathbf{P}\_{0}(\mathbf{y}) \right\}^{-1} \\ & \cdot \left[ \int \frac{\partial}{\partial \theta} \mathbf{g}(\mathbf{y}, \theta \mathbf{o}) \, d\mathbf{P}\_{0}(\mathbf{y}) \right]^{\top} \left[ \int \mathbf{g}(\mathbf{y}, \theta \mathbf{o}) \mathbf{g}(\mathbf{y}, \theta \mathbf{o})^{\top} \, d\mathbf{P}\_{0}(\mathbf{y}) \right]^{-1} \mathbf{g}(\mathbf{x}, \theta \mathbf{o}) . \end{split} \tag{25}$$

Since the function *x* → *g*(*x*, *θ*) is usually not bounded, for example, when we have linear constraints, the influence function IF(*x*; *Tϕ*, *P*0) is not bounded; therefore, the minimum empirical divergence estimators *θ* 4*<sup>ϕ</sup>* defined in (20) are generally not robust.

Through the calculations, it can be seen that there is a connection between the influence functions IF(*x*; *tθ*<sup>0</sup> , *P*0) and IF(*x*; *Tϕ*, *P*0), namely the relation

$$\left[\int \frac{\partial}{\partial \theta} \overline{\mathfrak{F}}(y, \theta\_0) dP\_0(y) \right]^\top \cdot \left\{ \frac{\partial}{\partial \theta} t(\theta\_0, P\_0) \text{IF}(\mathbf{x}; T\_{\mathfrak{g}}, P\_0) + \text{IF}(\mathbf{x}; t\_{\theta\_0}, P\_0) \right\} = 0.$$

Since IF(*x*; *Tϕ*, *P*0) is linearly related to IF(*x*; *tθ*<sup>0</sup> , *P*0), using a robust estimator of *t<sup>θ</sup>* = *t<sup>θ</sup>* (*P*0) in the original duality Formula (15) would lead to a new robust estimator of *θ*0. This is the idea at the basis of our proposal in this paper, for constructing new robust estimators for moment condition models.

## **3. Robust Estimators for Moment Condition Models**
