3.2.1. Limit Properties of the Estimator *<sup>Q</sup>θ*b*<sup>n</sup>*

Under the hypothesis that the true model *Q* belongs to the family of models (*P<sup>θ</sup>* ), hence *Q* = *Pθ*<sup>0</sup> , we prove the consistency and the asymptotic normality for the estimator *<sup>Q</sup>θ*b*<sup>n</sup>* .

Note that

$$Q\_{\widehat{\theta}\_{\mathbb{R}}} = \frac{1}{\gamma + 1} \ln \left( \int p\_{\widehat{\theta}\_{\mathbb{R}}}^{\gamma + 1} \mathbf{d} \lambda \right) - \frac{1}{\gamma} \ln \left( \frac{1}{n} \sum\_{i=1}^{n} p\_{\widehat{\theta}\_{\mathbb{R}}}^{\gamma} (X\_i) \right) \tag{19}$$

$$=-\ln\left[\frac{\frac{1}{n}\sum\_{i=1}^{n}p\_{\hat{\theta}\_{\boldsymbol{n}}}(\mathbf{X}\_{\boldsymbol{l}})}{\left(\int p\_{\hat{\theta}\_{\boldsymbol{n}}}^{\gamma+1}d\lambda\right)^{\frac{\gamma}{\gamma+1}}}\right]^{\frac{1}{\gamma}}=-\ln[\hat{\mathcal{R}}\_{\boldsymbol{\gamma}}(\theta\_{0})]^{\frac{1}{\gamma}}.\tag{20}$$

where R *p γ*+1 *θ*b*n* d*λ* = hR *p γ*+1 *θ* d*λ* i *θ*=*θ*b*<sup>n</sup>* and *R*b*γ*(*θ*0) is given by (7).

First we prove that *R*b*γ*(*θ*0) is a consistent estimator of *Rγ*(*θ*0). Indeed, using Theorem 1 and the fact that R *∂ ∂θ h*(*x*, *θ*0)*dPθ*<sup>0</sup> (*x*) = 0, a Taylor expansion of <sup>1</sup> *<sup>n</sup>* ∑ *n i*=1 *h*(*X<sup>i</sup>* , *θ*) in *θ*b*<sup>n</sup>* around *θ*<sup>0</sup> gives

$$\widehat{\mathcal{R}}\_{\gamma}(\theta\_0) = \frac{1}{n} \sum\_{i=1}^{n} h(X\_i, \theta\_0) + o\_P(n^{-1/2}).\tag{21}$$

Using the weak law of large numbers,

$$\frac{1}{n}\sum\_{i=1}^{n}h(X\_i,\theta\_0) = R\_\gamma(\theta\_0) + o\_P(1). \tag{22}$$

Combining (21) and (22), we obtain that *R*b*γ*(*θ*0) converges to *Rγ*(*θ*0) in probability.

Then, using the continuous mapping theorem, since *g*(*t*) = − ln *t* 1 *<sup>γ</sup>* is a continuous function, we get

$$Q\_{\widehat{\theta}\_{\mathfrak{l}}} = -\ln[\widehat{R}\_{\gamma}(\theta\_0)]^{\frac{1}{\gamma}} \to -\ln[R\_{\gamma}(\theta\_0)]^{\frac{1}{\gamma}} = \mathcal{W}\_{\theta\_0}$$

in probability.

On the other hand, using the asymptotic normality of the estimator *R*b*γ*(*θ*0) (according to Theorem <sup>1</sup> (c)) together with the univariate delta method, we obtain the asymptotic normality of *<sup>Q</sup>θ*b*<sup>n</sup>* . The Proposition below summarizes the above asymptotic results.

**Proposition 4.** *Under (C1)-(C3), when Q* = *Pθ*<sup>0</sup> *, it holds*

*(a) Qθ*b*<sup>n</sup> converges to Wθ*<sup>0</sup> *in probability.*

*(b)* √ *<sup>n</sup>*(*Qθ*b*<sup>n</sup>* − *Wθ*<sup>0</sup> ) *converges in distribution to a centered univariate normal random variable with variance <sup>σ</sup>* 2 (*θ*0) *γ*2*Rγ*(*θ*0) 2 *, σ* 2 (*θ*0) *being defined in Theorem 1.*

3.2.2. Robustness Properties of the Estimator *<sup>Q</sup>θ*b*<sup>n</sup>*

The influence function is a useful tool for describing robustness of an estimator. Recall that, a map *T* defined on a set of probability measures and parameter space valued is a statistical functional corresponding to an estimator *θ*b*<sup>n</sup>* of the parameter *θ*, whenever *θ*b*<sup>n</sup>* = *T*(*Pn*), where *P<sup>n</sup>* is the empirical measure associated to the sample. The influence function of *T* at *P<sup>θ</sup>* is defined by

$$\text{IF}(\mathfrak{x}; T\_\prime P\_\theta) := \left. \frac{\partial T(\widetilde{P}\_\varepsilon \chi)}{\partial \varepsilon} \right|\_{\varepsilon=0} \cdot \omega$$

where *P*e*ε<sup>x</sup>* := (1 − *ε*)*P<sup>θ</sup>* + *εδx*, *ε* > 0, *δ<sup>x</sup>* being the Dirac measure putting all mass at *x*. The gross error sensitivity of the estimator is defined by

$$\gamma^\*(T, P\_\theta) = \sup\_{\mathfrak{x}} \left\| \text{IF}(\mathfrak{x}; T, P\_\theta) \right\|.$$

Whenever the influence function is bounded with respect to *x*, the corresponding estimator is called B-robust (see [19]).

In what follows, for a given *<sup>γ</sup>* <sup>&</sup>gt; 0, we derive the influence function of the estimator *<sup>Q</sup>θ*b*<sup>n</sup>* . The statistical functional associated with this estimator, which we denote by *U*, is defined by

$$M(P) := \frac{1}{\gamma + 1} \ln \left( \int p\_{T(P)}^{\gamma + 1} \mathbf{d} \lambda \right) - \frac{1}{\gamma} \ln \left( \int p\_{T(P)}^{\gamma} \mathbf{d}P \right) \cdot \mathbf{1}$$

where *T* is the statistical functional corresponding to the used minimum pseudodistance estimator estimator *θ*b*n*, namely

$$T(P) := \arg\sup\_{\theta} \mathbb{C}\_{\gamma}(\theta)^{-1} \int p\_{\theta}^{\gamma} \mathrm{d}P$$

where *Cγ*(*θ*) = (R *p γ*+1 *θ* d*λ*) *γ*/(*γ*+1) .

Due to the Fisher consistency of the functional *T*, according to (6), we have *T*(*Pθ*<sup>0</sup> ) = *θ*<sup>0</sup> which implies that *U*(*Pθ*<sup>0</sup> ) = *Wθ*<sup>0</sup> .

**Proposition 5.** *When Q* = *Pθ*<sup>0</sup> *, the influence function of Qθ*b*<sup>n</sup> is given by*

$$\text{IF}(\mathbf{x}; \mathcal{U}, P\_{\theta\_0}) = \frac{1}{\gamma} \left[ 1 - \frac{p\_{\theta\_0}^{\gamma}(\mathbf{x})}{\int p\_{\theta\_0}^{\gamma + 1} \mathbf{d} \lambda} \right]. \tag{23}$$

Note that the influence function of the estimator *<sup>Q</sup>θ*b*<sup>n</sup>* does not depend on the estimator *θ*b*n*, but depends on the used pseudodistance. Usually, *p γ θ*0 (*x*) is bounded with respect to *x* and therefore *Qθ*b*n* is a robust estimator with respect to *Wθ*<sup>0</sup> .

**Figure 1.** Influence functions in the case of the normal model.

For comparison at the level of the influence function, we consider the AIC criterion which is defined by

$$AIC = -2\ln(\mathcal{L}(\widehat{\theta}\_n)) + 2d\_\prime$$

where L(*θ*b*n*) is the maximum value of the likelihood function for the model, *θ*b*<sup>n</sup>* the maximum likelihood estimator and *d* the dimension of the parameter. The statistical functional corresponding to the statistic −2 ln(L(*θ*b*n*)) is

$$V(P) = -2\int \ln p\_{T(P)}dP$$

where *T* here is the statistical functional corresponding to the maximum likelihood estimator. The influence function of the functional *V* is given by

$$IF(\mathbf{x}; V, P\_{\theta\_0}) = 2 \left[ \int \ln p\_{\theta\_0} dP\_{\theta\_0} - \ln p\_{\theta\_0}(\mathbf{x}) \right]. \tag{24}$$

This influence function is not bounded with respect to *x*, therefore the statistic −2 ln(L(*θ*b*n*)) is not robust.

For example, in the case of the univariate normal model, for a positive *γ*, the influence function (23) writes as

$$\text{IF}(\mathbf{x}; \mathcal{U}\_{\prime} P\_{\theta\_0}) = \frac{1}{\gamma} \left( 1 - \sqrt{\gamma + 1} \cdot \exp\left( -\frac{\gamma}{2} \left( \frac{\mathbf{x} - \mathbf{m}}{\sigma} \right)^2 \right) \right) \tag{25}$$

while the influence function (24) writes as

$$\text{IF}(\mathbf{x}; V, P\_{\theta\_0}) = \left(\frac{\mathbf{x} - m}{\sigma}\right)^2 - \frac{2m^2}{\sigma^2} - 1 \tag{26}$$

(here *θ*<sup>0</sup> = (*m*, *σ*)). For all the pseudodistances, the influence function (25) is bounded with respect to *<sup>x</sup>*, therefore the selection criteria based on the statistic *<sup>Q</sup>θ*b*<sup>n</sup>* will be robust. On the other hand, the influence function (26) is not bounded with respect to *x*, showing the non robustness of AIC in this case. Moreover, the gross error sensitivities corresponding to these influence functions are *γ* ∗ (*U*, *Pθ*<sup>0</sup> ) = <sup>1</sup> *γ* and *γ* ∗ (*V*, *Pθ*<sup>0</sup> ) = ∞. These results show that, in the case of the normal model, when *γ* increases the gross error sensitivity decreases. Therefore, larger values of *γ* are associated with more robust procedures. For the particular case *m* = 0 and *σ* = 1, the influence functions (25) and (26) are represented in Figure 1.

#### *3.3. Model Selection Criteria Using Pseudodistances*

#### 3.3.1. The Case of Univariate Normal Family

The criteria that we propose in this section correspond to the case where the candidate model is a univariate normal model from the family of normal models (*P<sup>θ</sup>* ) indexed by *θ* = (*µ*, *σ*). We also suppose that the true model *Q* belongs to (*P<sup>θ</sup>* ).

In the case of the univariate normal model, *Mγ*(*θ*0) defined in (14) expresses as

*<sup>M</sup>γ*(*θ*0) = (*<sup>γ</sup>* <sup>+</sup> <sup>1</sup>) 2 (2*γ* + 1) 3/2 *A*(*γ*)*V* −1 , (27)

where *V* is the asymptotic covariance matrix given by (8) and the matrix *A*(*γ*) is given by

$$A(\gamma) = \begin{pmatrix} 1 & 0\\ 0 & \frac{3\gamma^2 + 4\gamma + 2}{2(2\gamma + 1)} \end{pmatrix}.$$

For small positive values of *γ*, the matrix *A*(*γ*) can be approximated by the identity matrix *I*. According to Theorem 1, √ *n*(*θ*b*<sup>n</sup>* − *θ*0) is asymptotically multivariate normal and then the statistic *n*(*θ*<sup>0</sup> − *θ*b*n*) *tV* −1 (*θ*<sup>0</sup> − *θ*b*n*) has approximately a *χ* 2 *d* distribution. For large *n*, it holds

$$E[(\theta\_0 - \hat{\theta}\_n)^t M\_\gamma(\theta\_0)(\theta\_0 - \hat{\theta}\_n)] \approx \frac{(\gamma + 1)^2}{(2\gamma + 1)^{3/2}} \cdot \frac{d}{n}.\tag{28}$$

Also, for the normal model, it holds

$$\frac{\int p\_{\theta\_0}^{2\gamma+1} \mathrm{d}\lambda}{\left(\int p\_{\theta\_0}^{\gamma+1} \mathrm{d}\lambda\right)^2} = \frac{\gamma+1}{\sqrt{2\gamma+1}}.\tag{29}$$

Therefore, (18) becomes

$$E[\mathcal{W}\_{\widehat{\theta}\_{\mathfrak{n}}}] \cong E[Q\_{\widehat{\theta}\_{\mathfrak{n}}}] + \frac{(\gamma + 1)^2}{(2\gamma + 1)^{3/2}} \cdot \frac{d}{n} + \frac{1}{2\gamma n} \left[1 - \frac{\gamma + 1}{\sqrt{2\gamma + 1}}\right] + E\left[\mathcal{R}\_{\mathfrak{n}}\right] + \frac{1}{\gamma} E\left[\mathcal{R}\_{\mathfrak{n}}'\right].\tag{30}$$

Using the central limit theorem and asymptotic properties of *θ*b*<sup>n</sup>* given in Theorem 1, the following hold

$$n \cdot o(\|\hat{\theta}\_n - \theta\_0\|^2) \quad = \quad o\_P(1), \tag{31}$$

$$n \cdot o(\|\frac{1}{n}\sum\_{i=1}^{n} p\_{\theta\_0}^{\gamma}(X\_i) - \int p\_{\theta\_0}^{\gamma+1} \mathbf{d}\lambda\|^2) \quad = \quad o\_P(1). \tag{32}$$

Using (30), (31) and (32) we obtain:

**Proposition 6.** *For the univariate normal family, an asymptotically unbiased estimator of the expected overall discrepancy is given by*

$$Q\_{\widehat{\theta}\_n} + \frac{(\gamma + 1)^2}{(2\gamma + 1)^{3/2}} \cdot \frac{d}{n} + \frac{1}{2\gamma n} \left[ 1 - \frac{\gamma + 1}{\sqrt{2\gamma + 1}} \right],\tag{33}$$

*where θ*b*<sup>n</sup> is a minimum pseudodistance estimator given by (3).*

Under the hypothesis that (*P<sup>θ</sup>* ) is the univariate normal model, as we supposed in this subsection, the function *h* writes as

$$h(\mathbf{x}, \theta) = (\sqrt{\gamma + 1})^{\gamma/(\gamma + 1)} \cdot (\sigma \sqrt{2\pi})^{-\gamma/(\gamma + 1)} \cdot \exp\left(-\frac{\gamma}{2} \left(\frac{\mathbf{x} - \mathbf{m}}{\sigma}\right)^2\right) \tag{34}$$

and it can be easily checked that all the conditions (C1)–(C5) are fulfilled. Therefore we can use all results presented in the preceding subsections, such that Proposition 6 is fully justified.

Moreover, the selection criteria based on (33) are consistent on the basis of Proposition 4. It should also be noted that the bias correction term in (33) decreases slowly as the parameter *γ* increases staying always very close to zero (<sup>∼</sup> <sup>10</sup>−<sup>2</sup> ). As expected, the larger the sample size the smaller the bias correction. As we saw in Section 3.2.2, since the gross error sensitivity of *<sup>Q</sup>θ*b*<sup>n</sup>* is *γ* ∗ (*U*, *Pθ*<sup>0</sup> ) = <sup>1</sup> *γ* , larger values of *γ* are associated with more robust procedures. On the other hand, the approximation of *A*(*γ*) with the identity matrix is realized for values of *γ* close to zero. Thus, positive values of *γ* smaller than 0.5 for example could represent choices satisfying the robustness requirement and the approximation of *A*(*γ*) through the identity matrix, approximation which is necessary to construct the criterion in this case.

#### 3.3.2. The Case of Linear Regression Models

In the following, we adapt the pseudodistance based model selection criterion in the case of linear regression models. Consider the linear regression model

$$Y = \mathfrak{a} + \mathfrak{f}^t X + e\mathfrak{e} \tag{35}$$

where *e* ∼ N (0, *σ*) and *e* is independent of *X*. Suppose we have a sample given by the i.i.d. random vectors *Z<sup>i</sup>* = (*X<sup>i</sup>* ,*Yi*), *i* = 1, ..., *n*, such that *Y<sup>i</sup>* = *α* + *β <sup>t</sup>X<sup>i</sup>* + *e<sup>i</sup>* .

We consider the joint distribution of the entire data and write a pseudodistance between the theoretical model and the true model corresponding to the data. Let *P<sup>θ</sup>* , *θ* := (*α*, *β*, *σ*), be the probability measure associated to the theoretical model given by the random vector *Z* = (*X*,*Y*) and *Q* the probability measure associated to the true model corresponding to the data. Denote by *p<sup>θ</sup>* , respectively by *q* the corresponding densities. For *γ* > 0, the pseudodistance between *P<sup>θ</sup>* and *Q* is defined by

$$\begin{array}{rcl} R\_{\gamma}(\mathsf{P}\_{\theta},\mathsf{Q}) & :=& \frac{1}{\gamma+1} \ln \left( \int p\_{\theta}^{\gamma}(\mathsf{x},\mathsf{y}) d\mathsf{P}\_{\theta}(\mathsf{x},\mathsf{y}) \right) + \frac{1}{\gamma(\gamma+1)} \ln \left( \int q^{\gamma}(\mathsf{x},\mathsf{y}) d\mathsf{Q}(\mathsf{x},\mathsf{y}) \right) - \\ & - \frac{1}{\gamma} \ln \left( \int p\_{\theta}^{\gamma}(\mathsf{x},\mathsf{y}) d\mathsf{Q}(\mathsf{x},\mathsf{y}) \right). \end{array} \tag{36}$$

Similar to [18], since the middle term above does not depend on *P<sup>θ</sup>* , a minimum pseudodistance estimator of the parameter *θ*<sup>0</sup> = (*α*0, *β*0, *σ*0) is defined by

$$\hat{\theta}\_{\text{n}} = (\hat{a}\_{\text{n}} \hat{\beta}\_{\text{n}} \hat{\rho}\_{\text{n}} \hat{\sigma}\_{\text{n}}) = \underset{a, \theta, \sigma}{\text{arg}\min} \left\{ \frac{1}{\gamma + 1} \ln \left( \int p\_{\theta}^{\gamma}(\mathbf{x}, y) dP\_{\theta}(\mathbf{x}, y) \right) - \frac{1}{\gamma} \ln \left( \int p\_{\theta}^{\gamma}(\mathbf{x}, y) dP\_{\mathbf{n}}(\mathbf{x}, y) \right) \right\}, \tag{37}$$

where *P<sup>n</sup>* is the empirical measure associated with the sample. This estimator can be written as

$$\hat{\theta}\_{\text{il}} = (\hat{\mathfrak{a}}\_{\text{ll}} \hat{\beta}\_{\text{ll}} \hat{\sigma}\_{\text{ll}}) = \underset{a, \hat{\theta}, \sigma}{\text{arg}\min} \left\{ \frac{1}{\gamma + 1} \ln \left( \int \phi\_{\sigma}^{\gamma + 1}(e) de \right) - \frac{1}{\gamma} \ln \left( \frac{1}{n} \sum\_{i=1}^{n} \phi\_{\sigma}^{\gamma} (Y\_i - a - \beta^t X\_i) \right) \right\}, \quad \text{(38)}$$

where *<sup>φ</sup><sup>σ</sup>* is the density of the random variable *<sup>e</sup>* ∼ N (0, *<sup>σ</sup>*). Then, the estimator *<sup>Q</sup>θ*b*<sup>n</sup>* can be written as

$$Q\_{\hat{\theta}\_n} = \min\_{\mathbf{u}, \boldsymbol{\beta}, \mathcal{F}} \left\{ \frac{1}{\gamma + 1} \ln \left( \frac{1}{(\sigma \sqrt{2\pi})^\gamma \sqrt{\gamma + 1}} \right) - \frac{1}{\gamma} \ln \left( \frac{1}{n} \sum\_{i=1}^n \frac{1}{(\sigma \sqrt{2\pi})^\gamma} \cdot \exp \left( -\frac{\gamma}{2\sigma^2} (\mathbf{Y}\_i - \mathbf{z} - \boldsymbol{\beta}^t \mathbf{X}\_i)^2 \right) \right) \right\}. \tag{39}$$

In order to construct an asymptotic unbiased estimator of the expected overall discrepancy in the case of the linear regression models, we evaluated the second and the third terms from (18).

For values of *γ* close to 0 (*γ* smaller than 0.3), we found the following approximation of the matrix *Mγ*(*θ*0)

$$M\_{\gamma}(\theta\_0) \simeq \frac{(\gamma+1)^2}{(2\gamma+1)^{3/2}} V^{-1}\begin{pmatrix} I & 0\\ 0 & \frac{3\gamma^2 + 4\gamma + 2}{2\gamma+1} \end{pmatrix} \tag{40}$$

where *V* is the asymptotic covariance matrix of *θ*b*<sup>n</sup>* and *I* is the identity matrix. We refer to [15] for the asymptotic properties of the minimum pseudodistance estimators in the case of linear regression models. Since √ *n*(*θ*b*<sup>n</sup>* − *θ*0) is asymptotically multivariate normal distributed, using the *χ* <sup>2</sup> distribution, we obtain the approximation

$$E[(\widehat{\theta}\_n - \theta\_0)^t M\_\gamma(\theta\_0)(\widehat{\theta}\_n - \theta\_0)] \simeq \frac{1}{n} \cdot \frac{(\gamma + 1)^2}{(2\gamma + 1)^{3/2}} \left[ (d - 1) + \frac{3\gamma^2 + 4\gamma + 2}{2(\gamma + 1)(2\gamma + 1)} \right].\tag{41}$$

Also, the third term in (18) is given by

$$\frac{1}{2\gamma n} \left[ 1 - \left( \frac{\gamma + 1}{\sqrt{2\gamma + 1}} \right)^d \right]. \tag{42}$$

Then, according to Proposition 3, an asymptotically unbiased estimator of the expected overall discrepancy is given by

$$Q\_{\widehat{\theta}\_{\mathfrak{n}}} + \frac{1}{n} \cdot \frac{(\gamma + 1)^2}{(2\gamma + 1)^{3/2}} \left[ (d - 1) + \frac{3\gamma^2 + 4\gamma + 2}{2(\gamma + 1)(2\gamma + 1)} \right] + \frac{1}{2\gamma n} \left[ 1 - \left( \frac{\gamma + 1}{\sqrt{2\gamma + 1}} \right)^d \right],\tag{43}$$

where *<sup>Q</sup>θ*b*<sup>n</sup>* is given by (39). Note that, using the asymptotic properties of *θ*b*<sup>n</sup>* and the central limit theorem, the last two terms in (18) of Proposition 3 are *oP*( 1 *n* ).

When we compare different linear regression models, as in Section 4 below, we can ignore the terms depending only on *n* and *γ* in (43). Therefore, we can use as model selection criterion the simplified expression

$$Q\_{\widehat{\theta}\_n} + \frac{(\gamma + 1)^2}{(\sqrt{2\gamma + 1})^3} \cdot \frac{d}{n} - \frac{1}{2\gamma n} \left(\frac{\gamma + 1}{\sqrt{2\gamma + 1}}\right)^d. \tag{44}$$

which we call Pseudodistance based Information Criterion (PIC).

#### **4. Applications**

#### *4.1. Simulation Study*

In order to illustrate the performance of the PIC criterion (44) in the case of linear regression models, we performed a simulation study using for comparison the model selection criteria AIC, BIC and MDIC. These criteria are defined respectively by

$$AIC = \pi \log \delta\_p^2 + 2\left(p + 2\right)$$

$$BIC = \pi \log \delta\_p^{-2} + (p + 2)\log n\_p$$

where *n* the sample size, *p* the number of covariates of the model and *σ*ˆ 2 *p* the classical unbiased estimator of the variance of the model,

$$\text{MDIC} = \mathfrak{n} M Q\_{\hat{\theta}} + (2\pi)^{-\mathfrak{a}/2} (1+\mathfrak{a})^{2+p/2} p$$

with *α* = 0.25 and

$$M\mathcal{Q}\_{\theta} = -\left[ (1+\alpha^{-1}) \frac{1}{n} \sum\_{n=1}^{n} f\_{\hat{\theta}}^{\alpha}(X\_i) \right] \rho$$

where ˆ*θ* is a consistent estimate of the vector of unknown parameters involved in the model with *p* covariates and *f* <sup>ˆ</sup>*<sup>θ</sup>* is the associated probability density function. Note that MDIC is based on the well known BHHJ family of divergence measures indexed by a parameter *α* > 0 and on the minimum divergence estimating method for robust parameter estimation (see [20]). The value of *α* = 0.25 was found in [9] to be an ideal one for a great variety of settings. The above three criteria have been chosen to be used in this comparative study with PIC not only due to their popularity, but also due to their special characteristics. Indeed, AIC is the classical representative of asymptotically efficient criteria, BIC is known to be consistent, while MDIC is associated with robust estimations (see e.g., [20–23]).

Let *X*1, *X*2, *X*3, *X*<sup>4</sup> be four variables following respectively the normal distributions N (0, 3), N (1, 3), N (2, 3) and N (3, 3). We consider the model

$$Y = a\_0 + a\_1 X\_1 + a\_2 X\_2 + \varepsilon$$

with *a*<sup>0</sup> = *a*<sup>1</sup> = *a*<sup>2</sup> = 1 and *ε* ∼ N (0, 1). This is the uncontaminated model. In order to evaluate the robustness of the new PIC criterion, we also consider the contaminated model

$$\mathcal{Y} = d\_1(a\_0 + a\_1 X\_1 + a\_2 X\_2 + \varepsilon) + d\_2(a\_0 + a\_1 X\_1 + a\_2 X\_2 + \varepsilon^\*)$$

where *ε* <sup>∗</sup> ∼ N (5, 1) and *d*1, *d*<sup>2</sup> ∈ [0, 1] such that *d*<sup>1</sup> + *d*<sup>2</sup> = 1. Note that for *d*<sup>1</sup> = 1 and *d*<sup>2</sup> = 0 the uncontaminated model is obtained.

The simulated data corresponding to the contaminated model are

$$Y\_i = d\_1(1 + X\_{1,i} + X\_{2,i} + \varepsilon\_i) + d\_2(1 + X\_{1,i} + X\_{2,i} + \varepsilon\_i^\*)\_{\prime \prime}$$

for *i* = 1, . . . , *n*, where *X*1,*<sup>i</sup>* , *X*2,*<sup>i</sup>* ,*εi* ,*ε* ∗ *i* are values of the variables *X*1, *X*2,*ε*,*ε* ∗ independently generated from the normal distributions N (1, 3), N (2, 3), N (0, 1), N (5, 1) correspondingly.

With a set of four possible regressors there are 2 <sup>4</sup> <sup>−</sup> <sup>1</sup> <sup>=</sup> <sup>15</sup> possible model specifications that include at least one regressor. These 15 possible models constitute the set of candidate models in our study. More precisely, this set contains the full model (*X*1, *X*2, *X*3, *X*4) given by

$$Y = b\_0 + b\_1 X\_1 + b\_2 X\_2 + b\_3 X\_3 + b\_4 X\_4 + \varepsilon\_t$$

as well as all 14 possible subsets of the full model consisting of one (*Xj*<sup>1</sup> ), two (*Xj*<sup>1</sup> , *Xj*<sup>2</sup> ) and three (*Xj*<sup>1</sup> , *Xj*<sup>2</sup> , *Xj*<sup>3</sup> ) of the four regressors *X*1, *X*2, *X*<sup>3</sup> and *X*4, with *j*<sup>1</sup> 6= *j*<sup>2</sup> 6= *j*3, *j<sup>i</sup>* ∈ {1, 2, 3, 4} and *i* = 1, 2, 3 .

In our simulation study, for several values of the parameter *γ* associated with the pseudodistance, we compared the new criterion PIC with the other model selection criteria. Different levels of contamination and different sample sizes have been considered. In the examples presented in this work, *d*<sup>1</sup> ∈ {0.8, 0.9, 0.95, 1} and *n* ∈ {20, 50, 100}. Additional examples for *n* = 30, 75, 200, 500 have been analyzed (results not shown) with similar findings (see below). For each setting, fifty experiments were performed in order to select the best model among the available candidate models. In the framework of each of the fifty experiments, on the basis of the simulated observations, the value of each of the above model selection criteria was calculated for each of the 15 possible models. Then, for each criterion, the 15 candidate models were ranked from 1st to 15th according to the value of the criterion. The model chosen by a given criterion is the one for which the value of the criterion is the lowest among all the 15 candidate models.

Tables 1–12 present the proportions of models selected by the considered criteria. Among the 15 candidate models only 4 were chosen at least once. These four models are the same in all instances and appear in the 2nd column of all Tables.

For small sample sizes (*n* = 20, *n* = 30) the criteria PIC and MDIC yield the best results. When the level of contamination is 10% or 20%, the PIC criterion yields very good results and beats the other competitors almost all the time. When the level of contamination is small, for example 5% or when there is no contamination, the two criteria are comparable, in the sense that in many cases the proportions of selected models by the two criteria are very close, so that sometimes PIC wins and sometimes MDIC wins. Tables 1–4 present these results for *n* = 20, but similar results are obtained for *n* = 30, too.

For medium sample sizes (*n* = 50, *n* = 75), the criteria PIC and BIC yield the best results. The results for *n* = 50 are given in Tables 5–8. Note that the PIC criterion yields the best results for 0% and 10% contamination. For the other levels of contamination, there are values of *γ* for which PIC is the best among all the considered criteria. On the other hand, in most cases when BIC wins, the proportions of selections of the true model by BIC and PIC are close.

When the sample size is large (*n* = 100, *n* = 200, *n* = 500), BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance. Tables 9–12 present the results obtained for *n* = 100.

Thus, the new PIC criterion works very well for small to medium sample sizes and for levels of contamination up to 20%, but falls behind BIC for large sample sizes. Note that for contaminated data, PIC with *γ* = 0.15 prevails in most of the considered cases. On the other hand, for uncontaminated data, it is PIC with *γ* = 0.2 that prevails in all the considered instances. It is also worth mentioning that PIC with *γ* = 0.3 appears to behave very satisfactorily in most cases irrespectively of the proportion of contamination (0%–20%) and the sample size. Observe also that in all cases, AIC has the highest overestimation rate which is somehow expected (see [24]).

Although the consistency is the main focus of the applications presented in this work, one should point out that if prediction is part of the objective of a regression analysis, then model selection carried out using criteria such as the ones used in this work, have desirable properties. In fact, the case of finite-dimensional normal regression models has been shown to be associated with satisfactory prediction errors for criteria such as AIC and BIC (see [25]). Furthermore, it should be pointed out that in many instances PIC has a behavior quite similar to the above criteria by choosing the same models. Also, according to the presented simulation results, the proportion of choosing the true model by PIC is always better than the proportion of choosing the true model by AIC (even in the case of non contaminated data) and sometimes it is better than the proportion of choosing the true model by BIC. These results imply a satisfactory prediction ability for the proposed PIC criterion.

In conclusion, the new PIC criterion is a good competitor of the well known model selection criteria AIC, BIC and MDIC and may have superior performance especially in the case of small and contaminated samples.


**Table 1.** Proportions of selected models by the considered criteria (*n* = 20, *d*<sup>1</sup> = 0.8).

**Table 2.** Proportions of selected models by the considered criteria (*n* = 20, *d*<sup>1</sup> = 0.9).



**Table 3.** Proportions of selected models by the considered criteria (*n* = 20, *d*<sup>1</sup> = 0.95).

**Table 4.** Proportions of selected models by the considered criteria (*n* = 20, *d*<sup>1</sup> = 1).



**Table 5.** Proportions of selected models by the considered criteria (*n* = 50, *d*<sup>1</sup> = 0.8).

**Table 6.** Proportions of selected models by the considered criteria (*n* = 50, *d*<sup>1</sup> = 0.9).



**Table 7.** Proportions of selected models by the considered criteria (*n* = 50, *d*<sup>1</sup> = 0.95).

**Table 8.** Proportions of selected models by the considered criteria (*n* = 50, *d*<sup>1</sup> = 1).



**Table 9.** Proportions of selected models by the considered criteria (*n* = 100, *d*<sup>1</sup> = 0.8).

**Table 10.** Proportions of selected models by the considered criteria (*n* = 100, *d*<sup>1</sup> = 0.9).



**Table 11.** Proportions of selected models by the considered criteria (*n* = 100, *d*<sup>1</sup> = 0.95).

**Table 12.** Proportions of the selected models by the considered criteria (*n* = 100, *d*<sup>1</sup> = 1).


#### *4.2. Real Data Example*

In order to illustrate the proposed method, we used the Hald cement data (see [26]) which represent a popular example for multiple linear regression. This example concern the heat evolved in calories per gram of cement *Y* as a function of the amount of each of four ingredient in the mix: tricalcium aluminate (*X*1), tricalcium silicate (*X*2), tetracalcium alumino-ferrite (*X*3) and dicalcium silicate (*X*4). The data are presented in Table 13.


**Table 13.** Hald cement data.

Since 4 variables are available, there are 15 possible candidate models (involving at least one regressor) for this data set. Note that the 4 single-variable models should be excluded from the analysis, because cement involves a mixture of at least two components that react chemically (see [27], p. 102). The model selection criteria that have been used are PIC for several values of *γ*, AIC, BIC and MDIC with *α* = 0.25. Table 14 shows the model selected by each of the considered criteria.

**Table 14.** Selected models by model selection criteria.


Observe that, in this example, PIC behaves similarly to AIC and MDIC having a slight tendency of overestimation. Note that for this specific dataset the collinearity is quite strong with *X*<sup>1</sup> and *X*<sup>3</sup> as well as *X*<sup>2</sup> and *X*<sup>4</sup> being seriously correlated. It should be pointed out that the model (*X*1, *X*2, *X*4) is chosen not only by AIC and PIC, but also by *C<sup>p</sup>* Mallows' criterion ([1]) with (*X*1, *X*2, *X*3) coming very close second. Note further that (*X*1, *X*2, *X*4) has also been chosen by cross validation ([28], p. 33) and PRESS ([26], p. 325). Finally, it is worth noticing that these two models share the highest adjusted *R* <sup>2</sup> values which are almost identical (0.976 for (*X*1, *X*2, *X*4) and 0.974 for (*X*1, *X*2, *X*3)) making the distinction between them extremely hard. Thus, in this example, the new PIC criterion gives results as good as other recognized classical model selection criteria.

#### **5. Conclusions**

In this work, by applying the same methodology as for AIC to a family of pseudodistances, we constructed new model selection criteria using minimum pseudodistance estimators. We proved theoretical properties of these criteria including asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models was studied in detail and specific selection criteria based on pseudodistance are proposed.

For linear regression models, a comparative study based on Monte Carlo simulations illustrate the performance of the new methodology. Thus, for small sample sizes, the criteria PIC and MDIC yield the best results and in many cases PIC wins, for example when the level of contamination is 10% or 20%. For medium sample sizes, the criteria PIC and BIC yield the best results. When the sample size is large, BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance.

Based on the results of the simulation study and on the real data example, we conclude that the new PIC criterion is a good competitor of the well known criteria AIC, BIC and MDIC with an overall performance which is very satisfactory for all possible settings according to the sample size and contamination rate. Also PIC may have superior performance, especially in the case of small and contaminated samples.

An important issue that needs further investigation is the choice of the appropriate value for the parameter *γ* associated to the procedure. The findings of the presented simulation study show that, for contaminated data, the value *γ* = 0.15 leads to very good results, irrespectively of the sample size. Also, *γ* = 0.3 produces overall very satisfactory results, irrespectively of the sample size and the contamination rate. We hope to explore further and provide a clear solution to this problem, in a future work. We also intend to extend this methodology to other type of models including nonlinear or time series models.

**Author Contributions:** A.T. conceived the methodology, obtained the theoretical results. A.T., A.K. and P.T. conceived the application part. A.K. and P.T. implemented the method in R and obtained the numerical results. All authors wrote the paper. All authors have read and approved the final manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The work of the first author was partially supported by a grant of the Romanian National Authority for Scientific Research, CNCS-UEFISCDI, project number PN-II-RU-TE-2012-3-0007. The work of the third author was completed as part of the activities of the Laboratory of Statistics and Data Analysis of the University of the Aegean.

**Conflicts of Interest:** The authors declare no conflict of interst.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

**Proof of Proposition 1.** Using a Taylor expansion of *W<sup>θ</sup>* around the true parameter *θ*<sup>0</sup> and taking *θ* = *θ*b*n*, on the basis of (12) and (13) we obtain

$$\mathcal{W}\_{\widehat{\theta}\_{\mathbb{R}}} = \mathcal{W}\_{\theta\_0} + \frac{1}{2} (\widehat{\theta}\_{\mathbb{R}} - \theta\_0)^t M\_{\gamma}(\theta\_0) (\widehat{\theta}\_{\mathbb{R}} - \theta\_0) + o(\|\widehat{\theta}\_{\mathbb{R}} - \theta\_0\|^2). \tag{A1}$$

Then (15) holds.

**Proof of Proposition 2.** Using a Taylor expansion of *Q<sup>θ</sup>* around to *θ*b*<sup>n</sup>* and taking *θ* = *θ*0, we obtain

$$Q\_{\theta\_0} = Q\_{\hat{\theta}\_n} + \left[\frac{\partial}{\partial \theta} Q\_{\theta}\right]\_{\theta = \hat{\theta}\_n}^t (\theta\_0 - \hat{\theta}\_n) + \frac{1}{2} (\theta\_0 - \hat{\theta}\_n)^t \left[\frac{\partial^2}{\partial \theta^2} Q\_{\theta}\right]\_{\theta = \hat{\theta}\_n} (\theta\_0 - \hat{\theta}\_n) + o(\|\hat{\theta}\_n - \theta\_0\|^2). \tag{A2}$$

Note that <sup>h</sup> *∂ ∂θ Q<sup>θ</sup>* i *θ*=*θ*b*<sup>n</sup>* = 0 by the very definition of *θ*b*n*. By applying the weak law of large numbers and the continuous mapping theorem, we get

$$
\left[\frac{\partial^2}{\partial\theta^2}Q\_\theta\right]\_{\theta=\theta\_0} - \left[\frac{\partial^2}{\partial\theta^2}W\_\theta\right]\_{\theta=\theta\_0} \stackrel{P}{\to} 0\tag{A3}
$$

and using (13)

$$
\left[\frac{\partial^2}{\partial\theta^2}Q\_\theta\right]\_{\theta=\theta\_0} - M\_\gamma(\theta\_0) \stackrel{P}{\to} 0.\tag{A4}
$$

Then, using the consistency of *θ*b*<sup>n</sup>* and (A4), we obtain

$$\left[\frac{\partial^2}{\partial\theta^2}Q\_\theta\right]\_{\theta=\hat{\theta}\_\text{fl}} = M\_\gamma(\theta\_0) + o\_P(1). \tag{A5}$$

Consequently,

$$Q\_{\theta\_0} = Q\_{\widehat{\theta}\_\mathbb{I}} + \frac{1}{2} (\theta\_0 - \widehat{\theta}\_\mathbb{I})^\dagger M\_\gamma(\theta\_0) (\theta\_0 - \widehat{\theta}\_\mathbb{I}) + o(\|\widehat{\theta}\_\mathbb{I} - \theta\_0\|^2) \tag{A6}$$

and we deduce (17).

**Proof of Proposition 3.** Using Proposition 1 and Proposition 2, we obtain

$$E[\mathcal{W}\_{\widehat{\theta}\_{\scriptscriptstyle\rm R}}] = E[Q\_{\widehat{\theta}\_{\scriptscriptstyle\rm R}}] + E[(\theta\_0 - \widehat{\theta}\_{\scriptscriptstyle\rm R})^t M\_\gamma(\theta\_0)(\theta\_0 - \widehat{\theta}\_{\scriptscriptstyle\rm R})] - E[Q\_{\theta\_0}] + \mathcal{W}\_{\theta\_0} + E[\mathcal{R}\_{\textit{\rm R}}] \tag{A7}$$

where *R<sup>n</sup>* = *o*(k*θ*b*<sup>n</sup>* − *θ*0k 2 ).

In order to evaluate *Wθ*<sup>0</sup> − *E*[*Qθ*<sup>0</sup> ], note that

$$Q\_{\theta\_0} - W\_{\theta\_0} = -\frac{1}{\gamma} \left[ \ln \left( \frac{1}{n} \sum\_{i=1}^n p\_{\theta\_0}^{\gamma} (X\_i) \right) - \ln \left( \int p\_{\theta\_0}^{\gamma+1} \mathbf{d} \lambda \right) \right]. \tag{A8}$$

A Taylor expansion of the function ln *x* around to R *p γ*+1 *θ*0 d*λ* yields

$$\begin{split} \ln\left(\frac{1}{n}\sum\_{i=1}^{n}p\_{\theta\_{0}}^{\gamma}(X\_{i})\right) &= \quad \ln\left(\int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda\right) + \frac{1}{\int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda} \left[\frac{1}{n}\sum\_{i=1}^{n}p\_{\theta\_{0}}^{\gamma}(X\_{i}) - \int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda\right] - \\ &\quad \quad \quad \quad - \frac{1}{2}\cdot\frac{1}{(\int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda)^{2}} \left[\frac{1}{n}\sum\_{i=1}^{n}p\_{\theta\_{0}}^{\gamma}(X\_{i}) - \int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda\right]^{2} + \\ &\quad \quad \quad + o(\|\frac{1}{n}\sum\_{i=1}^{n}p\_{\theta\_{0}}^{\gamma}(X\_{i}) - \int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda\|^{2}). \end{split} \tag{A9}$$

Then

$$\begin{split} \mathbb{E}[\mathcal{Q}\_{\theta\_{0}} - \mathcal{W}\_{\theta\_{0}}] &= \ -\frac{1}{\gamma} \mathbb{E}\left[\ln\left(\frac{1}{n}\sum\_{i=1}^{n} p\_{\theta\_{0}}^{\gamma}(\mathbf{X}\_{i})\right) - \ln\left(\int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d}\lambda\right)\right] \\ &= \ -\frac{1}{\gamma} \left\{\frac{1}{\int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d}\lambda} \mathbb{E}\left[\frac{1}{n}\sum\_{i=1}^{n} p\_{\theta\_{0}}^{\gamma}(\mathbf{X}\_{i}) - \int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d}\lambda\right] \right. \\ &\left. -\frac{1}{2} \cdot \frac{1}{(\int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d}\lambda)^{2}} \mathbb{E}\left[\left(\frac{1}{n}\sum\_{i=1}^{n} p\_{\theta\_{0}}^{\gamma}(\mathbf{X}\_{i}) - \int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d}\lambda\right)^{2}\right] + \mathbb{E}[\mathcal{R}\_{n}']\right\} .\end{split}$$

where *R* 0 *<sup>n</sup>* = *o*(k 1 *<sup>n</sup>* ∑ *n i*=1 *p γ θ*0 (*Xi*) − R *p γ*+1 *θ*0 d*λ*k 2 ). On the other hand,

$$\begin{split} \mathbb{E}\left[\left(\frac{1}{n}\sum\_{i=1}^{n}p\_{\theta\_{0}}^{\gamma}(X\_{i})-\int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda\right)^{2}\right] &= \text{Var}\left[\frac{1}{n}\sum\_{i=1}^{n}p\_{\theta\_{0}}^{\gamma}(X\_{i})\right] = \frac{1}{n}\text{Var}\left[p\_{\theta\_{0}}^{\gamma}(X)\right] \\ &= \frac{1}{n}\left\{\mathbb{E}[p\_{\theta\_{0}}^{2\gamma}(X)]-\mathbb{E}[p\_{\theta\_{0}}^{\gamma}(X)]^{2}\right\} \\ &= \frac{\int p\_{\theta\_{0}}^{2\gamma+1}\mathbf{d}\lambda - (\int p\_{\theta\_{0}}^{\gamma+1}\mathbf{d}\lambda)^{2}}{n}. \end{split} \tag{A10}$$

Consequently,

$$\mathbb{E}[Q\_{\theta\_0}] - \mathcal{W}\_{\theta\_0} = -\frac{1}{2\gamma n} \left[ 1 - \frac{\int p\_{\theta\_0}^{2\gamma+1} \mathrm{d}\lambda}{\left( \int p\_{\theta\_0}^{\gamma+1} \mathrm{d}\lambda \right)^2} \right] - \frac{1}{\gamma} \mathbb{E}\left[ \mathcal{R}\_n' \right]. \tag{A11}$$

Using (A7) and (A11), we obtain (18).

**Proof of Proposition 5.** For the contaminated model *P*e*ε<sup>x</sup>* = (1 − *ε*)*Pθ*<sup>0</sup> + *εδx*, it holds

$$\mathcal{U}(\tilde{P}\_{\varepsilon x}) = \frac{1}{\gamma + 1} \ln \left( \int p\_{T(\tilde{P}\_{\varepsilon x})}^{\gamma + 1} d\lambda \right) - \frac{1}{\gamma} \ln \left( \int p\_{T(\tilde{P}\_{\varepsilon x})}^{\gamma} d\tilde{P}\_{\varepsilon x} \right). \tag{A12}$$

Derivation with respect to *ε* yields

$$\begin{split} \frac{\partial}{\partial \varepsilon} [\mathcal{U}(\tilde{P}\_{\varepsilon})]\_{\varepsilon=0} &= \quad \frac{1}{\int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d} \boldsymbol{\lambda}} \cdot \int p\_{\theta\_{0}}^{\gamma} \boldsymbol{\rho}\_{\theta\_{0}} \mathbf{d} \boldsymbol{\lambda} \cdot \mathbf{IF}(\mathbf{x}; T, \boldsymbol{P}\_{\theta\_{0}}) - \\ & \quad \quad - \frac{1}{\gamma} \cdot \frac{1}{\int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d} \boldsymbol{\lambda}} \left\{ - \int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d} \boldsymbol{\lambda} + \gamma \cdot \int p\_{\theta\_{0}}^{\gamma} \boldsymbol{\rho}\_{\theta\_{0}} \mathbf{d} \boldsymbol{\lambda} \cdot \mathbf{IF}(\mathbf{x}; T, \boldsymbol{P}\_{\theta\_{0}}) + p\_{\theta\_{0}}^{\gamma}(\mathbf{x}) \right\} \\ &= \quad \frac{1}{\gamma} \cdot \left[ 1 - \frac{p\_{\theta\_{0}}^{\gamma}(\mathbf{x})}{\int p\_{\theta\_{0}}^{\gamma+1} \mathbf{d} \boldsymbol{\lambda}} \right]. \end{split}$$

Thus we obtain

$$\text{IF}(\mathbf{x}; \mathbf{U}, P\_{\theta\_0}) = \frac{1}{\gamma} \left[ 1 - \frac{p\_{\theta\_0}^{\gamma}(\mathbf{x})}{\int p\_{\theta\_0}^{\gamma + 1} \mathbf{d} \lambda} \right]. \tag{A13}$$

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
