*Article* **Model Selection in a Composite Likelihood Framework Based on Density Power Divergence**

#### **Elena Castilla 1,\*, Nirian Martín <sup>2</sup> , Leandro Pardo <sup>1</sup> and Konstantinos Zografos <sup>3</sup>**


Received: 22 January 2020; Accepted: 25 February 2020; Published: 27 February 2020

**Abstract:** This paper presents a model selection criterion in a composite likelihood framework based on density power divergence measures and in the composite minimum density power divergence estimators, which depends on an tuning parameter *α*. After introducing such a criterion, some asymptotic properties are established. We present a simulation study and two numerical examples in order to point out the robustness properties of the introduced model selection criterion.

**Keywords:** composite likelihood; composite minimum density power divergence estimators; model selection

#### **1. Introduction**

Composite likelihood inference is an important approach to deal with those real situations of large data sets or very complex models, in which classical likelihood methods are computationally difficult, or even, not possible to manage. Composite likelihood methods have been successfully used in many applications concerning, for example, genetics ([1]), generalized linear mixed models ([2]), spatial statistics ([3–5]), frailty models ([6]), multivariate survival analysis ([7,8]), etc.

Let us introduce the problem, adopting here the notation by [9]. Let { *<sup>f</sup>*(·; *<sup>θ</sup>*), *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*<sup>p</sup>* , *p* ≥ 1} be a parametric identifiable family of distributions for an observation *y* = (*y*1, ..., *ym*) *T* , a realization of a random *m*-vector *Y*. In this setting, the composite likelihood function based on *K* different marginal or conditional distributions has the form

$$\mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y}) = \prod\_{k=1}^{K} \left( f\_{A\_k}(\boldsymbol{y}\_{j'} \boldsymbol{j} \in A\_k \colon \boldsymbol{\theta}) \right)^{w\_k}$$

and the corresponding composite log-density

$$\log \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y}) = \sum\_{k=1}^{K} w\_k \ell\_{A\_k}(\boldsymbol{\theta}, \boldsymbol{y}), \tag{1}$$

with `*A<sup>k</sup>* (*θ*, *y*) = log *fA<sup>k</sup>* (*y<sup>j</sup>* , *j* ∈ *A<sup>k</sup>* ; *θ*), where {*Ak*} *K k*=1 is a family of sets of indices associated either with marginal or conditional distributions involving some *y<sup>j</sup>* , *j* ∈ {1, ..., *m*} and *w<sup>k</sup>* , *k* = 1, ..., *K* are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures give equivalent results. The composite maximum likelihood estimator (CMLE), b*θc*, is obtained by maximizing, in respect to *θ* ∈ Θ, the expression (1).

The CMLE is consistent and asymptotically normal and, based on it, we can establish hypothesis testing procedures in a similar way to the classical likelihood ratio test, Wald test or Rao's score test. A development of the asymptotic theory of the CMLE including its application to obtain the composite ratio statistics, Wald-type tests and Rao score tests in the context of composite likelihood can be seen in [10]. However, in [11–13] is shown that the CMLE and the derived testing procedures present an important lack of robustness. In this sense, [11–13] derived some new distance-based estimators and tests with good robustness behaviour without an important loss of efficiency. In this paper, we are going to consider the composite minimum density power divergence estimator (CMDPDE), introduced in [12], in order to present a model selection criterion in a composite likelihood framework.

Model selection criteria, for summarizing data evidence in favor of a model, is a very well studied subject in statistical literature, overall in the context of full likelihood. The construction of such criteria requires a measure of similarity between two models, which are typically described in terms of their distributions. This can be achieved if an unbiased estimator of the expected overall discrepancy is found, which measures the statistical distance between the true, but unknown model, and the entertained model. Therefore, the model with the smallest value of the criterion is the most preferable model. The use of divergence measures, in particular Kullback–Leibler divergence ([14]), to measure this discrepancy, is the main idea of some of the most known criteria: Akaike Information Criterion (AIC, [15,16]), the criterion proposed by Takeuchi (TIC, [17]) and other modifications of AIC [18]. DIC criterion, based on the density power divergence (DPD), was presented in [19] and, recently, [20] presented a local BHHJ power divergence information criterion following [21]. In the context of the composite likelihood there are some criteria based on Kullback–Leibler divergence, see for instance [22–24] and references therein. To the best of our knowledge only Kullback–Leibler divergence was used to develop model selection criteria in a composite likelihood framework. To fill this gap, our interest is now focused on DPD.

In this paper, we present a new information criterion for model selection in the framework of composite likelihood based on DPD measure. This divergence measure, introduced and studied in the case of complete likelihood by [25], has been considered previously in [12,13] in the context of composite likelihood. In these papers, a new estimator, the CMDPDE, was introduced and its robustness in relation to the CMLE as well as the robustness of some families of test statistics were studied, but the problem of model selection was not considered. This problem is considered in this paper. The criterion introduced in this paper will be called composite likelihood DIC criterion (CLDIC). The motivation of considering a criterion based on DPD instead of Kullback–Leibler divergence is due to the robustness of the procedures based on DPD in statistical inference, not only in the context of full likelihood [25,26], but also in the context of composite likelihood [12,13]. In Section 2, the CMDPDE is presented and some properties of this estimator are discussed. The new model selection criterion, CLDIC, based on CMDPDE is introduced in Section 3 and some of its asymptotic properties are studied. A simulation study is carried out in Section 4 and some numerical examples are presented in Section 5. Finally, some concluding remarks are presented in Section 6.

#### **2. Composite Minimum Density Power Divergence Estimator**

Given two probability density functions *g* and *f* , associated with two *m*-dimensional random variables respectively, the DPD ([25]) measures a statistical distance between *g* and *f* by

$$d\_a(\mathcal{g}, f) = \int\_{\mathbb{R}^m} \left\{ f(\mathcal{y})^{1+a} - \left( 1 + \frac{1}{a} \right) f(\mathcal{y})^a g(\mathcal{y}) + \frac{1}{a} \left. g(\mathcal{y})^{1+a} \right\} d\mathcal{y} \right. \tag{2}$$

for *α* > 0, while for *α* = 0 it is defined by

$$d\_0(\mathcal{g}\_\prime f) = \lim\_{\alpha \to 0^+} d\_\alpha(\mathcal{g}\_\prime f) = d\_{\text{KL}}(\mathcal{g}\_\prime f)\_{\prime \prime}$$

where *dKL*(*g*, *f*) is the Kullback–Leibler divergence (see, for example, [26]). For *α* = 1, the expression (2) leads to the *L*<sup>2</sup> distance *L*2(*g*, *f*) = R <sup>R</sup>*<sup>m</sup>* (*f*(*y*) − *g*(*y*)) 2 *dy*. It is also interesting to note that (2) is a special case of the so-called Bregman divergence

$$\int\_{\mathbb{R}^{m}} \left[ T(\boldsymbol{g}(\boldsymbol{y})) - T(f(\boldsymbol{y})) - \{\boldsymbol{g}(\boldsymbol{y}) - f(\boldsymbol{y})\boldsymbol{T}^{\prime}(f(\boldsymbol{y})) \} \right] d\boldsymbol{y}.\tag{3}$$

If we consider *T*(*l*) = <sup>1</sup> *α l* 1+*α* in (3), we get *dα*(*g*, *f*). The parameter *α* controls the trade-off between robustness and asymptotic efficiency of the parameter estimates which are the minimizers of this family of divergences. For more details about this family of divergence measures we refer to [27].

Let now *Y*1, ..., *Y<sup>n</sup>* be independent and identically distributed replications of *Y* which are characterized by the true but unknown distribution *g*. Taking into account that the true model *g* is unknown, suppose that <sup>Ξ</sup> <sup>=</sup> { *<sup>f</sup>*(·; *<sup>θ</sup>*), *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*<sup>p</sup>* , *p* ≥ 1} is a parametric identifiable family of candidate distributions to describe the observations *y*<sup>1</sup> , ..., *y<sup>n</sup>* . Then, the DPD between the true model *g* and the composite likelihood function, CL(*θ*, ·), associated to the parametric model *f*(·; *θ*) is defined as

$$d\_{\mathfrak{a}}(\mathfrak{g}\left(\cdot\right), \mathcal{CL}(\mathfrak{g}, \cdot)) = \int\_{\mathbb{R}^{m}} \left\{ \mathcal{CL}(\mathfrak{g}, \mathfrak{y})^{1+\mathfrak{a}} - \left(1 + \frac{1}{\mathfrak{a}}\right) \mathcal{CL}(\mathfrak{e}, \mathfrak{y})^{\mathfrak{a}} \mathfrak{g}(\mathfrak{y}) + \frac{1}{\mathfrak{a}} \mathfrak{g}(\mathfrak{y})^{1+\mathfrak{a}} \right\} d\mathfrak{y},\tag{4}$$

for *α* > 0, while for *α* = 0 we have *dKL*(*g* (·), CL(*θ*, ·)), which is defined by

$$d\_{\rm KL}(\lg(\cdot), \mathcal{CL}(\theta, \cdot)) = \int\_{\mathbb{R}^{m}} g(y) \log \frac{g(y)}{\mathcal{CL}(\theta, y)} dy. \tag{5}$$

In Section 3, we are going to introduce and study the CLDIC criterion based on (4).

Let

$$\{M\_k\}\_{k \in \{1, \ldots, \ell\}}\tag{6}$$

be a family of candidate models to govern the observations *Y*1, ..., *Yn*. We shall assume that the true model is included in {*Mk*}*k*∈{1,...,`} . For a specific *k* = 1, . . . , `, the parametric model *M<sup>k</sup>* is described by the composite likelihood function

$$\mathcal{CL}(\mathfrak{G}, \cdot), \quad \mathfrak{G} \in \mathfrak{G}\_k \subset \mathbb{R}^k.$$

In this setting, it is quite clear that the most suitable candidate model to describe the observations is the model that minimizes the DPD in (4). However, the unknown parameter *θ* is included in it, so it is not possible to use directly this measure for the choice of the most suitable model. A way to overcome this problem is to plug-in, in (4), the unknown parameter *θ* by an estimator which is desirable to obey some nice properties, like consistency and asymptotic normality. Based on this point, the CMDPDE, introduced in [12], can be used. This estimator is described in the sequel for the sake of completeness.

If we denote the kernel of (4) as

$$\mathcal{W}\_{\mathfrak{a}}\left(\boldsymbol{\theta}\right) = \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y})^{1+\mathfrak{a}} d\boldsymbol{y} - \left(1 + \frac{1}{\mathfrak{a}}\right) \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y})^{\mathfrak{a}} \boldsymbol{g}(\boldsymbol{y}) d\boldsymbol{y},\tag{7}$$

we can write

$$d\_{\mathfrak{a}}(\mathfrak{g}\left(\cdot\right), \mathcal{CL}(\mathfrak{g}\cdot)) = \mathcal{W}\_{\mathfrak{a}}\left(\mathfrak{g}\right) + \frac{1}{\mathfrak{a}} \int\_{\mathbb{R}^{m}} \mathfrak{g}(\mathfrak{y})^{1+\mathfrak{a}} d\mathfrak{y}$$

and the term <sup>1</sup> *α* R <sup>R</sup>*<sup>m</sup> g*(*y*) <sup>1</sup>+*αdy* does not depend on *θ* and could be ignored in (9). A natural estimator of *W<sup>α</sup>* (*θ*), given in (7), can be obtained by observing that the last integral in (7), can be expressed in the form R <sup>R</sup>*<sup>m</sup>* CL(*θ*, *y*) *<sup>α</sup>dG*(*y*), for *G* the distribution function corresponding to *g*. Hence, if the empirical distribution function of *Y*1, ..., *Y<sup>n</sup>* will be exploited, this last integral is approximated by 1 *n n* ∑ *i*=1 CL(*θ*, *Yi*) *α* , i.e.,

$$\mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\boldsymbol{\theta}\right) = \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y})^{a+1} d\boldsymbol{y} - \left(1 + \frac{1}{a}\right) \frac{1}{n} \sum\_{i=1}^{n} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{Y}\_{i})^{a}. \tag{8}$$

**Definition 1.** *The CMDPDE of θ,* b*θ α c* , *is defined, for α* > 0*, by*

$$\widehat{\boldsymbol{\theta}}\_{\mathcal{C}}^{\boldsymbol{a}} = \arg\min\_{\boldsymbol{\theta} \in \boldsymbol{\Theta}} \mathcal{W}\_{\boldsymbol{n}, \boldsymbol{a}}\left(\boldsymbol{\theta}\right). \tag{9}$$

We shall denote the score of the composite likelihood by

$$
\mu(\theta, y) = \frac{\partial \log \mathcal{CL}(\theta, y)}{\partial \theta}. \tag{10}
$$

Let *θ*<sup>0</sup> be the true value of the parameter *θ*. In [12], it was shown that the asymptotic distribution of b*θ α c* is given by

$$\sqrt{n}(\widehat{\boldsymbol{\theta}}\_{\varepsilon}^{\boldsymbol{\alpha}}-\boldsymbol{\theta}\_{0}) \xrightarrow[n\to\infty]{\mathcal{L}} \mathcal{N}\left(\mathbf{0}\_{p\prime}\,\boldsymbol{H}\_{\boldsymbol{\alpha}}(\boldsymbol{\theta}\_{0})^{-1}\mathbf{J}\_{\boldsymbol{\alpha}}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{\boldsymbol{\alpha}}(\boldsymbol{\theta}\_{0})^{-1}\right)\,\boldsymbol{\alpha}$$

being

$$H\_a(\theta) = \int\_{\mathbb{R}^m} \mathcal{CL}(\theta, y)^{a+1} u(\theta, y) u(\theta, y)^T dy \tag{11}$$

and

$$\begin{split} J\_{\mathfrak{a}}(\boldsymbol{\theta}) &= \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y})^{2\boldsymbol{x}+1} \boldsymbol{u}(\boldsymbol{\theta}, \boldsymbol{y}) \boldsymbol{u}(\boldsymbol{\theta}, \boldsymbol{y})^{\mathrm{T}} d\boldsymbol{y} \\ &- \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y})^{\mathrm{T}+1} \boldsymbol{u}(\boldsymbol{\theta}, \boldsymbol{y}) d\boldsymbol{y} \int\_{\mathbb{R}^{m}} \boldsymbol{u}(\boldsymbol{\theta}, \boldsymbol{y})^{\mathrm{T}} \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y})^{1+\mathrm{a}} d\boldsymbol{y}. \end{split} \tag{12}$$

**Remark 1.** *For α* = 0 *we get the CMLE of θ*

$$\widehat{\boldsymbol{\theta}}\_{\mathcal{C}} = \arg\min\_{\boldsymbol{\theta}} \left( -\frac{1}{n} \sum\_{i=1}^{n} \log \mathcal{L} \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{y}\_{i}) \right). \tag{13}$$

*At the same time it is well-known that*

$$\sqrt{n}(\widehat{\boldsymbol{\theta}}\_{\mathcal{E}}-\boldsymbol{\theta}) \xrightarrow[n\to\infty]{\mathcal{L}} \mathcal{N}\left(\mathbf{0}\_{p\prime}\ \mathbf{G}\_{\*}(\boldsymbol{\theta})^{-1}\right)\mathcal{N}$$

*where G*∗(*θ*) *denotes the Godambe information matrix defined by G*∗(*θ*) = *H*(*θ*)*J*(*θ*) <sup>−</sup>1*H*(*θ*)*, with H*(*θ*) *being the sensitivity or Hessian matrix and J*(*θ*) *being the variability matrix, defined, respectively, by*

$$H(\boldsymbol{\theta}) = E\_{\boldsymbol{\theta}} \left[ -\frac{\partial}{\partial \boldsymbol{\theta}} \boldsymbol{\mu}(\boldsymbol{\theta}, \boldsymbol{\Upsilon})^{T} \right], \quad \mathbf{J}(\boldsymbol{\theta}) = E\_{\boldsymbol{\theta}} \left[ \boldsymbol{\mu}(\boldsymbol{\theta}, \boldsymbol{\Upsilon}) \boldsymbol{\mu}(\boldsymbol{\theta}, \boldsymbol{\Upsilon})^{T} \right].$$

#### **3. A New Model Selection Criterion**

In order to describe the CLDIC criterion we consider the model *M<sup>k</sup>* given in (6). Following standard methodology (cf. [28], pp. 240), the most suitable candidate model to describe the data *Y*1, ..., *Y<sup>n</sup>* is the model that minimizes the expected estimated DPD

$$\left[E\_{\mathbf{Y}\_1,\ldots,\mathbf{Y}\_n}\left[d\_{\boldsymbol{\alpha}}(\boldsymbol{g}\left(\cdot\right),\mathcal{CL}(\widehat{\boldsymbol{\theta}}^{\boldsymbol{a}}\_{c\prime}\cdot)\right)\right],\tag{14}$$

subject to the assumption that the unknown model *g* is belonging to Ξ, i.e., the true model is included in {*Ms*}*s*∈{1,...,`} and taking into account that b*θ α c* , defined in (9), is a consistent and asymptotic normally distributed estimator of *θ*. However, this expected value is still depending on the unknown parameter *θ*. So, as a criterion, it should be used an asymptotically unbiased estimator of (14), for *g* ∈ Ξ.

The most appropriate model to select is the model which minimizes the expected value

$$E\_{\mathbf{Y}\_1,\dots,\mathbf{Y}\_n} \left[ W\_{\alpha} \left( \widehat{\boldsymbol{\theta}}\_c^{\alpha} \right) \right] \dots$$

This expected value is still depending on the unknown parameter *θ*. So, an asymptotically unbiased estimator of the above expected value could be the basis of a selection criterion, for *g* ∈ Ξ. In order to proceed with the derivation of such an asymptotically unbiased estimator of *EY*<sup>1</sup> ,...,*Yn* h *W<sup>α</sup>* b*θ α c* i. The empirical version of *W<sup>α</sup>* (*θ*), in (7), is *Wn*,*α*(*θ*), given in (8), and plays a central role in the development of the model selection criterion on the basis of the next theorem which expresses the expected value *EY*<sup>1</sup> ,...,*Yn* h *W<sup>α</sup>* b*θ α c* i by means of the respective expected value of *<sup>W</sup>n*,*α*(b*<sup>θ</sup> α c* ), in an asymptotically equivalent way.

**Theorem 1.** *If the true distribution g belongs to the parametric family* Ξ *and θ*<sup>0</sup> *denotes the true value of the parameter θ, then we have*

$$E\_{\mathbf{Y}\_1,\dots,\mathbf{Y}\_n} \left[ \mathcal{W}\_{\mathbf{t}} \left( \widehat{\boldsymbol{\theta}}\_{\mathbf{c}}^{\mathbf{t}} \right) \right] = E\_{\mathbf{Y}\_1,\dots,\mathbf{Y}\_n} \left[ \mathcal{W}\_{\mathbf{t},\mathbf{a}} \left( \widehat{\boldsymbol{\theta}}\_{\mathbf{a}} \right) + \frac{\boldsymbol{a} + 1}{n} \text{trace} \left( \boldsymbol{I}\_{\mathbf{a}} \left( \boldsymbol{\theta}\_{\mathbf{0}} \right) \boldsymbol{H}\_{\mathbf{a}} \left( \boldsymbol{\theta}\_{\mathbf{0}} \right)^{-1} \right) \right] + o\_p(1)$$

*with H<sup>α</sup>* (*θ*) *and J<sup>α</sup>* (*θ*) *given in (11) and (12), respectively.*

Based on the above theorem, the proof of which is presented in a full detail in the Appendix A, an asymptotic unbiased estimator of *EY*<sup>1</sup> ,....,*Yn* h *Wα*(b*θ α c* ) i is given by

$$\mathcal{W}\_{n,\alpha}(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{\varepsilon}}^{\boldsymbol{\alpha}}) + \frac{\boldsymbol{\alpha} + 1}{n} \operatorname{trace} \left( \mathbf{J}\_{\alpha}(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{\varepsilon}}^{\boldsymbol{\alpha}}) \mathbf{H}\_{\alpha}(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{\varepsilon}}^{\boldsymbol{\alpha}})^{-1} \right) \boldsymbol{\dots}$$

This ascertainment is the basis and a strong motivation for the next definition which introduces the model selection criterion.

**Definition 2.** *Let* {*Mk*}*k*∈{1,...,`} *be candidate models for the observations Y*1, ..., *Yn. The selected model M*∗ *verifies*

$$M^\* = \min\_{k \in \{1, \dots, \ell\}} \mathbb{C}LDIC\_{\mathfrak{a}}\left(M\_k\right),$$

*where*

$$\text{CLDIC}\_{\mathfrak{a}}\left(M\_{\hat{k}}\right) = W\_{\mathfrak{n},\mathfrak{a}}(\hat{\theta}\_{\mathfrak{c}}^{\mathfrak{a}}) + \frac{\mathfrak{a}+1}{n}\text{trace}\left(\mathcal{J}\_{\mathfrak{a}}(\hat{\theta}\_{\mathfrak{c}}^{\mathfrak{a}})H\_{\mathfrak{a}}(\hat{\theta}\_{\mathfrak{c}}^{\mathfrak{a}})^{-1}\right),$$

*Wn*,*α*(*θ*) *was given in (8) and J<sup>α</sup>* (*θ*) *and H<sup>α</sup>* (*θ*) *were defined in (11) and (12), respectively.*

The next remark summarizes the model selection criterion in the case *α* = 0 and it therefore extends, in a sense, the pioneer and classic AIC.

**Remark 2.** *For α* = 0 *we have,*

$$d\_{KL}(\mathcal{g}(\cdot), \mathcal{CL}(\boldsymbol{\theta}, \cdot)) = \mathcal{W}\_0(\boldsymbol{\theta}) + \int\_{\mathbb{R}^n} \mathcal{g}(\boldsymbol{y}) \log \mathcal{g}(\boldsymbol{y}) d\boldsymbol{y}$$

*with W*0(*θ*) = − R <sup>R</sup>*<sup>n</sup>* log CL(*θ*, *y*)*g*(*y*)*dy. Therefore, the most appropriate model which should be selected, is the model which minimizes the expected value*

$$E\_{Y\_1,\ldots,Y\_n} \left[ \mathcal{W}\_0(\widehat{\boldsymbol{\theta}}\_c) \right] \tag{15}$$

*where* b*θ<sup>c</sup> is the CMLE of θ*<sup>0</sup> *defined in (9).*

*The expected value (15) is still depending on the unknown parameter θ. A natural estimator of W*0(b*θc*) *can be obtained by replacing the distribution function G, of g, by the empirical distribution function based on Y*1, . . . , *Yn,*

$$\mathcal{W}\_{n,0}(\boldsymbol{\theta}) = -\frac{1}{n} \sum\_{i=1}^{n} \log \mathcal{CL}(\boldsymbol{\theta}, \boldsymbol{y}\_i).$$

*Based on it, we select the model M*∗ *that verifies*

$$M^\* = \min\_{k \in \{1, \dots, \ell\}} \mathcal{CLIDIC}\_0 \left( M\_k \right),$$

*with*

$$\text{CLDIC}\_0(M\_k) = H\_{n,0}(\widehat{\boldsymbol{\theta}}\_c) + \frac{1}{n}\text{trace}\left(\boldsymbol{J}(\widehat{\boldsymbol{\theta}}\_c)\boldsymbol{H}(\widehat{\boldsymbol{\theta}}\_c)^{-1}\right)\boldsymbol{J}$$

*where J*(b*θc*) *and H*(b*θc*) *are defined in Remark 1. In a manner, quite similar to that of the previous theorem, it can be established that CLDIC*0(*M<sup>k</sup>* ) *is an asymptotic unbiased estimator of EY*<sup>1</sup> ,...,*Yn* h *W*0(b*θc*) i .

*This would be the model selection criterion in a composite likelihood framework based on Kullback–Leibler divergence. We can observe that this criterion coincides with the criterion given in [22] as a generalization of the classical criterion of Akaike, which will be referred from now as Composite Akaike Information Criterion (CAIC).*

#### **4. Numerical Simulations**

#### *4.1. Scenario 1: Two-Component Mixed Model*

We are starting with a simulation example, which is motivated and follows ideas from the paper [29] and the Example 4.1 in [20] which will compare the behaviour of the proposed criteria with the CAIC criterion, for *α* = 0 (see Remark 2).

Consider the random vector *Y* = (*Y*1,*Y*2,*Y*3,*Y*4) *T* from an unknown density *g* and let now *Y*1, ..., *Y<sup>n</sup>* be independent and identically distributed replications of *Y* which are described by the true but unknown distribution *g*. Taking into account that the true model *g* is unknown, suppose that { *<sup>f</sup>*(·; *<sup>θ</sup>*), *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*<sup>p</sup>* , *p* ≥ 1} is a parametric identifiable family of candidate distributions to describe the observations *y*<sup>1</sup> , ..., *y<sup>n</sup>* . Let also CL(*θ*, *y*) denotes the composite likelihood function associated to the parametric model *f*(·; *θ*).

We consider the problem of choosing (on the basis of *n* independent and identically distributed replications *y*<sup>1</sup> , ..., *y<sup>n</sup>* of *Y* = (*Y*1,*Y*2,*Y*3,*Y*4) *T* ) between a 4-variate normal distribution, N *µ <sup>N</sup>*, **Σ** , with *µ <sup>N</sup>* = (*µ N* 1 , *µ N* 2 , *µ N* 3 , *µ N* 4 ) *<sup>T</sup>* and

$$
\Sigma = \begin{pmatrix} 1 & \rho & 2\rho & 2\rho \\ \rho & 1 & 2\rho & 2\rho \\ 2\rho & 2\rho & 1 & \rho \\ 2\rho & 2\rho & \rho & 1 \end{pmatrix} \prime
$$

and a 4-variate *t*-distribution with *ν* degrees of freedom, *t<sup>ν</sup> µ tν* , **Σ** ∗ , with different location parameters *µ <sup>t</sup><sup>ν</sup>* = (*µ tν* 1 , *µ tν* 2 , *µ tν* 3 , *µ tν* 4 ) *<sup>T</sup>* and same variance-covariance matrix **Σ**, and density,

$$\mathbb{C}\_{m}|\boldsymbol{\Sigma}^\*|^{-1/2}\left[1+\frac{1}{\nu}(\boldsymbol{y}-\boldsymbol{\mu}^{t\_{\boldsymbol{\nu}}})^T(\boldsymbol{\Sigma}^\*)^{-1}(\boldsymbol{y}-\boldsymbol{\mu}^{t\_{\boldsymbol{\nu}}})\right]^{-(\nu+m)/2}\text{ }\boldsymbol{\nu}$$

with **Σ** <sup>∗</sup> = *<sup>ν</sup>*−<sup>2</sup> *ν* **Σ**, *C<sup>m</sup>* = (*πν*) <sup>−</sup>*m*/2 <sup>Γ</sup>[(*ν*+*m*)/2] Γ(*ν*/2) and *m* = 4.

Consider the composite likelihood function,

$$\mathcal{CLN}(\rho, \mathfrak{y}) = f\_{A\_1}^N(\mathfrak{y}; \rho) f\_{A\_2}^N(\mathfrak{y}; \rho) \,\!\!/ $$

with *f N A*1 (*y*; *ρ*) = *f N* <sup>12</sup>(*y*1, *y*2; *µ N* 1 , *µ N* 2 ; *ρ*) and *f N A*2 (*y*; *ρ*) = *f N* <sup>34</sup>(*y*3, *y*4; *µ N* 3 , *µ N* 4 ; *ρ*), where *f N* <sup>12</sup> and *f N* <sup>34</sup> are the densities of the marginals of *Y*, i.e., bivariate normal distributions with mean vectors (*µ N* 1 , *µ N* 2 ) *<sup>T</sup>* and (*µ N* 3 , *µ N* 4 ) *T* , respectively, and common variance-covariance matrix

$$
\Sigma\_0 = \left(\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right) \cdot \mathbb{1}
$$

In a similar manner consider the composite likelihood

$$\mathcal{CL}t\_{\boldsymbol{\nu}}(\rho, \boldsymbol{y}) = f\_{A\_1}^{t\_{\boldsymbol{\nu}}}(\boldsymbol{y}; \rho) f\_{A\_2}^{t\_{\boldsymbol{\nu}}}(\boldsymbol{y}; \rho)\_{\boldsymbol{\nu}}$$

with *f tν A*1 (*y*; *ρ*) = *f tν* <sup>12</sup>(*y*1, *y*2; *µ tν* 1 , *µ tν* 2 ; *ρ*) and *f tν A*2 (*y*; *ρ*) = *f tν* <sup>34</sup>(*y*3, *y*4; *µ tν* 3 , *µ tν* 4 ; *ρ*), where *f tν* <sup>12</sup> and *f tν* <sup>34</sup> are the densities of the marginals of *Y*, i.e., bivariate *t*-distributions with mean vectors (*µ tν* 1 , *µ tν* 2 ) *<sup>T</sup>* and (*µ tν* 3 , *µ tν* 4 ) *T* , respectively, and common variance-covariance matrix

$$
\Sigma\_0 = \left(\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right).
$$

Under this formulation, the simulation study follows in the next two scenarios.

#### 4.1.1. Scenario 1a

Following Example 4.1 in [20], the steps of the simulation study are the following:

• Generate 1000 samples of size *n* = 5, 7, 10, 20, 40, 50, 70, 100 from a two component mixture of two 4-variate distributions, namely, a 4-variate normal and a 4-variate *t*-distribution,

$$h\_{\omega}(\boldsymbol{y}) = \omega \mathcal{N}\left(\boldsymbol{\mu}^{\rm N}, \boldsymbol{\Sigma}\right) + (1 - \omega)t\_{\boldsymbol{\nu}}\left(\boldsymbol{\mu}^{t\_{\boldsymbol{\nu}}}, \boldsymbol{\Sigma}^\*\right), \quad 0 \le \omega \le 1,$$

with *µ <sup>N</sup>* = (0, 0, 0.5, 0) and *µ <sup>t</sup><sup>ν</sup>* = (3.2, 1.5, 0.5, 2), for *ω* = 0, 0.25, 0.45, 0.5, 0.55, 0.75, 1, *ν* = 5, 10, 30 degrees of freedom and with specific values of *ρ* = −0.15, −0.10, 0.10. As pointed out in [29], taking into account that **Σ** should be semi-positive definite, the following condition is imposed: −1 <sup>5</sup> <sup>≤</sup> *<sup>ρ</sup>* <sup>≤</sup> <sup>1</sup> 3 .

• Estimate the common parameter *ρ*, separately in each model, by using the CMDPDE estimator for different values of the tuning parameter *α* = 0, 0.3. The composite density which corresponds to the mixture *hω*(*y*) is defined by

$$\mathcal{CL}(\rho, y) = \omega \mathcal{CLN}(\rho, y) + (1 - \omega) \mathcal{CLt}\_{\mathbb{V}}(\rho, y), \quad 0 \le \omega \le 1,$$

and it is used to obtain the CMDPDE estimator, *<sup>ρ</sup>*b, of *<sup>ρ</sup>*.

• Define the mixture composite likelihood function

$$\mathcal{CL}(\widehat{\rho}, \mathfrak{y}) = \omega \mathcal{CLN}(\widehat{\rho}, \mathfrak{y}) + (1 - \omega) \mathcal{CLt}\_{\mathbb{V}}(\widehat{\rho}, \mathfrak{y}), \quad 0 \le \omega \le 1.$$

• Calculate *CLDIC<sup>α</sup>* (*Mk*), the value of the model selection criterion considered in this paper, for the two candidate models, with

$$\text{CLDIC}\_{\mathfrak{a}}\left(M\_{\mathfrak{k}}\right) = \mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\widehat{\rho}\right) + \frac{\mathfrak{a}+1}{\mathfrak{n}}\text{trace}\left(\mathbf{J}\_{\mathfrak{a}}\left(\widehat{\rho}\right)\mathbf{H}\_{\mathfrak{a}}\left(\widehat{\rho}\right)^{-1}\right).$$

An explanation of how to obtain this value for the both candidate models is given in Appendix B. • Compute the times that the 4-variate normal model was selected.

Results are summarized in Table 1. Extreme values of *ω* = 0, 1 represent the times that the 4-variate normal model was selected under the 4-variate *t*-distribution and 4-variate normal distribution, respectively. This means that, for *ω* = 1, the perfect discrimination will be achieved when 1000 of the 1000 simulated samples are correctly assigned, while for *ω* = 0, the more near to 0, the better discrimination of the criterion. *ω* = 0.5 means that each sample was generated both from the normal and *t*-distribution in the same proportion.


**Table 1.** Main results, Scenario 1a.

#### 4.1.2. Scenario 1b

Same Scenario is evaluated under the more-closed means *µ <sup>N</sup>* = (0, 1.5, 0.5, <sup>−</sup>0.75) and *<sup>µ</sup> <sup>t</sup><sup>ν</sup>* = (0, 1.5, 0.5, 2) for moderate-large sample sizes and *α* ∈ {0, 0.2, 0.4}. Here *ν* = 5 and *ρ* = −0.15. Results are shown in Table 2. In this case, the models under consideration are more similar, so it would be understandable that the CLDIC criterion did not discriminate in such as good way.


**Table 2.** Main results, Scenario 1b.

#### *4.2. Scenario 2: Three-Component Mixed Model*

Now, we consider a mixed model composed on two 4-variate normal distributions and a 4-variate *t*-distribution with *ν* = 10 degrees of freedom. The three distributions have common variance-covariance matrix, as in the previous scenario, with unknown *ρ* = −0.15 and different but known means *µ N* <sup>1</sup> = (0, 0, 0.5, 0), *µ N* <sup>2</sup> = (0, 1.5, 0.5, 0) and *µ <sup>t</sup>* = (0, 1.5, 0.5, 2). The model is defined by

$$
\omega \mathcal{N}(\mu\_1^N, \Sigma) + \lambda \mathcal{N}(\mu\_2^N, \Sigma) + (1 - \omega - \lambda) t\_{\nu=10}(\mu^t, \Sigma^\*), \quad 0 \le \omega, \lambda, \omega + \lambda \le 1\_{\nu}
$$

with **Σ** being again a common variance-covariance matrix with unknown parameter *ρ* of the form

$$
\mathbf{E} = \begin{pmatrix} 1 & \rho & 2\rho & 2\rho \\ \rho & 1 & 2\rho & 2\rho \\ 2\rho & 2\rho & 1 & \rho \\ 2\rho & 2\rho & \rho & 1 \end{pmatrix}.
$$

Following the same steps that in the first scenario, we generate 1000 samples of the three-component mixture for different sample sizes *n* = 5, 7, 10, 20, 40, 50, 70, 100 and different values of *ω* and *λ*. Then, we consider the problem of choosing among one of the two 4-variate normal distributions and the 4-variate *t*-distribution through the CLDIC criterion, for different values of the tuning parameter *α* = 0, 0.3, 0.5, 0.7. See Table 3 for results. Here, the normal models are denoted by N1 and N2, respectively, while the 4-variate *t*-distribution is denoted by MT. The first three cases evaluate the selected model under these multivariate distributions. In the last two scenarios, a mixed model is considered as the true distribution.

#### *4.3. Discussion of Results*

In Scenario 1a, two well-differentiated multivariate models are considered. In this case CLDIC criterion works in a very efficient way, with an almost-perfect discrimination for extreme values of *ω*. The good behaviour is also observed for not so extreme values of *ω*, such as *ω* = 0.55 or 0.45. We can not observe a significant difference in the choice of *α*.

In Scenario 1b we consider closer models, which affect the discrimination power of the CLDIC. However, in this case, we do observe great differences when considering different *α*. While the discrimination power of CLDIC for *α* = 0 (CAIC) and *ω* = 1 is around 75%, for *α* = 0.2 or *α* = 0.4

the behaviour is excellent. This happens also for large but not extreme values of *ω*, such as *ω* = 0.75. However, a medium value of *α* turns into a worse discrimination for low values of *ω*.


**Table 3.** Main results, Scenario 2.

<sup>∗</sup> Here the model candidates are expressed as N1, N2, MT to denote N (*µ N* 1 , **Σ**), N (*µ N* 2 , **Σ**) and *t*10(*µ t* , **Σ**), respectively.

Scenario 2 deals with three different models, two multivariate normal and one multivariate *t* (N1, N2 and MT, respectively). The second normal distribution is closer to MT in terms of means. While CLDIC criterion discriminate well between N1 and N2 and between N1 and MT, it has difficulties in distinguishing N2 an MT distributions, overall for small samples sizes and *α* = 0.

It seems, therefore, that when we have well-discriminated models, CLDIC criterion works very well, independently of the sample size and the tuning parameter *α* considered. Dealing with closer models leads, as expected, to worst results, overall for *α* = 0 (CAIC).

Note that the behaviour of Wald-type and Rao tests based on CMDPDEs was studied in [12,13] through extensive simulation studies.

#### **5. Numerical Examples**

#### *5.1. Choice of the Tuning Parameter*

In the previous sections, we have seen that CLDIC criterion works generally very well, independently of *α*, but that some values present a better behaviour, overall when distinguishing similar models. In these situations, it appears that values close to 0.2 or 0.3 work well, while CAIC criterion presents a worse behaviour. A data-driven approach for the choice of the tuning parameter which would be helpful in practice. The approach of [30] was adapted In [13], for the choice of the optimum *α* in CMDPDEs. This approach consisted on minimizing the estimated mean squared error by means of a pilot estimator, *θ P* . This approximation is given by

$$\widehat{MSE}\_{\mathfrak{a}} = (\widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} - \boldsymbol{\theta}^{\mathbf{P}})^{T} (\widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} - \boldsymbol{\theta}^{\mathbf{P}}) + \frac{1}{n} \text{Trace} \left( \boldsymbol{H}\_{\mathfrak{a}}^{-1} (\widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}}) \boldsymbol{I}\_{\mathfrak{a}} (\widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}}) \boldsymbol{H}\_{\mathfrak{a}}^{-1} (\widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}}) \right), \tag{16}$$

where *Hα*(*θ*) and *J<sup>α</sup>* (*θ*) are given in (11) and (12). The optimum *α* will be the one that minimizes expression (16). The choice of the pilot estimator is probably one of the major drawbacks of this approach, as it may lead to a choice of *α* too close to that used for the pilot estimator. A pilot estimator with *α* ≈ 0.4, was proposed in [13] after some simulations, in concordance with [30], where the initial choice of a pilot is suggested to be a robust one in order to obtain the best results in terms of robustness.

#### *5.2. Iris Data*

The Iris data (Fisher, [31]) includes 3 categories of 50 sample values each, where each category refers to a type of iris plant: *setosa*, *versicolor* and *virginica*. Each plant is categorized in its class and described by other 4 variables: (1) sepal length, (2) sepal width, (3) petal length and (4) petal width. This is one of the most known data sets for discriminant analysis. [32] proposed the use of a Gaussian finite mixture for modeling Iris data, in which each known class is modeled by a single Gaussian term with the same variance-covariance matrix. The resulting model is as follows

$$f(\mathbf{x}) = \frac{1}{3}\mathcal{N}(\boldsymbol{\mu}\_1, \boldsymbol{\Sigma}) + \frac{1}{3}\mathcal{N}(\boldsymbol{\mu}\_2, \boldsymbol{\Sigma}) + \frac{1}{3}\mathcal{N}(\boldsymbol{\mu}\_3, \boldsymbol{\Sigma}), \tag{17}$$

with

$$\mu\_1 = \left(\mu\_{11}, \mu\_{12}, \mu\_{13}, \mu\_{14}\right)^T, \quad \mu\_2 = \left(\mu\_{21}, \mu\_{22}, \mu\_{23}, \mu\_{24}\right)^T, \quad \mu\_3 = \left(\mu\_{31}, \mu\_{32}, \mu\_{33}, \mu\_{34}\right)^T$$

and

$$
\Sigma = \begin{pmatrix}
\sigma\_1^2 & \sigma\_{12} & \sigma\_{13} & \sigma\_{14} \\
\sigma\_{21} & \sigma\_2^2 & \sigma\_{23} & \sigma\_{24} \\
\sigma\_{31} & \sigma\_{32} & \sigma\_3^2 & \sigma\_{34} \\
\sigma\_{41} & \sigma\_{42} & \sigma\_{43} & \sigma\_4^2
\end{pmatrix}.
$$

Exact values can be obtained by *MclustDA*() function of *mclust* package in R Software ([32]).

We propose a composite likelihood approach to modeling (17) where we suppose independence between the two first and two last variables. This is

$$f\_{CL}(\mathbf{y}) = \frac{1}{3}CLN\_1 + \frac{1}{3}CLN\_2 + \frac{1}{3}CLN\_3\tag{18}$$

with

$$\mathcal{C}LN\_{\vec{l}} = f\_{A\_{\vec{l}1}}^N(\rho\_{12}, \mathfrak{y}) f\_{A\_{\vec{l}2}}^N(\rho\_{34}, \mathfrak{y})\_{\vec{l}}$$

where *f N Ai*<sup>1</sup> (*ρ*12, *y*) = *f N Ai*<sup>1</sup> (*ρ*12, *µi*<sup>1</sup> , *µi*<sup>2</sup> , **Σ***A*<sup>1</sup> , *y*) and *f N Ai*<sup>2</sup> (*ρ*34, *y*) = *f N Ai*<sup>2</sup> (*ρ*34, *µi*<sup>3</sup> , *µi*<sup>4</sup> , **Σ***A*<sup>2</sup> , *y*), *i* = 1, 2, 3 are bivariate normals with variance-covariance matrices

$$\boldsymbol{\Sigma}\_{A\_1} = \begin{pmatrix} \sigma\_1^2 & \rho\_{12}\sigma\_1\sigma\_2 \\ \rho\_{12}\sigma\_1\sigma\_2 & \sigma\_2^2 \end{pmatrix}, \quad \boldsymbol{\Sigma}\_{A\_2} = \begin{pmatrix} \sigma\_3^2 & \rho\_{34}\sigma\_3\sigma\_4 \\ \rho\_{34}\sigma\_3\sigma\_4 & \sigma\_4^2 \end{pmatrix}.$$

We are going to evaluate the behavior of the CLDIC criterion proposed in previous sections. After estimating parameters *ρ*<sup>12</sup> and *ρ*<sup>34</sup> in (18), we consider 10 different subsets of the IRIS data:


In Table 4, chosen models for each one of the subsets are obtained by the proposed CLDIC criterion. When a "pure" subset is considered, all the tuning parameters lead to optimal decisions, but when a "contaminated" subset is under consideration, only *α* = 0.2, 0.3 have an optimal response in all the cases.


**Table 4.** Selected model in each of the subsets. Iris data.

We now apply the ad hoc approach presented in Section 5.1 for selecting the tuning parameter *α* in a composite likelihood framework. Applying this procedure to our data set though a grid search of length 100 and by means of a pilot estimator with *α* = 0.4 leads to the optimal tuning parameter *α* = 0.22, what is in concordance with the obtained results (see Table 5). We can see that the use of other pilot estimators would not affect very much to the final decission.

**Table 5.** Selected *α* for different pilot estimators, ad-hoc tuning parameter selection procedure. Iris and Wine data


#### *5.3. Wine Data*

We now work with Wine data ([33]), which contain a chemical analysis of 178 Italian wines from three different cultivars (Barolo, Grignolino, Barbera) yielded 13 measurements. In order to illustrate our criterion, we will work with only first four explanatory variables: Alcohol, Malic, Ash and Alkalinity. As in the previous section, we adjust a Gaussian mixture model with weights, in this case: 59/178 , 72/178 and 47/178 corresponding to Barolo, Grignolino and Barbera classes, respectively. We now consider these 10 different subsets of the Wine data:


We can observe how, for medium values of *α*, the discrimination is perfect (see Table 6). Applying ad-hoc tuning parameter choice procedure we obtain *αopt* ≈ 0.51, with a perfect discrimination again (Table 5).


**Table 6.** Selected model in each of the subsets. Wine data.

#### **6. Conclusions and Future Research**

In this paper, we have addressed the problem of model selection in the framework of composite likelihood methodology, on the basis of the DPD as a measure of the closeness of the composite density and the true model that drives the data. In this context, an information criterion is introduced and studied which is defined by means of composite minimum distance type estimators of the unknown parameters, well-known for having nice robustness properties. Thanks to a simulation study, we have shown that the proposed here model selection criterion works well in practice and mainly that the use of CMDPDE makes the criterion more robust than the criteria based on the classic CMLE and the Kullback–Leibler divergence, given in [22]. The analysis of two real data examples of the literature illustrate on how the model selection criterion, presented here, can be applied in practical cases. This paper is a part of a series of papers by the authors where composite likelihood ideas and methods are harmonically weaved with divergence theoretic methods in order to develop statistical inference (estimation and testing of hypotheses) and model selection criteria, as well. We envision future work in some directions. The development of change point methodology on the basis of composite density with CMDPDE and divergence measures would be maybe an appealing problem for a future research on the topic. However, all the information theoretic methods developed on the

basis of the composite likelihood depend on the choice of the family of sets {*Ak*} *K k*=1 , appeared in Formula (1). A question is raised at this point: how the information theoretic procedures developed on the basis of the composite likelihood are affected by this family of sets? It is an appealing problem which deserves also investigation in a future work.

**Author Contributions:** Conceptualization, E.C., N.M., L.P. and K.Z.; Methodology, E.C., N.M., L.P. and K.Z.; Software, E.C., N.M., L.P. and K.Z.; Validation, E.C., N.M., L.P. and K.Z.; Formal Analysis, E.C., N.M., L.P. and K.Z.; Investigation, E.C., N.M., L.P. and K.Z.; Resources, E.C., N.M., L.P. and K.Z.; Data Curation, E.C., N.M., L.P. and K.Z.; Writing—Original Draft Preparation, E.C., N.M., L.P. and K.Z.; Writing—Review & Editing, E.C., N.M., L.P. and K.Z.; Visualization, E.C., N.M., L.P. and K.Z.; Supervision, E.C., N.M., L.P. and K.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is partially supported by Grant PGC2018-095194-B-I00 and Grant FPU16/03104 from Ministerio de Ciencia, Innovación y Universidades (Spain). E. Castilla, N. Martín and L. Pardo are members of the Instituto de Matemática Interdisciplinar, Complutense University of Madrid.

**Acknowledgments:** The authors would like to thank the Editor and Reviewers for taking their precious time to make several valuable comments on the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Proof of Theorem 1**

**Proof.** A Taylor expansion of *W<sup>α</sup>* (*θ*) around the true parameter *θ*<sup>0</sup> and evaluated in *θ* = b*θ α c* , gives

$$\begin{split} \mathcal{W}\_{\mathfrak{a}} \left( \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} \right) &= \mathcal{W}\_{\mathfrak{a}} \left( \boldsymbol{\theta}\_{\mathfrak{0}} \right) + \left( \frac{\partial \mathcal{W}\_{\mathfrak{a}} \left( \boldsymbol{\theta} \right)}{\partial \boldsymbol{\theta}} \right)\_{\mathfrak{0} = \mathfrak{e}\_{\mathfrak{0}}} \left( \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} - \boldsymbol{\theta}\_{\mathfrak{0}} \right) \\ &+ \frac{1}{2} \left( \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} - \boldsymbol{\theta}\_{\mathfrak{0}} \right)^{T} \left( \frac{\partial^{2} \mathcal{W}\_{\mathfrak{a}} \left( \boldsymbol{\theta} \right)}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}} \right)\_{\mathfrak{0} = \mathfrak{e}\_{\mathfrak{0}}} \left( \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} - \boldsymbol{\theta}\_{\mathfrak{0}} \right) + o \left( \left\| \left( \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{\mathfrak{a}} - \boldsymbol{\theta}\_{\mathfrak{0}} \right) \right\|^{2} \right). \end{split}$$

Now,

$$\begin{split} \frac{\partial W\_{\mathfrak{a}}(\mathfrak{g})}{\partial \mathfrak{g}} &= \int\_{\mathbb{R}^{m}} (1+a) \mathcal{CL}(\mathfrak{e}, y)^{a} \frac{\partial \mathcal{CL}(\mathfrak{e}, y)}{\partial \mathfrak{e}} dy - \left(1 + \frac{1}{a}\right) a \int\_{\mathbb{R}^{m}} \mathcal{CL}(\mathfrak{e}, y)^{a-1} \frac{\partial \mathcal{CL}(\mathfrak{e}, y)}{\partial \mathfrak{e}} g(y) dy \\ &= (1+a) \int\_{\mathbb{R}^{m}} \mathcal{CL}(\mathfrak{e}, y)^{a+1} \mathfrak{u}\left(\mathfrak{e}, y\right) dy - (1+a) \int\_{\mathbb{R}^{m}} \mathcal{CL}(\mathfrak{e}, y)^{a} \mathfrak{u}\left(\mathfrak{e}, y\right) \mathfrak{g}(y) dy. \end{split}$$

It is clear that if the true distribution *g* belongs to the parameter family *f*(.; *θ*), *θ* ∈ Θ and *θ*<sup>0</sup> denotes the true value of the parameter *θ*, we get

$$\left(\frac{\partial \mathcal{W}\_{\boldsymbol{\theta}}\left(\boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}}\right)\_{\boldsymbol{\theta} = \boldsymbol{\theta}\_{0}} = \mathbf{0}.$$

Now we are going to get

$$\begin{split} \frac{\partial^{2}\mathcal{W}\_{\mathfrak{a}}(\boldsymbol{\theta})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{T}} &= (1+a) \left\{ \int\_{\mathbb{R}^{m}} (1+a) \,\mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{a+1} \boldsymbol{u} \, (\boldsymbol{\theta},\boldsymbol{y}) \, \boldsymbol{u} \, (\boldsymbol{\theta},\boldsymbol{y})^{T} \, \boldsymbol{d} \boldsymbol{y} \\ &- \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{a+1} \left( -\frac{\partial^{2}\log\mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{T}} \right) \, \boldsymbol{d} \boldsymbol{y} \\ &- a \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{a} \boldsymbol{u} \, (\boldsymbol{\theta},\boldsymbol{y}) \, \boldsymbol{u} \, (\boldsymbol{\theta},\boldsymbol{y})^{T} \, \boldsymbol{g} (\boldsymbol{y}) \, \boldsymbol{d} \boldsymbol{y} + \int\_{\mathbb{R}^{m}} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{a} \left( -\frac{\partial^{2}\log\mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{T}} \right) \boldsymbol{g} (\boldsymbol{y}) \, \boldsymbol{d} \boldsymbol{y} \right\}. \end{split}$$

If the true distribution *g* belongs to the parameter family *fθ*(·; *θ*), *θ* ∈ Θ and *θ*<sup>0</sup> denotes the true value of the parameter *θ*, verifies,

$$\begin{split} \left(\frac{\partial^2 W\_{\mathfrak{a}}\left(\theta\right)}{\partial\theta\partial\theta^T}\right)\_{\theta=\mathfrak{e}\_0} &= \left(1+a\right) \int\_{\mathbb{R}^m} \mathcal{CL}(\mathfrak{e}\_0,\mathfrak{y})^{a+1} \mathfrak{u}\left(\mathfrak{e}\_0,\mathfrak{y}\right) \mathfrak{u}\left(\mathfrak{e}\_0,\mathfrak{y}\right)^T d\mathfrak{y} \\ &= \left(1+a\right) H\_{\mathfrak{a}}\left(\mathfrak{e}\_0\right). \end{split}$$

Therefore,

$$n\mathcal{W}\_{\mathfrak{k}}\left(\widehat{\boldsymbol{\theta}}\_{\varepsilon}^{\mathfrak{k}}\right) = n\mathcal{W}\_{\mathfrak{k}}\left(\boldsymbol{\theta}\_{0}\right) + \frac{(1+a)}{2}\sqrt{n}\left(\widehat{\boldsymbol{\theta}}\_{\varepsilon}^{\mathfrak{k}} - \boldsymbol{\theta}\_{0}\right)^{T}\mathcal{H}\_{\mathfrak{k}}\left(\boldsymbol{\theta}\_{0}\right)\sqrt{n}\left(\widehat{\boldsymbol{\theta}}\_{\varepsilon}^{\mathfrak{k}} - \boldsymbol{\theta}\_{0}\right) + no\left(\left\|\left(\widehat{\boldsymbol{\theta}}\_{\varepsilon}^{\mathfrak{k}} - \boldsymbol{\theta}\_{0}\right)\right\|^{2}\right).$$

But

$$\sqrt{n}\left(\hat{\boldsymbol{\theta}}\_{c}^{\boldsymbol{\alpha}}-\boldsymbol{\theta}\_{0}\right)\underset{n\to\infty}{\overset{L}{\underset{n\to\infty}{\operatorname{ $\boldsymbol{\alpha}$ }}}{\overset{L}{\underset{n\to\infty}{\operatorname{ $\boldsymbol{\alpha}$ }}}}N\left(\mathbf{0}\_{\boldsymbol{\prime}}\boldsymbol{H}\_{\boldsymbol{\alpha}}(\boldsymbol{\theta}\_{0})^{-1}\boldsymbol{J}\_{\boldsymbol{\alpha}}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{\boldsymbol{\alpha}}(\boldsymbol{\theta}\_{0})^{-1}\right)\boldsymbol{\wedge}$$

and *no* b*θ α <sup>c</sup>* − *θ*<sup>0</sup> 2 = *o*(*Op*(1)) = *op*(1).

The asymptotic distribution of the quadratic form √ *n* b*θ α <sup>c</sup>* − *θ*<sup>0</sup> *T H<sup>α</sup>* (*θ*0) √ *n* b*θ α <sup>c</sup>* − *θ*<sup>0</sup> , verifies

$$\sqrt{n}\left(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{\alpha}}-\boldsymbol{\theta}\_{0}\right)^{T}\boldsymbol{H}\_{\boldsymbol{\alpha}}\left(\boldsymbol{\theta}\_{0}\right)\sqrt{n}\left(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{\alpha}}-\boldsymbol{\theta}\_{0}\right)\underset{\boldsymbol{n}\to\infty}{\mathcal{L}}\sum\_{r=1}^{k}\lambda\_{r}\boldsymbol{Z}\_{r}^{\boldsymbol{\alpha}}$$

being *λ<sup>r</sup>* , *r* = 1, ..., *k*, the eigenvalues of the matrix

$$H\_{\mathfrak{A}}\left(\boldsymbol{\theta}\_{0}\right)H\_{\mathfrak{A}}\left(\boldsymbol{\theta}\_{0}\right)^{-1}\mathbf{J}\_{\mathfrak{A}}\left(\boldsymbol{\theta}\_{0}\right)H\_{\mathfrak{A}}\left(\boldsymbol{\theta}\_{0}\right)^{-1} = \mathbf{J}\_{\mathfrak{A}}\left(\boldsymbol{\theta}\_{0}\right)H\_{\mathfrak{A}}\left(\boldsymbol{\theta}\_{0}\right)^{-1}$$

and *Z<sup>r</sup>* are independent normal random variable of mean zero and variance 1. Therefore,

$$\begin{aligned} \operatorname{E}\_{Y\_1,\dots,Y\_n} \left[ \sqrt{n} \left( \hat{\boldsymbol{\theta}}\_c^d - \boldsymbol{\theta}\_0 \right)^T \mathbf{H}\_a \left( \boldsymbol{\theta}\_0 \right) \sqrt{n} \left( \hat{\boldsymbol{\theta}}\_c^d - \boldsymbol{\theta}\_0 \right) \right] &= \sum\_{r=1}^k \lambda\_r + o\_p(1) \\ &= \operatorname{trace} \left( \boldsymbol{J}\_a(\boldsymbol{\theta}\_0) \boldsymbol{H}\_a(\boldsymbol{\theta}\_0)^{-1} \right) + o\_p(1) \end{aligned}$$

and

$$E\_{\mathbf{Y}\_1,\dots,\mathbf{Y}\_n} \left[ n \mathcal{W}\_{\mathbf{a}} (\widehat{\boldsymbol{\theta}}\_{\mathbf{c}}^{\mathbf{a}}) \right] = n \mathcal{W}\_{\mathbf{a}} \left( \boldsymbol{\theta}\_0 \right) + \frac{(1+\alpha)}{2} trace \left( \boldsymbol{J}\_{\mathbf{a}} (\boldsymbol{\theta}\_0) \boldsymbol{H}\_{\mathbf{a}} (\boldsymbol{\theta}\_0)^{-1} \right) + o\_p(1).$$

Now a Taylor expansion of *Wn*,*<sup>α</sup>* (*θ*), around b*θ α <sup>c</sup>* and evaluated at *θ* = *θ*<sup>0</sup> gives

$$\begin{split} \mathcal{W}\_{\boldsymbol{\eta},\boldsymbol{\mu}}(\boldsymbol{\theta}\_{0}) &= \mathcal{W}\_{\boldsymbol{\eta},\boldsymbol{\mu}}(\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}) + \left(\frac{H\_{\boldsymbol{\eta},\boldsymbol{\mu}}\left(\boldsymbol{\theta}\right)}{\partial\boldsymbol{\theta}}\right)\_{\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}} \left(\boldsymbol{\theta}\_{0}-\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}\right) \\ &+ \frac{1}{2} \left(\boldsymbol{\theta}\_{0}-\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}\right)^{T} \left(\frac{\partial^{2}\mathcal{W}\_{\boldsymbol{n},\boldsymbol{a}}\left(\boldsymbol{\theta}\right)}{\partial\boldsymbol{\theta}}\partial\boldsymbol{\theta}^{T}\right)\_{\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}} \left(\boldsymbol{\theta}\_{0}-\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}\right) + o\left(\left\|\boldsymbol{\theta}\_{0}-\widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{\boldsymbol{a}}\right\|^{2}\right). \end{split}$$

But

$$\frac{\mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\boldsymbol{\theta}\right)}{\partial\boldsymbol{\theta}} = (\mathfrak{a}+1) \int\_{\mathbb{R}^{n}} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{\mathfrak{a}+1} \boldsymbol{u}\left(\boldsymbol{\theta},\boldsymbol{y}\right) d\boldsymbol{y} - (\mathfrak{a}+1) \,\frac{1}{n} \sum\_{k=1}^{n} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y}\_{k})^{\mathfrak{a}} \boldsymbol{u}\left(\boldsymbol{\theta},\boldsymbol{y}\_{k}\right)$$

therefore

$$\left(\frac{\mathcal{W}\_{n,\alpha}(\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\right)\_{\boldsymbol{\theta} = \widehat{\boldsymbol{\theta}}\_{\boldsymbol{c}}^{a}} \stackrel{P}{\to}\_{n \to \infty} \mathbf{0}.$$

On the other hand

$$\begin{split} \frac{\partial^{2}\mathcal{W}\_{\boldsymbol{n},\boldsymbol{\mu}}(\boldsymbol{\theta})}{\partial\boldsymbol{\theta}\,\partial\boldsymbol{\theta}^{T}} &= (1+\mathfrak{a}) \left\{ \int\_{\mathbb{R}^{n}} (1+\mathfrak{a}) \,\mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{n+1} \boldsymbol{u} \,(\boldsymbol{\theta},\boldsymbol{y})^{T} \boldsymbol{u} \,(\boldsymbol{\theta},\boldsymbol{y}) \,d\boldsymbol{y} + \int\_{\mathbb{R}^{n}} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{n+1} \frac{\partial\boldsymbol{u} \,(\boldsymbol{\theta},\boldsymbol{y})}{\partial\boldsymbol{\theta}^{T}} \boldsymbol{d} \boldsymbol{y} \\ &- \frac{1}{n} \sum\_{i=1}^{n} \boldsymbol{a} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y}\_{i})^{n} \boldsymbol{u} \,(\boldsymbol{\theta},\boldsymbol{y}\_{i})^{T} \boldsymbol{u} \,(\boldsymbol{\theta},\boldsymbol{y}\_{i}) - \frac{1}{n} \sum\_{i=1}^{n} \mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y}\_{i})^{n} \frac{\partial\boldsymbol{u} \,(\boldsymbol{\theta},\boldsymbol{y}\_{i})}{\partial\boldsymbol{\theta}^{T}} \right\}. \end{split}$$

But

$$\frac{1}{n}\sum\_{i=1}^{n}\mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y}\_{i})^{\boldsymbol{\alpha}}\boldsymbol{u}\left(\boldsymbol{\theta},\boldsymbol{y}\_{i}\right)^{\boldsymbol{T}}\boldsymbol{u}\left(\boldsymbol{\theta},\boldsymbol{y}\_{i}\right)\underset{\boldsymbol{n}\to\infty}{\overset{\scriptstyle{P}}{\underset{\left[\boldsymbol{\theta},\boldsymbol{u}\right]}{\rightleftharpoons}{\operatorname{\boldsymbol{\mathcal{E}}}}}\int\_{\mathbb{R}^{n}}\mathcal{CL}(\boldsymbol{\theta},\boldsymbol{y})^{\boldsymbol{\alpha}+1}\boldsymbol{u}\left(\boldsymbol{\theta},\boldsymbol{y}\right)^{\boldsymbol{T}}\boldsymbol{u}\left(\boldsymbol{\theta},\boldsymbol{y}\right)\boldsymbol{d}\boldsymbol{y}$$

and

$$\frac{1}{n}\sum\_{i=1}^{n}\mathcal{EL}(\boldsymbol{\theta},\boldsymbol{y}\_{i})^{a}\frac{\partial\boldsymbol{u}\,(\boldsymbol{\theta},\boldsymbol{y}\_{i})}{\partial\boldsymbol{\theta}^{T}}\underset{\boldsymbol{u}\to\infty}{\underset{\boldsymbol{\theta}\to\infty}{\operatorname{\mathbb{P}}}}\int\_{\mathbb{R}^{m}}\mathcal{EL}(\boldsymbol{\theta},\boldsymbol{y})^{a+1}\frac{\partial\boldsymbol{u}\,(\boldsymbol{\theta},\boldsymbol{y})}{\partial\boldsymbol{\theta}^{T}}d\boldsymbol{y}.$$

Therefore

$$\left(\frac{\partial^2 H\_{n,\alpha}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}\right)\_{\boldsymbol{\theta} = \widehat{\boldsymbol{\theta}}\_{\boldsymbol{\varepsilon}}^{\alpha}} \overset{P}{\underset{n \to \infty}{\longrightarrow}} \left(1 + \alpha\right) H\_{\alpha}\left(\boldsymbol{\theta}\_0\right) \dots$$

We can now write

$$n\mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\boldsymbol{\theta}\_{0}\right) = n\mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{a}\right) + \frac{(1+\mathfrak{a})}{2}\sqrt{n}\left(\boldsymbol{\theta}\_{0} - \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{a}\right)^{T}\boldsymbol{H}\_{\mathfrak{a}}\left(\boldsymbol{\theta}\_{0}\right)\sqrt{n}\left(\boldsymbol{\theta}\_{0} - \widehat{\boldsymbol{\theta}}\_{\mathfrak{c}}^{a}\right) + o\_{p}(1).$$

It is clear that

$$\begin{aligned} \operatorname{E}\_{\mathbf{Y}\_{1},\ldots,\mathbf{Y}\_{n}}\left[\sqrt{n}\left(\boldsymbol{\theta}\_{0}-\widehat{\boldsymbol{\theta}}\_{c}^{\boldsymbol{\alpha}}\right)^{T}\operatorname{H}\_{\mathbf{a}}\left(\boldsymbol{\theta}\_{0}\right)\sqrt{n}\left(\boldsymbol{\theta}\_{0}-\widehat{\boldsymbol{\theta}}\_{c}^{\boldsymbol{\alpha}}\right)\right] &= \sum\_{r=1}^{k} \lambda\_{r} + o\_{p}(1) \\ &= \operatorname{trace}\left(\operatorname{J}\_{\mathbf{a}}\left(\boldsymbol{\theta}\_{0}\right)\operatorname{H}\_{\mathbf{a}}\left(\boldsymbol{\theta}\_{0}\right)^{-1}\right) + o\_{p}(1). \end{aligned}$$

Then

$$E\_{Y\_1,\dots,Y\_n}\left[nW\_{\boldsymbol{n},\boldsymbol{a}}(\boldsymbol{\theta}\_0)\right] = E\_{Y\_1,\dots,Y\_n}\left[nW\_{\boldsymbol{n},\boldsymbol{a}}(\widehat{\boldsymbol{\theta}}\_c^{\boldsymbol{a}})\right] + \frac{(1+\boldsymbol{a})}{2}trace\left(J\_a(\boldsymbol{\theta}\_0)H\_a(\boldsymbol{\theta}\_0)^{-1}\right) + o\_p(1)$$

and, on the other hand, it is clear that

$$E\_{Y\_1,\dots,Y\_n} \left[ \mathcal{W}\_{n,\alpha}(\boldsymbol{\theta}\_0) \right] = \mathcal{W}\_{\alpha}(\boldsymbol{\theta}\_0).$$

Therefore,

$$\begin{split} \left[E\_{Y\_{1},\ldots,Y\_{n}}\left[n\mathcal{W}\_{a}(\widehat{\boldsymbol{\theta}}^{\boldsymbol{\alpha}}\_{c})\right] = n\mathcal{W}\_{a}\left(\boldsymbol{\theta}\_{0}\right) + \frac{(1+a)}{2}\text{trace}\left(\boldsymbol{I}\_{a}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{a}(\boldsymbol{\theta}\_{0})^{-1}\right) + o\_{p}(1) \\ = E\_{Y\_{1},\ldots,Y\_{n}}\left[n\mathcal{W}\_{n,\boldsymbol{\alpha}}\left(\boldsymbol{\theta}\_{0}\right)\right] + \frac{(1+a)}{2}\text{trace}\left(\boldsymbol{I}\_{a}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{a}(\boldsymbol{\theta}\_{0})^{-1}\right) + o\_{p}(1) \\ = E\_{Y\_{1},\ldots,Y\_{n}}\left[n\mathcal{W}\_{n,\boldsymbol{\alpha}}\left(\widehat{\boldsymbol{\theta}}^{\boldsymbol{\alpha}}\_{c}\right)\right] + \frac{(1+a)}{2}\text{trace}\left(\boldsymbol{I}\_{a}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{a}(\boldsymbol{\theta}\_{0})^{-1}\right) \\ + \frac{(1+a)}{2}\text{trace}\left(\boldsymbol{I}\_{a}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{a}(\boldsymbol{\theta}\_{0})^{-1}\right) + o\_{p}(1) \\ = E\_{Y\_{1},\ldots,Y\_{n}}\left[n\mathcal{W}\_{n,\boldsymbol{\alpha}}(\widehat{\boldsymbol{\theta}}^{\boldsymbol{\alpha}}\_{c})\right] + (1+a)\text{trace}\left(\boldsymbol{I}\_{a}(\boldsymbol{\theta}\_{0})\boldsymbol{H}\_{a}(\boldsymbol{\theta}\_{0})^{-1}\right) + o\_{p}(1). \end{split}$$

Hence *nWn*,*α*(b*θ α c* ) + (1 + *α*)*trace Jα* (*θ*0)*Hα*(*θ*0) −1 is an asymptotic unbiased estimator of

$$E\_{Y\_1,\dots,Y\_n} \left[ n \mathcal{W}\_a \left( \widehat{\boldsymbol{\theta}}\_c^{\boldsymbol{\alpha}} \right) \right] \ .$$

#### **Appendix B. Computation of the CLDIC in Section 4.1**

We have to compute

$$\text{CLDIC}\left(M\_k\right) = \mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\widehat{\rho}\right) + \frac{\mathfrak{a} + 1}{\mathfrak{n}} \frac{J\_{\mathfrak{a}}\left(\widehat{\rho}\right)}{H\_{\mathfrak{a}}\left(\widehat{\rho}\right)}\lambda$$

where

$$\mathcal{W}\_{\mathfrak{n},\mathfrak{a}}\left(\widehat{\rho}\right) = \int\_{\mathbb{R}^4} \mathcal{CL}(\widehat{\rho}, \mathfrak{y})^{\mathfrak{a}+1} d\mathfrak{y} - (1 - \mathfrak{a}^{-1}) \frac{1}{n} \sum\_{i=1}^n \mathcal{CL}(\widehat{\rho}, \mathfrak{y}\_i)^{\mathfrak{a}},\tag{A1}$$

$$J\_{\mathfrak{a}}(\widehat{\rho}) = \int\_{\mathbb{R}^4} \mathcal{CL}(\widehat{\rho}, y)^{2a+1} u(\widehat{\rho}, y)^2 dy - \left( \int\_{\mathbb{R}^4} \mathcal{CL}(\widehat{\rho}, y)^{a+1} u(\widehat{\rho}, y) dy \right)^2,\tag{A2}$$

$$H\_{\mathfrak{a}}(\widehat{\rho}) = -\int\_{\mathbb{R}^4} \mathcal{CL}(\widehat{\rho}, \mathfrak{y})^{\mathfrak{a}+1} u(\widehat{\rho}, \mathfrak{y})^2 d\mathfrak{y} \,\tag{A3}$$

for our candidate models, namely, composite normal and composite 4-variate *t*-distribution. As commented in Section 4.1, we consider a composite likelihood function based on the product of two bivariate distributions with common variance-covariance matrix. It is therefore, necessary in this example, to obtain values (A1), (A2) and (A3) for both composite normal and composite *t*-distributions. However, as stated in [10], while the sensitivity and variability matrices can be sometimes be evaluated explicitly, it is more usual to use empirical estimates. Following this comment, in the current example, we compute Equations (A1), (A2) and (A3) empirically through the sample data using

$$\begin{split} \widehat{\mathcal{W}}\_{n,\boldsymbol{\alpha}}\left(\widehat{\boldsymbol{\rho}}\right) &= \sum\_{i=1}^{n} \mathcal{CL}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{\boldsymbol{\alpha}+1} - (1-\boldsymbol{\alpha}^{-1})\frac{1}{n} \sum\_{i=1}^{n} \mathcal{CL}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{\boldsymbol{\alpha}}, \\ \widehat{f}\_{\boldsymbol{a}}(\widehat{\boldsymbol{\rho}}) &= \sum\_{i=1}^{n} \mathcal{CL}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{2\boldsymbol{\alpha}+1} \boldsymbol{u}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{2} - \left(\sum\_{i=1}^{n} \mathcal{CL}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{\boldsymbol{\alpha}+1} \boldsymbol{u}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})\right)^{2}, \\ \widehat{H}\_{\boldsymbol{a}}(\widehat{\boldsymbol{\rho}}) &= -\sum\_{i=1}^{n} \mathcal{CL}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{\boldsymbol{\alpha}+1} \boldsymbol{u}(\widehat{\boldsymbol{\rho}},\boldsymbol{\mathcal{y}}\_{i})^{2}. \end{split}$$

Now, we obtain the score of the composite likelihood *<sup>u</sup>*(*ρ*b, *<sup>y</sup><sup>i</sup>* ) explicitly for both cases. By equation (A.5) in [12],

$$\begin{split} u^{N}(\widehat{\rho}, \mathcal{Y}\_{i}) &= \frac{\widehat{\rho}}{1 - \widehat{\rho}^{2}} \left[ 2 + \frac{1}{\widehat{\rho}} (t\_{1i} t\_{2i} + t\_{3i} t\_{4i}) \\ &- \frac{1}{1 - \widehat{\rho}^{2}} \left( t\_{1i}^{2} - 2 \widehat{\rho} t\_{1i} t\_{2i} + t\_{2i}^{2} \right) - \frac{1}{1 - \widehat{\rho}^{2}} \left( t\_{3i}^{2} - 2 \widehat{\rho} t\_{3i} t\_{4i} + t\_{4i}^{2} \right) \right], \end{split}$$

with *tji* = *yji* − *µ<sup>j</sup>* , *j* = 1, . . . , 4. On the other hand, we want to compute *u <sup>t</sup><sup>ν</sup>* (*ρ*b, *<sup>y</sup><sup>i</sup>* ).

*u <sup>t</sup><sup>ν</sup>* (*ρ*b, *<sup>y</sup><sup>i</sup>* ) = *<sup>∂</sup>*CL*t<sup>ν</sup>* (*ρ*b, *<sup>y</sup><sup>i</sup>* ) *∂ρ*b = *<sup>∂</sup>* log CL*t<sup>ν</sup>* (*ρ*b, *<sup>y</sup><sup>i</sup>* ) *∂ρ*b = 1 CL*t<sup>ν</sup>* (*ρ*b, *<sup>y</sup><sup>i</sup>* ) *<sup>∂</sup>*CL*t<sup>ν</sup>* (*ρ*b, *<sup>y</sup><sup>i</sup>* ) *∂ρ*b = 1 *f tν* <sup>12</sup>(*y<sup>i</sup>* ; *<sup>ρ</sup>*b)*<sup>f</sup> tν* <sup>34</sup>(*y<sup>i</sup>* ; *ρ*b) *∂ ∂ρ*b *f tν* <sup>12</sup>(*y<sup>i</sup>* ; *<sup>ρ</sup>*b)*<sup>f</sup> tν* <sup>34</sup>(*y<sup>i</sup>* ; *ρ*b) = 1 *f tν* <sup>12</sup>(*y<sup>i</sup>* ; *<sup>ρ</sup>*b)*<sup>f</sup> tν* <sup>34</sup>(*y<sup>i</sup>* ; *ρ*b) *∂ ∂ρ*b *f tν* <sup>12</sup>(*y<sup>i</sup>* ; *ρ*b) *f tν* <sup>34</sup>(*y<sup>i</sup>* ; *<sup>ρ</sup>*b) + *<sup>f</sup> tν* <sup>12</sup>(*y<sup>i</sup>* ; *ρ*b) *∂ ∂ρ*b *f tν* <sup>34</sup>(*y<sup>i</sup>* ; *ρ*b) = 1 *f tν* <sup>12</sup>(*y<sup>i</sup>* ; *ρ*b) *∂ ∂ρ*b *f tν* <sup>12</sup>(*y<sup>i</sup>* ; *ρ*b) + 1 *f tν* <sup>34</sup>(*y<sup>i</sup>* ; *ρ*b) *∂ ∂ρ*b *f tν* <sup>34</sup>(*y<sup>i</sup>* ; *ρ*b) .

Now, it can be shown that

$$\frac{\partial f\_{12}^{t\_{\mathcal{I}}}(y\_{i\prime};\hat{\rho})}{\partial \hat{\rho}} = f\_{12}^{t\_{\mathcal{I}}}(y\_{i\prime};\hat{\rho}) \frac{\nu\left[ (\nu-2)\hat{\rho}^{3} - t\_{1\prime}t\_{2\prime}\nu\hat{\rho}^{2} + \left( (t\_{1\prime}^{2} + t\_{2\prime}^{2} - 1)\nu + t\_{2\prime}^{2} + t\_{1\prime}^{2} + 2 \right) \hat{\rho} - t\_{1\prime}t\_{2\prime}\nu - 2t\_{1\prime}t\_{2\prime} \right]}{(1 - \hat{\rho}^{2})\left[ (\nu-2)\hat{\rho}^{2} + 2t\_{1\prime}t\_{2\prime}\hat{\rho} - \nu - t\_{1\prime}^{2} - t\_{2\prime}^{2} + 2 \right]}$$

and

$$\frac{\partial f\_{34}^{t\_{\widehat{}}}(y\_{i};\widehat{\rho})}{\partial \widehat{\rho}} = f\_{34}^{t\_{\widehat{}}}(y\_{i};\widehat{\rho}) \frac{\mathbb{v}\left[ (\mathbb{v}-2)\widehat{\rho}^{3} - t\_{3i}t\_{4i}\mathbb{v}\widehat{\rho}^{2} + \left( (t\_{3i}^{2} + t\_{4i}^{2} - 1)\mathbb{v} + t\_{4i}^{2} + t\_{3i}^{2} + 2 \right) \widehat{\rho} - t\_{1i}t\_{4i}\mathbb{v} - 2t\_{3i}t\_{4i} \right]}{(1 - \widehat{\rho}^{2})\left[ (\mathbb{v}-2)\widehat{\rho}^{2} + 2t\_{3i}t\_{4i}\widehat{\rho} - \mathbb{v} - t\_{3i}^{2} - t\_{4i}^{2} + 2 \right]}.$$

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
