Projection Pursuit Through ϕ-Divergence Minimisation

Touboul, Jacques

doi:10.3390/e12061581

Open AccessArticle

Projection Pursuit Through ϕ-Divergence Minimisation

by

Jacques Touboul

Laboratoire de Statistique Théorique et Appliquée, Université Pierre et Marie Curie, 175 rue du Chevaleret, 75013 Paris, France

Entropy 2010, 12(6), 1581-1611; https://doi.org/10.3390/e12061581

Submission received: 8 April 2010 / Revised: 27 May 2010 / Accepted: 31 May 2010 / Published: 14 June 2010

Download

Browse Figures

Versions Notes

Abstract

:

In his 1985 article (“Projection pursuit”), Huber demonstrates the interest of his method to estimate a density from a data set in a simple given case. He considers the factorization of density through a Gaussian component and some residual density. Huber’s work is based on maximizing Kullback–Leibler divergence. Our proposal leads to a new algorithm. Furthermore, we will also consider the case when the density to be factorized is estimated from an i.i.d. sample. We will then propose a test for the factorization of the estimated density. Applications include a new test of fit pertaining to the elliptical copulas.

Keywords:

projection pursuit; minimum φ-divergence; elliptical distribution; goodness-of-fit; copula; regression

MSC Classification:

94A17; 62F05; 62J05; 62G08

1. Outline of the Article

The objective of projection pursuit is to generate one or several projections providing as much information as possible about the structure of the data set regardless of its size:

Once a structure has been isolated, the corresponding data are transformed through a Gaussianization. Through a recursive approach, this process is iterated to find another structure in the remaining data, until no further structure can be evidenced in the data left at the end.

Friedman [1] and Huber [2] count among the first authors to have introduced this type of approaches for evidencing structures. They each describe, with many examples, how to evidence such a structure and consequently how to estimate the density of such data through two different methodologies each. Their work is based on maximizing Kullback–Leibler divergence.

For a very long time, the two methodologies exposed by each of the above authors were thought to be equivalent but Zhu [3] showed it was in fact not the case when the number of iterations in the algorithms exceeds the dimension of the space containing the data, i.e., in case of density estimation. In the present article, we will therefore only focus on Huber’s study while taking into account the Zhu remarks.

At present, let us briefly introduce Huber’s methodology. We will then expose our approach and objective.

1.1. Huber’s analytic approach

Let f be a density on

R^{d}

. We define an instrumental density g with same mean and variance as f. Huber’s methodology requires us to start with performing the

K (f, g) = 0

test—with K being the Kullback–Leibler divergence. Should this test turn out to be positive, then

f = g

and the algorithm stops. If the test were not to be verified, the first step of Huber’s algorithm amounts to defining a vector

a_{1}

and a density

f^{(1)}

by

a_{1} = a r g inf_{a \in R_{*}^{d}} K (f \frac{g_{a}}{f_{a}}, g) and f^{(1)} = f \frac{g_{a_{1}}}{f_{a_{1}}}

(1.1)

where

R_{*}^{d}

is the set of non-null vectors of

R^{d}

, where

f_{a}

(resp.

g_{a}

) stands for the density of

a^{⊤} X

(resp.

a^{⊤} Y

) when f (resp. g) is the density of X (resp. Y). More exactly, this results from the maximisation of

a \mapsto K (f_{a}, g_{a})

since

K (f, g) = K (f_{a}, g_{a}) + K (f \frac{g_{a}}{f_{a}}, g)

and it is assumed that

K (f, g)

is finite. In a second step, Huber replaces f with

f^{(1)}

and goes through the first step again.

By iterating this process, Huber thus obtains a sequence

(a_{1}, a_{2}, . . .)

of vectors of

R_{*}^{d}

and a sequence of densities

f^{(i)}

.

$R e m a r k$ 1.1. Huber stops his algorithm when the Kullback–Leibler divergence equals zero or when his algorithm reaches the

d^{t h}

iteration, he then obtains an approximation of f from g:

When there exists an integer j such that

K (f^{(j)}, g) = 0

with

j \leq d

, he obtains

f^{(j)} = g

, i.e.,

f = g Π_{i = 1}^{j} \frac{f_{a_{i}}^{(i - 1)}}{g_{a_{i}}}

since by induction

f^{(j)} = f Π_{i = 1}^{j} \frac{g_{a_{i}}}{f_{a_{i}}^{(i - 1)}}

. Similarly, when, for all j, Huber gets

K (f^{(j)}, g) > 0

with

j \leq d

, he assumes

g = f^{(d)}

in order to derive

f = g Π_{i = 1}^{d} \frac{f_{a_{i}}^{(i - 1)}}{g_{a_{i}}}

.

He can also stop his algorithm when the Kullback–Leibler divergence equals zero without the condition

j \leq d

is met. Therefore, since by induction we have

f^{(j)} = f Π_{i = 1}^{j} \frac{g_{a_{i}}}{f_{a_{i}}^{(i - 1)}}

with

f^{(0)} = f

, we obtain

g = f Π_{i = 1}^{j} \frac{g_{a_{i}}}{f_{a_{i}}^{(i - 1)}}

. Consequently, we derive a representation of f as

f = g Π_{i = 1}^{j} \frac{f_{a_{i}}^{(i - 1)}}{g_{a_{i}}} .

Finally, he obtains

K (f^{(0)}, g) \geq K (f^{(1)}, g) \geq . . . . . \geq 0

with

f^{(0)} = f

.

1.2. Huber’s synthetic approach

Keeping the notations of the above section, we start with performing the

K (f, g) = 0

test; should this test turn out to be positive, then

f = g

and the algorithm stops, otherwise, the first step of his algorithm would consist in defining a vector

a_{1}

and a density

g^{(1)}

by

a_{1} = a r g inf_{a \in R_{*}^{d}} K (f, g \frac{f_{a}}{g_{a}}) and g^{(1)} = g \frac{f_{a_{1}}}{g_{a_{1}}}

(1.2)

More exactly, this optimisation results from the maximisation of

a \mapsto K (f_{a}, g_{a})

since

K (f, g) = K (f_{a}, g_{a}) + K (f, g \frac{f_{a}}{g_{a}})

and it is assumed that

K (f, g)

is finite. In a second step, Huber replaces g with

g^{(1)}

and goes through the first step again. By iterating this process, Huber thus obtains a sequence

(a_{1}, a_{2}, . . .)

of vectors of

R_{*}^{d}

and a sequence of densities

g^{(i)}

.

$R e m a r k$ 1.2. First, in a similar manner to the analytic approach, this methodology enables us to approximate and even to represent f from g:

To obtain an approximation of f, Huber either stops his algorithm when the Kullback–Leibler divergence equals zero, i.e.,

K (f, g^{(j)}) = 0

implies

g^{(j)} = f

with

j \leq d

, or when his algorithm reaches the

d^{t h}

iteration, i.e., he approximates f with

g^{(d)}

.

To obtain a representation of f, Huber stops his algorithm when the Kullback–Leibler divergence equals zero, since

K (f, g^{(j)}) = 0

implies

g^{(j)} = f

. Therefore, since by induction we have

g^{(j)} = g Π_{i = 1}^{j} \frac{f_{a_{i}}}{g_{a_{i}}^{(i - 1)}}

with

g^{(0)} = g

, we then obtain

f = g Π_{i = 1}^{j} \frac{f_{a_{i}}}{g_{a_{i}}^{(i - 1)}} .

Second, he gets

K (f, g^{(0)}) \geq K (f, g^{(1)}) \geq . . . . . \geq 0

with

g^{(0)} = g

.

1.3. Proposal

Let us first introduce the concept of

ϕ -

divergence.

Let ϕ be a strictly convex function defined by

φ : \bar{R^{+}} \to \bar{R^{+}},

and such that

φ (1) = 0

. We define a

ϕ -

divergence of P from Q—where P and Q are two probability distributions over a space Ω such that Q is absolutely continuous with respect to P—by

D_{ϕ} (Q, P) = \int φ (\frac{d Q}{d P}) d P

or

D_{ϕ} (q, p) = \int φ (\frac{q (x)}{p (x)}) p (x) d x

, if P and Q present p and q as density respectively.

Throughout this article, we will also assume that

φ (0) < \infty

, that

φ^{'}

is continuous and that this divergence is greater than the

L^{1}

distance—see also Appendix A.1 page 1604.

Now, let us introduce our algorithm.

We start with performing the

D_{ϕ} (g, f) = 0

test; should this test turn out to be positive, then

f = g

and the algorithm stops, otherwise, the first step of our algorithm would consist in defining a vector

a_{1}

and a density

g^{(1)}

by

a_{1} = a r g inf_{a \in R_{*}^{d}} D_{ϕ} (g \frac{f_{a}}{g_{a}}, f) and g^{(1)} = g \frac{f_{a_{1}}}{g_{a_{1}}}

(1.3)

Later on, we will prove that

a_{1}

simultaneously optimises (1.1), (1.2) and (1.3).

In our second step, we will replace g with

g^{(1)}

, and we will repeat the first step.

And so on, by iterating this process, we will end up obtaining a sequence

(a_{1}, a_{2}, . . .)

of vectors in

R_{*}^{d}

and a sequence of densities

g^{(i)}

.

We will thus prove that the underlying structures of f evidenced through this method are identical to the ones obtained through Huber’s method. We will also evidence the above structures, which will enable us to infer more information on f—see example below.

$R e m a r k$ 1.3. As in the previous algorithm, we first provide an approximate and even a representation of f from g: To obtain an approximation of f, we stop our algorithm when the divergence equals zero, i.e.,

D_{ϕ} (g^{(j)}, f) = 0

implies

g^{(j)} = f

with

j \leq d

, or when our algorithm reaches the

d^{t h}

iteration, i.e., we approximate f with

g^{(d)}

.

To obtain a representation of f, we stop our algorithm when the divergence equals zero. Therefore, since by induction we have

g^{(j)} = g Π_{i = 1}^{j} \frac{f_{a_{i}}}{g_{a_{i}}^{(i - 1)}}

with

g^{(0)} = g

, we then obtain

f = g Π_{i = 1}^{j} \frac{f_{a_{i}}}{g_{a_{i}}^{(i - 1)}} .

Second, we get

D_{ϕ} (g^{(0)}, f) \geq D_{ϕ} (g^{(1)}, f) \geq . . . . . \geq 0

with

g^{(0)} = g

.

Finally, the specific form of relationship (1.3) establishes that we deal with M-estimation. We can therefore state that our method is more robust than Huber’s—see Yohai [4], Toma [5] as well as Huber [6].

At present, let us study two examples:

$E x a m p l e$ 1.1. Let f be a density defined on

R^{3}

by

f (x_{1}, x_{2}, x_{3}) = n (x_{1}, x_{2}) h (x_{3})

, with n being a bi-dimensional Gaussian density, and h being a non-Gaussian density. Let us also consider g, a Gaussian density with same mean and variance as f.

Since

g (x_{1}, x_{2} / x_{3}) = n (x_{1}, x_{2})

, we then have

D_{ϕ} (g \frac{f_{3}}{g_{3}}, f) = D_{ϕ} (n . f_{3}, f) = D_{ϕ} (f, f) = 0

as

f_{3} = h

, i.e., the function

a \mapsto D_{ϕ} (g \frac{f_{a}}{g_{a}}, f)

reaches zero for

e_{3} = {(0, 0, 1)}^{'}

—where

f_{3}

and

g_{3}

are the third marginal densities of f and g respectively.

We therefore obtain

g (x_{1}, x_{2} / x_{3}) = f (x_{1}, x_{2} / x_{3})

.

$E x a m p l e$ 1.2. Assuming that the φ-divergence is greater than the

L^{2}

norm. Let us consider

{(X_{n})}_{n \geq 0}

, the Markov chain with continuous state space E. Let f be the density of

(X_{0}, X_{1})

and let g be the normal density with same mean and variance as f.

Let us now assume that

D_{ϕ} (g^{(1)}, f) = 0

with

g^{(1)} (x) = g (x) \frac{f_{1}}{g_{1}}

, i.e., let us assume that our algorithm stops for

a_{1} = {(1, 0)}^{'}

. Consequently, if

(Y_{0}, Y_{1})

is a random vector with g density, then the distribution law of

X_{1}

given

X_{0}

is Gaussian and is equal to the distribution law of

Y_{1}

given

Y_{0}

.

And then, for any sequence

(A_{i})

—where

A_{i} \subset E

—we have

\begin{array}{l} P (X_{n + 1} \in A_{n + 1} ∣ X_{0} \in A_{0}, X_{1} \in A_{1}, \dots, X_{n - 1} \in A_{n - 1}, X_{n} \in A_{n}) \\ = P (X_{n + 1} \in A_{n + 1} ∣ X_{n} \in A_{n}), b a s e d o n t h e v e r y d e f i n i t i o n o f a M a r k o v c h a i n, \\ = P (X_{1} \in A_{1} ∣ X_{0} \in A_{0}), t h r o u g h t h e M a r k o v p r o p e r t y, \\ = P (Y_{1} \in A_{1} ∣ Y_{0} \in A_{0}), a s a c o n s e q u e n c e o f t h e a b o v e n u l l i t y o f t h e ϕ - d i v e r g e n c e . \end{array}

To recapitulate our method, if

D_{ϕ} (g, f) = 0

, we derive f from the relationship

f = g

; should a sequence

{(a_{i})}_{i = 1, . . . j}

,

j < d

, of vectors in

R_{*}^{d}

defining

g^{(j)}

and such that

D_{ϕ} (g^{(j)}, f) = 0

exist, then

f (. / a_{i}^{⊤} x, 1 \leq i \leq j) = g (. / a_{i}^{⊤} x, 1 \leq i \leq j)

, i.e., f coincides with g on the complement of the vector subspace generated by the family

{a_{i}}_{i = 1, . . ., j}

—see also Section 2 for a more detailed explanation.

In this paper, after having clarified the choice of g, we will consider the statistical solution to the representation problem, assuming that f is unknown and

X_{1}

,

X_{2}

,...

X_{m}

are i.i.d. with density f. We will provide asymptotic results pertaining to the family of optimizing vectors

a_{k, m}

—that we will define more precisely below—as m goes to infinity. Our results also prove that the empirical representation scheme converges towards the theoretical one. As an application, Section 3.4 permits a new test of fit pertaining to the copula of an unknown density f, Section 3.5 gives us an estimate of a density deconvoluted with a Gaussian component and Section 3.6 presents some applications to regression analysis. Finally, we will present simulations and an application to real datasets.

2. The Algorithm

2.1. The model

As explained by Friedman [1] and Diaconis [7], the choice of g depends on the family of distribution one wants to find in f. Until now, the choice has only been to use the class of Gaussian distributions. This can be extended to the class of elliptic distributions with almost all

ϕ -

divergences.

Elliptical laws

The interest of this class lies in the fact that conditional densities with elliptical distributions are also elliptical—see Cambanis [8], Landsman [9]. This very property allows us to use this class in our algorithm.

Definition 2.1.

X is said to abide by a multivariate elliptical distribution—noted

X \sim E_{d} (μ, Σ, ξ_{d})

—if X presents the following density, for any x in

R^{d}

:

f_{X} (x) = \frac{c_{d}}{{| Σ |}^{1 / 2}} ξ_{d} (\frac{1}{2} {(x - μ)}^{'} Σ^{- 1} (x - μ))

with Σ, being a $d \times d$ positive-definite matrix and with μ, being a d-column vector,
with $ξ_{d}$ , being referred as the “density generator”,
with $c_{d}$ , being a normalisation constant, such that $c_{d} = \frac{Γ (d / 2)}{{(2 π)}^{d / 2}} {(\int_{0}^{\infty} x^{d / 2 - 1} ξ_{d} (x) d x)}^{- 1}$ , with $\int_{0}^{\infty} x^{d / 2 - 1} ξ_{d} (x) d x < \infty$ .

Property 2.1.

1/ For any

X \sim E_{d} (μ, Σ, ξ_{d})

, for any A, being an

m \times d

matrix with rank

m \leq d,

and for any b, being an m-dimensional vector, we have

A X + b \sim E_{m} (A μ + b, A Σ A^{'}, ξ_{m})

.

Therefore, any marginal density of multivariate elliptical distribution is elliptic, i.e.,

X = (X_{1}, X_{2}, . . ., X_{d}) \sim E_{d} (μ, Σ, ξ_{d}) \Rightarrow X_{i} \sim E_{1} (μ_{i}, σ_{i}^{2}, ξ_{1}),

f_{X_{i}} (x) = \frac{c_{1}}{σ_{i}} ξ_{1} (\frac{1}{2} {(\frac{x - μ_{i}}{σ})}^{2}),

1 \leq i \leq d

.

2/ Corollary 5 of Cambanis [8] states that conditional densities with elliptical distributions are also elliptic. Indeed, if

X = {(X_{1}, X_{2})}^{'} \sim E_{d} (μ, Σ, ξ_{d})

, with

X_{1}

(resp.

X_{2}

) being a size

d_{1} < d

(resp.

d_{2} < d

), then

X_{1} / (X_{2} = a) \sim E_{d_{1}} (μ^{'}, Σ^{'}, ξ_{d_{1}})

with

μ^{'} = μ_{1} + Σ_{12} Σ_{22}^{- 1} (a - μ_{2})

and

Σ^{'} = Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21},

with

μ = (μ_{1}, μ_{2})

and

Σ = {(Σ_{i j})}_{1 \leq i, j \leq 2}

.

$R e m a r k$ 2.1.

Landsman [9] shows that multivariate Gaussian distributions derive from

ξ_{d} (x) = e^{- x}

. He also shows that if

X = (X_{1}, . . ., X_{d})

has an elliptical density such that its marginals verify

E (X_{i}) < \infty

and

E (X_{i}^{2}) < \infty

for

1 \leq i \leq d,

then μ is the mean of X and Σ is a multiple of the covariance matrix of X. Consequently, from now on, we will assume that we are in this case.

Definition 2.2.

Let t be an elliptical density on

R^{k}

and let q be an elliptical density on

R^{k^{'}}

. The elliptical densities t and q are said to belong to the same family—or class—of elliptical densities, if their generating densities are

ξ_{k}

and

ξ_{k^{'}}

respectively, which belong to a common given family of densities.

$E x a m p l e$ 2.1.

Consider two Gaussian densities

N (0, 1)

and

N ((0, 0), I d_{2})

. They are said to belong to the same elliptical families as they both present

x \mapsto e^{- x}

as generating density.

Choice of g

Let us begin with studying the following case:

Let f be a density on

R^{d}

. Let us assume there exists d non-null linearly independent vectors

a_{j}

, with

1 \leq j \leq d,

of

R^{d}

, such that

f (x) = n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) h (a_{1}^{⊤} x, . . ., a_{j}^{⊤} x)

(2.1)

with

j < d

, with n being an elliptical density on

R^{d - j - 1}

and with h being a density on

R^{j}

, which does not belong to the same family as n. Let

X = (X_{1}, . . ., X_{d})

be a vector presenting f as density.

Define g as an elliptical distribution with same mean and variance as f.

For simplicity, let us assume that the family

{a_{j}}_{1 \leq j \leq d}

is the canonical basis of

R^{d}

:

The very definition of f implies that

(X_{j + 1}, . . ., X_{d})

is independent from

(X_{1}, . . ., X_{j})

. Hence, the density of

(X_{j + 1}, . . ., X_{d})

given

(X_{1}, . . ., X_{j})

is n.

Let us assume that

D_{ϕ} (g^{(j)}, f) = 0,

for some

j \leq d

. We then get

\frac{f (x)}{f_{a_{1}} f_{a_{2}} . . . f_{a_{j}}} = \frac{g (x)}{g_{a_{1}}^{(1 - 1)} g_{a_{2}}^{(2 - 1)} . . . g_{a_{j}}^{(j - 1)}}

, since, by induction, we have

g^{(j)} (x) = g (x) \frac{f_{a_{1}}}{g_{a_{1}}^{(1 - 1)}} \frac{f_{a_{2}}}{g_{a_{2}}^{(2 - 1)}} . . . \frac{f_{a_{j}}}{g_{a_{j}}^{(j - 1)}}

.

Consequently, the fact that conditional densities with elliptical distributions are also elliptical enables us to infer that

n (a_{j + 1}^{⊤} x, ., a_{d}^{⊤} x) = f (. / a_{i}^{⊤} x, 1 \leq i \leq j) = g (. / a_{i}^{⊤} x, 1 \leq i \leq j)

In other words, f coincides with g on the complement of the vector subspace generated by the family

{a_{i}}_{i = 1, . . ., j}

.

Now, if the family

{a_{j}}_{1 \leq j \leq d}

is no longer the canonical basis of

R^{d}

, then this family is again a basis of

R^{d}

. Hence, Lemma D.1—page 1607—implies that

g (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) = n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) = f (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x)

(2.2)

which is equivalent to having

D_{ϕ} (g^{(j)}, f) = 0

—since by induction

g^{(j)} = g \frac{f_{a_{1}}}{g_{a_{1}}^{(1 - 1)}} \frac{f_{a_{2}}}{g_{a_{2}}^{(2 - 1)}} . . . \frac{f_{a_{j}}}{g_{a_{j}}^{(j - 1)}}

.

The end of our algorithm implies that f coincides with g on the complement of the vector subspace generated by the family

{a_{i}}_{i = 1, . . ., j}

. Therefore, the nullity of the

ϕ -

divergence provides us with information on the density structure.

In summary, the following proposition clarifies our choice of g which depends on the family of distribution one wants to find in f:

Proposition 2.1.

With the above notations,

D_{ϕ} (g^{(j)}, f) = 0

is equivalent to

g (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) = f (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x)

More generally, the above proposition leads us to defining the co-support of f as the vector space generated from vectors

a_{1}, . . ., a_{j}

.

Definition 2.3.

Let f be a density on

R^{d}

. We define the co-vectors of f as the sequence of vectors

a_{1}, . . ., a_{j}

which solves the problem

D_{ϕ} (g^{(j)}, f) = 0

where g is an elliptical distribution with same mean and variance as f. We define the co-support of f as the vector space generated from vectors

a_{1}, . . ., a_{j}

.

$R e m a r k$ 2.2.

Any

(a_{i})

family defining f as in (2.1), is an orthogonal basis of

R^{d}

—see Lemma D.2

2.2. Stochastic outline of our algorithm

Let

X_{1}

,

X_{2}

,..,

X_{m}

(resp.

Y_{1}

,

Y_{2}

,..,

Y_{m}

) be a sequence of m independent random vectors with same density f (resp. g). As customary in nonparametric

ϕ -

divergence optimizations, all estimates of f and

f_{a}

as well as all uses of Monté Carlo’s methods are being performed using subsamples

X_{1}

,

X_{2}

,..,

X_{n}

and

Y_{1}

,

Y_{2}

,..,

Y_{n}

—extracted respectively from

X_{1}

,

X_{2}

,..,

X_{m}

and

Y_{1}

,

Y_{2}

,..,

Y_{m}

—since the estimates are bounded below by some positive deterministic sequence

θ_{m}

—see Appendix B.

Let

P_{n}

be the empirical measure of the subsample

X_{1}

,

X_{2}

,.,

X_{n}

. Let

f_{n}

(resp.

f_{a, n}

for any a in

R_{*}^{d}

) be the kernel estimate of f (resp.

f_{a}

), which is built from

X_{1}

,

X_{2}

,..,

X_{n}

(resp.

a^{⊤} X_{1}

,

a^{⊤} X_{2}

,..,

a^{⊤} X_{n}

).

As defined in Section 1.3, we introduce the following sequences

{(a_{k})}_{k \geq 1}

and

{(g^{(k)})}_{k \geq 1}

:

\begin{matrix} • a_{k} is a non null vector of R^{d} such that a_{k} = a r g {min}_{a \in R_{*}^{d}} D_{ϕ} (g^{(k - 1)} \frac{f_{a}}{g_{a}^{(k - 1)}}, f) \end{matrix}

(2.3)

\begin{matrix} • g^{(k)} is the density such that g^{(k)} = g^{(k - 1)} \frac{f_{a_{k}}}{g_{a_{k}}^{(k - 1)}} with g^{(0)} = g \end{matrix}

The stochastic setting up of the algorithm uses

f_{n}

and

g_{n}^{(0)} = g

instead of f and

g^{(0)} = g

—since g is known. Thus, at the first step, we build the vector

{\overset{ˇ}{a}}_{1}

which minimizes the

ϕ -

divergence between

f_{n}

and

g \frac{f_{a, n}}{g_{a}}

and which estimates

a_{1}

:

Proposition B.1 page 1606 and Lemma D.3 page 1607 enable us to minimize the

ϕ -

divergence between

f_{n}

and

g \frac{f_{a, n}}{g_{a}}

. Defining

{\overset{ˇ}{a}}_{1}

as the argument of this minimization, Proposition 3.3 page 1589 shows us that this vector tends to

a_{1}

.

Finally, we define the density

{\overset{ˇ}{g}}_{m}^{(1)}

as

{\overset{ˇ}{g}}_{m}^{(1)} = g \frac{f_{{\overset{ˇ}{a}}_{1}, m}}{g_{{\overset{ˇ}{a}}_{1}}}

which estimates

g^{(1)}

through Theorem 3.1.

Now, from the second step and as defined in Section 1.3, the density

g^{(k - 1)}

is unknown. Consequently, once again, we have to truncate the samples:

All estimates of f and

f_{a}

(resp.

g^{(1)}

and

g_{a}^{(1)}

) are being performed using a subsample

X_{1}

,

X_{2}

,..,

X_{n}

(resp.

Y_{1}^{(1)}

,

Y_{2}^{(1)}

,..,

Y_{n}^{(1)}

) extracted from

X_{1}

,

X_{2}

,..,

X_{m}

(resp.

Y_{1}^{(1)}

,

Y_{2}^{(1)}

,..,

Y_{m}^{(1)}

—which is a sequence of m independent random vectors with same density

g^{(1)}

) such that the estimates are bounded below by some positive deterministic sequence

θ_{m}

—see Appendix B.

Let

P_{n}

be the empirical measure of the subsample

X_{1}

,

X_{2}

,..,

X_{n}

. Let

f_{n}

(resp.

g_{n}^{(1)}

,

f_{a, n}

,

g_{a, n}^{(1)}

for any a in

R_{*}^{d}

) be the kernel estimate of f (resp.

g^{(1)}

and

f_{a}

as well as

g_{a}^{(1)}

) which is built from

X_{1}

,

X_{2}

,..,

X_{n}

(resp.

Y_{1}^{(1)}

,

Y_{2}^{(1)}

,..,

Y_{n}^{(1)}

and

a^{⊤} X_{1}

,

a^{⊤} X_{2}

,..,

a^{⊤} X_{n}

as well as

a^{⊤} Y_{1}^{(1)}

,

a^{⊤} Y_{2}^{(1)}

,..,

a^{⊤} Y_{n}^{(1)}

). The stochastic setting up of the algorithm uses

f_{n}

and

g_{n}^{(1)}

instead of f and

g^{(1)}

.

Thus, we build the vector

{\overset{ˇ}{a}}_{2}

which minimizes the

ϕ -

divergence between

f_{n}

and

g_{n}^{(1)} \frac{f_{a, n}}{g_{a, n}^{(1)}}

—since

g^{(1)}

and

g_{a}^{(1)}

are unknown—and which estimates

a_{2}

.

Proposition B.1 page 1606 and Lemma D.3 page 1607 enable us to minimize the

ϕ -

divergence between

f_{n}

and

g_{n}^{(1)} \frac{f_{a, n}}{g_{a, n}^{(1)}}

. Defining

{\overset{ˇ}{a}}_{2}

as the argument of this minimization, Proposition 3.3 page 1589 shows us that this vector tends to

a_{2}

in n. Finally, we define the density

{\overset{ˇ}{g}}_{n}^{(2)}

as

{\overset{ˇ}{g}}_{n}^{(2)} = g_{n}^{(1)} \frac{f_{{\overset{ˇ}{a}}_{2}, n}}{g_{{\overset{ˇ}{a}}_{2}, n}^{(1)}}

which estimates

g^{(2)}

through Theorem 3.1.

And so on, we will end up obtaining a sequence

({\overset{ˇ}{a}}_{1}, {\overset{ˇ}{a}}_{2}, . . .)

of vectors in

R_{*}^{d}

estimating the co-vectors of f and a sequence of densities

{({\overset{ˇ}{g}}_{n}^{(k)})}_{k}

such that

{\overset{ˇ}{g}}_{n}^{(k)}

estimates

g^{(k)}

through Theorem 3.1.

3. Results

3.1. Convergence results

3.1.1. Hypotheses on f

In this paragraph, we define the set of hypotheses on f which could possibly be of use in our work. Discussion on several of these hypotheses can be found in Appendix C.

In this section, to be more legible we replace g with

g^{(k - 1)}

. Let

Θ = R^{d}, Θ^{D_{ϕ}} = {b \in Θ | \int φ^{*} (φ^{'} (\frac{g (x)}{f (x)} \frac{f_{b} (b^{⊤} x)}{g_{b} (b^{⊤} x)})) d P < \infty}

M (b, a, x) = \int φ^{'} (\frac{g (x)}{f (x)} \frac{f_{b} (b^{⊤} x)}{g_{b} (b^{⊤} x)}) g (x) \frac{f_{a} (a^{⊤} x)}{g_{a} (a^{⊤} x)} d x - φ^{*} (φ^{'} (\frac{g (x)}{f (x)} \frac{f_{b} (b^{⊤} x)}{g_{b} (b^{⊤} x)}))

P_{n} M (b, a) = \int M (b, a, x) d P_{n}, P M (b, a) = \int M (b, a, x) d P

where

P

is the probability measure presenting f as density.

Similarly as in chapter V of Van der Vaart [10], let us define :

(H1): : $For all ε > 0, there is η > 0, such that for all c \in Θ^{D_{ϕ}} verifying ∥ c - a_{k} ∥ \geq ε, we have P M (c, a) - η > P M (a_{k}, a), with a \in Θ .$
(H2): : $\exists Z < 0, n_{0} > 0 such that (n \geq n_{0} \Rightarrow {sup}_{a \in Θ} {sup}_{c \in {Θ^{D_{ϕ}}}^{c}} P_{n} M (c, a) < Z)$
(H3): : There is a neighbourhood V of a_k, and a positive function H, such that, for all $c \in V, we have | M (c, a_{k}, x) | \leq H (x) (P - a . s .) with P H < \infty,$
(H4): : There is a neighbourhood V of a_k, such that for all ε, there is a η such that for all $c \in V and a \in Θ, verifying ∥ a - a_{k} ∥ \geq ε, we have P M (c, a_{k}) < P M (c, a) - η .$

Putting

I_{a_{k}} = \frac{\partial^{2}}{\partial a^{2}} D_{ϕ} (g \frac{f_{a_{k}}}{g_{a_{k}}}, f),

and

x \to ρ (b, a, x) = φ^{'} (\frac{g (x) f_{b} (b^{⊤} x)}{f (x) g_{b} (b^{⊤} x)}) \frac{g (x) f_{a} (a^{⊤} x)}{g_{a} (a^{⊤} x)}

, putting:

(H5): : The function φ is $C^{3}$ in $(0, + \infty$ ) and there is a neighbourhood $V_{k}^{'}$ of $(a_{k}, a_{k})$ such that, for all $(b, a)$ of $V_{k}^{'}$ , the gradient $\nabla (\frac{g (x) f_{a} (a^{⊤} x)}{g_{a} (a^{⊤} x)})$ and the Hessian $H (\frac{g (x) f_{a} (a^{⊤} x)}{g_{a} (a^{⊤} x)})$ exist ( $λ_a . s .$ ), and the first order partial derivatives $\frac{g (x) f_{a} (a^{⊤} x)}{g_{a} (a^{⊤} x)}$ and the first and second order derivatives of $(b, a) \mapsto ρ (b, a, x)$ are dominated ( $λ_$ a.s.) by λ-integrable functions.
(H6): : The function $(b, a) \mapsto M (b, a)$ is $C^{3}$ in a neighbourhood $V_{k}$ of $(a_{k}, a_{k})$ for all x; and the partial derivatives of $(b, a) \mapsto M (b, a)$ are all dominated in $V_{k}$ by a $P_$ integrable function $H (x)$ .
(H7): : $P ∥ \frac{\partial}{\partial b} M (a_{k}, a_{k}) ∥^{2}$ and $P ∥ \frac{\partial}{\partial a} M (a_{k}, a_{k}) ∥^{2}$ are finite and the expressions $P \frac{\partial^{2}}{\partial b_{i} \partial b_{j}} M (a_{k}, a_{k})$ and $I_{a_{k}}$ exist and are invertible.
(H8): : There exists k such that $P M (a_{k}, a_{k}) = 0$ .
(H9): : ${(V a r_{P} (M (a_{k}, a_{k})))}^{1 / 2}$ exists and is invertible.
(H0): : f and g are assumed to be positive and bounded and such that $K (g, f) \geq \int | f (x) - g (x) | d x$ .

3.1.2. Estimation of the first co-vector of f

Let

R

be the class of all positive functions r defined on

R

and such that

g (x) r (a^{⊤} x)

is a density on

R^{d}

for all a belonging to

R_{*}^{d}

. The following proposition shows that there exists a vector a such that

\frac{f_{a}}{g_{a}}

minimizes

D_{ϕ} (g r, f)

in r:

Proposition 3.1.

There exists a vector a belonging to

R_{*}^{d}

such that

a r g min_{r \in R} D_{ϕ} (g r, f) = \frac{f_{a}}{g_{a}} a n d r (a^{⊤} x) = \frac{f_{a} (a^{⊤} x)}{g_{a} (a^{⊤} x)}

$R e m a r k$ 3.1.

This proposition proves that

a_{1}

simultaneously optimises (1.1), (1.2) and (1.3). In other words, it proves that the underlying structures of f evidenced through our method are identical to the ones obtained through Huber’s methods.

Following Broniatowski [11], let us introduce the estimate of

D_{ϕ} (g \frac{f_{a, n}}{g_{a}}, f_{n})

, through

\overset{ˇ}{D_{ϕ}} (g \frac{f_{a, n}}{g_{a}}, f_{n}) = \int M (a, a, x) d P_{n} (x)

Proposition 3.2.

Let

\overset{ˇ}{a}

be such that

\overset{ˇ}{a} : = a r g {inf}_{a \in R_{*}^{d}} \overset{ˇ}{D_{ϕ}} (g \frac{f_{a, n}}{g_{a}}, f_{n}) .

Then,

\overset{ˇ}{a}

is a strongly convergent estimate of a, as defined in Proposition 3.1.

Let us also introduce the following sequences

{({\overset{ˇ}{a}}_{k})}_{k \geq 1}

and

{({\overset{ˇ}{g}}_{n}^{(k)})}_{k \geq 1}

, for any given n—see Section 2.2.:

${\overset{ˇ}{a}}_{k}$ is an estimate of $a_{k}$ as defined in Proposition 3.2 with ${\overset{ˇ}{g}}_{n}^{(k - 1)}$ instead of g,
${\overset{ˇ}{g}}_{n}^{(k)}$ is such that ${\overset{ˇ}{g}}_{n}^{(0)} = g$ , ${\overset{ˇ}{g}}_{n}^{(k)} (x) = {\overset{ˇ}{g}}_{n}^{(k - 1)} (x) \frac{f_{{\overset{ˇ}{a}}_{k}, n} ({\overset{ˇ}{a}}_{k}^{⊤} x)}{{[{\overset{ˇ}{g}}^{(k - 1)}]}_{{\overset{ˇ}{a}}_{k}, n} ({\overset{ˇ}{a}}_{k}^{⊤} x)}$ , i.e., ${\overset{ˇ}{g}}_{n}^{(k)} (x) = g (x) Π_{j = 1}^{k} \frac{f_{{\overset{ˇ}{a}}_{j}, n} ({\overset{ˇ}{a}}_{j}^{⊤} x)}{{[{\overset{ˇ}{g}}^{(j - 1)}]}_{{\overset{ˇ}{a}}_{j}, n} ({\overset{ˇ}{a}}_{j}^{⊤} x)}$ .

We also note that

{\overset{ˇ}{g}}_{n}^{(k)}

is a density.

3.1.3. Convergence study at the $k^{th}$ step of the algorithm:

In this paragraph, we will show that the sequence

{({\overset{ˇ}{a}}_{k})}_{n}

converges towards

a_{k}

and that the sequence

{({\overset{ˇ}{g}}_{n}^{(k)})}_{n}

converges towards

g^{(k)}

.

Let

{\overset{ˇ}{c}}_{n} (a) = a r g {sup}_{c \in Θ} P_{n} M (c, a),

with

a \in Θ

, and

{\overset{ˇ}{γ}}_{n} = a r g {inf}_{a \in Θ} {sup}_{c \in Θ} P_{n} M (c, a)

. We state

Proposition 3.3.

Both

{sup}_{a \in Θ} ∥ {\overset{ˇ}{c}}_{n} (a) - a_{k} ∥

and

{\overset{ˇ}{γ}}_{n}

converge toward

a_{k}

a.s.

Finally, the following theorem shows that

{\overset{ˇ}{g}}_{n}^{(k)}

converges almost everywhere towards

g^{(k)}

:

Theorem 3.1.

It holds

{\overset{ˇ}{g}}_{n}^{(k)} \to_{n} g^{(k)} a . s .

3.2. Asymptotic Inference at the $k^{t h}$ step of the algorithm

The following theorem shows that

{\overset{ˇ}{g}}_{n}^{(k)}

converges towards

g^{(k)}

at the rate

O_{P} (n^{- \frac{2}{2 + d}})

in three different cases, namely for any given x, with the

L^{1}

distance and with the Kullback–Leibler divergence:

Theorem 3.2.

It holds

| {\overset{ˇ}{g}}_{n}^{(k)} (x) - g^{(k)} (x) | = O_{P} (n^{- \frac{2}{2 + d}}),

\int | {\overset{ˇ}{g}}_{n}^{(k)} (x) - g^{(k)} (x) | d x = O_{P} (n^{- \frac{2}{2 + d}})

and

| K ({\overset{ˇ}{g}}_{n}^{(k)}, f) - K (g^{(k)}, f) | = O_{P} (n^{- \frac{2}{2 + d}}) .

The following theorem shows that the laws of our estimators of

a_{k}

, namely

{\overset{ˇ}{c}}_{n} (a_{k})

and

{\overset{ˇ}{γ}}_{n}

, converge towards a linear combination of Gaussian variables.

Theorem 3.3.

It holds

\sqrt{n} A . ({\overset{ˇ}{c}}_{n} (a_{k}) - a_{k}) \overset{L aw}{\to} B . N_{d} (0, P ∥ \frac{\partial}{\partial b} M (a_{k}, a_{k}) ∥^{2}) + C . N_{d} (0, P ∥ \frac{\partial}{\partial a} M (a_{k}, a_{k}) ∥^{2})

and

\sqrt{n} A . ({\overset{ˇ}{γ}}_{n} - a_{k}) \overset{L aw}{\to} C . N_{d} (0, P ∥ \frac{\partial}{\partial b} M (a_{k}, a_{k}) ∥^{2}) + C . N_{d} (0, P ∥ \frac{\partial}{\partial a} M (a_{k}, a_{k}) ∥^{2})

where

A = P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) (P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k}))

,

C = P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k})

and

B = P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k}) .

3.3. A stopping rule for the procedure

In this paragraph, we will call

{\overset{ˇ}{g}}_{n}^{(k)}

(resp.

{\overset{ˇ}{g}}_{a, n}^{(k)}

) the kernel estimator of

{\overset{ˇ}{g}}^{(k)}

(resp.

{\overset{ˇ}{g}}_{a}^{(k)}

). We will first show that

g_{n}^{(k)}

converges towards f in k and n. Then, we will provide a stopping rule for this identification procedure.

3.3.1. Estimation of f

The following proposition provides us with an estimate of f:

Theorem 3.4.

We have

{lim}_{n} {lim}_{k} {\overset{ˇ}{g}}_{n}^{(k)} = f

a.s.

Consequently, the following corollary shows that

D_{ϕ} (g_{n}^{(k - 1)} \frac{f_{a_{k}, n}}{g_{a_{k}, n}^{(k - 1)}}, f_{a_{k}, n})

converges towards zero as k and then as n go to infinity:

Corollary 3.1.

We have

{lim}_{n} {lim}_{k} D_{ϕ} ({\overset{ˇ}{g}}_{n}^{(k)} \frac{f_{a_{k}, n}}{{[{\overset{ˇ}{g}}^{(k)}]}_{a_{k}, n}}, f_{n}) = 0

a.s.

3.3.2. Testing of the criteria

In this paragraph, through a test of our criteria, namely

a \mapsto D_{ϕ} ({\overset{ˇ}{g}}_{n}^{(k)} \frac{f_{a, n}}{{[{\overset{ˇ}{g}}^{(k)}]}_{a, n}}, f_{n})

, we will build a stopping rule for this procedure. First, the next theorem enables us to derive the law of our criteria:

Theorem 3.5.

For a fixed k, we have

\sqrt{n} {(V a r_{P} (M ({\overset{ˇ}{c}}_{n} ({\overset{ˇ}{γ}}_{n}), {\overset{ˇ}{γ}}_{n})))}^{- 1 / 2} (P_{n} M ({\overset{ˇ}{c}}_{n} ({\overset{ˇ}{γ}}_{n}), {\overset{ˇ}{γ}}_{n}) - P_{n} M (a_{k}, a_{k})) \overset{L aw}{\to} N (0, I),

where k represents the

k^{t h}

step of our algorithm and where I is the identity matrix in

R^{d}

.

Note that k is fixed in Theorem 3.5 since

{\overset{ˇ}{γ}}_{n} = a r g {inf}_{a \in Θ} {sup}_{c \in Θ} P_{n} M (c, a)

where M is a known function of k—see Section 3.1. Thus, in the case when

D_{ϕ} (g^{(k - 1)} \frac{f_{a_{k}}}{g_{a_{k}}^{(k - 1)}}, f) = 0

, we obtain

Corollary 3.2.

We have

\sqrt{n} {(V a r_{P} (M ({\overset{ˇ}{c}}_{n} ({\overset{ˇ}{γ}}_{n}), {\overset{ˇ}{γ}}_{n})))}^{- 1 / 2} P_{n} M ({\overset{ˇ}{c}}_{n} ({\overset{ˇ}{γ}}_{n}), {\overset{ˇ}{γ}}_{n}) \overset{L aw}{\to} N (0, I)

.

Hence, we propose the test of the null hypothesis

(H_{0}) : D_{ϕ} (g^{(k - 1)} \frac{f_{a_{k}}}{g_{a_{k}}^{(k - 1)}}, f) = 0 v e r s u s the alternative (H_{1}) : D_{ϕ} (g^{(k - 1)} \frac{f_{a_{k}}}{g_{a_{k}}^{(k - 1)}}, f) \neq 0 .

Based on this result, we stop the algorithm, then, defining

a_{k}

as the last vector generated, we derive from Corollary 3.2 a α-level confidence ellipsoid around

a_{k}

, namely

E_{k} = {b \in R^{d}; \sqrt{n} {(V a r_{P} (M (b, b)))}^{- 1 / 2} P_{n} M (b, b) \leq q_{α}^{N (0, 1)}}

where

q_{α}^{N (0, 1)}

is the quantile of a α-level reduced centered normal distribution and where

P_{n}

is the empirical measure arising from a realization of the sequences

(X_{1}, \dots, X_{n})

and

(Y_{1}, \dots, Y_{n})

.

Consequently, the following corollary provides us with a confidence region for the above test:

Corollary 3.3.

E_{k}

is a confidence region for the test of the null hypothesis

(H_{0})

versus

(H_{1})

.

3.4. Goodness-of-fit test for copulas

Let us begin with studying the following case:

Let f be a density defined on

R^{2}

and let g be an elliptical distribution with same mean and variance as f. Assuming first that our algorithm leads us to having

D_{ϕ} (g^{(2)}, f) = 0

where family

(a_{i})

is the canonical basis of

R^{2}

. Hence, we have

g^{(2)} (x) = g (x) \frac{f_{1}}{g_{1}} \frac{f_{2}}{g_{2}^{(1)}} = g (x) \frac{f_{1}}{g_{1}} \frac{f_{2}}{g_{2}}

—through Lemma D.4 page 1608—and

g^{(2)} = f

. Therefore,

f = g (x) \frac{f_{1}}{g_{1}} \frac{f_{2}}{g_{2}},

i.e.,

\frac{f}{f_{1} f_{2}} = \frac{g}{g_{1} g_{2}}

, and then

\frac{\partial^{2}}{\partial x \partial y} C_{f} = \frac{\partial^{2}}{\partial x \partial y} C_{g}

where

C_{f}

(resp.

C_{g}

) is the copula of f (resp. g).

At present, let f be a density on

R^{d}

and let g be the density defined in Section 2.1.

Let us assume that our algorithm implies that

D_{ϕ} (g^{(d)}, f) = 0

.

Hence, we have, for any

x \in R^{d}

,

g (x) Π_{k = 1}^{d} \frac{f_{a_{k}} (a_{k}^{⊤} x)}{{[g^{(k - 1)}]}_{a_{k}} (a_{k}^{⊤} x)} = f (x)

, i.e.,

\frac{g (x)}{Π_{k = 1}^{d} g_{a_{k}} (a_{k}^{⊤} x)} = \frac{f (x)}{Π_{k = 1}^{d} f_{a_{k}} (a_{k}^{⊤} x)}

, since Lemma D.4 page 1608 implies that

g_{a_{k}}^{(k - 1)} = g_{a_{k}}

if

k \leq d

.

Moreover, the family

{(a_{i})}_{i = 1 . . . d}

is a basis of

R^{d}

—see Lemma D.5 page 1608. Hence, putting

A = (a_{1}, . . ., a_{d})

and defining vector y (resp. density

\tilde{f}

, copula

{\tilde{C}}_{f}

of

\tilde{f}

, density

\tilde{g}

, copula

{\tilde{C}}_{g}

of

\tilde{g}

) as the expression of vector x (resp. density f, copula

C_{f}

of f, density g, copula

C_{g}

of g) in basis A, the above equality implies

\frac{\partial^{d}}{\partial y_{1} . . . \partial y_{d}} {\tilde{C}}_{f} = \frac{\partial^{d}}{\partial y_{1} . . . \partial y_{d}} {\tilde{C}}_{g} .

Finally, we perform a statistical test of the null hypothesis

(H_{0})

:

\frac{\partial^{d}}{\partial y_{1} . . . \partial y_{d}} {\tilde{C}}_{f} = \frac{\partial^{d}}{\partial y_{1} . . . \partial y_{d}} {\tilde{C}}_{g}

versus the alternative

(H_{1})

:

\frac{\partial^{d}}{\partial y_{1} . . . \partial y_{d}} {\tilde{C}}_{f} \neq \frac{\partial^{d}}{\partial y_{1} . . . \partial y_{d}} {\tilde{C}}_{g}

. Since, under

(H_{0})

, we have

D_{ϕ} (g^{(d)}, f) = 0

, then, as explained in Section 3.3, Corollary 3.3 provides us with a confidence region for our test.

Theorem 3.6.

Keeping the notations of Corollary 3.3, we infer that

E_{d}

is a confidence region for the test of the null hypothesis

(H_{0})

versus the alternative hypothesis

(H_{1})

.

3.5. Rewriting of the convolution product

In the present paper, we first elaborated an algorithm aiming at isolating several known structures from initial data. Our objective was to verify if for a known density on

R^{d}

, a known density n on

R^{d - j - 1}

such that, for

d > 1

,

f (x) = n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) h (a_{1}^{⊤} x, . . ., a_{j}^{⊤} x)

(3.1)

did indeed exist, with

j < d

, with

(a_{1}, \dots, a_{d})

being a basis of

R^{d}

and with h being a density on

R^{j}

.

Secondly, our next step consisted in building an estimate (resp. a representation) of f without necessarily assuming that f meets relationship (3.1)—see Theorem 3.4.

Consequently, let us consider

Z_{1}

and

Z_{2}

, two random vectors with respective densities

h_{1}

and

h_{2}

—which is elliptical—on

R^{d}

. Let us consider a random vector X such that

X = Z_{1} + Z_{2}

and let f be its density. This density can then be written as

f (x) = h_{1} * h_{2} (x) = \int_{R^{d}} h_{1} (x) h_{2} (t - x) d t .

Then, the following property enables us to represent f under the form of a product and without the integral sign.

Proposition 3.4.

Let φ be a centered elliptical density with

σ^{2} . I_{d}

,

σ^{2} > 0

, as covariance matrix, such that it is a product density in all orthogonal coordinate systems and such that its characteristic function

s \mapsto Ψ (\frac{1}{2} | s |^{2} σ^{2})

is integrable—see Landsman [9]. Let f be a density on

R^{d}

which can be deconvoluted with ϕ, i.e.,

f = \bar{f} * ϕ = \int_{R^{d}} \bar{f} (x) ϕ (t - x) d t,

where

\bar{f}

is some density on

R^{d}

. Let

g^{(0)}

be the elliptical density belonging to the same elliptical family as f and having same mean and variance as f.

Then, the sequence

{(g^{(k)})}_{k}

converges uniformly a.s. and in

L^{1}

towards f in k, i.e.,

lim_{k \to \infty} sup_{x \in R^{d}} | g^{(k)} (x) - f (x) | = 0, a n d lim_{k \to \infty} \int_{R^{d}} | g^{(k)} (x) - f (x) | d x = 0

Finally, with the notations of Section 3.3 and of Proposition 3.4, the following theorem enables us to estimate any convolution product of a multivariate elliptical density φ with a continuous density

\bar{f}

:

Theorem 3.7.

It holds

{lim}_{n} {lim}_{k} {\overset{ˇ}{g}}_{n}^{(k)} = \bar{f} * ϕ a . s .

3.6. On the regression

In this section, we will study several applications of our algorithm pertaining to the regression analysis. We define

(X_{1}, . . ., X_{d})

(resp.

(Y_{1}, . . ., Y_{d})

) as a vector with density f (resp. g—see Section 2.1).

$R e m a r k$ 3.2.

In this paragraph, we will work in the

L^{2}

space. Then, we will first only consider the

ϕ -

divergences which are greater than or equal to the

L^{2}

distance—see Vajda [12]. Note also that the co-vectors of f can be obtained in the

L^{2}

space—see Lemma D.3 and Proposition B.1.

3.6.1. The basic idea

In this paragraph, we will assume that

Θ = R_{*}^{2}

and that our algorithm stops for

j = 1

and

a_{1} = {(0, 1)}^{'}

. The following theorem provides us with the regression of

X_{1}

on

X_{2}

:

Theorem 3.8.

The probability measure of

X_{1}

given

X_{2}

is the same as the probability measure of

Y_{1}

given

Y_{2}

. Moreover, the regression between

X_{1}

and

X_{2}

is

X_{1} = E (Y_{1} / Y_{2}) + ε,

where ε is a centered random variable orthogonal to

E (X_{1} / X_{2})

.

$R e m a r k$ 3.3

This theorem implies that

E (X_{1} / X_{2}) = E (Y_{1} / Y_{2})

. This equation can be used in many fields of research. The Markov chain theory has been used for instance in Example 1.2.

Moreover, if g is a Gaussian density with same mean and variance as f, then Saporta [14] implies that

E (Y_{1} / Y_{2}) = E (Y_{1}) + \frac{C o v (Y_{1}, Y_{2})}{V a r (Y_{2})} (Y_{2} - E (Y_{2}))

and then

X_{1} = E (Y_{1}) + \frac{C o v (Y_{1}, Y_{2})}{V a r (Y_{2})} (Y_{2} - E (Y_{2})) + ε .

3.6.2. General case

In this paragraph, we will assume that

Θ = R_{*}^{d}

and that our algorithm stops with j for

j < d

. Lemma D.6 implies the existence of an orthogonal and free family

{(b_{i})}_{i = j + 1, . ., d}

of

R_{*}^{d}

such that

R^{d} = V e c t {a_{i}} \overset{⊥}{\oplus} V e c t {b_{k}}

and such that

\begin{matrix} g (b_{j + 1}^{⊤} x, . . ., b_{d}^{⊤} x / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) = f (b_{j + 1}^{⊤} x, . . ., b_{d}^{⊤} x / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) \end{matrix}

(3.2)

Hence, the following theorem provides us with the regression of

b_{k}^{⊤} X

,

k = 1, . . ., d

, on

(a_{1}^{⊤} X, . . ., a_{j}^{⊤} X)

:

Theorem 3.9.

The probability measure of

(b_{j + 1}^{⊤} X, . . ., b_{d}^{⊤} X)

given

(a_{1}^{⊤} X, . . ., a_{j}^{⊤} X)

is the same as the probability measure of

(b_{j + 1}^{⊤} Y, . . ., b_{d}^{⊤} Y)

given

(a_{1}^{⊤} Y, . . ., a_{j}^{⊤} Y)

. Moreover, the regression of

b_{k}^{⊤} X

,

k = 1, . . ., d

, on

(a_{1}^{⊤} X, . . ., a_{j}^{⊤} X)

is

b_{k}^{⊤} X = E (b_{k}^{⊤} Y / a_{1}^{⊤} Y_{1}, . . ., a_{j}^{⊤} Y) + b_{k}^{⊤} ε

, where ε is a centered random vector such that

b_{k}^{⊤} ε

is orthogonal to

E (b_{k}^{⊤} X / a_{1}^{⊤} X, . . ., a_{j}^{⊤} X)

.

Corollary 3.4.

If g is a Gaussian density with same mean and variance as f, and if

C o v (X_{i}, X_{j}) = 0

for any

i \neq j

, then, the regression of

b_{k}^{⊤} X

,

k = 1, . . ., d

, on

(a_{1}^{⊤} X, . . ., a_{j}^{⊤} X)

is

b_{k}^{⊤} X = E (b_{k}^{⊤} Y) + b_{k}^{⊤} ε

, where ε is a centered random vector such that

b_{k}^{⊤} ε

is orthogonal to

E (b_{k}^{⊤} X / a_{1}^{⊤} X, . . ., a_{j}^{⊤} X)

.

4. Simulations

Let us study five simulations. The first involves a

χ^{2}

-divergence, the second a Hellinger distance, the third and the fourth a Cressie–Read divergence (still with

γ = 1.25

), and the fifth a Kullback–Leibler divergence.

In each example, our program will follow our algorithm and will aim at creating a sequence of densities

(g^{(j)})

,

j = 1, . ., k

,

k < d

, such that

g^{(0)} = g,

g^{(j)} = g^{(j - 1)} f_{a_{j}} / {[g^{(j - 1)}]}_{a_{j}}

and

D_{ϕ} (g^{(k)}, f) = 0,

with

D_{ϕ}

being a divergence and

a_{j} = a r g {inf}_{b} D_{ϕ} (g^{(j - 1)} f_{b} / {[g^{(j - 1)}]}_{b}, f),

for all

j = 1, . . ., k

. Moreover, in the second example, we will study the robustness of our method with two outliers. In the third and the fourth example, defining

(X_{0}, X_{1})

as a vector with f as density, we will study the regression of

X_{1}

on

X_{0}

. And finally, in the fifth example, we will perform our goodness-of-fit test for copulas.

$S i m u l a t i o n$ 4.1

(With the

χ^{2}

divergence).

We are in dimension 3(=d), and we consider a sample of 50(=n) values of a random variable X with a density law f defined by

f (x) = G a u s s i a n (x_{1} + x_{2}) . G a u s s i a n (x_{0} + x_{2}) . G u m b e l (x_{0} + x_{1})

where the Normal law parameters are

(- 5, 2)

and

(1, 1)

and where the Gumbel distribution parameters are

- 3

and 4. Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f.

We theoretically obtain

k = 1

and

a_{1} = (1, 1, 0)

. To get this result, we perform the following test:

H 0 : a_{1} = (1, 1, 0) v e r s u s (H_{1}) : a_{1} \neq (1, 1, 0) .

Then, Corollary 3.3 enables us to estimate

a_{1}

by the following 0.9(=α) level confidence ellipsoid

E_{1} = {b \in R^{3}; {(V a r_{P} (M (b, b)))}^{(- 1 / 2)} P_{n} M (b, b) \leq q_{α}^{N (0, 1)} / \sqrt{n} ≃ 0, 2533 / 7.0710678 = 0.03582203}

And, we obtain

Table 1. Simulation 1: Numerical results of the optimisation.

**Table 1.** Simulation 1: Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :	minimum : 0.0201741
	at point : (1.00912,1.09453,0.01893)
	P-Value : 0.81131
Test :	$H_{0}$ : $a_{1} \in E_{1}$ : True
$χ^{2}$ (Kernel Estimation of $g^{(1)}$ , $g^{(1)}$ )	6.1726

Therefore, we conclude that f = g⁽¹⁾.

$S i m u l a t i o n$ 4.2

(With the Hellinger distance H).

We are in dimension 20(=d). We first generate a sample with 100(=n) observations, namely two outliers

x = (2, 0, \dots, 0)

and 98 values of a random variable X with a density f defined by

f (x) = G u m b e l (x_{0}) . N o r m a l (x_{1}, \dots, x_{9})

where the Gumbel law parameters are -5 and 1 and where the normal distribution is reduced and centered. Our reasoning is the same as in Simulation 4.1.

In the first part of the program, we theoretically obtain

k = 1

and

a_{1} = (1, 0, \dots, 0)

. To get this result, we perform the following test

(H_{0}) : a_{1} = (1, 0, \dots, 0) v e r s u s (H_{1}) : a_{1} \neq (1, 0, \dots, 0)

We estimate

a_{1}

by the following 0.9(=α) level confidence ellipsoid

E_{i} = {b \in R^{2}; {(V a r_{P} (M (b, b)))}^{- 1 / 2} P_{n} M (b, b) \leq q_{α}^{N (0, 1)} / \sqrt{n} ≃ 0.02533}

And, we obtain

Table 2. Simulation 2: Numerical results of the optimisation.

**Table 2.** Simulation 2: Numerical results of the optimisation.
Our Algorithm
Projection Study 0	minimum : 0.002692
	at point : (1.01326, 0.0657, 0.0628, 0.1011, 0.0509, 0.1083,
	0.1261, 0.0573, 0.0377, 0.0794, 0.0906, 0.0356, 0.0012,
	0.0292, 0.0737, 0.0934, 0.0286, 0.1057, 0.0697, 0.0771)
	P-Value : 0.80554
Test :	$H_{0}$ : $a_{1} \in E_{1}$ : True
H(Est. of $g^{(1)}$ , $g^{(1)}$ )	3.042174

Therefore, we conclude that f = g⁽¹⁾.

$S i m u l a t i o n$ 4.3

(With the Cressie-Read divergence (

D_{ϕ}

)).

We are in dimension 2(=d), and we consider a sample of 50(=n) values of a random variable

X = (X_{0}, X_{1})

with a density law f defined by

f (x) = G u m b e l (x_{0}) . N o r m a l (x_{1})

where the Gumbel law parameters are -5 and 1 and where the normal distribution parameters are

(0, 1)

. Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f.

We theoretically obtain

k = 1

and

a_{1} = (1, 0)

. To get this result, we perform the following test

H 0 : a_{1} = (1, 0) v e r s u s (H_{1}) : a_{1} \neq (1, 0)

Then, Corollary 3.3 enables us to estimate

a_{1}

by the following 0.9(=α) level confidence ellipsoid

E_{1} = {b \in R^{2}; {(V a r_{P} (M (b, b)))}^{(- 1 / 2)} P_{n} M (b, b) \leq q_{α}^{N (0, 1)} / \sqrt{n}}, w i t h q_{α}^{N (0, 1)} / \sqrt{n} ≃ 0.03582203 .

And, we obtain

Table 3. Simulation 3: Numerical results of the optimisation.

**Table 3.** Simulation 3: Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :	minimum : 0.0210058
	at point : (1.001,0.0014)
	P-Value : 0.989552
Test :	$H_{0}$ : $a_{1} \in E_{1}$ : True
$D_{ϕ}$ (Kernel Estimation of $g^{(1)}$ , $g^{(1)}$ )	6.47617

Therefore, we conclude that f = g⁽¹⁾.

Figure 1. Graph of the distribution to estimate (red) and of our own estimate (green).

Figure 2. Graph of the distribution to estimate (red) and of Huber’s estimate (green).

At present, keeping the notations of this simulation, let us study the regression of

X_{1}

on

X_{0}

.

Our algorithm leads us to infer that the density of

X_{1}

given

X_{0}

is the same as the density of

Y_{1}

given

Y_{0}

. Moreover, Property A.1 implies that the co-factors of f are the same for any divergence. Consequently, applying Theorem 3.8 implies that

X_{1} = E (Y_{1} / Y_{0}) + ε,

where ε is a centered random variable orthogonal to

E (X_{1} / X_{0})

.

Thus, since g is a Gaussian density, Remark 3.3 implies that

X_{1} = E (Y_{1}) + \frac{C o v (Y_{1}, Y_{0})}{V a r (Y_{0})} (Y_{0} - E (Y_{0})) + ε

Now, using the least squares method, we estimate

α_{1}

and

α_{2}

such that

X_{1} = α_{1} + α_{2} . X_{0} + ε .

Thus, the following table presents the results of our regression and of the least squares method if we assume that ε is Gaussian.

Table 4. Simulation 3: Numerical results of the regression.

**Table 4.** Simulation 3: Numerical results of the regression.
Our Regression	$E (Y_{1})$	-4.545483
	$C o v (Y_{1}, Y_{0})$	0.0380534
	$V a r (Y_{0})$	0.9190052
	$E (Y_{0})$	0.3103752
	correlation $(Y_{1}, Y_{0})$	0.02158213
Least squares method	$α_{1}$	-4.34159227
	Std Error of $α_{1}$	0.19870
	$α_{2}$	0.06803317
	Std Error of $α_{2}$	0.21154
	correlation $(X_{1}, X_{0})$	0.04888484

Figure 3. Graph of the regression of

X_{1}

on

X_{0}

based on the least squares method (red) and based on our theory (green).

Figure 3. Graph of the regression of

X_{1}

on

X_{0}

based on the least squares method (red) and based on our theory (green).

$S i m u l a t i o n$ 4.4

(With the Cressie-Read divergence (

D_{ϕ}

)).

We are in dimension 2(=d), and we consider a sample of 500(=n) values of a random variable

X = (X_{0}, X_{1})

with a density law f defined by

f (x) = G u m b e l (x_{1} - x_{0}) . N o r m a l (x_{1} + x_{0})

where the Gumbel law parameters are -5 and 1 and where the normal distribution parameters are

(0, 1)

. Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f.

We theoretically obtain

k = 1

and

a_{1} = (1, 0)

. To get this result, we perform the following test

H 0 : a_{1} = (1, - 1) v e r s u s (H_{1}) : a_{1} \neq (1, - 1) .

Then, Corollary 3.3 enables us to estimate

a_{1}

by the following 0.9(=α) level confidence ellipsoid

E_{1} = {b \in R^{2}; {(V a r_{P} (M (b, b)))}^{(- 1 / 2)} P_{n} M (b, b) \leq q_{α}^{N (0, 1)} / \sqrt{n} ≃ 0, 2533 / \sqrt{500} = 0.01132792}

And, we obtain

Table 5. Simulation 4: Numerical results of the optimisation.

**Table 5.** Simulation 4: Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :	minimum : 0.010920
	at point : (1.09,-0.9701)
	P-Value : 0.889400
Test :	$H_{0}$ : $a_{1} \in E_{1}$ : True
$D_{ϕ}$ (Kernel Estimation of $g^{(1)}$ , $g^{(1)}$ )	5.25077

Therefore, we conclude that f = g⁽¹⁾.

At present, keeping the notations of this simulation, let us study the regression of

X_{1} + X_{0}

on

X_{1} - X_{0}

. Our algorithm leads us to infer that the density of

X_{1} + X_{0}

given

X_{1} - X_{0}

is the same as the density of

Y_{1} + Y_{0}

given

Y_{1} - Y_{0}

. Moreover, Property A.1 implies that the co-factors of f are the same for any divergence. Consequently, putting

U = X_{1} + X_{0}

,

V = X_{1} - X_{0}

,

U^{'} = Y_{1} + Y_{0}

and

V^{'} = Y_{1} - Y_{0}

, and since

{{(1, 1)}^{'}, {(1, - 1)}^{'}}

is an orthogonal basis, we can therefore infer from Theorem 3.8 that

U = E (U^{'} / V^{'}) + ε,

where ε is a centered random variable orthogonal to

E (U / V)

.

Thus, since g is a Gaussian density, Remark 3.3 implies that

U = E (U^{'}) + \frac{C o v (U^{'}, V^{'})}{V a r (V^{'})} (V^{'} - E (V^{'})) + ε

In other words, we apply the same reasoning as the one used in the regression studies in Simulation 4.3 to

(U, V)

instead of

(X_{1}, X_{0})

. This is possible since

{{(1, 1)}^{'}, {(1, - 1)}^{'}}

is an orthogonal basis of

R^{2}

, i.e., we implement a change in basis from the canonical basis of

R^{2}

to

{{(1, 1)}^{'}, {(1, - 1)}^{'}}

.

Thus, in the canonical basis

U = E (U^{'} / V^{'}) + ε

becomes

X_{1} + X_{0} = E (Y_{1} + Y_{0} / Y_{1} - Y_{0}) + ε

, i.e., we obtain that

X_{1} + X_{0} = E (Y_{1} + Y_{0}) + \frac{C o v (Y_{1} + Y_{0}, Y_{1} - Y_{0})}{V a r (Y_{1} - Y_{0})} (Y_{1} - Y_{0} - E (Y_{1} - Y_{0})) + ε

where ε is a centered random variable orthogonal to

E (X_{1} + X_{0} / X_{1} - X_{0})

.

The following table presents the results of our regression.

We simulate 10 times the regression and we obtain a and b such that

X_{1} = a + b X_{0} + ε

:

Table 6. Simulation 4: Numerical results of the regression.

**Table 6.** Simulation 4: Numerical results of the regression.
Simulation	a	Std Error of a	b	Std Error of b
1	-4.83739	0.11149	-0.95861	0.04677
2	-4.56895	0.09989	-0.88577	0.04225
3	-4.4926	0.1057	-1.2085	0.0452
4	-4.70619	0.10350	-1.04549	0.04235
5	-4.40331	0.10248	-1.00890	0.0438
6	-4.61757	0.09813	-1.20890	0.04649
7	-4.40572	0.09172	-1.16085	0.04091
8	-4.39581	0.10174	-1.38696	0.04487
9	-4.42780	0.10018	-0.93672	0.04066
10	-4.55394	0.09923	-0.98065	0.04382

Figure 4. Graph of the regression of

X_{1}

on

X_{0}

based on our theory (green).

Figure 4. Graph of the regression of

X_{1}

on

X_{0}

based on our theory (green).

$S i m u l a t i o n$ 4.5

(With the Kullback-Leibler divergence K).

We are in dimension 2(=d), and we use the Kullback–Leibler divergence to perform our optimisations. Let us consider a sample of 50(=n) values of a random variable X with a density law f defined by :

f (x) = c_{ρ} (F_{G u m b e l} (x_{0}), F_{E x p o n e n t i a l} (x_{1})) . G u m b e l (x_{0}) . E x p o n e n t i a l (x_{1})

where :

c is the Gaussian copula with correlation coefficient $ρ = 0.5$ ,
the Gumbel distribution parameters are $- 1$ and 1 and
the Exponential density parameter is 2.

Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f. We theoretically obtain

k = 2

and

(a_{1}, a_{2}) = ((1, 0), (0, 1))

. To get this result, we perform the following test

(H_{0}) : (a_{1}, a_{2}) = ((1, 0), (0, 1)) v e r s u s (H_{1}) : (a_{1}, a_{2}) \neq ((1, 0), (0, 1))

Then, Theorem 3.6 enables us to verify

(H_{0})

by the following 0.9(=α) level confidence ellipsoid

E_{2} = {b \in R^{2}; {(V a r_{P} (M (b, b)))}^{(- 1 / 2)} P_{n} M (b, b) \leq q_{α}^{N (0, 1)} / \sqrt{n} ≃ 0, 2533 / 7.0710678 = 0.0358220}

And, we obtain

Table 7. Simulation 5: Numerical results of the optimisation.

**Table 7.** Simulation 5: Numerical results of the optimisation.
Our Algorithm
Projection Study number 0 :	minimum : 0.445199
	at point : (1.0142,0.0026)
	P-Value : 0.94579
Test :	$H_{1}$ : $a_{1} \notin E_{1}$ : True
Projection Study number 1 :	minimum : 0.0263
	at point : (0.0084,0.9006)
	P-Value : 0.97101
Test :	$H_{0}$ : $a_{2} \in E_{2}$ : True
K(Kernel Estimation of $g^{(2)}$ , $g^{(2)}$ )	4.0680

Therefore, we can conclude that H₀ is verified.

Figure 5. Graph of the estimate of

(x_{0}, x_{1}) \mapsto c_{ρ} (F_{G u m b e l} (x_{0}), F_{E x p o n e n t i a l} (x_{1}))

.

Figure 5. Graph of the estimate of

(x_{0}, x_{1}) \mapsto c_{ρ} (F_{G u m b e l} (x_{0}), F_{E x p o n e n t i a l} (x_{1}))

.

Application to real datasets

Let us now apply our theory to real datasets.

Let us for instance study the moves in the stock prices of Nokia and Sanofi from January 11, 2010 to May 10, 2010. We thus gather 84(=n) data from these stock prices—see data below.

Let us also consider

X_{1}

(resp.

X_{2}

) the random variable defining the stock price of Nokia (resp. Sanofi). We will assume—as it is commonly done in mathematical finance—that the stock market abides by the classical hypotheses of the Black–Scholes model—see [13].

Consequently,

X_{1}

and

X_{2}

each present a log-normal distribution as probability distribution. Let f be the density of vector

(l n (X_{1}), X_{2})

, let us now apply our algorithm to f with the Kullback–Leibler divergence as φ-divergence. Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting same mean and variance as f.

We first assume that there exists a vector a such that

D_{ϕ} (g \frac{f_{a}}{g_{a}}, f) = 0

.

In order to verify this hypothesis, our reasoning will be the same as in Simulation 4.1. Indeed, we assume that this vector is a co-factor of f. Consequently, Corollary 3.3 enables us to estimate a by the following 0.9(=α) level confidence ellipsoid

E_{1} = {b \in R^{2}; {(V a r_{P} (M (b, b)))}^{(- 1 / 2)} P_{n} M (b, b) \leq q_{α}^{N (0, 1)} / \sqrt{n} ≃ 0, 2533 / \sqrt{84} = 0.02763730}

And, we obtain

Table 8. Numerical results of the optimisation.

**Table 8.** Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :	minimum : 0.017345
	at point : (0.027,3.18)
	P-Value : 0.890210
Test :	$H_{0}$ : $a_{1} \in E_{1}$ : True
K(Kernel Estimation of $g^{(1)}$ , $g^{(1)}$ )	2.7704005

Therefore, we conclude that

f = g^{(1)}

, i.e., our hypothesis is confirmed.

Consequently, as explained in Simulations 4.3 and 4.4, we can say that

l o g (X_{1}) = 0.027 . X_{2} + 3.18 + ε

where ε is a centered random variable orthogonal to

E (l o g (X_{1}) / X_{2})

.

Finally, using the least squares method, we estimate

α_{1}

and

α_{2}

such that

l o g (X_{1}) = α_{1} + α_{2} . X_{2} + ε .

Thus, the following table presents the results of the least squares method if we assume that ε is Gaussian:

Table 9. Numerical results of the regression.

**Table 9.** Numerical results of the regression.
Simulation	$α_{1}$	Std Error of $α_{1}$	$α_{2}$	Std Error of $α_{2}$
1	3.153694	0.230380	0.026578	0.004236

Figure 6. Graph of the regression of log of Nokia on Sanofi based on the least squares method (red) and based on our theory (green).

Table 10. Stock prices of Nokia and Sanofi.

**Table 10.** Stock prices of Nokia and Sanofi.
Date	Nokia	Log-of-Nokia	Sanofi	Date	Nokia	Log-of-Nokia	Sanofi
10/05/10	84.75	4.44	51.62	07/05/10	81.85	4.4	48.5
06/05/10	87.3	4.47	50.35	05/05/10	87.75	4.47	50.95
04/05/10	87.25	4.47	50.49	03/05/10	87.85	4.48	51.51
30/04/10	87.8	4.48	51.66	29/04/10	87.85	4.48	51.41
28/04/10	87.85	4.48	51.88	27/04/10	89	4.49	52.11
26/04/10	89.2	4.49	54.09	23/04/10	90.7	4.51	53.47
22/04/10	92.75	4.53	53.59	21/04/10	108.4	4.69	53.95
20/04/10	108.9	4.69	54.43	19/04/10	108.3	4.68	54.05
16/04/10	106.8	4.67	54.04	15/04/10	109.9	4.7	54.95
14/04/10	109.8	4.7	54.86	13/04/10	108.3	4.68	54.67
12/04/10	109.1	4.69	55.27	09/04/10	110.1	4.7	55.41
08/04/10	110.7	4.71	54.96	07/04/10	113.2	4.73	55.3
06/04/10	112.4	4.72	54.64	01/04/10	113.3	4.73	55.16
31/03/10	112.4	4.72	55.19	30/03/10	112.5	4.72	55.39
29/03/10	111.8	4.72	55.49	26/03/10	112.5	4.72	55.72
25/03/10	111.4	4.71	56.33	24/03/10	110.2	4.7	55.95
23/03/10	109.1	4.69	56.12	22/03/10	109.2	4.69	56.33
19/03/10	108.5	4.69	56.57	18/03/10	108.4	4.69	56.56
17/03/10	109.9	4.7	56.28	16/03/10	107	4.67	57.21

Table 11. Stock prices of Nokia and Sanofi.

**Table 11.** Stock prices of Nokia and Sanofi.
Date	Nokia	Log-of-Nokia	Sanofi	Date	Nokia	Log-of-Nokia	Sanofi
15/03/10	105.3	4.66	55.95	12/03/10	105	4.65	55.4
11/03/10	103	4.63	55.65	10/03/10	104	4.64	56.13
09/03/10	101.5	4.62	56.17	08/03/10	100.7	4.61	55.75
05/03/10	100.2	4.61	55.76	04/03/10	98.7	4.59	54.81
03/03/10	99.8	4.6	55.14	02/03/10	97.25	4.58	54.99
01/03/10	95.85	4.56	54.82	26/02/10	95.85	4.56	53.72
25/02/10	94.55	4.55	52.92	24/02/10	96.3	4.57	53.92
23/02/10	96.2	4.57	54.05	22/02/10	96.7	4.57	54.14
19/02/10	97.3	4.58	54.71	18/02/10	96.6	4.57	54.43
17/02/10	96.1	4.57	53.88	16/02/10	94.95	4.55	53.56
15/02/10	93.65	4.54	53.2	12/02/10	93.55	4.54	53.01
11/02/10	94.6	4.55	52.52	10/02/10	95.55	4.56	52.2
09/02/10	98.4	4.59	52.66	08/02/10	99.2	4.6	52.98
05/02/10	99.8	4.6	51.68	04/02/10	102.6	4.63	53.42
03/02/10	103.9	4.64	54.06	02/02/10	103.8	4.64	53.8
01/02/10	102.4	4.63	53.23	29/01/10	103.6	4.64	53.6
28/01/10	101.8	4.62	52.68	27/01/10	92.55	4.53	53.8
26/01/10	92.7	4.53	54.42	25/01/10	91.9	4.52	53.66
22/01/10	94.1	4.54	54.65	21/01/10	93.7	4.54	55.28
20/01/10	92.75	4.53	56.67	19/01/10	93.6	4.54	57.69
18/01/10	94.55	4.55	56.67	15/01/10	93.55	4.54	56.85
14/01/10	93.7	4.54	56.91	13/01/10	92.5	4.53	56.18
12/01/10	92.35	4.53	55.83	11/01/10	93	4.53	56.08

5. Critics of the Simulations

In the case where f is unknown, we will never be sure to have reached the minimum of the φ-divergence: we have indeed used the simulated annealing method to solve our optimisation problem, and therefore it is only when the number of random jumps tends in theory towards infinity that the probability to reach the minimum tends to 1. We also note that no theory on the optimal number of jumps to implement does exist, as this number depends on the specificities of each particular problem. Moreover, we choose the

50^{- \frac{4}{4 + d}}

(resp.

500^{- \frac{4}{4 + d}}

and

100^{- \frac{4}{4 + d}}

) for the AMISE of Simulations 4.1, 4.2 and 4.3 (resp. Simulations 4.4 and 4.5). This choice leads us to simulate 50 (resp. 500 and 100) random variables—see Scott [15] page 151—none of which have been discarded to obtain the truncated sample. This has also been the case in our application to real datasets.

Finally, we remark that some of the key advantages of our method over Huber’s consist in the fact that—since there exist divergences smaller than the Kullback–Leibler divergence—our method requires a considerably shorter computation time and also in the superior robustness of our method.

6. Conclusions

Projection Pursuit is useful in evidencing characteristic structures as well as one-dimensional projections and their associated distributions in multivariate data. Huber [2] shows us how to achieve it through maximization of the Kullback–Leibler divergence.

The present article shows that our ϕ-divergence method constitutes a good alternative to Huber’s particularly in terms of regression and robustness as well as in terms of copula’s study. Indeed, the convergence results and simulations we carried out, convincingly fulfilled our expectations regarding our methodology.

References

Friedman, J.H.; Stuetzle, W.; Schroeder, A. Projection pursuit density estimation. J. Amer. Statist. Assoc. 1984, 79, 599–608. [Google Scholar] [CrossRef]
Huber, P.J. Projection pursuit. Ann. Statist. 1985, 13, 435–525, With discussion. [Google Scholar] [CrossRef]
Zhu, M. On the forward and backward algorithms of projection pursuit. Ann. Statist. 2004, 32, 233–244. [Google Scholar] [CrossRef]
Yohai, V.J. Optimal robust estimates using the Kullback-Leibler divergence. Stat. Probab. Lett. 2008, 78, 1811–1816. [Google Scholar] [CrossRef]
Toma, A. Optimal robust M-estimators using divergences. Stat. Probab. Lett. 2009, 79, 1–5. [Google Scholar] [CrossRef]
Huber, P.J. Robust Statistics; Wiley: Hoboken, NJ, USA, 1981; republished in paperback, 2004. [Google Scholar]
Diaconis, P.; Freedman, D. Asymptotics of graphical projection pursuit. Ann. Statist. 1984, 12, 793–815. [Google Scholar] [CrossRef]
Cambanis, S.; Huang, S.; Simons, G. On the theory of elliptically contoured distributions. J. Multivariate Anal. 1981, 11, 368–385. [Google Scholar] [CrossRef]
Landsman, Z.M.; Valdez, E.A. Tail conditional expectations for elliptical distributions. N. Am. Actuar. J. 2003, 7, 55–71. [Google Scholar] [CrossRef]
Van der Vaart, A.W. Asymptotic Statistics. In Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, MA, USA, 1998; Volume 3. [Google Scholar]
Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivariate Anal. 2009, 100, 16–36. [Google Scholar] [CrossRef]
Vajda, I. χ^α-divergence and generalized Fisher’s information. In Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Czech Technical University in Prague: Prague, Czech, 1971; dedicated to the memory of Antonín Spacek; Academia: Prague, Czech; pp. 873–886. [Google Scholar]
Black, F.; Scholes, M.S. The pricing of options and corporate liabilities. J. Polit. Econ. 1973, 3, 637–654. [Google Scholar] [CrossRef]
Saporta, G. Probabilités, Analyse des données et Statistique; Technip: Paris, France, 2006. [Google Scholar]
Scott, D.W. Multivariate Density Estimation. Theory, Practice, and Visualization; John Wiley and Sons: New York, NY, USA, 1992. [Google Scholar]
Cressie, N.; Read, T.R.C. Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. 1984, Ser. B 46, 440–464. [Google Scholar]
Csiszár, I. On topology properties of f-divergences. Studia Sci. Math. Hungar. 1967, 2, 329–339. [Google Scholar]
Liese, F.; Vajda, I. Convex Statistical Distances. In Teubner-Texte zur Mathematik [Teubner Texts in Mathematics]; B.G. Teubner Verlagsgesellschaft: Leipzig, Germany, 1987; Volume 95. [Google Scholar]
Pardo, L. Statistical inference based on divergence measures. In Statistics: Textbooks and Monographs; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006; Volume 185. [Google Scholar]
Zografos, K.; Ferentinos, K.; Papaioannou, T. ϕ-divergence statistics: sampling properties and multinomial goodness of fit and divergence tests. Comm. Statist. Theory Methods 1990, 19, 1785–1802. [Google Scholar] [CrossRef]
Azé, D. Eléments d’analyse convexe et variationnelle; Ellipse: Minneapolis, MN, USA, 1997. [Google Scholar]
Touboul, J. Projection pursuit through φ-divergence minimisation. arXiv:0912.2883, 2009. [Google Scholar]
Bosq, D.; Lecoutre, J.-P. Livre—Theorie De L’Estimation Fonctionnelle; Economica: Hoboken, NJ, USA, 1999. [Google Scholar]

Appendix

A. Reminders

A.1. φ-Divergence

Let us call

h_{a}

the density of

a^{⊤} Z

if h is the density of Z. Let ϕ be a strictly convex function defined by

φ : \bar{R^{+}} \to \bar{R^{+}},

and such that

φ (1) = 0

.

Definition A.1. We define the

ϕ -

divergence of P from Q, where P and Q are two probability distributions over a space Ω such that Q is absolutely continuous with respect to P, by

D_{ϕ} (Q, P) = \int φ (\frac{d Q}{d P}) d P

(A.1)

The above expression (A.1) is also valid if P and Q are both dominated by the same probability.

The most used distances (Kullback, Hellinger or

χ^{2}

) belong to the Cressie–Read family (see Cressie [16], Csiszár [17] and the books of Liese [18], Pardo [19] and Zografos [20]). They are defined by a specific ϕ. Indeed,

-: with the Kullback–Leibler divergence, we associate $φ (x) = x l n (x) - x + 1$
-: with the Hellinger distance, we associate $φ (x) = 2 {(\sqrt{x} - 1)}^{2}$
-: with the $χ^{2}$ distance, we associate $φ (x) = \frac{1}{2} {(x - 1)}^{2}$
-: more generally, with power divergences, we associate $φ (x) = \frac{x^{γ} - γ x + γ - 1}{γ (γ - 1)}$ , where $γ \in R ∖ (0, 1)$
-: and, finally, with the $L^{1}$ norm, which is also a divergence, we associate $φ (x) = | x - 1 | .$

Let us now present some well-known properties of divergences.

Property A.1. We have

D_{ϕ} (P, Q) = 0 \Leftrightarrow P = Q .

Property A.2. The divergence function

Q \mapsto D_{ϕ} (Q, P)

is convex, lower semi-continuous (l.s.c.)—for the topology that makes all the applications of the form

Q \mapsto \int f d Q

continuous where f is bounded and continuous—as well as l.s.c. for the topology of the uniform convergence.

Property A.3. (corollary (1.29), page 19 of Liese [18]). If

T : (X, A) \to (Y, B)

is measurable and if

D_{ϕ} (P, Q) < \infty,

then

D_{ϕ} (P, Q) \geq D_{ϕ} (P T^{- 1}, Q T^{- 1}),

with equality being reached when T is surjective for

(P, Q)

.

Theorem A.1. (theorem III.4 of Azé [21]). Let

f : I \to R

be a convex function. Then f is a Lipschitz function in all compact intervals

[a, b] \subset i n t {I} .

In particular, f is continuous on

i n t {I}

.

A.2. Miscellaneous

In the present section, all demonstrations can be found in Touboul [22].

Lemma A.1. The set

Γ_{c}

is closed in

L^{1}

for the topology of the uniform convergence.

Lemma A.2. For all

c > 0

, we have

Γ_{c} \subset {\bar{B}}_{L^{1}} (f, c),

where

B_{L^{1}} (f, c) = {p \in L^{1} {; ∥ f - p ∥}_{1} \leq c}

.

Lemma A.3. G is closed in

L^{1}

for the topology of the uniform convergence.

Lemma A.4. Let consider the sequence

(a_{i})

defined in (2.3) page 1587.

We then have

{lim}_{n} {lim}_{k} K ({\overset{ˇ}{g}}_{n}^{(k)} \frac{f_{a_{k}, n}}{{[{\overset{ˇ}{g}}^{(k)}]}_{a_{k}, n}}, f_{n}) = 0

a.s.

In the case where f is known and keeping the notations introduced in Section 3.1, we have

Proposition A.1. Assuming

(H 1)

to

(H 3)

hold. Both

{sup}_{a \in Θ} ∥ {\overset{ˇ}{c}}_{n} (a) - a_{k} ∥

and

{\overset{ˇ}{γ}}_{n}

tends to

a_{k}

a.s.

Theorem A.2. Assuming

(H 0)

to

(H 3)

hold, for any

k = 1, . . ., d

and any

x \in R^{d}

, we have

| {\overset{ˇ}{g}}^{(k)} (x) - g^{(k)} (x) | = O_{P} (n^{- 1 / 2})

and

\int | {\overset{ˇ}{g}}^{(k)} (x) - g^{(k)} (x) | d x = O_{P} (n^{- 1 / 2})

as well as

| K ({\overset{ˇ}{g}}^{(k)}, f) - K (g^{(k)}, f) | = O_{P} (n^{- 1 / 2}) .

Theorem A.3. Assuming that

(H 1)

to

(H 3)

,

(H 6)

and

(H 8)

hold. Then,

\sqrt{n} {(V a r_{P} (M ({\overset{ˇ}{c}}_{n} ({\overset{ˇ}{γ}}_{n}), {\overset{ˇ}{γ}}_{n})))}^{- 1 / 2} (P_{n} M ({\overset{ˇ}{c}}_{n} ({\overset{ˇ}{γ}}_{n}), {\overset{ˇ}{γ}}_{n}) - P_{n} M (a_{k}, a_{k})) \overset{L aw}{\to} N (0, I)

, where k represents the

k^{t h}

step of the algorithm and with I being the identity matrix in

R^{d}

.

B. Study of the sample

Let

X_{1}

,

X_{2}

,..,

X_{m}

be a sequence of independent random vectors with same density f. Let

Y_{1}

,

Y_{2}

,..,

Y_{m}

be a sequence of independent random vectors with same density g. Then, the kernel estimators

f_{m}

,

g_{m}

,

f_{a, m}

and

g_{a, m}

of f, g,

f_{a}

and

g_{a}

, for all

a \in R_{*}^{d}

, almost surely and uniformly converge since we assume that the bandwidth

h_{m}

of these estimators meets the following conditions (see Bosq [23])—with

L (u) = l n (u \lor e)

:

(H y p) : h_{m} ↘_{m} 0, m h_{m} ↗_{m} \infty, m h_{m} / L (h_{m}^{- 1}) \to_{m} \infty and L (h_{m}^{- 1}) / L L m \to_{m} \infty .

Let us consider

B_{1} (n, a) = \frac{1}{n} Σ_{i = 1}^{n} φ^{'} {\frac{f_{a, n} (a^{⊤} Y_{i})}{g_{a, n} (a^{⊤} Y_{i})} \frac{g_{n} (Y_{i})}{f_{n} (Y_{i})}} \frac{f_{a, n} (a^{⊤} Y_{i})}{g_{a, n} (a^{⊤} Y_{i})} and B_{2} (n, a) = \frac{1}{n} Σ_{i = 1}^{n} φ^{*} {φ^{'} {\frac{f_{a, n} (a^{⊤} X_{i})}{g_{a, n} (a^{⊤} X_{i})} \frac{g_{n} (X_{i})}{f_{n} (X_{i}}}} .

Our goal is to estimate the minimum of

D_{ϕ} (g \frac{f_{a}}{g_{a}}, f)

. To do this, it is necessary for us to truncate our samples:

Let us consider now a positive sequence

θ_{m}

such that

θ_{m} \to 0, y_{m} / θ_{n}^{2} \to 0,

where

y_{m}

is the almost sure convergence rate of the kernel density estimator—

y_{m} = O_{P} (m^{- \frac{2}{4 + d}})

, see Lemma D.7—

y_{m}^{(1)} / θ_{m}^{2} \to 0,

where

y_{m}^{(1)}

is defined by

| φ (\frac{g_{m} (x)}{f_{m} (x)} \frac{f_{b, m} (b^{⊤} x)}{g_{b, m} (b^{⊤} x)}) - φ (\frac{g (x)}{f (x)} \frac{f_{b} (b^{⊤} x)}{g_{b} (b^{⊤} x)}) | \leq y_{m}^{(1)}

, for all b in

R_{*}^{d}

and all x in

R^{d}

, and finally

\frac{y_{m}^{(2)}}{θ_{m}^{2}} \to 0,

where

y_{n}^{(2)}

is defined by

| φ^{'} (\frac{g_{m} (x)}{f_{m} (x)} \frac{f_{b, m} (b^{⊤} x)}{g_{b, m} (b^{⊤} x)}) - φ^{'} (\frac{g (x)}{f (x)} \frac{f_{b} (b^{⊤} x)}{g_{b} (b^{⊤} x)}) | \leq y_{m}^{(2)}

, for all b in

R_{*}^{d}

and all x in

R^{d}

.

We will generate

f_{m}

,

g_{m}

and

g_{b, m}

from the starting sample and we will select the

X_{i}

and

Y_{i}

vectors such that

f_{m} (X_{i}) \geq θ_{m}

and

g_{b, m} (b^{⊤} Y_{i}) \geq θ_{m}

, for all i and for all

b \in R_{*}^{d}

.

The vectors meeting these conditions will be called

X_{1}, X_{2}, . . ., X_{n}

and

Y_{1}, Y_{2}, . . ., Y_{n}

.

Consequently, the next proposition provides us with the condition required for us to derive our estimations.

Proposition B.1. Using the notations introduced in Broniatowski [11] and in Section 3.1, it holds

{lim}_{n \to \infty} {sup}_{a \in R_{*}^{d}} | (B_{1} (n, a) - B_{2} (n, a)) - D_{ϕ} (g \frac{f_{a}}{g_{a}}, f) | = 0 .

$R e m a r k$ B.1. With the Kullback–Leibler divergence, we can take for

θ_{m}

the expression

m^{- ν}

, with

0 < ν < \frac{1}{4 + d}

.

C. Hypotheses’ discussion

C.1. Discussion of $(H 2)$ .

Let us work with the Kullback–Leibler divergence and with g and

a_{1}

.

For all

b \in R_{*}^{d}

, we have

\int φ^{*} (φ^{'} (\frac{g (x) f_{b} (b^{⊤} x)}{f (x) g_{b} (b^{⊤} x)})) f (x) d x = \int (\frac{g (x) f_{b} (b^{⊤} x)}{f (x) g_{b} (b^{⊤} x)} - 1) f (x) d x = 0,

since, for any b in

R_{*}^{d}

, the function

x \mapsto g (x) \frac{f_{b} (b^{⊤} x)}{g_{b} (b^{⊤} x)}

is a density. The complement of

Θ^{D_{ϕ}}

in

R_{*}^{d}

is ∅ and then the supremum looked for in

\bar{R}

is

- \infty

. We can therefore conclude. It is interesting to note that we obtain the same verification with f,

g^{(k - 1)}

and

a_{k}

.

C.2. Discussion of $(H 4)$ .

This hypothesis consists in the following assumptions:

We work with the Kullback–Leibler divergence, (0)
We have $f (. / a_{1}^{⊤} x) = g (. / a_{1}^{⊤} x)$ , i.e., $K (g \frac{f_{1}}{g_{1}}, f) = 0$ —we could also derive the same proof with f, $g^{(k - 1)}$ and $a_{k}$ —(1)

Preliminary

(A)

: Shows that

A = {(c, x) \in R_{*}^{d} ∖ {a_{1}} \times R^{d}; \frac{f_{a_{1}} (a_{1}^{⊤} x)}{g_{a_{1}} (a_{1}^{⊤} x)} > \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)}, g (x) \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)} > f (x)} = \emptyset

through a reductio ad absurdum, i.e., if we assume

A \neq \emptyset

.

Thus, our hypothesis enables us to derive

f (x) = f (. / a_{1}^{⊤} x) f_{a_{1}} (a_{1}^{⊤} x) = g (. / a_{1}^{⊤} x) f_{a_{1}} (a_{1}^{⊤} x) > g (. / c^{⊤} x) f_{c} (c^{⊤} x) > f

since

\frac{f_{a_{1}} (a_{1}^{⊤} x)}{g_{a_{1}} (a_{1}^{⊤} x)} \geq \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)}

implies

g (. / a_{1}^{⊤} x) f_{a_{1}} (a_{1}^{⊤} x) = g (x) \frac{f_{a_{1}} (a_{1}^{⊤} x)}{g_{a_{1}} (a_{1}^{⊤} x)} \geq g (x) \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)} = g (. / c^{⊤} x) f_{c} (c^{⊤} x)

, i.e.,

f > f

. We can therefore conclude.

Preliminary

(B)

: Shows that

B = {(c, x) \in R_{*}^{d} ∖ {a_{1}} \times R^{d}; \frac{f_{a_{1}} (a_{1}^{⊤} x)}{g_{a_{1}} (a_{1}^{⊤} x)} < \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)}, g (x) \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)} < f (x)} = \emptyset

through a reductio ad absurdum, i.e., if we assume

B \neq \emptyset

.

Thus, our hypothesis enables us to derive

f (x) = f (. / a_{1}^{⊤} x) f_{a_{1}} (a_{1}^{⊤} x) = g (. / a_{1}^{⊤} x) f_{a_{1}} (a_{1}^{⊤} x) < g (. / c^{⊤} x) f_{c} (c^{⊤} x) < f

We can therefore conclude as above.

Let us now verify

(H 4)

:

We have

P M (c, a_{1}) - P M (c, a) = \int l n (\frac{g (x) f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x) f (x)}) {\frac{f_{a_{1}} (a_{1}^{⊤} x)}{g_{a_{1}} (a_{1}^{⊤} x)} - \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)}} g (x) d x .

Moreover, the logarithm

l n

is negative on

{x \in R_{*}^{d}; \frac{g (x) f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x) f (x)} < 1}

and is positive on

{x \in R_{*}^{d}; \frac{g (x) f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x) f (x)} \geq 1}

.

Thus, the preliminary studies

(A)

and

(B)

show that

l n (\frac{g (x) f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x) f (x)})

and

{\frac{f_{a_{1}} (a_{1}^{⊤} x)}{g_{a_{1}} (a_{1}^{⊤} x)} - \frac{f_{c} (c^{⊤} x)}{g_{c} (c^{⊤} x)}}

always present a negative product. We can therefore conclude, since

(c, a) \mapsto P M (c, a_{1}) - P M (c, a)

is not null for all c and for all a—with

a \neq a_{1}

.

D. Proofs

Preliminary remark :

Let us note that if

K (g, f) \geq \int | f (x) - g (x) | d x

, a simple reductio ad absurdum enables us to to infer that

K (g^{(1)}, f) \geq \int | f (x) - g^{(1)} (x) | d x

. Therefore, through an induction, we immediately obtain that, for any k,

K (g^{(k)}, f) \geq \int | f (x) - g^{(k)} (x) | d x

. Thus, for any k and from a certain rank n, we derive that

K (g_{n}^{(k)}, f) \geq \int | f (x) - g_{n}^{(k)} (x) | d x

.

Proof of Lemma D.1.

Lemma D.1. We have

g (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) = n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) = f (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x)

.

Putting

A = (a_{1}, . ., a_{d})

, let us determine f in basis A. Let us first study the function defined by

ψ : R^{d} \to R^{d}

,

x \mapsto (a_{1}^{⊤} x, . ., a_{d}^{⊤} x) .

We can immediately say that ψ is continuous and since A is a basis, its bijectivity is obvious. Moreover, let us study its Jacobian.

By definition, it is

J_{ψ} (x_{1}, \dots, x_{d}) = |\begin{matrix} \frac{\partial ψ_{1}}{\partial x_{1}} & \dots & \frac{\partial ψ_{1}}{\partial x_{d}} \\ \dots & \dots & \dots \\ \frac{\partial ψ_{d}}{\partial x_{1}} & \dots & \frac{\partial ψ_{d}}{\partial x_{d}} \end{matrix}| = |\begin{matrix} a_{1, 1} & \dots & a_{1, d} \\ \dots & \dots & \dots \\ a_{d, 1} & \dots & a_{d, d} \end{matrix}| = | A | \neq 0

since A is a basis. We can therefore infer :

\forall x \in R^{d}, \exists! y \in R^{d} such that f (x) = {| A |}^{- 1} Ψ (y),

i.e., Ψ (resp. y) is the expression of f (resp of x) in basis A, namely

Ψ (y) = \tilde{n} (y_{j + 1}, . . ., y_{d}) \tilde{h} (y_{1}, . . ., y_{j})

, with

\tilde{n}

and

\tilde{h}

being the expressions of n and h in basis A. Consequently, our results in the case where the family

{a_{j}}_{1 \leq j \leq d}

is the canonical basis of

R^{d}

, still hold for Ψ in basis A—see Section 2.1. And then, if

\tilde{g}

is the expression of g in basis A, we have

\tilde{g} (. / y_{1}, . . ., y_{j}) = \tilde{n} (y_{j + 1}, . . ., y_{d}) = Ψ (. / y_{1}, . . ., y_{j})

, i.e.,

g (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) = n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) = f (. / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x)

.

Proof of Lemma D.2.

Lemma D.2. Should there exist a family

{(a_{i})}_{i = 1 . . . d}

such that

f (x) = n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) h (a_{1}^{⊤} x, . . ., a_{j}^{⊤} x),

with

j < d

, with f, n and h being densities, then this family is an orthogonal basis of

R^{d}

.

Using a reductio ad absurdum, we have

\int f (x) d x = 1 \neq + \infty = \int n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) h (a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) d x

. We can therefore conclude.

Lemma D.3.

{inf}_{a \in R_{*}^{d}} D_{ϕ} (g^{*}, f)

is reached when the ϕ-divergence is greater than the

L^{1}

distance as well as the

L^{2}

distance.

Indeed, let G be

{g \frac{f_{a}}{g_{a}}; a \in R_{*}^{d}}

and

Γ_{c}

be

Γ_{c} = {p; K (p, f) \leq c}

for all c>0. From Lemmas A.1, A.2 and A.3 (see page 1605), we get

Γ_{c} \cap G

is a compact for the topology of the uniform convergence, if

Γ_{c} \cap G

is not empty. Hence, and since property A.2 (see page 1605) implies that

Q \mapsto D_{ϕ} (Q, P)

is lower semi-continuous in

L^{1}

for the topology of the uniform convergence, then the infimum is reached in

L^{1}

. (Taking for example

c = D_{ϕ} (g, f),

Ω is necessarily not empty because we always have

D_{ϕ} (g \frac{f_{a}}{g_{a}}, f) \leq D_{ϕ} (g, f)

). Moreover, when the

ϕ -

divergence is greater than the

L^{2}

distance, the very definition of the

L^{2}

space enables us to provide the same proof as for the

L^{1}

distance.

Proof of Lemma D.4.

Lemma D.4. For any

p \leq d

, we have

f_{a_{p}}^{(p - 1)} = f_{a_{p}}

—see Huber’s analytic method -,

g_{a_{p}}^{(p - 1)} = g_{a_{p}}

—see Huber’s synthetic method - and

g_{a_{p}}^{(p - 1)} = g_{a_{p}}

—see our algorithm.

As it is equivalent to prove either our algorithm or Huber’s, we will only develop here the proof for our algorithm. Assuming, without any loss of generality, that the

a_{i}

,

i = 1, . ., p

, are the vectors of the canonical basis, since

g^{(p - 1)} (x) = g (x) \frac{f_{1} (x_{1})}{g_{1} (x_{1})} \frac{f_{2} (x_{2})}{g_{2} (x_{2})} . . . \frac{f_{p - 1} (x_{p - 1})}{g_{p - 1} (x_{p - 1})}

we derive immediately that

g_{p}^{(p - 1)} = g_{p}

. We note that it is sufficient to operate a change in basis on the

a_{i}

to obtain the general case.

Proof of Lemma D.5.

Lemma D.5. If there exits p,

p \leq d

, such that

D_{ϕ} (g^{(p)}, f) = 0

, then the family of

{(a_{i})}_{i = 1, . ., p}

—derived from the construction of

g^{(p)}

—is free and orthogonal.

Without any loss of generality, let us assume that

p = 2

and that the

a_{i}

are the vectors of the canonical basis. Using a reductio ad absurdum with the hypotheses

a_{1} = (1, 0, . . ., 0)

and

a_{2} = (α, 0, . . ., 0)

, where

α \in R

, we get

g^{(1)} (x) = g (x_{2}, . ., x_{d} / x_{1}) f_{1} (x_{1})

and

f = g^{(2)} (x) = g (x_{2}, . ., x_{d} / x_{1}) f_{1} (x_{1}) \frac{f_{α a_{1}} (α x_{1})}{{[g^{(1)}]}_{α a_{1}} (α x_{1})}

. Hence

f (x_{2}, . ., x_{d} / x_{1}) = g (x_{2}, . ., x_{d} / x_{1}) \frac{f_{α a_{1}} (α x_{1})}{{[g^{(1)}]}_{α a_{1}} (α x_{1})} .

It consequently implies that

f_{α a_{1}} (α x_{1}) = {[g^{(1)}]}_{α a_{1}} (α x_{1})

since

1 = \int f (x_{2}, . ., x_{d} / x_{1}) d x_{2} . . . d x_{d} = \int g (x_{2}, . ., x_{d} / x_{1}) d x_{2} . . . d x_{d} \frac{f_{α a_{1}} (α x_{1})}{{[g^{(1)}]}_{α a_{1}} (α x_{1})} = \frac{f_{α a_{1}} (α x_{1})}{{[g^{(1)}]}_{α a_{1}} (α x_{1})}

. Therefore,

g^{(2)} = g^{(1)}

, i.e.,

p = 1

which leads to a contradiction. Hence, the family is free. Moreover, using a reductio ad absurdum we get the orthogonality. Indeed, we have

\int f (x) d x = 1 \neq + \infty = \int n (a_{j + 1}^{⊤} x, . . ., a_{d}^{⊤} x) h (a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) d x

. The use of the same argument as in the proof of Lemma D.2, enables us to infer the orthogonality of

{(a_{i})}_{i = 1, . ., p}

.

Proof of Lemma D.6.

Lemma D.6. If there exits p,

p \leq d

, such that

D_{ϕ} (g^{(p)}, f) = 0

, where

g^{(p)}

is built from the free and orthogonal family

a_{1}

,...,

a_{j}

, then, there exists a free and orthogonal family

{(b_{k})}_{k = j + 1, . . ., d}

of vectors of

R_{*}^{d}

, such that

g^{(p)} (x) = g (b_{j + 1}^{⊤} x, . . ., b_{d}^{⊤} x / a_{1}^{⊤} x, . . ., a_{j}^{⊤} x) f_{a_{1}} (a_{1}^{⊤} x) . . . f_{a_{j}} (a_{j}^{⊤} x)

and such that

R^{d} = V e c t {a_{i}} \overset{⊥}{\oplus} V e c t {b_{k}}

.

Through the incomplete basis theorem and similarly as in Lemma D.5, we obtain the result thanks to the Fubini’s theorem.

Proof of Lemma D.7.

Lemma D.7. For any continuous density f, we have

y_{m} = | f_{m} (x) - f (x) | = O_{P} (m^{- \frac{2}{4 + d}})

.

Defining

b_{m} (x)

as

b_{m} (x) = | E (f_{m} (x)) - f (x) |

, we have

y_{m} \leq | f_{m} (x) - E (f_{m} (x)) | + b_{m} (x)

. Moreover, from page 150 of Scott [15], we derive that

b_{m} (x) = O_{P} (Σ_{j = 1}^{d} h_{j}^{2})

where

h_{j} = O_{P} (m^{- \frac{1}{4 + d}})

. Then, we obtain

b_{m} (x) = O_{P} (m^{- \frac{2}{4 + d}})

. Finally, since the central limit theorem rate is

O_{P} (m^{- \frac{1}{2}})

, we infer that

y_{m} \leq O_{P} (m^{- \frac{1}{2}}) + O_{P} (m^{- \frac{2}{4 + d}}) = O_{P} (m^{- \frac{2}{4 + d}})

.

Proof of Proposition 3.1.

Without loss of generality, we reason with

x_{1}

in lieu of

a^{⊤} x

.

Let us define

g^{*} = g r

. We remark that g and

g^{*}

present the same density conditionally to

x_{1}

. Indeed,

g_{1}^{*} (x_{1}) = \int g^{*} (x) d x_{2} . . . d x_{d} = \int h (x_{1}) g (x) d x_{2} . . . d x_{d} = h (x_{1}) \int g (x) d x_{2} . . . d x_{d} = h (x_{1}) g_{1} (x_{1})

.

We can therefore prove this proposition.

First, since f and g are known, then, for any given function

h : x_{1} \mapsto h (x_{1})

, the application T, which is defined by

$T : g (. / x_{1}) \frac{h (x_{1}) f_{1} (x_{1})}{g_{1} (x_{1})} \mapsto g (. / x_{1}) f_{1} (x_{1})$
$T : f (. / x_{1}) f_{1} (x_{1}) \mapsto f (. / x_{1}) f_{1} (x_{1})$

is measurable.

Second, the above remark implies that

D_{ϕ} (g^{*}, f) = D_{ϕ} (g^{*} (. / x_{1}) \frac{g_{1} (x_{1}) h (x_{1})}{f_{1} (x_{1})}, f (. / x_{1}) f_{1} (x_{1})) = D_{ϕ} (g (. / x_{1}) \frac{g_{1} (x_{1}) h (x_{1})}{f_{1} (x_{1})}, f (. / x_{1}) f_{1} (x_{1})) .

Consequently, property A.3 page 1605 infers:

D_{ϕ} (g (. / x_{1}) \frac{g_{1} (x_{1}) h (x_{1})}{f_{1} (x_{1})}, f (. / x_{1}) f_{1} (x_{1})) \geq D_{ϕ} (T^{- 1} (g (. / x_{1}) \frac{g_{1} (x_{1}) h (x_{1})}{f_{1} (x_{1})}), T^{- 1} (f (. / x_{1}) f_{1} (x_{1})))

= D_{ϕ} (g (. / x_{1}) f_{1} (x_{1}), f (. / x_{1}) f_{1} (x_{1}))

, by the very definition of T.

= D_{ϕ} (g \frac{f_{1}}{g_{1}}, f)

, which completes the proof of this proposition.

Proof of Proposition 3.3. Proposition 3.3 comes immediately from Proposition B.1 page 1606 and Lemma A.1 page 1605.

Proof of Theorem 3.1. First, by the very definition of the kernel estimator

{\overset{ˇ}{g}}_{n}^{(0)} = g_{n}

converges towards g. Moreover, the continuity of

a \mapsto f_{a, n}

and

a \mapsto g_{a, n}

and Proposition 3.3 imply that

{\overset{ˇ}{g}}_{n}^{(1)} = {\overset{ˇ}{g}}_{n}^{(0)} \frac{f_{a, n}}{{\overset{ˇ}{g}}_{a, n}^{(0)}}

converges towards

g^{(1)}

. Finally, since, for any k,

{\overset{ˇ}{g}}_{n}^{(k)} = {\overset{ˇ}{g}}_{n}^{(k - 1)} \frac{f_{{\overset{ˇ}{a}}_{k}, n}}{{\overset{ˇ}{g}}_{{\overset{ˇ}{a}}_{k}, n}^{(k - 1)}}

, we conclude by an immediate induction.

Proof of Theorem 3.2. First, from Lemma D.7, we derive that, for any x,

{sup}_{a \in R_{*}^{d}} | f_{a, n} (a^{⊤} x) - f_{a} (a^{⊤} x) | = O_{P} (n^{- \frac{2}{4 + d}})

. Then, let us consider

Ψ_{j} = \frac{f_{\overset{ˇ}{a_{j}}, n} ({\overset{ˇ}{a_{j}}}^{⊤} x)}{{\overset{ˇ}{g}}_{\overset{ˇ}{a_{j}}, n}^{(j - 1)} ({\overset{ˇ}{a_{j}}}^{⊤} x)} - \frac{f_{a_{j}} (a_{j}^{⊤} x)}{g_{a_{j}}^{(j - 1)} (a_{j}^{⊤} x)}

, we have

Ψ_{j} = \frac{1}{{\overset{ˇ}{g}}_{\overset{ˇ}{a_{j}}, n}^{(j - 1)} ({\overset{ˇ}{a_{j}}}^{⊤} x) g_{a_{j}}^{(j - 1)} (a_{j}^{⊤} x)}

((f_{\overset{ˇ}{a_{j}}, n} ({\overset{ˇ}{a_{j}}}^{⊤} x) - f_{a_{j}} (a_{j}^{⊤} x)) g_{a_{j}}^{(j - 1)} (a_{j}^{⊤} x) + f_{a_{j}} (a_{j}^{⊤} x) (g_{a_{j}}^{(j - 1)} (a_{j}^{⊤} x) - {\overset{ˇ}{g}}_{\overset{ˇ}{a_{j}}, n}^{(j - 1)} ({\overset{ˇ}{a_{j}}}^{⊤} x))),

i.e.,

| Ψ_{j} | = O_{P} (n^{- \frac{1}{2} 1_{d = 1} - \frac{2}{4 + d} 1_{d > 1}})

since

f_{a_{j}} (a_{j}^{⊤} x) = O (1)

and

g_{a_{j}}^{(j - 1)} (a_{j}^{⊤} x) = O (1)

. We can therefore conclude similarly as in the proof of Theorem A.2.

Proof of Theorem D.1.

Theorem D.1. In the case where f is known and under the hypotheses assumed in Section 3.1, it holds

\sqrt{n} A . ({\overset{ˇ}{c}}_{n} (a_{k}) - a_{k}) \overset{L aw}{\to} B . N_{d} (0, P ∥ \frac{\partial}{\partial b} M (a_{k}, a_{k}) ∥^{2}) + C . N_{d} (0, P ∥ \frac{\partial}{\partial a} M (a_{k}, a_{k}) ∥^{2})

and

\sqrt{n} A . ({\overset{ˇ}{γ}}_{n} - a_{k}) \overset{L aw}{\to} C . N_{d} (0, P ∥ \frac{\partial}{\partial b} M (a_{k}, a_{k}) ∥^{2}) + C . N_{d} (0, P ∥ \frac{\partial}{\partial a} M (a_{k}, a_{k}) ∥^{2})

where

A = P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) (P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k}))

,

C = P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k})

and

B = P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k}) .

First of all, let us remark that hypotheses

(H 1)

to

(H 3)

imply that

{\overset{ˇ}{γ}}_{n}

and

{\overset{ˇ}{c}}_{n} (a_{k})

converge towards

a_{k}

in probability. Hypothesis

(H 4)

enables us to derive under the integrable sign after calculation,

P \frac{\partial}{\partial b} M (a_{k}, a_{k}) = P \frac{\partial}{\partial a} M (a_{k}, a_{k}) = 0,

P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k}) = P \frac{\partial^{2}}{\partial b_{j} \partial a_{i}} M (a_{k}, a_{k}) = \int φ " (\frac{g f_{a_{k}}}{f g_{a_{k}}}) \frac{\partial}{\partial a_{i}} \frac{g f_{a_{k}}}{f g_{a_{k}}} \frac{\partial}{\partial b_{j}} \frac{g f_{a_{k}}}{f g_{a_{k}}} f d x,

P \frac{\partial^{2}}{\partial b_{i} \partial b_{j}} M (a_{k}, a_{k}) = - \int φ " (\frac{g f_{a_{k}}}{f g_{a_{k}}}) \frac{\partial}{\partial b_{i}} \frac{g f_{a_{k}}}{f g_{a_{k}}} \frac{\partial}{\partial b_{j}} \frac{g f_{a_{k}}}{f g_{a_{k}}} f d x

,

P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) = \int φ^{'} (\frac{g f_{a_{k}}}{f g_{a_{k}}}) \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} \frac{g f_{a_{k}}}{f g_{a_{k}}} f d x,

and consequently

P \frac{\partial^{2}}{\partial b_{i} \partial b_{j}} M (a_{k}, a_{k}) = - P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k}) = - P \frac{\partial^{2}}{\partial b_{j} \partial a_{i}} M (a_{k}, a_{k}),

which implies,

\frac{\partial^{2}}{\partial a_{i} \partial a_{j}} K (g \frac{f_{a_{k}}}{g_{a_{k}}}, f) = P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) - P \frac{\partial^{2}}{\partial b_{i} \partial b_{j}} M (a_{k}, a_{k}),

= P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial a_{i} \partial b_{j}} M (a_{k}, a_{k})

= P \frac{\partial^{2}}{\partial a_{i} \partial a_{j}} M (a_{k}, a_{k}) + P \frac{\partial^{2}}{\partial b_{j} \partial a_{i}} M (a_{k}, a_{k}) .

The very definition of the estimators

{\overset{ˇ}{γ}}_{n}

and

{\overset{ˇ}{c}}_{n} (a_{k})

, implies that

\{\begin{matrix} P_{n} \frac{\partial}{\partial b} M (b, a) = 0 \\ P_{n} \frac{\partial}{\partial a} M (b (a), a) = 0 \end{matrix}

i.e.

\{\begin{matrix} P_{n} \frac{\partial}{\partial b} M ({\overset{ˇ}{c}}_{n} (a_{k}), {\overset{ˇ}{γ}}_{n}) = 0 \\ P_{n} \frac{\partial}{\partial a} M ({\overset{ˇ}{c}}_{n} (a_{k}), {\overset{ˇ}{γ}}_{n}) + P_{n} \frac{\partial}{\partial b} M ({\overset{ˇ}{c}}_{n} (a_{k}), {\overset{ˇ}{γ}}_{n}) \frac{\partial}{\partial a} {\overset{ˇ}{c}}_{n} (a_{k}) = 0, \end{matrix}

i.e.

\{\begin{matrix} P_{n} \frac{\partial}{\partial b} M ({\overset{ˇ}{c}}_{n} (a_{k}), {\overset{ˇ}{γ}}_{n}) = 0 (E 0) \\ P_{n} \frac{\partial}{\partial a} M ({\overset{ˇ}{c}}_{n} (a_{k}), {\overset{ˇ}{γ}}_{n}) = 0 (E 1) \end{matrix}

Under

(H 5)

and

(H 6)

, and using a Taylor development of the

(E 0)

(resp.

(E 1)

) equation, we infer there exists

({\bar{c}}_{n}, {\bar{γ}}_{n})

(resp.

({\tilde{c}}_{n}, {\tilde{γ}}_{n})

) on the interval

[({\overset{ˇ}{c}}_{n} (a_{k}), {\overset{ˇ}{γ}}_{n}), (a_{k}, a_{k})]

such that

- P_{n} \frac{\partial}{\partial b} M (a_{k}, a_{k}) = [{(P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}))}^{⊤} + o_{P} (1), {(P \frac{\partial^{2}}{\partial a \partial b} M (a_{k}, a_{k}))}^{⊤} + o_{P} (1)] a_{n} .

(resp.

- P_{n} \frac{\partial}{\partial a} M (a_{k}, a_{k}) = [{(P \frac{\partial^{2}}{\partial b \partial a} M (a_{k}, a_{k}))}^{⊤} + o_{P} (1), {(P \frac{\partial^{2}}{\partial a^{2}} M (a_{k}, a_{k}))}^{⊤} + o_{P} (1)] a_{n}

) with

a_{n} = ({({\overset{ˇ}{c}}_{n} (a_{k}) - a_{k})}^{⊤}, {({\overset{ˇ}{γ}}_{n} - a_{k})}^{⊤})

. Thus we get

\sqrt{n} a_{n} = \sqrt{n} {[\begin{matrix} P \frac{\partial^{2}}{\partial b^{2}} M (a_{k}, a_{k}) & P \frac{\partial^{2}}{\partial a \partial b} M (a_{k}, a_{k}) \\ P \frac{\partial^{2}}{\partial b \partial a} M (a_{k}, a_{k}) & P \frac{\partial^{2}}{\partial a^{2}} M (a_{k}, a_{k}) \end{matrix}]}^{- 1} [\begin{matrix} - P_{n} \frac{\partial}{\partial b} M (a_{k}, a_{k}) \\ - P_{n} \frac{\partial}{\partial a} M (a_{k}, a_{k}) \end{matrix}] + o_{P} (1)

= \sqrt{n} {(P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) \frac{\partial^{2}}{\partial a \partial a} K (g \frac{f_{a_{k}}}{g_{a_{k}}}, f))}^{- 1}

. [\begin{matrix} P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) + \frac{\partial^{2}}{\partial a \partial a} K (g \frac{f_{a_{k}}}{g_{a_{k}}}, f) & P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) \\ P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) & P \frac{\partial^{2}}{\partial b \partial b} M (a_{k}, a_{k}) \end{matrix}] . [\begin{matrix} - P_{n} \frac{\partial}{\partial b} M (a_{k}, a_{k}) \\ - P_{n} \frac{\partial}{\partial a} M (a_{k}, a_{k}) \end{matrix}] + o_{P} (1)

Moreover, the central limit theorem implies:

P_{n} \frac{\partial}{\partial b} M (a_{k}, a_{k}) \overset{L aw}{\to} N_{d} (0, P ∥ \frac{\partial}{\partial b} M (a_{k}, a_{k}) ∥^{2})

,

P_{n} \frac{\partial}{\partial a} M (a_{k}, a_{k}) \overset{L aw}{\to} N_{d} (0, P ∥ \frac{\partial}{\partial a} M (a_{k}, a_{k}) ∥^{2})

, since

P \frac{\partial}{\partial b} M (a_{k}, a_{k}) = P \frac{\partial}{\partial a} M (a_{k}, a_{k}) = 0

, which leads us to the result.

Proof of Theorem 3.3. We derive this theorem through Proposition B.1 and Theorem D.1.

Proof of Theorem 3.4. We recall that

g_{n}^{(k)}

is the kernel estimator of

{\overset{ˇ}{g}}^{(k)}

. Since the Kullback–Leibler divergence is greater than the

L^{1}

-distance, we then have

{lim}_{n} {lim}_{k} K (g_{n}^{(k)}, f_{n}) \geq {lim}_{n} {lim}_{k} \int | g_{n}^{(k)} (x) - f_{n} (x) | d x

Moreover, the Fatou’s lemma implies that

{lim}_{k} \int | g_{n}^{(k)} (x) - f_{n} (x) | d x \geq \int {lim}_{k} [| g_{n}^{(k)} (x) - f_{n} (x) |] d x = \int | [{lim}_{k} g_{n}^{(k)} (x)] - f_{n} (x) | d x

and

{lim}_{n} \int | [{lim}_{k} g_{n}^{(k)} (x)] - f_{n} (x) | d x \geq \int {lim}_{n} [| [{lim}_{k} g_{n}^{(k)} (x)] - f_{n} (x) |] d x

= \int | [{lim}_{n} {lim}_{k} g_{n}^{(k)} (x)] - {lim}_{n} f_{n} (x) | d x

Through Lemma A.4, we then obtain that

0 = {lim}_{n} {lim}_{k} K (g_{n}^{(k)}, f_{n}) \geq \int | [{lim}_{n} {lim}_{k} g_{n}^{(k)} (x)] - {lim}_{n} f_{n} (x) | d x \geq 0

, i.e., that

\int | [{lim}_{n} {lim}_{k} g_{n}^{(k)} (x)] - {lim}_{n} f_{n} (x) | d x = 0

. Moreover, for any given k and any given n, the function

g_{n}^{(k)}

is a convex combination of multivariate Gaussian distributions. As derived at Remark 2.1 of page 1585, for all k, the determinant of the covariance of the random vector—with density

g^{(k)}

—is greater than or equal to the product of a positive constant times the determinant of the covariance of the random vector with density f. The form of the kernel estimate therefore implies that there exists an integrable function φ such that, for any given k and any given n, we have

| g_{n}^{(k)} | \leq φ

.

Finally, the dominated convergence theorem enables us to say that

{lim}_{n} {lim}_{k} g_{n}^{(k)} = {lim}_{n} f_{n} = f

, since

f_{n}

converges towards f and since

\int | [{lim}_{n} {lim}_{k} g_{n}^{(k)} (x)] - {lim}_{n} f_{n} (x) | d x = 0

.

Proof of Corollary 3.1. Through the dominated convergence theorem and through Theorem 3.4, we get the result using a reductio ad absurdum.

Proof of Theorem 3.5. Through Proposition B.1 and Theorem A.3, we derive theorem 3.5.

© 2010 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

Share and Cite

MDPI and ACS Style

Touboul, J. Projection Pursuit Through ϕ-Divergence Minimisation. Entropy 2010, 12, 1581-1611. https://doi.org/10.3390/e12061581

AMA Style

Touboul J. Projection Pursuit Through ϕ-Divergence Minimisation. Entropy. 2010; 12(6):1581-1611. https://doi.org/10.3390/e12061581

Chicago/Turabian Style

Touboul, Jacques. 2010. "Projection Pursuit Through ϕ-Divergence Minimisation" Entropy 12, no. 6: 1581-1611. https://doi.org/10.3390/e12061581

Article Menu

Projection Pursuit Through ϕ-Divergence Minimisation

Abstract

1. Outline of the Article

1.1. Huber’s analytic approach

1.2. Huber’s synthetic approach

1.3. Proposal

2. The Algorithm

2.1. The model

Elliptical laws

Choice of g

2.2. Stochastic outline of our algorithm

3. Results

3.1. Convergence results

3.1.1. Hypotheses on f

3.1.2. Estimation of the first co-vector of f

3.1.3. Convergence study at the k th step of the algorithm:

3.2. Asymptotic Inference at the k t h step of the algorithm

3.3. A stopping rule for the procedure

3.3.1. Estimation of f

3.3.2. Testing of the criteria

3.4. Goodness-of-fit test for copulas

3.5. Rewriting of the convolution product

3.6. On the regression

3.6.1. The basic idea

3.6.2. General case

4. Simulations

Application to real datasets

5. Critics of the Simulations

6. Conclusions

References

Appendix

A. Reminders

A.1. φ-Divergence

A.2. Miscellaneous

B. Study of the sample

C. Hypotheses’ discussion

C.1. Discussion of ( H 2 ) .

C.2. Discussion of ( H 4 ) .

D. Proofs

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.3. Convergence study at the $k^{th}$ step of the algorithm:

3.2. Asymptotic Inference at the $k^{t h}$ step of the algorithm

C.1. Discussion of $(H 2)$ .

C.2. Discussion of $(H 4)$ .