Next Article in Journal
Eigenvalue and Entropy Statistics for Products of Conjugate Random Quantum Channels
Previous Article in Journal
A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Projection Pursuit Through ϕ-Divergence Minimisation

Laboratoire de Statistique Théorique et Appliquée, Université Pierre et Marie Curie, 175 rue du Chevaleret, 75013 Paris, France
Entropy 2010, 12(6), 1581-1611; https://doi.org/10.3390/e12061581
Submission received: 8 April 2010 / Revised: 27 May 2010 / Accepted: 31 May 2010 / Published: 14 June 2010

Abstract

:
In his 1985 article (“Projection pursuit”), Huber demonstrates the interest of his method to estimate a density from a data set in a simple given case. He considers the factorization of density through a Gaussian component and some residual density. Huber’s work is based on maximizing Kullback–Leibler divergence. Our proposal leads to a new algorithm. Furthermore, we will also consider the case when the density to be factorized is estimated from an i.i.d. sample. We will then propose a test for the factorization of the estimated density. Applications include a new test of fit pertaining to the elliptical copulas.
MSC Classification:
94A17; 62F05; 62J05; 62G08

1. Outline of the Article

The objective of projection pursuit is to generate one or several projections providing as much information as possible about the structure of the data set regardless of its size:
Once a structure has been isolated, the corresponding data are transformed through a Gaussianization. Through a recursive approach, this process is iterated to find another structure in the remaining data, until no further structure can be evidenced in the data left at the end.
Friedman [1] and Huber [2] count among the first authors to have introduced this type of approaches for evidencing structures. They each describe, with many examples, how to evidence such a structure and consequently how to estimate the density of such data through two different methodologies each. Their work is based on maximizing Kullback–Leibler divergence.
For a very long time, the two methodologies exposed by each of the above authors were thought to be equivalent but Zhu [3] showed it was in fact not the case when the number of iterations in the algorithms exceeds the dimension of the space containing the data, i.e., in case of density estimation. In the present article, we will therefore only focus on Huber’s study while taking into account the Zhu remarks.
At present, let us briefly introduce Huber’s methodology. We will then expose our approach and objective.

1.1. Huber’s analytic approach

Let f be a density on R d . We define an instrumental density g with same mean and variance as f. Huber’s methodology requires us to start with performing the K ( f , g ) = 0 test—with K being the Kullback–Leibler divergence. Should this test turn out to be positive, then f = g and the algorithm stops. If the test were not to be verified, the first step of Huber’s algorithm amounts to defining a vector a 1 and a density f ( 1 ) by
a 1 = a r g inf a R * d K ( f g a f a , g ) and f ( 1 ) = f g a 1 f a 1
where R * d is the set of non-null vectors of R d , where f a (resp. g a ) stands for the density of a X (resp. a Y ) when f (resp. g) is the density of X (resp. Y). More exactly, this results from the maximisation of a K ( f a , g a ) since K ( f , g ) = K ( f a , g a ) + K ( f g a f a , g ) and it is assumed that K ( f , g ) is finite. In a second step, Huber replaces f with f ( 1 ) and goes through the first step again.
By iterating this process, Huber thus obtains a sequence ( a 1 , a 2 , . . . ) of vectors of R * d and a sequence of densities f ( i ) .
R e m a r k 1.1. Huber stops his algorithm when the Kullback–Leibler divergence equals zero or when his algorithm reaches the d t h iteration, he then obtains an approximation of f from g:
When there exists an integer j such that K ( f ( j ) , g ) = 0 with j d , he obtains f ( j ) = g , i.e., f = g Π i = 1 j f a i ( i 1 ) g a i since by induction f ( j ) = f Π i = 1 j g a i f a i ( i 1 ) . Similarly, when, for all j, Huber gets K ( f ( j ) , g ) > 0 with j d , he assumes g = f ( d ) in order to derive f = g Π i = 1 d f a i ( i 1 ) g a i .
He can also stop his algorithm when the Kullback–Leibler divergence equals zero without the condition j d is met. Therefore, since by induction we have f ( j ) = f Π i = 1 j g a i f a i ( i 1 ) with f ( 0 ) = f , we obtain g = f Π i = 1 j g a i f a i ( i 1 ) . Consequently, we derive a representation of f as f = g Π i = 1 j f a i ( i 1 ) g a i .
Finally, he obtains K ( f ( 0 ) , g ) K ( f ( 1 ) , g ) . . . . . 0 with f ( 0 ) = f .

1.2. Huber’s synthetic approach

Keeping the notations of the above section, we start with performing the K ( f , g ) = 0 test; should this test turn out to be positive, then f = g and the algorithm stops, otherwise, the first step of his algorithm would consist in defining a vector a 1 and a density g ( 1 ) by
a 1 = a r g inf a R * d K ( f , g f a g a ) and g ( 1 ) = g f a 1 g a 1
More exactly, this optimisation results from the maximisation of a K ( f a , g a ) since K ( f , g ) = K ( f a , g a ) + K ( f , g f a g a ) and it is assumed that K ( f , g ) is finite. In a second step, Huber replaces g with g ( 1 ) and goes through the first step again. By iterating this process, Huber thus obtains a sequence ( a 1 , a 2 , . . . ) of vectors of R * d and a sequence of densities g ( i ) .
R e m a r k 1.2. First, in a similar manner to the analytic approach, this methodology enables us to approximate and even to represent f from g:
To obtain an approximation of f, Huber either stops his algorithm when the Kullback–Leibler divergence equals zero, i.e., K ( f , g ( j ) ) = 0 implies g ( j ) = f with j d , or when his algorithm reaches the d t h iteration, i.e., he approximates f with g ( d ) .
To obtain a representation of f, Huber stops his algorithm when the Kullback–Leibler divergence equals zero, since K ( f , g ( j ) ) = 0 implies g ( j ) = f . Therefore, since by induction we have g ( j ) = g Π i = 1 j f a i g a i ( i 1 ) with g ( 0 ) = g , we then obtain f = g Π i = 1 j f a i g a i ( i 1 ) .
Second, he gets K ( f , g ( 0 ) ) K ( f , g ( 1 ) ) . . . . . 0 with g ( 0 ) = g .

1.3. Proposal

Let us first introduce the concept of ϕ divergence.
Let ϕ be a strictly convex function defined by φ : R + ¯ R + ¯ , and such that φ ( 1 ) = 0 . We define a ϕ divergence of P from Q—where P and Q are two probability distributions over a space Ω such that Q is absolutely continuous with respect to P—by
D ϕ ( Q , P ) = φ ( d Q d P ) d P
or D ϕ ( q , p ) = φ ( q ( x ) p ( x ) ) p ( x ) d x , if P and Q present p and q as density respectively.
Throughout this article, we will also assume that φ ( 0 ) < , that φ is continuous and that this divergence is greater than the L 1 distance—see also Appendix A.1 page 1604.
Now, let us introduce our algorithm.
We start with performing the D ϕ ( g , f ) = 0 test; should this test turn out to be positive, then f = g and the algorithm stops, otherwise, the first step of our algorithm would consist in defining a vector a 1 and a density g ( 1 ) by
a 1 = a r g inf a R * d D ϕ ( g f a g a , f ) and g ( 1 ) = g f a 1 g a 1
Later on, we will prove that a 1 simultaneously optimises (1.1), (1.2) and (1.3).
In our second step, we will replace g with g ( 1 ) , and we will repeat the first step.
And so on, by iterating this process, we will end up obtaining a sequence ( a 1 , a 2 , . . . ) of vectors in R * d and a sequence of densities g ( i ) .
We will thus prove that the underlying structures of f evidenced through this method are identical to the ones obtained through Huber’s method. We will also evidence the above structures, which will enable us to infer more information on f—see example below.
R e m a r k 1.3. As in the previous algorithm, we first provide an approximate and even a representation of f from g: To obtain an approximation of f, we stop our algorithm when the divergence equals zero, i.e., D ϕ ( g ( j ) , f ) = 0 implies g ( j ) = f with j d , or when our algorithm reaches the d t h iteration, i.e., we approximate f with g ( d ) .
To obtain a representation of f, we stop our algorithm when the divergence equals zero. Therefore, since by induction we have g ( j ) = g Π i = 1 j f a i g a i ( i 1 ) with g ( 0 ) = g , we then obtain f = g Π i = 1 j f a i g a i ( i 1 ) .
Second, we get D ϕ ( g ( 0 ) , f ) D ϕ ( g ( 1 ) , f ) . . . . . 0 with g ( 0 ) = g .
Finally, the specific form of relationship (1.3) establishes that we deal with M-estimation. We can therefore state that our method is more robust than Huber’s—see Yohai [4], Toma [5] as well as Huber [6].
At present, let us study two examples:
E x a m p l e 1.1. Let f be a density defined on R 3 by f ( x 1 , x 2 , x 3 ) = n ( x 1 , x 2 ) h ( x 3 ) , with n being a bi-dimensional Gaussian density, and h being a non-Gaussian density. Let us also consider g, a Gaussian density with same mean and variance as f.
Since g ( x 1 , x 2 / x 3 ) = n ( x 1 , x 2 ) , we then have D ϕ ( g f 3 g 3 , f ) = D ϕ ( n . f 3 , f ) = D ϕ ( f , f ) = 0 as f 3 = h , i.e., the function a D ϕ ( g f a g a , f ) reaches zero for e 3 = ( 0 , 0 , 1 ) —where f 3 and g 3 are the third marginal densities of f and g respectively.
We therefore obtain g ( x 1 , x 2 / x 3 ) = f ( x 1 , x 2 / x 3 ) .
E x a m p l e 1.2. Assuming that the φ-divergence is greater than the L 2 norm. Let us consider ( X n ) n 0 , the Markov chain with continuous state space E. Let f be the density of ( X 0 , X 1 ) and let g be the normal density with same mean and variance as f.
Let us now assume that D ϕ ( g ( 1 ) , f ) = 0 with g ( 1 ) ( x ) = g ( x ) f 1 g 1 , i.e., let us assume that our algorithm stops for a 1 = ( 1 , 0 ) . Consequently, if ( Y 0 , Y 1 ) is a random vector with g density, then the distribution law of X 1 given X 0 is Gaussian and is equal to the distribution law of Y 1 given Y 0 .
And then, for any sequence ( A i ) —where A i E —we have
P X n + 1 A n + 1 X 0 A 0 , X 1 A 1 , , X n 1 A n 1 , X n A n = P X n + 1 A n + 1 X n A n ,   b a s e d   o n   t h e   v e r y   d e f i n i t i o n   o f   a   M a r k o v   c h a i n , = P X 1 A 1 X 0 A 0 ,   t h r o u g h   t h e   M a r k o v   p r o p e r t y , = P Y 1 A 1 Y 0 A 0 ,   a s   a   c o n s e q u e n c e   o f   t h e   a b o v e   n u l l i t y   o f   t h e   ϕ - d i v e r g e n c e .
To recapitulate our method, if D ϕ ( g , f ) = 0 , we derive f from the relationship f = g ; should a sequence ( a i ) i = 1 , . . . j , j < d , of vectors in R * d defining g ( j ) and such that D ϕ ( g ( j ) , f ) = 0 exist, then f ( . / a i x , 1 i j ) = g ( . / a i x , 1 i j ) , i.e., f coincides with g on the complement of the vector subspace generated by the family { a i } i = 1 , . . . , j —see also Section 2 for a more detailed explanation.
In this paper, after having clarified the choice of g, we will consider the statistical solution to the representation problem, assuming that f is unknown and X 1 , X 2 ,... X m are i.i.d. with density f. We will provide asymptotic results pertaining to the family of optimizing vectors a k , m —that we will define more precisely below—as m goes to infinity. Our results also prove that the empirical representation scheme converges towards the theoretical one. As an application, Section 3.4 permits a new test of fit pertaining to the copula of an unknown density f, Section 3.5 gives us an estimate of a density deconvoluted with a Gaussian component and Section 3.6 presents some applications to regression analysis. Finally, we will present simulations and an application to real datasets.

2. The Algorithm

2.1. The model

As explained by Friedman [1] and Diaconis [7], the choice of g depends on the family of distribution one wants to find in f. Until now, the choice has only been to use the class of Gaussian distributions. This can be extended to the class of elliptic distributions with almost all ϕ divergences.

Elliptical laws 

The interest of this class lies in the fact that conditional densities with elliptical distributions are also elliptical—see Cambanis [8], Landsman [9]. This very property allows us to use this class in our algorithm.
Definition 2.1. 
X is said to abide by a multivariate elliptical distribution—noted X E d ( μ , Σ , ξ d ) —if X presents the following density, for any x in R d :
f X ( x ) = c d | Σ | 1 / 2 ξ d 1 2 ( x μ ) Σ 1 ( x μ )
  • with Σ, being a d × d positive-definite matrix and with μ, being a d-column vector,
  • with ξ d , being referred as the “density generator”,
  • with c d , being a normalisation constant, such that c d = Γ ( d / 2 ) ( 2 π ) d / 2 0 x d / 2 1 ξ d ( x ) d x 1 , with 0 x d / 2 1 ξ d ( x ) d x < .
Property 2.1. 
1/ For any X E d ( μ , Σ , ξ d ) , for any A, being an m × d matrix with rank m d , and for any b, being an m-dimensional vector, we have A X + b E m ( A μ + b , A Σ A , ξ m ) .
Therefore, any marginal density of multivariate elliptical distribution is elliptic, i.e., X = ( X 1 , X 2 , . . . , X d ) E d ( μ , Σ , ξ d ) X i E 1 ( μ i , σ i 2 , ξ 1 ) , f X i ( x ) = c 1 σ i ξ 1 1 2 ( x μ i σ ) 2 , 1 i d .
2/ Corollary 5 of Cambanis [8] states that conditional densities with elliptical distributions are also elliptic. Indeed, if X = ( X 1 , X 2 ) E d ( μ , Σ , ξ d ) , with X 1 (resp. X 2 ) being a size d 1 < d (resp. d 2 < d ), then X 1 / ( X 2 = a ) E d 1 ( μ , Σ , ξ d 1 ) with μ = μ 1 + Σ 12 Σ 22 1 ( a μ 2 ) and Σ = Σ 11 Σ 12 Σ 22 1 Σ 21 , with μ = ( μ 1 , μ 2 ) and Σ = ( Σ i j ) 1 i , j 2 .
R e m a r k 2.1. 
Landsman [9] shows that multivariate Gaussian distributions derive from ξ d ( x ) = e x . He also shows that if X = ( X 1 , . . . , X d ) has an elliptical density such that its marginals verify E ( X i ) < and E ( X i 2 ) < for 1 i d , then μ is the mean of X and Σ is a multiple of the covariance matrix of X. Consequently, from now on, we will assume that we are in this case.
Definition 2.2. 
Let t be an elliptical density on R k and let q be an elliptical density on R k . The elliptical densities t and q are said to belong to the same family—or class—of elliptical densities, if their generating densities are ξ k and ξ k respectively, which belong to a common given family of densities.
E x a m p l e 2.1. 
Consider two Gaussian densities N ( 0 , 1 ) and N ( ( 0 , 0 ) , I d 2 ) . They are said to belong to the same elliptical families as they both present x e x as generating density.

Choice of g 

Let us begin with studying the following case:
Let f be a density on R d . Let us assume there exists d non-null linearly independent vectors a j , with 1 j d , of R d , such that
f ( x ) = n ( a j + 1 x , . . . , a d x ) h ( a 1 x , . . . , a j x )
with j < d , with n being an elliptical density on R d j 1 and with h being a density on R j , which does not belong to the same family as n. Let X = ( X 1 , . . . , X d ) be a vector presenting f as density.
Define g as an elliptical distribution with same mean and variance as f.
For simplicity, let us assume that the family { a j } 1 j d is the canonical basis of R d :
The very definition of f implies that ( X j + 1 , . . . , X d ) is independent from ( X 1 , . . . , X j ) . Hence, the density of ( X j + 1 , . . . , X d ) given ( X 1 , . . . , X j ) is n.
Let us assume that D ϕ ( g ( j ) , f ) = 0 , for some j d . We then get f ( x ) f a 1 f a 2 . . . f a j = g ( x ) g a 1 ( 1 1 ) g a 2 ( 2 1 ) . . . g a j ( j 1 ) , since, by induction, we have g ( j ) ( x ) = g ( x ) f a 1 g a 1 ( 1 1 ) f a 2 g a 2 ( 2 1 ) . . . f a j g a j ( j 1 ) .
Consequently, the fact that conditional densities with elliptical distributions are also elliptical enables us to infer that
n ( a j + 1 x , . , a d x ) = f ( . / a i x , 1 i j ) = g ( . / a i x , 1 i j )
In other words, f coincides with g on the complement of the vector subspace generated by the family { a i } i = 1 , . . . , j .
Now, if the family { a j } 1 j d is no longer the canonical basis of R d , then this family is again a basis of R d . Hence, Lemma D.1—page 1607—implies that
g ( . / a 1 x , . . . , a j x ) = n ( a j + 1 x , . . . , a d x ) = f ( . / a 1 x , . . . , a j x )
which is equivalent to having D ϕ ( g ( j ) , f ) = 0 —since by induction g ( j ) = g f a 1 g a 1 ( 1 1 ) f a 2 g a 2 ( 2 1 ) . . . f a j g a j ( j 1 ) .
The end of our algorithm implies that f coincides with g on the complement of the vector subspace generated by the family { a i } i = 1 , . . . , j . Therefore, the nullity of the ϕ divergence provides us with information on the density structure.
In summary, the following proposition clarifies our choice of g which depends on the family of distribution one wants to find in f:
Proposition 2.1. 
With the above notations, D ϕ ( g ( j ) , f ) = 0 is equivalent to
g ( . / a 1 x , . . . , a j x ) = f ( . / a 1 x , . . . , a j x )
More generally, the above proposition leads us to defining the co-support of f as the vector space generated from vectors a 1 , . . . , a j .
Definition 2.3. 
Let f be a density on R d . We define the co-vectors of f as the sequence of vectors a 1 , . . . , a j which solves the problem D ϕ ( g ( j ) , f ) = 0 where g is an elliptical distribution with same mean and variance as f. We define the co-support of f as the vector space generated from vectors a 1 , . . . , a j .
R e m a r k 2.2. 
Any ( a i ) family defining f as in (2.1), is an orthogonal basis of R d —see Lemma D.2

2.2. Stochastic outline of our algorithm

Let X 1 , X 2 ,.., X m (resp. Y 1 , Y 2 ,.., Y m ) be a sequence of m independent random vectors with same density f (resp. g). As customary in nonparametric ϕ divergence optimizations, all estimates of f and f a as well as all uses of Monté Carlo’s methods are being performed using subsamples X 1 , X 2 ,.., X n and Y 1 , Y 2 ,.., Y n —extracted respectively from X 1 , X 2 ,.., X m and Y 1 , Y 2 ,.., Y m —since the estimates are bounded below by some positive deterministic sequence θ m —see Appendix B.
Let P n be the empirical measure of the subsample X 1 , X 2 ,., X n . Let f n (resp. f a , n for any a in R * d ) be the kernel estimate of f (resp. f a ), which is built from X 1 , X 2 ,.., X n (resp. a X 1 , a X 2 ,.., a X n ).
As defined in Section 1.3, we introduce the following sequences ( a k ) k 1 and ( g ( k ) ) k 1 :
a k is a non null vector of R d such that a k = a r g min a R * d D ϕ ( g ( k 1 ) f a g a ( k 1 ) , f )
g ( k ) is the density such that g ( k ) = g ( k 1 ) f a k g a k ( k 1 ) with g ( 0 ) = g
The stochastic setting up of the algorithm uses f n and g n ( 0 ) = g instead of f and g ( 0 ) = g —since g is known. Thus, at the first step, we build the vector a ˇ 1 which minimizes the ϕ divergence between f n and g f a , n g a and which estimates a 1 :
Proposition B.1 page 1606 and Lemma D.3 page 1607 enable us to minimize the ϕ divergence between f n and g f a , n g a . Defining a ˇ 1 as the argument of this minimization, Proposition 3.3 page 1589 shows us that this vector tends to a 1 .
Finally, we define the density g ˇ m ( 1 ) as g ˇ m ( 1 ) = g f a ˇ 1 , m g a ˇ 1 which estimates g ( 1 ) through Theorem 3.1.
Now, from the second step and as defined in Section 1.3, the density g ( k 1 ) is unknown. Consequently, once again, we have to truncate the samples:
All estimates of f and f a (resp. g ( 1 ) and g a ( 1 ) ) are being performed using a subsample X 1 , X 2 ,.., X n (resp. Y 1 ( 1 ) , Y 2 ( 1 ) ,.., Y n ( 1 ) ) extracted from X 1 , X 2 ,.., X m (resp. Y 1 ( 1 ) , Y 2 ( 1 ) ,.., Y m ( 1 ) —which is a sequence of m independent random vectors with same density g ( 1 ) ) such that the estimates are bounded below by some positive deterministic sequence θ m —see Appendix B.
Let P n be the empirical measure of the subsample X 1 , X 2 ,.., X n . Let f n (resp. g n ( 1 ) , f a , n , g a , n ( 1 ) for any a in R * d ) be the kernel estimate of f (resp. g ( 1 ) and f a as well as g a ( 1 ) ) which is built from X 1 , X 2 ,.., X n (resp. Y 1 ( 1 ) , Y 2 ( 1 ) ,.., Y n ( 1 ) and a X 1 , a X 2 ,.., a X n as well as a Y 1 ( 1 ) , a Y 2 ( 1 ) ,.., a Y n ( 1 ) ). The stochastic setting up of the algorithm uses f n and g n ( 1 ) instead of f and g ( 1 ) .
Thus, we build the vector a ˇ 2 which minimizes the ϕ divergence between f n and g n ( 1 ) f a , n g a , n ( 1 ) —since g ( 1 ) and g a ( 1 ) are unknown—and which estimates a 2 .
Proposition B.1 page 1606 and Lemma D.3 page 1607 enable us to minimize the ϕ divergence between f n and g n ( 1 ) f a , n g a , n ( 1 ) . Defining a ˇ 2 as the argument of this minimization, Proposition 3.3 page 1589 shows us that this vector tends to a 2 in n. Finally, we define the density g ˇ n ( 2 ) as g ˇ n ( 2 ) = g n ( 1 ) f a ˇ 2 , n g a ˇ 2 , n ( 1 ) which estimates g ( 2 ) through Theorem 3.1.
And so on, we will end up obtaining a sequence ( a ˇ 1 , a ˇ 2 , . . . ) of vectors in R * d estimating the co-vectors of f and a sequence of densities ( g ˇ n ( k ) ) k such that g ˇ n ( k ) estimates g ( k ) through Theorem 3.1.

3. Results

3.1. Convergence results

3.1.1. Hypotheses on f

In this paragraph, we define the set of hypotheses on f which could possibly be of use in our work. Discussion on several of these hypotheses can be found in Appendix C.
In this section, to be more legible we replace g with g ( k 1 ) . Let
Θ = R d , Θ D ϕ = { b Θ | φ * ( φ ( g ( x ) f ( x ) f b ( b x ) g b ( b x ) ) ) d P < }
M ( b , a , x ) = φ ( g ( x ) f ( x ) f b ( b x ) g b ( b x ) ) g ( x ) f a ( a x ) g a ( a x ) d x φ * ( φ ( g ( x ) f ( x ) f b ( b x ) g b ( b x ) ) )
P n M ( b , a ) = M ( b , a , x ) d P n , P M ( b , a ) = M ( b , a , x ) d P
where P is the probability measure presenting f as density.
Similarly as in chapter V of Van der Vaart [10], let us define :
(H1)
: For all ε > 0 , there is η > 0 , such that for all c Θ D ϕ verifying c a k ε , we have P M ( c , a ) η > P M ( a k , a ) , with a Θ .
(H2)
: Z < 0 , n 0 > 0 such that ( n n 0 sup a Θ sup c { Θ D ϕ } c P n M ( c , a ) < Z )
(H3)
: There is a neighbourhood V of ak, and a positive function H, such that, for all c V , we have | M ( c , a k , x ) | H ( x ) ( P a . s . ) with P H < ,
(H4)
: There is a neighbourhood V of ak, such that for all ε, there is a η such that for all c V and a Θ , verifying a a k ε , we have P M ( c , a k ) < P M ( c , a ) η .
Putting I a k = 2 a 2 D ϕ ( g f a k g a k , f ) , and x ρ ( b , a , x ) = φ ( g ( x ) f b ( b x ) f ( x ) g b ( b x ) ) g ( x ) f a ( a x ) g a ( a x ) , putting:
(H5)
: The function φ is C 3 in ( 0 , + ) and there is a neighbourhood V k of ( a k , a k ) such that, for all ( b , a ) of V k , the gradient ( g ( x ) f a ( a x ) g a ( a x ) ) and the Hessian H ( g ( x ) f a ( a x ) g a ( a x ) ) exist ( λ _ a . s . ), and the first order partial derivatives g ( x ) f a ( a x ) g a ( a x ) and the first and second order derivatives of ( b , a ) ρ ( b , a , x ) are dominated ( λ _ a.s.) by λ-integrable functions.
(H6)
: The function ( b , a ) M ( b , a ) is C 3 in a neighbourhood V k of ( a k , a k ) for all x; and the partial derivatives of ( b , a ) M ( b , a ) are all dominated in V k by a P _ integrable function H ( x ) .
(H7)
: P b M ( a k , a k ) 2 and P a M ( a k , a k ) 2 are finite and the expressions P 2 b i b j M ( a k , a k ) and I a k exist and are invertible.
(H8)
: There exists k such that P M ( a k , a k ) = 0 .
(H9)
: ( V a r P ( M ( a k , a k ) ) ) 1 / 2 exists and is invertible.
(H0)
: f and g are assumed to be positive and bounded and such that K ( g , f ) | f ( x ) g ( x ) | d x .

3.1.2. Estimation of the first co-vector of f

Let R be the class of all positive functions r defined on R and such that g ( x ) r ( a x ) is a density on R d for all a belonging to R * d . The following proposition shows that there exists a vector a such that f a g a minimizes D ϕ ( g r , f ) in r:
Proposition 3.1. 
There exists a vector a belonging to R * d such that
a r g min r R D ϕ ( g r , f ) = f a g a a n d r ( a x ) = f a ( a x ) g a ( a x )
R e m a r k 3.1. 
This proposition proves that a 1 simultaneously optimises (1.1), (1.2) and (1.3). In other words, it proves that the underlying structures of f evidenced through our method are identical to the ones obtained through Huber’s methods.
Following Broniatowski [11], let us introduce the estimate of D ϕ ( g f a , n g a , f n ) , through
D ϕ ˇ ( g f a , n g a , f n ) = M ( a , a , x ) d P n ( x )
Proposition 3.2. 
Let a ˇ be such that a ˇ : = a r g inf a R * d D ϕ ˇ ( g f a , n g a , f n ) .
Then, a ˇ is a strongly convergent estimate of a, as defined in Proposition 3.1.
Let us also introduce the following sequences ( a ˇ k ) k 1 and ( g ˇ n ( k ) ) k 1 , for any given n—see Section 2.2.:
  • a ˇ k is an estimate of a k as defined in Proposition 3.2 with g ˇ n ( k 1 ) instead of g,
  • g ˇ n ( k ) is such that g ˇ n ( 0 ) = g , g ˇ n ( k ) ( x ) = g ˇ n ( k 1 ) ( x ) f a ˇ k , n ( a ˇ k x ) [ g ˇ ( k 1 ) ] a ˇ k , n ( a ˇ k x ) , i.e., g ˇ n ( k ) ( x ) = g ( x ) Π j = 1 k f a ˇ j , n ( a ˇ j x ) [ g ˇ ( j 1 ) ] a ˇ j , n ( a ˇ j x ) .
We also note that g ˇ n ( k ) is a density.

3.1.3. Convergence study at the k th step of the algorithm:

In this paragraph, we will show that the sequence ( a ˇ k ) n converges towards a k and that the sequence ( g ˇ n ( k ) ) n converges towards g ( k ) .
Let c ˇ n ( a ) = a r g sup c Θ P n M ( c , a ) , with a Θ , and γ ˇ n = a r g inf a Θ sup c Θ P n M ( c , a ) . We state
Proposition 3.3. 
Both sup a Θ c ˇ n ( a ) a k and γ ˇ n converge toward a k a.s.
Finally, the following theorem shows that g ˇ n ( k ) converges almost everywhere towards g ( k ) :
Theorem 3.1. 
It holds g ˇ n ( k ) n g ( k ) a . s .

3.2. Asymptotic Inference at the k t h step of the algorithm

The following theorem shows that g ˇ n ( k ) converges towards g ( k ) at the rate O P ( n 2 2 + d ) in three different cases, namely for any given x, with the L 1 distance and with the Kullback–Leibler divergence:
Theorem 3.2. 
It holds | g ˇ n ( k ) ( x ) g ( k ) ( x ) | = O P ( n 2 2 + d ) , | g ˇ n ( k ) ( x ) g ( k ) ( x ) | d x = O P ( n 2 2 + d ) and | K ( g ˇ n ( k ) , f ) K ( g ( k ) , f ) | = O P ( n 2 2 + d ) .
The following theorem shows that the laws of our estimators of a k , namely c ˇ n ( a k ) and γ ˇ n , converge towards a linear combination of Gaussian variables.
Theorem 3.3. 
It holds n A . ( c ˇ n ( a k ) a k ) L aw B . N d ( 0 , P b M ( a k , a k ) 2 ) + C . N d ( 0 , P a M ( a k , a k ) 2 ) and n A . ( γ ˇ n a k ) L aw C . N d ( 0 , P b M ( a k , a k ) 2 ) + C . N d ( 0 , P a M ( a k , a k ) 2 ) where A = P 2 b b M ( a k , a k ) ( P 2 a i a j M ( a k , a k ) + P 2 a i b j M ( a k , a k ) ) , C = P 2 b b M ( a k , a k ) and B = P 2 b b M ( a k , a k ) + P 2 a i a j M ( a k , a k ) + P 2 a i b j M ( a k , a k ) .

3.3. A stopping rule for the procedure

In this paragraph, we will call g ˇ n ( k ) (resp. g ˇ a , n ( k ) ) the kernel estimator of g ˇ ( k ) (resp. g ˇ a ( k ) ). We will first show that g n ( k ) converges towards f in k and n. Then, we will provide a stopping rule for this identification procedure.

3.3.1. Estimation of f

The following proposition provides us with an estimate of f:
Theorem 3.4. 
We have lim n lim k g ˇ n ( k ) = f a.s.
Consequently, the following corollary shows that D ϕ ( g n ( k 1 ) f a k , n g a k , n ( k 1 ) , f a k , n ) converges towards zero as k and then as n go to infinity:
Corollary 3.1. 
We have lim n lim k D ϕ ( g ˇ n ( k ) f a k , n [ g ˇ ( k ) ] a k , n , f n ) = 0 a.s.

3.3.2. Testing of the criteria

In this paragraph, through a test of our criteria, namely a D ϕ ( g ˇ n ( k ) f a , n [ g ˇ ( k ) ] a , n , f n ) , we will build a stopping rule for this procedure. First, the next theorem enables us to derive the law of our criteria:
Theorem 3.5. 
For a fixed k, we have
n ( V a r P ( M ( c ˇ n ( γ ˇ n ) , γ ˇ n ) ) ) 1 / 2 ( P n M ( c ˇ n ( γ ˇ n ) , γ ˇ n ) P n M ( a k , a k ) ) L aw N ( 0 , I ) ,
where k represents the k t h step of our algorithm and where I is the identity matrix in R d .
Note that k is fixed in Theorem 3.5 since γ ˇ n = a r g inf a Θ sup c Θ P n M ( c , a ) where M is a known function of k—see Section 3.1. Thus, in the case when D ϕ ( g ( k 1 ) f a k g a k ( k 1 ) , f ) = 0 , we obtain
Corollary 3.2. 
We have n ( V a r P ( M ( c ˇ n ( γ ˇ n ) , γ ˇ n ) ) ) 1 / 2 P n M ( c ˇ n ( γ ˇ n ) , γ ˇ n ) L aw N ( 0 , I ) .
Hence, we propose the test of the null hypothesis
( H 0 ) : D ϕ ( g ( k 1 ) f a k g a k ( k 1 ) , f ) = 0 v e r s u s the alternative ( H 1 ) : D ϕ ( g ( k 1 ) f a k g a k ( k 1 ) , f ) 0 .
Based on this result, we stop the algorithm, then, defining a k as the last vector generated, we derive from Corollary 3.2 a α-level confidence ellipsoid around a k , namely
E k = { b R d ; n ( V a r P ( M ( b , b ) ) ) 1 / 2 P n M ( b , b ) q α N ( 0 , 1 ) }
where q α N ( 0 , 1 ) is the quantile of a α-level reduced centered normal distribution and where P n is the empirical measure arising from a realization of the sequences ( X 1 , , X n ) and ( Y 1 , , Y n ) .
Consequently, the following corollary provides us with a confidence region for the above test:
Corollary 3.3. 
E k is a confidence region for the test of the null hypothesis ( H 0 ) versus ( H 1 ) .

3.4. Goodness-of-fit test for copulas

Let us begin with studying the following case:
Let f be a density defined on R 2 and let g be an elliptical distribution with same mean and variance as f. Assuming first that our algorithm leads us to having D ϕ ( g ( 2 ) , f ) = 0 where family ( a i ) is the canonical basis of R 2 . Hence, we have g ( 2 ) ( x ) = g ( x ) f 1 g 1 f 2 g 2 ( 1 ) = g ( x ) f 1 g 1 f 2 g 2 —through Lemma D.4 page 1608—and g ( 2 ) = f . Therefore, f = g ( x ) f 1 g 1 f 2 g 2 , i.e., f f 1 f 2 = g g 1 g 2 , and then 2 x y C f = 2 x y C g where C f (resp. C g ) is the copula of f (resp. g).
At present, let f be a density on R d and let g be the density defined in Section 2.1.
Let us assume that our algorithm implies that D ϕ ( g ( d ) , f ) = 0 .
Hence, we have, for any x R d , g ( x ) Π k = 1 d f a k ( a k x ) [ g ( k 1 ) ] a k ( a k x ) = f ( x ) , i.e., g ( x ) Π k = 1 d g a k ( a k x ) = f ( x ) Π k = 1 d f a k ( a k x ) , since Lemma D.4 page 1608 implies that g a k ( k 1 ) = g a k if k d .
Moreover, the family ( a i ) i = 1 . . . d is a basis of R d —see Lemma D.5 page 1608. Hence, putting A = ( a 1 , . . . , a d ) and defining vector y (resp. density f ˜ , copula C ˜ f of f ˜ , density g ˜ , copula C ˜ g of g ˜ ) as the expression of vector x (resp. density f, copula C f of f, density g, copula C g of g) in basis A, the above equality implies d y 1 . . . y d C ˜ f = d y 1 . . . y d C ˜ g .
Finally, we perform a statistical test of the null hypothesis ( H 0 ) : d y 1 . . . y d C ˜ f = d y 1 . . . y d C ˜ g versus the alternative ( H 1 ) : d y 1 . . . y d C ˜ f d y 1 . . . y d C ˜ g . Since, under ( H 0 ) , we have D ϕ ( g ( d ) , f ) = 0 , then, as explained in Section 3.3, Corollary 3.3 provides us with a confidence region for our test.
Theorem 3.6. 
Keeping the notations of Corollary 3.3, we infer that E d is a confidence region for the test of the null hypothesis ( H 0 ) versus the alternative hypothesis ( H 1 ) .

3.5. Rewriting of the convolution product

In the present paper, we first elaborated an algorithm aiming at isolating several known structures from initial data. Our objective was to verify if for a known density on R d , a known density n on R d j 1 such that, for d > 1 ,
f ( x ) = n ( a j + 1 x , . . . , a d x ) h ( a 1 x , . . . , a j x )
did indeed exist, with j < d , with ( a 1 , , a d ) being a basis of R d and with h being a density on R j .
Secondly, our next step consisted in building an estimate (resp. a representation) of f without necessarily assuming that f meets relationship (3.1)—see Theorem 3.4.
Consequently, let us consider Z 1 and Z 2 , two random vectors with respective densities h 1 and h 2 —which is elliptical—on R d . Let us consider a random vector X such that X = Z 1 + Z 2 and let f be its density. This density can then be written as f ( x ) = h 1 * h 2 ( x ) = R d h 1 ( x ) h 2 ( t x ) d t .
Then, the following property enables us to represent f under the form of a product and without the integral sign.
Proposition 3.4. 
Let φ be a centered elliptical density with σ 2 . I d , σ 2 > 0 , as covariance matrix, such that it is a product density in all orthogonal coordinate systems and such that its characteristic function s Ψ ( 1 2 | s | 2 σ 2 ) is integrable—see Landsman [9]. Let f be a density on R d which can be deconvoluted with ϕ, i.e., f = f ¯ * ϕ = R d f ¯ ( x ) ϕ ( t x ) d t , where f ¯ is some density on R d . Let g ( 0 ) be the elliptical density belonging to the same elliptical family as f and having same mean and variance as f.
Then, the sequence ( g ( k ) ) k converges uniformly a.s. and in L 1 towards f in k, i.e.,
lim k sup x R d | g ( k ) ( x ) f ( x ) | = 0 , a n d lim k R d | g ( k ) ( x ) f ( x ) | d x = 0
Finally, with the notations of Section 3.3 and of Proposition 3.4, the following theorem enables us to estimate any convolution product of a multivariate elliptical density φ with a continuous density f ¯ :
Theorem 3.7. 
It holds lim n lim k g ˇ n ( k ) = f ¯ * ϕ   a . s .

3.6. On the regression

In this section, we will study several applications of our algorithm pertaining to the regression analysis. We define ( X 1 , . . . , X d ) (resp. ( Y 1 , . . . , Y d ) ) as a vector with density f (resp. g—see Section 2.1).
R e m a r k 3.2. 
In this paragraph, we will work in the L 2 space. Then, we will first only consider the ϕ divergences which are greater than or equal to the L 2 distance—see Vajda [12]. Note also that the co-vectors of f can be obtained in the L 2 space—see Lemma D.3 and Proposition B.1.

3.6.1. The basic idea

In this paragraph, we will assume that Θ = R * 2 and that our algorithm stops for j = 1 and a 1 = ( 0 , 1 ) . The following theorem provides us with the regression of X 1 on X 2 :
Theorem 3.8. 
The probability measure of X 1 given X 2 is the same as the probability measure of Y 1 given Y 2 . Moreover, the regression between X 1 and X 2 is X 1 = E ( Y 1 / Y 2 ) + ε , where ε is a centered random variable orthogonal to E ( X 1 / X 2 ) .
R e m a r k 3.3 
This theorem implies that E ( X 1 / X 2 ) = E ( Y 1 / Y 2 ) . This equation can be used in many fields of research. The Markov chain theory has been used for instance in Example 1.2.
Moreover, if g is a Gaussian density with same mean and variance as f, then Saporta [14] implies that E ( Y 1 / Y 2 ) = E ( Y 1 ) + C o v ( Y 1 , Y 2 ) V a r ( Y 2 ) ( Y 2 E ( Y 2 ) ) and then X 1 = E ( Y 1 ) + C o v ( Y 1 , Y 2 ) V a r ( Y 2 ) ( Y 2 E ( Y 2 ) ) + ε .

3.6.2. General case

In this paragraph, we will assume that Θ = R * d and that our algorithm stops with j for j < d . Lemma D.6 implies the existence of an orthogonal and free family ( b i ) i = j + 1 , . . , d of R * d such that R d = V e c t { a i } V e c t { b k } and such that
g ( b j + 1 x , . . . , b d x / a 1 x , . . . , a j x ) = f ( b j + 1 x , . . . , b d x / a 1 x , . . . , a j x )
Hence, the following theorem provides us with the regression of b k X , k = 1 , . . . , d , on ( a 1 X , . . . , a j X ) :
Theorem 3.9. 
The probability measure of ( b j + 1 X , . . . , b d X ) given ( a 1 X , . . . , a j X ) is the same as the probability measure of ( b j + 1 Y , . . . , b d Y ) given ( a 1 Y , . . . , a j Y ) . Moreover, the regression of b k X , k = 1 , . . . , d , on ( a 1 X , . . . , a j X ) is b k X = E ( b k Y / a 1 Y 1 , . . . , a j Y ) + b k ε , where ε is a centered random vector such that b k ε is orthogonal to E ( b k X / a 1 X , . . . , a j X ) .
Corollary 3.4. 
If g is a Gaussian density with same mean and variance as f, and if C o v ( X i , X j ) = 0 for any i j , then, the regression of b k X , k = 1 , . . . , d , on ( a 1 X , . . . , a j X ) is b k X = E ( b k Y ) + b k ε , where ε is a centered random vector such that b k ε is orthogonal to E ( b k X / a 1 X , . . . , a j X ) .

4. Simulations

Let us study five simulations. The first involves a χ 2 -divergence, the second a Hellinger distance, the third and the fourth a Cressie–Read divergence (still with γ = 1 . 25 ), and the fifth a Kullback–Leibler divergence.
In each example, our program will follow our algorithm and will aim at creating a sequence of densities ( g ( j ) ) , j = 1 , . . , k , k < d , such that g ( 0 ) = g , g ( j ) = g ( j 1 ) f a j / [ g ( j 1 ) ] a j and D ϕ ( g ( k ) , f ) = 0 , with D ϕ being a divergence and a j = a r g inf b D ϕ ( g ( j 1 ) f b / [ g ( j 1 ) ] b , f ) , for all j = 1 , . . . , k . Moreover, in the second example, we will study the robustness of our method with two outliers. In the third and the fourth example, defining ( X 0 , X 1 ) as a vector with f as density, we will study the regression of X 1 on X 0 . And finally, in the fifth example, we will perform our goodness-of-fit test for copulas.
S i m u l a t i o n 4.1 
(With the χ 2 divergence).
We are in dimension 3(=d), and we consider a sample of 50(=n) values of a random variable X with a density law f defined by
f ( x ) = G a u s s i a n ( x 1 + x 2 ) . G a u s s i a n ( x 0 + x 2 ) . G u m b e l ( x 0 + x 1 )
where the Normal law parameters are ( 5 , 2 ) and ( 1 , 1 ) and where the Gumbel distribution parameters are 3 and 4. Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f.
We theoretically obtain k = 1 and a 1 = ( 1 , 1 , 0 ) . To get this result, we perform the following test:
H 0 : a 1 = ( 1 , 1 , 0 ) v e r s u s ( H 1 ) : a 1 ( 1 , 1 , 0 ) .
Then, Corollary 3.3 enables us to estimate a 1 by the following 0.9(=α) level confidence ellipsoid
E 1 = { b R 3 ; ( V a r P ( M ( b , b ) ) ) ( 1 / 2 ) P n M ( b , b ) q α N ( 0 , 1 ) / n 0 , 2533 / 7 . 0710678 = 0 . 03582203 }
And, we obtain
Table 1. Simulation 1: Numerical results of the optimisation.
Table 1. Simulation 1: Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :minimum : 0.0201741
at point : (1.00912,1.09453,0.01893)
P-Value : 0.81131
Test : H 0 : a 1 E 1 : True
χ 2 (Kernel Estimation of g ( 1 ) , g ( 1 ) )6.1726
Therefore, we conclude that f = g(1).
S i m u l a t i o n 4.2 
(With the Hellinger distance H).
We are in dimension 20(=d). We first generate a sample with 100(=n) observations, namely two outliers x = ( 2 , 0 , , 0 ) and 98 values of a random variable X with a density f defined by
f ( x ) = G u m b e l ( x 0 ) . N o r m a l ( x 1 , , x 9 )
where the Gumbel law parameters are -5 and 1 and where the normal distribution is reduced and centered. Our reasoning is the same as in Simulation 4.1.
In the first part of the program, we theoretically obtain k = 1 and a 1 = ( 1 , 0 , , 0 ) . To get this result, we perform the following test
( H 0 ) : a 1 = ( 1 , 0 , , 0 ) v e r s u s ( H 1 ) : a 1 ( 1 , 0 , , 0 )
We estimate a 1 by the following 0.9(=α) level confidence ellipsoid
E i = { b R 2 ; ( V a r P ( M ( b , b ) ) ) 1 / 2 P n M ( b , b ) q α N ( 0 , 1 ) / n 0 . 02533 }
And, we obtain
Table 2. Simulation 2: Numerical results of the optimisation.
Table 2. Simulation 2: Numerical results of the optimisation.
Our Algorithm
Projection Study 0minimum : 0.002692
at point : (1.01326, 0.0657, 0.0628, 0.1011, 0.0509, 0.1083,
0.1261, 0.0573, 0.0377, 0.0794, 0.0906, 0.0356, 0.0012,
0.0292, 0.0737, 0.0934, 0.0286, 0.1057, 0.0697, 0.0771)
P-Value : 0.80554
Test : H 0 : a 1 E 1 : True
H(Est. of g ( 1 ) , g ( 1 ) )3.042174
Therefore, we conclude that f = g(1).
S i m u l a t i o n 4.3 
(With the Cressie-Read divergence ( D ϕ )).
We are in dimension 2(=d), and we consider a sample of 50(=n) values of a random variable X = ( X 0 , X 1 ) with a density law f defined by
f ( x ) = G u m b e l ( x 0 ) . N o r m a l ( x 1 )
where the Gumbel law parameters are -5 and 1 and where the normal distribution parameters are ( 0 , 1 ) . Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f.
We theoretically obtain k = 1 and a 1 = ( 1 , 0 ) . To get this result, we perform the following test
H 0 : a 1 = ( 1 , 0 ) v e r s u s ( H 1 ) : a 1 ( 1 , 0 )
Then, Corollary 3.3 enables us to estimate a 1 by the following 0.9(=α) level confidence ellipsoid
E 1 = { b R 2 ; ( V a r P ( M ( b , b ) ) ) ( 1 / 2 ) P n M ( b , b ) q α N ( 0 , 1 ) / n } ,   w i t h   q α N ( 0 , 1 ) / n 0 . 03582203 .
And, we obtain
Table 3. Simulation 3: Numerical results of the optimisation.
Table 3. Simulation 3: Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :minimum : 0.0210058
at point : (1.001,0.0014)
P-Value : 0.989552
Test : H 0 : a 1 E 1 : True
D ϕ (Kernel Estimation of g ( 1 ) , g ( 1 ) )6.47617
Therefore, we conclude that f = g(1).
Figure 1. Graph of the distribution to estimate (red) and of our own estimate (green).
Figure 1. Graph of the distribution to estimate (red) and of our own estimate (green).
Entropy 12 01581 g001
Figure 2. Graph of the distribution to estimate (red) and of Huber’s estimate (green).
Figure 2. Graph of the distribution to estimate (red) and of Huber’s estimate (green).
Entropy 12 01581 g002
At present, keeping the notations of this simulation, let us study the regression of X 1 on X 0 .
Our algorithm leads us to infer that the density of X 1 given X 0 is the same as the density of Y 1 given Y 0 . Moreover, Property A.1 implies that the co-factors of f are the same for any divergence. Consequently, applying Theorem 3.8 implies that X 1 = E ( Y 1 / Y 0 ) + ε , where ε is a centered random variable orthogonal to E ( X 1 / X 0 ) .
Thus, since g is a Gaussian density, Remark 3.3 implies that
X 1 = E ( Y 1 ) + C o v ( Y 1 , Y 0 ) V a r ( Y 0 ) ( Y 0 E ( Y 0 ) ) + ε
Now, using the least squares method, we estimate α 1 and α 2 such that X 1 = α 1 + α 2 . X 0 + ε .
Thus, the following table presents the results of our regression and of the least squares method if we assume that ε is Gaussian.
Table 4. Simulation 3: Numerical results of the regression.
Table 4. Simulation 3: Numerical results of the regression.
Our Regression E ( Y 1 ) -4.545483
C o v ( Y 1 , Y 0 ) 0.0380534
V a r ( Y 0 ) 0.9190052
E ( Y 0 ) 0.3103752
correlation ( Y 1 , Y 0 ) 0.02158213
Least squares method α 1 -4.34159227
Std Error of α 1 0.19870
α 2 0.06803317
Std Error of α 2 0.21154
correlation ( X 1 , X 0 ) 0.04888484
Figure 3. Graph of the regression of X 1 on X 0 based on the least squares method (red) and based on our theory (green).
Figure 3. Graph of the regression of X 1 on X 0 based on the least squares method (red) and based on our theory (green).
Entropy 12 01581 g003
S i m u l a t i o n 4.4 
(With the Cressie-Read divergence ( D ϕ )).
We are in dimension 2(=d), and we consider a sample of 500(=n) values of a random variable X = ( X 0 , X 1 ) with a density law f defined by
f ( x ) = G u m b e l ( x 1 x 0 ) . N o r m a l ( x 1 + x 0 )
where the Gumbel law parameters are -5 and 1 and where the normal distribution parameters are ( 0 , 1 ) . Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f.
We theoretically obtain k = 1 and a 1 = ( 1 , 0 ) . To get this result, we perform the following test H 0 : a 1 = ( 1 , 1 ) v e r s u s ( H 1 ) : a 1 ( 1 , 1 ) . Then, Corollary 3.3 enables us to estimate a 1 by the following 0.9(=α) level confidence ellipsoid
E 1 = { b R 2 ; ( V a r P ( M ( b , b ) ) ) ( 1 / 2 ) P n M ( b , b ) q α N ( 0 , 1 ) / n 0 , 2533 / 500 = 0 . 01132792 }
And, we obtain
Table 5. Simulation 4: Numerical results of the optimisation.
Table 5. Simulation 4: Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :minimum : 0.010920
at point : (1.09,-0.9701)
P-Value : 0.889400
Test : H 0 : a 1 E 1 : True
D ϕ (Kernel Estimation of g ( 1 ) , g ( 1 ) )5.25077
Therefore, we conclude that f = g(1).
At present, keeping the notations of this simulation, let us study the regression of X 1 + X 0 on X 1 X 0 . Our algorithm leads us to infer that the density of X 1 + X 0 given X 1 X 0 is the same as the density of Y 1 + Y 0 given Y 1 Y 0 . Moreover, Property A.1 implies that the co-factors of f are the same for any divergence. Consequently, putting U = X 1 + X 0 , V = X 1 X 0 , U = Y 1 + Y 0 and V = Y 1 Y 0 , and since { ( 1 , 1 ) , ( 1 , 1 ) } is an orthogonal basis, we can therefore infer from Theorem 3.8 that U = E ( U / V ) + ε , where ε is a centered random variable orthogonal to E ( U / V ) .
Thus, since g is a Gaussian density, Remark 3.3 implies that
U = E ( U ) + C o v ( U , V ) V a r ( V ) ( V E ( V ) ) + ε
In other words, we apply the same reasoning as the one used in the regression studies in Simulation 4.3 to ( U , V ) instead of ( X 1 , X 0 ) . This is possible since { ( 1 , 1 ) , ( 1 , 1 ) } is an orthogonal basis of R 2 , i.e., we implement a change in basis from the canonical basis of R 2 to { ( 1 , 1 ) , ( 1 , 1 ) } .
Thus, in the canonical basis U = E ( U / V ) + ε becomes X 1 + X 0 = E ( Y 1 + Y 0 / Y 1 Y 0 ) + ε , i.e., we obtain that
X 1 + X 0 = E ( Y 1 + Y 0 ) + C o v ( Y 1 + Y 0 , Y 1 Y 0 ) V a r ( Y 1 Y 0 ) ( Y 1 Y 0 E ( Y 1 Y 0 ) ) + ε
where ε is a centered random variable orthogonal to E ( X 1 + X 0 / X 1 X 0 ) .
The following table presents the results of our regression.
We simulate 10 times the regression and we obtain a and b such that X 1 = a + b X 0 + ε :
Table 6. Simulation 4: Numerical results of the regression.
Table 6. Simulation 4: Numerical results of the regression.
SimulationaStd Error of abStd Error of b
1-4.837390.11149-0.958610.04677
2-4.568950.09989-0.885770.04225
3-4.49260.1057-1.20850.0452
4-4.706190.10350-1.045490.04235
5-4.403310.10248-1.008900.0438
6-4.617570.09813-1.208900.04649
7-4.405720.09172-1.160850.04091
8-4.395810.10174-1.386960.04487
9-4.427800.10018-0.936720.04066
10-4.553940.09923-0.980650.04382
Figure 4. Graph of the regression of X 1 on X 0 based on our theory (green).
Figure 4. Graph of the regression of X 1 on X 0 based on our theory (green).
Entropy 12 01581 g004
S i m u l a t i o n 4.5 
(With the Kullback-Leibler divergence K).
We are in dimension 2(=d), and we use the Kullback–Leibler divergence to perform our optimisations. Let us consider a sample of 50(=n) values of a random variable X with a density law f defined by :
f ( x ) = c ρ ( F G u m b e l ( x 0 ) , F E x p o n e n t i a l ( x 1 ) ) . G u m b e l ( x 0 ) . E x p o n e n t i a l ( x 1 )
where :
  • c is the Gaussian copula with correlation coefficient ρ = 0 . 5 ,
  • the Gumbel distribution parameters are 1 and 1 and
  • the Exponential density parameter is 2.
Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting the same mean and variance as f. We theoretically obtain k = 2 and ( a 1 , a 2 ) = ( ( 1 , 0 ) , ( 0 , 1 ) ) . To get this result, we perform the following test
( H 0 ) : ( a 1 , a 2 ) = ( ( 1 , 0 ) , ( 0 , 1 ) ) v e r s u s ( H 1 ) : ( a 1 , a 2 ) ( ( 1 , 0 ) , ( 0 , 1 ) )
Then, Theorem 3.6 enables us to verify ( H 0 ) by the following 0.9(=α) level confidence ellipsoid
E 2 = { b R 2 ; ( V a r P ( M ( b , b ) ) ) ( 1 / 2 ) P n M ( b , b ) q α N ( 0 , 1 ) / n 0 , 2533 / 7 . 0710678 = 0 . 0358220 }
And, we obtain
Table 7. Simulation 5: Numerical results of the optimisation.
Table 7. Simulation 5: Numerical results of the optimisation.
Our Algorithm
Projection Study number 0 :minimum : 0.445199
at point : (1.0142,0.0026)
P-Value : 0.94579
Test : H 1 : a 1 E 1 : True
Projection Study number 1 :minimum : 0.0263
at point : (0.0084,0.9006)
P-Value : 0.97101
Test : H 0 : a 2 E 2 : True
K(Kernel Estimation of g ( 2 ) , g ( 2 ) )4.0680
Therefore, we can conclude that H0 is verified.
Figure 5. Graph of the estimate of ( x 0 , x 1 ) c ρ ( F G u m b e l ( x 0 ) , F E x p o n e n t i a l ( x 1 ) ) .
Figure 5. Graph of the estimate of ( x 0 , x 1 ) c ρ ( F G u m b e l ( x 0 ) , F E x p o n e n t i a l ( x 1 ) ) .
Entropy 12 01581 g005

Application to real datasets

Let us now apply our theory to real datasets.
Let us for instance study the moves in the stock prices of Nokia and Sanofi from January 11, 2010 to May 10, 2010. We thus gather 84(=n) data from these stock prices—see data below.
Let us also consider X 1 (resp. X 2 ) the random variable defining the stock price of Nokia (resp. Sanofi). We will assume—as it is commonly done in mathematical finance—that the stock market abides by the classical hypotheses of the Black–Scholes model—see [13].
Consequently, X 1 and X 2 each present a log-normal distribution as probability distribution. Let f be the density of vector ( l n ( X 1 ) , X 2 ) , let us now apply our algorithm to f with the Kullback–Leibler divergence as φ-divergence. Let us generate then a Gaussian random variable Y with a density—that we will name g—presenting same mean and variance as f.
We first assume that there exists a vector a such that D ϕ ( g f a g a , f ) = 0 .
In order to verify this hypothesis, our reasoning will be the same as in Simulation 4.1. Indeed, we assume that this vector is a co-factor of f. Consequently, Corollary 3.3 enables us to estimate a by the following 0.9(=α) level confidence ellipsoid
E 1 = { b R 2 ; ( V a r P ( M ( b , b ) ) ) ( 1 / 2 ) P n M ( b , b ) q α N ( 0 , 1 ) / n 0 , 2533 / 84 = 0 . 02763730 }
And, we obtain
Table 8. Numerical results of the optimisation.
Table 8. Numerical results of the optimisation.
Our Algorithm
Projection Study 0 :minimum : 0.017345
at point : (0.027,3.18)
P-Value : 0.890210
Test : H 0 : a 1 E 1 : True
K(Kernel Estimation of g ( 1 ) , g ( 1 ) )2.7704005
Therefore, we conclude that f = g ( 1 ) , i.e., our hypothesis is confirmed.
Consequently, as explained in Simulations 4.3 and 4.4, we can say that
l o g ( X 1 ) = 0 . 027 . X 2 + 3 . 18 + ε
where ε is a centered random variable orthogonal to E ( l o g ( X 1 ) / X 2 ) .
Finally, using the least squares method, we estimate α 1 and α 2 such that l o g ( X 1 ) = α 1 + α 2 . X 2 + ε . Thus, the following table presents the results of the least squares method if we assume that ε is Gaussian:
Table 9. Numerical results of the regression.
Table 9. Numerical results of the regression.
Simulation α 1 Std Error of α 1 α 2 Std Error of α 2
13.1536940.2303800.0265780.004236
Figure 6. Graph of the regression of log of Nokia on Sanofi based on the least squares method (red) and based on our theory (green).
Figure 6. Graph of the regression of log of Nokia on Sanofi based on the least squares method (red) and based on our theory (green).
Entropy 12 01581 g006
Table 10. Stock prices of Nokia and Sanofi.
Table 10. Stock prices of Nokia and Sanofi.
DateNokiaLog-of-NokiaSanofiDateNokiaLog-of-NokiaSanofi
10/05/1084.754.4451.6207/05/1081.854.448.5
06/05/1087.34.4750.3505/05/1087.754.4750.95
04/05/1087.254.4750.4903/05/1087.854.4851.51
30/04/1087.84.4851.6629/04/1087.854.4851.41
28/04/1087.854.4851.8827/04/10894.4952.11
26/04/1089.24.4954.0923/04/1090.74.5153.47
22/04/1092.754.5353.5921/04/10108.44.6953.95
20/04/10108.94.6954.4319/04/10108.34.6854.05
16/04/10106.84.6754.0415/04/10109.94.754.95
14/04/10109.84.754.8613/04/10108.34.6854.67
12/04/10109.14.6955.2709/04/10110.14.755.41
08/04/10110.74.7154.9607/04/10113.24.7355.3
06/04/10112.44.7254.6401/04/10113.34.7355.16
31/03/10112.44.7255.1930/03/10112.54.7255.39
29/03/10111.84.7255.4926/03/10112.54.7255.72
25/03/10111.44.7156.3324/03/10110.24.755.95
23/03/10109.14.6956.1222/03/10109.24.6956.33
19/03/10108.54.6956.5718/03/10108.44.6956.56
17/03/10109.94.756.2816/03/101074.6757.21
Table 11. Stock prices of Nokia and Sanofi.
Table 11. Stock prices of Nokia and Sanofi.
DateNokiaLog-of-NokiaSanofiDateNokiaLog-of-NokiaSanofi
15/03/10105.34.6655.9512/03/101054.6555.4
11/03/101034.6355.6510/03/101044.6456.13
09/03/10101.54.6256.1708/03/10100.74.6155.75
05/03/10100.24.6155.7604/03/1098.74.5954.81
03/03/1099.84.655.1402/03/1097.254.5854.99
01/03/1095.854.5654.8226/02/1095.854.5653.72
25/02/1094.554.5552.9224/02/1096.34.5753.92
23/02/1096.24.5754.0522/02/1096.74.5754.14
19/02/1097.34.5854.7118/02/1096.64.5754.43
17/02/1096.14.5753.8816/02/1094.954.5553.56
15/02/1093.654.5453.212/02/1093.554.5453.01
11/02/1094.64.5552.5210/02/1095.554.5652.2
09/02/1098.44.5952.6608/02/1099.24.652.98
05/02/1099.84.651.6804/02/10102.64.6353.42
03/02/10103.94.6454.0602/02/10103.84.6453.8
01/02/10102.44.6353.2329/01/10103.64.6453.6
28/01/10101.84.6252.6827/01/1092.554.5353.8
26/01/1092.74.5354.4225/01/1091.94.5253.66
22/01/1094.14.5454.6521/01/1093.74.5455.28
20/01/1092.754.5356.6719/01/1093.64.5457.69
18/01/1094.554.5556.6715/01/1093.554.5456.85
14/01/1093.74.5456.9113/01/1092.54.5356.18
12/01/1092.354.5355.8311/01/10934.5356.08

5. Critics of the Simulations

In the case where f is unknown, we will never be sure to have reached the minimum of the φ-divergence: we have indeed used the simulated annealing method to solve our optimisation problem, and therefore it is only when the number of random jumps tends in theory towards infinity that the probability to reach the minimum tends to 1. We also note that no theory on the optimal number of jumps to implement does exist, as this number depends on the specificities of each particular problem. Moreover, we choose the 50 4 4 + d (resp. 500 4 4 + d and 100 4 4 + d ) for the AMISE of Simulations 4.1, 4.2 and 4.3 (resp. Simulations 4.4 and 4.5). This choice leads us to simulate 50 (resp. 500 and 100) random variables—see Scott [15] page 151—none of which have been discarded to obtain the truncated sample. This has also been the case in our application to real datasets.
Finally, we remark that some of the key advantages of our method over Huber’s consist in the fact that—since there exist divergences smaller than the Kullback–Leibler divergence—our method requires a considerably shorter computation time and also in the superior robustness of our method.

6. Conclusions

Projection Pursuit is useful in evidencing characteristic structures as well as one-dimensional projections and their associated distributions in multivariate data. Huber [2] shows us how to achieve it through maximization of the Kullback–Leibler divergence.
The present article shows that our ϕ-divergence method constitutes a good alternative to Huber’s particularly in terms of regression and robustness as well as in terms of copula’s study. Indeed, the convergence results and simulations we carried out, convincingly fulfilled our expectations regarding our methodology.

References

  1. Friedman, J.H.; Stuetzle, W.; Schroeder, A. Projection pursuit density estimation. J. Amer. Statist. Assoc. 1984, 79, 599–608. [Google Scholar] [CrossRef]
  2. Huber, P.J. Projection pursuit. Ann. Statist. 1985, 13, 435–525, With discussion. [Google Scholar] [CrossRef]
  3. Zhu, M. On the forward and backward algorithms of projection pursuit. Ann. Statist. 2004, 32, 233–244. [Google Scholar] [CrossRef]
  4. Yohai, V.J. Optimal robust estimates using the Kullback-Leibler divergence. Stat. Probab. Lett. 2008, 78, 1811–1816. [Google Scholar] [CrossRef]
  5. Toma, A. Optimal robust M-estimators using divergences. Stat. Probab. Lett. 2009, 79, 1–5. [Google Scholar] [CrossRef]
  6. Huber, P.J. Robust Statistics; Wiley: Hoboken, NJ, USA, 1981; republished in paperback, 2004. [Google Scholar]
  7. Diaconis, P.; Freedman, D. Asymptotics of graphical projection pursuit. Ann. Statist. 1984, 12, 793–815. [Google Scholar] [CrossRef]
  8. Cambanis, S.; Huang, S.; Simons, G. On the theory of elliptically contoured distributions. J. Multivariate Anal. 1981, 11, 368–385. [Google Scholar] [CrossRef]
  9. Landsman, Z.M.; Valdez, E.A. Tail conditional expectations for elliptical distributions. N. Am. Actuar. J. 2003, 7, 55–71. [Google Scholar] [CrossRef]
  10. Van der Vaart, A.W. Asymptotic Statistics. In Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, MA, USA, 1998; Volume 3. [Google Scholar]
  11. Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivariate Anal. 2009, 100, 16–36. [Google Scholar] [CrossRef]
  12. Vajda, I. χα-divergence and generalized Fisher’s information. In Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Czech Technical University in Prague: Prague, Czech, 1971; dedicated to the memory of Antonín Spacek; Academia: Prague, Czech; pp. 873–886. [Google Scholar]
  13. Black, F.; Scholes, M.S. The pricing of options and corporate liabilities. J. Polit. Econ. 1973, 3, 637–654. [Google Scholar] [CrossRef]
  14. Saporta, G. Probabilités, Analyse des données et Statistique; Technip: Paris, France, 2006. [Google Scholar]
  15. Scott, D.W. Multivariate Density Estimation. Theory, Practice, and Visualization; John Wiley and Sons: New York, NY, USA, 1992. [Google Scholar]
  16. Cressie, N.; Read, T.R.C. Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. 1984, Ser. B 46, 440–464. [Google Scholar]
  17. Csiszár, I. On topology properties of f-divergences. Studia Sci. Math. Hungar. 1967, 2, 329–339. [Google Scholar]
  18. Liese, F.; Vajda, I. Convex Statistical Distances. In Teubner-Texte zur Mathematik [Teubner Texts in Mathematics]; B.G. Teubner Verlagsgesellschaft: Leipzig, Germany, 1987; Volume 95. [Google Scholar]
  19. Pardo, L. Statistical inference based on divergence measures. In Statistics: Textbooks and Monographs; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006; Volume 185. [Google Scholar]
  20. Zografos, K.; Ferentinos, K.; Papaioannou, T. ϕ-divergence statistics: sampling properties and multinomial goodness of fit and divergence tests. Comm. Statist. Theory Methods 1990, 19, 1785–1802. [Google Scholar] [CrossRef]
  21. Azé, D. Eléments d’analyse convexe et variationnelle; Ellipse: Minneapolis, MN, USA, 1997. [Google Scholar]
  22. Touboul, J. Projection pursuit through φ-divergence minimisation. arXiv:0912.2883, 2009. [Google Scholar]
  23. Bosq, D.; Lecoutre, J.-P. Livre—Theorie De L’Estimation Fonctionnelle; Economica: Hoboken, NJ, USA, 1999. [Google Scholar]

Appendix

A. Reminders

A.1. φ-Divergence

Let us call h a the density of a Z if h is the density of Z. Let ϕ be a strictly convex function defined by φ : R + ¯ R + ¯ , and such that φ ( 1 ) = 0 .
Definition A.1. We define the ϕ divergence of P from Q, where P and Q are two probability distributions over a space Ω such that Q is absolutely continuous with respect to P, by
D ϕ ( Q , P ) = φ ( d Q d P ) d P
The above expression (A.1) is also valid if P and Q are both dominated by the same probability.
The most used distances (Kullback, Hellinger or χ 2 ) belong to the Cressie–Read family (see Cressie [16], Csiszár [17] and the books of Liese [18], Pardo [19] and Zografos [20]). They are defined by a specific ϕ. Indeed,
-
with the Kullback–Leibler divergence, we associate φ ( x ) = x l n ( x ) x + 1
-
with the Hellinger distance, we associate φ ( x ) = 2 ( x 1 ) 2
-
with the χ 2 distance, we associate φ ( x ) = 1 2 ( x 1 ) 2
-
more generally, with power divergences, we associate φ ( x ) = x γ γ x + γ 1 γ ( γ 1 ) , where γ R ( 0 , 1 )
-
and, finally, with the L 1 norm, which is also a divergence, we associate φ ( x ) = | x 1 | .
Let us now present some well-known properties of divergences.
Property A.1. We have D ϕ ( P , Q ) = 0 P = Q .
Property A.2. The divergence function Q D ϕ ( Q , P ) is convex, lower semi-continuous (l.s.c.)—for the topology that makes all the applications of the form Q f d Q continuous where f is bounded and continuous—as well as l.s.c. for the topology of the uniform convergence.
Property A.3. (corollary (1.29), page 19 of Liese [18]). If T : ( X , A ) ( Y , B ) is measurable and if D ϕ ( P , Q ) < , then D ϕ ( P , Q ) D ϕ ( P T 1 , Q T 1 ) , with equality being reached when T is surjective for ( P , Q ) .
Theorem A.1. (theorem III.4 of Azé [21]). Let f : I R be a convex function. Then f is a Lipschitz function in all compact intervals [ a , b ] i n t { I } . In particular, f is continuous on i n t { I } .

A.2. Miscellaneous

In the present section, all demonstrations can be found in Touboul [22].
Lemma A.1. The set Γ c is closed in L 1 for the topology of the uniform convergence.
Lemma A.2. For all c > 0 , we have Γ c B ¯ L 1 ( f , c ) , where B L 1 ( f , c ) = { p L 1 ; f p 1 c } .
Lemma A.3. G is closed in L 1 for the topology of the uniform convergence.
Lemma A.4. Let consider the sequence ( a i ) defined in (2.3) page 1587.
We then have lim n lim k K ( g ˇ n ( k ) f a k , n [ g ˇ ( k ) ] a k , n , f n ) = 0 a.s.
In the case where f is known and keeping the notations introduced in Section 3.1, we have
Proposition A.1. Assuming ( H 1 ) to ( H 3 ) hold. Both sup a Θ c ˇ n ( a ) a k and γ ˇ n tends to a k a.s.
Theorem A.2. Assuming ( H 0 ) to ( H 3 ) hold, for any k = 1 , . . . , d and any x R d , we have | g ˇ ( k ) ( x ) g ( k ) ( x ) | = O P ( n 1 / 2 ) and | g ˇ ( k ) ( x ) g ( k ) ( x ) | d x = O P ( n 1 / 2 ) as well as | K ( g ˇ ( k ) , f ) K ( g ( k ) , f ) | = O P ( n 1 / 2 ) .
Theorem A.3. Assuming that ( H 1 ) to ( H 3 ) , ( H 6 ) and ( H 8 ) hold. Then, n ( V a r P ( M ( c ˇ n ( γ ˇ n ) , γ ˇ n ) ) ) 1 / 2 ( P n M ( c ˇ n ( γ ˇ n ) , γ ˇ n ) P n M ( a k , a k ) ) L aw N ( 0 , I ) , where k represents the k t h step of the algorithm and with I being the identity matrix in R d .

B. Study of the sample

Let X 1 , X 2 ,.., X m be a sequence of independent random vectors with same density f. Let Y 1 , Y 2 ,.., Y m be a sequence of independent random vectors with same density g. Then, the kernel estimators f m , g m , f a , m and g a , m of f, g, f a and g a , for all a R * d , almost surely and uniformly converge since we assume that the bandwidth h m of these estimators meets the following conditions (see Bosq [23])—with L ( u ) = l n ( u e ) :
( H y p ) : h m m 0 ,   m h m m ,   m h m / L ( h m 1 ) m  and  L ( h m 1 ) / L L m m .
Let us consider
B 1 ( n , a ) = 1 n Σ i = 1 n φ { f a , n ( a Y i ) g a , n ( a Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a Y i ) g a , n ( a Y i )  and  B 2 ( n , a ) = 1 n Σ i = 1 n φ * { φ { f a , n ( a X i ) g a , n ( a X i ) g n ( X i ) f n ( X i } } .
Our goal is to estimate the minimum of D ϕ ( g f a g a , f ) . To do this, it is necessary for us to truncate our samples:
Let us consider now a positive sequence θ m such that θ m 0 , y m / θ n 2 0 , where y m is the almost sure convergence rate of the kernel density estimator— y m = O P ( m 2 4 + d ) , see Lemma D.7— y m ( 1 ) / θ m 2 0 , where y m ( 1 ) is defined by | φ ( g m ( x ) f m ( x ) f b , m ( b x ) g b , m ( b x ) ) φ ( g ( x ) f ( x ) f b ( b x ) g b ( b x ) ) | y m ( 1 ) , for all b in R * d and all x in R d , and finally y m ( 2 ) θ m 2 0 , where y n ( 2 ) is defined by | φ ( g m ( x ) f m ( x ) f b , m ( b x ) g b , m ( b x ) ) φ ( g ( x ) f ( x ) f b ( b x ) g b ( b x ) ) | y m ( 2 ) , for all b in R * d and all x in R d .
We will generate f m , g m and g b , m from the starting sample and we will select the X i and Y i vectors such that f m ( X i ) θ m and g b , m ( b Y i ) θ m , for all i and for all b R * d .
The vectors meeting these conditions will be called X 1 , X 2 , . . . , X n and Y 1 , Y 2 , . . . , Y n .
Consequently, the next proposition provides us with the condition required for us to derive our estimations.
Proposition B.1. Using the notations introduced in Broniatowski [11] and in Section 3.1, it holds lim n sup a R * d | ( B 1 ( n , a ) B 2 ( n , a ) ) D ϕ ( g f a g a , f ) | = 0 .
R e m a r k B.1. With the Kullback–Leibler divergence, we can take for θ m the expression m ν , with 0 < ν < 1 4 + d .

C. Hypotheses’ discussion

C.1. Discussion of ( H 2 ) .

Let us work with the Kullback–Leibler divergence and with g and a 1 .
For all b R * d , we have φ * ( φ ( g ( x ) f b ( b x ) f ( x ) g b ( b x ) ) ) f ( x ) d x = ( g ( x ) f b ( b x ) f ( x ) g b ( b x ) 1 ) f ( x ) d x = 0 , since, for any b in R * d , the function x g ( x ) f b ( b x ) g b ( b x ) is a density. The complement of Θ D ϕ in R * d is ∅ and then the supremum looked for in R ¯ is . We can therefore conclude. It is interesting to note that we obtain the same verification with f, g ( k 1 ) and a k .

C.2. Discussion of ( H 4 ) .

This hypothesis consists in the following assumptions:
  • We work with the Kullback–Leibler divergence, (0)
  • We have f ( . / a 1 x ) = g ( . / a 1 x ) , i.e., K ( g f 1 g 1 , f ) = 0 —we could also derive the same proof with f, g ( k 1 ) and a k —(1)
Preliminary ( A ) : Shows that A = { ( c , x ) R * d { a 1 } × R d ; f a 1 ( a 1 x ) g a 1 ( a 1 x ) > f c ( c x ) g c ( c x ) , g ( x ) f c ( c x ) g c ( c x ) > f ( x ) } = through a reductio ad absurdum, i.e., if we assume A .
Thus, our hypothesis enables us to derive f ( x ) = f ( . / a 1 x ) f a 1 ( a 1 x ) = g ( . / a 1 x ) f a 1 ( a 1 x ) > g ( . / c x ) f c ( c x ) > f since f a 1 ( a 1 x ) g a 1 ( a 1 x ) f c ( c x ) g c ( c x ) implies g ( . / a 1 x ) f a 1 ( a 1 x ) = g ( x ) f a 1 ( a 1 x ) g a 1 ( a 1 x ) g ( x ) f c ( c x ) g c ( c x ) = g ( . / c x ) f c ( c x ) , i.e., f > f . We can therefore conclude.
Preliminary ( B ) : Shows that B = { ( c , x ) R * d { a 1 } × R d ; f a 1 ( a 1 x ) g a 1 ( a 1 x ) < f c ( c x ) g c ( c x ) , g ( x ) f c ( c x ) g c ( c x ) < f ( x ) } = through a reductio ad absurdum, i.e., if we assume B .
Thus, our hypothesis enables us to derive f ( x ) = f ( . / a 1 x ) f a 1 ( a 1 x ) = g ( . / a 1 x ) f a 1 ( a 1 x ) < g ( . / c x ) f c ( c x ) < f
We can therefore conclude as above.
Let us now verify ( H 4 ) :
We have P M ( c , a 1 ) P M ( c , a ) = l n ( g ( x ) f c ( c x ) g c ( c x ) f ( x ) ) { f a 1 ( a 1 x ) g a 1 ( a 1 x ) f c ( c x ) g c ( c x ) } g ( x ) d x . Moreover, the logarithm l n is negative on { x R * d ; g ( x ) f c ( c x ) g c ( c x ) f ( x ) < 1 } and is positive on { x R * d ; g ( x ) f c ( c x ) g c ( c x ) f ( x ) 1 } .
Thus, the preliminary studies ( A ) and ( B ) show that l n ( g ( x ) f c ( c x ) g c ( c x ) f ( x ) ) and { f a 1 ( a 1 x ) g a 1 ( a 1 x ) f c ( c x ) g c ( c x ) } always present a negative product. We can therefore conclude, since ( c , a ) P M ( c , a 1 ) P M ( c , a ) is not null for all c and for all a—with a a 1 .

D. Proofs

Preliminary remark :
Let us note that if K ( g , f ) | f ( x ) g ( x ) | d x , a simple reductio ad absurdum enables us to to infer that K ( g ( 1 ) , f ) | f ( x ) g ( 1 ) ( x ) | d x . Therefore, through an induction, we immediately obtain that, for any k, K ( g ( k ) , f ) | f ( x ) g ( k ) ( x ) | d x . Thus, for any k and from a certain rank n, we derive that K ( g n ( k ) , f ) | f ( x ) g n ( k ) ( x ) | d x .
Proof of Lemma D.1.
Lemma D.1. We have g ( . / a 1 x , . . . , a j x ) = n ( a j + 1 x , . . . , a d x ) = f ( . / a 1 x , . . . , a j x ) .
Putting A = ( a 1 , . . , a d ) , let us determine f in basis A. Let us first study the function defined by ψ : R d R d , x ( a 1 x , . . , a d x ) . We can immediately say that ψ is continuous and since A is a basis, its bijectivity is obvious. Moreover, let us study its Jacobian.
By definition, it is J ψ ( x 1 , , x d ) = ψ 1 x 1 ψ 1 x d ψ d x 1 ψ d x d = a 1 , 1 a 1 , d a d , 1 a d , d = | A | 0 since A is a basis. We can therefore infer : x R d , ! y R d such that f ( x ) = | A | 1 Ψ ( y ) , i.e., Ψ (resp. y) is the expression of f (resp of x) in basis A, namely Ψ ( y ) = n ˜ ( y j + 1 , . . . , y d ) h ˜ ( y 1 , . . . , y j ) , with n ˜ and h ˜ being the expressions of n and h in basis A. Consequently, our results in the case where the family { a j } 1 j d is the canonical basis of R d , still hold for Ψ in basis A—see Section 2.1. And then, if g ˜ is the expression of g in basis A, we have g ˜ ( . / y 1 , . . . , y j ) = n ˜ ( y j + 1 , . . . , y d ) = Ψ ( . / y 1 , . . . , y j ) , i.e., g ( . / a 1 x , . . . , a j x ) = n ( a j + 1 x , . . . , a d x ) = f ( . / a 1 x , . . . , a j x ) .
Proof of Lemma D.2.
Lemma D.2. Should there exist a family ( a i ) i = 1 . . . d such that f ( x ) = n ( a j + 1 x , . . . , a d x ) h ( a 1 x , . . . , a j x ) , with j < d , with f, n and h being densities, then this family is an orthogonal basis of R d .
Using a reductio ad absurdum, we have f ( x ) d x = 1 + = n ( a j + 1 x , . . . , a d x ) h ( a 1 x , . . . , a j x ) d x . We can therefore conclude.
Lemma D.3. inf a R * d D ϕ ( g * , f ) is reached when the ϕ-divergence is greater than the L 1 distance as well as the L 2 distance.
Indeed, let G be { g f a g a ; a R * d } and Γ c be Γ c = { p ; K ( p , f ) c } for all c>0. From Lemmas A.1, A.2 and A.3 (see page 1605), we get Γ c G is a compact for the topology of the uniform convergence, if Γ c G is not empty. Hence, and since property A.2 (see page 1605) implies that Q D ϕ ( Q , P ) is lower semi-continuous in L 1 for the topology of the uniform convergence, then the infimum is reached in L 1 . (Taking for example c = D ϕ ( g , f ) , Ω is necessarily not empty because we always have D ϕ ( g f a g a , f ) D ϕ ( g , f ) ). Moreover, when the ϕ divergence is greater than the L 2 distance, the very definition of the L 2 space enables us to provide the same proof as for the L 1 distance.
Proof of Lemma D.4.
Lemma D.4. For any p d , we have f a p ( p 1 ) = f a p —see Huber’s analytic method -, g a p ( p 1 ) = g a p —see Huber’s synthetic method - and g a p ( p 1 ) = g a p —see our algorithm.
As it is equivalent to prove either our algorithm or Huber’s, we will only develop here the proof for our algorithm. Assuming, without any loss of generality, that the a i , i = 1 , . . , p , are the vectors of the canonical basis, since g ( p 1 ) ( x ) = g ( x ) f 1 ( x 1 ) g 1 ( x 1 ) f 2 ( x 2 ) g 2 ( x 2 ) . . . f p 1 ( x p 1 ) g p 1 ( x p 1 ) we derive immediately that g p ( p 1 ) = g p . We note that it is sufficient to operate a change in basis on the a i to obtain the general case.
Proof of Lemma D.5.
Lemma D.5. If there exits p, p d , such that D ϕ ( g ( p ) , f ) = 0 , then the family of ( a i ) i = 1 , . . , p —derived from the construction of g ( p ) —is free and orthogonal.
Without any loss of generality, let us assume that p = 2 and that the a i are the vectors of the canonical basis. Using a reductio ad absurdum with the hypotheses a 1 = ( 1 , 0 , . . . , 0 ) and a 2 = ( α , 0 , . . . , 0 ) , where α R , we get g ( 1 ) ( x ) = g ( x 2 , . . , x d / x 1 ) f 1 ( x 1 ) and f = g ( 2 ) ( x ) = g ( x 2 , . . , x d / x 1 ) f 1 ( x 1 ) f α a 1 ( α x 1 ) [ g ( 1 ) ] α a 1 ( α x 1 ) . Hence f ( x 2 , . . , x d / x 1 ) = g ( x 2 , . . , x d / x 1 ) f α a 1 ( α x 1 ) [ g ( 1 ) ] α a 1 ( α x 1 ) . It consequently implies that f α a 1 ( α x 1 ) = [ g ( 1 ) ] α a 1 ( α x 1 ) since 1 = f ( x 2 , . . , x d / x 1 ) d x 2 . . . d x d = g ( x 2 , . . , x d / x 1 ) d x 2 . . . d x d f α a 1 ( α x 1 ) [ g ( 1 ) ] α a 1 ( α x 1 ) = f α a 1 ( α x 1 ) [ g ( 1 ) ] α a 1 ( α x 1 ) . Therefore, g ( 2 ) = g ( 1 ) , i.e., p = 1 which leads to a contradiction. Hence, the family is free. Moreover, using a reductio ad absurdum we get the orthogonality. Indeed, we have f ( x ) d x = 1 + = n ( a j + 1 x , . . . , a d x ) h ( a 1 x , . . . , a j x ) d x . The use of the same argument as in the proof of Lemma D.2, enables us to infer the orthogonality of ( a i ) i = 1 , . . , p .
Proof of Lemma D.6.
Lemma D.6. If there exits p, p d , such that D ϕ ( g ( p ) , f ) = 0 , where g ( p ) is built from the free and orthogonal family a 1 ,..., a j , then, there exists a free and orthogonal family ( b k ) k = j + 1 , . . . , d of vectors of R * d , such that g ( p ) ( x ) = g ( b j + 1 x , . . . , b d x / a 1 x , . . . , a j x ) f a 1 ( a 1 x ) . . . f a j ( a j x ) and such that R d = V e c t { a i } V e c t { b k } .
Through the incomplete basis theorem and similarly as in Lemma D.5, we obtain the result thanks to the Fubini’s theorem.
Proof of Lemma D.7.
Lemma D.7. For any continuous density f, we have y m = | f m ( x ) f ( x ) | = O P ( m 2 4 + d ) .
Defining b m ( x ) as b m ( x ) = | E ( f m ( x ) ) f ( x ) | , we have y m | f m ( x ) E ( f m ( x ) ) | + b m ( x ) . Moreover, from page 150 of Scott [15], we derive that b m ( x ) = O P ( Σ j = 1 d h j 2 ) where h j = O P ( m 1 4 + d ) . Then, we obtain b m ( x ) = O P ( m 2 4 + d ) . Finally, since the central limit theorem rate is O P ( m 1 2 ) , we infer that y m O P ( m 1 2 ) + O P ( m 2 4 + d ) = O P ( m 2 4 + d ) .
Proof of Proposition 3.1.
Without loss of generality, we reason with x 1 in lieu of a x .
Let us define g * = g r . We remark that g and g * present the same density conditionally to x 1 . Indeed, g 1 * ( x 1 ) = g * ( x ) d x 2 . . . d x d = h ( x 1 ) g ( x ) d x 2 . . . d x d = h ( x 1 ) g ( x ) d x 2 . . . d x d = h ( x 1 ) g 1 ( x 1 ) .
We can therefore prove this proposition.
First, since f and g are known, then, for any given function h : x 1 h ( x 1 ) , the application T, which is defined by
  • T : g ( . / x 1 ) h ( x 1 ) f 1 ( x 1 ) g 1 ( x 1 ) g ( . / x 1 ) f 1 ( x 1 )
  • T : f ( . / x 1 ) f 1 ( x 1 ) f ( . / x 1 ) f 1 ( x 1 )
is measurable.
Second, the above remark implies that D ϕ ( g * , f ) = D ϕ ( g * ( . / x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) , f ( . / x 1 ) f 1 ( x 1 ) ) = D ϕ ( g ( . / x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) , f ( . / x 1 ) f 1 ( x 1 ) ) .
Consequently, property A.3 page 1605 infers: D ϕ ( g ( . / x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) , f ( . / x 1 ) f 1 ( x 1 ) ) D ϕ ( T 1 ( g ( . / x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) ) , T 1 ( f ( . / x 1 ) f 1 ( x 1 ) ) )
= D ϕ ( g ( . / x 1 ) f 1 ( x 1 ) , f ( . / x 1 ) f 1 ( x 1 ) ) , by the very definition of T.
= D ϕ ( g f 1 g 1 , f ) , which completes the proof of this proposition.
Proof of Proposition 3.3. Proposition 3.3 comes immediately from Proposition B.1 page 1606 and Lemma A.1 page 1605.
Proof of Theorem 3.1. First, by the very definition of the kernel estimator g ˇ n ( 0 ) = g n converges towards g. Moreover, the continuity of a f a , n and a g a , n and Proposition 3.3 imply that g ˇ n ( 1 ) = g ˇ n ( 0 ) f a , n g ˇ a , n ( 0 ) converges towards g ( 1 ) . Finally, since, for any k, g ˇ n ( k ) = g ˇ n ( k 1 ) f a ˇ k , n g ˇ a ˇ k , n ( k 1 ) , we conclude by an immediate induction.
Proof of Theorem 3.2. First, from Lemma D.7, we derive that, for any x, sup a R * d | f a , n ( a x ) f a ( a x ) | = O P ( n 2 4 + d ) . Then, let us consider Ψ j = f a j ˇ , n ( a j ˇ x ) g ˇ a j ˇ , n ( j 1 ) ( a j ˇ x ) f a j ( a j x ) g a j ( j 1 ) ( a j x ) , we have Ψ j = 1 g ˇ a j ˇ , n ( j 1 ) ( a j ˇ x ) g a j ( j 1 ) ( a j x ) ( ( f a j ˇ , n ( a j ˇ x ) f a j ( a j x ) ) g a j ( j 1 ) ( a j x ) + f a j ( a j x ) ( g a j ( j 1 ) ( a j x ) g ˇ a j ˇ , n ( j 1 ) ( a j ˇ x ) ) ) , i.e., | Ψ j | = O P ( n 1 2 1 d = 1 2 4 + d 1 d > 1 ) since f a j ( a j x ) = O ( 1 ) and g a j ( j 1 ) ( a j x ) = O ( 1 ) . We can therefore conclude similarly as in the proof of Theorem A.2.
Proof of Theorem D.1.
Theorem D.1. In the case where f is known and under the hypotheses assumed in Section 3.1, it holds n A . ( c ˇ n ( a k ) a k ) L aw B . N d ( 0 , P b M ( a k , a k ) 2 ) + C . N d ( 0 , P a M ( a k , a k ) 2 ) and n A . ( γ ˇ n a k ) L aw C . N d ( 0 , P b M ( a k , a k ) 2 ) + C . N d ( 0 , P a M ( a k , a k ) 2 ) where A = P 2 b b M ( a k , a k ) ( P 2 a i a j M ( a k , a k ) + P 2 a i b j M ( a k , a k ) ) , C = P 2 b b M ( a k , a k ) and B = P 2 b b M ( a k , a k ) + P 2 a i a j M ( a k , a k ) + P 2 a i b j M ( a k , a k ) .
First of all, let us remark that hypotheses ( H 1 ) to ( H 3 ) imply that γ ˇ n and c ˇ n ( a k ) converge towards a k in probability. Hypothesis ( H 4 ) enables us to derive under the integrable sign after calculation, P b M ( a k , a k ) = P a M ( a k , a k ) = 0 , P 2 a i b j M ( a k , a k ) = P 2 b j a i M ( a k , a k ) = φ " ( g f a k f g a k ) a i g f a k f g a k b j g f a k f g a k f d x , P 2 b i b j M ( a k , a k ) = φ " ( g f a k f g a k ) b i g f a k f g a k b j g f a k f g a k f d x , P 2 a i a j M ( a k , a k ) = φ ( g f a k f g a k ) 2 a i a j g f a k f g a k f d x , and consequently P 2 b i b j M ( a k , a k ) = P 2 a i b j M ( a k , a k ) = P 2 b j a i M ( a k , a k ) , which implies, 2 a i a j K ( g f a k g a k , f ) = P 2 a i a j M ( a k , a k ) P 2 b i b j M ( a k , a k ) , = P 2 a i a j M ( a k , a k ) + P 2 a i b j M ( a k , a k ) = P 2 a i a j M ( a k , a k ) + P 2 b j a i M ( a k , a k ) .
The very definition of the estimators γ ˇ n and c ˇ n ( a k ) , implies that P n b M ( b , a ) = 0 P n a M ( b ( a ) , a ) = 0 i.e. P n b M ( c ˇ n ( a k ) , γ ˇ n ) = 0 P n a M ( c ˇ n ( a k ) , γ ˇ n ) + P n b M ( c ˇ n ( a k ) , γ ˇ n ) a c ˇ n ( a k ) = 0 , i.e. P n b M ( c ˇ n ( a k ) , γ ˇ n ) = 0 ( E 0 ) P n a M ( c ˇ n ( a k ) , γ ˇ n ) = 0 ( E 1 )
Under ( H 5 ) and ( H 6 ) , and using a Taylor development of the ( E 0 ) (resp. ( E 1 ) ) equation, we infer there exists ( c ¯ n , γ ¯ n ) (resp. ( c ˜ n , γ ˜ n ) ) on the interval [ ( c ˇ n ( a k ) , γ ˇ n ) , ( a k , a k ) ] such that P n b M ( a k , a k ) = [ ( P 2 b b M ( a k , a k ) ) + o P ( 1 ) , ( P 2 a b M ( a k , a k ) ) + o P ( 1 ) ] a n . (resp. P n a M ( a k , a k ) = [ ( P 2 b a M ( a k , a k ) ) + o P ( 1 ) , ( P 2 a 2 M ( a k , a k ) ) + o P ( 1 ) ] a n ) with a n = ( ( c ˇ n ( a k ) a k ) , ( γ ˇ n a k ) ) . Thus we get n a n = n P 2 b 2 M ( a k , a k ) P 2 a b M ( a k , a k ) P 2 b a M ( a k , a k ) P 2 a 2 M ( a k , a k ) 1 P n b M ( a k , a k ) P n a M ( a k , a k ) + o P ( 1 ) = n ( P 2 b b M ( a k , a k ) 2 a a K ( g f a k g a k , f ) ) 1 . P 2 b b M ( a k , a k ) + 2 a a K ( g f a k g a k , f ) P 2 b b M ( a k , a k ) P 2 b b M ( a k , a k ) P 2 b b M ( a k , a k ) . P n b M ( a k , a k ) P n a M ( a k , a k ) + o P ( 1 ) Moreover, the central limit theorem implies: P n b M ( a k , a k ) L aw N d ( 0 , P b M ( a k , a k ) 2 ) , P n a M ( a k , a k ) L aw N d ( 0 , P a M ( a k , a k ) 2 ) , since P b M ( a k , a k ) = P a M ( a k , a k ) = 0 , which leads us to the result.
Proof of Theorem 3.3. We derive this theorem through Proposition B.1 and Theorem D.1.
Proof of Theorem 3.4. We recall that g n ( k ) is the kernel estimator of g ˇ ( k ) . Since the Kullback–Leibler divergence is greater than the L 1 -distance, we then have lim n lim k K ( g n ( k ) , f n ) lim n lim k | g n ( k ) ( x ) f n ( x ) | d x
Moreover, the Fatou’s lemma implies that lim k | g n ( k ) ( x ) f n ( x ) | d x lim k | g n ( k ) ( x ) f n ( x ) | d x = | [ lim k g n ( k ) ( x ) ] f n ( x ) | d x and lim n | [ lim k g n ( k ) ( x ) ] f n ( x ) | d x lim n | [ lim k g n ( k ) ( x ) ] f n ( x ) | d x = | [ lim n lim k g n ( k ) ( x ) ] lim n f n ( x ) | d x Through Lemma A.4, we then obtain that 0 = lim n lim k K ( g n ( k ) , f n ) | [ lim n lim k g n ( k ) ( x ) ] lim n f n ( x ) | d x 0 , i.e., that | [ lim n lim k g n ( k ) ( x ) ] lim n f n ( x ) | d x = 0 . Moreover, for any given k and any given n, the function g n ( k ) is a convex combination of multivariate Gaussian distributions. As derived at Remark 2.1 of page 1585, for all k, the determinant of the covariance of the random vector—with density g ( k ) —is greater than or equal to the product of a positive constant times the determinant of the covariance of the random vector with density f. The form of the kernel estimate therefore implies that there exists an integrable function φ such that, for any given k and any given n, we have | g n ( k ) | φ .
Finally, the dominated convergence theorem enables us to say that lim n lim k g n ( k ) = lim n f n = f , since f n converges towards f and since | [ lim n lim k g n ( k ) ( x ) ] lim n f n ( x ) | d x = 0 .
Proof of Corollary 3.1. Through the dominated convergence theorem and through Theorem 3.4, we get the result using a reductio ad absurdum.
Proof of Theorem 3.5. Through Proposition B.1 and Theorem A.3, we derive theorem 3.5.

Share and Cite

MDPI and ACS Style

Touboul, J. Projection Pursuit Through ϕ-Divergence Minimisation. Entropy 2010, 12, 1581-1611. https://doi.org/10.3390/e12061581

AMA Style

Touboul J. Projection Pursuit Through ϕ-Divergence Minimisation. Entropy. 2010; 12(6):1581-1611. https://doi.org/10.3390/e12061581

Chicago/Turabian Style

Touboul, Jacques. 2010. "Projection Pursuit Through ϕ-Divergence Minimisation" Entropy 12, no. 6: 1581-1611. https://doi.org/10.3390/e12061581

Article Metrics

Back to TopTop