Next Article in Journal
Kinetic Theory Microstructure Modeling in Concentrated Suspensions
Previous Article in Journal
Protection Intensity Evaluation for a Security System Based on Entropy Theory
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence

Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako, 351-0198 Saitama, Japan
*
Author to whom correspondence should be addressed.
Entropy 2013, 15(7), 2788-2804; https://doi.org/10.3390/e15072788
Submission received: 14 June 2013 / Revised: 12 July 2013 / Accepted: 15 July 2013 / Published: 18 July 2013

Abstract

:
We propose a generalized method of the canonical correlation analysis using Alpha-Beta divergence, called AB-canonical analysis (ABCA). From observations of two random variables, x R P and y R Q , ABCA finds directions, w x R P and w y R Q , such that the AB-divergence between the joint distribution of ( w x T x , w y T y ) and the product of their marginal distributions is maximized. The number of significant non-zero canonical coefficients are determined by using a sequential permutation test. The advantage of our method over the standard canonical correlation analysis (CCA) is that it can reconstruct the hidden non-linear relationship between w x T x and w y T y , and it is robust against outliers. We extend ABCA when data are observed in terms of tensors. We further generalize this method by imposing sparseness constraints. Extensive simulation study is performed to justify our approach.

1. Introduction

In statistics and data analysis, we are often interested to find out the relationship between two sets of multi-dimensional random variables, x R P and y R Q . Canonical correlation analysis (CCA) focuses on the correlation between a linear combination of the variables in one set and another linear combination of the variables in the other set. The idea is to first determine linear combinations of x and y , called canonical variables, such that the correlation between the canonical variables is the highest possible among all such linear combinations.
Based on the observed random sample, the aim in standard CCA is to find the linear relationship between x and y . Therefore, the method fails if the relationship is non-linear. Another disadvantage of the standard CCA is that it is very sensitive to outliers, as it is based on the correlation coefficient. In this paper, we generalize the concept of CCA, which can extract the non-linear relationship between two sets of variables, and at the same time, the method is robust against outliers. We assume that there exists a hidden relationship of the following type:
w y T y = ψ ( w x T x ) + ϵ ,
where ψ is an unknown smooth function and ϵ is the random error. Our aim is to find out vectors, w x R P and w y R Q , from observed values of x and y . Yin (2004) [1] has developed a technique to solve this problem based on an information theoretic approach (see, also, Yin et al., 2008 [2]; Iaci et al., 2010 [3]). Recently, Iaci and Sriram (2013) [4] applied this method using beta-divergence and power divergence. Wang et al. (2012) [5] have used Bregman divergence to perform CCA. We will explore this problem in detail and extend this method by using the Alpha-Beta divergence (or AB-divergence) (Cichocki et al., 2011 [6]), which is a generalized measure of divergence. Moreover, the earlier methods are limited to the case where x and y are random vectors; we will extend it to the tensor (multiway array) valued random variables.
Kernel CCA (Lai and Fyfe, 2000 [7]; Shawe-Taylor and Cristianini, 2004 [8]) deals with the non-linear relationship between two sets of random variables, but the setting of the problem is different than our approach. Kernel CCA first transforms the data to a higher (or infinite) dimensional non-linear space, called the reproducing kernel Hilbert space, and then assumes that there exists a linear relationship between the variables in the transformed space. In kernel CCA, it is not possible to recover the non-linear relationship, whereas in our case, we can find out the unknown function, ψ, in Equation (1) by further analysis (see Breiman and Friedman, 1985 [9]). However, in this paper, our main interest is to recover w x and w y , which satisfy Equation (1).
The rest of the paper is organized as follows. In Section 2 and Section 3, we discuss the basic formulations of CCA and AB-divergence, respectively. The new method, AB-canonical analysis (ABCA), is proposed in Section 4. In Section 5, we describe the algorithm of ABCA. The sequential permutation test is proposed to determine the number of significant canonical variable pairs in Section 6. In Section 7, we generalize ABCA when data sets are observed as tensors. The sparsity constraint is introduced in Section 8. Numerical illustrations of the performance of this method are presented in Section 9. Section 10 has some concluding remarks.

2. Canonical Correlation Analysis

Suppose we have N pairs of observations from two sets of random variables, x and y , { x ( n ) R P , y ( n ) R Q ; n = 1 , 2 , , N } . In CCA, we look for linear combinations of x and y , which have maximum correlation with each other (Hotelling, 1936 [10]). Formally, the classical CCA computes two projection vectors, w x R P and w y R Q , such that the correlation coefficient:
ρ = w x T Σ x y w y w x T Σ x w x w y T Σ y w y
is maximized, where Σ x y is the covariance matrix between x and y , and Σ x and Σ y are the dispersion matrices of x and y , respectively. Since ρ is invariant to the scaling of vectors w x and w y , CCA can be formulated equivalently as the following constrained optimization problem:
max w x , w y w x T Σ x y w y ,   subject to   w x T Σ x w x = w y T Σ y w y = 1 .
We denote the optimum values of ( w x , w y ) as ( 1 w x , 1 w y ) . We refer to u 1 = 1 w x T x and v 1 = 1 w y T y as the pair of first canonical variables.
Next, we determine a new pair of linear combinations, say u 2 and v 2 , which has the highest correlation subject to u 2 , being uncorrelated with u 1 , and v 2 being uncorrelated with v 1 (the construction actually ensures that u 1 and v 2 are uncorrelated, as well, as are u 2 and v 1 ). Therefore, at the i-th step, the canonical vectors are obtained as:
i w x , i w y = arg max w x , w y w x T Σ x y w y
subject to:
i w x T Σ x i w x = i w y T Σ y i w y = 1 ,
j w x T Σ x i w x = j w y T Σ y i w y = 0 ,
for all j = 1 , 2 , , i - 1 and i min { p , q } . The process continues, until subsequent pairs of linear combinations no longer produce a significant correlation.

3. AB-Divergence

Consider two density functions, f and g, with respect to a Lebesgue measure. Then, the AB-divergence (Cichocki et al., 2011 [6]) between f and g is denoted as D α , β ( f | | g ) and is defined by:
D α , β ( f | | g ) = - 1 α β x f α ( x ) g β ( x ) - α α + β f α + β ( x ) - β α + β g α + β ( x ) d x ,
where α , β , α + β 0 . The singularity for certain values of parameters are avoided by taking continuous limits with respect to the parameters. Thus, AB-divergence is expressed in a more explicit form as:
D α , β ( f | | g ) = x d α , β ( f , g ) d x ,
where:
d α , β = - 1 α β f α g β - α α + β f α + β - β α + β g α + β if α , β , α + β 0 1 α 2 f α ln f g α - f α + g α if α 0 , β = 0 1 α 2 ln g f α + g f - α - 1 if α = - β 0 1 β 2 g β ln g f β - g β + f β if α = 0 , β 0 1 2 ( ln f - ln g ) 2 if α , β = 0 .
There are several important divergences in the class of AB-divergence: for a suitable choice of the parameters α and β, we can construct those divergences (Amari, 2007 [11]; Minami and Eguchi, 2002 [12]). For example, when ( α + β ) = 1 , the AB-divergence reduces to the Alpha-divergence (Amari, 2007 [11]; Cichocki et al., 2011 [6]). On the other hand, when α = 1 , it becomes Beta-divergence (Basu et al., 1998 [13]; Cichocki et al., 2006 [14]; Kompass, 2007 [15]; Minami and Eguchi, 2002 [12]; Févotte et al., 2009 [16]). The AB-divergence becomes the standard Kullback-Leibler divergence for α = 1 and β = 0 . Itakura-Saito divergence and the Hellinger distance also belong to the class of AB-divergence (Cichocki et al., 2006 [14]; Févotte et al., 2009 [16]).
One important property of the divergence is that D α , β ( f | | g ) is non-negative for all f and g and is equal to zero if and only if f g almost everywhere (Cichocki et al., 2011 [6]). Let us take f to be the joint density of two random variables, x and y , and g to be the product of their marginal densities. Then, D α , β ( f | | g ) = 0 if and only if x and y are independent. We will use this property of AB-divergence to find the canonical variables.

4. AB-Canonical Analysis

Let us denote the joint distribution of two random variables as f ( · , · ) , whereas the marginal distribution as f ( · ) . We define the AB-divergence between the joint distribution of ( w x T x , w y T y ) and the product of their marginal distributions as:
D α , β ( w x , w y ) = D α , β f w x T x , w y T y | | f w x T x f w y T y .
From the property of the AB-divergence, we know that D α , β ( w x , w y ) = 0 if and only if w x T x and w y T y are statistically independent. Here, our aim is to find directions w x and w y , such that w x T x and w y T y are as much dependent as possible. Therefore, we find w x and w y from the optimization problem:
max w x , w y D α , β ( w x , w y ) , subject to w x T w x = w y T w y = 1 .
We denote the first set of AB-canonical vectors as ( 1 w x , 1 w y ) . The i-th set of canonical vectors are obtained as:
i w x , i w y = arg max w x , w y D α , β ( w x , w y ) ,
subject to:
i w x T i w x = i w y T i w y = 1 ,
j w x T i w x = j w y T i w y = 0 ,
for all j = 1 , 2 , , i - 1 and i min { p , q } . Like CCA, we continue, until a subsequent pairs of canonical variables no longer produce a significant dependence.
We note that D α , β ( w x , w y ) = 0 implies that w x T x and w y T y are statistically independent, regardless of the distributions of x and y . On the other hand, in standard CCA, the zero canonical correlation implies that x and y are uncorrelated, but in general, they may not be independent. However, if x and y follow normal distributions, then they are independent. The concept of statistical dependence is more general and flexible than the concept of correlation. If x and y are independent, then they are also uncorrelated, but not vice versa.

5. ABCA Algorithm

Suppose we have N pairs of observations from two sets of random variables, x and y , { x ( n ) R P , y ( n ) R Q ; n = 1 , 2 , , N } . We calculate D α , β ( N ) ( w x , w y ) , the sample version of D α , β ( w x , w y ) , using kernel density estimates (Yin, 2004 [1]). Therefore,
D α , β ( N ) w x , w y = D α , β f N w x T x , w y T y | | f N w x T x f N w y T y ,
where:
f N ( u ) = 1 N h n = 1 N K u - u n h , u R ,
and:
f N ( u , v ) = 1 N h 1 h 2 n = 1 N K 2 u - u n h 1 , v - v n h 2 , ( u , v ) R 2 .
Here, h , h 1 and h 2 are suitably chosen bandwidths and K ( · ) and K 2 ( · , · ) are univariate and bivariate kernels, respectively. For simplicity, we will take the product kernel (Scott, 1992 [17]), i.e.:
f N ( u , v ) = 1 N h 1 h 2 n = 1 N K u - u n h 1 K v - v n h 2 , ( u , v ) R 2 .
For convergence of the kernel density functions to the corresponding underlying densities, we need to ensure that the bandwidth parameters tend to zero as the sample size increases. We follow the method described in Silverman (1986) [18] by taking h = 1.06 s N - 1 / 5 , h j = s j N - 1 / 6 , j = 1 , 2 , where s , s 1 and s 2 are the corresponding standard deviations. Moreover, the choice of the bandwidth parameters satisfies the condition of Theorem 1, stated later in this section. Here, we use Gaussian kernel. Robust kernel may be used to make the procedure robust against outliers (Kim and Scott, 2012 [19]), but we prefer to choose suitable tuning parameters, α and β, to make the procedure robust.
The AB-canonical vectors obtained from Equation (15) are consistent in the sense that they converge to the original canonical vectors for large sample sizes. The following theorem ensures this result. The proof of the theorem can be done in the same line of thought as mentioned in Proposition 3 of Yin (2004) [1] or Theorem 1 of Iaci and Sriram (2013) [4].
Theorem 1 : Assume that both the univariate and bivariate density functions, f ( · ) and f ( · , · ) , are continuous. Suppose that the kernel density, K, is a bounded variation function, and the sequence of the bandwidth parameter, h n , used in the k-dimensional Density Estimation satisfies the following bound:
n = 1 e - γ n h n 2 k < ,   f o r   a l l   γ > 0 ,
where k = 1 , 2 . Let us denote ( w ^ x , w ^ y ) = a r g m a x D α , β ( N ) ( w x , w y ) and ( w x , w y ) = a r g m a x D α , β ( w x , w y ) , where ( α , β ) R 2 . Then, ( w ^ x , w ^ y ) ( w x , w y ) , almost surely as N .
It should be mentioned here that the optimization problem in Equation (12) is non-linear, and it may stick at a local maxima. Therefore, it is often needed to repeat the algorithm several times with different initial values to get the appropriate solution. We use the interior point algorithm (see Byrd et al., 1999 [20]; Byrd et al. 2000 [21]) to estimate the canonical vectors, w x and w y . A MATLAB program for the ABCA will be found in [22].
The value of D α , β ( w x , w y ) is always non-negative, but there does not exist any fixed upper limit for all values of α and β. Therefore, it is difficult to interpret the result from the values of AB-divergence. Whereas in standard CCA, the value of the canonical coefficient close to one signifies better performance from this method, therefore we will calculate the maximal correlation (Breiman and Friedman, 1985 [9]) as a measure of dependency. The maximal correlation coefficient between w x x and w y y is denoted by ρ * and is defined as:
ρ * = max ψ Corr ( w y y , ψ ( w x x ) ) .
Here, we call ρ * as the AB-canonical coefficient. It is the maximum possible correlation between w y y and any function of w x x . The value of ρ * lies in [0,1]. We calculate ρ * using the alternating conditional expectation algorithm (Breiman and Friedman, 1985 [9]).

6. Sequential Permutation Test

One advantage of ABCA is that if the AB-canonical coefficient is zero, it implies that the corresponding AB-canonical variables are independent, regardless of the distributions of y and x . Therefore, the non-parametric sequential permutation test can be applied to determine the number of significant AB-canonical variables (Yin, 2004 [1]; Efron and Tibshirani, 1994 [23]; Davison and Hinkley, 1997 [24]). On the other hand, the test of significance for the standard CCA is very complicated, and it is typically under the normality assumption (Yin, 2004 [1]).
Let ( i w x , i w y ) be the i-th AB-canonical vectors pair. We want to test the following hypothesis:
i H 0 : D α , β i w x , i w y = 0 , vs. i H 1 : D α , β i w x , i w y > 0 .
Testing i H 0 implies that the two canonical variables, i w x T x and i w y T y , are independent. First, we fix the previously found AB-canonical variables, ( j w x , j w y ) , j = 1 , 2 , , i - 1 . Then, we take a random permutation of the N observations of x , say x * , and perform ABCA with x * and y using the algorithm described in Section 5. Let us denote the corresponding AB-divergence measure as D α , β * .
We repeat this procedure a sufficient number of times (say, R times), and we calculate D α , β * ( r ) , the corresponding AB-divergence measure for the r-th permutation, r = 1 , 2 , , R . Let D γ be the ( 1 - γ ) -th percentile point of D α , β * ( r ) , r = 1 , 2 , , R , where γ is the level of significance of the test. Then, we reject the null hypothesis, i H 0 , if:
D α , β ( N ) i w x , i w y > D γ ,
where D α , β ( N ) ( i w x , i w y ) is the actual observed value of D α , β ( i w x , i w y ) without permuting data. If i H 0 is rejected, we proceed to the next step to calculate another AB-canonical variable pair.

7. Extension to Tensor

In this section, we extend the concept of ABCA in the case of tensor data. In many applications, the data structures often contain higher order modes, such as subjects, groups, trials, classes, conditions, etc., together with the intrinsic dimensions of space, time and frequency. Many studies of neuroscience involve recording data over time for multiple subjects (people or animals) and in different conditions, leading to experimental data structures conveniently represented by multi-array tensors. We generalize the idea of ABCA to extract the meaningful components from this type of high dimensional tensor data.
Tensors are denoted by underlined capital boldface letters, e.g., Y ̲ R I 1 × I 2 × × I Q . The order of a tensor is the number of modes, also known as ways or dimensions (e.g., frequency, subjects, trials, classes, groups and conditions). Throughout this section, we will use the basic tensor operations proposed in the literature (Kolda and Bader, 2009 [25]; Cichocki et al., 2009 [26]). Specifically, the mode-n multiplication of a tensor, Y ̲ R I 1 × I 2 × × I Q , by a vector, a R I n , is denoted by:
Y ̲ × ¯ n a R I 1 × × I n - 1 × I n + 1 × × I Q ,
where the ( i 1 , i 2 , , i n - 1 , i n + 1 , , i Q ) -th element is given by:
i n = 1 I n y i 1 , i 2 , , i Q a i n .
The mode-n multiplication of a tensor, Y ̲ R I × J × K , by vectors, a R I , b R J and c R K , can be expressed as:
Y ̲ × ¯ 1 a × ¯ 2 b × ¯ 3 c = i = 1 I j = 1 J k = 1 K y i j k a i b j c k .
Suppose we have two sets of data from the tensor valued random variables, X ̲ and Y ̲ , { X ̲ ( n ) R I 1 × I 2 × × I P , Y ̲ ( n ) R K 1 × K 2 × × K Q ; n = 1 , 2 , , N } , where N is the sample size. In tensor ABCA, our aim is to find w x ( 1 ) R I 1 , w x ( 2 ) R I 2 , , w x ( P ) R I P and w y ( 1 ) R K 1 , w y ( 2 ) R K 2 , , w y ( Q ) R K Q , such that the AB-divergence between the joint distribution of the canonical variables:
u 1 = X ̲ × ¯ 1 w x ( 1 ) × ¯ 2 w x ( 2 ) × ¯ P w x ( P ) ,
v 1 = Y ̲ × ¯ 1 w y ( 1 ) × ¯ 2 w y ( 2 ) × ¯ Q w 1 ( Q ) ,
and the product of their marginal distributions is maximized. We define:
D α , β ( w x ( 1 ) , , w x ( P ) , w y ( 1 ) , , w y ( Q ) ) = D α , β f ( u 1 , v 1 ) | | f ( u 1 ) f ( v 1 ) .
Here, we find w x ( 1 ) , , w x ( P ) and w y ( 1 ) , , w y ( Q ) from the optimization problem:
max w x ( 1 ) , , w x ( P ) , w y ( 1 ) , , w y ( Q ) D α , β ( w x ( 1 ) , , w x ( P ) , w y ( 1 ) , , w y ( Q ) )
subject to:
w x ( p ) T w x ( p ) = w y ( q ) T w y ( q ) = 1 ,
for p = 1 , 2 , , P and q = 1 , 2 , , Q .
We denote the first set of AB-canonical vectors as ( 1 w x ( 1 ) , , 1 w x ( P ) , 1 w y ( 1 ) , , 1 w y ( Q ) ) . The i-th set of AB-canonical vectors, ( i w x ( 1 ) , , i w x ( P ) , i w y ( 1 ) , , i w y ( Q ) ) , is obtained as:
arg max w x ( 1 ) , , w x ( P ) , w y ( 1 ) , , w y ( Q ) D α , β w x ( 1 ) , , w x ( P ) , w y ( 1 ) , , w y ( Q )
subject to:
i w x ( p ) T i w x ( p ) = i w y ( q ) T i w y ( q ) = 1 ,
j w x ( p ) T i w x ( p ) = j w y ( q ) T i w y ( q ) = 0 ,
for all j = 1 , 2 , , i - 1 .

8. Sparseness Constraints

The standard CCA has some disadvantages, especially for large-scale and noisy problems. In general, the canonical variables are linear combinations of all the components of x (or y ). This means the canonical variables are dense (not sparse), which often make the physical interpretation of the CCA difficult in many applications. For example, in many applications (from genetics, image analysis, etc.), the coordinate axes have a physical interpretation (each axis may correspond to a specific feature), so a sparse canonical variable is more meaningful than a dense one. Recently, several modifications of CCA have been proposed that impose some sparseness conditions for the canonical variables, and the corresponding method is called sparse canonical correlation analysis (SCCA); see Torres et al. (2007) [27]. The main idea in SCCA is to force the canonical variables to be sparse; however, the sparsity profile should be adjustable or well controlled via some parameters in order to discover specific features in the observed data. In a similar way, we propose the sparse AB-canonical analysis.
For sparse AB-canonical analysis, we impose suitable sparsity constraints on the canonical vectors (Witten et al., 2009 [28]; Witten, 2010 [29]). Here, the optimization problem reduces to:
w x , w y = arg max w x , w y { D α , β ( w x , w y ) - λ 1 P 1 ( w x ) - λ 2 P 2 ( w y ) }
subject to:
w x T w x = 1 , w y T w y = 1 ,
where P 1 and P 2 are convex penalty functions and λ 1 , λ 2 are suitably chosen tuning parameters. Some frequently used penalty functions are:
P ( w ) = | | w | | 1 = i | w i | , ( LASSO )
P ( w ) = | | w | | 0 = i sign ( w i ) , ( Cardinality Penalty )
P ( w ) = i | w i | + λ i | w i - w i - 1 | , ( Fused LASSO ) .
Here, also, we use the interior-point algorithm to estimate the canonical vectors. A MATLAB code will be obtained just by changing the optimization function of the standard ABCA in [22]. However, if we use a cardinality penalty, then we need to modify the program a little bit, so that the algorithm tries to find a solution in the lower dimensional subspace. For tensor AB-canonical analysis, the sparseness constraints can be imposed in a similar way (see Allen, 2012 [30]).

9. Simulation Results

The validity and the performance of the proposed ABCA is evaluated based on the simulated data. In the following examples, we have generated { x ( n ) , y ( n ) ; n = 1 , 2 , , N } , such that they have a relationship, as mentioned in Equation (1). Note that the following types of relations are, for example, included in the model:
b 1 y 1 + b 2 y 2 = ( a 0 + a 1 x 1 + a 2 x 2 ) 2 + ϵ ,
b 1 y 1 + b 2 y 2 = sin ( a 0 + a 1 x 1 + a 2 x 2 ) + ϵ ,
b 1 y 1 + b 2 y 2 = ( a 0 + a 1 x 1 + a 2 x 2 ) 2 + sin ( a 0 + a 1 x 1 + a 2 x 2 ) + ϵ ,
where x = ( x 1 , x 2 , x 3 ) T , y = ( y 1 , y 2 ) T , b 1 , b 2 and a 0 , a 1 , a 2 are unknown constants. Here, ϵ is the random error. However, if a 2 0 , then the following models are not included in Equation (1):
b 1 y 1 + b 2 y 2 = ( a 0 + a 1 x 1 ) 2 + a 2 x 2 + ϵ ,
b 1 y 1 + b 2 y 2 = sin ( a 0 + a 1 x 1 ) + a 2 x 2 + ϵ ,
b 1 y 1 + b 2 y 2 = ( a 0 + a 1 x 1 ) 2 + sin ( a 0 + a 1 x 1 + a 2 x 2 ) + ϵ .
In the first example, we have generated data, such that there exists a non-linear relationship between x and y . We will notice that ABCA successfully extracts the hidden relationship, whereas standard CCA fails. In the next example, we show the robustness property of ABCA and compare it with the standard CCA. Finally, we have given an example when data sets are tensors.
Figure 1. (a) and (b): The scatter plots of the latent variables. (c) and (d): The scatter plots of the first two AB-canonical variable pairs. It is clearly seen that the non-linear relationship is reconstructed.
Figure 1. (a) and (b): The scatter plots of the latent variables. (c) and (d): The scatter plots of the first two AB-canonical variable pairs. It is clearly seen that the non-linear relationship is reconstructed.
Entropy 15 02788 g001

9.1. Extraction of Non-linear Relationship

Example 1: The dimensions of x and y are taken as six and four, respectively; so, x = ( x 1 , x 2 , , x 6 ) T and y = ( y 1 , y 2 , y 3 , y 4 ) T . x is the explanatory variable, where the components are generated from independent N ( 0 , 1 ) random variables. y is the dependent variable based on the following latent variables:
y 1 * = sin ( 3 a 1 x ) + ϵ 1 ,
y 2 * = ( a 2 x ) 3 - a 2 x + ϵ 2 ,
where ϵ 1 and ϵ 2 are the random errors, and we assume ϵ i 0.05 N ( 0 , 1 ) , i = 1 , 2 . The coefficient vectors, a 1 and a 2 , are generated from independent uniform ( - 1 / 2 , 1 / 2 ) random variables, and then, they are orthogonalized. Therefore, a 1 T a 2 = 0 . The relationship between y and the latent variables, y * = ( y 1 * , y 2 * ) T , is assumed to be the linear combination, as mentioned below:
y 1 = c 1 T y * , y 2 = c 2 T y * ,
and y 3 and y 4 are independent N ( 0 , 1 ) random variables. The elements of the matrix, C = ( c 1 , c 2 ) , are generated from independent uniform ( - 1 / 2 , 1 / 2 ) random variables, and then, their rows are orthogonalized, so that the columns of C - 1 become orthogonal. We generate a sample size of 100 from x and y .
The scatter plots of the latent variables are given in (a) and (b) of Figure 1. We perform ABCA for this data set with divergent parameters, α = 0.5 and β = 0.5 . The first two AB-canonical variable pairs are plotted in (c) and (d) of Figure 1. The values of the first two AB-canonical coefficients are 0.9616 and 0.9301. It is obvious that ABCA extracts the latent variable quite accurately. We notice that the scale and the sign of the canonical vectors cannot be recovered from ABCA. The standard CCA fails to extract them, due to a non-linear relationship with the latent variables. The first two standard canonical variable pairs are plotted in (a) and (b) of Figure 2. The values of the first two canonical coefficients are 0.5704 and 0.3559.
Figure 2. Scatter plots for the first two standard canonical variable pairs. Here, canonical correlation analysis (CCA) fails to reconstruct the non-linear relationship.
Figure 2. Scatter plots for the first two standard canonical variable pairs. Here, canonical correlation analysis (CCA) fails to reconstruct the non-linear relationship.
Entropy 15 02788 g002
Figure 3. (a) Simulated data with outliers inside the red circle. (b) Scatter plot for the AB-canonical variable pair.
Figure 3. (a) Simulated data with outliers inside the red circle. (b) Scatter plot for the AB-canonical variable pair.
Entropy 15 02788 g003
Figure 4. (a) Scatter plot for the standard canonical variable pair. (b) Scatter plot for the canonical variable pair using Yin (2004) [1] approach.
Figure 4. (a) Scatter plot for the standard canonical variable pair. (b) Scatter plot for the canonical variable pair using Yin (2004) [1] approach.
Entropy 15 02788 g004

9.2. Robustness Property

Example 2: In this example, we check the robustness property of ABCA. To compare it with standard CCA, we have generated data, such that x and y have a linear relationship, and then, few outliers are inserted. The dimensions of x and y are taken as five and three, respectively. All the components of x are generated from independent N ( 0 , 1 ) random variables. For simplicity, we have taken the relationship between x and y as follows:
y 1 = 1 + x 1 + ϵ ,
where ϵ is the random error, and we assume ϵ 0.05 N ( 0 , 1 ) . Here, y 1 and x 1 are the first components of x and y , respectively. The other components of y are generated from independent N ( 0 , 1 ) random variables. We have generated only 90 random samples from this model, and we have taken 10 outliers. For the outlying observations, we have taken x 1 = 0 and y 1 = 10 . Figure 3a represents the original data, where there are 10 outlying observations inside the red circle. In Figure 3b, we have plotted the first AB-canonical variable pair. The divergence parameters are taken as α = 0.5 and β = 0.5 . It is seen that ABCA successfully extracts the canonical variables, but Figure 4a shows that the standard CCA completely fails. In Figure 4b, we present the scatter plot of the first pair of the canonical variables using the approach of Yin (2004) [1]. This is based on Kulback-Leibler divergence, so it is a special case of ABCA, where α = 1 and β = 0 . The values of the first AB-canonical coefficients for α = 0.5 , β = 0.5 and α = 1 , β = 0 are 0.9121 and 0.7107, respectively. Thus, we can make ABCA robust by choosing suitable tuning parameters.

9.3. Tensor Data

Example 3: In this example, we have generated data from tensor valued random variables, X ̲ and Y ̲ . The dimensions of X ̲ and Y ̲ are taken as (4,3,2) and (3,2,2), respectively. X ̲ is the explanatory variable, where the components are generated from independent N ( 0 , 1 ) random variables. Let us define:
u 1 = X ̲ × ¯ 1 a x ( 1 ) × ¯ 2 a x ( 2 ) × ¯ 3 a x ( 3 ) ,
u 2 = X ̲ × ¯ 1 b x ( 1 ) × ¯ 2 b x ( 2 ) × ¯ 3 b x ( 3 ) .
The vectors, a x ( i ) and b x ( i ) , i = 1 , 2 , 3 , are generated from independent uniform ( - 1 / 2 , 1 / 2 ) random variables, and then, they are orthogonalized. Therefore, a x ( i ) T b x ( i ) = 0 , i = 1 , 2 , 3 . Y ̲ is the dependent variable based on the following latent variables:
y 1 * = cos ( 10 u 1 ) + ϵ 1 ,
y 2 * = 2 100 u 2 2 + 1 + ϵ 2 ,
where ϵ 1 and ϵ 2 are the random errors, and we assume ϵ i 0.05 N ( 0 , 1 ) , i = 1 , 2 . The relationship between Y ̲ and the latent variables, y * = ( y 1 * , y 2 * ) T , is assumed to be the linear combination, as follows:
y 1 , 1 , 1 = c 1 T y * , y 2 , 2 , 2 = c 2 T y * ,
All other components of Y ̲ are independent N ( 0 , 1 ) random variables. The elements of the matrix, C = ( c 1 , c 2 ) , are generated following the way we did in Example 1. We have generated a sample size of 100 from X ̲ and Y ̲ .
The scatter plots of the latent variables are given in Figure 5a,b. We have performed tensor ABCA for this data set with divergent parameters, α = 0.5 and β = 0.5 . The first two tensor AB-canonical variable pairs are plotted in Figure 5c,d. The values of the first two tensor AB-canonical coefficients are 0.98671 and 0.9712. It is obvious that ABCA extracts the latent variable quite accurately.
Figure 5. (a) and (b): The scatter plots of the latent variables. (c) and (d): The scatter plots of the first two tensor AB-canonical variable pairs. It is clearly seen that the non-linear relationship is reconstructed.
Figure 5. (a) and (b): The scatter plots of the latent variables. (c) and (d): The scatter plots of the first two tensor AB-canonical variable pairs. It is clearly seen that the non-linear relationship is reconstructed.
Entropy 15 02788 g005

9.4. Choice of Divergence Parameters

There does not exist any universal way of selecting divergence parameters, α and β. They generally control the trade-off between the efficiency and robustness properties of the procedure. Although they cover the whole two-dimensional plane, the rate of change in the values of AB-divergence coefficients for very high or very small values of the tuning parameters are very slow. Therefore, we are often interested in choosing the parameters in the interval [0, 1]. For α = 1 and β = 1 , the AB-divergence turns out to be the L 2 -distance between two densities. L 2 -distance is regarded as a strong robust divergence in the literature, but the robustness is achieved at some loss of efficiency (Basu et al., 1998 [13]; Scott, 2001 [31]). On the other hand, for α = 0 and β = 0 , the AB-divergence becomes the L 2 -distance between the logarithm of two densities, which may be regarded as non-robust. Therefore, a suitable choice of the parameters are needed to balance between robustness and efficiency. In our simulation examples, α and β around ( 0.5 , 0.5 ) seem to a good choice.

10. Conclusion

We have used AB-divergence measure to perform the canonical correlation analysis. It can extract the hidden non-linear relationship between two sets of data, whereas the standard CCA is designed to find out only the linear relationship. Moreover, the standard CCA is very non-robust against the outlying observations. On the other hand, by choosing suitable tuning parameters, α and β, for the AB-divergence, we can make ABCA robust against outliers. Our method is very general in the sense that it uses AB-divergence, which is a general measure of discrepancy. Moreover, we have generalized the method in the case of tensor data, and we have also considered the sparseness constants.

Acknowledgements

The authors gratefully acknowledge the comments of the referees, which led to an improved version of the paper.

Conflict of Interest

The authors declare no conflict of interest.

References

  1. Yin, X. Canonical correlation analysis based on information theory. J. Multivar. Anal. 2004, 91, 161–176. [Google Scholar] [CrossRef]
  2. Yin, X.; Sriram, T. Common canonical variates for independent groups using information theory. Stat. Sin. 2008, 18, 335–353. [Google Scholar]
  3. Iaci, R.; Sriram, T.; Yin, X. Multivariate association and dimension reduction: A generalization of canonical correlation analysis. Biometrics 2010, 66, 1107–1118. [Google Scholar] [CrossRef] [PubMed]
  4. Iaci, R.; Sriram, T. Robust multivariate association and dimension reduction using density divergences. J. Multivar. Anal. 2013, 117, 281–295. [Google Scholar] [CrossRef]
  5. Wang, X.; Crowe, M.; Fyfe, C. Dual stream data exploration. Int. J. Data Min., Model. Manage. 2012, 4, 188–202. [Google Scholar] [CrossRef]
  6. Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
  7. Lai, P.L.; Fyfe, C. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 2000, 10, 365–377. [Google Scholar] [CrossRef] [PubMed]
  8. Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  9. Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
  10. Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
  11. Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
  12. Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural comput. 2002, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed]
  13. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
  14. Cichocki, A.; Zdunek, R.; Amari, S.I. Csiszár’s divergences for non-negative matrix factorization: Family of new algorithms. In Independent Component Analysis and Blind Signal Separation, Proceedings of Fifth International Conference, ICA 2004, Granada, Spain, 22–24 September 2004; Puntonet, C.G., Prieto, A., Eds.; Springer: Berlin, Heidelberg, Germany, 2006; pp. 32–39. [Google Scholar]
  15. Kompass, R. A generalized divergence measure for nonnegative matrix factorization. Neural comput. 2007, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
  16. Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence with application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
  17. Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; Wiley: New York, NY, USA, 1992; Volume 1. [Google Scholar]
  18. Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall/CRC: London, UK, 1986; Volume 26. [Google Scholar]
  19. Kim, J.S.; Scott, C. Robust kernel density estimation. J. Mach. Learn. Res. 2012, 13, 2529–2565. [Google Scholar]
  20. Byrd, R.H.; Hribar, M.E.; Nocedal, J. An interior point algorithm for large-scale nonlinear programming. SIAM J. Optim. 1999, 9, 877–900. [Google Scholar] [CrossRef]
  21. Byrd, R.H.; Gilbert, J.C.; Nocedal, J. A trust region method based on interior point techniques for nonlinear programming. Math. Program. 2000, 89, 149–185. [Google Scholar] [CrossRef]
  22. MATLAB code of ABCA. Available online: http://www.isical.ac.in/∼abhijit_v/ABC.m (accessed on 17 July 2013).
  23. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, NY, USA, 1993; Volume 57. [Google Scholar]
  24. Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997; Volume 1. [Google Scholar]
  25. Kolda, T.G.; Bader, B.W. Tensor decompositions and applications. SIAM rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
  26. Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S.I. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation; Wiley: Chichester, UK, 2009. [Google Scholar]
  27. Torres, D.A.; Turnbull, D.; Barrington, L.; Lanckriet, G.R. Identifying words that are musically meaningful. In Proceedings of the 8th International Conference of Music Information Retrieval, Vienna, Austria, 23–27 September 2007; Volume 7, pp. 405–410.
  28. Witten, D.M.; Tibshirani, R.; Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, 10, 515–534. [Google Scholar] [CrossRef] [PubMed]
  29. Witten, D.M. A penalized matrix decomposition, and its applications. PhD thesis, Stanford University, USA, 2010. [Google Scholar]
  30. Allen, G.I. Sparse higher-order principal components analysis. In Proceedings of 15th International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, 20–22 April 2012; Volume 22, pp. 27–36.
  31. Scott, D.W. Parametric statistical modeling by minimum integrated square error. Technometrics 2001, 43, 274–285. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Mandal, A.; Cichocki, A. Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence. Entropy 2013, 15, 2788-2804. https://doi.org/10.3390/e15072788

AMA Style

Mandal A, Cichocki A. Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence. Entropy. 2013; 15(7):2788-2804. https://doi.org/10.3390/e15072788

Chicago/Turabian Style

Mandal, Abhijit, and Andrzej Cichocki. 2013. "Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence" Entropy 15, no. 7: 2788-2804. https://doi.org/10.3390/e15072788

APA Style

Mandal, A., & Cichocki, A. (2013). Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence. Entropy, 15(7), 2788-2804. https://doi.org/10.3390/e15072788

Article Metrics

Back to TopTop