Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements

Thiyam, Deepa Beeta; Cruces, Sergio; Olias, Javier; Cichocki, Andrzej

doi:10.3390/e19030089

Open AccessArticle

Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements

by

Deepa Beeta Thiyam

^1,2,

Sergio Cruces

^2,*

,

Javier Olias

² and

Andrzej Cichocki

^3,4,5

¹

Department of Sensor and Biomedical Technology, School of Electronics Engineering, VIT University, Vellore, Tamil Nadu 632014, India

²

Departamento de Teoría de la Señal y Comunicaciones, Universidad de Sevilla, Camino de los Descubrimientos s/n, Seville 41092, Spain

³

Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan

⁴

Systems Research Institute, Polish Academy of Sciences, Warsaw 01-447, Poland

⁵

Skolkovo Institute of Science and Technology (Skoltech), Moscow 143026, Russia

^*

Author to whom correspondence should be addressed.

Entropy 2017, 19(3), 89; https://doi.org/10.3390/e19030089

Submission received: 13 December 2016 / Revised: 7 February 2017 / Accepted: 22 February 2017 / Published: 25 February 2017

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The Alpha-Beta Log-Det divergences for positive definite matrices are flexible divergences that are parameterized by two real constants and are able to specialize several relevant classical cases like the squared Riemannian metric, the Steins loss, the S-divergence, etc. A novel classification criterion based on these divergences is optimized to address the problem of classification of the motor imagery movements. This research paper is divided into three main sections in order to address the above mentioned problem: (1) Firstly, it is proven that a suitable scaling of the class conditional covariance matrices can be used to link the Common Spatial Pattern (CSP) solution with a predefined number of spatial filters for each class and its representation as a divergence optimization problem by making their different filter selection policies compatible; (2) A closed form formula for the gradient of the Alpha-Beta Log-Det divergences is derived that allows to perform optimization as well as easily use it in many practical applications; (3) Finally, in similarity with the work of Samek et al. 2014, which proposed the robust spatial filtering of the motor imagery movements based on the beta-divergence, the optimization of the Alpha-Beta Log-Det divergences is applied to this problem. The resulting subspace algorithm provides a unified framework for testing the performance and robustness of the several divergences in different scenarios.

Keywords:

1. Introduction

Over the last few years, the use of specialized metrics and divergences measures in the successful design of dimensionality reduction techniques has been progressively acquiring much recognition [1,2,3,4,5]. There are numerous real scenarios and applications for which the parameters of interest belong to non-flat manifolds, and where the Euclidean geometry results are unsuitable to evaluate the similarities. Indeed, this is usual case in the comparison of probability density functions and also of their associated covariance matrices. The present contribution may be seen as a continuation of the work in [6], where we defined the Alpha-Beta Log-Det family of divergences between Symmetric and Positive Definite (SPD) matrices and studied its properties. The Alpha-Beta Log-Det family unifies under the same framework many existing Log-Det divergences and connects them smoothly, through intermediate versions, with the help of two real hyperparameters: α and β. In [7] a recent extension of the the Alpha-Beta Log-Determinant divergences was also proposed for the infinite-dimensional setting.

The evaluation of the Alpha-Beta Log-Det divergences depends on the generalized eigenvalues of the compared SPD matrices, and this makes its optimization non-trivial. In this paper, we motivate the use of these divergences with the illustrative application of dimensionality reduction in Brain-Computer Interface (BCI) and explain how to perform their optimization. The electroencephalogram (EEG) data has a typical high-dimensionality, a low signal to noise ratio and may have artifacts/outliers. The dimensionality reduction is then a necessary processing of the EEG signals for extracting those subspaces where the features have highest discriminative power.

Brain-Computer Interface has gained lots of interest in neuroscience and rehabilitation engineering. BCI [8,9] systems enable a person to operate external devices by using brain signals. The motor imagery (MI) based BCI systems are the most preferable BCI systems among others. It uses the brain signals of the MI movements as control commands for external devices without using the peripheral nervous system. During the imagination process, an alteration in the rhythmic activity of the brain can be observed in the mu and β rhythms at the corresponding area of the sensory-motor cortex. This phenomenon is known as event-related synchronization (ERS) or event-related desynchronization (ERD) [10]. The MI-based BCI systems use theses activities as control commands. Such a system can potentially serves as a communication aid for the people suffering from amyotrophic lateral sclerosis, multiple sclerosis and completely locked-in.

One of the most popular and efficient algorithms used for MI-based BCI applications is the common spatial pattern (CSP) algorithm [11]. It was first used to detect the abnormalities present in EEG signals [12] and later, was introduced in BCI applications [13]. The main objective of the CSP is to obtain the spatial filters by maximizing the variance of one class, at the same time minimizing that of the other class variance. It has been reported that this algorithm provides excellent classification accuracy for MI-based BCI systems. Besides, being the most popular method, its performance is easily affected by the presence of artifacts and nonstationarities. Since the computation of the spatial filters mainly depends on the covariance matrix, the presence of artifacts such as blinking of the eyes, eye movements and improper placement of the electrodes contribute to the poor computation of the covariance matrix which leads to the poor classification performance.

The main contributions of this work are the following:

The existing link between the CSP method and the symmetric KL divergence (see [1]), is extended to the case of the minimax optimization of the AB log-det divergences. In absence of regularization, their solutions are shown to be equivalent whenever these methods apply the same divergence-based criterion for choosing the spatial filters. Although, in general, this is not the case when the CSP method adopts the popular practical criterion of a priori fixing the number of spatial filters for each class, we show that the equivalence with the solution of the optimization of AB log-det divergences can be still preserved if a suitable scaling factor κ is used in one of the arguments of the divergence.
The details on how to perform the optimization of the AB log-det divergence are presented. The explicit expression of the gradient of this divergence with respect to the spatial filters is obtained. Expression which generalizes and extends the gradient of several more established well-known divergences, for instance, the gradient of the Alpha–Gamma divergence and the gradient of the Kullback–Leibler divergence between SPD matrices.
The robustness property of the AB log-det divergence with respect to outliers has been analyzed. The study reveals that the hyperparameters of the divergence can be chosen to underweight or overweight, at convenience, the influence of the larger and smaller generalized eigenvalues in the estimating equations.
Motivated by the success of criteria based on the Beta divergence [1] in the robust spatial filtering of the motor imagery movements, in this work, we consider the use of a criterion based on AB log-det divergences for the same purpose. A subspace optimization algorithm based on regularized AB log-det divergences is proposed for obtaining the relevant subset of spatial filters. Some exemplary simulations illustrate its robustness over synthetic and real datasets.

This article is organized as follows: Section 2 presents the fundamental model of the observations and paper notation. Section 3 reviews the CSP algorithm while Section 4 discusses CSP via the divergence optimization. In Section 5, we present the family of AB log-det divergences and provide a new upper-bounds and conditions for the equivalence between this divergence optimization and the robust CSP solution. Section 6 explains how to obtain closed-form formulas for computing the gradient of the AB log-det divergence, which is useful for its optimization. The analysis of the robustness of the divergence in terms of its hyperparameters is the objective of Section 7. Section 8 briefly reviews several related techniques, while Section 9 presents the regularized version of the criterion based on AB log-det divergences, as well as, the subspace algorithm that optimizes it. Section 10 presents the experimental datasets, the steps involved in the preprocessing, feature extraction and classification. The results of the simulations are presented and discussed in Section 11. Finally, the paper summarizes main results in Section 12.

2. Notation and Model of the Measurements

Throughout this paper, the following notations are adopted. Vectors are typically denoted by bold letters, the capital bold letters are reserved for the matrices, while the random variables appear in italic capital letters. The operators

⌊ \cdot ⌋

and

⌈ \cdot ⌉

round the value of their argument to the nearest lower and higher integers respectively. All the covariance matrices, which are denoted by

C o v (\cdot)

, are assumed to be positive definite and hence invertible.

Let us now describe the statistical model of the observations. As usual, the raw EEG observations are initially preprocessed by a bandpass filter that retains the activity in the bands of the mu and β rhythms and are later normalized for each trial so as to keep their total spatial power constant. One can define a statistical model of these “normalized” observations as

x (t) = {[x_{1} (t), \dots, x_{n} (t)]}^{T}

conditioned on the true imagery movement, which here will be represented by a member of the class

c \in {c_{1}, c_{2}}

. In general, the EEG observations are noisy and high-dimensional, while the number of recorded trials is quite limited. Therefore, the learning of the discriminative features is quite sensitive to overfitting, a situation that would severely degrade the prediction accuracy over test samples. In this case, it is worth sacrificing the bias by choosing a simpler (less complex) model in which parameters can be estimated with a smaller variance. For this reason, we adopt usual convention [14] of considering the observations from each class as drawn from the independent and identically distributed (i.i.d.) Gaussian random vectors as represented as

X | c

of zero mean and with covariance matrix as

C o v (X | c)

, which in turn is set equal to the sample covariance matrix of the class, i.e.,

\begin{matrix} C o v (X | c) = C o v (x | c) for c \in {c_{1}, c_{2}} . \end{matrix}

(1)

The observations are then modeled by the mixture distribution

\begin{matrix} p (x) & = & p (c_{1}) p (x | c_{1}) + p (c_{2}) p (x | c_{2}), \end{matrix}

(2)

where

p (c)

refers to the sample probabilities of each class in the training data. When

\bar{x}

denotes the sample mean of the observations, their sample covariance matrix is obtained by

\begin{matrix} (3) & C o v (x) & = & \frac{1}{T} \sum_{t = 1}^{T} (x (t) - \bar{x}) {(x (t) - \bar{x})}^{T} \\ (4) & = & p (c_{1}) C o v (x | c_{1}) + p (c_{2}) C o v (x | c_{2}) . \end{matrix}

and its eigenvalue decomposition is

\begin{matrix} C o v (x) & = & U_{1} Δ U_{1}^{T} \end{matrix}

(5)

where Δ and

U_{1}

, respectively denote the matrix of eigenvalues and eigenvectors of

C o v (x)

.

We define

w_{i} = {[w_{1 i}, w_{2 i}, ....., w_{p i}]}^{T}

as the vector with the coefficients of the i-th-esime spatial filter for

i = 1, \dots, p

. The collection of p spatial filters forms the overall filter matrix

W = [w_{1}, w_{2} ..... w_{p}]

, which is used to reduce the dimensionality of the observations by projecting them onto the p-dimensional subspace spanned by the filter outputs

\begin{matrix} y = W^{T} x \in R^{p}, \end{matrix}

(6)

where

p ≪ n

. The model for the estimated conditional distribution

p (y | c)

is a multidimensional Gaussian of zero mean and covariance matrix

C o v (Y | c) = W^{T} C o v (x | c) W

, i.e., for each class

\begin{matrix} Y | c \sim N (0, W^{T} C o v (x | c) W) . \end{matrix}

(7)

3. The Common Spatial Patterns Algorithm

The development of the CSP algorithm as a technique for feature selection in classification problems can be traced back to the work of [11], while later, [12,13] considered its practical application for the study of EEG signals. This technique exploits the event-related desynchronization during the limbs movement imagination process that alters the rhythmic activity in a class dependent area of the motor cortex. The objective of the algorithm is to obtain a set of most discriminative spatial filters, i.e., those that hierarchically maximize the output activity of one class, while at the same time; they minimize the activity of the other class. Since only the direction of the spatial filters (i.e., not the scale) are of interest, the technique starts with a linear transformation

y = W^{T} x

that whitens the sample covariance of the outputs

\begin{matrix} (8) & C o v (y) & = & p (c_{1}) C o v (y | c_{1}) + p (c_{2}) C o v (y | c_{2}) \\ (9) & = & W^{T} C o v (x) W \\ (10) & = & I_{p} . \end{matrix}

With the help of the eigenvalue decomposition of

C o v (x)

, the general expression of the spatial filter matrix that preserves the whitening constraint can be found as

\begin{matrix} W^{T} = Ω^{T} Δ^{- \frac{1}{2}} U_{1}^{T} . \end{matrix}

(11)

Note that

W \in R^{n \times p}

is specified up to the ambiguity in the choice of the semi-orthogonal matrix

Ω \in R^{n \times p}

(i.e.,

Ω^{T} Ω = I_{p}

) which parameterizes the relevant degrees of freedom for finding the most discriminative directions. Then, the objective of the CSP criterion [11] is implemented by first choosing one part of the spatial filters from the constrained maximization of the conditional covariances of the outputs of the first class

\begin{matrix} w_{i} & = & arg max_{w} w^{T} C o v (x | c_{1}) w i = 1, \dots, k, \end{matrix}

(12)

and later choosing the other part of the filters to hierarchically maximize the conditional covariances of the outputs of the second class

\begin{matrix} w_{i} & = & arg max_{w} w^{T} C o v (x | c_{2}) w i = k + 1, \dots, p, \end{matrix}

(13)

where, in both cases, the maximization with respect to the spatial filters takes place under the whitening or (

C o v (x)

-orthonormality) constraints

\begin{matrix} w_{i}^{T} C o v (x) w_{j} = δ_{i j} \forall j \leq i . \end{matrix}

(14)

The number of spatial filters k that hierarchically maximize (12) can be determined by a chosen filter selection policy. For simplicity, in most cases it is usually set k close to

\frac{p}{2}

with the aim to balance the number of spatial filters devoted to each of the classes.

The maximization in (12) can be alternatively posed as the constrained optimization of the quotient

\begin{matrix} w_{i} & = & arg max_{w} \frac{w^{T} C o v (x | c) w}{w^{T} C o v (x) w} subject to w^{T} C o v (x) w = δ_{i j} \forall j \leq i \end{matrix}

(15)

which, in terms of the transformed and normalized spatial vectors

\begin{matrix} r_{i} = \frac{{(C o v (x))}^{\frac{1}{2}} w_{i}}{{∥ {(C o v (x))}^{\frac{1}{2}} w_{i} ∥}_{2}}, \end{matrix}

(16)

is rewritten as a quadratic optimization under orthogonality constraints

\begin{matrix} w_{i} = {(C o v (x))}^{- \frac{1}{2}} \times arg max_{r} \{r^{T} {(C o v (x))}^{- \frac{1}{2}} C o v (x | c) {(C o v (x))}^{- \frac{1}{2}} r\} s . t . r^{T} r_{j} = δ_{i j} \forall j \leq i \end{matrix}

(17)

At this point, the straightforward application of the Courant–Fisher–Weyl minimax principle ([15], p. 58) yields the variational description of the desired spatial filters as the minimax solution of the Rayleigh quotients for each class

\begin{matrix} (18) & w_{i} & = & {(C o v (x))}^{- \frac{1}{2}} \times arg min_{dim {R} = n - i + 1} max_{\overset{r \in R}{∥ r ∥ = 1}} \{r^{T} {(C o v (x))}^{- \frac{1}{2}} C o v (x | c) {(C o v (x))}^{- \frac{1}{2}} r\} \\ (19) & = & arg min_{dim {W} = n - i + 1} max_{w \in W} \frac{w^{T} C o v (x | c) w}{w^{T} C o v (x) w} \\ (20) & = & arg min_{dim {W} = n - i + 1} max_{w \in W} \frac{C o v (y_{i} | c)}{C o v (y_{i})} . \end{matrix}

By the same principle, the generalized eigenvectors

v_{i}^{(c)}

of the matrix pencil (

p (c)

C o v (y | c)

,

C o v (y)

), are the minimax solutions of the Rayleigh quotient, while the values that takes the criterion at these solutions are the generalized eigenvalues

\begin{matrix} λ_{i}^{(c)} = p (c) \frac{v_{i}^{(c) T} C o v (x | c) v_{i}^{(c)}}{v_{i}^{(c) T} C o v (x) v_{i}^{(c)}} = p (c) min_{dim {W} = i} max_{w \in W} \frac{w^{T} C o v (x | c) w}{w^{T} C o v (x) w}, \end{matrix}

(21)

which are sorted according to the descent in their magnitude,

λ_{1}^{(c)} \geq λ_{2}^{(c)} \geq \dots \geq λ_{n}^{(c)}

.

The generalized eigenvectors of the two quotients (one for each class) coincide, except for their ordering which are reversed [11], i.e.,

v_{i}^{(c_{1})} = v_{n - i + 1}^{(c_{2})}

, while the weighted sum of generalized eigenvalues is bounded by

\begin{matrix} λ_{i}^{(c_{1})} + λ_{n - i + 1}^{(c_{2})} = \frac{v_{i}^{(c_{1}) T} (p (c_{1}) C o v (x | c_{1}) + p (c_{2}) C o v (x | c_{2})) v_{i}^{(c_{1})}}{v_{i}^{(c_{1}) T} C o v (x) v_{i}^{(c_{1})}} = 1 . \end{matrix}

(22)

Therefore, a direction of maximum variance for one class will simultaneously minimize the variance of the other class, and vice versa. Hence, the standard CSP solution is obtained when the spatial filters match with the principal and minor eigenvectors of the generalized eigendecomposition problem [11,12,13]

C o v (x | c_{1}) v_{i}^{(c_{1})} = λ_{i}^{(c_{1})} C o v (x) v_{i}^{(c_{1})} i = 1, \dots, n .

(23)

After sorting the eigenvalues according to its magnitude, CSP explicitly selects k spatial filters

v_{i}^{(c_{1})}

from the principal eigenvectors and

p - k

spatial filters from the minor eigenvectors, to form the spatial filter matrix

\begin{matrix} (24) & W_{C S P} & \equiv & [w_{1}, w_{2} ..... w_{p}] \\ (25) & = & [v_{1}^{(c_{1})}, \dots, v_{k}^{(c_{1})}, v_{n - (p - k) + 1}^{(c_{1})}, \dots, v_{n}^{(c_{1})}] . \end{matrix}

4. The Divergence Optimization Interpretation of CSP

Under the appropriate selection policy for the number of spatial filters for each class, the solution obtained by the CSP algorithm admits an interpretation in terms of the optimization divergence measures (here denoted by

D i v (\cdot ∥ \cdot)

) between the Gaussian pdfs outputs for each class

\begin{matrix} w_{i} & = & arg min_{dim {W} = n - i + 1} max_{w \in W} D i v (p (y_{i} | c_{1}) ∥ p (y_{i} | c_{2})), \end{matrix}

(26)

except for a probable permutation in the ordering of some of the spatial filters.

The problem can be formulated using the following optimization problem

\begin{matrix} w_{i} & = & arg min_{dim {W} = n - i + 1} max_{w \in W} D (C o v (y_{i} | c_{1}) ∥ C o v (y_{i} | c_{2})) \end{matrix}

(27)

where

D (\cdot ∥ \cdot)

refers to a divergence between the covariances of the conditional densities of the outputs. As a consequence of the assumption of zero mean Gaussian densities, the covariances are the only necessary statistics that summarize all the relevant information of the conditional data.

In particular, the solution of the CSP algorithm was linked in [1,11,16,17] with the optimization of the symmetric Kullback–Leibler divergence (sKL)

\begin{matrix} (28) & D i v_{_{s K L}} (p (y_{i} | c_{1}) ∥ p (y_{i} | c_{2})) & = & \int p (y_{i} | c_{1}) log \frac{p (y_{i} | c_{1})}{p (y_{i} | c_{2})} d y_{i} + \int p (y_{i} | c_{2}) log \frac{p (y_{i} | c_{2})}{p (y_{i} | c_{1})} d y_{i}, \\ (29) & = & \int (p (y_{i} | c_{1}) - p (y_{i} | c_{2})) log \frac{p (y_{i} | c_{1})}{p (y_{i} | c_{2})} d y_{i} . \end{matrix}

This divergence measures can be simplified to the symmetric Kullback–Leibler (sKL) divergence between the class conditional covariances

\begin{matrix} (30) & D i v_{_{s K L}} (p (y_{i} | c_{1}) ∥ p (y_{i} | c_{2})) & = & \frac{1}{2} \frac{C o v (y_{i} | c_{1})}{C o v (y_{i} | c_{2})} + \frac{1}{2} \frac{C o v (y_{i} | c_{2})}{C o v (y_{i} | c_{1})} - 1 \\ (31) & \equiv & D_{_{s K L}} (C o v (y_{i} | c_{1}) ∥ C o v (y_{i} | c_{2})) . \end{matrix}

In this paper, we propose an extension of the existing KL to the criterion of the Alpha-Beta log-det divergence (AB log-det) between the class-conditional covariances defined as [6]

\begin{matrix} D_{A B}^{(α, β)} (C o v (y_{i} | c_{1}) ∥ C o v (y_{i} | c_{2})) & = & \frac{1}{α β} log {|\frac{α {(\frac{C o v (y_{i} | c_{1})}{C o v (y_{i} | c_{2})})}^{β} + β {(\frac{C o v (y_{i} | c_{2})}{C o v (y_{i} | c_{1})})}^{α}}{α + β}|}_{+} \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{matrix}

(32)

where

\begin{matrix} {| x |}_{+} = \{\begin{matrix} x & x \geq 0, \\ 0, & x < 0, \end{matrix} \end{matrix}

(33)

denotes the non-negative truncation operator. When the arguments covariances are scalars and

α, β > 0

, the AB log-det divergence can also be rewritten as the logarithmic ratio between the weigthed arithmetic mean of the scaled covariances (

C o v^{α + β} (y_{i} | c_{1})

,

C o v^{α + β} (y_{i} | c_{2})

) and their weighted geometric mean, i.e.,

\begin{matrix} D_{A B}^{(α, β)} (C o v (y_{i} | c_{1}) ∥ C o v (y_{i} | c_{2})) & = & \frac{1}{α β} log \frac{(\frac{α}{α + β} C o v^{α + β} (y_{i} | c_{2}) + \frac{β}{α + β} C o v^{α + β} (y_{i} | c_{1}))}{{(C o v^{α + β} (y_{i} | c_{2}))}^{\frac{α}{α + β}} {(C o v^{α + β} (y_{i} | c_{1}))}^{\frac{β}{α + β}}} . \end{matrix}

(34)

Additionally if

α + β = 1

, the AB log-det divergence between covariances is proportional to the Alpha–Gamma divergence [18] between the conditional densities

\begin{matrix} D_{A B}^{(α, β)} (C o v (y_{i} | c_{1}) ∥ C o v (y_{i} | c_{2})) & \equiv & 2 D i v_{A G}^{(β, α)} (p (y_{i} | c_{1}) ∥ p (y_{i} | c_{2})) \\ = & \frac{2}{α β} log \frac{{(\int_{Ω} p (y_{i} | c_{1}) d y_{i})}^{β} {(\int p (y_{i} | c_{2}) d y_{i})}^{α}}{\int p^{β} (y_{i} | c_{1}) p^{α} (y_{i} | c_{2}) d y_{i}} \\ for α > 0, β > 0, α + β = 1 . \end{matrix}

(35)

In Section 5.3, it is proven that, under certain conditions, the simple optimization of an AB log-det divergence also leads to the solution of the CSP algorithm. Although, the potential of these divergences does not rely on their plain optimization but instead rely on their optimization in the presence of some regularization terms that help to specify the desired solutions.

Recently, several divergence criteria have been proposed for the extraction of the spatial dimensions with maximum discriminative power. Among these, the multiclass approach based on the maximization of the harmonic mean of Kullback–Leibler divergences [16] and the regularization framework based on the beta divergences [1,17] are the most noteworthy methods. Another approach based on Bhattacharyya distance and Gamma divergence has also been proposed for classification of motor imagery movements [19]. Our proposal is motivated by the success of these methods in improving the classification accuracy and the robustness against the outliers. The distinctive property of the AB log-det divergence is that it smoothly connects (through its hyperparameters) a quite broad family of log-det divergences for SPD matrices, covering several relevant classical cases like: the KL divergence, the dual KL divergence, the Beta log-det family, the Alpha log-det family, the Power log-det family, as well as the Affine Invariant Riemannian divergence.

5. The Definition of the AB Log-Det Divergence

Henceforth, we will work on the multidimensional observation vectors

x = {[x_{1}, \dots, x_{n}]}^{T} \in R^{n}

. In order to simplify the notation, the covariance matrices of the two classes are renamed as follows

\begin{matrix} P & \equiv & C o v (x | c_{1}), \end{matrix}

(36)

\begin{matrix} Q & \equiv & C o v (x | c_{2}) . \end{matrix}

(37)

The AB log-det divergence is a directed divergence that evaluates the dissimilarity between two multidimensional covariance matrices. It was defined in [6] as

\begin{matrix} D_{A B}^{(α, β)} (P ∥ Q) & = & \frac{1}{α β} log {|\frac{α {(Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}})}^{β} + β {(Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}})}^{- α}}{α + β}|}_{+} \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{matrix}

(38)

while, for the singular cases, its definition is given by

\begin{matrix} D_{A B}^{(α, β)} (P ∥ Q) & = & \{\begin{matrix} \frac{1}{α^{2}} [tr ({(Q^{\frac{1}{2}} P^{- 1} Q^{\frac{1}{2}})}^{α} - I) - α log | Q^{\frac{1}{2}} P^{- 1} Q^{\frac{1}{2}} |] & for α \neq 0, β = 0, \\ \frac{1}{β^{2}} [tr ({(Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}})}^{β} - I) - β log | Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}} |] & for α = 0, β \neq 0, \\ \frac{1}{α^{2}} log {|{(Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}})}^{α} (I + log {(Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}})}^{- α})|}_{+} & for α = - β, \\ \frac{1}{2} | | log (Q^{\frac{1}{2}} P^{- 1} Q^{\frac{1}{2}}) {| |}_{F}^{2} & for α, β = 0 . \end{matrix} \end{matrix}

(39)

The divergence depends only on the eigenvalues

Λ = diag (λ_{1}, \dots, λ_{n})

of the Symmetric Positive Definite (SPD) matrix

Q^{- 1 / 2} P Q^{- 1 / 2}

, which also coincide with the eigenvalues of the matrix

Q^{- 1} P

, although their eigenspaces differ. Given the eigenvalue decomposition

\begin{matrix} Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}} = V_{1} Λ V_{1}^{T}, \end{matrix}

(40)

where

V_{1}

is the orthogonal matrix of eigenvectors, and

Λ = diag {λ_{1}, λ_{2}, \dots, λ_{n}}

is the diagonal matrix with positive eigenvalues

λ_{i} > 0, i = 1, 2, \dots, n

. One of the properties of the AB log-det divergence is that it is invariant under a common change of basis on its matrix arguments, i.e., an invertible congruence transformation. Since, with the help of this specific transformation, we have

\begin{matrix} P & \to & (V_{1}^{T} Q^{- \frac{1}{2}}) P {(V_{1}^{T} Q^{- \frac{1}{2}})}^{T} = Λ, \end{matrix}

(41)

\begin{matrix} Q & \to & (V_{1}^{T} Q^{- \frac{1}{2}}) Q {(V_{1}^{T} Q^{- \frac{1}{2}})}^{T} = I, \end{matrix}

(42)

it can be inferred that the divergence is separable (over the generalized eigenvalues of the matrix pencil

(P, Q)

) in a sum of marginal divergences that measure how far are each of the generalized eigenvalues from the unity, i.e.,

\begin{matrix} D_{A B}^{(α, β)} (P ∥ Q) = D_{A B}^{(α, β)} (Λ ∥ I_{n}) = \sum_{i = 1}^{n} D_{A B}^{(α, β)} (λ_{i} ∥ 1) . \end{matrix}

(43)

Hence,

\begin{matrix} D_{A B}^{(α, β)} (P ∥ Q) & = & \frac{1}{α β} \sum_{i = 1}^{n} log {|\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β}|}_{+}, α, β, α + β \neq 0 . \end{matrix}

(44)

Similarly, for the singular cases, the divergence is

\begin{matrix} D_{A B}^{(α, β)} (P ∥ Q) & = & \{\begin{matrix} \frac{1}{α^{2}} [\sum_{i = 1}^{n} (λ_{i}^{- α} - log (λ_{i}^{- α})) - n] & for α \neq 0, β = 0 \\ \frac{1}{β^{2}} [\sum_{i = 1}^{n} (λ_{i}^{β} - log (λ_{i}^{β})) - n] & for α = 0, β \neq 0 \\ \frac{1}{α^{2}} [\sum_{i = 1}^{n} log {|\frac{λ_{i}^{α}}{1 + log λ_{i}^{α}}|}_{+}] & for α = - β \neq 0 \\ \frac{1}{2} \sum_{i = 1}^{n} {log}^{2} (λ_{i}) & for α, β = 0 . \end{matrix} \end{matrix}

(45)

This divergence compares two symmetric positive definite matrices and returns its dissimilarity, i.e., a positive value when they are non-coincident and

D_{A B}^{(α, β)} (P ∥ Q) = 0

iff

P = Q

. As it can be observed in Figure 1 the AB log-det divergence generalizes several existing log-det matrix divergences, like: the Stein’s loss, the S-divergence, the Alpha and Beta log-det families of divergences and the geodesic distance between covariance matrices (the squared Riemannian metric), among others (see Table 1 in [6] for a comprehensive list).

5.1. A Tight Upper-Bound for the AB Log-Det Divergences

The divergence

D_{A B}^{(α, β)} (P ∥ Q)

depends on the generalized eigenvalues

λ_{1}, \dots, λ_{n}

of the matrix pencil

(P, Q)

which, for convenience, are assumed to have a simple spectrum (the eigenvalues are unique or non-coincident) and can be sorted in descending order

\begin{matrix} λ_{1} > λ_{2} > \dots > λ_{n} > 0 . \end{matrix}

(46)

In practice, the assumption is plausible because the real symmetric matrices with unique eigenvalues are known to form an open dense set in the space of all the real symmetric matrices [20].

Although the space of the observations is high-dimensional, most of the discriminative information between the two conditions is confined into a low-dimensional subspace. Thus, the spatial filter matrix

W \in R^{n \times p}

is used to reduce the dimensionality of the samples from n to p with the linear compression transformation

y = W^{T} x \in R^{p}

. It is shown in [6] that, after applying this compression to the arguments of the divergence, the resulting output covariance matrices

W^{T} PW

and

W^{T} QW

are more similar than in the original space, as shown in the below equation

\begin{matrix} D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW) = \sum_{i = 1}^{p} D_{A B}^{(α, β)} (μ_{i} ∥ 1) \leq \sum_{i = 1}^{n} D_{A B}^{(α, β)} (λ_{i} ∥ 1) = D_{A B}^{(α, β)} (P ∥ Q), \end{matrix}

(47)

where

μ_{1} \geq \dots \geq μ_{p} > 0

are the generalized eigenvalues of the matrix pencil

(W^{T} PW, W^{T} QW)

. However, this upper bound is loose for the case of interest (dimensionality reduction), i.e., when

p < n

. In Appendix A.1, the possible way to tighten the previous upper-bound with the following new proposal is shown

\begin{matrix} D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW) \leq \sum_{i = 1}^{p} D_{A B}^{(α, β)} (λ_{π_{i}} ∥ 1), \end{matrix}

(48)

where π defines the permutation of the indices

1, \dots, n

that sorts the divergence of the eigenvalues from the unity in descending order

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{1}} ∥ 1) \geq D_{A B}^{(α, β)} (λ_{π_{2}} ∥ 1) \geq \dots \geq D_{A B}^{(α, β)} (λ_{π_{n}} ∥ 1) . \end{matrix}

(49)

Moreover, the equality with the upper-bound is only obtained for those extraction matrices

W

that lie within the span of the p generalized eigenvectors of the matrix pencil

(P, Q)

which are associated with the eigenvalues

λ_{π_{1}}, \dots, λ_{π_{p}}

that maximize the divergence from unity in (49).

5.2. Relationship between the Generalized Eigenvalues and Eigenvectors of the Matrix Pencils $(P, Q)$ and $(p (c_{1}) P, C o v (x))$

We have seen in the previous section that the tight upper-bound of the divergence is attained by a subset of the generalized eigenvectors of the matrix pencil

(P, Q)

, whereas, the CSP solution in (24) depends on a subset of the generalized eigenvectors of another matrix pencil

(p (c_{1}) P, C o v (x))

. In this section we addresses the close relationship between both eigendecompositions. For this purpose, we recognize Λ as the matrix of eigenvalues of

Q^{- 1} P

and

Λ^{(c_{1})}

as the matrix of eigenvalues of

{(C o v (x))}^{- 1} p (c_{1}) P

. Then, we write

\begin{matrix} {(p (c_{2}) Q)}^{- 1} (p (c_{1}) P) & = & {[{(C o v (x))}^{- 1} (p (c_{2}) Q)]}^{- 1} [{(C o v (x))}^{- 1} (p (c_{1}) P)], \end{matrix}

(50)

and use the decomposition of

C o v (x)

in (4) to substitute

p (c_{2}) Q = C o v (x) - p (c_{1}) P

in the previous equation. In this way, we obtain

\begin{matrix} {(p (c_{2}) Q)}^{- 1} (p (c_{1}) P) & = & {[I_{n} - {(C o v (x))}^{- 1} (p (c_{1}) P)]}^{- 1} [{(C o v (x))}^{- 1} (p (c_{1}) P)] . \end{matrix}

(51)

The matrix of eigenvectors

V

of

Q^{- 1} P

diagonalizes both sides of the previous equation

\begin{matrix} (52) & Λ \frac{p (c_{1})}{p (c_{2})} & = & V^{- 1} [\frac{p (c_{1})}{p (c_{2})} Q^{- 1} P] V \\ (53) & = & (V^{- 1} {[I_{n} - {(C o v (x))}^{- 1} (p (c_{1}) P)]}^{- 1} V) (V^{- 1} [{(C o v (x))}^{- 1} (p (c_{1}) P)] V) \\ (54) & = & {[I_{n} - V^{- 1} {(C o v (x))}^{- 1} (p (c_{1}) PV)]}^{- 1} (V^{- 1} [{(C o v (x))}^{- 1} (p (c_{1}) P)] V) \\ (55) & = & {(I_{n} - Λ^{(c_{1})})}^{- 1} Λ^{(c_{1})} . \end{matrix}

Hence, we have the explicit relationship between the two sets of eigenvalues

\begin{matrix} λ_{i} \frac{p (c_{1})}{p (c_{2})} = \frac{λ_{i}^{(c_{1})}}{1 - λ_{i}^{(c_{1})}} \equiv g (λ_{i}^{(c_{1})}), i = 1, \dots, n, \end{matrix}

(56)

where

g (λ_{i}^{(c_{1})})

, as can be seen in Figure 2, is a strictly monotonous ascending function over the domain of

λ_{i}^{(c_{1})} \in (0, 1)

. Moreover, the Equations (52)–(55) imply that the matrix

V

of generalized eigenvectors of the matrix pencil

(P, Q)

exactly coincides with the matrix

V^{(c_{1})}

of generalized eigenvectors of the other matrix pencil

(p (c_{1}) P, C o v (x))

.

5.3. Linking the Optimization of the Divergence and the CSP Solution

There is a link between the solutions of the CSP method and the solutions obtained with the optimization of the symmetric KL divergence between the class conditional covariances, which was studied in previous works [1,11,16]. This subsection shows that under the appropriate filter selection criteria the link also extends to the optimization of other divergences, like the AB log-det family of divergences.

We have previously assumed that generalized eigenvalues are ordered and can be regarded as non-equal. Therefore, we can cluster them in the following three sets of principal, inner and minor eigenvalues of the matrix pencil

(P, Q)

:

\begin{matrix} \underset{k principal eigenvalues}{\underset{︸}{λ_{1} > \dots > λ_{k}}} > \underset{inner eigenvalues}{\underset{︸}{λ_{k + 1} > \dots > λ_{n - (p - k)}}} > \underset{(p - k) minor eigenvalues}{\underset{︸}{λ_{n - (p - k) + 1} > \dots > λ_{n}}} . \end{matrix}

(57)

The following sequence of optimizations induces an alternative ordering of the generalized eigenvalues

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{i}} ∥ 1) = min_{dim {W} = n - i + 1} max_{w \in W} D_{A B}^{(α, β)} (w_{i}^{T} P w_{i} ∥ w_{i}^{T} Q w_{i}), i = 1, \dots, n, \end{matrix}

(58)

according to a permutation π that sorts their marginal divergences from 1 in descending order

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{1}} ∥ 1) \geq \dots \geq D_{A B}^{(α, β)} (λ_{π_{p}} ∥ 1) \geq D_{A B}^{(α, β)} (λ_{π_{p + 1}} ∥ 1) \dots \geq D_{A B}^{(α, β)} (λ_{π_{n}} ∥ 1) . \end{matrix}

(59)

For building the matrix of spatial filters

W_{D i v} \equiv [w_{1}, w_{2} ..... w_{p}]

, one possible selection policy is to retain only the p most discriminative spatial filters for the considered divergence optimization problem, i.e., those that solve (58) for

i = 1, \dots, p

. The filters consist in p eigenvectors (

v_{π_{i}}

with

i = 1, \dots, p

) of the matrix pencil

(P, Q)

that are arranged according to the permutation π. From the one-to-one relationship that exists between the generalized eigenvalues and eigenvectors of the matrix pencils

(P, Q)

and

(p (c_{1}) P, C o v (x))

(see the previous subsection) the solution takes the following form

\begin{matrix} (60) & W_{D i v} & = & [v_{π_{1}}, \dots, v_{π_{p}}] \\ (61) & = & [v_{π_{1}}^{(c_{1})}, \dots, v_{π_{p}}^{(c_{1})}] . \end{matrix}

This result tells us that the optimization of different divergences (in absence of other regularizing terms) only differs in the selection criteria for the spatial filters, which eventually determine the chosen subindices

π_{1}, \dots, π_{p}

.

Now, the question of whether these spatial filters that solve the sequence of minimax divergence optimization problems

\begin{matrix} min_{dim {W} = n - i + 1} max_{w \in W} D_{A B}^{(α, β)} (w_{i}^{T} P w_{i} ∥ w_{i}^{T} Q w_{i}), i = 1, \dots, p, \end{matrix}

(62)

essentially coincide (up to a possible permutation in the order of the spatial filters) with the spatial filters of the CSP solution in (63)

\begin{matrix} W_{C S P} = [\underset{k principal eigenvectors}{\underset{︸}{v_{1}^{(c_{1})}, \dots, v_{k}^{(c_{1})}}}, \underset{p - k minor eigenvectors}{\underset{︸}{v_{n - (p - k) + 1}^{(c_{1})}, \dots, v_{n}^{(c_{1})}}}], \end{matrix}

(63)

has a simple answer. The straightforward comparison between (61) and (63) reveals that both solutions should essentially coincide when the subindices

π_{1}, \dots, π_{p}

are a permutation of the integers

1, \dots, k, n - (p - k) + 1, \dots, n

. Thus, the link between both techniques happens whenever CSP method adopts the filter selection policy of the divergence criterion in (59).

However, many of the CSP implementations find satisfactory to choose the number of spatial filters for each class a priori, respectively as k and

p - k

(we will refer to this case as the original CSP filter selection policy), where k is close to

p / 2

in order to approximately balance the number of spatial filters for each class [13,21].

In general, the use of a divergence based selection policy does not ensure a balanced representation of the spatial filters for each of the classes. For instance, consider the synthetic but illustrative situation for

n = 100

, where we wish to select

p = 8

spatial filters. If the generalized eigenvalues of the matrix pencil

(P, Q)

are shifted towards to zero, for instance, equal to

{10, 0.99, 0.98, \dots, 0.03, 0.02, 0.01}

. In most cases, the solution

W_{D i v}

will select as its columns: only

k = 1

principal eigenvectors and

p - k = 7

minor eigenvectors, an unbalanced choice.

In view of this potential limitation, an interesting question is whether it would be possible to modify the AB log-det divergence criterion so as to enforce that its solution essentially coincides with the one obtained by the CSP method with its original filter selection policy. We will show in the following that this requires only a suitable scaling

κ \in R^{+}

in one of the arguments of the divergence. Without loss of generality, we assume scaling in the second argument of the divergence. As it is shown in the Appendix A.2, there is a permutation

π^{'}

of the indices of the spatial filters

1, \dots, p

that links the CSP solution in (24) with the optimization of the divergence

\begin{matrix} w_{π_{i}^{'}} = arg min_{dim {W} = n - i + 1} max_{w \in W} D_{A B}^{(α, β)} (w_{i}^{T} P w_{i} ∥ κ w_{i}^{T} Q w_{i}), i = 1, \dots, p, \end{matrix}

(64)

for any given

\begin{matrix} κ & \in & (κ_{inf}, κ_{sup}) \end{matrix}

(65)

with

\begin{matrix} (66) & κ_{inf} & \equiv & K (λ_{k + 1}, λ_{n - (p - k) + 1}) \\ (67) & κ_{sup} & \equiv & K (λ_{k}, λ_{n - (p - k)}) \end{matrix}

where the function

\begin{matrix} K (a, b) & = & \{\begin{matrix} {(\frac{(a^{β} - b^{β}) / β}{(a^{- α} - b^{- α}) / (- α)})}^{\frac{1}{α + β}} & for α, β, α + β \neq 0 \\ {(\frac{log (a / b))}{(a^{- α} - b^{- α}) / (- α)})}^{\frac{1}{α}} & for α \neq 0, β = 0 \\ {(\frac{(a^{β} - b^{β}) / β}{log (a / b))})}^{\frac{1}{β}} & for α = 0, β \neq 0 \\ exp (\frac{a^{α} log (e b^{α}) - b^{α} log (e a^{α})}{α (a^{α} - b^{α})}) & for α = - β \neq 0 \\ \sqrt{a b} & for α = β = 0 \end{matrix} \end{matrix}

(68)

determines the value of the constant

κ = K (a, b) \in R

that equalizes the value of the AB log-det divergences between any arbitrary

a, b \in R

constants (in the first argument) and κ (in the second argument), i.e.,

\begin{matrix} D_{A B}^{(α, β)} (a ∥ κ) = D_{A B}^{(α, β)} (b ∥ κ) . \end{matrix}

(69)

Note that the only role of the scaling factor κ is to adjust the reference value in one of the arguments of the divergence to ensure the exact balance in the number of spatial filters that are specialized in each class. As it is shown in the Appendix A.2, this scaling factor prevents that the minimax solution for

i = 1, \dots, p

, could be attained by some eigenvectors associated with elements of the inner set of eigenvalues in (57), so the chosen subset of eigenvectors have to essentially coincide with the principal and minor eigenvectors that form the CSP solution in (63). In practice, a value of κ which is closer to unity and meets the required bounds can be obtained from the truncated choice

\begin{matrix} κ_{★} & = & \{\begin{matrix} κ_{inf} + ε & for κ_{inf} \geq 1 \\ 1 & for 1 \in (κ_{inf}, κ_{s u p}) \\ κ_{sup} - ε & for κ_{sup} \leq 1 \end{matrix} \end{matrix}

(70)

for an arbitrary small value of the constant

ϵ ≪ κ_{s u p} - κ_{i n f}

.

6. The Gradient of the AB Log-Det Divergence

The AB log-det divergence between the conditional covariance of the outputs

Y = W^{T} x

for each of the classes

\begin{matrix} (71) & f (W) & = & D_{A B}^{(α, β)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2})) \\ (72) & = & D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW), \end{matrix}

is a function of the matrix

W \in R^{n \times p}

.

The optimization of this function with respect to

W

is non-trivial, so in this section, we show how the gradient of the AB log-det divergences can be derived. One may note that this is not only naturally interesting for the optimization that we would like to perform in this work, but it also contributes to pave the way for the potential practical use of the AB log-det divergence in other scenarios and applications.

As we have shown previously, the divergence is separable

\begin{matrix} D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW) & = & D_{A B}^{(α, β)} (M ∥ I_{p}) = \sum_{i = 1}^{p} D_{A B}^{(α, β)} (μ_{i} (W) ∥ 1) \end{matrix}

(73)

over the eigenvalues of the matrix

\begin{matrix} (74) & M & = & {(W^{T} QW)}^{- \frac{1}{2}} W^{T} P W {(W^{T} QW)}^{- \frac{1}{2}} \\ (75) & = & U diag {μ_{1}, \dots, μ_{n}} U^{T} \end{matrix}

where

diag {μ_{1}, \dots, μ_{p}}

and

U = [u_{1}, \dots, u_{p}]

, respectively denote the matrices of eigenvalues and eigenvectors of

M

, which are functions of the matrix

W

.

The differential of

f (W)

can be expressed as

\begin{matrix} d f (W) & = & tr \{d W^{T} \frac{\partial f (W)}{\partial W}\} \end{matrix}

(76)

where

\begin{matrix} \frac{\partial f (W)}{\partial W} = {[\frac{\partial f (W)}{\partial W_{i j}}]}_{i j} \in R^{n \times p} \end{matrix}

(77)

denotes the gradient of the function. The divergence directly depends on the generalized eigenvalues, which in turn depend on the matrix

W

. The suitable tool to obtain the gradient of this composition of functions is the chain rule, which can be written as

\begin{matrix} \frac{\partial f (W)}{\partial W} & = & \sum_{i = 1}^{p} \frac{\partial μ_{i}}{\partial W} \frac{\partial f (W)}{\partial μ_{i}} . \end{matrix}

(78)

So, the gradient can be evaluated after finding

\frac{\partial f (W)}{\partial μ_{i}}

and

\frac{\partial μ_{i}}{\partial W}

.

Since the divergence is a separable function of the generalized eigenvalues, the first term is easier to obtain,

\begin{matrix} \frac{\partial f (W)}{\partial μ_{i}} = \frac{\partial D_{A B}^{(α, β)} (μ_{i} ∥ 1)}{\partial μ_{i}} = \{\begin{matrix} \frac{μ_{i}^{β - 1} - μ_{i}^{- α - 1}}{α μ_{i}^{β} + β μ_{i}^{- α}} = \frac{μ_{i}^{α + β} - 1}{μ_{i} (α μ_{i}^{α + β} + β)} & for α + β \neq 0 \\ \frac{log μ_{i}}{μ_{i} (1 + α log μ_{i})} & for α + β = 0 . \end{matrix} \end{matrix}

(79)

Obtaining the second term

\frac{\partial μ_{i}}{\partial W}

is not so easy and requires to employ our previous plausible assumption that the generalized eigenvalues have a simple spectrum. Under this condition, the Hadamard first variation formula can be used to write the differential of the eigenvalues as

\begin{matrix} d μ_{i} = u_{i}^{T} d M u_{i}, \end{matrix}

(80)

where

u_{i}

denotes the normalized eigenvector (

∥ u_{i} ∥_{2} = 1

) corresponding to each eigenvalue

μ_{i}

.

With the help of the product rule for differentials, we obtain

\begin{matrix} d M & = & d {(W^{T} QW)}^{- \frac{1}{2}} {(W^{T} QW)}^{\frac{1}{2}} M + \\ M {(W^{T} QW)}^{\frac{1}{2}} d {(W^{T} QW)}^{- \frac{1}{2}} + \\ {(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} PW + W^{T} P d W) {(W^{T} QW)}^{- \frac{1}{2}} . \end{matrix}

(81)

As we show in the Appendix A.3, it can be simplified as follows

\begin{matrix} (82) & d {(W^{T} QW)}^{- \frac{1}{2}} {(W^{T} QW)}^{\frac{1}{2}} & = & - \frac{1}{2} {(W^{T} QW)}^{- \frac{1}{2}} d (W^{T} QW) {(W^{T} QW)}^{- \frac{1}{2}} \\ (83) & = & - \frac{1}{2} {(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} QW + W^{T} Q d W) {(W^{T} QW)}^{- \frac{1}{2}} \end{matrix}

hence

\begin{matrix} d M & = & - \frac{1}{2} {(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} QW + W^{T} Q d W) {(W^{T} QW)}^{- \frac{1}{2}} M \\ - \frac{1}{2} {[{(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} QW + W^{T} Q d W) {(W^{T} QW)}^{- \frac{1}{2}} M]}^{T} \\ + {(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} PW + W^{T} P d W) {(W^{T} QW)}^{- \frac{1}{2}} . \end{matrix}

(84)

Thus, after substituting (84) in (80) and using the invariance of the trace under transpositions (

tr {A} = tr {A^{T}}

) and the cyclic shifts (

tr {A B} = tr {B A}

), the following values are obtained

\begin{matrix} (85) & d μ_{i} & = & u_{i}^{T} d M u_{i} \\ (86) & = & tr \{u_{i}^{T} d M u_{i}\} \\ = & - \frac{1}{2} tr \{u_{i}^{T} {(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} QW + W^{T} Q d W) {(W^{T} QW)}^{- \frac{1}{2}} M u_{i}\} \\ - \frac{1}{2} tr \{u_{i}^{T} {[{(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} QW + W^{T} Q d W) {(W^{T} QW)}^{- \frac{1}{2}} M]}^{T} u_{i}\} \\ (87) & + tr \{u_{i}^{T} {(W^{T} QW)}^{- \frac{1}{2}} (d W^{T} PW + W^{T} P d W) {(W^{T} QW)}^{- \frac{1}{2}} u_{i}\} \\ = & - 2 tr \{d W^{T} QW {(W^{T} QW)}^{- \frac{1}{2}} M u_{i} u_{i}^{T} {(W^{T} QW)}^{- \frac{1}{2}}\} \\ (88) & + 2 tr \{d W^{T} PW {(W^{T} QW)}^{- \frac{1}{2}} u_{i} u_{i}^{T} {(W^{T} QW)}^{- \frac{1}{2}}\} . \end{matrix}

At this point, we can use the identity for the differential

\begin{matrix} d μ_{i} & = & tr \{d W^{T} \frac{\partial μ_{i}}{\partial W}\} \end{matrix}

(89)

in (88) to identify the second desired term

\begin{matrix} \frac{\partial μ_{i}}{\partial W} = - 2 QW {(W^{T} QW)}^{- \frac{1}{2}} M u_{i} u_{i}^{T} {(W^{T} QW)}^{- \frac{1}{2}} + 2 P W {(W^{T} QW)}^{- \frac{1}{2}} u_{i} u_{i}^{T} {(W^{T} QW)}^{- \frac{1}{2}} . \end{matrix}

(90)

Substituting the expressions (79) and (90) in (78), we obtain

\begin{matrix} \frac{\partial f (W)}{\partial W} & = & \sum_{i = 1}^{p} \frac{\partial μ_{i}}{\partial W} \frac{\partial D_{A B}^{(α, β)} (μ_{i} ∥ 1)}{\partial μ_{i}} \\ = & - 2 QW {(W^{T} QW)}^{- \frac{1}{2}} M Z {(W^{T} QW)}^{- \frac{1}{2}} + 2 P W {(W^{T} QW)}^{- \frac{1}{2}} Z {(W^{T} QW)}^{- \frac{1}{2}} \end{matrix}

(91)

where, for convenience, the matrix is defined as following

\begin{matrix} (92) & Z & = & \sum_{i = 1}^{p} u_{i} \frac{\partial D_{A B}^{(α, β)} (μ_{i} ∥ 1)}{\partial μ_{i}} u_{i}^{T} \\ (93) & = & U diag \{\frac{\partial D_{A B}^{(α, β)} (μ_{1} ∥ 1)}{\partial μ_{1}}, \dots, \frac{\partial D_{A B}^{(α, β)} (μ_{p} ∥ 1)}{\partial μ_{p}}\} U^{T} . \end{matrix}

The matrix

Z

can also be represented directly in terms of the matrix

M

(which we have defined previously in Equation (74)) as

\begin{matrix} Z & = & \{\begin{matrix} M^{- 1} {(α M^{α + β} + β I)}^{- 1} (M^{α + β} - I) & for α + β \neq 0 \\ M^{- 1} {({(log M)}^{- 1} + α I)}^{- 1} & for α + β = 0 \end{matrix} \end{matrix}

(94)

where

log (\cdot)

for matrix arguments denotes the matrix logarithm functional. After the grouping of common terms in (91) we obtain the final gradient expression, which is given by

\begin{matrix} \frac{\partial f (W)}{\partial W} & = & 2 [PW - QW {(W^{T} QW)}^{- 1} (W^{T} PW)] {(W^{T} QW)}^{- \frac{1}{2}} Z {(W^{T} QW)}^{- \frac{1}{2}} . \end{matrix}

(95)

6.1. Validation of Equation (95) with the Gradient of the KL Divergence

The Kullback–Leibler (KL) divergence between the Gaussian densities

p (x | c_{2})

and the

p (x | c_{1})

, of zero mean and the respective covariance matrices

C o v (Y | c_{1})

and

C o v (Y | c_{2})

, is given by

\begin{matrix} D i v_{K L} (p (x | c_{2}) ∥ p (x | c_{1})) & = & \int p (x | c_{2}) log \frac{p (x | c_{2})}{p (x | c_{1})} d x \\ = & \frac{1}{2} log | C o v (Y | c_{1}) | - \frac{1}{2} log | C o v (Y | c_{2}) | + \frac{1}{2} tr {C o v^{- 1} (Y | c_{1}) C o v (Y | c_{2}) - I_{p}} . \end{matrix}

(96)

Since this divergence only involves trace and log-det operators, as it is shown in the Appendix A.4, its gradient with respect to

W

, i.e.,

\begin{matrix} \frac{\partial}{\partial W} D i v_{K L} (p (x | c_{2}) ∥ p (x | c_{1})) & = & - QW {(W^{T} QW)}^{- 1} + PW {(W^{T} PW)}^{- 1} + QW {(W^{T} PW)}^{- 1} \\ + PW {(W^{T} PW)}^{- 1} {(W^{T} QW)}^{- 1} {(W^{T} PW)}^{- 1}, \end{matrix}

(97)

is relatively easy to obtain. Then, we can use the fact that the KL divergence is proportional to the AB log-det divergence between the class conditional covariance matrices, as long as the conditional covariance matrices appear in the AB log-det divergence interchanged in position with respect to class conditional density arguments of the KL divergence. So for the specific case of

α = 1

and

β = 0

, i.e.,

\begin{matrix} D_{A B}^{(1, 0)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2})) & = & 2 D i v_{K L} (p (x | c_{2}) ∥ p (x | c_{1})), \end{matrix}

(98)

to test whether there is coherence between the obtained gradient formula in (95) and twice the gradient of the KL divergence that was independently obtained in the Appendix A.4. For this purpose, in the specific case of

α = 1

and

β = 0

, from (94) the following auxiliary matrices are evaluated

\begin{matrix} (99) & Z & = & M^{- 1} (I_{p} - M^{- 1}) \\ (100) & {(W^{T} QW)}^{- \frac{1}{2}} Z {(W^{T} QW)}^{- \frac{1}{2}} & = & {(W^{T} PW)}^{- 1} - {(W^{T} PW)}^{- 1} (W^{T} QW) {(W^{T} PW)}^{- 1} \end{matrix}

and are substituted in the expression of the gradient of the AB log-det divergence (95). After the following straightforward simplifications,

\begin{matrix} \frac{\partial}{\partial W} D_{A B}^{(1, 0)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2})) & = & 2 [PW - QW {(W^{T} QW)}^{- 1} (W^{T} PW)] \\ (101) & \times [{(W^{T} QW)}^{- \frac{1}{2}} Z {(W^{T} QW)}^{- \frac{1}{2}}] \\ = & 2 [- QW {(W^{T} QW)}^{- 1} (W^{T} PW) + PW] \\ (102) & \times [{(W^{T} PW)}^{- 1} - {(W^{T} PW)}^{- 1} (W^{T} QW) {(W^{T} PW)}^{- 1}] \\ = & 2 [- QW {(W^{T} QW)}^{- 1} + PW {(W^{T} P W)}^{- 1} + QW {(W^{T} P W)}^{- 1} \\ (103) & + PW {(W^{T} P W)}^{- 1} {(W^{T} QW)}^{- 1} {(W^{T} P W)}^{- 1}] \\ (104) & = & 2 \frac{\partial}{\partial W} D i v_{K L} (p (x | c_{2}) ∥ p (x | c_{1})) . \end{matrix}

the proportionality between the gradient of

D_{A B}^{(1, 0)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2}))

and the gradient of the KL divergence in (97) is confirmed.

6.2. Validation of Equation (95) with the Gradient of the AG Divergence

The Alpha–Gamma divergence between the Gaussian densities

p (x | c_{2})

and

p (x | c_{1})

, of zero mean and with respective covariance matrices

C o v (Y | c_{1}) = W^{T} PW

and

C o v (Y | c_{2}) = W^{T} QW

, is equal to

\begin{matrix} (105) & D i v_{A G}^{(α, β)} (p (y_{i} | c_{2}) ∥ p (y_{i} | c_{1})) & \equiv & \frac{1}{α β} log \frac{{(\int_{Ω} p (y_{i} | c_{1}) d y_{i})}^{β} {(\int p (y_{i} | c_{2}) d y_{i})}^{α}}{\int p^{β} (y_{i} | c_{1}) p^{α} (y_{i} | c_{2}) d y_{i}} \\ = & \frac{1}{2 α β} log | W^{T} (α P + β Q) W | - \frac{1}{2 β} log | W^{T} PW | - \frac{1}{2 α} log | W^{T} QW | \\ (106) & for α > 0, β > 0, α + β = 1 . \end{matrix}

Due to the constraint

α + β = 1

, we assume that β is determined by α, i.e.,

β = 1 - α

along this subsection. Since

\begin{matrix} \nabla_{W} log | (W^{T} PW) | & = & 2 PW {(W^{T} PW)}^{- 1}, \end{matrix}

(107)

the gradient of the AG divergence with respect to

W

is given by

\begin{matrix} \frac{\partial}{\partial W} D i v_{A G}^{(α, β)} (p (x | c_{2}) ∥ p (x | c_{1})) & = & \frac{2}{2 α β} (α P + β Q) W {(W^{T} (α P + β Q) W)}^{- 1} \\ - \frac{2}{2 α} QW {(W^{T} QW)}^{- 1} - \frac{2}{2 β} PW {(W^{T} PW)}^{- 1} \\ = & - \frac{1}{β} PW [{(W^{T} P W)}^{- 1} - {(W^{T} (α P + β Q) W)}^{- 1}] \\ - \frac{1}{α} QW [{(W^{T} QW)}^{- 1} - {(W^{T} (α P + β Q) W)}^{- 1}] . \end{matrix}

(108)

Then, we can use the equivalence between the AG divergence and the AB log-det divergence between the class conditional covariance matrices

\begin{matrix} D_{A B}^{(α, β)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2})) & = & 2 D i v_{A G}^{(α, β)} (p (y_{i} | c_{2}) ∥ p (y_{i} | c_{1})), \end{matrix}

(109)

which is valid for the specific case of

α + β = 1

and

α, β > 0

, to also test the coherence between the obtained gradient formula in (95) and twice the gradient of the AG divergence. For

α + β = 1

, the auxiliary matrices in the definition of the gradient are

\begin{matrix} Z & = & {(α M + β I)}^{- 1} [M^{- 1} (M - I)] = {(α M + β I)}^{- 1} - {(α M^{2} + β M)}^{- 1} \end{matrix}

(110)

and

\begin{matrix} {(W^{T} QW)}^{- \frac{1}{2}} Z {(W^{T} QW)}^{- \frac{1}{2}} & = & {(α {(W^{T} QW)}^{\frac{1}{2}} M {(W^{T} QW)}^{\frac{1}{2}} + β W^{T} QW)}^{- 1} \\ - (α {(W^{T} QW)}^{\frac{1}{2}} M {(W^{T} QW)}^{\frac{1}{2}} {(W^{T} QW)}^{- 1} {(W^{T} QW)}^{\frac{1}{2}} M {(W^{T} QW)}^{\frac{1}{2}} \\ (111) & + β {(W^{T} QW)}^{\frac{1}{2}} M {(W^{T} QW)}^{\frac{1}{2}})^{- 1} \\ (112) & = & [I - {(W^{T} P W)}^{- 1} (W^{T} QW)] {(W^{T} (α P + β Q) W)}^{- 1} . \end{matrix}

After substituting this last expression in the gradient of the AB log-det divergence (95), we obtain

\begin{matrix} \frac{\partial}{\partial W} D_{A B}^{(α, β)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2})) \\ = & 2 [PW - QW {(W^{T} QW)}^{- 1} (W^{T} PW)] {(W^{T} QW)}^{- \frac{1}{2}} Z {(W^{T} QW)}^{- \frac{1}{2}} \\ = & + 2 (PW + QW) {(W^{T} (α P + β Q) W)}^{- 1} \\ - \frac{2}{β} α PW {[(W^{T} α P W) + (W^{T} α P W) {(W^{T} β QW)}^{- 1} (W^{T} α P W)]}^{- 1} \\ - \frac{2}{α} β QW {[(W^{T} β QW) + (W^{T} β QW) {(W^{T} α P W)}^{- 1} (W^{T} β QW)]}^{- 1} . \end{matrix}

(113)

With the help of the particular form of the Woodbury identity for the matrix inverse

\begin{matrix} {[A + A B^{- 1} A]}^{- 1} = A^{- 1} - {(A + B)}^{- 1} \end{matrix}

(114)

we simplify the terms within the brackets. Finally, we use the fact that

α + β = 1

to confirm the proportionality with the gradient of the AG divergence given in (108),

\begin{matrix} \frac{\partial}{\partial W} D_{A B}^{(α, β)} (C o v (Y | c_{1}) ∥ C o v (Y | c_{2})) & = & + 2 (PW + QW) {(W^{T} (α P + β Q) W)}^{- 1} \\ - \frac{2}{β} α PW [{(W^{T} α P W)}^{- 1} - {(W^{T} (α P + β Q) W)}^{- 1}] \\ (115) & - \frac{2}{α} β QW [{(W^{T} β QW)}^{- 1} - {(W^{T} (α P + β Q) W)}^{- 1}] \\ = & + 2 (PW + QW) {(W^{T} (α P + β Q) W)}^{- 1} \\ - \frac{2}{β} PW [{(W^{T} P W)}^{- 1} - α {(W^{T} (α P + β Q) W)}^{- 1}] \\ (116) & - \frac{2}{α} QW [{(W^{T} QW)}^{- 1} - β {(W^{T} (α P + β Q) W)}^{- 1}] \\ = & + 2 ((1 + \frac{α}{β}) PW + (1 + \frac{β}{α}) QW) {(W^{T} (α P + β Q) W)}^{- 1} \\ (117) & - \frac{2}{β} PW {(W^{T} PW)}^{- 1} - \frac{2}{α} QW {(W^{T} QW)}^{- 1} \\ = & + 2 (\frac{1}{β} PW + \frac{1}{α} QW) {(W^{T} (α P + β Q) W)}^{- 1} \\ (118) & - \frac{2}{β} PW {(W^{T} PW)}^{- 1} - \frac{2}{α} QW {(W^{T} QW)}^{- 1} \\ (119) & = & 2 \frac{\partial}{\partial W} D i v_{A G}^{(α, β)} (p (x | c_{2}) ∥ p (x | c_{1})) . \end{matrix}

7. Robustness of the AB Log-Det Divergence in Terms of $α$ and $β$

The squared Riemann metric is known to be the natural distance in the manifold of SPD matrices, as it measures the squared length of the geodesic path between the arguments of the divergence [3]. However, in the real data there are usually several model contaminations (mismatches), including outliers or artifacts, that could make other robust divergences preferable. In this section, we study how the hyperparameters α and β can influence robustness of the AB log-det divergence with respect to the behavior of the squared Riemann metric, which is used as a reference.

For convenience, we denote the AB log-det divergence as a function of the spatial filter matrix

W

by

\begin{matrix} f_{(α, β)} (W) \equiv D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW), \end{matrix}

(120)

and we consider its gradient expression given by Equation (78). The spatial filters that maximize this divergence should satisfy the following estimating equations

\begin{matrix} \frac{\partial f_{(α, β)} (W)}{\partial W} & = & \sum_{i = 1}^{p} \frac{\partial μ_{i}}{\partial W} ψ_{(α, β)} (μ_{i}) = 0, \end{matrix}

(121)

where

μ_{i}, i = 1, \dots, p

, are the eigenvalues of matrix

M

, which was defined in Equation (74), and

\begin{matrix} ψ_{(α, β)} (μ_{i}) = \frac{\partial f_{(α, β)} (W)}{\partial μ_{i}}, i = 1, \dots, p, \end{matrix}

(122)

may be regarded as influence functions for each pair

(α, β)

that account for the penalty variation in the divergence with respect to

μ_{i}

. The complementary term to

ψ_{(α, β)} (μ_{i})

in (121), i.e.,

\frac{\partial μ_{i}}{\partial W}

, is a matrix of partial derivatives of the generalized eigenvalues

μ_{i}

with respect to the elements of the spatial filters

W

and, therefore, it is independent of the considered divergence. It is easy to observe that, in the particular case of

α = β = 0

, the expression in (121) represents the estimating equation for the squared Riemann metric

\begin{matrix} \frac{\partial f_{(0, 0)} (W)}{\partial W} & = & \sum_{i = 1}^{p} \frac{\partial μ_{i}}{\partial W} ψ_{(0, 0)} (μ_{i}) = 0 . \end{matrix}

(123)

In order to study the relative robusness to outliers, one can rewrite the estimating equation for a chosen pair of hyperparameters

(α, β)

in terms of the influence function for the squared Riemannian metric as

\begin{matrix} \frac{\partial f_{(α, β)} (W)}{\partial W} & = & \sum_{i = 1}^{p} (\frac{\partial μ_{i}}{\partial W} ψ_{(0, 0)} (μ_{i})) w_{(α, β)} (μ_{i}) = 0, \end{matrix}

(124)

where the scalar term

\begin{matrix} w_{(α, β)} (μ) = \frac{ψ_{(α, β)} (μ)}{ψ_{(0, 0)} (μ)} \end{matrix}

(125)

acts as a weight function that controls, for a given pair

(α, β)

, the magnitude of the effect in the estimation equation of departures of

μ_{i}

from unity.

The presence of outliers in the real data, typically results in eigenvalues

μ_{i}

that are too far from unity. However, depending on the problem, the higher prevalence of outliers may be stronger only for the greatest eigenvalues, or for the smallest eigenvalues, or simultaneously for the greatest and smaller eigenvalues. Those hyperparameters

(α, β)

that are able to down-weight the contribution of the outliers, are considered more robust. Therefore, the shape of the weight functions

w_{(α, β)} (μ_{i})

is useful to study the relative immunity of the AB log-det divergence to outliers.

Figure 3a shows the squared Riemannian metric (

α = β = 0

) and its weight function, which is flat since this divergence is taken as reference. Figure 3b presents a similar plot for the Power Log-det divergence with

α = β = 1

. In this case, the bell shape of the weight function is an indicator of the robustness with respect to the presence of outliers in the greatest and smallest eigenvalues, since they will be down-weighted in the estimating Equation (124). Similar plots can be done by increasing the magnitude of

α = β

, which progressively enhances the robustness. When

α \neq β

the divergence is asymmetric. The Figure 4a,b respectively present the Kullback–Leibler divergence for SPD matrices (

α = 1, β = 0

) and its dual version (

α = 0, β = 1

), together with their associated weight functions. These plots illustrate the asymmetric cases in situations where

α + β > 0

and reveal that, when

α ≫ β

, the AB log-det divergences tend to be more robust against outliers in the large eigenvalues while, for

α ≪ β

, the robustness tends to be with respect to the outliers in small eigenvalues.

8. Review of Some Related Techniques for the Spatial Filtering of Motor Imagery Movements

In this section, we will review the regularized variants of CSP that have been proposed to improve the classification performance. The regularization approaces of CSP are mainly done either in the estimation of the covariance matrices or by modifying the CSP objective function.

Most of them combine the estimation of the covariance matrices for each class with the regularization of the CSP objective function using penalty terms. Some of the approaches include the previous information [22], other subject data [23,24] and previous session data [25] for estimating the class covariance matrix. Another approach used M-estimators to compute the robust class covariance matrices [26] and yet another approach obtained the covariance matrices by finding the minimum squared error [27]. The authors of [28] applied Multiple Kernel Learning (MKL) to combine the information from different subjects.

It has been shown in [29] that the regularization of the objective function is more useful than regularizing the estimated covariance matrix. Several approaches have been proposed by regularizing the objective function. The authors of [21] have additionally incorporated the electrooculogram (EOG) signals for reducing the ocular artifacts. Other authors have tried to ensure robustness by selecting only the important channels and produce sparse spatial filters [30,31,32]. Another approach is to robustify the system by obtaining only the stationary features. A robustify maximin CSP method was proposed that used a set of covariance matrices instead of an individual covariance matrix without using any other user data or data from the previous sessions [33,34]. In order to avoid the presence of the outlier, the CSP objective function has been formulated using

l_{p}

-norm in [35,36]. The Stationary Subspace Analysis (SSA) algorithm was proposed to obtain the stationary subspaces of the time series EEG signals by considering only the stationary components of the signals. The limitation of this method is the detection of dissimilarity of the different class as a non-stationary feature [37]. The group wise SSA (gwSSA) algorithm aims at obtaining the non-stationarities by dividing the dataset into different groups and calculating the minimum KL divergence between estimated source distribution of each trial in a group and the average distribution of the corresponding group. This algorithm not only allows the combining of the multisubject data but also the multiclass data [38]. But, the gwSSA algorithm cannot find the discriminative information between the classes. The same group proposed a new approach for extracting the discriminative information, by subtracting the inter class divergences from the gwSSA objective function [39]. To overcome the limitation of the SSA algorithm, two-step approaches have been proposed where the initial extraction of the stationary sources was done using the SSA method and later, the CSP was used for the computation of the spatial filters [40]. Another approach to extract the stationary features is to reduce the nonstationarities between the two sessions. The supervised and unsupervised methods for adaptation of the data space have been proposed using KL divergence between the intersession data [41]. Recently, the authors of [14] presented maximum a posteriori-CSP (MAP-CSP) algorithm by deriving the probabilistic model of CSP to resolve the issue of overfitting of the baseline CSP algorithm.

One of the limitations of the CSP algorithm is that it is mainly suitable only for the discrimination of two classes, while, in general, for an efficient BCI system more than two motor imagery movements are required. In order to formulate it for the multiclass system, the authors of [42,43] have reduced the multiclass problem to a binary problem. The authors of [44] proposed two approaches for the multiclass problem; firstly to find the spatial filters for one class with respect to all the other classes and secondly, by simultaneous diagonalization methods. Other approaches, like [45], proposed to solve the multiclass problems by combining information theoretic criteria with joint diagonalization methods. Several other methods have been proposed for the multiclass paradigm using independent component analysis [46] and Riemannian geometry to obtain the spatial filters [47]. The authors of [48] derived a relation between Bayes classification error and Rayleigh quotient and used this approach to solve the multiclass problem. In spite of all these different approaches, the performance of MI based BCI systems is degraded due to the presence of non-stationarities and outliers, which is a challenge for the BCI systems in a real application. Hence, a robust feature extraction algorithm is needed to increase the overall performance of the system.

9. Proposed Criterion and Algorithm for Spatial Filtering

For the presentation of the proposed criterion some additional notation needs to be defined. Let

{\tilde{x}}^{(j)} (t) | c

denote the output of the passband filtering of the raw observations at time t and for the jth trial of class

c \in {c_{1}, c_{2}}

. The power of the trials of a given class c is normalized by the operation

\begin{matrix} x^{(j)} (t) = \frac{{\tilde{x}}^{(j)} (t)}{\sqrt{tr {C o v ({\tilde{x}}^{(j)} | c)}}}, \end{matrix}

(126)

where

\begin{matrix} C o v (x^{(j)} | c) = \frac{1}{L} \sum_{t = 1}^{L} (x^{(j)} (t) - {\bar{x}}^{(j)}) {(x^{(j)} (t) - {\bar{x}}^{(j)})}^{T} with {\bar{x}}^{(j)} = \frac{1}{L} \sum_{t = 1}^{L} x^{(j)} (t) \end{matrix}

(127)

denotes the sample covariance matrix the

j^{t h}

trial

x^{(j)}

of class c, and L is the size in samples of each trial. In order to simplify the notation, the covariance matrices of the two classes are renamed as

\begin{matrix} P_{j} \equiv C o v (x^{(j)} | c_{1}) and Q_{j} \equiv C o v (x^{(j)} | c_{2}), \end{matrix}

(128)

and their averaged versions (the centroids of each class) are denoted as

\begin{matrix} P \equiv < P_{j} > = \frac{1}{N_{1}} \sum_{j = 1}^{N_{1}} P_{j} and Q \equiv < Q_{j} > = \frac{1}{N_{2}} \sum_{j = 1}^{N_{2}} Q_{j} . \end{matrix}

(129)

The classification of imagery movements involves extracting the relevant features of the observations and the classification of the observed patterns in the feature space. In the considered application, the data is high-dimensional but only a few features are sufficient to capture the discriminative information about the intended movements. Thus, the extraction of the relevant features involves a dimensionality reduction step for the observations from

R^{n}

to

R^{p}

where

p ≪ n

. This step is implemented through the spatial filtering, i.e., by projecting the n-dimensional observations onto a p-dimensional subspace which should allow a good discrimination of the cluster centroids and, at the same time, guarantee a compact representation of the clusters.

As mentioned earlier, the CSP solution will be obtained by a minimax optimization of the divergence between the projected and scaled centroids of the classes, i.e.,

D_{A B}^{(α, β)} (w_{i}^{T} P w_{i} ∥ κ w_{i}^{T} Q w_{i})

. However, since this solution completely ignores the within-class dispersion of the samples, it is quite sensitive to artifact and outlier in the training dataset. In similarity with the divergence framework presented in [1] and with some variants of Fisher LDA, p. 366 in [49], one can regularize the previous problem by controlling the dispersion of the trials of each class around their centroids and also by exploiting the degrees of freedom in the selection of the hyperparameters of the divergences. Then, a robust criterion based on the AB log-det divergence takes the following form

\begin{matrix} F (W) & = & D_{A B}^{(α, β)} (W^{T} PW ∥ κ W^{T} QW) - η (p (c_{1}) R_{1} + p (c_{2}) R_{2}), \end{matrix}

(130)

where the penalties associated to the within-class dispersion involve the averaged divergences

\begin{matrix} R_{1} & = & \frac{1}{N_{1}} \sum_{j = 1}^{N_{1}} D_{A B}^{(α, β)} (W^{T} P_{j} W ∥ W^{T} PW), \end{matrix}

(131)

\begin{matrix} R_{2} & = & \frac{1}{N_{2}} \sum_{j = 1}^{N_{2}} D_{A B}^{(α, β)} (W^{T} Q_{j} W ∥ W^{T} QW), \end{matrix}

(132)

and the parameter

η \in R^{+}

controls the balance between the maximization of the between-class scatter and the minimization of the within-class scatter. Note that in (132) we have used the fact that the AB log-det divergence is invariant under the common scaling of its arguments, to simplify

D_{A B}^{(α, β)} (κ W^{T} Q_{j} W ∥ κ W^{T} QW) = D_{A B}^{(α, β)} (W^{T} Q_{j} W ∥ W^{T} QW)

.

The optimization of the criterion in (130) can be performed simultaneously, for all the spatial filters, with the use of subspace techniques [1]. In the next section, we present a subspace optimization algorithm based on AB log-det divergences.

The Subspace Optimization Algorithm (Sub-ABLD)

The subspace method aims to extract the desired set of p spatial filters in two steps. The idea is to first use a robust method to determine the discriminative subspace of the spatial filters, for instance, considering the optimization of a robust criterion like (130). Later, another criterion is used to identify the individual spatial filters within the subspace. Since the influence of outliers on the solution is significantly reduced after the discriminative subspace is determined. In the second step, the standard CSP criterion can be safely used to determine the final spatial directions within the chosen subspace.

The input parameters of the subspace optimization algorithm based on AB log-det divergences (Sub-ABLD) are the set of covariance matrices for each class (

P_{j}

,

Q_{j}

), the dimension of subspace to be extracted p, and the hyperparameters α, β and η. The method starts with the computation of the sample prior probabilities as well as the average covariance matrices for each class, i.e.,

p (c_{1}), p (c_{2})

and (

P

,

Q

). The spatial filter matrix decomposes as

W^{T} = Ω^{T} T

into the product of a whitening transformation matrix

T

of the observations and a semi-orthogonal matrix

Ω^{T}

, which satisfies

Ω^{T} Ω = I_{p}

. The whitening transformation is obtained from eigenvalue decomposition of

C o v (x) = p (c_{1}) P + p (c_{2}) Q = U_{1} Δ U_{1}^{T}

as follows

\begin{matrix} T = Δ^{- \frac{1}{2}} U_{1}^{T}, \end{matrix}

(133)

where Δ and

U_{1}

represent the matrices of eigenvalues and eigenvectors. This transformation is applied to both sides of the covariance matrices to obtain the whitened trial covariances

\begin{matrix} {\overset{˘}{P}}_{j} = T P_{j} T^{T}, {\overset{˘}{Q}}_{j} = T Q_{j} T^{T}, \end{matrix}

(134)

and their averaged versions

\begin{matrix} \overset{˘}{P} = T P T^{T}, \overset{˘}{Q} = T Q T^{T} . \end{matrix}

(135)

The scaling parameter κ, which pursues the balance of the number of features for each class in absence of regularizers, is determined with the truncation procedure proposed in (70). The semiorthogonal matrix

Ω^{T}

that projects the whitened observations onto a p-dimensional subspace is initialized from the identity matrix of dimension

n \times p

. This is equivalent to start the optimization projecting onto the principal p-dimensional subspace of the observations, which ensures a good initial signal to noise ratio. Once the whitening transformation is fixed, the criterion to optimize

F (W)

can be rewritten, in terms of Ω, as the following function

\begin{matrix} f (Ω) & = & D_{A B}^{(α, β)} (Ω^{T} \overset{˘}{P} Ω ∥ κ Ω^{T} \overset{˘}{Q} Ω) \\ - η (p (c_{1}) \frac{1}{N_{1}} \sum_{j = 1}^{N_{1}} D_{A B}^{(α, β)} (Ω^{T} {\overset{˘}{P}}_{j} Ω ∥ Ω^{T} \overset{˘}{P} Ω) \\ + p (c_{2}) \frac{1}{N_{2}} \sum_{j = 1}^{N_{2}} D_{A B}^{(α, β)} (Ω^{T} {\overset{˘}{Q}}_{j} Ω ∥ Ω^{T} \overset{˘}{Q} Ω)), \end{matrix}

(136)

which ordinary gradient can be determined from (95), to obtain

\begin{matrix} \frac{\partial f (Ω)}{\partial Ω} & = & 2 [\overset{˘}{P} Ω - κ \overset{˘}{Q} Ω {(κ Ω^{T} \overset{˘}{Q} Ω)}^{- 1} (Ω^{T} \overset{˘}{P} Ω)] {(κ Ω^{T} \overset{˘}{Q} Ω)}^{- \frac{1}{2}} Z_{1} {(κ Ω^{T} \overset{˘}{Q} Ω)}^{- \frac{1}{2}} \\ - η (p (c_{1}) \frac{2}{N_{1}} \sum_{j = 1}^{N_{1}} [{\overset{˘}{P}}_{j} Ω - \overset{˘}{P} Ω {(Ω^{T} \overset{˘}{P} Ω)}^{- 1} (Ω^{T} \overset{˘}{P_{j}} Ω)] {(Ω^{T} \overset{˘}{P} Ω)}^{- \frac{1}{2}} Z_{2} {(Ω^{T} \overset{˘}{P} Ω)}^{- \frac{1}{2}} \\ + p (c_{2}) \frac{2}{N_{2}} \sum_{j = 1}^{N_{2}} [\overset{˘}{Q_{j}} Ω - \overset{˘}{Q} Ω {(Ω^{T} \overset{˘}{Q} Ω)}^{- 1} (Ω^{T} \overset{˘}{Q_{j}} Ω)] {(Ω^{T} \overset{˘}{Q} Ω)}^{- \frac{1}{2}} Z_{3} {(Ω^{T} \overset{˘}{Q} Ω)}^{- \frac{1}{2}}) \end{matrix}

(137)

where the matrices

Z_{i}

should be defined for each case (

i = 1, \dots, 3

) as in (94). However, this gradient is not the fastest ascent direction in the structured manifold of semi-orthogonal matrices (the Stiefel manifold). Instead, the fastest ascent direction is given by the “natural" gradient in this manifold [50,51], which is given by

\begin{matrix} \nabla_{Ω} f (Ω) = \frac{\partial f (Ω)}{\partial Ω} - Ω {(\frac{\partial f (Ω)}{\partial Ω})}^{T} Ω . \end{matrix}

(138)

Let

Ω^{(i)}

denote the semi-orthogonal matrix at iteration i and let

μ^{(i)}

denotes the step-size, the gradient ascent update is then performed with

\begin{matrix} Ω_{t g}^{(i + 1)} = Ω^{(i)} + μ^{(i)} \nabla_{Ω} f (Ω^{(i)}) . \end{matrix}

(139)

The resulting matrix

Ω_{t g}^{(i + 1)}

belongs to the tangent space of the manifold at

Ω^{(i)}

and asymptotically follows the geodesic path of maximum ascent for a sufficient small stepsize

μ \to 0

. However, for practical stepsizes, like the one that we consider next

\begin{matrix} μ^{(i)} = \frac{0.02}{∥ \nabla_{Ω} f (Ω^{(i)}) ∥_{F}}, \end{matrix}

(140)

the resulting updates

Ω_{t g}^{(i + 1)}

are not exactly semi-orthogonal and, in order to restore this property, a retraction procedure onto the manifold is necessary after each iteration. The retraction can be implemented with the help of the MatLab command for a “thin” singular value decomposition as

\begin{matrix} (141) & [Q_{L}, D, Q_{R}] & = & svd (Ω_{t g}^{(i + 1)}, 0), \\ (142) & Ω^{(i + 1)} & = & Q_{L} Q_{R}^{T} . \end{matrix}

The procedure is then repeated until convergence to a maxima of the criterion at a given iteration

i_{m a x}

. After that, the solution

{(Ω^{(i_{m a x})})}^{T} T

identifies the subspace of the spatial filters, but not each of their individual directions. In order to determine them, one can solve a CSP problem within the previously identified subspace. We compute the generalized eigenvalues of the matrix pencil

({(Ω^{(i_{m a x})})}^{T} \overset{˘}{P} Ω^{(i_{m a x})}, {(Ω^{(i_{m a x})})}^{T} \overset{˘}{Q} Ω^{(i_{m a x})})

and use the resulting principal and minor eigenvectors

{\overset{˘}{v}}_{j}

to form the spatial filter matrix

\begin{matrix} \overset{˘}{V} & = & [{\overset{˘}{v}}_{1}, \dots, {\overset{˘}{v}}_{⌊ \frac{p}{2} ⌋}, {\overset{˘}{v}}_{n - p + 1 + ⌊ \frac{p}{2} ⌋}, \dots, {\overset{˘}{v}}_{n}] . \end{matrix}

(143)

The final matrix of spatial filters that solves the problem, is the product of the whitening matrix

T

, the projection matrix

{(Ω^{(i_{m a x})})}^{T}

and a CSP rotation matrix

{\overset{˘}{V}}^{T}

which operates within the subspace, i.e.,

\begin{matrix} W^{T} = {\overset{˘}{V}}^{T} {(Ω^{(i_{m a x})})}^{T} T . \end{matrix}

(144)

The main steps of the Sub-ABLD iteration are summarized in Algorithm 1.

Algorithm 1 Sub-ABLD algorithm

1:: function Sub-ABLD( ${P_{j}}$ , ${Q_{j}}$ ,p,α,β,η)
2:: Compute the average covariance matrices $P$ and $Q$ .
3:: Compute the total covariance matrix $C o v (x) = p (c_{1}) P + p (c_{2}) Q$ .
4:: Compute the whitening transform matrix $T$ using (133).
5:: Whiten the trial and average covariance matrices to respectively obtain ${{\overset{˘}{P}}_{j}}$ , ${{\overset{˘}{Q}}_{j}}$ and $\overset{˘}{P}$ , $\overset{˘}{Q}$ .
6:: Compute the scaling parameter, κ using (70) and initialize the iteration counter: $i = 0$ .
7:: Initialize the semi-orthogonal matrix $Ω^{(i)} = I_{n \times p}$ .
8:: repeat
9:: Compute the robust criterion $f (Ω^{(i)})$ using (136).
10:: Compute the ordinary gradient $\frac{\partial f (Ω^{(i)})}{\partial Ω}$ using (137).
11:: Compute the natural gradient on the Stiefel manifold $\nabla_{Ω} f (Ω^{(i)})$ using (138).
12:: Obtain the tangent matrix $Ω_{t g}^{(i + 1)}$ using (139).
13:: Obtain the projection matrix $Ω^{(i + 1)}$ using (141) and (142) (the retraction onto the manifold).
14:: Increase the iteration counter: $i = i + 1$ .
15:: until convergence at iteration $i_{m a x}$ .
16:: Collect in $\overset{˘}{V}$ the princip./minor eigenvect. of the pencil $({(Ω^{(i_{m a x})})}^{T} \overset{˘}{P} Ω^{(i_{m a x})}, {(Ω^{(i_{m a x})})}^{T} \overset{˘}{Q} Ω^{(i_{m a x})})$ .
17:: return $W^{T} = {\overset{˘}{V}}^{T} {(Ω^{(i_{m a x})})}^{T} T .$
18:: end function

The proposed subspace algorithm (Sub-ABLD) is similar in structure to the one presented in [1] for Beta divergences. In spite of the fact that they optimize different criteria, the main difference between both subspace algorithms is in the specific way that the updates of the estimates are implemented. In [1] the authors opted for applying multiplicative updates that require the determination of the gradient of the criterion in the space of skew-symmetric matrices, whereas our proposal performs tangent updates to the manifold of the semi-orthogonal matrices that are followed by a projection or retraction onto the manifold. These updates are quite common in the research field of Independent Component Analysis [50,51,52,53].

10. Experimental Study

The discrimination of two class MI movements consists of the following steps. The MI EEG signals are acquired, preprocessed and spatially filtered. The filtered signals are then used for extracting the required features, which are classified using a linear classifier. In the following section, we explain the experimental steps used in the testing and comparison of the proposed algorithm.

10.1. Simulations Data and Preprocessing

Initially, we have explored the robustness of the proposed algorithm in a controlled situation with synthetic data. Two sets of symmetric positive definite matrices (SPD) that represent the trial covariance matrices of the two classes were randomly generated. Each set consists of 200 trials. For further preprocessing, both the sets of matrices were concatenated. The concatenated data are cross-validated using k-fold cross-validation

(k = 10)

. This divides the data into 10 equal subsets in which a single set was used as a testing data and the remaining 9 subsets were used for training the classifier. The performance of the proposed algorithm was studied in the presence of the outliers. The outliers consist of matrices with abnormal higher variances that were inserted in the training set of both the classes. The proposed Sub-ABLD algorithm was tested in the following figure by progressively varying the percentage of outliers in the trials from

0 %

until

30 %

. The robustness of Sub-ABLD and its comparison with respect to the other algorithms mentioned in the figure will be addressed in Section 11.

10.2. EEG Dataset and Preprocessing

To evaluate the proposed Sub-ABLD algorithm with BCI competition datasets, we utilized two datasets from competition III: data set 3a, data set 4a (which can be downloaded from [54]) and one dataset from competition IV data set 2a (which can be downloaded from [55]). The data were acquired during the MI movements. The first dataset 3a [56] from BCI competition III [57], were acquired from 3 healthy subjects namely K3, K6 and L1 using 60 channels EEG acquisition system. The signals were recorded while executing the MI movements of the left hand, right hand, foot and tongue. The signals were sampled at a frequency of 250 Hz. The sampled signals were bandpass filtered at the frequency range between 1 to 50 Hz. The data set consists of two sessions i.e., training and testing sessions. For subject K3, both the sessions consist of 45 trials for each class whereas the other two subjects i.e., K6 and L1 performed 30 trials per class in both the sessions. For the second dataset, data set 4a [44] of BCI competition III [57], the signals were acquired from five subjects namely AA, AL, AV, AW and AY using 118 channels EEG system. The acquisition was done during the imagery movements of the left hand, right hand and right foot. Down-sampling of the recorded signals was done at 100 Hz. The band-pass filter between 0.05 to 200 Hz frequency band was applied to the signals. The data set of each subject consists of 280 total trials. The size of the training sessions is different from testing sessions. The training sessions consist of 168, 224, 84, 56, 28 trails for subjects AA, AL, AV, AW, AY and the remaining denotes the testing trails for the corresponding subjects. The last dataset, data set 2a [46] BCI competition IV [58] were acquired from nine subjects (A1 to A9) while performing the left hand, right hand, foot and tongue MI movements using 22 electrodes. The sampling frequency of the signals was 250 Hz. The band-pass filtering of the acquired signals were performed between 0.5 and 100 Hz. For each subject, the data were acquired on different days and each set consists of 72 trials for each class.

In this approach, the performances were obtained using only two MI movements considering all the channels from each dataset. The preprocessing step was implemented similarly for all the algorithms. First, a fifth-order band-pass filter with a cut-off frequency between 8 to 30 Hz was applied to the raw EEG signals. A time window of 2s during the imagination of movements was extracted for each trial. The extracted trials were concatenated for each class and applied a k-fold cross-validation

(k = 10)

to the concatenated data. The CV process divides the data into 10 equal sets where one set of data was used as testing data and the remaining 9 sets were used for training. Finally, the optimal spatial filters were obtained using the training dataset. The number of filters selected for each class is

k = 3

, so the total number

p = 6

.

10.3. Feature Extraction and Feature Classification

For both-the synthetic and the BCI datasets, the obtained spatial filters were used for filtering the training and testing data. The training and testing features were obtained by taking the log-variance of the filtered data in order that their distribution be closer to Gaussianity. The Linear Discriminant Analysis (LDA) [59] classifier was used for discriminating the features of the two classes. The classifier was trained using the training features and its performance was obtained using the testing features. The preprocessing, feature extraction and classification steps were repeated 10 times and finally the average performance was obtained.

10.4. Selection of α, β and η Values

The selection of α and β is one of the crucial steps for the proposed algorithm. Depending on the α and β values, the AB Log-Det divergence can be derived into different divergence techniques [6]. The proposed algorithm performed better when

α = β

, situation where the AB Log-Det divergence is symmetric or invariant under the permutation of its arguments. In this experiment, we have observed the performance for various values of

α = β

and η, and a suitable configuration of parameters for each dataset was selected.

11. Results and Discussion

The performance of the proposed Sub-ABLD algorithm is compared with the performance of the existing algorithms: CSP, JADE, MAPCSP and divCSP-WS for both the synthetic and the real BCI competition datasets. JADE algorithm performs a joint approximate diagonalization of the trial covariance matrices of the classes [45]. MAPCSP is a Bayesian algorithm that tries to find the maximum a posteriori estimates of the patterns and sources in a generative model with additive Gaussian isotropic noise [14]. The subspace implementation of divCSP-WS finds a balance between the maximization of Beta divergence between the conditional covariances of the filtered outputs for each class and the minimization of the variability within each class [1]. This algorithm contains two hyperparameters, the regularization factor λ and the real scalar

β^{'}

that specifies the chosen Beta divergence. The factor λ admits an equivalence in terms of the reguralization parameter η in Sub-ABLD which link them through the mapping

λ \equiv η / (1 + η)

, while the parameter of the Beta divergence

β_{*}^{'}

was chosen in the simulations to maximize the performance .

In order to carry out a fair performance comparison, a total of six features (i.e.,

p = 6

) have been selected for all the algorithms. The implementation of the JADE and divCSP-WS algorithms were taken from the webpages of the authors. The baseline divCSP-WS algorithm has been downloaded from [60], while the implementation of JADE algorithm can be found at [61]. The performance comparison between all the algorithms is presented in the following subsections.

11.1. Observations for Simulated Data

To study the performance of the proposed algorithm in the presence of outliers, the experiment was done by increasing the percentage of outlier trials in the training set for both the classes. The performance of the proposed Sub-ABLD algorithm for

(α, β) = (1.5, 1.5)

and

η = 1

was obtained. The performances of the above algorithms with the increasing percentage of outliers in the training set are presented in Figure 5. It can be observed that CSP and MAPCSP perform worse in the presence of the outliers. The performances of JADE and divCSP-WS are much more robust than those of CSP and MAPCSP, but in overall the proposed Sub-ABLD algorithm seems to outperform the compared algorithms in the presence of the outliers.

11.2. Observations for BCI Competition Datasets

In this section, the proposed algorithm is tested using three BCI competition datasets. For each dataset, the performances of the proposed algorithm for the different values of

(α, β)

and η were observed. From the observation, the maximum performance of the Sub-ABLD algorithm for the particular

(α, β)

and η values was selected. The selected performance is compared with the performances of other existing algorithms. Further analysis is done by using a box plot comparison for all the algorithms. The box plot analysis shows the distribution of the performances. In a box plot representation, the line inside the box represents the median performance. The upper and lower hinge of the box denote the 75-th and 25-th percentile of the overall performance distributions. The whiskers are symbolized by the two lines outside the box. The upper and lower whisker represent the maximum and minimum performance observed.

For BCI competition III dataset 3a, the Figure 6a shows the comparison of the highest average performance of the Sub-ABLD algorithm with the average performances of other existing algorithms. From the figure, it is observed that the Sub-ABLD algorithm outperforms the other existing algorithms with an average performance accuracy of

89 %

for this dataset. The box plot comparison is shown in Figure 6b. Although, the median performance is slightly higher for CSP, JADE and divCSP, their the 25th percentile performance is much smaller than the one of the Sub-ABLD algorithm. As we will see later, is a consequence that with the Sub-ABLD algorithm the most difficult subjects have attained a significant improvement in their classification performance.

Figure 7 shows the observed average performances using BCI competition III dataset 4a. For this dataset, the algorithms JADE, Sub-ABLD and divCSP-WS perform essentially similar and slightly above than the average performance of CSP, which is 88.1%. From the box plot of the results we can observe that the 25-th percentiles for these four algorithms are also quite close.

Similar results have been obtained for the BCI competition IV dataset 2a, which are shown in Figure 8. Again the algorithms JADE, Sub-ABLD and divCSP-WS perform essentially the same as CSP, which average performance is 81%. In the box plot we can observe that the quartiles of these algorithms are approximately coincident.

To analyze the effect of performance for different divergences, we varied the parameters

(α, β)

for a single subject (Subject k6 from BCI competition III dataset 3a, which is one of the subjects with worst performance for the experiment) and obtained the corresponding performance. The values of

(α, β)

are varied to cover the interval

[0, 2] \times [0, 2]

with a mesh of 0.1 spacings. The observed performance is shown in Figure 9. This figure reveals a tendency to improve the classification accuracy of the worst user for values of α and β that are close to the diagonal and large enough so they can effectively down-weight the contribution in the estimating equations coming from the largest and smallest generalized eigenvalues.

The proposed Sub-ABLD algorithm has been tested on both simulated and real EEG signals. On one hand, the results with synthetic data indicate that the proposed Sub-ABLD exhibits a certain robustness to the presence of outliers trials in the dataset. On the other hand, the analysis of real EEG signals is also challenging because of the possible presence of artifacts and non-stationarities. We have presented the performance of the Sub-ABLD algorithm using several real BCI datasets. For BCI competition III dataset 3a, we can observe that the proposed Sub-ABLD algorithm also outperforms the other algorithms. Whereas, the performance of the proposed algorithm is almost similar to the one obtained by JADE, divCSP-WS and CSP in the other two datasets, i.e., for the BCI competition III dataset 4a and BCI competition IV dataset 2a. Additionally, the analysis of the box-plots reveals that the proposed Sub-ABLD algorithm increased the classification performance of the subjects that do not perform well for the other methods. At the same time, it retained an almost similar performance for the remaining subjects. These observations meet our initial goal of developing a robust algorithm. The classification performance is also affected by the regularization parameter η that controls the penalty term. In general, the data with outliers give the best performance for the higher values of η and, otherwise, smaller values are preferable. In this study, the value of η has been kept constant across subjects in each dataset.

12. Conclusions

In this paper, we have explained how one can be able to use and optimize the recently proposed family of Alpha-Beta Log-Det divergences. For this purpose, we have summarized the key properties of these divergences and derived an original explicit formula for their gradient. In this work, we have adopted as an illustrative example of application the problem of spatial filter selection for the classification of two class imagery movements. We have reexamined the relation between the Common Spatial Pattern criterion with a predefined number of spatial filters for each class and its interpretation as an Alpha-Beta Log-Det divergence optimization problem, to show that a scaling factor in one of the arguments of the divergence is necessary for the equivalence of the solutions. We have proposed a subspace algorithm (Sub-ABLD) for obtaining the spatial filters that retain the discriminative information of two class MI movements. This algorithm was tested with synthetic and real datasets and compared with the other existing algorithms. The simulations have confirmed the possibility to tune up the hyperparameters of the divergence so as to improve the robustness of the obtained solutions without deteriorating the expected accuracy.

Acknowledgments

This work was supported by the Spanish Government under the MICINN project TEC2014-53103-P, and by the MES Russian Federation under the grant 14.756.31.0001.

Author Contributions

Deepa Beeta Thiyam and Sergio Cruces have collaborated in the tast of writing the manuscript. Sergio Cruces was in charge of developing the theoretical content and has coordinated the proposal, while Deepa Beeta Thiyam was in charge of the design of the proposed algorithm and simulations. Javier Olias and Andrzej Cichocki have respectively collaborated in the experimental part and in the study of the AB log-det divergences. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1 Obtaining the Upper-Bound of the AB Log-Det Divergence

The divergence

D_{A B}^{(α, β)} (P ∥ Q)

depends on the generalized eigenvalues of the matrix pencil

(P, Q)

, which have been denoted by

λ_{i}

for

i = 1, \dots, n

. Similarly, the divergence of the compressed arguments

D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW)

depends on

μ_{i}

for

i = 1, \dots, p

, the eigenvalues of the matrix pencil

(W^{T} P W, W^{T} QW)

. The Cauchy interlacing inequalities [62]

λ_{j} \leq μ_{j} \leq λ_{n - p + j} .

(A1)

provide upper and lower-bounds for

μ_{j}

in terms of the eigenvalues of the uncompressed matrix pencil. This property implies that the eigenvalues

μ_{j}

, for each

j = 1, \dots, p

, should lie in a sequence of possibly partially overlapping intervals given by

[λ_{j}, λ_{n - p + j}]

.

The divergence

D_{A B}^{(α, β)} (λ ∥ 1)

is minimum (zero) for

λ = 1

, strictly monotone descending for

λ < 1

and strictly monotone ascending for

λ > 1

. So we can bound the the AB log-det divergence in each interval by

\begin{matrix} D_{A B}^{(α, β)} (μ_{j} ∥ 1) & \leq & max {D_{A B}^{(α, β)} (λ_{j} ∥ 1), D_{A B}^{(α, β)} (λ_{n - p + j} ∥ 1)}, \end{matrix}

(A2)

and the maximum value occurs at one of the extreme eigenvalues of the interval. The construction of the interlacing property, prevents that any eigenvalue with a given index could appear more than once as upper extreme of an interval or as a lower extreme of an interval. This fact, combined with the strict monotonicity property of the divergence, implies that the maxima of the divergence for each interval can only be obtained by eigenvalues with different indices. Finally, the result of adding these p maximum values can not exceed the sum of the divergences for those eigenvalues which maximize the divergence from unity,

\begin{matrix} (A 3) & D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW) & = & \sum_{j = 1}^{p} D_{A B}^{(α, β)} (μ_{j} ∥ 1) \\ (A 4) & \leq & \sum_{j = 1}^{p} max {D_{A B}^{(α, β)} (λ_{j} ∥ 1), D_{A B}^{(α, β)} (λ_{n - p + j} ∥ 1)} \end{matrix}

With the help of the permutation π of the indices

1, \dots, n

that sorts the divergence of the eigenvalues from the unity in descending order

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{1}} ∥ 1) \geq D_{A B}^{(α, β)} (λ_{π_{2}} ∥ 1) \geq \dots \geq D_{A B}^{(α, β)} (λ_{π_{n}} ∥ 1), \end{matrix}

(A5)

we can write

\begin{matrix} (A 6) & D_{A B}^{(α, β)} (W^{T} PW ∥ W^{T} QW) & = & \sum_{j = 1}^{p} D_{A B}^{(α, β)} (μ_{j} ∥ 1) \\ (A 7) & \leq & \sum_{j = 1}^{p} max {D_{A B}^{(α, β)} (λ_{j} ∥ 1), D_{A B}^{(α, β)} (λ_{n - p + j} ∥ 1)} \\ (A 8) & \leq & \sum_{j = 1}^{p} D_{A B}^{(α, β)} (λ_{π_{j}} ∥ 1) \end{matrix}

which is the desired upper-bound.

Appendix A.2 Proof of the Link between the Optimization of the Divergence and the CSP Solution

The fact that any Rayleigh quotient is bounded by the maximum and minimum eigenvalues of the associated matrix pencil

\begin{matrix} λ_{1} \leq \frac{w_{i}^{T} P w_{i}}{w_{i}^{T} Q w_{i}} \leq λ_{n} \end{matrix}

(A9)

can be used to recursively prove that the minimax value of the divergence is equal to

\begin{matrix} min_{dim {W} = n - i + 1} max_{w \in W} D_{A B}^{(α, β)} (w_{i}^{T} P w_{i} ∥ κ w_{i}^{T} Q w_{i}) & = & min_{dim {W} = n - i + 1} max_{w \in W} D_{A B}^{(α, β)} (\frac{w_{i}^{T} P w_{i}}{w_{i}^{T} Q w_{i}} ∥ κ) \\ = & D_{A B}^{(α, β)} (λ_{π_{i}^{'}} ∥ κ), \end{matrix}

(A10)

where permutation

π^{'}

sorts the divergence of the eigenvalues from κ in descending order

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{1}^{'}} ∥ κ) \geq D_{A B}^{(α, β)} (λ_{π_{2}^{'}} ∥ κ) \geq \dots \geq D_{A B}^{(α, β)} (λ_{π_{n}^{'}} ∥ κ) . \end{matrix}

(A11)

The minimax value is then attained for the eigenvectors

\begin{matrix} v_{π_{i}^{'}} = arg min_{dim {W} = n - i + 1} max_{w \in W} D_{A B}^{(α, β)} (w_{i}^{T} P w_{i} ∥ κ w_{i}^{T} Q w_{i}), i = 1, \dots, p . \end{matrix}

(A12)

For the coincidence of the set of solutions

{v_{π_{1}^{'}}, \dots, v_{π_{p}^{'}}}

in (A12) with the set of spatial filters

{v_{1}^{(c_{1})}, \dots, v_{k}^{(c_{1})}, v_{n - (p - k) + 1}^{(c_{1})}, \dots, v_{n}^{(c_{1})}}

that define the

W_{C S P}

, the eigenvalues

λ_{π_{1}}, \dots, λ_{π_{p}}

that maximize their divergence from κ, should all belong to the upper and lower sets of eigenvalues defined in (57). For this to be true, it necessary and sufficient that the divergence of the last selected eigenvalue

λ_{π_{p}}

from κ upper-bounds with inequality all the divergences between an inner eigenvalue

λ_{i}

and κ, in the sense that

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{p}} ∥ κ) > max_{i \in [k + 1, n - (p - k)]} D_{A B}^{(α, β)} (λ_{i} ∥ κ) \end{matrix}

(A13)

The domain of κ for which this strict inequality holds true is

\begin{matrix} κ & \in & (κ_{inf}, κ_{sup}) \end{matrix}

(A14)

where the bounds

\begin{matrix} (A 15) & κ_{inf} & \equiv & K (λ_{k + 1}, λ_{n - (p - k) + 1 ⌋}) \\ (A 16) & κ_{sup} & \equiv & K (λ_{k}, λ_{n - (p - k)}) \end{matrix}

respectively equalize the value of the divergences

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{p}} ∥ κ_{inf}) = D_{A B}^{(α, β)} (λ_{k + 1} ∥ κ_{inf}) = D_{A B}^{(α, β)} (λ_{n - (p - k) + 1} ∥ κ_{inf}) \end{matrix}

(A17)

and

\begin{matrix} D_{A B}^{(α, β)} (λ_{π_{p}} ∥ κ_{sup}) = D_{A B}^{(α, β)} (λ_{k} ∥ κ_{sup}) = D_{A B}^{(α, β)} (λ_{n - (p - k)} ∥ κ_{sup}) . \end{matrix}

(A18)

Appendix A.3 Differential of the Inverse Square Root of a SPD Matrix

Let

X

be any symmetric possitive definite matrix (SPD). We would like to obtain the differential of its inverse square root

d X^{- \frac{1}{2}}

in terms of the matrix

X

and its differential

d X

, and later use this result to simplify the desired expression

d X^{- \frac{1}{2}} X^{\frac{1}{2}}

. For this purpose, we start from the trivial identity

I_{p} = X^{- \frac{1}{2}} X^{\frac{1}{2}}

and take differentials on both sides of this equality, with the help of the product rule for differentials we obtain

\begin{matrix} 0 = d I_{p} & = & d (X^{- \frac{1}{2}} X^{\frac{1}{2}}) = d X^{- \frac{1}{2}} X^{\frac{1}{2}} + X^{- \frac{1}{2}} d X^{\frac{1}{2}} . \end{matrix}

(A19)

Solving for the differential

\begin{matrix} d X^{- \frac{1}{2}} & = & - X^{- \frac{1}{2}} d X^{\frac{1}{2}} X^{- \frac{1}{2}}, \end{matrix}

(A20)

we see it as a function of

X

and

d X^{\frac{1}{2}}

. Then, we simplify

d X^{\frac{1}{2}}

with the help of the another trivial identify

X^{\frac{1}{2}} {(X^{\frac{1}{2}})}^{T} = X

. We take again differentials on both sides of the equality

\begin{matrix} d (X^{\frac{1}{2}} {(X^{\frac{1}{2}})}^{T}) = d X^{\frac{1}{2}} {(X^{\frac{1}{2}})}^{T} + X^{\frac{1}{2}} {(d X^{\frac{1}{2}})}^{T} & = & d X \end{matrix}

(A21)

and obtain the special solution

\begin{matrix} d X^{\frac{1}{2}} = \frac{1}{2} d X {(X^{- \frac{1}{2}})}^{T} . \end{matrix}

(A22)

The substitution of (A22) in (A22) yields the differential of the inverse symmetric square root of the SPD matrix

\begin{matrix} d X^{- \frac{1}{2}} & = & - X^{- \frac{1}{2}} (\frac{1}{2} d X {(X^{- \frac{1}{2}})}^{T}) X^{- \frac{1}{2}} . \end{matrix}

(A23)

Finally, by the symmetry of

X^{- \frac{1}{2}}

, we prove the desired result

\begin{matrix} (A 24) & d X^{- \frac{1}{2}} X^{\frac{1}{2}} & = & - \frac{1}{2} X^{- \frac{1}{2}} d X {(X^{- \frac{1}{2}})}^{T} \\ (A 25) & = & - \frac{1}{2} X^{- \frac{1}{2}} d X X^{- \frac{1}{2}} \end{matrix}

Appendix A.4 The Gradient of the KL Divergence between Gaussian Densities

The Kullback–Leibler (KL) divergence between the Gaussian densities

p (x | c_{2})

and

p (x | c_{1})

, of zero mean and with respective covariance matrices

C o v (Y | c_{1}) = W^{T} P W

and

C o v (Y | c_{2}) = W^{T} Q W

, is equal to

\begin{matrix} D i v_{K L} (p (x | c_{2}) ∥ p (x | c_{1})) & = & \frac{1}{2} log | W^{T} PW | - \frac{1}{2} log | W^{T} QW | \\ + \frac{1}{2} tr {{(W^{T} P W)}^{- 1} (W^{T} QW) - I_{p}} \end{matrix}

(A26)

This subsection explains the operations involved in obtaining its gradient. The first differential of the log-determinant terms is

\begin{matrix} (A 27) & d log | W^{T} PW | & = & tr {{(W^{T} P W)}^{- 1} d (W^{T} PW)} \\ (A 28) & = & tr {{(W^{T} P W)}^{- 1} (d W^{T} PW + W^{T} P d W)} \\ (A 29) & = & 2 tr {[PW {(W^{T} P W)}^{- 1}] d W^{T}} \end{matrix}

By using the relationship between the first differential and the gradient

\begin{matrix} d log | W^{T} PW | & = & tr {[\nabla_{W} log | W^{T} PW |] d W^{T}} \end{matrix}

(A30)

one can identify from (A29) that

\begin{matrix} \nabla_{W} \frac{1}{2} log | W^{T} PW | & = & PW {(W^{T} P W)}^{- 1} \end{matrix}

(A31)

and, similarly,

\begin{matrix} \nabla_{W} [- \frac{1}{2} log | W^{T} QW |] & = & - QW {(W^{T} QW)}^{- 1} . \end{matrix}

(A32)

On the other hand, the first differential of the trace term simplifies to

\begin{matrix} d [\frac{1}{2} tr {{(W^{T} P W)}^{- 1} (W^{T} QW) - I_{p}}] & = & \frac{1}{2} tr {{(W^{T} P W)}^{- 1} d (W^{T} QW)} \\ + \frac{1}{2} tr {d {(W^{T} P W)}^{- 1} (W^{T} QW)} \\ = & \frac{1}{2} tr {{(W^{T} P W)}^{- 1} d (W^{T} QW)} \\ + \frac{1}{2} tr {{(W^{T} P W)}^{- 1} d (W^{T} P W) {(W^{T} P W)}^{- 1} (W^{T} QW)} \\ = & \frac{1}{2} tr {{(W^{T} P W)}^{- 1} (d W^{T} QW + W^{T} Q d W)} \\ + \frac{1}{2} tr {{(W^{T} P W)}^{- 1} (d W^{T} PW + W^{T} P d W) \\ \times {(W^{T} P W)}^{- 1} (W^{T} QW)} \\ = & \frac{1}{2} tr {2 QW {(W^{T} P W)}^{- 1} d W^{T}} \\ + \frac{1}{2} tr {2 PW {(W^{T} P W)}^{- 1} (W^{T} QW) {(W^{T} P W)}^{- 1} d W^{T}} \end{matrix}

(A33)

From which one can also identify

\begin{matrix} \nabla_{W} [\frac{1}{2} tr {{(W^{T} P W)}^{- 1} (W^{T} QW) - I_{p}}] & = & QW {(W^{T} P W)}^{- 1} \\ + PW {(W^{T} P W)}^{- 1} (W^{T} QW) {(W^{T} P W)}^{- 1} \end{matrix}

(A34)

Once we have obtained in (A31), (A32) and (A34) the gradients of the partial terms that are involved in the definition (A26) of the KL divergence, their simple addition yields the complete gradient of the KL divergence with respect to

W

, which is given by

\begin{matrix} \frac{\partial}{\partial W} D i v_{K L} (p (x | c_{2}) ∥ p (x | c_{1})) & = & - QW {(W^{T} QW)}^{- 1} + PW {(W^{T} P W)}^{- 1} + QW {(W^{T} P W)}^{- 1} \\ + PW {(W^{T} P W)}^{- 1} {(W^{T} QW)}^{- 1} {(W^{T} P W)}^{- 1} . \end{matrix}

(A35)

References

Samek, W.; Kawanabe, M.; Müller, K.R. Divergence-based framework for common spatial patterns algorithms. IEEE Rev. Biomed. Eng. 2014, 7, 50–72. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Wang, R.; Shan, S.; Li, X.; Chen, X. Log-Euclidean Metric Learning on Symmetric Positive Definite Manifold with Application to Image Set Classification. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015.
Salzmann, M.; Hartley, R. Dimensionality Reduction on SPD Manifolds: The Emergence of Geometry-Aware Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2016. [Google Scholar] [CrossRef]
Sra, S.; Hosseini, R. Geometric Optimization in Machine Learning Algorithmic Advances in Riemannian Geometry and Applications: For Machine Learning, Computer Vision, Statistics, and Optimization; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Horev, I.; Yger, F.; Sugiyama, M. Geometry-aware principal component analysis for symmetric positive definite matrices. In Proceedings of the 7th Asian Conference on Machine Learning, Hong Kong, China, 20–22 November 2015; pp. 1–16.
Cichocki, A.; Cruces, S.; Amari, S.i. Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences. Entropy 2015, 17, 2988–3034. [Google Scholar] [CrossRef]
Minh, H.Q. Infinite-dimensional Log-Determinant divergences II: Alpha-Beta divergences. arXiv 2017. [Google Scholar]
Dornhege, G. Toward Brain-Computer Interfacing; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Wolpaw, J.; Wolpaw, E.W. Brain-computer Interfaces: Principles and Practice; Oxford University Press: Oxford, UK, 2012. [Google Scholar]
Pfurtscheller, G.; Da Silva, F.L. Event-related EEG/MEG synchronization and desynchronization: Basic principles. Clin. Neurophysiol. 1999, 110, 1842–1857. [Google Scholar] [CrossRef]
Fukunaga, K.; Koontz, W.L.G. Application of the Karhunen-Loeve Expansion to Feature Selection and Ordering. IEEE Trans. Comput. 1970, C-19, 440–447. [Google Scholar] [CrossRef]
Koles, Z.J. The quantitative extraction and topographic mapping of the abnormal components in the clinical EEG. Electroencephalogr. Clin. Neurophysiol. 1991, 79, 440–447. [Google Scholar] [CrossRef]
Ramoser, H.; Müller-Gerking, J.; Pfurtscheller, G. Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans. Rehabil. Eng. 2000, 8, 441–446. [Google Scholar] [CrossRef] [PubMed]
Wu, W.; Chen, Z.; Gao, X.; Li, Y.; Brown, E.N.; Gao, S. Probabilistic common spatial patterns for multichannel EEG analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 639–653. [Google Scholar] [CrossRef] [PubMed]
Bhatia, R. Matrix Analysis; Graduate Texts in Mathematics; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Wang, H. Harmonic mean of Kullback–Leibler divergences for optimizing multi-class EEG spatio-temporal filters. Neural Process. Lett. 2012, 36, 161–171. [Google Scholar] [CrossRef]
Samek, W.; Blythe, D.; Müller, K.R.; Kawanabe, M. Robust spatial filtering with beta divergence. In Proceedings of the Advances in Neural Information Processing Systems, Stateline, NV, USA, 5–10 December 2013; pp. 1007–1015.
Cichocki, A.; Amari, S. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Brandl, S.; Müller, K.R.; Samek, W. Robust common spatial patterns based on Bhattacharyya distance and Gamma divergence. In Proceedings of the 2015 3rd International Winter Conference on Brain-Computer Interface (BCI), Jeongsun-Kun, Korea, 12–14 January 2015; pp. 1–4.
Tao, T. Topics in Random Matrix Theory; American Mathematical Society: Providence, RI, USA, 2012; Volume 132. [Google Scholar]
Blankertz, B.; Kawanabe, M.; Tomioka, R.; Hohlefeld, F.; Müller, K.R.; Nikulin, V.V. Invariant common spatial patterns: Alleviating nonstationarities in brain-computer interfacing. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; pp. 113–120.
Lotte, F.; Guan, C. Spatially regularized common spatial patterns for EEG classification. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 23–26 August 2010; pp. 3712–3715.
Kang, H.; Nam, Y.; Choi, S. Composite common spatial pattern for subject-to-subject transfer. IEEE Signal Process. Lett. 2009, 16, 683–686. [Google Scholar] [CrossRef]
Lotte, F.; Guan, C. Learning from other subjects helps reducing brain-computer interface calibration time. In Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, TX, USA, 14–19 March 2010; pp. 614–617.
Lu, H.; Plataniotis, K.N.; Venetsanopoulos, A.N. Regularized common spatial patterns with generic learning for EEG signal classification. In Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, MN, USA, 3–6 September 2009; pp. 6599–6602.
Xinyi Yong, R.K.W.; Birch, G.E. Robust Common Spatial Patterns for EEG Signal Preprocessing. In Proceedings of the IEEE EMBS 30th Annual International Conference, Vancouver, BC, Canada, 20–25 August 2008; pp. 2087–2090.
Kawanabe, M.; Vidaurre, C. Improving BCI performance by modified common spatial patterns with robustly averaged covariance matrices. In Proceedings of the World Congress on Medical Physics and Biomedical Engineering, Munich, Germany, 7–12 September 2009.
Samek, W.; Binder, A.; Müller, K.R. Multiple kernel learning for brain-computer interfacing. In Proceedings of the 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 3–7 July 2013; pp. 7048–7051.
Lotte, F.; Guan, C. Regularizing common spatial patterns to improve BCI designs: unified theory and new algorithms. IEEE Trans. Biomed. Eng. 2011, 58, 355–362. [Google Scholar] [CrossRef] [PubMed]
Arvaneh, M.; Guan, C.; Ang, K.K.; Quek, H.C. Spatially sparsed common spatial pattern to improve BCI performance. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 2412–2415.
Farquhar, J.; Hill, N.; Lal, T.N.; Schölkopf, B. Regularised CSP for sensor selection in BCI. In Proceedings of the 3rd International BCI workshop, Graz, Austria, 21–24 September 2006; pp. 1–2.
Yong, X.; Ward, R.K.; Birch, G.E. Sparse spatial filter optimization for EEG channel reduction in brain-computer interface. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), Las Vegas, NV, USA, 31 March–4 April 2008; pp. 417–420.
Kawanabe, M.; Vidaurre, C.; Scholler, S.; Müller, K.R. Robust common spatial filters with a maxmin approach. In Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, MN, USA, 3–6 September 2009; pp. 2470–2473.
Kawanabe, M.; Samek, W.; Müller, K.R.; Vidaurre, C. Robust common spatial filters with a maxmin approach. Neural Comput. 2014, 26, 349–376. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Tang, Q.; Zheng, W. L1-norm-based common spatial patterns. IEEE Trans. Biomed. Eng. 2012, 59, 653–662. [Google Scholar] [CrossRef] [PubMed]
Park, J.; Chung, W. Common spatial patterns based on generalized norms. In Proceedings of the 2013 International Winter Workshop on Brain-Computer Interface (BCI), Jeongsun-kun, Korea, 18–20 February 2013; pp. 39–42.
Von Bünau, P.; Meinecke, F.C.; Király, F.C.; Müller, K.R. Finding stationary subspaces in multivariate time series. Phys. Rev. Lett. 2009, 103, 214101. [Google Scholar] [CrossRef] [PubMed]
Samek, W.; Kawanabe, M.; Vidaurre, C. Group-wise stationary subspace analysis–A novel method for studying non-stationarities. In Proceedings of the International Brain–Computer Interfacing Conference, Graz, Austria, 22–24 September 2011; pp. 16–20.
Samek, W.; Müller, K.R.; Kawanabe, M.; Vidaurre, C. Brain-computer interfacing in discriminative and stationary subspaces. In Proceedings of the 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), San Diego, CA, USA, 28 August–1 September 2012; pp. 2873–2876.
Von Bünau, P.; Meinecke, F.C.; Scholler, S.; Müller, K.R. Finding stationary brain sources in EEG data. In Proceedings of the 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Buenos Aires, Argentina, 31 August–4 September 2010; pp. 2810–2813.
Arvaneh, M.; Guan, C.; Ang, K.K.; Quek, C. Optimizing spatial filters by minimizing within-class dissimilarities in electroencephalogram-based brain–computer interface. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 610–619. [Google Scholar] [CrossRef] [PubMed]
Müller-Gerking, J.; Pfurtscheller, G.; Flyvbjerg, H. Designing optimal spatial filters for single-trial EEG classification in a movement task. Clin. Neurophysiol. 1999, 110, 787–798. [Google Scholar] [CrossRef]
Allwein, E.L.; Schapire, R.E.; Singer, Y. Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 2001, 1, 113–141. [Google Scholar]
Dornhege, G.; Blankertz, B.; Curio, G.; Müller, K.R. Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms. IEEE Trans. Biomed. Eng. 2004, 51, 993–1002. [Google Scholar] [CrossRef] [PubMed]
Grosse-Wentrup, M.; Buss, M. Multiclass common spatial patterns and information theoretic feature extraction. IEEE Trans. Biomed. Eng. 2008, 55, 1991–2000. [Google Scholar] [CrossRef] [PubMed]
Naeem, M.; Brunner, C.; Leeb, R.; Graimann, B.; Pfurtscheller, G. Separability of four-class motor imagery data using independent components analysis. J. Neural Eng. 2006, 3, 208–216. [Google Scholar] [CrossRef] [PubMed]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Multiclass brain–Computer interface classification by Riemannian geometry. IEEE Trans. Biomed. Eng. 2012, 59, 920–928. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, H.; Yang, H.; Guan, C. Bayesian learning for spatial filtering in an EEG-based brain–computer interface. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1049–1060. [Google Scholar] [CrossRef] [PubMed]
Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Edelman, A.; Arias, T.A.; Smith, S.T. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 1998, 20, 303–353. [Google Scholar] [CrossRef]
Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
Cruces, S.; Cichocki, A.; Amari, S. From Blind Signal Extraction to Blind Instantaneous Signal Separation. IEEE Trans. Neural Netw. 2004, 15, 859–873. [Google Scholar] [CrossRef] [PubMed]
Nishimori, Y. Learning algorithm for ICA by geodesic flows on orthogonal group. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Washington, DC, USA, 10–16 July 1999; pp. 933–938.
BCI Competition III. Available online: http://www.bbci.de/competition/iii/ (accessed on 26 August 2014).
BCI Competition IV. Available online: http://www.bbci.de/competition/iv/ (accessed on 26 August 2014).
Schlögl, A.; Lee, F.; Bischof, H.; Pfurtscheller, G. Characterization of four-class motor imagery EEG data for the BCI-competition 2005. J. Neural Eng. 2005, 2, L14. [Google Scholar] [CrossRef] [PubMed]
Blankertz, B.; Müller, K.R.; Krusienski, D.J.; Schalk, G.; Wolpaw, J.R.; Schlögl, A.; Pfurtscheller, G.; Millan, J.R.; Schröder, M.; Birbaumer, N. The BCI competition III: Validating alternative approaches to actual BCI problems. IEEE Trans. Neural Syst. Rehabil. Eng. 2006, 14, 153–159. [Google Scholar] [CrossRef] [PubMed]
Tangermann, M.; Müller, K.R.; Aertsen, A.; Birbaumer, N.; Braun, C.; Brunner, C.; Leeb, R.; Mehring, C.; Miller, K.J.; Mueller-Putz, G.; et al. Review of the BCI competition IV. Front. Neurosci. 2012, 6, 55. [Google Scholar]
Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973; Volume 3. [Google Scholar]
The Divergence Methods Web Site. Available online: http://www.divergence-methods.org (accessed on 4 April 2016).
Machine Learning in Neural Engineering. Available online: http://brain-computer-interfaces.net/ (accessed on 29 January 2015).
Li, R. Rayleigh Quotient Based Optimization Methods For Eigenvalue Problems. In Summary of Lectures Delivered at Gene Golub SIAM Summer School 2013; Fudan University: Shanghai, China, 2013; pp. 1–27. [Google Scholar]

Figure 1. This illustration shows the AB log-det divergence

D_{A B}^{(α, β)} (P ∥ Q)

positioned in a plane as a function of their real pair of hyperparameters

(α, β)

. It is clear from the figure, that the parameterization smoothly connects several relevant positive definite matrix divergences, like: the squared Riemannian metric (

α = 0, β = 0

), the KL matrix divergence or Stein’s loss (

α = 1, β = 0

), the dual KL matrix divergence (

α = 0, β = 1

), and the S-divergence (

α = \frac{1}{2}, β = \frac{1}{2}

) among others.

Figure 1. This illustration shows the AB log-det divergence

D_{A B}^{(α, β)} (P ∥ Q)

positioned in a plane as a function of their real pair of hyperparameters

(α, β)

. It is clear from the figure, that the parameterization smoothly connects several relevant positive definite matrix divergences, like: the squared Riemannian metric (

α = 0, β = 0

), the KL matrix divergence or Stein’s loss (

α = 1, β = 0

), the dual KL matrix divergence (

α = 0, β = 1

), and the S-divergence (

α = \frac{1}{2}, β = \frac{1}{2}

) among others.

Figure 2. Illustration of the strictly monotonous ascending transformation

g (\cdot)

that, through Equation (56), maps eigenvalues of the matrix pencil

(p (c_{1}) P, C o v (x))

into the eigenvalues of the matrix pencil

(P, Q)

, in a case where the sample probabilities of the classes are uniform

p (c_{1}) = p (c_{2})

. Note that the eigenvalues of the first pencil are bounded in the interval

(0, 1)

, while the domain of the eigenvalues of the second pencil is

(0, \infty)

.

Figure 2. Illustration of the strictly monotonous ascending transformation

g (\cdot)

that, through Equation (56), maps eigenvalues of the matrix pencil

(p (c_{1}) P, C o v (x))

into the eigenvalues of the matrix pencil

(P, Q)

, in a case where the sample probabilities of the classes are uniform

p (c_{1}) = p (c_{2})

. Note that the eigenvalues of the first pencil are bounded in the interval

(0, 1)

, while the domain of the eigenvalues of the second pencil is

(0, \infty)

.

Figure 3. Illustration of the behavior of the AB log-det divergence

D_{A B}^{(α, β)} (μ, 1)

, and of its associated weight function

w_{α, β} (μ)

, versus μ for different values of

α = β

. Note that μ is shown in log-scale. (a) Squared Riemannian metric for

α = β = 0

(upper plot) and its weight function (lower plot); (b) Power Log-det divergence for

α = β = 1

(upper plot) and its weight function (lower plot).

Figure 3. Illustration of the behavior of the AB log-det divergence

D_{A B}^{(α, β)} (μ, 1)

, and of its associated weight function

w_{α, β} (μ)

, versus μ for different values of

α = β

. Note that μ is shown in log-scale. (a) Squared Riemannian metric for

α = β = 0

(upper plot) and its weight function (lower plot); (b) Power Log-det divergence for

α = β = 1

(upper plot) and its weight function (lower plot).

Figure 4. Illustration of the behavior of the AB log-det divergence

D_{A B}^{(α, β)} (μ, 1)

, and of its associated weight function

w_{α, β} (μ)

, versus μ for different values of

α \neq β

. Note that μ is shown in log-scale. (a) Kullback–Leibler (KL) positive definite matrix divergence for

α = 1, β = 0,

and its weight function (lower plot); (b) Dual KL positive definite matrix div. for

α = 0, β = 1,

and its weight function (lower plot).

Figure 4. Illustration of the behavior of the AB log-det divergence

D_{A B}^{(α, β)} (μ, 1)

, and of its associated weight function

w_{α, β} (μ)

, versus μ for different values of

α \neq β

. Note that μ is shown in log-scale. (a) Kullback–Leibler (KL) positive definite matrix divergence for

α = 1, β = 0,

and its weight function (lower plot); (b) Dual KL positive definite matrix div. for

α = 0, β = 1,

and its weight function (lower plot).

Figure 5. Performance comparison of the proposed algorithm Sub-ABLD (

η = 1, α = β = 1.5

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.5, β_{*}^{'} = 0.25

), versus the percentage of outlier trials.

Figure 5. Performance comparison of the proposed algorithm Sub-ABLD (

η = 1, α = β = 1.5

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.5, β_{*}^{'} = 0.25

), versus the percentage of outlier trials.

Figure 6. (a) Performance comparison of the proposed algorithm Sub-ABLD (

η = 2, α = β = 1.5

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.66, β_{*}^{'} = 1

) using BCI competition III dataset 3a and (b) its corresponding boxplot.

Figure 6. (a) Performance comparison of the proposed algorithm Sub-ABLD (

η = 2, α = β = 1.5

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.66, β_{*}^{'} = 1

) using BCI competition III dataset 3a and (b) its corresponding boxplot.

Figure 7. (a) Performance comparison of the proposed algorithm Sub-ABLD (

η = 0.5, α = β = 2

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.33, β_{*}^{'} = 0.5

) using BCI competition datasets III dataset 4a and (b) its corresponding boxplot.

Figure 7. (a) Performance comparison of the proposed algorithm Sub-ABLD (

η = 0.5, α = β = 2

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.33, β_{*}^{'} = 0.5

) using BCI competition datasets III dataset 4a and (b) its corresponding boxplot.

Figure 8. (a) Performance comparison of the proposed algorithm Sub-ABLD (

η = 0.25, α = β = 1.25

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.2, β_{*}^{'} = 0

) using BCI competition datasets IV dataset 2a and (b) its corresponding boxplot.

Figure 8. (a) Performance comparison of the proposed algorithm Sub-ABLD (

η = 0.25, α = β = 1.25

) with CSP, JADE, MAPCSP and divCSP-WS (

λ = 0.2, β_{*}^{'} = 0

) using BCI competition datasets IV dataset 2a and (b) its corresponding boxplot.

Figure 9. Results of the Sub-ABLD algorithm for the subject k6 from BCI competition III dataset 3a. This figure illustrates the changes in the average classification performance with respect to the variation of the parameters α and β. Relatively good performance results are obtained close to the diagonal and for moderately large values of the parameters.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Thiyam, D.B.; Cruces, S.; Olias, J.; Cichocki, A. Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements. Entropy 2017, 19, 89. https://doi.org/10.3390/e19030089

AMA Style

Thiyam DB, Cruces S, Olias J, Cichocki A. Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements. Entropy. 2017; 19(3):89. https://doi.org/10.3390/e19030089

Chicago/Turabian Style

Thiyam, Deepa Beeta, Sergio Cruces, Javier Olias, and Andrzej Cichocki. 2017. "Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements" Entropy 19, no. 3: 89. https://doi.org/10.3390/e19030089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements

Abstract

1. Introduction

2. Notation and Model of the Measurements

3. The Common Spatial Patterns Algorithm

4. The Divergence Optimization Interpretation of CSP

5. The Definition of the AB Log-Det Divergence

5.1. A Tight Upper-Bound for the AB Log-Det Divergences

5.2. Relationship between the Generalized Eigenvalues and Eigenvectors of the Matrix Pencils ( P , Q ) and ( p ( c 1 ) P , C o v ( x ) )

5.3. Linking the Optimization of the Divergence and the CSP Solution

6. The Gradient of the AB Log-Det Divergence

6.1. Validation of Equation (95) with the Gradient of the KL Divergence

6.2. Validation of Equation (95) with the Gradient of the AG Divergence

7. Robustness of the AB Log-Det Divergence in Terms of α and β

8. Review of Some Related Techniques for the Spatial Filtering of Motor Imagery Movements

9. Proposed Criterion and Algorithm for Spatial Filtering

The Subspace Optimization Algorithm (Sub-ABLD)

10. Experimental Study

10.1. Simulations Data and Preprocessing

10.2. EEG Dataset and Preprocessing

10.3. Feature Extraction and Feature Classification

10.4. Selection of α, β and η Values

11. Results and Discussion

11.1. Observations for Simulated Data

11.2. Observations for BCI Competition Datasets

12. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A

Appendix A.1 Obtaining the Upper-Bound of the AB Log-Det Divergence

Appendix A.2 Proof of the Link between the Optimization of the Divergence and the CSP Solution

Appendix A.3 Differential of the Inverse Square Root of a SPD Matrix

Appendix A.4 The Gradient of the KL Divergence between Gaussian Densities

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Relationship between the Generalized Eigenvalues and Eigenvectors of the Matrix Pencils $(P, Q)$ and $(p (c_{1}) P, C o v (x))$

7. Robustness of the AB Log-Det Divergence in Terms of $α$ and $β$